Engineering Evolutionary Intelligent Systems

Ajith Abraham, Crina Grosan and Witold Pedrycz (Eds.
)
Engineering Evolutionary Intelligent Systems
Studies in Computational Intelligence, Volume 82
Editor-in-chief
Prof. Janusz Kacprzyk
Systems Research Institute
Polish Academy of Sciences
ul. Newelska 6
01-447 Warsaw
Poland
E-mail: kacprzyk@ibspan.waw.pl
Further volumes of this series Vol. 70. Javaan Singh Chahl, Lakhmi C. Jain, Akiko
can be found on our homepage: Mizutani and Mika Sato-Ilic (Eds.)
springer.com Innovations in Intelligent Machines-1, 2007
ISBN 978-3-540-72695-1
Vol. 60. Vladimir G. Ivancevic and Tijana T. Ivacevic Vol. 71. Norio Baba, Lakhmi C. Jain and Hisashi Handa
Computational Mind: A Complex Dynamics (Eds.)
Perspective, 2007 Advanced Intelligent Paradigms in Computer
ISBN 978-3-540-71465-1 Games, 2007
ISBN 978-3-540-72704-0
Vol. 61. Jacques Teller, John R. Lee and Catherine
Roussey (Eds.) Vol. 72. Raymond S.T. Lee and Vincenzo Loia (Eds.)
Ontologies for Urban Development, 2007 Computation Intelligence for Agent-based Systems, 2007
ISBN 978-3-540-71975-5 ISBN 978-3-540-73175-7
Vol. 62. Lakhmi C. Jain, Raymond A. Tedman Vol. 73. Petra Perner (Ed.)
and Debra K. Tedman (Eds.) Case-Based Reasoning on Images and Signals, 2008
Evolution of Teaching and Learning Paradigms ISBN 978-3-540-73178-8
in Intelligent Environment, 2007
Vol. 74. Robert Schaefer
ISBN 978-3-540-71973-1
Foundation of Global Genetic Optimization, 2007
Vol. 63. Wlodzislaw Duch and Jacek Mańdziuk (Eds.) ISBN 978-3-540-73191-7
Challenges for Computational Intelligence, 2007
Vol. 75. Crina Grosan, Ajith Abraham and Hisao
ISBN 978-3-540-71983-0
Ishibuchi (Eds.)
Vol. 64. Lorenzo Magnani and Ping Li (Eds.) Hybrid Evolutionary Algorithms, 2007
Model-Based Reasoning in Science, Technology, and ISBN 978-3-540-73296-9
Medicine, 2007
Vol. 76. Subhas Chandra Mukhopadhyay and Gourab Sen
ISBN 978-3-540-71985-4
Gupta (Eds.)
Vol. 65. S. Vaidya, L.C. Jain and H. Yoshida (Eds.) Autonomous Robots and Agents, 2007
Advanced Computational Intelligence Paradigms in ISBN 978-3-540-73423-9
Healthcare-2, 2007 Vol. 77. Barbara Hammer and Pascal Hitzler (Eds.)
ISBN 978-3-540-72374-5 Perspectives of Neural-Symbolic Integration, 2007
Vol. 66. Lakhmi C. Jain, Vasile Palade and Dipti ISBN 978-3-540-73953-1
Srinivasan (Eds.) Vol. 78. Costin Badica and Marcin Paprzycki (Eds.)
Advances in Evolutionary Computing for System Design, Intelligent and Distributed Computing, 2008
2007 ISBN 978-3-540-74929-5
ISBN 978-3-540-72376-9
Vol. 79. Xing Cai and T.-C. Jim Yeh (Eds.)
Vol. 67. Vassilis G. Kaburlasos and Gerhard X. Ritter Quantitative Information Fusion for Hydrological
(Eds.) Sciences, 2008
Computational Intelligence Based on Lattice ISBN 978-3-540-75383-4
Theory, 2007
ISBN 978-3-540-72686-9 Vol. 80. Joachim Diederich
Rule Extraction from Support Vector Machines, 2008
Vol. 68. Cipriano Galindo, Juan-Antonio ISBN 978-3-540-75389-6
Fernández-Madrigal and Javier Gonzalez
A Multi-Hierarchical Symbolic Model of the Environment Vol. 81. K. Sridharan
for Improving Mobile Robot Operation, 2007 Robotic Exploration and Landmark Determination, 2008
ISBN 978-3-540-72688-3 ISBN 978-3-540-75393-3
Vol. 69. Falko Dressler and Iacopo Carreras (Eds.) Vol. 82. Ajith Abraham, Crina Grosan and Witold
Advances in Biologically Inspired Information Systems: Pedrycz (Eds.)
Models, Methods, and Tools, 2007 Engineering Evolutionary Intelligent Systems, 2008
ISBN 978-3-540-72692-0 ISBN 978-3-540-75395-7
Ajith Abraham
Crina Grosan
Witold Pedrycz
(Eds.)
Engineering Evolutionary
Intelligent Systems
With 191 Figures and 109 Tables
123
Ajith Abraham Crina Grosan
Centre for Quantifiable Quality of Service Department of Computer Science
in Communication Systems (Q2S) Faculty of Mathematics
Centre of Excellence and Computer Science
Norwegian University of Science Babes-Bolyai University
and Technology Cluj-Napoca, Kogalniceanu 1
O.S. Bragstads plass 2E 400084 Cluj - Napoca
N-7491 Trondheim Romania
Norway
ajith.abraham@ieee.org
Witold Pedrycz
Department of Electrical & Computer
Engineering
University of Alberta
ECERF Bldg., 2nd floor
Edmonton AB T6G 2V4
Canada
ISBN 978-3-540-75395-7 e-ISBN 978-3-540-75396-4
Studies in Computational Intelligence ISSN 1860-949X
Library of Congress Control Number: 2007938406

c 2008 Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations are
liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Cover Design: Deblik, Berlin, Germany
Printed on acid-free paper
9 8 7 6 5 4 3 2 1
springer.com
Contents
Engineering Evolutionary Intelligent Systems: Methodologies,

Architectures and Reviews
Ajith Abraham and Crina Grosan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Architectures of Evolutionary Intelligent Systems . . . . . . . . . . . . . . . . . 2
3 Evolutionary Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 Evolutionary Search of Connection Weights . . . . . . . . . . . . . . . . . 6
3.2 Evolutionary Search of Architectures . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Evolutionary Search of Learning Rules . . . . . . . . . . . . . . . . . . . . . 8
3.4 Recent Applications of Evolutionary Neural Networks
in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Evolutionary Fuzzy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Evolutionary Search of Fuzzy Membership Functions . . . . . . . . . 12
4.2 Evolutionary Search of Fuzzy Rule Base . . . . . . . . . . . . . . . . . . . . 12
4.3 Recent Applications of Evolutionary Fuzzy Systems
in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Evolutionary Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 Recent Applications of Evolutionary Design
of Complex Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7 Multiobjective Evolutionary Design of Intelligent Paradigms . . . . . . . 16
8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Genetically Optimized Hybrid Fuzzy Neural Networks:
Analysis and Design of Rule-based Multi-layer
Perceptron Architectures
Sung-Kwun Oh and Witold Pedrycz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1 Introductory remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 The architecture of conventional Hybrid Fuzzy Neural
Networks (HFNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
VI Contents
3 The architecture and development of genetically optimized

HFNN (gHFNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Fuzzy neural networks based on genetic optimization . . . . . . . . . 27
3.2 Genetically optimized PNN (gPNN) . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Optimization of gHFNN topologies . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 The algorithms and design procedure of genetically optimized
HFNN (gHFNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 The premise of gHFNN: in case of FS FNN . . . . . . . . . . . . . . . . . 35
4.2 The consequence of gHFNN: in case of gPNN combined
with FS FNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Experimental studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1 Nonlinear function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Gas furnace process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 NOx emission process of gas turbine power plant . . . . . . . . . . . . 49
6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Genetically Optimized Self-organizing Neural Networks
Based on Polynomial and Fuzzy Polynomial Neurons:
Analysis and Design
Sung-Kwun Oh and Witold Pedrycz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2 The architecture and development of the self-organizing
neural networks (SONN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.1 Polynomial Neuron (PN) based SONN and its topology . . . . . . . 62
2.2 Fuzzy Polynomial Neuron (FPN) based SONN
and its topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3 Genetic optimization of SONN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4 The algorithm and design procedure of genetically optimized
SONN (gSONN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 Experimental studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Chaotic time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Evolution of Inductive Self-organizing Networks
Dongwon Kim and Gwi-Tae Park . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2 Design of EA-based SOPNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
2.1 Representation of chromosome for appropriate information
of each PD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
2.2 Fitness function for modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Contents VII
3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3.2 Three-input nonlinear function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Recursive Pattern based Hybrid Supervised Training
Kiruthika Ramanathan and Sheng Uei Guan . . . . . . . . . . . . . . . . . . . . . . . . . 129
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
1.2 Organization of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2 Some preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.2 Simplified architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.4 Variable length genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 134
2.5 Pseudo global optima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3 The RPHS training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.1 Hybrid recursive training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4 Summary of the RPHS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5 The two spiral problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6 Heuristics for making the RPHS algorithm better . . . . . . . . . . . . . . . . . 141
6.1 Minimal coded genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2 Seperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.3 Computation intensity and population size . . . . . . . . . . . . . . . . . . 143
6.4 Validation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.5 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.1 Problems considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2 Experimental parameters and control algorithms
implemented . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Enhancing Recursive Supervised Learning Using Clustering
and Combinatorial Optimization (RSL-CC)
Kiruthika Ramanathan and Sheng Uei Guan . . . . . . . . . . . . . . . . . . . . . . . . . 157
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
2 Some preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
2.2 Problem formulation for recursive learning . . . . . . . . . . . . . . . . . . 160
2.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
VIII Contents
3 The RSL-CC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

3.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4 Details of the RSL-CC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.2 Termination criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5 Heuristics for improving the performance
of the RSL-CC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.1 Minimal coded genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.2 Population size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.3 Number of generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.4 Duplication of chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.5 Problems Considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.6 Experimental parameters and control algorithms
implemented . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6 Conclusions and future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Evolutionary Approaches to Rule Extraction
from Neural Networks
Urszula Markowska-Kaczmar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
2 The basics of neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3 Rule extraction from neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.2 The existing methods of rule extraction
from neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
4 Basic concepts of evolutionary algorithms . . . . . . . . . . . . . . . . . . . . . . . 184
5 Evolutionary methods in rule extraction from neural networks . . . . . 185
5.1 Local approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.2 Evolutionary algorithms in a global approach to rule
extraction from neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Cluster-wise Design of Takagi and Sugeno Approach of Fuzzy
Logic Controller
Tushar and Dilip Kumar Pratihar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
2 Takagi and Sugeno Approach of FLC . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
3 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
4 Clustering and Linear Regression Analysis Using
the Clustered Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Contents IX
4.1 Entropy-based Fuzzy Clustering (EFC) . . . . . . . . . . . . . . . . . . . . . 217

4.2 Approach 1: Cluster-wise Linear Regression . . . . . . . . . . . . . . . . . 219
5 GA-based Tuning of Takagi and Sugeno Approach of FLC . . . . . . . . . 220
5.1 Genetic-Fuzzy System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.1 Modeling of Abrasive Flow Machining (AFM) Process . . . . . . . . 223
6.2 Modeling of Tungsten Inert Gas (TIG) Process . . . . . . . . . . . . . . 231
7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
8 Scope for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Evolutionary Fuzzy Modelling for Drug Resistant HIV-1

Treatment Optimization
Mattia Prosperi and Giovanni Ulivi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
1.1 Artificial Intelligence in Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . 251
1.2 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
2 Background on HIV Treatment and Drug Resistance Onset . . . . . . . . 252
2.1 HIV Replication and Treatment Design . . . . . . . . . . . . . . . . . . . . . 253
2.2 Experimental Settings and Data Collection . . . . . . . . . . . . . . . . . . 254
3 Machine Learning for Drug Resistant HIV . . . . . . . . . . . . . . . . . . . . . . . 255
4 Fuzzy Modelling for HIV Drug Resistance Interpretation . . . . . . . . . . 256
4.1 Fuzzy Medical Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
4.2 Fuzzy Relational System for In-Vitro Cultures . . . . . . . . . . . . . . . 259
4.3 Models for In-Vivo Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
5.1 Fuzzy Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
5.2 Random Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
5.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
6.1 Phenotype Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
6.2 In-Vivo Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
A New Genetic Approach for Neural Network Design
Antonia Azzini and Andrea G.B. Tettamanzi . . . . . . . . . . . . . . . . . . . . . . . . 289
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
2 Evolving ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
3 Neuro-Genetic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
3.1 Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
3.2 Individual Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
3.3 Fitness Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
X Contents
3.4 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

3.5 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
3.6 Recombination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
4 Real-World Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
4.1 Fault Diagnosis Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
4.2 Brain Wave Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
4.3 Financial Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
A Grammatical Genetic Programming Representation for
Radial Basis Function Networks
Ian Dempsey, Anthony Brabazon, and Michael O’Neill . . . . . . . . . . . . . . . . 325
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
2 Grammatical Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
3 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
4 GE-RBFN Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
4.1 Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
4.2 Example Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
5 Experimental Setup & Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
6 Conclusions & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
A Neural-Genetic Technique for Coastal Engineering:
Determining Wave-induced Seabed Liquefaction Depth
Daeho Cha, Michael Blumenstein, Hong Zhang, and Dong-Sheng Jeng . . 337
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
1.1 Artificial Neural Networks in Engineering . . . . . . . . . . . . . . . . . . . 337
1.2 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
1.3 ANN models trained by GAs (Evolutionary Algorithms) . . . . . . 338
1.4 Wave-induced seabed liquefaction . . . . . . . . . . . . . . . . . . . . . . . . . . 339
2 A neural-genetic technique for wave-induced liquefaction . . . . . . . . . . 341
2.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
3.1 Neural-genetic model configuration for wave-induced
liquefaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
3.2 ANN model training using GAs for wave-induced
liquefaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
3.3 Results for determining wave-induced liquefaction . . . . . . . . . . . . 347
4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Contents XI
On the Design of Large-scale Cellular Mobile Networks

Using Multi-population Memetic Algorithms
Alejandro Quintero and Samuel Pierre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
2 Background and related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
3 Memetic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
3.1 Basic Principles of Canonical Genetic Algorithms . . . . . . . . . . . . 359
3.2 Basic Principles of Memetic Algorithms . . . . . . . . . . . . . . . . . . . . . 360
3.3 Multi-population Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
4.1 Memetic Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . 362
4.2 Local Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
4.3 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
5 Performance Evaluation and Numerical Results . . . . . . . . . . . . . . . . . . 369
5.1 Comparison with Other Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 369
5.2 Quality of the Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
A Hybrid Cellular Genetic Algorithm for the Capacitated

Vehicle Routing Problem
Enrique Alba and Bernabé Dorronsoro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
2 The Vehicle Routing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
3 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
3.1 Problem Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
3.2 Recombination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
3.3 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
3.4 Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
4 Looking for a New Algorithm: the Way to JCell2o1i . . . . . . . . . . . . . . 389
4.1 Cellular vs. Generational Genetic Algorithms . . . . . . . . . . . . . . . . 390
4.2 On the Importance of the Mutation Operator . . . . . . . . . . . . . . . 391
4.3 Tuning the Local Search Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
5 Solving CVRP with JCell2o1i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
5.1 Benchmark by Augerat et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
5.2 Benchmark by Van Breedam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
5.3 Benchmark by Christofides and Eilon . . . . . . . . . . . . . . . . . . . . . . 400
5.4 Benchmark by Christofides, Mingozzi and Toth . . . . . . . . . . . . . . 401
5.5 Benchmark by Fisher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
5.6 Benchmark by Golden et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
5.7 Benchmark by Taillard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
5.8 Benchmark of Translated Instances from TSP . . . . . . . . . . . . . . . 406
6 Conclusions and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
XII Contents
A Best Found Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

B Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Particle Swarm Optimization with Mutation for High

Dimensional Problems
Jeff Achtnig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
1.2 Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
1.3 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
2 PSO Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
2.1 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
2.2 Random Constriction Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
3 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
3.1 Standard Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
3.2 Neural Network Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
4.1 Comparison with Differential Evolution . . . . . . . . . . . . . . . . . . . . . 437
5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
Preface
Evolutionary Computation (EC) has become an important and timely method-

ology of problem solving among many researchers working in the area of
computational intelligence. The population based collective learning process,
self adaptation and robustness are some of the key features of evolutionary
algorithms when compared to other global optimization techniques. Evolu-
tionary Computation has been widely accepted for solving several important
practical applications in engineering, business, commerce, etc. As the opti-
mization problems to be tackled in the future will be growing in terms of
complexity and volume of data, we can envision a rapidly growing role of the
EC over the passage of time.
Evolutionary design of intelligent systems is gaining much popularity
due to its capabilities in handling several real world problems involving op-
timization, complexity, noisy and non-stationary environment, imprecision,
uncertainty and vagueness. This edited volume is aimed to present the latest
state-of-the-art methodologies in ’Engineering Evolutionary Intelligent Sys-
tems’. This book deals with the theoretical and methodological aspects, as well
as various EC applications to many real world problems originating from sci-
ence, technology, business or commerce. This volume comprises of 15 chapters
including an introductory chapter which covers the fundamental definitions
and outlines some important research challenges. These fifteen chapters are
organized as follows.
In the first Chapter, Abraham and Grosan elaborate on various schemes
of evolutionary design of intelligent systems. Generic hybrid evolutionary in-
telligent system architectures are presented with a detailed review of some of
the interesting hybrid frameworks already reported in the literature.
Oh and Pedrycz introduce an advanced architecture of genetically opti-
mized Hybrid Fuzzy Neural Networks (gHFNN) resulting from a synergistic
usage of the genetic optimization-driven hybrid system generated by combin-
ing Fuzzy Neural Networks (FNN) with Polynomial Neural Networks (PNN).
FNNs support the formation of the premise part of the rule-based struc-
ture of the gHFNN. The consequence part of the gHFNN is designed using
XIV Preface
PNNs. The optimization of the FNN is realized with the aid of a standard
back-propagation learning algorithm and genetic optimization.
In the third Chapter, Oh and Pedrycz introduce Self-Organizing Neu-
ral Networks (SONN) that is based on a genetically optimized multilayer
perceptron with Polynomial Neurons (PNs) or Fuzzy Polynomial Neurons
(FPNs). In the conventional SONN, an evolutionary algorithm is used to ex-
tend the main characteristics of the extended Group Method of Data Handling
(GMDH) method, that utilizes the polynomial order as well as the number of
node inputs fixed at the corresponding nodes (PNs or FPNs) located in each
layer during the development of the network. The genetically optimized SONN
(gSONN) results in a structurally optimized structure and comes with a higher
level of flexibility in comparison to the one encountered in the conventional
SONN.
Kim and Park discuss a new design methodology of the self-organizing
technique which builds upon the use of evolutionary algorithms. The self-
organizing network dwells on the idea of group method of data handling. The
order of the polynomial, the number of input variables, and the optimal num-
ber of input variables and their selection are encoded as a chromosome. The
appropriate information of each node is evolved accordingly and tuned grad-
ually using evolutionary algorithms. The evolved network is a sophisticated
and versatile architecture, which can construct models from a limited data set
as well as from poorly defined complex problems.
In the fifth Chapter, Ramanathan and Guan propose the Recursive
Pattern-based Hybrid Supervised (RPHS) learning algorithm that makes use
of the concept of pseudo global optimal solutions to evolve a set of neural
networks, each of which can solve correctly a subset of patterns. The pattern-
based algorithm uses the topology of training and validation data patterns to
find a set of pseudo-optima, each learning a subset of patterns.
Ramanathan and Guan improve the RPHP algorithm (as discussed in
Chapter 5) by using a combination of genetic algorithm, weak learner and
pattern distributor. The global search component is achieved by a cluster-
based combinatorial optimization, whereby patterns are clustered according
to the output space of the problem. A combinatorial optimization problem is
therefore formed, which is solved using evolutionary algorithms. An algorithm
is also proposed to use the pattern distributor to determine the optimal num-
ber of recursions and hence the optimal number of weak learners suitable for
the problem at hand.
In the seventh Chapter, Markowska-Kaczmar proposes two methods of
rule extraction referred to as REX and GEX. REX uses propositional fuzzy
rules and is composed of two methods REX Michigan and REX Pitt. GEX
takes an advantage of classical Boolean rules. The efficiency of REX and
GEX were tested using different benchmark data sets coming from the UCI
repository.
Tushar and Pratihar deals with Takagi and Sugeno Fuzzy Logic Con-
trollers (FLC) by focusing on their design process. This development by
Preface XV
clustering the data based on their similarity among themselves and then
cluster-wise regression analysis is carried out, to determine the response equa-
tions for the consequent part of the rules. The performance of the developed
cluster-wise linear regression approach; cluster-wise Takagi and Sugeno model
of FLC with linear membership functions and cluster-wise Takagi and Sugeno
model of FLC with nonlinear membership functions are illustrated using two
practical problems.
In the ninth Chapter, Prosperi and Ulivi propose fuzzy relational mod-
els for genotypic drug resistance analysis in Human Immunodeficiency Virus
type 1 (HIV-1). Fuzzy logic is introduced to model high-level medical lan-
guage, viral and pharmacological dynamics. Fuzzy evolutionary algorithms
and fuzzy evaluation functions are proposed to mine resistance rules, to
improve computational performance and to select relevant features.
Azzini and Tettamanzi present an approach to the joint optimization of
neural network structure and weights, using backpropagation algorithm as a
specialized decoder, and defining a simultaneous evolution of architecture and
weights of neural networks.
In the eleventh Chapter, Dempsey et al. present grammatical genetic
programming to generate radial basis function networks. Authors tested
the hybrid algorithm considering several benchmark classification problems
reporting on encouraging performance obtained there.
In the sequel Cha et al. propose neural-genetic model to wave-induced
liquefaction, which provides a better prediction of liquefaction potential. The
wave-induced seabed liquefaction problem is one of the most critical issues for
analyzing and designing marine structures such as caissons, oil platforms and
harbors. In the past, various investigations into wave-induced seabed lique-
faction have been carried out including numerical models, analytical solutions
and some laboratory experiments. However, most previous numerical studies
are based on solving complicated partial differential equations. The neural-
genetic simulation results illustrate the applicability of the hybrid technique
for the accurate prediction of wave-induced liquefaction depth, which can also
provide coastal engineers with alternative tools to analyze the stability of
marine sediments.
In the thirteenth Chapter, Quintero and Pierre propose a multi-population
Memetic Algorithm (MA) with migration and elitism to solve the problem of
assigning cells to switches as a design step of large-scale mobile networks. Be-
ing well-known in the literature to be an NP-hard combinatorial optimization
problem, this task requires the recourse to heuristic methods, which can prac-
tically lead to good feasible solutions, not necessarily optimal, the objective
being rather to reduce the convergence time toward these solutions. Compu-
tational results reported on an extensive suite of extensive tests confirm the
efficiency and the effectiveness of MA to provide good solutions in compar-
ison with other heuristics well-known in the literature, especially those for
large-scale cellular mobile networks.
XVI Preface
Alba and Dorronsoro solve the Capacitated Vehicle Routing Problem

(CVRP) of 160 instances using a Cellular genetic algorithm (cGA) hybridized
with a problem customized recombination operation, an advanced muta-
tion operator integrating three mutation methods, and an inclusion of two
well-known local search algorithms formulated for routing problems.
In the last Chapter, Achtnig investigates the use of Particle Swarm Opti-
mization (PSO) in dealing with optimization problems of very high dimension.
It has been found that PSO with some of the concepts originating from
evolutionary algorithms, such as a mutation operator, can in many cases sig-
nificantly improve the performance of the PSO. Further improvements have
been reported with the addition of a random constriction coefficient.
We are very much grateful to all the authors of this volume for sharing
their expertise and presenting their recent research findings. Our thanks go
to the referees for their outstanding service and a wealth of critical yet highly
constructive comments. The Editors would like to thank Dr. Thomas Ditzinger
(Springer Engineering In house Editor, Studies in Computational Intelli-
gence Series), Professor Janusz Kacprzyk (Editor-in-Chief, Springer Studies
in Computational Intelligence Series) and Ms. Heather King (Editorial Assis-
tant, Springer Verlag, Heidelberg) for the editorial assistance and excellent
collaboration during the development of this volume.
We hope that the reader will share our excitement and find the volume
‘Engineering Evolutionary Intelligent Systems’ both useful and inspiring.
Trondheim, Norway Ajith Abraham

Cluj-Napoca, Romania Crina Grosan
Alberta, Canada Witold Pedrycz
Contributors
Ajith Abraham Anthony Brabazon

Center of Excellence for Quantifiable Natural Computing Research
Quality of Service and Applications Group
Norwegian University of Science University College Dublin
and Technology, Trondheim Ireland
Norway anthony.brabazon@ucd.ie
ajith.abraham@ieee.org
Jeff Achtnig
Nalisys (Research Division) Daeho Cha
jeff.achtnig@nalisys.com Griffith School of Engineering
Griffith University, Gold Coast
Enrique Alba Campus
Department of Computer Science QLD 4215, Australia
University of Málaga f.cha@griffith.edu.au
eat@lcc.uma.es
Antonia Azzini
Università degli Studi di Milano Ian Dempsey
Dipartimento di Tecnologie Natural Computing Research
dell’Informazione and Applications Group
via Bramante 65 University College Dublin
26013, Crema - Italy Ireland
azzini@dti.unimi.it ian.dempsey@PipelineFinancial.
com
Michael Blumenstein
School of Information
and Communication Technology
Griffith University Bernabé Dorronsoro
Gold Coast Campus, QLD 4215 Department of Computer Science
Australia University of Málaga
M.Blumenstein@griffith.edu.au bernabe@lcc.uma.es
XVIII Contributors
Crina Grosan Michael O’Neill

Department of Computer Science Natural Computing Research
Faculty of Mathematics and Applications Group
and Computer Science University College Dublin
Babes-Bolyai University Ireland
Cluj-Napoca, Romania m.oneill@ucd.ie
cgrosan@cs.ubbcluj.ro
Gwi-Tae Park
Sheng Uei Guan Department of Electrical Engineering
School of Engineering and Design Korea University, 1
Brunel University, Uxbridge 5-ka, Anam-dong
Middlesex, UB8 3PH, UK Seongbuk-ku, Seoul 136-701
steven.guan@brunel.ac.uk Korea
gtpark@korea.ac.kr
Dong-Sheng Jeng
School of Civil Engineering
Witold Pedrycz
The University of Sydney NSW 2006
Department of Electrical and
Australia
Computer Engineering
d.jeng@civil.usyd.edu.au
University of Alberta, Edmonton
AB, Canada T6G 2G6
Dongwon Kim
Department of Electrical Engineering and
Korea University, 1
Systems Research Institute
5-ka, Anam-dong
Polish Academy of Sciences
Seongbuk-ku, Seoul 136-701
Warsaw, Poland
Korea
pedrycz@ee.ualberta.ca
upground@korea.ac.kr
Urszula Markowska-Kaczmar Samuel Pierre

Wroclaw University of Technology Department of Computer
Institute of Applied Informatics Engineering
Wyb. Wyspianskiego 27, 50-370 École Polytechnique de Montréal
Wroclaw, Poland C.P. 6079, succ. Centre-Ville
urszula.markowska-kaczmar@pwr. Montréal, Qué.
wroc.pl Canada, H3C 3A7
samuel.pierre@polymtl.ca
Sung-Kwun Oh
Department of Electrical Engineering Dilip Kumar Pratihar
The University of Suwon Department of Mechanical
San 2-2, Wau-ri Engineering
Bongdam-eup, Hwaseong-si Indian Institute of Technology
Gyeonggi-do, 445-743 Kharagpur-721 302
South Korea West Bengal, India
ohsk@suwon.ac.kr dkpra@mech.iitkgp.ernet.in
Contributors XIX
Mattia Prosperi Andrea G.B. Tettamanzi

University of Roma TRE Università degli Studi di Milano
Faculty of Computer Science Dipartimento di Tecnologie
Engineering dell’Informazione
Department of Computer Science via Bramante 65
and Automation 26013, Crema - Italy
tettamanzi@dti.unimi.it
prosperi@dia.uniroma3.it
Tushar
Department of Mechanical
Engineering
Alejandro Quintero Indian Institute of Technology
Department of Computer Kharagpur-721 302
Engineering West Bengal, India
École Polytechnique de Montréal tushar@iitkgp.ac.in
C.P. 6079, succ. Centre-Ville
Montréal, Qué. Giovanni Ulivi
Canada, H3C 3A7 University of Roma TRE
alejandro.quintero@polymtl.ca Faculty of Computer Science
Engineering
Department of Computer Science
and Automation
ulivi@uniroma3.it
Kiruthika Ramanathan
Department of Electrical
Hong Zhang
and Computer Engineering Griffith School of Engineering
National University of Singapore Griffith University, Gold Coast
10 Kent Ridge Crescent, Singapore Campus
119260 QLD 4215, Australia
kiruthika r@nus.edu.sg Hong.Zhang@griffith.edu.au
Engineering Evolutionary Intelligent Systems:
Methodologies, Architectures and Reviews
Ajith Abraham and Crina Grosan
Summary. Designing intelligent paradigms using evolutionary algorithms is getting

popular due to their capabilities in handling several real world problems involv-
ing complexity, noisy environment, imprecision, uncertainty and vagueness. In this
Chapter, we illustrate the various possibilities for designing intelligent systems us-
ing evolutionary algorithms and also present some of the generic evolutionary design
architectures that has evolved during the last couple of decades. We also provide a
review of some of the recent interesting evolutionary intelligent system frameworks
reported in the literature.
1 Introduction
Evolutionary Algorithms (EA) have recently received increased interest, par-
ticularly with regard to the manner in which they may be applied for practical
problem solving. Usually grouped under the term evolutionary computation
or evolutionary algorithms, we find the domains of Genetic Algorithms [34],
Evolution Strategies [68], [69], Evolutionary Programming [20], Learning Clas-
sifier Systems [36], Genetic Programming [45], Differential Evolution [67] and
Estimation of Distribution Algorithms [56]. They all share a common con-
ceptual base of simulating the evolution of individual structures and they
differ in the way the problem is represented, processes of selection and the us-
age/implementation of reproduction operators. The processes depend on the
perceived performance of the individual structures as defined by the problem.
Compared to other global optimization techniques, evolutionary algo-
rithms are easy to implement and very often they provide adequate solutions.
A population of candidate solutions (for the optimization task to be solved)
is initialized. New solutions are created by applying reproduction operators
(mutation and/or crossover). The fitness (how good the solutions are) of the
resulting solutions are evaluated and suitable selection strategy is then applied
to determine, which solutions are to be maintained into the next generation.
The procedure is then iterated.
A. Abraham and C. Grosan: Engineering Evolutionary Intelligent Systems: Methodologies,

Architectures and Reviews, Studies in Computational Intelligence (SCI) 82, 1–22 (2008)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008
2 A. Abraham and C. Grosan
The rest of the chapter is organized as follows. In Section 2, the various

architectures for engineering evolutionary intelligent systems are presented.
In Section 3, we present evolutionary artificial neural networks and its recent
applications followed by evolutionary fuzzy systems and applications in Sec-
tion 4. Evolutionary clustering is presented in Section 5 followed by recent
applications of evolutionary design of complex paradigms in Section 6. Mul-
tiobjective evolutionary design intelligent systems are presented in Section 7
and some conclusions are provided towards the end.
2 Architectures of Evolutionary Intelligent Systems

Hybridization of evolutionary algorithms with other intelligent paradigms is a
promising research field and various architectures for Evolutionary Intelligent
Systems (EIS) could be formulated as depicted in Figures 1–6. By problem, we
refer to any data mining/optimization/function approximation type problem
and Intelligent Paradigm (IP) refers to any computational intelligence tech-
niques like neural network, machine learning schemes, fuzzy inference systems,
clustering algorithms etc.
Evolutionary algorithm Intelligent paradigm
Problem / Data
Solution
(Output)
Fig. 1. EIS architecture 1
Solution
Evolutionary algorithm Intelligent paradigm
(Output)
Problem / Data

Engineering Evolutionary Intelligent Systems 3
Solution
Intelligent paradigm
(Output)
Problem / Data Evolutionary algorithm
Solution
(Output)
Error
Problem / Data feedback
Evolutionary algorithm
Solution
(Output)
Problem / Data

Solution
(Output)
Error
Problem / Data feedback
Figure 1 illustrates a transformational architecture where an evolutionary

algorithm is used to optimize an intelligent paradigm and at the same time
the intelligent paradigm is used to fine tune the parameters and performance
of the evolutionary algorithm. An example is an evolutionary - fuzzy system
where an evolutionary algorithm is used to fine tune the parameters of a fuzzy
inference system (for a function approximation problem) and the fuzzy system
is used to control the parameters of the evolutionary algorithm [32].
A concurrent hybrid architecture is depicted in Figure 2, where an EA
is used as a pre-processor and the intelligent paradigm is used to fine tune
the solutions formulated by the EA. The final solutions to the problem is
provided by IP. Both EA and IP are continuously required for the satisfactory
performance of the system. EA may be used as a post processor as illustrated
in [1], [4] and [28].
Architecture 3 (Figure 3) depicts a cooperative hybrid system where the
evolutionary algorithm is used to fine tune the parameters of IP only during
the initialization of the system. Once the system is initialized the EA is not
required for the satisfactory functioning of the system.
Architecture 4 uses an error feed back from the output (performance mea-
sure) and based on the error measure (critic information) the EA is used to
fine tune the performance of the IP. Final solutions are provided by the the
IP as illustrated in Figure 4.
An ensemble model is depicted in Figure 5 where EA received inputs
directly from the IP and independent solutions are provided by the EA and
IP. A slightly different architecture is depicted in Figure 6, where an error feed
back is generated by the EA and depending on this input IP performance is
fine tuned and final solutions are provided by the IP.
In the following Section, some of the well established hybrid frameworks
for optimizing the performance of evolutionary algorithm using intelligent
paradigms are presented.
3 Evolutionary Artificial Neural Networks

Artificial neural networks are capable of performing a wide variety of tasks, yet
in practice sometimes they deliver only marginal performance. Inappropriate
topology selection and learning algorithm are frequently blamed. There is little
reason to expect that one can find a uniformly best algorithm for selecting the
weights in a feedforward artificial neural network. This is in accordance with
the no free lunch theorem, which explains that for any algorithm, any elevated
performance over one class of problems is exactly paid for in performance over
another class [75].
At present, neural network design relies heavily on human experts who
have sufficient knowledge about the different aspects of the network and the
problem domain. As the complexity of the problem domain increases, man-
ual design becomes more difficult and unmanageable. Evolutionary design of
artificial neural networks has been widely explored. Evolutionary algorithms
are used to adapt the connection weights, network architecture and learning
rules according to the problem environment. A distinct feature of evolution-
ary neural networks is their adaptability to a dynamic environment. In other
words, such neural networks can adapt to an environment as well as changes
in the environment. The two forms of adaptation: evolution and learning in
evolutionary artificial neural networks make their adaptation to a dynamic
environment much more effective and efficient than the conventional learning
approach.
In Evolutionary Artificial Neural Network (EANN), evolution can be in-
troduced at various levels. At the lowest level, evolution can be introduced
into weight training, where ANN weights are evolved. At the next higher
level, evolution can be introduced into neural network architecture adapta-
tion, where the architecture (number of hidden layers, no of hidden neurons
and node transfer functions) is evolved. At the highest level, evolution can be
introduced into the learning mechanism [77]. A general framework of EANNs
which includes the above three levels of evolution is given in Figure 7 [2].
From the point of view of engineering, the decision on the level of evolution
depends on what kind of prior knowledge is available. If there is more prior
knowledge about EANN’s architectures that about their learning rules or
Slow
Evolutionary Search of learning rules
Evolutionary search of architectures and node transfer

functions
Evolutionary search of connection weights Fast
Fig. 7. A General framework for evolutionary artificial neural network

a particular class of architectures is pursued, it is better to implement the

evolution of architectures at the highest level because such knowledge can be
used to reduce the search space and the lower level evolution of learning rules
can be more biased towards this kind of architectures. On the other hand, the
evolution of learning rules should be at the highest level if there is more prior
knowledge about them available or there is a special interest in certain type
of learning rules.
3.1 Evolutionary Search of Connection Weights
The neural network training process is formulated as a global search of connec-

tion weights towards an optimal set defined by the evolutionary algorithm.
Optimal connection weights can be formulated as a global search problem
wherein the architecture of the neural network is pre-defined and fixed during
the evolution. Connection weights may be represented as binary strings or real
numbers. The whole network is encoded by concatenation of all the connection
weights of the network in the chromosome. A heuristic concerning the order
of the concatenation is to put connection weights to the same node together.
Proper genetic operators are to be chosen depending upon the representation
used. While gradient based techniques are very much dependant on the initial
setting of weights, evolutionary search method can be considered generally
much less sensitive to initial conditions. When compared to any gradient de-
scent or second order optimization technique that can only find local optimum
in a neighborhood of the initial solution, evolutionary algorithms always try
to search for a global optimal solution. Evolutionary search for connection
weights is depicted in Algorithm 3.1.
Algorithm 3.1 Evolutionary search of connection weights

1. Generate an initial population of N weight chromosomes. Evaluate the fitness of
each EANN depending on the problem.
2. Depending on the fitness and using suitable selection methods reproduce a

number of children for each individual in the current generation.
3. Apply genetic operators to each child individual generated above and obtain the
next generation.
4. Check whether the network has achieved the required error rate or the specified
number of generations has been reached. Go to Step 2.
5. End
3.2 Evolutionary Search of Architectures
Evolutionary architecture adaptation can be achieved by constructive and de-

structive algorithms. Constructive algorithms, which add complexity to the
network starting from a very simple architecture until the entire network is
able to learn the task [23], [52]. Destructive algorithms start with large archi-
tectures and remove nodes and interconnections until the ANN is no longer
able to perform its task [58], [66]. Then the last removal is undone. For an op-
timal network, the required node transfer function (Gaussian, sigmoidal, etc.)
can be formulated as a global search problem, which is evolved simultaneously
with the search for architectures [49].
To minimize the size of the genotype string and improve scalability, when
priori knowledge of the architecture is known it will be efficient to use some in-
direct coding (high level) schemes. For example, if two neighboring layers are
fully connected then the architecture can be coded by simply using the number
of layers and nodes. The blueprint representation is a popular indirect coding
scheme where it assumes architecture consists of various segments or areas.
Each segment or area will define a set of neurons, their spatial arrangement
and their efferent connectivity. Several high level coding schemes like graph
generation system [44], Symbiotic Adaptive Neuro-Evolution (SANE) [54],
marker based genetic coding [24], L-systems [10], cellular encoding [26], frac-
tal representation [53], cellular automata [29] etc. are some of the rugged
techniques.
Global search of transfer function and the connectivity of the ANN us-
ing evolutionary algorithms is formulated in Algorithm 3.2. The evolution of
architectures has to be implemented such that the evolution of weight chro-
mosomes are evolved at a faster rate i.e. for every architecture chromosome,
there will be several weight chromosomes evolving at a faster time scale.
Algorithm 3.2 Evolutionary search of architectures

1. Generate an initial population of N architecture chromosomes. Evaluate the
fitness of each EANN depending on the problem.

next generation.
5. End
3.3 Evolutionary Search of Learning Rules
For the neural network to be fully optimal the learning rules are to be adapted
dynamically according to its architecture and the given problem. Deciding the
learning rate and momentum can be considered as the first attempt of learning
rules [48]. The basic learning rule can be generalized by the function
⎛ ⎞
n n
k
∆w(t) = ⎝θi1 ,i2 ,...,ik xij (t − 1)⎠ (1)
k=1 i1 ,i2 ,...,ik =1 j=1
where t is the time, ∆w is the weight change, x1 , x2 ,. . . .. x n are local variables

and the θ’ s are the real values coefficients which is to be determined by
the global search algorithm. In the above equation, different values of θ’ s
determine different learning rules. The above equation is arrived based on the
assumption that the same rule is applicable at every node of the network and
the weight updating is only dependent on the input/output activations and
the connection weights on a particular node.
The evolution of learning rules has to be implemented such that the evo-
lution of architecture chromosomes are evolved at a faster rate i.e. for every
learning rule chromosome, there will be several architecture chromosomes
evolving at a faster time scale. Genotypes (θ’ s) can be encoded as real-valued
coefficients and the global search for learning rules using the evolutionary
algorithm is formulated in Algorithm 3.3.
In the literature, several research works could be traced about how to for-
mulate different optimal learning rules [8], [21]. The adaptive adjustment of
back-propagation algorithm parameters, such as the learning rate and mo-
mentum, through evolution could be considered as the first attempt of the
evolution of learning rules [30]. Sexton et al. [65] used simulated annealing al-
gorithm for optimization of learning. For optimization of the neural network
Algorithm 3.3 Evolutionary search of learning algorithms or rules

1. Generate an initial population of N learning rules. Evaluate the fitness of each
EANN depending on the problem.

next generation.
5. End
learning, in many cases a pre-defined architecture was used and in a few cases
architectures were evolved together. Abraham [2] proposed the meta-learning
evolutionary evolutionary neural network with a tight interaction of the differ-
ent evolutionary search mechanisms using the generic framework illustrated
in Figure 7.
3.4 Recent Applications of Evolutionary Neural Networks

in Practice
Cai et al. [11] used a hybrid of Particle Swarm Optimization (PSO) [41], [18]
and EA to train Recurrent Neural Networks (RNNs) for the prediction of
missing values in time series data. Experimental results illustrate that RNNs,
trained by the hybrid algorithm, are able to predict the missing values in
the time series with minimum error, in comparison with those trained with
standard EA and PSO algorithms.
Castillo et al. [12] explored several methods that combine evolution-
ary algorithms and local search to optimize multilayer perceptrons. Authors
explored a method that optimizes the architecture and initial weights of multi-
layer perceptrons, a search algorithm for training algorithm parameters, and
finally, a co-evolutionary algorithm, that handles the architecture, the net-
work’s initial weights and the training algorithm parameters. Experimental
results show that the co-evolutionary method obtains similar or better re-
sults than the other approaches, requiring far less training epochs and thus,
reducing running time.
Hui [37] proposed a new method for predicting the reliability for repairable
systems using evolutionary neural networks. Genetic algorithms are used to
globally optimize the number of neurons in the hidden layer and learning
parameters of the neural network architecture.
Marwala [51] proposed a Bayesian neural network trained using Markov
Chain Monte Carlo (MCMC) and genetic programming in binary space within
Metropolis framework. The proposed algorithm could learn using samples
obtained from previous steps merged using concepts of natural evolution which
include mutation, crossover and reproduction. The reproduction function is
the Metropolis framework and binary mutation as well as simple crossover,
are also used.
Kim and Cho [42] proposed an incremental evolution method for neu-
ral networks based on cellular automata and a method of combining several
evolved modules by a rule-based approach. The incremental evolution method
evolves the neural network by starting with simple environment and gradu-
ally making it more complex. The multi-modules integration method can make
complex behaviors by combining several modules evolved or programmed to
do simple behaviors.
Kim [43] explored a genetic algorithm approach to instance selection in
artificial neural networks when the amount of data is very large. GA optimizes
simultaneously the connection weights and the optimal selection of relevant

instances.
Capi and Doya [13] implemented an extended multi-population genetic al-
gorithm (EMPGA), where subpopulations apply different evolutionary strate-
gies for designing neural controllers in the real hardware of Cyber Rodent
robot. The EMPGA subpopulations compete and cooperate among each
other.
Bhattacharya et al. [7] used a meta-learning evolutionary artificial neural
network in selecting the best Flexible Manufacturing Systems (FMS) from a
group of candidate FMSs. EA is used to evolve the architecture and weights
of the proposed neural network method. Further, a Back-Propagation (BP)
algorithm is used as the local search algorithm. All the randomly generated
architectures of the initial population are trained by BP algorithm for a fixed
number of epochs. The learning rate and momentum of the BP algorithm have
been adapted suiting the generated data of the MCDM problem.
4 Evolutionary Fuzzy Systems

A conventional fuzzy controller makes use of a model of the expert who is in
a position to specify the most important properties of the process. Fuzzy con-
troller consists of a fuzzification interface, which receives the current values of
the input variables and eventually transforms to linguistic terms or fuzzy sets.
The knowledge base contains information about the domains of the variables,
and the fuzzy sets associated with the linguistic terms. Also a rule base in the
form of linguistic control rules is stored in the knowledge base. The decision
logic determines the information about the control variables with the help of
the measured input values and knowledge base. The task of defuzzification
interface is to create a crisp control value out of the information about the
control variable of the decision logic by using a suitable transformation.
The usual approach in fuzzy control is to define a number of concurrent
if-then fuzzy rules. Most fuzzy systems employ the inference method proposed
by Mamdani [50] in which the rule consequence is defined by fuzzy sets and
has the following structure:
if x is A 1 and y is B 1 then f = C (2)
Takagi, Sugeno and Kang (TSK) [72] proposed an inference scheme in

which the conclusion of a fuzzy rule is constituted by a weighted linear com-
bination of the crisp inputs rather than a fuzzy set and has the following
structure:
if x is A 1 and y is B 1 , then f = p 1 x + q 1 y + r (3)
In the literature, several research works related to evolutionary design of
fuzzy system could be located [59], [62]. Majority of the works are concerned
with the automatic design or optimization of fuzzy logic controllers either by
adapting the fuzzy membership functions or by learning the fuzzy if-then rules
[55], [33]. Figure 8 shows the architecture of the adaptive fuzzy control system
wherein the fuzzy membership functions and the rule bases are optimized
using a hybrid global search procedure. An optimal design of an adaptive fuzzy
control system could be achieved by the adaptive evolution of membership
functions and the learning rules that progress on different time scales. Figure 9
illustrates the general interaction mechanism with the global search of fuzzy
rules evolving at the highest level on the slowest time scale. For each fuzzy
rule base, global search of membership functions proceeds at a faster time
scale in an environment decided by the problem.
Evolutionary search
(Adaptation of fuzzy sets Performance
and rule base) measure
Fuzzy sets
+
Process
if-then rules
-
Knowledge base
Fuzzy controller
Fig. 8. Adaptive fuzzy control system architecture
Slow
Evolutionary search of fuzzy rules
Evolutionary search of membership functions
Fast
Fig. 9. Interaction of various search mechanisms in the design of optimal adaptive

fuzzy control system
4.1 Evolutionary Search of Fuzzy Membership Functions
The tuning of the scaling parameters and fuzzy membership functions (piece-
wise linear and/or differentiable functions) is an important task in the design
of fuzzy systems and is popularly known as genetic tuning. Evolutionary al-
gorithms could be used to search the optimal shape, number of membership
functions per linguistic variable and the parameters [31]. The genome encodes
parameters of trapezoidal, triangle, logistic, Laplace, hyperbolic-tangent or
Gaussian membership functions etc. Most of the existing methods assume
the existence of a predefined collection of fuzzy membership functions giving
meaning to the linguistic labels contained in the rules (database). Evolution-
ary algorithms are applied to obtain a suitable rule base, using chromosomes
that code single rules or complete rule bases. If prior knowledge of the mem-
bership functions is available, a simplified chromosome representation could
be formulated accordingly.
The first decision a designer has to make is how to represent a solution
in a chromosome structure. First approach is to have the chromosome en-
code the complete rule base. Each chromosome differs only in the fuzzy rule
membership functions as defined in the database. In the second approach, each
chromosome encodes a different database definition based on the fuzzy domain
partitions. The global search for membership functions using evolutionary
algorithm is formulated in Algorithm 4.1.
4.2 Evolutionary Search of Fuzzy Rule Base
The number of rules grows rapidly with an increasing number of variables

and fuzzy sets. Literature scan reveals that several coding methods were used
Algorithm 4.1 Evolution of learning of fuzzy membership functions and its

parameters
1. Generate an initial population of N chromosomes using one of the approaches
mentioned in Section 4.1. Evaluate the fitness of each fuzzy rule base depending on
the problem.

next generation.
4. Check whether the fuzzy system has achieved the required error rate or the
specified number of generations has been reached. Go to Step 2.
5. End
according to the nature of the problem. The rule base of the fuzzy system
may be represented using relational matrix, decision table and set of rules.
In the Pittsburg approach, [71] each chromosome encodes a whole rule set.
Crossover serves to provide a new combination of rules and mutation provides
new rules. The disadvantage is the increased complexity of search space and
additional computational burden especially for online learning. The size of the
genotype depends on the number of input/output variables and fuzzy sets.
In the Michigan approach, [35] each genotype represents a single fuzzy rule
and the entire population represents a solution. The fuzzy knowledge base is
adapted as a result of antagonistic roles of competition and cooperation of
fuzzy rules. A classifier rule triggers whenever its condition part matches the
current input, in which case the proposed action is sent to the process to be
controlled. The fuzzy behavior is created by an activation sequence of mutually
collaborating fuzzy rules. In the Michigan approach, techniques for judging
the performance of single rules are necessary.
The Iterative Rule Learning (IRL) approach [27] is similar to the Michi-
gan approach where the chromosomes encode individual rules. In IRL, only
the best individual is considered as the solution, discarding the remaining
chromosomes in the population. The evolutionary algorithm generates new
classifier rules based on the rule strengths acquired during the entire process.
Defuzzification operators and its parameters may be also formulated as an
evolutionary search [46], [40], [5].
4.3 Recent Applications of Evolutionary Fuzzy Systems in Practice
Tsang et al. [73] proposed a fuzzy rule-based system for intrusion detection,
which is evolved from an agent-based evolutionary framework and multi-
objective optimization. The proposed system can also act as a genetic feature
selection wrapper to search for an optimal feature subset for dimensionality
reduction.
Edwards et al. [19] modeled the complex export pattern behavior of multi-
national corporation subsidiaries in Malaysia using a Takagi-Sugeno fuzzy
inference system. The proposed fuzzy inference system is optimized by us-
ing neural network learning and evolutionary computation. Empirical results
clearly show that the proposed approach could model the export behavior
reasonably well compared to a direct neural network approach.
Chen et al. [15] proposed an automatic way of evolving hierarchical Tak-
agi - Sugeno Fuzzy Systems (TS-FS). The hierarchical structure is evolved
using Probabilistic Incremental Program Evolution (PIPE) with specific in-
structions. The fine tuning of the if - then rules parameters encoded in the
structure is accomplished using Evolutionary Programming (EP). The pro-
posed method interleaves both PIPE and EP optimizations. Starting with
random structures and rules parameters, it first tries to improve the hierar-
chical structure and then as soon as an improved structure is found, it further
fine tunes the rules parameters. It then goes back to improve the structure
and the rules’ parameters. This loop continues until a satisfactory hierarchical
TS-FS model is found or a time limit is reached.
Pawara and Ganguli [61] developed a Genetic Fuzzy System (GFS) for on-
line structural health monitoring of composite helicopter rotor blades. Authors
formulated a global and local GFSs. The global GFS is for matrix cracking
and debonding/delamination detection along the whole blade and the local
GFS is for matrix cracking and debonding/delamination detection in various
parts of the blade.
Chua et al. [16] proposed a GA-based fuzzy controller design for tunnel
ventilation systems. Fuzzy Logic Control (FLC) method has been utilized
due to the complex and nonlinear behavior of the system and the FLC was
optimized using the GA.
Franke et al. [22] presented a genetic - fuzzy system for automatically
generating online scheduling strategies for a complex objective defined by a
machine provider. The scheduling algorithm is based on a rule system, which
classifies all possible scheduling states and assigns a corresponding scheduling
strategy. Authors compared two different approaches. In the first approach,
an iterative method is applied, that assigns a standard scheduling strategy
to all situation classes. In the second approach, a symbiotic evolution varies
the parameter of Gaussian membership functions to establish the different
situation classes and also assigns the appropriate scheduling strategies.
5 Evolutionary Clustering
Clustering means the act of partitioning an unlabeled dataset into groups
of similar objects. Each group, called a ‘cluster’, consists of objects that
are similar between themselves and dissimilar to objects of other groups. A
comprehensive review of the state-of-the-art clustering methods can be found
in [76], [64].
Data clustering is broadly based on two approaches: hierarchical and par-
titional. In hierarchical clustering, the output is a tree showing a sequence
of clustering with each cluster being a partition of the data set. Hierarchical
algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglom-
erative algorithms begin with each element as a separate cluster and merge
them in successively larger clusters. Partitional clustering algorithms, on the
other hand, attempt to decompose the data set directly into a set of disjoint
clusters by optimizing certain criteria. The criterion function may emphasize
the local structure of the data, as by assigning clusters to peaks in the prob-
ability density function, or the global structure. Typically, the global criteria
involve minimizing some measure of dissimilarity in the samples within each
cluster, while maximizing the dissimilarity of different clusters. The advan-
tages of the hierarchical algorithms are the disadvantages of the partitional
algorithms and vice versa.
Clustering can also be performed in two different modes: crisp and fuzzy.
In crisp clustering, the clusters are disjoint and non-overlapping in nature.
Any pattern may belong to one and only one class in this case. In case of
fuzzy clustering, a pattern may belong to all the classes with a certain fuzzy
membership grade.
One of the widely used clustering methods is the fuzzy c-means (FCM)
algorithm developed by Bezdek [9]. FCM partitions a collection of n vectors
xi , i = 1, 2 . . . , n into c fuzzy groups and finds a cluster center in each group
such that a cost function of dissimilarity measure is minimized. To accom-
modate the introduction of fuzzy partitioning, the membership matrix U is
allowed to have elements with values between 0 and 1. The FCM objective
function takes the form:

c
c
n
J(U, c1 , . . . cc ) = Ji = um 2
ij dij (4)
i=1 i=1 j=1
where uij , is a numerical value between [0,1]; ci is the cluster center of fuzzy
group i; dij = ci − xj is the Euclidian distance between ith cluster center and
j th data point; and m is called the exponential weight which influences the
degree of fuzziness of the membership (partition) matrix. Usually a number
of cluster centers are randomly initialized and the FCM algorithm provides
an iterative approach to approximate the minimum of the objective function
starting from a given position and leads to any of its local minima [3]. No guar-
antee ensures that FCM converges to an optimum solution (can be trapped
by local extrema in the process of optimizing the clustering criterion). The
performance is very sensitive to initialization of the cluster centers.
Research efforts have made it possible to view data clustering as an opti-
mization problem. This view offers us a chance to apply EA for evolving the
optimal number of clusters and their cluster centers. The algorithm is initial-
ized by constraining the initial values to be within the space defined by the
vectors to be clustered. An important advantage of the EA is its ability to
cope with local optima by maintaining, recombining and comparing several
candidate solutions simultaneously.
Abraham [3] proposed the concurrent architecture of a fuzzy clustering
algorithm (to discover data clusters) and a fuzzy inference system for Web
usage mining. A hybrid evolutionary FCM approach is proposed in this paper
to optimally segregate similar user interests. The clustered data is then used
to analyze the trends using a Takagi-Sugeno fuzzy inference system learned
using a combination of evolutionary algorithm and neural network learning.
6 Recent Applications of Evolutionary Design

of Complex Paradigms
Park et al. [60] used EA to optimize Hybrid Self-Organizing Fuzzy Polyno-
mial Neural Networks (HSOFPNN)m, which are based on genetically opti-
mized multi-layer perceptrons. The architecture of the resulting HSOFPNN
combines fuzzy polynomial neurons (FPNs) [57] that are located at the first
layer of the network with polynomial neurons (PNs) forming the remaining
layers of the network. The GA-based design procedure being applied at each
layer of HSOFPNN leads to the selection of preferred nodes of the network
(FPNs or PNs) whose local characteristics (such as the number of input vari-
ables, the order of the polynomial, a collection of the specific subset of input
variables, the number of membership functions for each input variable, and
the type of membership function) can be easily adjusted.
Juang and Chung [39] proposed a recurrent TakagiSugenoKang (TSK)
fuzzy network design using the hybridization of a multi-group genetic algo-
rithm and particle swarm optimization (R-MGAPSO). Both the number of
fuzzy rules and the parameters in a TRFN are designed simultaneously by
R-MGAPSO. In R-MGAPSO, the techniques of variable-length individuals
and the local version of particle swarm optimization are incorporated into a
genetic algorithm, where individuals with the same length constitute the same
group, and there are multigroups in a population.
Aouiti et al. [6] proposed an evolutionary method for the design of beta ba-
sis function neural networks (BBFNN) and beta fuzzy systems (BFS). Authors
used a hierarchical genetic learning model of the BBFNN and the BFS.
Chen at al. [14] introduced a new time-series forecasting model based on
the flexible neural tree (FNT). The FNT model is generated initially as a flexi-
ble multi-layer feed-forward neural network and evolved using an evolutionary
procedure. FNT model could also select the appropriate input variables or
time-lags for constructing a time-series model.
7 Multiobjective Evolutionary Design of Intelligent

Paradigms
Even though some real world problems can be reduced to a matter of single
objective very often it is hard to define all the aspects in terms of a single ob-
jective. In single objective optimization, the search space is often well defined.
As soon as there are several possibly contradicting objectives to be optimized
simultaneously, there is no longer a single optimal solution but rather a whole
set of possible solutions of equivalent quality. When we try to optimize several
objectives at the same time the search space also becomes partially ordered. To
obtain the optimal solution, there will be a set of optimal trade-offs between
the conflicting objectives. A multiobjective optimization problem is defined
by a function f which maps a set of constraint variables to a set of objective
values.
Delgado and Pegalajar [17], developed a multi-objective evolutionary algo-
rithm, which is able to determine the optimal size of recurrent neural networks
for any particular application. Authors analyzed in the case of grammatical
inference: in particular, how to establish the optimal size of a recurrent neural
network in order to learn positive and negative examples in a certain language,
and how to determine the corresponding automaton using a self-organizing

map once the training has been completed.
Serra and Bottura [70] proposed a gain scheduling adaptive control scheme
based on fuzzy systems, neural networks and multiobjective genetic algorithms
for nonlinear plants. A FLC is developed, which is a discrete time version of a
conventional one. Its data base as well as the controller gains are optimally de-
signed by using a genetic algorithm for simultaneously satisfying the overshoot
and settling time minimizations and output response smoothing.
Kelesoglu [47] developed a method for solving fuzzy multiobjective op-
timization of space truss using GA. The displacement, tensile stress, fuzzy
sets, membership functions and minimum size constraints are considered in
formulation of the design problem.
Lin [48] proposed a multiobjective and multistage fuzzy competence set
model using a hybrid genetic algorithm. Author illustrated that the proposed
method can provide a sound fuzzy competence set model by considering the
multiobjective and the multistage situations simultaneously.
Ishibuchi and Nojimaa [38] examined the interpretability-accuracy trade-
off in fuzzy rule-based classifiers using a multiobjective fuzzy genetics-based
machine learning (GBML) algorithm which is a hybrid version of Michigan
and Pittsburgh approaches. Each fuzzy rule is represented by its antecedent
fuzzy sets as an integer string of fixed length. Each fuzzy rule-based classifier,
which is a set of fuzzy rules, is represented as a concatenated integer string
of variable length. The GBML algorithm simultaneously maximizes the accu-
racy of rule sets and minimizes their complexity. The accuracy is measured
by the number of correctly classified training patterns while the complexity is
measured by the number of fuzzy rules and/or the total number of antecedent
conditions of fuzzy rules.
Garcia-Pedrajas et al. [25] developed a cooperative coevolutive model
for the evolution of neural network topology and weights, called MOBNET.
MOBNET evolves subcomponents that must be combined in order to form a
network, instead of whole networks. The subcomponents in a cooperative co-
evolutive model must fulfill different criteria to be useful, these criteria usually
conflict with each other. The problem of evaluating the fitness on an individ-
ual based on many criteria that must be optimized together is approached as
a multi-criteria optimization problems.
Wang et al. [74] proposed a multiobjective hierarchical genetic algorithm
(MOHGA) to extract interpretable rule-based knowledge from data. In order
to remove the redundancy of the rule base proactively, authors applied an
interpretability-driven simplification method. Fuzzy clustering is used to gen-
erate an initial rule-based model and then MOHGA and the recursive least
square method are used to obtain the optimized fuzzy models.
Pettersson et al. [63] used an evolutionary multiobjective technique in the
training process of a feed forward neural network, using noisy data from an
industrial iron blast furnace. The number of nodes in the hidden layer, the
architecture of the lower part of the network, as well as the weights used in
them were kept as variables, and a Pareto front was effectively constructed
by minimizing the training error along with the network size.
8 Conclusions
This Chapter presented the various architectures for designing intelligent
paradigms using evolutionary algorithms. The main focus was on designing
evolutionary neural networks and evolutionary fuzzy systems. We also illus-
trated some of the recent generic evolutionary design architectures reported
in the literature including fuzzy neural networks and multiobjective design
strategies.
References
1. Abraham A, Grosan C, Han SY, Gelbukh A (2005) Evolutionary multiobjective
optimization approach for evolving ensemble of intelligent paradigms for stock
market modeling. In: Alexander Gelbukh et al. (eds.) 4th Mexican international
conference on artificial intelligence, Mexico, Lecture notes in computer science,
Springer, Berlin Heidelberg New York, pp 673–681
2. Abraham, A (2004) Meta-learning evolutionary artificial neural networks.
Neurocomput J 56c:1–38
3. Abraham A (2003) i-Miner: A Web Usage Mining Framework Using Hierarchi-
cal Intelligent Systems, The IEEE International Conference on Fuzzy Systems,
FUZZ-IEEE’03, IEEE Press, ISBN 0780378113, pp 1129–1134
4. Abraham A, Ramos V (2003), Web Usage Mining Using Artificial Ant Colony
Clustering and Genetic Programming, 2003 IEEE Congress on Evolutionary
Computation (CEC2003), Australia, IEEE Press, ISBN 0780378040, pp 1384–
1391, 2003
5. Abraham A (2003), EvoNF: A Framework for Optimization of Fuzzy Infer-
ence Systems Using Neural Network Learning and Evolutionary Computation,
The 17th IEEE International Symposium on Intelligent Control, ISIC’02, IEEE
Press, ISBN 0780376218, pp 327–332
6. Aouiti C, Alimi AM, Karray F, Maalej A (2005) The design of beta basis func-
tion neural network and beta fuzzy systems by a hierarchical genetic algorithm.
Fuzzy Sets Syst 154(2):251–274
7. Bhattacharya A, Abraham A, Vasant P, Grosan C (2007) Meta-learning evo-
lutionary artificial neural network for selecting flexible manufacturing systems
under disparate level-of-satisfaction of decision maker. Int J Innovative Comput
Inf Control 3(1):131–140
8. Baxter J (1992) The evolution of learning algorithms for artificial neural
networks, Complex systems, IOS, Amsterdam, pp 313–326
9. Bezdek, JC (1981) Pattern recognition with fuzzy objective function algorithms.
Plenum, New York
10. Boers EJW, Borst MV, Sprinkhuizen-Kuyper IG (1995) Artificial neural nets
and genetic algorithms. In: Pearson DW et al. (eds.) Proceedings of the
international conference in Ales, France, Springer, Berlin Heidelberg New York,
pp 333–336
11. Cai X, Zhang N, Venayagamoorthy GK, Wunsch II DC (2007) Time series pre-
diction with recurrent neural networks trained by a hybrid PSOEA algorithm.
Neurocomputing 70(13–15):2342–2353
12. Castillo PA, Merelo JJ, Arenas MG, Romero G (2007) Comparing evolutionary
hybrid systems for design and optimization of multilayer perceptron structure
along training parameters. Inf Sci 177(14):2884–2905
13. Capi G, Doya K (2005), Evolution of recurrent neural controllers using an
extended parallel genetic algorithm. Rob Auton Syst 52(2–3):148–159
14. Chen Y, Yang B, Dong J, Abraham A (2005) Time-series forecasting using
flexible neural tree model. Inf Sci 174(3–4):219–235
15. Chen Y, Yang B, Abraham A, Peng L (2007) Automatic design of hierarchical
takagi-sugeno fuzzy systems using evolutionary algorithms. IEEE Trans Fuzzy
Syst 15(3):385–397
16. Chu B, Kim D, Hong D, Park J, Chung JT, Chung JH, Kim TH (2008) GA-based
fuzzy controller design for tunnel ventilation systems, Journal of Automation in
Construction, 17(2):130–136
17. Delgado M, Pegalajar MC (2005) A multiobjective genetic algorithm for obtain-
ing the optimal size of a recurrent neural network for grammatical inference.
Pattern Recognit 38(9):1444–1456
18. Eberhart RC, Kennedy J (1995) A new optimizer using particle swarm theory.
In: Proceedings of 6th Internationl Symposium on Micro Machine and Human
Science, Nagoya, Japan, IEEE Service Center, Piscataaway, NJ, pp 39–43
19. Edwards R, Abraham A, Petrovic-Lazarevic S (2005) Computational intelli-
gence to model the export behaviour of multinational corporation subsidiaries
in Malaysia. Int J Am Soc Inf Sci Technol (JASIST) 56(11):1177–1186
20. Fogel LJ, Owens AJ, Walsh MJ (1966) Artificial intelligence through simulated
evolution. Wiley, USA
21. Fontanari JF, Meir R (1991) Evolving a learning algorithm for the binary
perceptron, Network, vol. 2, pp 353–359
22. Franke C, Hoffmann F, Lepping J, Schwiegelshohn U (2008) Development of
scheduling strategies with Genetic Fuzzy systems, Applied Soft Computing
Journal, 8(1):706–721
23. Frean M (1990), The upstart algorithm: a method for constructing and training
feed forward neural networks. Neural Comput 2:198–209
24. Fullmer B, Miikkulainen R (1992) Using marker-based genetic encoding of neu-
ral networks to evolve finite-state behaviour. In: Varela FJ, Bourgine P (eds.)
Proceedings of the first European conference on artificial life, France, pp 255–262
25. Garca-Pedrajas N, Hervs-Martnez C, Muoz-Prez J (2002) Multi-objective co-
operative coevolution of artificial neural networks (multi-objective cooperative
networks). Neural Netw 15(10):1259–1278
26. Grau F (1992)Genetic synthesis of boolean neural networks with a cell rewrit-
ing developmental process. In: Whitely D, Schaffer JD (eds.) Proceedings of
the international workshop on combinations of genetic algorithms and neural
Networks, IEEE Computer Society Press, CA, pp 55–74
27. Gonzalez A, Herrera F (1997) Multi-stage genetic fuzzy systems based on the
iterative rule learning approach. Mathware Soft Comput 4(3)
28. Grosan C, Abraham A, Nicoara M (2005) Search optimization using hybrid
particle sub-swarms and evolutionary algorithms. Int J Simul Syst, Sci Technol
UK 6(10–11):60–79
29. Gutierrez G, Isasi P, Molina JM, Sanchis A, Galvan IM (2001) Evolutionary cel-
lular configurations for designing feedforward neural network architectures, con-
nectionist models of neurons. In: Jose Mira et al. (eds.) Learning processes, and
artificial intelligence, Springer, Berlin Heidelberg New York, LNCS 2084, pp
514–521
30. Harp SA, Samad T, Guha A (1989) Towards the genetic synthesis of neural
networks. In: Schaffer JD (ed.) Proceedings of the third international conference
on genetic algorithms and their applications, Morgan Kaufmann, CA, pp 360–
369
31. Herrera F, Lozano M, Verdegay JL (1995) Tuning fuzzy logic controllers by
genetic algorithms. Int J Approximate Reasoning 12:299–315
32. Herrera F, Lozano M, Verdegay JL (1995) Tackling fuzzy genetic algorithms.
In: Winter G, Periaux J, Galan M, Cuesta P (eds.) Genetic algorithms in
engineering and computer science, Wiley, USA, pp 167–189
33. Hoffmann F (1999) The Role of Fuzzy Logic in Evolutionary Robotics. In:
Saffiotti A, Driankov D (ed.) Fuzzy logic techniques for autonomous vehicle
navigation, Springer, Berlin Heidelberg New York
34. Holland JH (1975) Adaptation in Natural and Artificial Systems, The University
of Michigan Press, Ann Arbor, MI
35. Holland JH, Reitman JS (1978), Cognitive systems based on adaptive al-
gorithms. In: Waterman DA, Hayes-Roth F (eds.) Pattern-directed inference
systems. Academic, San Diego, CA
36. Holland, JH (1980) Adaptive algorithms for discovering and using general
patterns in growing knowledge bases. Int J Policy Anal Inf Sys 4(3):245–268
37. Hui LY (2007) Evolutionary neural network modeling for forecasting the field
failure data of repairable systems. Expert Syst Appl 33(4):1090–1096
38. Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of
fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J
Approximate Reason 44(1):4–31
39. Juang CF, Chung IF (2007) Recurrent fuzzy network design using hybrid
evolutionary learning algorithms, Neurocomputing 70(16–18):3001–3010
40. Jin Y, von Seelen W (1999) Evaluating flexible fuzzy controllers via evolution
strategies. Fuzzy Sets Syst 108(3):243–252
41. Kennedy J, Eberhart RC (1995). Particle swarm optimization. In: Proceedings of
IEEE International Conference on Neural Networks, Perth, Australia, pp 1942–
1948
42. Kim KJ, Cho SB (2006) Evolved neural networks based on cellular automata
for sensory-motor controller. Neurocomputing 69(16–18):2193–2207
43. Kim KJ (2006) Artificial neural networks with evolutionary instance selection
for financial forecasting. Expert Syst Appl 30(3):519–526
44. Kitano H (1990) Designing neural networks using genetic algorithms with graph
generation system. Complex Syst 4(4):461–476
45. Koza JR (1992) Genetic programming: on the programming of computers by
means of natural selection, MIT, Cambridge, MA
46. Kosinski W (2007) Evolutionary algorithm determining defuzzyfication opera-
tors. Eng Appl Artif Intell 20(5):619–627
47. Kelesoglu O (2007) Fuzzy multiobjective optimization of truss-structures using
genetic algorithm. Adv Eng Softw 38(10):717–721
48. Lin CM (2006) Multiobjective fuzzy competence set expansion problem by

multistage decision-based hybrid genetic algorithms. Appl Math Comput
181(2):1402–1416
49. Liu Y, Yao X (1996) Evolutionary design of artificial neural networks with dif-
ferent node transfer functions. In: Proceedings of the Third IEEE International
Conference on Evolutionary Computation, Nagoya, Japan, pp 670–675
50. Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a
fuzzy logic controller. Int J Man Mach Stud 7(1):1–13
51. Marwala T (2007) Bayesian training of neural networks using genetic program-
ming. Pattern Recognit Lett 28(12):1452–1458
52. Mascioli F, Martinelli G (1995) A constructive algorithm for binary neural
networks: the oil spot algorithm. IEEE Trans Neural Netw 6(3):794–797
53. Merril JWL, Port RF (1991) Fractally configured neural networks. Neural Netw
4(1):53–60
54. Moriarty DE, Miikkulainen R (1997) Forming neural networks through efficient
and adaptive coevolution. Evol Comput 5:373–399
55. Mohammadian M, Stonier RJ (1994) Generating fuzzy rules by genetic algo-
rithms. In: Proceedings of 3rd IEEE International Workshop on Robot and
Human Communication, Nagoya, pp 362–367
56. Muhlenbein H, Paab G (1996) From recombination of genes to the estimation of
distributions I. Binary parameters. In: Lecture notes in computer science 1411:
parallel problem solving from nature-PPSN IV, pp 178–187
57. Oh SK, Pedrycz W, Roh SB (2006), Genetically optimized fuzzy polynomial
neural networks with fuzzy set-based polynomial neurons. Inf Sci 176(23):3490–
3519
58. Omlin CW, Giles CL (1993) Pruning recurrent neural networks for improved
generalization performance. Techincal report No 93-6, CS Department, Rensse-
laer Institute, Troy, NY
59. Cordon O, Herrera F, Hoffmann F, Magdalena L (2001) Genetic fuzzy systems:
evolutionary tuning and learning of fuzzy knowledge bases, World Scientific,
Singapore, ISBN 981-02-4016-3, p 462
60. Park HS, Pedrycz W, Oh SK (2007) Evolutionary design of hybrid self-
organizing fuzzy polynomial neural networks with the aid of information
granulation. Expert Syst Appl 33(4):830–846
61. Pawar PM, Ganguli R (2007) Genetic fuzzy system for online structural health
monitoring of composite helicopter rotor blades. Mech Syst Signal Process
21(5):2212–2236
62. Pedrycz W (ed.) (1997), Fuzzy evolutionary computation, Kluwer Academic
Publishers, Boston, ISBN 0-7923-9942-0, p 336
63. Pettersson F, Chakraborti N, Saxen H (2007) A genetic algorithms based multi-
objective neural net applied to noisy blast furnace data. Appl Soft Comput
7(1):387–397
64. Rokach L, Maimon O (2005) Clustering methods, data mining and knowledge
discovery handbook, Springer, Berlin Heidelberg New York, pp 321–352
65. Sexton R, Dorsey R, Johnson J (1999) Optimization of neural networks: a com-
parative analysis of the genetic algorithm and simulated annealing. Eur J Oper
Res 114:589–601
66. Stepniewski SW, Keane AJ (1997) Pruning back-propagation neural networks
using modern stochastic optimization techniques. Neural Comput Appl 5:76–98
67. Storn R, Price K (1997) Differential evolution – a simple and efficient adap-
tive scheme for global optimization over continuous spaces. J Global Optim
11(4):341–359
68. Rechenberg I, (1973) Evolutions strategie: optimierung technischer Systeme
nach Prinzipien der biologischen Evolution, Fromman-Holzboog, Stuttgart
69. Schwefel HP (1977) Numerische Optimierung von Computermodellen mittels
der Evolutionsstrategie, Birkhaeuser, Basel
70. Serra GLO, Bottura CP (2006) Multiobjective evolution based fuzzy PI con-
troller design for nonlinear systems. Eng Appl Artif Intell 19(2):157–167
71. Smith SF (1980) A learning system based on genetic adaptive algorithms. PhD
thesis, University of Pittsburgh
72. Takagi T, Sugeno M (1983) Derivation of fuzzy logic control rules from human
operators control actions. In: Proceedings of the IFAC symposium on fuzzy
information representation and decision analysis, pp 55–60
73. Tsang CH, Kwong S, Wang A (2007) Genetic-fuzzy rule mining approach
and evaluation of feature selection techniques for anomaly intrusion detection.
Pattern Recognit 40(9):2373–2391
74. Wang H, Kwong S, Jin Y, Wei W, Man K (2005) A multi-objective hierarchical
genetic algorithm for interpretable rule-based knowledge extraction. Fuzzy Sets
Syst 149(1):149–186
75. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization.
IEEE Trans Evol Comput 1(1):67–82
76. Xu R, Wunsch D, (2005) Survey of clustering algorithms. IEEE Trans Neural
Netw 16(3):645–678
77. Yao X (1999) Evolving artificial neural networks. Proc IEEE 87(9):423–1447
Genetically Optimized Hybrid Fuzzy Neural
Networks: Analysis and Design of Rule-based
Multi-layer Perceptron Architectures
Sung-Kwun Oh and Witold Pedrycz
Summary. In this study, we introduce an advanced architecture of genetically

optimized Hybrid Fuzzy Neural Networks (gHFNN) and develop a comprehensive
design methodology supporting their construction. A series of of numeric experi-
ments is included to illustrate the performance of the networks. The construction
of gHFNN exploits fundamental technologies of Computational Intelligence (CI),
namely fuzzy sets, neural networks, and genetic algorithms (GAs). The architecture
of the gHFNNs results from a synergistic usage of the genetic optimization-driven
hybrid system generated by combining Fuzzy Neural Networks (FNN) with Polyno-
mial Neural Networks (PNN). In this tandem, a FNN supports the formation of the
premise part of the rule-based structure of the gHFNN. The consequence part of
the gHFNN is designed using PNNs. The optimization of the FNN is realized with
the aid of a standard back-propagation learning algorithm and genetic optimization.
We distinguish between two types of the fuzzy rule-based FNN structures showing
how this taxonomy depends upon the type of a fuzzy partition of input variables.
As to the consequence part of the gHFNN, the development of the PNN dwells
on two general optimization mechanisms: the structural optimization is realized via
GAs whereas in case of the parametric optimization we proceed with a standard
least square method-based learning. Through the consecutive process of such struc-
tural and parametric optimization, an optimized PNN is generated in a dynamic
fashion.
To evaluate the performance of the gHFNN, the models are experimented with
several representative numerical examples. A comparative analysis demonstrates
that the proposed gHFNN come with higher accuracy as well as superb predictive
capabilities when comparing with other neurofuzzy models.
1 Introductory remarks
Recently, a lot of attention has been devoted towards advanced techniques
of modeling complex systems inherently associated with nonlinearity, high-
order dynamics, time-varying behavior, and imprecise measurements. It is
anticipated that efficient modeling techniques should allow for a selection
of pertinent variables and in this way help cope with dimensionality of the
problem at hand. The models should be able to take advantage of the existing
S.-K. Oh and W. Pedrycz: Genetically Optimized Hybrid Fuzzy Neural Networks: Analysis
and Design of Rule-based Multi-layer Perceptron Architectures, Studies in Computational
Intelligence (SCI) 82, 23–57 (2008)
www.springerlink.com
c Springer-Verlag Berlin Heidelberg 2008
24 S.-K. Oh and W. Pedrycz
domain knowledge (such as a prior experience of human observers or process

operators) and augment it by available numeric data to form a coherent data-
knowledge modeling entity. The omnipresent modeling tendency is the one
that exploits techniques of Computational Intelligence (CI) by embracing
fuzzy modeling [1–6], neurocomputing [7], and genetic optimization [8–10].
Especially the two of the most successful approaches have been the hybridiza-
tion attempts made in the framework of CI [11,12]. Neuro-fuzzy systems are
one of them [13–20]. A different approach to hybridization leads to genetic
fuzzy systems. Lately to obtain a highly beneficial synergy effect, the neural
fuzzy systems and the genetic fuzzy systems hybridize the approximate infer-
ence method of fuzzy systems with the learning capabilities of neural networks
and evolutionary algorithms [21].
In this study, we develop a hybrid modeling architecture, called genet-
ically optimized Hybrid Fuzzy Neural Networks (gHFNN). In a nutshell, a
gHFNN is composed of two main substructures driven by genetic optimiza-
tion, namely a rule-based Fuzzy Neural Network (FNN) and a Polynomial
Neural Network (PNN). From a standpoint of rule-based architectures (with
their rules assuming the general form “if antecedent then consequent”), one
can regard the FNN as an implementation of the antecedent (or premise)
part of the rules while the consequent part is realized with the aid of PNN.
The resulting gHFNN is an optimized architecture designed by combining the
conventional Hybrid Fuzzy Neural Networks (HFNN [19,20,34,35]) with ge-
netic algorithms (GAs). The conventional HFNNs exhibits FNN architecture
treated as the premise part while the PNN structures are used in common as
the conclusion part of HFNNs. In this study, the FNNs come with two kinds of
network architectures, namely fuzzy-set based FNN and fuzzy-relation based
FNN. The topology of the network proposed here relies on fuzzy partitions
realized in terms of fuzzy sets or fuzzy relations that its input variables are
considered separately or simultaneously. Each of them is placed into the two
main categories according to the type of fuzzy inference, namely the simplified
and linear fuzzy inference. Moreover the PNN structure is optimized by GAs,
that is, a genetically optimized PNN (gPNN) is designed and the gPNN is
applied to the consequence part of gHFNN. The gPNN that exhibits a flexi-
ble and versatile structure is constructed on a basis of PNN [14,15] and GAs
[8–10]. gPNN leads to the effective reduction of the depth of the networks as
well as the width of the layer, and the avoidance of a substantial amount of
time-consuming iterations for finding the most preferred networks in conven-
tional PNN. In this network, the number of layers and number of nodes in
each layer are not predetermined (unlike in case of most neural-networks) but
can be generated in a dynamic fashion. The design procedure applied in the
construction of each layer of the PNN deals with its structural optimization
involving the selection of optimal nodes (or PNs) with specific local charac-
teristics (such as the number of input variables, the order of the polynomial,
and a collection of the specific subset of input variables) and addresses specific
aspects of parametric optimization.
Genetically Optimized Hybrid Fuzzy Neural Networks 25
The study is organized in the following manner. First, Section 2 delivers

a brief introduction to the architecture of the conventional HFNN. In Section
3, we discuss a structure of the genetically optimized HFNN (gHFNN) and
elaborate on the development of the networks. The detailed genetic design
of the gHFNN model comes with an overall description of a detailed design
methodology of the gHFNN presented in Section 4. In Section 5, we report on
a comprehensive set of experiments. Finally concluding remarks are covered
in Section 6.
2 The architecture of conventional Hybrid Fuzzy Neural

Networks (HFNN)
The conventional HFNN architecture combined with the FNN and PNN is
visualized in Figs. 1–3 [19,20,34,35]. Let us recall that the fuzzy inference
(both simplified and linear) -based FNN is constructed with the aid of the
space partitioning realized by not only fuzzy set defined for each input variable
but also fuzzy relations that effectively capture an ensemble of input variables.
These networks arise as a synergy between two other general constructs such
as FNN and PNN. Based on the different PNN topologies (see Table 1), the
HFNN embraces two kinds of architectures, namely a basic and modified one.
Moreover for each architecture of the HFNN, we identified two cases; refer to
Fig. 1 for the overall taxonomy.
According to the alternative position of two connection points (interface)
in case of th usage of FS FNN shown in Fig. 2, we realize a different com-
bination of FNN and PNN while forming the HFNN architecture. Especially
when dealing with the interface of FNN realized by means of PNNs, we note
that if input variables to PNN used in the consequence part of HFNN are less
than three (or four), the generic type of HFNN does not generate a highly
versatile structure. As visualized in Figs. 1–3, we identify also two types of
the topology, namely a generic and advanced type. Observe that in Figs. 2–3,
zi ’(Case 2) in the 2nd layer or higher indicates that the polynomial order of
Premise part(FNN) Consequence part(PNN)

case 1 Generic type
Basic
Simplified Generic case 2 Basic HFNN
fuzzy inference
Interface
type case 1 Generic type

Modified
case 2 Modified HFNN
case 1 Advanced type
Basic
Linear Advanced case 2 Basic HFNN
fuzzy inference type case 1 Advanced type
Modified
case 2 Modified HFNN
Fig. 1. Overall diagram for generating the conventional HFNN architecture

Fig. 2. FS HFNN architecture combined with FS FNN and PNN
Fig. 3. FR HFNN architecture combined with FR FNN and PNN

Table 1. Taxonomy of various PNN architectures
Layer of PNN No. of input Order of PNN architecture

variables Polynomial
of polynomial
(1) p = q : Basic PNN
lst layer p Type P a) Type P = Type Q: Case1
b) Type P=Type Q: Case2
(2) p=q : Modified PNN
2nd or higher q Type Q a) Type P = Type Q: Case1
layer b) Type P = Type Q: Case2
(p = 2,3,4,5, q = 2,3,4,5 ; P = 1,2,3, Q = 1,2,3)
the PD of each node has a different type in comparison with zi of the lst layer.
The “NOP” node states that the Ath node of the current layer is the same as
the node positioned in the previous layer (NOP stands for “no operation”).
An arrow to the NOP node is used to show that the same node moves from
the previous layer to the current one.
3 The architecture and development of genetically

optimized HFNN (gHFNN)
In this section, we elaborate on the architecture and a development process of
the gHFNN. This network emerges from the genetically optimized multi-layer
perceptron architecture based on fuzzy set or fuzzy relation-based FNN, PNN
and GAs. In the sequel, gHFNN is designed by combining the conventional
Hybrid Fuzzy Neural Networks (HFNN) with GAs. These networks result as
a synergy between two other general constructs such as FNN [24,32] and PNN
[14,15].
First, we briefly discuss these two classes of models by underlining their
profound features in sections 3.1 and 3.2, respectively.
3.1 Fuzzy neural networks based on genetic optimization
We consider two kinds of FNNs (viz. FS FNN and FR FNN) based on two
types of fuzzy inferences, namely, simplified and linear fuzzy inferences. The
structure of the FNN is the same as the used in the premise of the conventional
HFNN. The FNN is designed by using space partitioning realized in terms of
the individual input variables or an ensemble of all variables. Its each topology
is concerned with a granulation carried out in terms of fuzzy sets defined
in each input variable or fuzzy relations that capture an ensemble of input
variables respectively. The fuzzy partitions formed for each case lead us to the
topologies visualized in Figs. 4–5.
Fig. 4. Topology of FS FNN by using space partitioning in terms of individual

input variables
Fig. 5. Topology of FR FNN by using space partitioning in terms of an ensemble

of input variables
The notation in these figures requires some clarification. The “circles” de-
note units of the FNN while “N” identifies a normalization procedure applied
tothe membership grades of the input variable xi . The output fi (xi ) of the
“ ” neuron is described by some nonlinear function fi . Not necessarily fi is
a sigmoid function encountered in conventional neural networks but we allow
for more flexibility in this regard. Finally, in case of FS FNN, the output of
the FNN y is governed by the following expression;

m
y = f1 (x1 ) + f2 (x2 ) + · · · + fm (xm ) = fi (xi ) (1)
i=1
with m being the number of the input variables (viz. the number of the outputs
fi ’s of the “ ” neurons in the network). As previously mentioned, FS FNN is
affected by the introduced fuzzy partition of each input variable. In this sense,
we can regard each fi given by fuzzy rules as shown in Table 2(a). Table
2(a) represents the comparison of fuzzy rules, inference result and learning
for two types of FNNs. In Table 2(a), Rj is the j-th fuzzy rule while Aij
denotes a fuzzy variable of the premise of the corresponding fuzzy rule and
represents membership function µij . In the simplified fuzzy inference, ωij is
a constant consequence of the rules and, in the linear fuzzy inference, ωsij
is a constant consequence and ωij is an input variable consequence of the
rules. They express a connection (weight) existing between the neurons as we
have already visualized in Fig. 4. Mapping from xi to fi (xi ) is determined
by the fuzzy inferences and a standard defuzzification. The inference result
for individual fuzzy rules follows a standard center of gravity aggregation. An
input signal xi activates only two membership functions, so inference results
can be written as outlined in Table 2(a) [23,24]. The learning of FNN is realized
by adjusting connections of the neurons and as such it follows a standard
Back-Propagation (BP) algorithm [23,24]. The complete update formulas are
covered in Table 2(a). Where η is a positive learning rate and α is a positive
momentum coefficient. The case of FR FNN, see Table 2(b), is carried out in
a same manner as outlined in Table 2(a) (the case of FS FNN).
The task of optimizing any model involves two main phases. First, a class
of some optimization algorithms has to be chosen so that it meets the re-
quirements implied by the problem at hand. Secondly, various parameters of
the optimization algorithm need to be tuned in order to achieve its best per-
formance. Along this line, genetic algorithms (GAs) viewed as optimization
techniques based on the principles of natural evolution are worth consider-
ing. GAs have been experimentally demonstrated to provide robust search
capabilities in problems involving complex spaces thus offering a valid solu-
tion to problems requiring efficient searching. It is instructive to highlight the
main features that tell GA apart from some other optimization methods: (1)
GA operates on the codes of the variables, but not the variables themselves.
(2) GA searches optimal points starting from a group (population) of points
in the search space (potential solutions), rather than a single point. (3) GA’s
Table 2. Comparison of simplified with linear fuzzy inference-based FNNs

(a) In case of using FS FNN
Structure Simplified fuzzy inference Linear fuzzy inference
(Scheme I) (Scheme II)
R1 : If xi is Ai1 then Cyi1 = ωi1 R1 : · · · then Cyi1 = ωsi1 + xi ωi1
.. ..
. .
Fuzzy rules Rj : If xi is Aij then Cyij = ωij Rj : · · · then Cyij = ωsij + xi ωij
.. ..
. .
Rz : If xi is Aiz then Cyiz = ωiz Rz : · · · then Cyiz = ωsiz + xi ωiz

z
z
(µij (xi ) · (ωsij + xi ωij ))
µij (xi ) · ωij j=1
j=1 fi (xi ) =
fi (xi ) =
z

z µij (xi )
µij (xi ) j=1
Inference j=1
result = µik (xi ) · (ωsik + xi ωik )
= µik · (xi )ωik
+ µik+1 (xi )
+ µik (xi ) · ωik+1
· (ωsik+1 + xi wik+1 )

n
y = fi (xi )
n
y = fi (xi )
i=1
i=1
⎧
⎪ ∆ωsij = 2 · η · (y − y) · µij
⎪
⎪
⎪
⎨
∆ωij = 2 · η · (yp − yp ) · µij (xi ) + α(ωsij (t) − ωsij (t − 1))
Learning
+ α(ωij (t) − ωij (t − 1)) ⎪
⎪ ∆ω = 2 · η · (y − y) · µij · xi
⎪
⎪
⎩
+ α(ωij (t) − ωij (t − 1))
(b) In case of using FR FNN

Structure fuzzy inference Linear
R1 : If x1 is A11 , · · · , and xk is Ak1
..
.
Premise part Ri : If x1 is A1i , · · · , and xk is Aki
..
.
Fuzzy Rn : If x1 is A1n , · · · , and xk is Akn
rules then Cy1 = ω1 then Cy1 = ω01 + ω11 · x1 + · · · + ωk1 · xk
.. ..
. .
Consequence
then Cyi = ωi then Cyi = ω0i + ω1i · x1 + · · · + ωki · xk
part
.. ..
. .
then Cyn = ωn then Cyn = ω0n + ω1n · x1 + · · · + ωkn · xk
(continued)
Table 2. (Continued)

n
n
y = fi y = fi
i=1 i=1
n n
= µ̄i · ωi = µ¯i · (ω0i + ω1i · x1 + ωki · xk )
Inference result
i=1 i=1
n
µi · ωi n
µi · (ω0i + ω1i · x1 + ωki · xk )
= n = n
i=1 µi i=1 µi
i=1 i=1
⎧
⎪ ∆ω0i = 2 · η · (y − y) · µ¯i + α(ω0i (t)
∆ωi = 2 · η · (y − y) · µ̄i ⎪
⎪
⎨ − ω0i (t − 1))
Learning + α(ωi (t)
⎪
⎪ ∆ω = 2 · η · (y − y) · µ¯i · xk
− ωi (t − 1)) ⎪
⎩
ki
+ α(ωki (t) − ωki (t − 1))
search is directed only by some fitness function whose form could be quite
complex; we do not require its differentiability [8–10].
In order to enhance the learning of the FNN, we use GAs to adjust learning
rate, momentum coefficient and the parameters of the membership functions
of the antecedents of the rules [19,20,34,35].
3.2 Genetically optimized PNN (gPNN)
As underlined, the PNN algorithm is based upon the GMDH [22] method and
utilizes a class of polynomials such as linear, quadratic, modified quadratic,
etc. to describe basic processing realized there. By choosing the most signif-
icant input variables and an order of the polynomial among various types of
forms available, we can obtain the best one - it comes under a name of a
partial description (PD). It is realized by selecting nodes at each layer and
eventually generating additional layers until the best performance has been
reached. Such a methodology leads to an optimal PNN structure [14,15].
In addressing the problems with the conventional PNN (see Fig. 6), we
introduce a new genetic design approach; in turn we will be referring to these
networks as genetically optimized PNN (to be called “gPNN”). When we
construct PNs of each layer in the conventional PNN, such parameters as
the number of input variables (nodes), the order of polynomial, and input
variables available within a PN are fixed (selected) in advance by the designer.
This could have frequently contributed to the difficulties in the design of the
optimal network. To overcome this apparent drawback, we resort ourselves to
the genetic optimization, see Figs. 8–9 of the next section for more detailed
flow of the development activities.
The overall genetically-driven structural optimization process of PNN is
shown in Fig. 7. The determination of the optimal values of the parame-
ters available within an individual PN (viz. the number of input variables,
Polynomial Neural Networks
PN
z1 • PN
PN
PN
z2 • PN PN
PN
^
PN •• PN f
z3 • •
PN • PN
PN PN
z4 •
PN
Polynomial Neuron(PN)
zp Input variables Polynomial order
zq zp, zq 2
Partial Description(PD) : Type 2

z
c0+ c1zp+ c2zq+ c3z2p + c4z2q + c5zpzq
Fig. 6. A general topology of the PN-based PNN: note a biquadratic polynomial

occurring in the partial description
1st stage 2nd stage
Genetic 1st Genetic 2nd

design layer design layer
Selection of the no. Selection of the no.

of input variables S z1 of input variables S z2
Layer Layer
Selection of Selection of
input variables Generation input variables Generation
E
Selection of the Selection of the
polynomial order polynomial order
PNs Selection PNs Selection
E : Entire inputs, S : Selected PNs, zi : Preferred outputs in the ith stage (zi = z1i, z2i, ..., zWi)
Fig. 7. Overall genetically-driven structural optimization process of PNN
the order of the polynomial, and input variables) leads to a structurally and
parametrically optimized network. As a result, this network is more flexible
as well as it exhibits simpler topology in comparison to the conventional PNN
discussed in the previous research [14,15].
For the optimization of the PNN model, GAs uses the serial method of bi-
nary type, roulette-wheel used in the selection process, one-point crossover in
the crossover operation, and a binary inversion (complementation) operation
in the mutation operator. To retain the best individual and carry it over to
the next generation, we use elitist strategy [8,9].
Premise part(FNN) Consequence part(gPNN)

1st layer 2nd layer
Genetically optimized Hybrid Fuzzy Neural

GAs
GAs GAs
S S
Membership
Layer generation
Layer generation
Networks (gHFNN)
parameters PN PN
x2= z1 x3= z2
Interface
PN PN
PN PN
Simplified
PN PN
&
Linear
fuzzy inference
S : Selected PNs, zi : Outputs of the ith layer, xj : Input variables of the jth layer ( j = i + 1)
Fig. 8. Overall diagram for generation of gHFNN architecture
3.3 Optimization of gHFNN topologies
The topology of gHFNN is constructed by combining fuzzy set or fuzzy

relation-based FNN for the premise part of the gHFNN with PNN being used
as the consequence part of gHFNN. These networks emerge through a syn-
ergy between two other general constructs such as FNNs and gPNNs. In what
follows, the gHFNN is composed of two main substructures driven by genetic
optimization; see Figs. 8–9. The role of FNNs arising at the premise part is
to support learning and interact with input as well as granulate the corre-
sponding input space (viz. converting the numeric data into their granular
representatives emerging at the level of fuzzy sets). Especially, two types of
fuzzy inferences-based FNN (viz. FS-FNN or FR-FNN) realized with the fuzzy
partitioning of individual input variables or an ensemble of input variables
are considered to enhance the adaptability of the hybrid network architec-
ture. One should stress that the structure of the consequent gPNN is not
fixed in advance but becomes dynamically organized during a growth process.
In essence, the gPNN exhibits an ability of self-organization. The gPNN algo-
rithm can produce an optimal nonlinear system by selecting significant input
variables among dozens of those available at the input and forming various
types of polynomials. Therefore, for the very reason we selected FNN and
gPNN in order to design the gHFNN architecture.
One may consider some other hybrid network architectures such as a com-
bination of FNNs and MLPs as well as ANFIS-like models combined with
MLPs. While attractive on a surface, such hybridization may lead to several
potential problems: 1) The repeated learning and optimization of each of the
contributing structure (such as ANFIS and MLP) may result in excessive
learning time as well as generate quite complex networks for relatively simple
systems and 2) owing to its fixed structure, it could be difficult to generate
the flexible topologies of the networks that are required to deal with highly
nonlinear dependencies.
START
Premise part of gHFNN :

Computing activation degrees FNN
Agjustment of parameters of MF using GAs &

of linguistic labels
Normalization of an activation
connection weight using BP

degree of the rule
Multiplying the normalized

activation degrees of rules by
connection weight
Connection 1
Point ?
2
Fuzzy inference
for output of the rules
Output of FS_FNN/FR_FNN
Consequence part of gHFNN :

Configuration of input variables for consequence GAs gPNN
& initial information and gPNN nsequence part
GAs Initialization of population
Generation of a PN by a
chromosome in population
Reproduction
Roulette-wheel selection Evaluation of PNs(Fitness)
One-point crossover x1 = z1, x2 = z2, ..., xW = zW
Invert mutation Elitist strategy &
Selection of PNs(W) The outputs of the preserved PNs
serve as new inputs to the next
layer
No Stop
condition
Yes
Generate a layer of gPNN
A layer consists of optimal PNs
selected by GAs
Stop No
condition
Yes
gHFNN
gHFNN is organized by FS_FNN
and layers with optimal PNs
END
Fig. 9. Overall design flowchart of the gHFNN architecture
4 The algorithms and design procedure of genetically

optimized HFNN (gHFNN)
In this section, we elaborate on the algorithmic details of the design method
by considering the functionality of the individual layers in the network archi-
tectures. The design procedure for each layer in the premise and consequence
of gHFNN comprises of the following steps:
4.1 The premise of gHFNN: in case of FS FNN
[Layer 1] Input layer: The role of this layer is to distribute the signals to the
nodes in the next layer.
[Layer 2] Computing activation degrees of linguistic labels: Each node in
this layer corresponds to one linguistic label (small, large, etc.) of the input
variables in layer 1. The layer determines a degree of satisfaction (activation)
of this label by the input.
[Layer 3] Normalization of a degree activation (firing) of the rule: As de-
scribed, a degree of activation of each rule was calculated in layer 2. In this
layer, we normalize the activation level by using the following expression.
µij µik
µ̄ij =
n = = µik (2)
µik + µik+1
µij
j=1
where n is the number of membership function for each input variable. An

input signal xi activates only two membership functions simultaneously and
the sum of grades of these two neighboring membership functions labeled by
k and k + 1 is always equal to 1, that is µik (xi ) + µik+1 (xi ) = 1, so that this
leads to a simpler format as shown in (2) [23,24].
[Layer 4] Multiplying a normalized activation degree of the rule by connection
(weight): The calculated activation degree at the third layer is now calibrated
through the connections, that is
aij = µ̄ij × Cyij = µij × Cyij (3)

Simplif ied : Cyij = ωij
(4)
Linear : Cyij = ωsij + ωij · xi
If we choose CP (connection point) 1 for combining FS FNN with PNN as
shown in Fig. 10, aij is given as the input variable of the PNN. If we choose
CP 2, fi (xi ) corresponds to the input signal to the output layer of FNN viewed
as the input variable of the PNN.
[Layer 5] Fuzzy inference for output of the rules: Considering Fig. 4, the
output of each node in the 5th layer of the premise part of gHFNN is inferred
µ1j aij CP 2
N
x1 w1j ∑
N
µij ∑ y^
N
xi wij ∑
fi(xi)
N
CP 1
Fig. 10. Connection points used for combining FS FNN (Simplified) with gPNN
by the center of gravity method [23,24]. If we choose CP 2, fi is the input

variable of gPNN that is the consequence part of gHFNN

z
z
aij µij (xi ) · ωij
j=1 j=1
Simplif ied : fi (xi ) =
z =
z
µij (xi ) µij (xi ) (5)
j=1 j=1
= µik (xi ) · ωik + µik+1 (xi ) · ωik+1

z
z
aij µij (xi ) · (ωsij + xi ωij )
j=1 j=1
Linear : fi (xi ) =
z =
z
µij (xi ) µij (xi ) (6)
j=1 j=1
= µik (xi ) · ωik + µik+1 (xi ) · ωik+1

[Output layer of FNN] Computing output of basic FNN: The output becomes
a sum of the individual contributions from the previous layer, see (1)
The design procedure for each layer in FR FNN is carried out in a same
manner as the one presented for FS FNN.
4.2 The consequence of gHFNN: in case of gPNN combined

with FS FNN
[Step 1] Configuration of input variables: Define input variables: xi ’s (i = 1,

2, · · · , n) to gPNN of the consequent structure of gHFNN. If we choose
the first option to combine the structures of FNN and gPNN (CP 1), aij ,
which is the output of layer 4 in the premise structure of the gHFNN,
is treated as the input of the consequence structure of gHFNN, that is,
x1 = a11 , x2 = a12 , · · · xn = aij (n = i × j). For the second option of com-
bining the structures (viz. CP 2), we have x1 = f1 , x2 = f2 , · · · , xn =
fm (n = m).
[Step 2] Decision of initial information for constructing the gPNN structure:
We decide upon the design parameters of the PNN structure and they include
that a) Stopping criterion, b) Maximum number of input variables coming
to each node in the corresponding layer, c) Total number W of nodes to
be retained (selected) at the next generation of the gPNN, d) Depth of the
gPNN to be selected to reduce a conflict between overfitting and generalization
abilities of the developed gPNN, and e) Depth and width of the gPNN to be
selected as a result of a tradeoff between accuracy and complexity of the
overall model. It is worth stressing that the decisions made with respect to
(b)–(e) help us avoid building excessively large networks (which could be quite
limited in terms of their predictive abilities).
[Step 3] Initialization of population: We create a population of chromosomes
for a PN, where each chromosome is a binary vector of bits. All bits for each
chromosome are initialized randomly.
[Step 4] Decision of PN structure using genetic design: This concerns the

selection of the number of input variables, the polynomial order, and the
input variables to be assigned in each node of the corresponding layer. These
important decisions are carried out through an extensive genetic optimization.
When it comes to the organization of the chromosome representing a PN,
we divide the chromosome into three sub-chromosomes as shown in Fig. 12.
The 1st sub-chromosome contains the number of input variables, the 2nd
sub-chromosome involves the order of the polynomial of the node, and the
3rd sub-chromosome (remaining bits) contains input variables coming to the
corresponding node (PN). In nodes (PNs) of each layer of gPNN, we adhere to
the notation of Fig. 11. ‘PNn ’ denotes the nth PN (node) of the corresponding
layer, ‘N’ denotes the number of nodes (inputs or PNs) coming to the cor-
responding node, and ‘T’ denotes the polynomial order in the corresponding
node. Each sub-step of the genetic design of the three types of the parameters
available within the PN is structured as follows.
[Step 4-1] Selection of the number of input variables (1st sub-chromosome)
Sub-step 1) The first 3 bits of the given chromosome are assigned to the
binary bits for the selection of the number of input variables.
Sub-step 2) The selected 3 bits are decoded into a decimal.
Sub-step 3) The above decimal value is converted into [1 N] and rounded
off. N denotes the maximal number of input variables entering the
corresponding node (PN).
Sub-step 4) The normalized integer value is then treated as the number
of input variables (or input nodes) coming to the corresponding node.
[Step 4-2] Selection of the order of polynomial (2nd sub-chromosome)
Sub-step 1) The 3 bits of the 2nd sub-chromosome are assigned to the
binary bits for the selection of the order of polynomial.
Sub-step 2) The 3 bits are decoded into a decimal format.
Sub-step 3) The decimal value obtained is normalized into [1 3] and
rounded off.
Sub-step 4) The normalized integer value is given as the polynomial order.
a) The normalized integer value is given as 1 ⇒ the order of polyno-
mial is Type 1
nth Polynomial Neuron(PN)

xi PNn
z
N T
xj Polynomial order(Type T)
No. of inputs
Fig. 11. Overall diagram for generation of gHFNN architecture

Selection of node (PN) structrue by chromosome
iii) Bits for the

i) Bits for the selection of ii) Bits for the selection
Related bit items selection of input
the no. of input variables of the polynomial order
variables
Bit structure of sub-

chromosome divided 1 0 1 0 1 1 1 1 0 1 1 1
for each item
Decoding Decoding
(Decimal) (Decimal)
1 1 0 1 1 1
1 r
Decoding Decoding
Normalization (Decimal) (Decimal)
(less than Normalization
Genetic Max) (1 ~ 3)
Normalization Normalization
Design (1 ~ n(or W)) (1 ~ n(or W))
Decision of Decision of
input variables input variables
Selection of Selection of
no. of input the order of
Selection of input
variables(r) polynomial
(Type 1~Type 3)
variables
Selected PN PN
Fig. 12. Overall design flowchart of the gHFNN architecture
Table 3. Different forms of regression polynomial forming a PN

PP
PP Number
PP of inputs
Order of P PP 2 3 4
the polynomial PP
P
1 (Type 1) Bilinear Trilinear Tetralinear
2 (Type 2) Biquadratic-1 Triquadratic-1 Tetraquadratic-1
2 (Type 3) Biquadratic-2 Triquadratic-2 Tetraquadratic-2
The following types of the polynomials are used;
• Bilinear = c0 + c1 x1 + c2 x2
• Biquadratic-1 (Basic) = Bilinear +c3 x21 + c4 x22 + c5 x1 x2 ,
• Biquadratic-2 (Modified) = Bilinear +c3 x1 x2
b) The normalized integer value is given as 2 ⇒ the order of poly-

nomial is Type 2
c) The normalized integer value is given as 3 ⇒ the order of polyno-
mial is Type 3
[Step 4-3] Selection of input variables (3rd sub-chromosome)

Sub-step 1) The remaining bits are assigned to the binary bits for the
selection of input variables. binary bits for the selection of the number
of input variables.
Sub-step 2) The remaining bits are divided by the value obtained in step
4-1.
Sub-step 3) Each bit structure is decoded into a decimal.
Sub-step 4) The decimal value obtained is normalized into [1 n (or W)]
and rounded off. n is the overall system’s inputs in the 1st layer, and W
is the number of the selected nodes in the 2nd layer or higher.
Sub-step 5) The normalized integer values are then taken as the selected
input variables while constructing each node of the corre-
sponding layer. Here, if the selected input variables are multiple-dupli-
cated, the multiple-duplicated input variables are treated as a single
input variable.
[Step 5] Estimation of the coefficients of the polynomial assignedto the se-
lected node and evaluation of a PN: The vector of coefficients is derived by
minimizing the mean squared error between yi and y [14,15]. To evaluate the
approximation and generalization capability of a PN produced by each chro-
mosome, we use the following fitness function (the objective function is given
in Section 5).
1
fitness function = (7)
1 + Objective function
[Step 6] Elitist strategy and Selection of nodes (PNs) with the best predictive
capability: The nodes (PNs) obtained on the basis of the calculated fitness
values (F1 ,F2 ,· · · ,Fz ) are rearranged in a descending order. We unify the nodes
with duplicated fitness values (viz. in case that one node is the same fitness
value as other nodes) among the rearranged nodes on the basis of the fitness
values. We choose several PNs (W) characterized by the best fitness values.
For the elitist strategy, we select the node that has the highest fitness value
among the generated nodes.
[Step 7] Reproduction: To generate new populations of the next generation,
we carry out selection, crossover, and mutation operation using genetic infor-
mation and the fitness values. Until the last generation, this step carries out
by repeating steps 4–7.
[Step 8] Construction of a corresponding layer of consequence part of gHFNN:
Individuals evolved by GAs produce optimal PNs, W. The generated PNs
construct their corresponding layer for the design of consequence part of
gHFNN.
[Step 9] Check the termination criterion: The termination condition builds a
sound compromise between the high accuracy of the resulting model and its
complexity as well as generalization abilities. [Step 10] Determine new input
variables for the next layer: If the termination criterion has not been met, the
model is expanded. The outputs of the preserved nodes (z1 ,z2 , · · · , zW ) serves
as new inputs to the next layer (x1 ,x2 , · · · , xW ). Repeating steps 3–10 carries
out the gPNN.
5 Experimental studies
In this section, the performance of the gHFNN is illustrated with the aid
of some well-known and widely used datasets. In the first experiment, the
network is used to model a three-input nonlinear function [1,13,19,25,28,29]. In
the second simulation, an gHFNN is used to model a time series of gas furnace
(Box-Jenkins data) [2–6,26,30–35]. Finally we use gHFNN for NOx emission
process of gas turbine power plant [27,29,32]. The performance indexes (object
function) used here are: (9) for the three-input nonlinear function and (8) for
both gas furnace process and NOx emission process.
i) Mean Squared Error (MSE)
1
n
E(P I or EP I) = (yp − yp )2 (8)
n p=1
ii) Mean Magnitude of Relative Error (MMRE)
1 |yp − yp |
n
E(P I or EP I) = × 100(%) (9)
n p=1 yp
Genetic algorithms use binary type, roulette-wheel as the selection operator,

one-point crossover, and an invert operation in the mutation operator. The
crossover rate of GAs is set to 0.75 and probability of mutation is equal to
0.065. The values of these parameters come from experiments and are very
much in line with typical values encountered in genetic optimization.
5.1 Nonlinear function
In this experiment, we use the same numerical data as in [1,13,19,25,28,29].

The nonlinear function to be determined is expressed as
−1 −1.5 2
y = (1 + x0.5
1 + x2 + x3 ) (10)
We consider 40 pairs of the original input-output data. The performance index

(PI) is defined by (9). 20 out of 40 pairs of input-output data are used as
learning set; the remaining part serves as a testing set.
Table 4 summarizes the list of parameters related to the genetic opti-
mization of the network. Design information for the optimization of gHFNN
distinguishes between information of two networks such as the premise FNN
and the consequent gPNN. First, a chromosome used in genetic optimization
Table 4. Parameters of the optimization environment and computational effort

Generation 100
Population size 60
Gas Elite population size (W) 30
Premise structure (FNN) 10 (per one variable)
String length
Consequence structure (PNN) 3 + 3 + 24
No. of entire system inputs 3
Learning iteration 1000
Learning rate Simplified 0.039
Premise tuned Linear 0.335
(FS FNN) Momentum Simplified 0.004
gHFNN Coefficient tuned Linear 0.058
No. of rules 6
No. of entire CP 1 6
Consequence inputs CP 2 3
(gPNN) Maximal layer 5
No. of inputs to be selected (N) 1 ≤ N ≤ 4 (Max)
Type(T) 1≤T≤3
N, T : integer

Generation 150
Population size 100
Gas Elite population size (W) 50
String length
Learning rate Simplified 0.309
Premise tuned Linear 0.879
gHFNN (FS FNN) Momentum Simplified 0.056
Coefficient tuned Linear 0.022
No. of rules 8
No. of entire inputs 8
Consequence Maximal layer 5
(gPNN) No. of inputs to be selected (N) 1 ≤ N ≤ 4 (Max)
Type (T) 1≤T≤3
N, T : integer
of the premise FNN contains the vertices of 2 membership functions of each

system input (here, 3 system input variables have been used), learning rate,
and momentum coefficient. The numbers of bits allocated to a chromosome
are equal to 60, 10, and 10, respectively, that is 10 bits is assigned to each
one variable. The parameters such as learning rate, momentum coefficient,
and membership parameters are tuned with the help of genetic optimization
of the FNN as shown in Table 4. Next, in case of the consequent gPNN, a

chromosome used in the genetic optimization consists of a string including 3
sub-chromosomes. The numbers of bits allocated to each sub-chromosome are
equal to 3, 3, and 24/28, respectively. The population size being selected from
the total population size, 60/100 is equal to 30/50. The process is realized as
follows. 60/100 nodes (PNs) are generated in each layer of the network. The
parameters of all nodes generated in each layer are estimated and the network
is evaluated using both the training and testing data sets. Then we compare
these values and choose 30/50 PNs that produce the best (lowest) value of
the performance index. The number of inputs to be selected is confined to a
maximum of four entries. The order of the polynomial is chosen from three
types, that is Type 1, Type 2, and Type 3.
Tables 5(a) and (b) summarize the results of the genetically optimized
HFNN architectures when exploiting two kinds of FNN (viz. FS FNN and
FR FNN) based on each fuzzy inference method. In light of the values of
Table 5(a) reported there, we distinguish with two network architectures such
as the premise FNN and the overall gHFNN. First, in case of the premise
FNN, the network comes in the form of two fuzzy inference methods. Here,
the FNN uses two membership functions for each input variable and has six
fuzzy rules. In this case, as mentioned previously, the parameters of the FNN
are optimized with the aid of GAs and BP learning. When considering the
simplified fuzzy inference-based FNN, the minimal value of the performance
index, that is PI = 5.217 and EPI = 5.142 are obtained. In case of the linear
fuzzy inference-based FNN, the best results are reported in the form of the
performance index such that PI = 2.929 and EPI = 3.45.
Next the values of the performance index of output of the gHFNN depend
on each connection point based on the individual fuzzy inference methods.
The values of the performance index vis-à-vis choice of number of layers of
gHFNN related to the optimized architectures in each layer of the network
are shown in Table 5.
That is, according to the maximal number of inputs to be selected
(Max = 4), the selected node numbers, the selected polynomial type (Type
T), and its corresponding performance index (PI and EPI) were shown when
the genetic optimization for each layer was carried out. For example, in case
when considering simplified fuzzy inference and connection point 2 of FS FNN
in Table 5(a), let us investigate the 2nd layer of the network (shadowed in
Table 5(a)). The fitness value in layer 2 attains its maximum for Max = 4
when nodes 8, 7, 4, 11 (such as z8 , z7 , z4 , z11 ) occur among preferred nodes
(W) chosen in the previous layer (the 1st layer) are selected as the node inputs
in the present layer. Furthermore 4 inputs of Type 2 (linear function) were
selected as the results of the genetic optimization, refer to Fig. 13(b). In the
“Input No.” item of Table 5, a blank node marked by period (·) indicates
that it has not been selected by the genetic operation. The performance of
the conventional HFNN (called “SOFPNN” in the literature [19], which is
composed of FNN and PNN with 4 inputs-Type 2 topology) was quantified

Premise part Consequence part
Fuzzy No. of rules CP No. of PI EPI
PI EPI Layer Input No. T
Inference (MFs) inputs
1 4 6 2 5 3 2 2.070 2.536
2 4 4 9 7 15 3 0.390 0.896
01 3 3 28 7 18 · 3 0.363 0.642
4 4 5 1 7 19 1 0.350 0.539
6 5 2 5 3 · · 2 0.337 0.452
Simplified 5.21 5.14
(2 + 2 + 2) 1 3 3 1 2 · 2 2.706 3.946
2 4 8 7 4 11 2 0.299 0.517
02 3 3 15 5 14 · 3 0.299 0.467
4 3 5 3 24 · 3 0.299 0.412
5 3 14 25 1 · 2 0.299 0.398
1 4 3 2 6 5 2 0.667 0.947
2 4 1 7 28 15 2 0.087 0.315
01 3 4 28 9 1 14 2 0.0029 0.258
4 4 16 23 3 22 2 0.0014 0.136
6 5 4 11 12 20 7 1 0.0014 0.112
Linear 2.92 3.45
(2 + 2 + 2) 1 3 2 1 3 · 2 0.908 1.423
2 4 16 1 2 7 2 0.113 0.299
02 3 4 15 2 28 24 3 0.029 0.151
4 4 5 15 20 24 3 0.010 0.068
5 3 12 3 7 · 3 0.0092 0.056

Fuzzy No. of rules No. of PI EPI
1 4 5 6 4 7 1 8.138 10.68
2 4 2 45 49 25 2 0.382 2.316
8
Simplified 3.997 3.269 3 4 42 9 40 49 3 0.313 1.403
(2 × 2 × 2)
4 3 36 16 33 · 1 0.311 0.734
5 3 43 45 28 · 2 0.309 0.610
1 4 6 1 5 7 2 0.423 4.601
2 4 49 4 44 34 3 0.184 2.175
8
Linear 2.069 2.518 3 2 43 2 · · 2 0.105 1.361
(2 × 2 × 2)
4 3 33 44 5 · 3 0.063 0.761
5 3 41 26 42 · 4 0.039 0.587
by the values of PI equal to 0.299 and EPI given as 0.555, whereas under
the condition given as similar performance, the best results for the proposed
network related to the output node mentioned previously were reported as
PI = 0.299 and EPI = 0.517. In the sequel, the depth (the number of lay-
ers) as well as the width (the number of nodes) of the proposed genetically
optimized HFNN (gHFNN) can be lower in comparison to the “conventional
HFNN” (which immensely contributes to the compactness of the resulting
network), refer to Fig. 13. In what follows, the genetic design procedure at
stage (layer) of HFNN leads to the selection of the preferred nodes (or PNs)
Fig. 13. Comparison of the proposed model architecture (gHFNN) and the
conventional model architecture (SOFPNN [19])
Fig. 14. Optimal topology of genetically optimized HFNN for the nonlinear function
(In case of using FS FNN)
with optimal local characteristics (such as the number of input variables, the
order of the polynomial, and input variables). In addition, when considering
linear fuzzy inference and CP2, the best results are reported in the form of
the performance index such as PI = 0.113 and EPI = 0.299 for layer 2, and
PI = 0.0092 and EPI = 0.056 for layer 5. Their optimal topologies are shown
in Figs. 14 (a) and (b). Figs. 15(a) and (b) depict the optimization process by
showing the values of the performance index in successive cycles of both BP
learning and genetic optimization when using each linear fuzzy inference-based
FNN. Noticeably, the variation ratio (slope) of the performance of the network
changes radically around the 1st and 2nd layer. Therefore, to effectively reduce
16
8 14
Premise part; Consequence part;
7 12
Premise part; FNN gPNN
FNN Consequence part;
gPNN
Performance Index
6
10
Performance Index
1st layer
5 1st layer
E_PI = 3.45
8
2nd layer 2nd layer
4
3rd layer 6 3rd layer
3
4th layer 4th layer
PI = 2.929 4
2 E_PI = 2.069
5th layer
: PI 5th layer
: PI
1
: E_PI 2
E_PI = 0.299 : E_PI PI = 2.518
PI = 0.113 E_PI = 0.056 E_PI = 0.587
PI = 0.0092 PI = 0.039
50 200 400 600 800 1000 100 200 300 400 500 50 200 400 600 800 1000 150 300 450 600 750
Iteration Generation Iteration Generation
(a) In case of FS FNN (linear) (b) In case of FR FNN (linear)
Fig. 15. Optimization procedure of HFNN by BP learning and GAs
a large number of nodes and avoid a substantial amount of time-consuming

iterations concerning HFNN layers, the stopping criterion can be taken into
consideration. Referring to Figs. 14 and 15 it becomes obvious that we can
optimize the network up to maximally the 2nd layer.
Table 6 covers a comparative analysis including several previous models.
Sugeno’s model I and II were fuzzy models based on linear fuzzy inference
method while Shin-ichi’s models formed fuzzy rules by using learning method
of neural networks. The study of literature [29] is based on fuzzy-neural net-
works using HCM clustering and evolutionary fuzzy granulation. SOFPNN
[19] is a network being called the “conventional HFNN” in this study. The
proposed genetically optimized HFNN (gHFNN) comes with higher accuracy
and improved prediction capabilities.
5.2 Gas furnace process

We illustrate the performance of the network and elaborate on its development
by experimenting with data coming from the gas furnace process. The time
series data (296 input-output pairs) resulting from the gas furnace process
has been intensively studied in the previous literature [2-6,30-35]. The delayed
terms of methane gas flow rate, u(t) and carbon dioxide density, y(t) are used
as system input variables such as u(t-3), u(t-2), u(t-1), y(t-3), y(t-2), and y(t-
1). We use two types of system input variables of FNN structure, Type I and
Type II to design an optimal model from gas furnace data. Type I utilize two
system input variables such as u(t-3) and y(t-1) and Type II utilizes 3 system
input variables such as u(t-2), y(t-2), and y(t-1). The output variable is y(t).
Table 7 summarizes the computational aspects related to the genetic
optimization of gHFNN.
Table 6. Performance analysis of selected models

Model PI EPI No. of rules
Linear model [25] 12.7 11.1
GMDH [25,28] 4.7 5.7
Fuzzy model I 1.5 2.1 3
Sugeno’s [1,25]
Fuzzy model II 1.1 3.6 4
FNN Type 1 0.84 1.22 8(23 )
Shin-ichi’s [13] FNN Type 2 0.73 1.28 4(22 )
FNN Type 3 0.63 1.25 8(23 )
Simplified 2.865 3.206 9(3 + 3 + 3)
FNN [29]
Linear 2.670 3.063 9(3 + 3 + 3)
Simplified 0.865 0.956 9(3 + 3 + 3)
Multi-FNN [29]
Linear 0.174 0.689 9(3 + 3 + 3)
BFPNN 0.299 0.555 6 rules/5th layer
SOFPNN [19]
MFPNN 0.116 0.360 8 rules/5th layer
Simplified 5.21 5.14 6(2 + 2 + 2)
FS FNN
Linear 2.92 3.45 6(2 + 2 + 2)
Simplified 3.997 3.269 8(2 × 2 × 2)
FR FNN
Linear 2.069 2.518 8(2 × 2 × 2)
Proposed model 0.299 0.517 6 rules/2nd layer
Simplified
gHFNN 0.299 0.398 6 rules/5th layer
(FS FNN) 0.113 0.299 6 rules/2nd layer
Linear
0.0092 0.056 6 rules/5th layer
gFPNN Simplified 0.309 0.610 8 rules/5th layer
(FR FNN) Linear 0.039 0.587 8 rules/5th layer
The GAs-based design procedure is carried out in the same manner as in

the previous experiments. Table 8 includes the results of the overall network
reported according to various alternatives concerning various forms of FNN
architecture, types of fuzzy inference and location of the connection point.
When considering the simplified fuzzy inference-based FS FNN with Type I
(4 fuzzy rules), the minimal value of the performance index, that is PI = 0.035
and EPI = 0.281 are obtained. In case of the linear fuzzy inference-based
FS FNN with Type I (4 fuzzy rules), the best results are reported with the
performance index such that PI = 0.041 and EPI = 0.267. When using Type II
(6 fuzzy rules), the best results (PI = 0.0248 and EPI = 0.126) were obtained
for simplified fuzzy inference and linear fuzzy inference, respectively (in the
second case we have PI = 0.0256 and EPI = 0.143). In case of using FR FNN
and Type II, the best results are given as the performance index such that
PI = 0.026, EPI = 0.115 and PI = 0.033, EPI = 0.119 for simplified and
linear fuzzy inference respectively.
When using FS FNN and Type I, Fig. 16 illustrates the detailed optimal
topology of the gHFNN with 3 layers of PNN; the network comes with the
following values: PI = 0.017 and EPI = 0.267. The proposed network enables
the architecture to be a structurally optimized and gets simpler than the
Table 7. Computational aspects of the optimization of gHFNN

Generation 150
Population size 60
GAs Elite population size (W) 30
String length
Premise Simplified 0.0014
Learning rate tuned
(FNN) Linear 0.0052
Momentum Simplified 0.0002
gHFNN No. of rules 4/6
CP 1 4/6
No. of system inputs
Consequence CP 2 2/3
(Gpnn) Maximal layer 5
No. of inputs to be selected (N) 1 ≤ N ≤ 4 (Max)
Type (T) 1≤T≤3
N, T : integer
(b) In case of using FR FNN and Type II
Generation 150
Population size 60
String length
Consequence structure (PNN) 3+3+24
Learning rate tuned
(FNN) Linear 0.0144
gHFNN No. of rules 4/8
No. of entire inputs 4/8
(Gpnn) No. of inputs to be selected (N) 1 ≤ N ≤ 4 (Max)
Type (T) 1≤T≤3
N, T : integer
conventional HFNN. Fig. 17(a) illustrates the optimization process by visu-

alizing the performance index in successive cycles (iteration and generation).
It also shows the optimized network architecture when taking into considera-
tion HFNN based on linear fuzzy inference and CP1, refer to Table 8(a) and
Fig. 16. As shown in Figs. 17(a) and (b), the variation ratio (slope) of the
performance of the network is almost the same around the 2nd through 5th
layer.
Table 8. Performance index of HFNN for the gas furnace

(a) In case of using FS FNN and Type I
1 2 3 1 · · 3 0.025 0.328
2 4 24 11 30 18 2 0.024 0.269
01 3 3 9 10 23 · 2 0.020 0.265
Simplified 4 0.035 0.281 4 4 7 13 27 20 2 0.019 0.262
(2 + 2) 5 3 1 10 7 · 2 0.018 0.254
1 2 1 2 · · 3 0.024 0.328
2 4 1 4 5 7 3 0.021 0.282
02 3 3 29 27 26 · 2 0.020 0.270
4 3 15 13 21 · 3 0.019 0.268
5 4 8 4 20 13 2 0.018 0.265
1 4 4 2 1 3 3 0.019 0.292
2 4 7 12 2 10 2 0.018 0.271
01 3 4 20 21 5 3 2 0.017 0.267
Linear 4 0.041 0.267 4 3 22 13 29 · 2 0.016 0.263
(2 + 2) 5 4 25 18 27 9 3 0.015 0.258
1 2 1 2 · · 3 0.027 0.310
2 3 4 6 5 · 2 0.021 0.279
02 3 4 6 14 7 1 2 0.018 0.270
4 3 15 3 2 · 2 0.018 0.263
5 3 16 6 14 · 2 0.016 0.259

1 4 2 5 6 7 2 0.021 0.124
2 3 26 15 16 · 2 0.019 0.115
Simplified 8 0.026 0.115 3 3 21 3 27 · 2 0.018 0.114
(2 × 2 × 2) 4 3 24 3 18 · 2 0.018 0.111
5 4 13 12 29 20 1 0.018 0.109
1 4 6 5 2 8 1 0.083 0.146
2 4 21 18 6 9 2 0.028 0.116
Linear 8 0.033 0.119 3 4 4 24 5 6 2 0.022 0.110
(2 × 2 × 2) 4 3 28 4 5 · 2 0.021 0.106
5 3 21 18 25 · 1 0.021 0.104
Table 9 contrasts the performance of the genetically developed network

with other fuzzy and fuzzy-neural networks reported in the literature. It be-
comes obvious that the proposed genetically optimized HFNN architectures
outperform other models both in terms of their accuracy and generalization
capabilities.
PN 1
3 3
PN 2
4 2
PN 4
N Π 4 3
u(t-3)
PN 5
N Π
3 3 PN 3
∑ PN 13 4 2
1
∑ 3 1 PN 5
PN 16
y^
4 2 PN 20
3 1 PN 20 4 2
PN 18 4 2
N Π 1 2 PN 21
y(t-1) PN 21 3 2
N Π 3 2
∑ PN 23
1 3 2
∑
PN 24
2 3
PN 29
2 2
Fig. 16. Genetically optimized HFNN (gHFNN) with FS FNN (linear)
0.9 1.6
: PI : PI
: E_PI : E_PI
0.8 1.4
Premise part; Consequence part;
0.7 FNN gPNN Consequence part;
1st layer 1.2
gPNN
Performance Index
1st layer
Performance Index
0.6 2nd layer

1
3rd layer Premise part; 2nd layer
0.5
FR_FNN
0.8 3rd layer
4th layer
0.4
5th layer
0.6
0.3 E_PI = 0.267
4th layer
E_PI = 0.267 E_PI = 0.258 0.4
0.2
5th layer
0.1 0.2 E_PI = 0.119

PI = 0.041 E_PI = 0.104
PI = 0.017 PI = 0.015 PI = 0.033 PI = 0.0211
30 100 200 300 150 300 450 600 750 30 200 400 500 150 300 450 600 750
Iteration Generation Iteration Generation
(a) In case of using FS FNN (linear) with (b) In case of using FR FNN (linear)
Type I and Type II
Fig. 17. Optimization procedure of HFNN by BP learning and GAs
5.3 NOx emission process of gas turbine power plant
NOx emission process is modeled using the data of gas turbine power plant
coming from a GE gas turbine power plant located in Virginia, US.
The input variables include AT (Ambient Temperature at site), CS (Com-
pressor Speed), LPTS (Low Pressure Turbine Speed), CDP (Compressor
Discharge Pressure), and TET (Turbine Exhaust Temperature). The output
variable is NOx [27,29,32]. The performance index is defined by (8). We con-
sider 260 pairs of the original input-output data. 130 out of 260 pairs of
input-output data are used as learning set; the remaining part serves as a

Kim, et al.’s model [30] 0.034 0.244 2
Lin and Cunningham’s mode [31] 0.071 0.261 4
0.022 0.335 4(2 × 2)
Simplified
0.022 0.336 6(3 × 2)
Min-Max [5]
0.024 0.358 4(2 × 2)
Linear
0.020 0.362 6(3 × 2)
Simplified 0.023 0.344 4(2 × 2)
GAs [5]
Linear 0.018 0.264 4(2 × 2)
Simplified 0.024 0.328 4(2 × 2)
Complex [2]
Linear 0.023 0.306 4(2 × 2)
Fuzzy
Hybrid [4] Simplified 0.024 0.329 4(2 × 2)
(GAs+Complex) Linear 0.017 0.289 4(2 × 2)
Simplified 0.755 1.439 6(3 × 2)
HCM [3]
Linear 0.018 0.286 6(3 × 2)
0.035 0.289 4(2 × 2)
Simplified
0.022 0.333 6(3 × 2)
HCM+GAs [3]
0.026 0.272 4(2 × 2)
Linear
0.020 0.2642 6(3 × 2)
Neural Networks [3] 0.034 4.997
0.021 0.332 9(3 × 3)
Oh’s Adaptive FNN [6]
0.022 0.353 4(2 × 2)
Simplified 0.043 0.264 6(3 + 3)
FNN [32]
Linear 0.037 0.273 6(3 + 3)
Simplified 0.025 0.274 6(3 + 3)
Multi-FNN [33]
Linear 0.024 0.283 6(3 + 3)
Generic [34] 0.017 0.250 4 rules/5th layer
SOFPNN
Advanced [35] 0.019 0.264 6 rules/5th layer
0.035 0.281 4(2 + 2)
Simplified
0.024 0.126 6(2 + 2 + 2)
FS FNN
0.041 0.267 4(2 + 2)
Linear
0.025 0.143 6(2 + 2 + 2)
0.024 0.329 4(2 × 2)
Simplified
0.026 0.115 8(2 × 2 × 2)
FR FNN
0.025 0.265 4(2 × 2)
Linear
Proposed model 0.033 0.119 8(2 × 2 × 2)
Simplified
(FS FNN) 0.015 0.258 4 rules/5th layer
Linear
Simplified
(FR FNN) 0.016 0.249 4 rules/5th layer
Linear
testing set. Using NOx emission process data, the regression model is
y = −163.77341 − 0.06709 x1 + 0.00322 x2 + 0.00235 x3
+ 0.26365 x4 + 0.20893 x5 (11)
And it comes with PI = 17.68 and EPI = 19.23. We will be using these results
as a reference point when discussing gHFNN models. Table 10 summarizes
Table 10. Summary of the parameters of the optimization environment
Generation 150
Population size 100
GAs Elite population size 50
Premise structure 10 (per one variable)
String length Consequence CP 1 3 + 3 + 70
structure CP 2 3 + 3 + 35
Learning rate tuned
(FNN) Linear 0.034
gHFNN No. of rules 10(2 + 2 + 2 + 2 + 2)
CP 1 10
No. of system inputs
CP 2 5
(gPNN) No. of inputs to be CP 1 1 ≤ N ≤ 10 (Max)
selected (N) CP 1 1 ≤ N ≤ 5 (Max)
Type (T) 1≤T≤3
N, T : integer
Generation 150
Population size 100
String length
Learning rate tuned
Linear 0.651
gHFNN No. of rules 32(2 × 2 × 2 × 2 × 2)
No. of entire inputs 32
(Gpnn) No. of inputs to be selected (N) 1 ≤ N ≤ 15 (Max)
Type (T) 1≤T≤3
N, T : integer
the parameters of the optimization environment. The parameters used for

optimization of this process modeling are almost the same as used in the
previous experiments.
Table 11 summarizes the detailed results. When using FS FNN, the best
results for the network are obtained when using linear fuzzy inference and CP

PI EPI Layer Type
1 7 2 0.916 2.014
2 6 2 0.623 1.430
01 3 9 1 0.477 1.212
4 10 1 0.386 1.077
Simplified 10 22.331 19.783 5 4 2 0.337 1.016
(2 + 2 + 2 + 2 + 2) 1 5 2 1.072 2.220
2 5 2 0.176 0.291
02 3 5 3 0.105 0.168
4 3 2 0.060 0.113
5 4 2 0.049 0.081
1 9 2 0.023 0.137
2 5 2 0.0095 0.044
01 3 9 1 0.0057 0.029
4 2 2 0.0057 0.027
Linear 10 8.054 12.147 5 9 1 0.0045 0.026
(2 + 2 + 2 + 2 + 2) 1 5 2 2.117 4.426
2 5 2 0.875 1.647
02 3 5 2 0.550 1.144
4 5 3 0.390 0.793
5 3 2 0.340 0.680

PI EPI Layer Type
1 13 2 0.149 0.921
Simplified 32 2 12 1 0.065 0.189
(2 × 2 × 2 × 2 × 2) 0.711 1.699 3 11 1 0.046 0.134
4 8 1 0.044 0.125
5 3 3 0.041 0.111
1 10 2 0.205 1.522
Linear 32 2 13 1 0.049 0.646
(2 × 2 × 2 × 2 × 2) 0.079 0.204 3 3 3 0.028 0.437
4 12 1 0.023 0.330
5 3 3 0.019 0.286
25 : PI 0.3
: PI
: E_PI : E_PI
Consequence part;
Premise part; Consequence part; gPNN
0.25
20 FNN gPNN
1st layer 1st layer
2nd layer 0.2

Performance Index
Performance Index
2nd layer
15 3rd layer
E_PI = 12.147 0.15

4th layer 3rd layer
10
5th layer 4th layer
0.1
PI = 8.054 5th layer
5 E_PI = 0.044
0.05
E_PI = 0.026
E_PI = 0.026
PI = 0.0045 PI = 0.0095 PI = 0.0045
200 400 600 800 1000 150 300 450 600 750 150 300 450 600 750
Iteration Generation Generation
(a) Premise and consequence part (b) Consequence part (extended)
Fig. 18. Optimal procedure of gHFNN with FS FNN (linear) by BP and GAs
1 with Type 1 (linear function) and 9 nodes at input; this network comes with
the value of PI equal to 0.0045 and EPI set as 0.026. In case of using FR FNN
and simplified fuzzy inference, the most preferred network architecture have
been reported as PI = 0.041 and EPI = 0.111. As shown in Table 11 and
Fig. 18, the variation ratio (slope) of the performance of the network changes
radically at the 2nd layer. Therefore, to effectively reduce a large number of
nodes and avoid a large amount of time-consuming iteration of gHFNN, the
stopping criterion can be taken into consideration up to maximally the 2nd
layer.
Table 12 covers a comparative analysis including several previous fuzzy-
neural network models. The experimental results clearly reveal that the
proposed approach and the resulting model outperform the existing networks
both in terms of better approximation and generalization capabilities.
6 Concluding remarks
In this study, we have introduced a class of gHFNN driven genetic optimization

regarded as a modeling vehicle for nonlinear and complex systems. The ge-
netically optimized HFNNs are constructed by combining FNNs with gPNNs.
In contrast to the conventional HFNN structures and their learning, the pro-
posed model comes with two kinds of rule-based FNNs (viz. FS FNN and
FR FNN based on two types of fuzzy inferences) as well as a diversity of lo-
cal characteristics of PNs that are extremely useful when coping with various
nonlinear characteristics of the system under consideration.

Regression model [32] 17.68 19.23
FNN 5.835
Ahn et al. [27]
AIM 8.420
Simplified 6.269 8.778 30(6 + 6 + 6 + 6 + 6)
FNN [32]
Linear 3.725 5.291 30(6 + 6 + 6 + 6 + 6)
Simplified 2.806 5.164 30(6 + 6 + 6 + 6 + 6)
Multi-FNN [33]
Linear 0.720 2.025 30(6 + 6 + 6 + 6 + 6)
Simplified 22.331 19.783 10(2 + 2 + 2 + 2 + 2)
FS FNN
Linear 8.054 12.147 10(2 + 2 + 2 + 2 + 2)
Simplified 0.711 1.699 32(2 × 2 × 2 × 2 × 2)
FR FNN
Linear 0.079 0.204 32(2×2×2×2×2)
Proposed model 0.176 0.291 10 rules/2nd layer
Simplified
(FS FNN) 0.0095 0.044 10 rules/2nd layer
Linear
0.0045 0.026 10 rules/5th layer
0.065 0.189 32 rules/2nd layer
Simplified
(FR FNN) 0.049 0.646 32 rules/2nd layer
Linear
The comprehensive design methodology comes with the parametrically as

well as structurally optimized network architecture. A few general notes are
worth stressing: 1) as the premise structure of the gHFNN, the optimization of
the rule-based FNN hinges on genetic algorithms and back-propagation (BP)
learning algorithm: The GAs leads to the auto-tuning of vertexes of mem-
bership function, while the BP algorithm helps produce optimal parameters
of the consequent polynomial of fuzzy rules through learning; 2) the gPNN
that is the consequent structure of the gHFNN is based on the technologies
of the extended GMDH algorithm and GAs: The extended GMDH method
is comprised of both a structural phase such as a self-organizing and evolu-
tionary algorithm and a parametric phase driven by the least square error
(LSE)-based learning. Furthermore the PNN architecture is optimized by the
genetic optimization that leads to the selection of the optimal nodes (or PNs)
with local characteristics such as the number of input variables, the order of
the polynomial, and a collection of the specific subset of input variables. In
this sense, we have constructed a coherent development platform in which all
components of CI are fully utilized. In the sequel, a variety of architectures
of the proposed gHFNN driven to genetic optimization have been discussed.
The model is inherently dynamic - the use of the genetically optimized PNN
(gPNN) of consequent structure of the overall network is essential to the gener-
ation process of the “optimally self-organizing” network by selecting its width
and depth. The series of experiments helped compare the network with other
models through which we found the network to be of superior quality.
7 Acknowledgement
This work has been supported by KESRI(I-2004-0-074-0-00), which is funded
by MOCIE (Ministry of commerce, industry and energy).
References
1. Kang G, Sugeno M (1987) Fuzzy modeling. Trans Soc Instrum Control Eng
23(6cr):106–108
2. Oh SK, Pedrycz W (2000) Fuzzy identification by means of auto-tuning algo-
rithm and its application to nonlinear systems. Fuzzy Sets Syst 115(2):205–230
3. Park BJ, Pedrycz W, Oh SK (2001) Identification of fuzzy models with the aid of
evolutionary data granulation. IEE Proc-Control Theory Appl 148(5):406–418
4. Oh SK, Pedrycz W, Park BJ (2002) Hybrid identification of fuzzy rule-based
models. Int J Intell Syst 17(1):77–103
5. Park BJ, Oh SK, Ahn TC, Kim HK (1999) Optimization of fuzzy systems by
means of GA and weighting factor. Trans Korean Inst Electr Eng 48A(6):789–
799 (In Korean)
6. Oh SK, Park CS, Park BJ (1999) On-line modeling of nonlinear process sys-
tems using the adaptive fuzzy-neural networks. Trans Korean Inst Electr Eng
48A(10):1293–1302 (In Korean)
7. Narendra KS, Parthasarathy K (1991) Gradient methods for the optimization
of dynamical systems containing neural networks. IEEE Trans Neural Netw
2:252–262
8. Goldberg DE (1989) Genetic algorithms in search, optimization & machine
learning. Addison-wesley, Reading
9. Michalewicz Z (1996) Genetic Algorithms + Data Structures = Evolution
Programs. Springer, Berlin Heidelberg Newyork
10. Holland JH (1975) Adaptation in natural and artificial systems. The University
of Michigan Press, Ann Arbour
11. Pedrycz W, Peters JF (1998) Computational intelligence and software engineer-
ing. World Scientific, Singapore
12. Computational intelligence by programming focused on fuzzy neural networks
and genetic algorithms. Naeha, Seoul (In Korean)
13. Horikawa S, Furuhashi T, Uchigawa Y (1992) On fuzzy modeling using fuzzy
neural networks with the back propagation algorithm. IEEE Trans Neural Netw
3(5):801–806
14. Oh SK, Pedrycz W (2002) The design of self-organizing polynomial neural
networks. Inf Sci 141(3–4):237–258
15. Oh SK, Pedrycz W, Park BJ (2003) Polynomial neural networks architecture:
Analysis and Design. Comput Electr Eng 29(6):653–725
16. Ohtani T, Ichihashi H, Miyoshi T, Nagasaka K (1998) Orthogonal and successive
projection methods for the learning of neurofuzzy GMDH. Inf Sci 110:5–24
17. Ohtani T, Ichihashi H, Miyoshi T, Nagasaka K (1998) Structural learning
with M-Apoptosis in neurofuzzy GMHD. In: Proceedings of the 7th IEEE
International Conference on Fuzzy Systems:1265–1270
18. Ichihashi H, Nagasaka K (1994) Differential minimum bias criterion for neuro-
fuzzy GMDH. In: Proceedings of 3rd International Conference on Fuzzy Logic
Neural Nets and Soft Computing IIZUKA’94:171–172
19. Park BJ, Pedrycz W, Oh SK (2002) Fuzzy polynomial neural networks: hybrid
architectures of fuzzy modeling. IEEE Trans Fuzzy Syst 10(5):607–621
20. Oh SK, Pedrycz W, Park BJ (2003) Self-organizing neurofuzzy networks based
on evolutionary fuzzy granulation. IEEE Trans Syst Man and Cybern A
33(2):271–277
21. Cordon O et al. (2004) Ten years of genetic fuzzy systems: current framework
and new trends. Fuzzy Sets Syst 141(1):5–31
22. Ivahnenko AG (1968) The group method of data handling: a rival of method of
stochastic approximation. Sov Autom Control 13(3):43–55
23. Yamakawa T (1993) A new effective learning algorithm for a neo fuzzy neuron
model. 5th IFSA World Conference:1017–1020
24. Oh SK, Yoon KC, Kim HK (2000) The Design of optimal fuzzy- eural networks
structure by means of GA and an aggregate weighted performance index. J
Control, Autom Syst Eng 6(3):273–283 (In Korean)
25. Park MY, Choi HS (1990) Fuzzy control system. Daeyoungsa, Seoul (In Korean)
26. Box G.EP, Jenkins GM (1976) Time series analysis, forecasting, and control,
2nd edn. Holden-Day, SanFransisco
27. Ahn TC, Oh SK (1997) Intelligent models concerning the pattern of an air
pollutant emission in a thermal power plant, Final Report, EESRI
28. Kondo T (1986) Revised GMDH algorithm estimating degree of the complete
polynomial. Trans Soc Instrum Control Eng 22(9):928–934
29. Park HS, Oh SK (2003) Multi-FNN identification based on HCM clustering and
evolutionary fuzzy granulation. Int J Control, Autom Syst 1(2):194–202
30. Kim E, Lee H, Park M, Park M (1998) A simply identified sugeno-type fuzzy
model via double clustering. Inf Sci 110:25–39
31. Lin Y, Cunningham III GA (1997) A new approach to fuzzy-neural modeling,
IEEE Trans Fuzzy Syst 3(2):190–197
32. Oh SK, Pedrycz W, Park HS (2003) Hybrid identification in fuzzy-neural
networks. Fuzzy Sets Syst 138(2):399–426
33. Park HS, Oh SK (2000) Multi-FNN identification by means of HCM cluster-
ing and its optimization using genetic algorithms. J Fuzzy Logic Intell Syst
10(5):487–496 (In Korean)
34. Park BJ, Oh SK, Jang SW (2002) The design of adaptive fuzzy polynomial
neural networks architectures based on fuzzy neural networks and self-organizing
networks. J Control Autom Syst Eng 8(2):126–135 (In Korean)
35. Park BJ, Oh SK (2002) The analysis and design of advanced neurofuzzy
polynomial networks. J Inst Electron Eng Korea 39-CI(3):18–31 (In Korean)
36. Park BJ, Oh SK, Pedrycz W, Kim HK (2005) Design of evolutionally optimized
rule-based fuzzy neural networks on fuzzy relation and evolutionary optimiza-
tion. International Conference on Computational Science. Lecture Notes in
Computer Science 3516:1100–1103
37. Oh SK, Park BJ, Pedrycz W, Kim HK (2005) Evolutionally optimized fuzzy
neural networks based on evolutionary fuzzy granulation. Lecture Notes in
Computer Science 3483:887–895
38. Oh SK, Park BJ, Pedrycz W, Kim HK (2005) Genetically optimized hybrid
fuzzy neural networks in modeling software data. Lecture Notes in Artificial
Intelligence 3558:338–345
39. Zadeh NN, Darvizeh A, Jamali A, Moeini A (2005) Evolutionary design of

generalized polynomial neural networks for modeling and prediction of explosive
forming process. J Mater Process Technol 164(15):1561–1571
40. Delivopoulos E, Theocharis JB (2004) A modified PNN algorithm with optimal
PD modeling using the orthogonal least squares method. Inf Sci 168(3):133–170
Genetically Optimized Self-organizing Neural
Networks Based on Polynomial and Fuzzy
Polynomial Neurons: Analysis and Design
Sung-Kwun Oh and Witold Pedrycz
Summary. In this study, we introduce and investigate a class of neural archi-

tectures of self-organizing neural networks (SONN) that is based on a genetically
optimized multilayer perceptron with polynomial neurons (PNs) or fuzzy polynomial
neurons (FPNs), develop a comprehensive design methodology involving mechanisms
of genetic optimization and carry out a series of numeric experiments. The conven-
tional SONN is based on a self-organizing and an evolutionary algorithm rooted in
a natural law of survival of the fittest as the main characteristics of the extended
Group Method of Data Handling (GMDH) method, and utilized the polynomial or-
der (viz. linear, quadratic, and modified quadratic) as well as the number of node
inputs fixed (selected in advance by designer) at the corresponding nodes (PNs or
FPNs) located in each layer through a growth process of the network. Moreover it
does not guarantee that the SONN generated through learning results in the opti-
mal network architecture. We distinguish between two kinds of SONN architectures,
that is, (a) Polynomial Neuron (PN) based and (b) Fuzzy Polynomial Neuron (FPN)
based self-organizing neural networks. This taxonomy is based on the character of
each neuron structure in the network. The augmented genetically optimized SONN
(gSONN) results in a structurally optimized structure and comes with a higher level
of flexibility in comparison to the one encountered in the conventional SONN. The
GA-based design procedure being applied at each layer of SONN leads to the se-
lection of preferred nodes (PNs or FPNs) with specific local characteristics (such
as the number of input variables, the order of the polynomial, and a collection of
the specific subset of input variables) available within the network. In the sequel,
two general optimization mechanisms of the gSONN are explored: the structural
optimization is realized via GAs whereas for the ensuing detailed parametric opti-
mization we proceed with a standard least square method-based learning. Each node
of the PN based gSONN exhibits a high level of flexibility and realizes a collection of
preferred nodes as well as a preferred polynomial type of mapping (linear, quadratic,
and modified quadratic) between input and output variables. FPN based gSONN
dwells on the ideas of fuzzy rule-based computing and neural networks. The perfor-
mance of the gSONN is quantified through experimentation that exploits standard
data already used in fuzzy or neurofuzzy modeling. These results reveal superiority
of the proposed networks over the existing fuzzy and neural models.
S.-K. Oh and W. Pedrycz: Genetically Optimized Self-organizing Neural Networks Based on

Polynomial and Fuzzy Polynomial Neurons: Analysis and Design, Studies in Computational
Intelligence (SCI) 82, 59–108 (2008)
1 Introduction
Recently, lots of attention has been directed towards advanced techniques
of complex system modeling. The challenging quest for constructing models
of the systems that come with significant approximation and generalization
abilities as well as are easy to comprehend has been within the community for
decades. While neural networks, fuzzy sets and evolutionary computing as the
technologies of Computational Intelligence (CI) have expanded and enriched
a field of modeling quite immensely, they have also gave rise to a number
of new methodological issues and increased our awareness about tradeoffs
one has to make in system modeling [1–4]. The most successful approaches
to hybridize fuzzy systems with learning and adaptation have been made in
the realm of CI. Especially neural fuzzy systems and genetic fuzzy systems
hybridize the approximate inference method of fuzzy systems with the learning
capabilities of neural networks and evolutionary algorithms [5]. When the
dimensionality of the model goes up (say, the number of variables increases),
so do the difficulties. Fuzzy sets emphasize the aspect of transparency of the
models and a role of a model designer whose prior knowledge about the system
may be very helpful in facilitating all identification pursuits. On the other
hand, to build models of substantial approximation capabilities, there is a need
for advanced tools. The art of modeling is to reconcile these two tendencies
and find a workable and efficient synergistic environment. Moreover it is also
worth stressing that in many cases the nonlinear form of the model acts as a
two-edge sword: while we gain flexibility to cope with experimental data, we
are provided with an abundance of nonlinear dependencies that need to be
exploited in a systematic manner. In particular, when dealing with high-order
nonlinear and multivariable equations of the model, we require a vast amount
of data for estimating all its parameters [1–2].
To help alleviate the problems, one of the first approaches along the line
of a systematic design of nonlinear relationships between system’s inputs
and outputs comes under the name of a Group Method of Data Handling
(GMDH). GMDH was developed in the late 1960s by Ivakhnenko [6–9] as a
vehicle for identifying nonlinear relations between input and output variables.
While providing with a systematic design procedure, GMDH comes with some
drawbacks. First, it tends to generate quite complex polynomial even for rela-
tively simple systems (experimental data). Second, owing to its limited generic
structure (that is quadratic two-variable polynomials), GMDH also tends to
produce an overly complex network (model) when it comes to highly nonlinear
systems. Third, if there are less than three input variables, GMDH algorithm
does not generate a highly versatile structure. To alleviate the problems as-
sociated with the GMDH, Self-Organizing Neural Networks (SONN) (viz.
polynomial neuron (PN)-based SONN and fuzzy polynomial neuron (FPN)-
based SONN, or called SOPNN/FPNN) were introduced by Oh and Pedrycz
[10–13] as a new category of neural networks or neuro-fuzzy networks. In a
nutshell, these networks come with a high level of flexibility associated with
Genetically Optimized Self-organizing Neural Networks 61
each node (processing element forming a Partial Description (PD) (viz. Poly-
nomial Neuron (PN) or Fuzzy Polynomial Neuron (FPN)) can have a different
number of input variables as well as exploit a different order of the polynomial
(say, linear, quadratic, cubic, etc.). In comparison to well-known neural net-
works or neuro-fuzzy networks whose topologies are commonly selected and
kept prior to all detailed (parametric) learning, the SONN architecture is not
fixed in advance but becomes generated and optimized in a dynamic way. As
a consequence, the SONNs show a superb performance in comparison to the
previously presented intelligent models. Although the SONN has a flexible ar-
chitecture whose potential can be fully utilized through a systematic design,
it is difficult to obtain the structurally and parametrically optimized network
because of the limited design of the nodes (viz. PNs or FPNs) located in each
layer of the SONN. In other words, when we construct nodes of each layer
in the conventional SONN, such parameters as the number of input variables
(nodes), the order of the polynomial, and the input variables available within
a node (viz. PN or FPN) are fixed (selected) in advance by the designer.
Accordingly, the SONN algorithm exhibits some tendency to produce overly
complex networks as well as a repetitive computation load by the trial and
error method and/or the repetitive parameter adjustment by designer like in
case of the original GMDH algorithm. In order to generate a structurally and
parametrically optimized network, such parameters need to be optimal.
In this study, in addressing the above problems with the conventional
SONN as well as the GMDH algorithm, we introduce a new genetic design
approach; as a consequence we will be referring to these networks as geneti-
cally optimized SONN (gSONN). The determination of the optimal values of
the parameters available within an individual PN or FPN (viz. the number
of input variables, the order of the polynomial, and a collection of preferred
nodes) leads to a structurally and parametrically optimized network. As a
result, this network is more flexible as well as exhibits simpler topology in
comparison to the conventional SONN discussed in the previous research. Let
us reiterate that the objective of this study is to develop a general design
methodology of gSONN modeling, come up with a logic-based structure of
such model and propose a comprehensive evolutionary development environ-
ment in which the optimization of the models can be efficiently carried out
both at the structural as well as parametric level [14].
This chapter is organized in the following manner. First, Section 2 gives
a brief introduction to the architecture and development of the SONNs. Sec-
tion 3 introduces the genetic optimization used in SONN. The genetic design of
the SONN comes with an overall description of a detailed design methodology
of SONN based on genetically optimized multi-layer perceptron architecture
in Section 4. In Section 5, we report on a comprehensive set of experiments.
Finally concluding remarks are covered in Section 6. To evaluate the perfor-
mance of the proposed model, we exploit two well-known time series data
[10, 11, 13, 15, 16, 20–41]. Furthermore, the network is directly contrasted with
several existing neurofuzzy models reported in the literatures.
2 The architecture and development of the

self-organizing neural networks (SONN)
Proceeding with the overall SONN architecture, essential design decisions have
to be made with regard to the number of input variables, the order of the
polynomial, and a collection of the specific subset of input variables. We dis-
tinguish between two kinds of the SONN architectures (PN-based SONN and
FPN-based SONN).
2.1 Polynomial Neuron (PN) based SONN and its topology
As underlined, the SONN algorithm is based on the GMDH method and uti-
lizes a class of polynomials such as linear, quadratic, modified quadratic, etc.
to describe basic processing realized there. By choosing the most significant
input variables and an order of the polynomial among various types of forms
available, we can obtain the best one - it comes under a name of a partial
description (PD). It is realized by selecting nodes at each layer and eventu-
ally generating additional layers until the best performance has been reached.
Such methodology leads to an optimal SONN structure. Let us recall that the
input-output data are given in the form
(Xi , yi ) = (x1i , x2i , · · · , xN i , yi ), i = 1, 2, 3, · · · , n. (1)
where N is the number of input variables, i is the datanumber of each input

and output variable, and n denotes the number of data in the dataset.
The input-output relationship for the above data realized by the SONN
algorithm can be described in the following manner
y = f (x1 , x2 , · · · , xN ). (2)
Where, x1 , x2 , · · · , xN denote the outputs of the 1st layer layer of PN nodes

(the inputs of the 2nd layer (PN nodes)).
The estimated output y reads as

N
N
N
N
N
N
y = c0 + ci xi + cij xi xj cijk xi xj xk · · · (3)
i=1 i=1 j=1 i=1 j=1 k=1
Where, C(c0 , ci , cij , cijk , · · · ) (i, j, k, · · · : 1, 2, · · · , N ) and X(xi , xj , xk , · · · ),

(i, j, k, · · · : 1, 2, · · · , N ) are vectors of the coefficients and input variables of
the resulting multi-input single-output (MISO) system, respectively.
The design of the SONN structure proceeds further and involves a gener-
ation of some additional layers. These layers consist of PNs (PDs) for which
the number of input variables, the polynomial order, and a collection of the
specific subset of input variables are genetically optimized across the layers.
The detailed PN involving a certain regression polynomial is shown in Table 1.
Table 1. Different forms of regression polynomial building a PN

PP
PP Number of
PP inputs 1 2 3
PP
Order PP
1 (Type 1) Linear Bilinear Trilinear
2 (Type 2) Biquadratic-1 Triquadratic-1
Quadratic
The following types of the polynomials are used;
• Biquadratic-1 (Basic) = Bilinear + c3 x21 + c4 x22 + c5 x1 x2
• Biquadratic-2 (Modified) = Bilinear + c3 x1 x2
1 st la y e r 2 n d la y e r o r h ig h e r
PN
PN
x1 PN
PN PN
x2 PN
PN
PN ŷ
x3 PN
PN PN
x4 PN
PN
PN
PN
Input
variables Partial
Description
zp zp
2 C0 +C1 zp +C2 zq+C3 z 2p+C4 z 2q +C5 zp zq z
zq zq
Polynomial
order
Fig. 1. A general topology of the PN based-SONN : note a biquadratic polynomial

in the partial description (z: intermediate variable)
The architecture of the PN based SONN is visualized in Fig. 1. The structure

of the SONN is genetically optimized on the basis of the design alternatives
available within a PN occurring in each layer. In the sequel, the SONN em-
braces diverse topologies of PN being selected on the basis of the number of
input variables, the order of the polynomial, and a collection of the specific
subset of input variables (as shown in Table 1).
The choice of the number of input variables, the polynomial order, and in-
put variables available within each node itself helps select the best model with
respect to the characteristics of the data, model design strategy, nonlinearity

and predictive capabilities.
2.2 Fuzzy Polynomial Neuron (FPN) based SONN and its

topology
In this section, we introduce a fuzzy polynomial neuron (FPN). This neuron,

regarded as a generic type of the processing unit, dwells on the concepts
of fuzzy sets and neural networks. We show that the FPN encapsulates a
family of nonlinear “if-then” rules. When arranged together, FPNs build a self-
organizing neural network (SONN). In the sequel, we investigate architectures
arising therein.
2.2.1 Fuzzy polynomial neuron (FPN)
As visualized in Fig. 2, the FPN consists of two basic functional modules.

The first one, labeled by F, is a collection of fuzzy sets (here Al and Bk ) that
form an interface between the input numeric variables and the processing
part realized by the neuron. Here q and xp denote input variables. The second
module (denoted here by P) is about the function-based nonlinear (polyno-
mial) processing. This nonlinear processing involves some input variables (xi
and xj ).
Quite commonly, we will be using a polynomial form of the nonlinearity,
hence the name of the fuzzy polynomial processing unit. The use of polynomi-
als is motivated by their generality. In particular, they include constant and
linear mappings as their special cases (that are used quite often in rule-based
systems). In other words, FPN realizes a family of multiple-input single-output
rules. Each rule, refer again to Fig. 2, reads in the form
if xp is Al and xq is Bk then z is Plk (xi , xj , alk ) (4)
FPN P
x i ,x j
F
µ1 µˆ 1 P1
xp {A l }
µ2 µˆ 2 P2
∑ z
µ3 µˆ 3 P3
xq {Bk }
µK µˆ K PK
Fig. 2. A general topology of the generic FPN module (F: fuzzy set-based processing
part, P: the polynomial form of mapping)
Table 2. Different forms of the regression polynomials standing in the consequence

part of the fuzzy rules
PP
PP Number of
PP inputs
Order of P PP 1 2 3
PP
the polynomial P
P
0 (Type 1) Constant Constant Constant
1 (Type 2) Linear Bilinear Trilinear
Quadratic
1: Basic type, 2: Modified type
where alk is a vector of the parameters of the conclusion part of the rule while
Plk (xi , xj , alk ) denotes the regression polynomial forming the consequence
part of the fuzzy rule which uses several types of high-order polynomials (lin-
ear, quadratic, and modified quadratic) besides the constant function forming
the simplest version of the consequence; refer to Table 2.
The types of the polynomial read as follows
• Biquadratic-1 = Bilinear + c3 x21 + c4 x22 + c5 x1 x2
• Biquadratic-2 = Bilinear + c3 x1 x2
• Trilinear = c0 + c1 x1 + c2 x2 + c3 x3
• Triquadratic-1 = Trilinear + c4 x21 + c5 x22 + c6 x23 + c7 x1 x2 + c8 x1 x3 + c9 x2 x3
• Triquadratic-2 = Trilinear + c4 x1 x2 + c5 x1 x3 + c6 x2 x3
Alluding to the input variables of the FPN, especially a way in which they
interact with the two functional blocks shown there, we use the notation FPN
(xp , xq ; xi , xj ) to explicitly point at the variables. The processing of the FPN
is governed by the following expressions that are in line of the rule-based
computing existing in the literatures [15, 16].
The activation of the rule “K ” is computed as an and-combination of
the activations of the fuzzy sets occurring in the rule. This combination of
the subconditions is realized through any t-norm. In particular, we consider
the minimum and product operations as two widely used models of the logic
connectives. Subsequently, denote the resulting activation level of the rule
by µK . The activation levels of the rules contribute to the output of the
FPN being computed as a weighted average of the individual condition parts
(functional transformations) PK (note that the index of the rule, namely “K”
is a shorthand notation for the two indexes of fuzzy sets used in the rule (4),
that is K = (l, k)).
all
all rules
rules all
rules
z= µK PK (xi , xj , aK ) µK = K PK (xi , xj , ak ) (5)
µ
K=1 K=1 K=1
In the above expression, we use an abbreviated notation to describe an

activation level of the “K”th rule to be in the form
µK
K = all rules
µ (6)

µL
L=l
2.2.2 The topology of the fuzzy polynomial neuron (FPN)

based SONN
The topology of the FPN based SONN implies the ensuing learning mecha-
nisms; in the description below we indicate some of these learning issues that
permeate the overall architecture. First, the network is homogeneous in the
sense it is constructed with the use of the FPNs. It is also heterogeneous in
the sense that FPNs can be very different (as far as the detailed architecture
is concerned) and this contributes to the generality of the architecture. The
network may contain a number of hidden layers each of them of a different
size (number of nodes). The nodes may have a different number of inputs and
this triggers a certain pattern of connectivity of the network. The FPN itself
promotes a number of interesting design options, see Fig. 3. These alternatives
distinguish between two categories such as designer-based and GA-based. The
former concerns a choice of the membership function (MF) type, the conse-
quent input structure of the fuzzy rules, and the number of MFs per each
input variable. The latter is related to a choice of the number of inputs, and
a collection of the specific subset of input variables and its associated order
of the polynomial realizing a consequence part of the rules based on fuzzy
inference method.
Proceeding with the FPN-based SONN architecture, see Fig. 4, essential
design decisions have to be made with regard to the number of input variables
Fig. 3. The design alternatives available within a single FPN

Fig. 4. Configuration of the topology of the FPN-based SONN

Table 3. Polynomial type according to the number of input variables in the
conclusion part of fuzzy rules
H Input
HH
HH vector Selected input Selected input Entire system
Type of HH variables in the variables in the input variables
the consequence HH premise part consequence part
polynomial H
H
Type T A A B
Type T* A B B
and the order of the polynomial forming the conclusion part of the rules as
well as a collection of the specific subset of input variables. The consequence
part can be expressed by linear, quadratic, or modified quadratic polynomial
equation as mentioned previously. Especially for the consequence part, we
consider two kinds of input vector formats in the conclusion part of the fuzzy
rules of the 1st layer, namely i) selected inputs and ii) entire system inputs,
see Table 3.
i) The input variables of the consequence part of the fuzzy rules are same
as the input variables of premise part.
ii) The input variables of the consequence part of the fuzzy rules in a node
of the 1st layer are same as the entire system input variables and the input
variables of the consequence part of the fuzzy rules in a node of the 2nd layer
or higher are same as the input variables of premise part.
Where notation A: Vector of the selected input variables (x1 , x2 , · · · , xi ),
B: Vector of the entire system input variables (x1 , x2 , · · · , xi , xj · · · ), Type
T:f (A) = f (x1 , x2 , · · · , xi ) - type of a polynomial function standing in the
consequence part of the fuzzy rules, Type T*: f (B) = f (x1 , x2 , · · · , xi , xj · · · )
- type of a polynomial function occurring in the consequence part of the fuzzy
rules.
As shown in Table 3, A and B describe vectors of the selected input

variables and the entire collection (set) of the input variables, respectively.
Proceeding with each layer of the SONN, the design alternatives available
within a single FPN can be carried out with regard to the entire collection
of the input variables or its selected subset as they occur in the consequence
part of fuzzy rules encountered at the 1st layer. Following these criteria, we
distinguish between two fundamental types (Type T, Type T*), namely
Type T- the input variables in the conditional part of fuzzy rules are kept as
those in the conclusion part of the fuzzy rules (Zi of (A)F P N (xp , xq ; xi , xj ))
Type T*- the entire collection of the input variables is kept as input variables
in the conclusion part of the fuzzy rules (Zi of (B)F P N (xp , xq ; x1 , x2 , x3 , x4 )).
In the Fuzzy Polynomial Neuron (FPN) shown in Fig. 4, the variables in
the FPN(•) are enumerated in the form of two lists that are separated by a
semicolon. The former and latter part denote the premise input variables (xp ,
xq ) and the input variables (xp , xq or x1 , x2 , x3 , x4 ) of the consequence
regression polynomial of the fuzzy rules respectively.
In other words, xp and xq of both the former and latter part stand for the
selected input variables to be used in both the premise and consequence part
of the fuzzy rules, and x1 , x2 , x3 , and x4 of the latter part stand for system
input variables to be used in the consequence polynomial of the fuzzy rules.
3 Genetic optimization of SONN

The task of optimizing any complex model involves two main phases. First, a
class of some optimization algorithms has to be chosen so that it is applicable
to the requirements implied by the problem at hand. Secondly, various pa-
rameters of the optimization algorithm need to be tuned in order to achieve
its best performance.
Genetic algorithms (GAs) are optimization techniques based on the prin-
ciples of natural evolution. In essence, they are search algorithms that use
operations found in natural genetics to guide a comprehensive search over the
parameter space. GAs have been theoretically and empirically demonstrated
to provide robust search capabilities in complex spaces thus offering a valid
solution strategy to problems requiring efficient and effective searching. For
the optimization applied to real world problems, many methods are avail-
able including gradient-based search and direct search, which are based on
techniques of mathematical programming [17]. In contrast to these, genetic
algorithm are aimed at stochastic global search and involving a structured
information exchange [18]. It is eventually instructive to highlight the main
features that tell GA apart from some other optimization methods: (1) GA op-
erates on the codes of the variables, but not the variables themselves. (2) GA
searches optimal points starting from a group (population) of points in the
search space (potential solutions), rather than a single point. (3) GA’s search
is directed only by some fitness function whose form could be quite complex;
we do not require it need to be differentiable.
In this study, for the optimization of the SONN model, GA uses the serial
method of binary type, roulette-wheel used in the selection process, one-point
crossover in the crossover operation, and a binary inversion (complementation)
operation in the mutation operator. To retain the best individual and carry it
over to the next generation, we use elitist strategy [19]. The overall genetically-
driven structural optimization process of SONN is shown in Figs. 5–6.
As mentioned, when we construct PNs or FPNs of each layer in the con-
ventional SONN, such parameters as the number of input variables (nodes),
Fig. 5. Overall genetically-driven structural optimization process of the PN-based

SONN
Fig. 6. Overall genetically-driven structural optimization process of FPN-based

SONN
Fig. 7. A general flow of genetic design of SONNs
the order of polynomial, and input variables available within a PN or a FPN

are fixed (selected) in advance by the designer. This could have frequently
contributed to the difficulties in the design of the optimal network. To over-
come this apparent drawback, we resort ourselves to the genetic optimization,
see Fig. 7 for more detailed flow of the development activities.
4 The algorithm and design procedure of genetically

optimized SONN (gSONN)
The genetically-driven SONN comes with a highly versatile architecture both
in the flexibility of the individual nodes as well as the interconnectivity be-
tween the nodes and organization of the layers. Evidently, these features
contribute to the significant flexibility of the networks yet require a prudent
design methodology and a well-thought learning mechanisms based on genetic
optimization. Let us stress that there are several important differences that
make this architecture distinct from the panoply of the well-known neuro-
fuzzy architectures existing in the literature. The most important is that the
learning of SONNs dwells on the extended GMDH algorithm that is crucial to

the structural development of the network. In the sequel, The GMDH method
is comprised of both a structural phase such as a self-organizing and an evo-
lutionary algorithm (rooted in natural law of survival of the fittest), and a
parametric phase of Least Square Estimation (LSE)-based learning. Therefore
the structural and parametric optimization help utilize hybrid method (com-
bining GAs with a structural phase of GMDH) and LSE-based technique in
the most efficient way.
Overall, the framework of the design procedure of the SONN based
on genetically optimized multi-layer perceptron architecture comprises the
following steps.
[Step 1] Determine system’s input variables.
[Step 2] Using available experimental data, form a training and testing data
set.
[Step 3] Decide initial information for constructing the SONN structure.
[Step 4] Decide a structure of the PN or FPN based SONN using genetic
design.
• Selection of the number of input variables
• Selection of the polynomial order (PN) or the polynomial order of the
consequent part of fuzzy rules (FPN)
• Selection of a collection of the specific subset of input variables
[Step 5] Estimate the coefficient parameters of the polynomial in the selected
node.
• In case of a PN
– Estimate the coefficients of the polynomial assigned to the selected
node (PN)
• In case of a FPN
– Carry out fuzzy inference and coefficients estimation for fuzzy
identification in the selected node (FPN)
[Step 6] Select nodes (PNs or FPNs) with the best predictive capability and
construct their corresponding layer.
[Step 7] Check the termination criterion.
[Step 8] Determine new input variables for the next layer.
In what follows, we describe each of these steps in more detail.
[Step 1] Determine system’s input variables.
Define system’s input variables xi (i = 1, 2, · · · , n) related to the output vari-
able y. If required, the normalization of input data is carried out as well.
[Step 2] Form a training and testing data.
The input-output data set (xi , yi ) = (x1i , x2i , · · · , xni , yi ), i = 1, 2, . . . , N
(with N being the total number of data points) is divided into two parts, that
is, a training and testing dataset. Denote their sizes by Nt and Nc respectively.
Obviously we have N = Nt + Nc . The training data set is used to construct
the SONN. Next, the testing data set is used to evaluate the quality of the
network.
[Step 3] Decide initial information for constructing the SONN structure.
We decide upon the design parameters of the SONN structure and they
include:
a) According to the stopping criterion, two termination methods are ex-
ploited:
– Criterion level for comparison of a minimal identification error of the
current layer with that occurring at the previous layer of the network.
– The maximum number of layers (predetermined by the designer) with
an intent to achieve a sound balance between model accuracy and its
complexity.
b) The maximum number of input variables coming to each node in the
corresponding layer.
c) The total number W of nodes to be retained (selected) at the next
generation of the SONN algorithm.
d) The depth of the SONN to be selected to reduce a conflict between
overfitting and generalization abilities of the developed SONN.
e) The depth and width of the SONN to be selected as a result of a tradeoff
between accuracy and complexity of the overall model.
In addition, in case of FPN-based SONN, parameters related to the
following item are considered besides what are mentioned above.
f) The decision of initial information for fuzzy inference method and fuzzy
identification:
– Fuzzy inference method
– MF type: Triangular or Gaussian-like MF
– No. of MFs per each input of a node (or FPN)
– Structure of the consequence part of fuzzy rules
[Step 4] Decide a structure of the PN or FPN based SONN using genetic
design.
This concerns the selection of the number of input variables, the polynomial
order, and the input variables to be assigned in each node of the corresponding
layer. These important decisions are carried out through an extensive genetic
optimization.
When it comes to the organization of the chromosome representing (map-
ping) the structure of the SONN, we divide the chromosome to be used for
genetic optimization into three sub-chromosomes as shown in Figs. 8–9. The
1st sub-chromosome contains the number of input variables, the 2nd sub-
chromosome involves the order of the polynomial of the node, and the 3rd
sub-chromosome (remaining bits) contains input variables coming to the cor-
responding node (PN or FPN). All these elements are optimized when running
the GA.
In nodes (PN or FPNs) of each layer of SONN, we adhere to the nota-
tion of Fig. 10. ‘PNn’ or ‘FPNn’ denotes the nth node (PN or FPN) of the
Fig. 8. The PN design used in the SONN architecture - structural considerations

and mapping the structure on a chromosome
Fig. 9. The FPN design used in the SONN architecture - structural considerations
and mapping the structure on a chromosome
n th Polynomial or Fuzzy Polynomial

Neuron(PN or FPN)
xi PN n or FPNn
z
N T
xj Polynomial order(Type T)
No. of inputs
Fig. 10. Formation of each PN or FPN in SONN architecture
corresponding layer, ‘N’ denotes the number of nodes (inputs or PNs/FPNs)

coming to the corresponding node, and ‘T’ denotes the order of polynomial
used in the corresponding node.
Each sub-step of the genetic design of the three types of the parameters
available within the PN or the FPN is structured as follows:
[Step 4-1] Selection of the number of input variables (1st sub-chromosome)
Sub-step 1) The first 3 bits of the given chromosome are assigned to the
binary bits for the selection of the number of input variables. The size of this
bit structure depends on the number of input variables. For example, in case
of 3 bits, the maximum number of input variables is limited to 7.
Sub-step 2) The 3 bits randomly selected as
β = (22 × bit(3)) + (21 × bit(2)) + (20 × bit(1)) (7)
are then decoded into decimal. Here, bit(1), bit(2) and bit(3) show each
location of these three bits and are denoted as “0”, or “1” respectively.
Sub-step 3) The above decimal value is rounded off
γ = (β/α) × (M ax − 1) + 1 (8)
where Max denotes the maximal number of input variables entering the
corresponding node (PN or FPN) while α is the decoded decimal value corre-
sponding to the situation when all bits of the 1st sub-chromosome are set up
as 1’s.
Sub-step 4) The normalized integer value is then treated as the number of
input variables (or input nodes) coming to the corresponding node.
Evidently, the maximal number (Max) of input variables is equal to or less
than the number of all system’s input variables (x1 , x2 , · · · , xn ) coming to the
1st layer, that is, Max ≤ n.
[Step 4-2] Selection of the order of polynomial (2nd sub-chromosome)
Sub-step 1) The 3 bits of the 2nd sub-chromosome are assigned to the binary
bits for the selection of the order of polynomial.
Sub-step 2) The 3 bits randomly selected using (7) are decoded into a
decimal format.
Sub-step 3) The decimal value obtained by means of (8) is normalized and
rounded off. The value of Max is replaced with 3 (in case of PN) or 4 (in case
of FPN), refer to Tables 1and 2.
Sub-step 4) The normalized integer value is given as the selected polynomial

order, when constructing each node of the corresponding layer.
[Step 4-3] Selection of input variables (3rd sub-chromosome)
Sub-step 1) The remaining bits are assigned to the binary bits for the
selection of input variables.
Sub-step 2) The remaining bits are divided by the value obtained in step
4-1. If these bits are not divisible, we apply the following rule. For example, if
the remaining are 22 bits and the number of input variables obtained in step
4-1 has been set up as 4, the 1st , 2nd , and 3rd bit structures (spaces) for the
selection of input variables are assigned to 6 bits, respectively. The last (4th )
bit structure (spaces) used for the selection of the input variables is assigned
to 4 bits.
Sub-step 3) Each bit structure is decoded into decimal (through relationship
(7))
Sub-step 4) Each decimal value obtained in sub-step 3 is then normalized
following (8); moreover we round off the values obtained from this expression.
We replace Max with the total number of inputs (viz. input variables or input
nodes), n (or W ) in the corresponding layer. Note that the total number of
input variables denotes the number of the overall system’s inputs, n, in the
1st layer, and the number of the selected nodes, W , as the output nodes of
the preceding layer in the 2nd layer or higher.
Sub-step 5) The normalized integer values are then taken as the selected
input variables while constructing each node of the corresponding layer. Here,
if the selected input variables are multiple-duplicated, the multiple-duplicated
input variables (viz. same input numbers) are treated as a single input variable
while the remaining ones are discarded.
[Step 5] Estimate the coefficient parameters of the polynomial in the selected
node (PN or FPN).
[Step 5-1] In case of a PN
The vector of coefficients Ci is derived by minimizing the mean squared error
between yi and zmi
1
Ntr
E= (yi − zm i)2 (9)
Ntr i=0
Using the training data subset, this gives rise to the set of linear equations
Y = Xi Ci (10)
Evidently, the coefficients of the PN of nodes in each layer are expressed in
the form
Ci = (XiT Xi )−1 XiT Y (11)
where
Y = [y1 y2 · · · yntr ]T , Xi = [X1i X2i · · · Xki · · · Xntr i ]T ,
T
Xki = [1 xki1 xki2 · · · xkin · · · xm
ki1 xki1 xki2 · · · xkin ], Ci = [c0i c1i c2i · · · cn t ]
m m m T
µ (x )
Small( µ S (x )) Big(µB ( x ) = 1 − µ S ( x ))
1 0 if x ≥ x max
x max − x
0.5 µ S (x ) if x min < x < x max
x max − x min
1 if x ≤ x min
0 x
x min x max
(a) Triangular MF
µ (x )
Small( µ S (x)) Big(µB ( x) = 1 − µ S ( x))

1 0 if c ≥ x max
2
( x −c )
−0.6931
0.5 µ S (x) µ ( x) = e σ if x min < c < x max
1 if c ≤ x min
0 x
x min x max
(b) Gaussian-like MF
Fig. 11. Triangular and Gaussian-like membership functions and their parameters
with the following notation i: node number, k: data number, ntr : number
of the training data subset, n: number of the selected input variables, m:
maximum order, n : number of estimated coefficients.
[Step 5-2] In case of a PN
At this step, the regression polynomial inference is considered. The inference
method deals with regression polynomial functions viewed as the consequents
of the rules. Regression polynomials (polynomial and in the very specific case,
a constant value) standing in the conclusion part of fuzzy rules are given as
different types of Type 1, 2, 3, or 4, see Table 2. In the fuzzy inference, we
consider two types of membership functions, namely triangular and Gaussian-
like membership functions.
The regression fuzzy inference (reasoning scheme) is envisioned: The con-
sequence part can be expressed by linear, quadratic, or modified quadratic
polynomial equation as shown in Table 2. The use of the regression polynomial
inference method gives rise to the expression
Ri : If x1 is Ai1 , · · · , xk is Aik then yi = fi (x1 , x2 , · · · , xk ) (12)

where, Ri is the i-th fuzzy rule, xl (l = 1, 2, k) is an input variable, Aik

is a membership function of fuzzy sets, k denotes the number of the input
variables, fi (•) is a regression polynomial function of the input variables.
The calculation of the numeric output of the model are carried out in the
well-known form

n
µi fi (x1 , x2 , · · ·, xk )
n
i=1
y =
n = i fi (x1 , x2 , · · ·, xk )
µ (13)
µi i=1
i=1
where, n is the number of the fuzzy rules, y is the inferred value, µi is the
premise fitness of Ri and µi is the normalized premise fitness of µi .
Here we consider a regression polynomial function of the input variables. The
consequence parameters are produced by the standard least squares method
that is
= (XjT Xj )−1 XjT Y
a (14)
where

aTj = [a10j , · · ·, anoj , a11j , · · ·, an1j , · · ·, a1kj , · · ·, ankj ],
Xj = [x1j , x2j , · · ·, xij , · · ·, xmj ]T ,
µlj1 , · · ·, µ
xTlj = [ ljn , xlj1 µ
lj1 , · · ·, xlj1 µ
ljn , · · ·, xljk µ
lj1 , · · ·, xljk µ
ljn ],
Y = [y1 , · · ·, ym ]T
j is the node number, m denotes the total number of data points, and k stands
the number of the input variables. Subsequently l is the data index while n is
the number of the fuzzy rules.
The procedure described above is implemented iteratively for all nodes of
the layer and also for all layers of SONN; we start from the input layer and
move towards the output layer.
[Step 6] Select nodes (PNs or FPNs) with the best predictive capability and
construct their corresponding layer.
Fig. 12 depicts the genetic optimization procedure for the generation of the
optimal nodes in the corresponding layer. As shown in Fig. 12, all nodes of
the corresponding layer of SONN architecture are constructed through the
genetic optimization.
The generation process can be organized as the following sequence of steps
Sub-step 1) We set up initial genetic information necessary for genera-
tion of the SONN architecture. This concerns the number of generations and
populations, mutation rate, crossover rate, and the length of the chromosome.
Sub-step 2) The nodes (PNs or FPNs) are generated through the genetic
design. Here, a single population assumes the same role as the node (PN or
FPN) in the SONN architecture and its underlying processing is visualized
Selected Population
Population Fitness operation of Array of nodes (preferred nodes:
Generation 1
(1~T) nodes (PNs or FPNs) (PNs or FPNs:1~T) PNs orFPNs)
(1~W)
Reproduction
Selection of the optimal node Selected Population
Roulette-wheel selection
Array of nodes (preferred nodes:
One-point crossover
Highest fitness value (PNs or FPNs:1~2W) PNs orFPNs)
Invert mutaion
(1~W)
Selected Population
Population Fitness operation of Elitist Array of nodes (preferred nodes:
Generation 2
(1~T) nodes (PNs or FPNs) strategy (PN or FPNs:1~T) PNs orFPNs)
(1~W)
Reproduction
Selection of the optimal node
Roulette-wheel selection Selected Population
One-point crossover Array of nodes (preferred nodes:
Highest fitness value
Invert mutaion (PNs or FPNs:1~2W) PNs orFPNs)
(1~W)
T : Total population size

Selected Population
Population Fitness operation of Elitist Array of nodes (preferred nodes:
Generation 3 W : Selected population size
(1~T) nodes (PNs or FPNs) strategy (PNs or FPNs:1~T) PNs orFPNs)
(1~W)
: GA operation flow
: Optimization operation
of SOFPNN within GA flow
Fig. 12. The genetic optimization used in the generation of the optimal nodes in
the given layer of the network
in Figs. 8–9. The optimal parameters of the corresponding polynomial are

computed by the standard least squares method.
Sub-step 3) To evaluate the performance of nodes (PNs or FPNs) con-
structed using the training dataset, the testing dataset is used. Based on
this performance index, we calculate the fitness function.
The fitness function reads as
1
F(fitness function) = (15)
1 + EPI
where EPI denotes the performance index for the testing data (or valida-
tion data). In this case, the model is obtained by the training data and EPI
is obtained from the testing data (or validation data) of the SONN model
constructed by the training data.
Sub-step 4) To move on to the next generation, we carry out selection,
crossover, and mutation operation using genetic initial information and the
fitness values obtained via sub-step 3.
Sub-step 5) The nodes (PNs or FPNs) obtained on the basis of the calculated
fitness values (F1 , F2 ,· · · , Fz ) are rearranged in a descending order. We unify
the nodes with duplicated fitness values (viz. in case that one node is the same
fitness value as other nodes) among the rearranged nodes on the basis of the
fitness values. We choose several nodes (PNs or FPNs) characterized by the
best fitness values. Here, we use the pre-defined number W of nodes (PNs or
FPNs) with better predictive capability that need to be preserved to assure an
optimal operation at the next iteration of the SONN algorithm. The outputs
of the retained nodes serve as inputs to the next layer of the network. There
are two cases as to the number of the retained nodes, that is
(i) If W * < W , then the number of the nodes (PNs or FPNs) retained for
the next layer is equal to z.
Here, W * denotes the number of the retained nodes in each layer that
nodes with the duplicated fitness values were moved.
(ii) If W * ≥ W , then for the next layer, the number of the retained nodes
(PNs or FPNs) is equal to W .
The above design pattern is carried out for the successive layers of the
network. For the construction of the nodes in the corresponding layer of the
original SONN structure, the nodes obtained from (i) or (ii) are rearranged in
ascending order on a basis of initial population number. This step is needed
to construct the final network architecture as we trace the location of the
original nodes of each layer generated by the genetic optimization.
Sub-step 6) For the elitist strategy, we select the node that has the highest
fitness value among the selected nodes (W ).
Sub-step 7) We generate new populations of the next generation using op-
erators of GAs obtained from Sub-step 4. We use the elitist strategy. This
sub-step carries out by repeating sub-step 2–6. Especially in sub-step 5,
we replace the node that has the lowest fitness value in the current genera-
tion with the node that has reached the highest fitness value in the previous
generation obtained from sub-step 6.
Sub-step 8) We combine the nodes (W populations) obtained in the previous
generation with the nodes (W populations) obtained in the current genera-
tion. In the sequel, W nodes that have higher fitness values among them (2W )
are selected. That is, this sub-step carries out by repeating sub-step 5.
Sub-step 9) Until the last generation, this sub-step carries out by repeating
sub-step 7–8.
The iterative process generates the optimal nodes of the given layer of the
SONN.
Step 7) Check the termination criterion.
The termination condition that controls the growth of the model consists
of two components, that is the performance index and a size of the network
(expressed in terms of the maximal number of the layers). As far as the per-
formance index is concerned (that reflects a numeric accuracy of the layers),
a termination is straightforward and comes in the form,
F1 ≤ F∗ (16)
Where, F1 denotes a maximal fitness value occurring at the current layer

whereas F∗ stands for a maximal fitness value that occurred at the previous
layer. As far as the depth of the network is concerned, the generation process
is stopped at a depth of less than five layers. This size of the network has been
experimentally found to build a sound compromise between the high accuracy
of the resulting model and its complexity as well as generalization abilities.
In this study, we use two measures (performance indexes) that is the Mean
Squared Error (MSE) and the Root Mean Squared Error (RMSE).
i) Mean Squared Error
1
N
E(P Is or EP Is ) = (yp − yp )2 (17)
N p=1
ii) Root Mean Squared Error

1 N
E(P Is or EP Is ) = (yp − yp )2 (18)
N p=1
where, yp is the p − th target output data and yp stands for the p-th actual
output of the model for this specific data point. N is training (P Is ) or test-
ing (EP Is ) input-output data pairs and E is an overall (global) performance
index defined as a sum of the errors for the N.
[Step 8] Determine new input variables for the next layer.
If (16) has not been met, the model is expanded. The outputs of the
preserved nodes (zli , z2i , · · ·, zW i ) serves as new inputs to the next layer
(x1j , x2j , · · ·, xW j )(j = i + 1). This is captured by the expression
x1j = z1i , x2j = z2i , · · · , xwj = zwi (19)
The SONN algorithm is carried out by repeating steps 4–8 of the algorithm.
5 Experimental studies
In this section, we illustrate the development of the SONN and show its per-
formance for a number of well-known and widely used datasets. The first one
is a time series of gas furnace (Box-Jenkins data) which was studied previously
in [10, 11, 13, 15, 16, 20–34]. The other one deals with a chaotic time series data
(Mackey-Glass time series data) [35–41].
We illustrate the performance of the network and elaborate on its development

by experimenting with data coming from the gas furnace process. The time
series data (296 input-output pairs) resulting from the gas furnace process has
been intensively studied in the previous literature [10, 11, 13, 15, 16, 20–34].
The delayed terms of methane gas flow rate, u(t) and carbon dioxide density,
y(t) are used as system input variables such as u(t − 3), u(t − 2), u(t − 1),
y(t − 3), y(t − 2), and y(t − 1). The output variable is y(t). The first part of
the dataset (consisting of 148 pairs) was used for training. The remaining part
Table 4. System’s input vector formats for the design of SONN

Number of System Inputs (SI) Inputs and output
2 (Type I) u(t−3), y(t−1):y(t)
3 (Type II) u(t−2), y(t−2), y(t−1):y(t)
4 (Type III) u(t−2), u(t−1), y(t−2), y(t−1):y(t)
6 (Type IV) u(t−3), u(t−2), u(t−1), y(t−3), y(t−2), y(t−1):y(t)
of the series serves as a testing set. We consider the MSE given by (17) to be
a pertinent performance index. We choose the input variables of nodes in the
1st layer of SONN architecture from these input variables. We use four types
of system input variables of SONN architecture with vector formats such as
Type I, Type II, Type III, and Type IV as shown in Table 4. The forms
Type I, II, III, and IV utilize two, three, four, and six system input variables
respectively.
Table 5 summarizes the list of parameters used in the genetic optimiza-
tion of the PN-based and the FPN-based SONN. In the optimization of each
layer, we use 100 generations, 60 populations, a string of 36 bits, crossover
rate equal to 0.65, and the probability of mutation set up to 0.1. A chro-
mosome used in the genetic optimization consists of a string including 3
sub-chromosomes. The 1st chromosome contains the number of input vari-
ables, the 2nd chromosome contains the order of the polynomial, and finally
the 3rd chromosome contains input variables. The numbers of bits allocated
to each sub- chromosome are equal to 3, 3, and 30, respectively. The popu-
lation size being selected from the total population size (60) is equal to 30.
The process is realized as follows. 60 nodes (PNs or FPNs) are generated
in each layer of the network. The parameters of all nodes generated in each
layer are estimated and the network is evaluated using both the training and
testing data sets. Then we compare these values and choose 30 nodes (PNs
or FPNs) that produce the best (lowest) value of the performance index.
The maximal number (Max) of inputs to be selected is confined to two to
five (2–5). In case of PN-based SONN, the order of the polynomial is chosen
from three types that is Type 1, Type 2, and Type 3 (refer to the Table 1),
while in case of FPN-based SONN, the polynomial order of the consequent
part of fuzzy rules is chosen from four types, that is Type 1, Type 2, Type
3, and Type 4 as shown in Table 2. As usual in fuzzy systems, we may ex-
ploit a variety of membership functions in the condition part of the rules and
this is another factor contributing to the flexibility of the network. Overall,
triangular or Gaussian fuzzy sets are of general interest. The first class of tri-
angular membership functions provides with a very simple implementation.
The second class becomes useful because of an infinite support of its fuzzy
sets.
Table 5. Computational aspects of the genetic optimization of PN-based and FPN-

based SONN
Parameters 1st layer 2nd layer 3rd layer 4th layer 5th layer
Maximum
generation 100 100 100 100 100
Total population
size 60 60 60 60 60
GA Selected
population 30 30 30 30 30
size (W )
Crossover rate 0.65 0.65 0.65 0.65 0.65
Mutation rate 0.1 0.1 0.1 0.1 0.1
String length 3 + 3 + 30 3 + 3 + 30 3 + 3 + 30 3 + 3 + 30 3 + 3 + 30
Maximal no.
(Max) of inputs 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max
PN- to be selected (2–5) (2–5) (2–5) (2–5) (2–5)
based
SONN Polynomial type 1≤T≤3 1≤T≤3 1≤T≤3 1≤T≤3 1≤T≤3
(Type T) (# )
Maximal no.
(Max) of inputs 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max
to be selected (2–5) (2–5) (2–5) (2–5) (2–5)
Polynomial type
(Type T) of the
consequent part 1≤T≤4 1≤T≤4 1≤T≤4 1≤T≤4 1≤T≤4
of fuzzy
rules (## )
FPN- Consequent Type T Type T Type T Type T Type T

based input type to be
SONN used for Type T
(### ) Type T∗ Type T Type T Type T Type T
Triangular Triangular Triangular Triangular Triangular

Membership
Function (MF)
type
Gaussian Gaussian Gaussian Gaussian Gaussian
No. of MFs per

2 2 2 2 2
input
# ## ###
l, T, Max: integers, , and : refer to Tables 1–3 respectively.
5.1.1 PN-based SONN
Fig. 13 shows an example of the PN design that is driven by some specific

chromosome, refer to the case that the performance values are PI = 0.022,
EPI = 0.136 in the 1st layer when using Max = 3 in Type IV as shown in Table
6(b). Here, the number of the input variables considered in the 1st layer is
given as 6 that are the number of the entire system input variables. Especially,
in the 2nd layer or higher, the number of entire input variables is given as W
that is the number of the nodes selected in the current layer, as the output
nodes of the preceding layer. Refer to sub-step 4 of step 4-3 of the introduced
Selection of node(PN) structure by a chromosome
iii) Bits for the

i) Bits for the selection of ii) Bits for the selection
Related bit items selection of input
the no. of input variables of the polynomial order
variables

chromosome divided 1 1 1 1 1 0 1 1 0 0 1 1 0 0
for each item
Decoding Decoding
1 1 0 0 1 0 0 0 1 1 0 1 1 1 1 1 0 0
7 6
1 2 3
Decoding : 50 Decoding : 7 Decoding : 60

Genetic
3 3 Normalization : 5 Normalization : 2 Normalization : 6
Design
Selected input Selected input Selected input
variables : 5 variables : 2 variables : 6
No. of selected Selected
input polynomial
order Selected input variables : 2, 5, 6
variables(3)
(Type 3)
3 input variables : x 2 , x 5 , x 6
Selected PN Modified quadratic polynomial : a0 + a1x2 + a2x5 + a3x6 + a4x2x5 + a5x2x6 + a6x5x6
PN
Fig. 13. The example of the PN design guided by some chromosome (Type IV and
Max = 3 in layer 1)
design process. The maximal number of input variables for the selection is
confined to 3 over nodes of each layer of the network.
Table 6 summarizes the results when using Type II and Type IV: Accor-
ding to the maximal number of inputs to be selected (Max = 2 to 5), the
selected node numbers, the selected polynomial type (Type T), and its cor-
responding performance index (PI and EPI) were shown when the genetic
optimization for each layer was carried out. “Node” denotes the nodes for
which the fitness value is maximal in each layer. For example, in case of Table
6(b), the fitness value in layer 1 is maximal for Max = 5 when nodes 3, 4, 5, 6
occurring in the previous layer are selected as the node inputs in the present
layer. Only 4 inputs of Type 1 (linear function) were selected as the result
of the genetic optimization. Here, node “0” indicates that it has not been
selected by the genetic operation. Therefore the width (the number of nodes)
of the layer can be lower in comparison to the conventional PN-based SONN
(which immensely contributes to the compactness of the resulting network).
In that case, the minimal value of the performance index at the node, that is
PI = 0.035, EPI = 0.125 are obtained.
Fig. 14 depicts the values of the performance index of the PN-based
gSONN with respect to the maximal number of inputs to be selected when
84
Table 6. Performance index of the PN-based SONN viewed with regard to the increasing number of the layers
(a) In case of Type II
1st layer 2nd layer 3rd layer 4th layer 5th layer
Max
Node T PI EPI Node T PI EPI Node T PI EPI Node T PI EPI Node T PI EPI
2 2 3 2 0.104 0.198 6 10 2 0.048 0.124 10 12 2 0.024 0.119 8 16 2 0.024 0.114 23 30 2 0.024 0.112
3 1 2 3 3 0.022 0.135 5 7 14 2 0.045 0.121 16 18 28 2 0.043 0.115 12 16 23 3 0.041 0.112 14 18 21 2 0.036 0.109
S.-K. Oh and W. Pedrycz
4 1 2 3 0 3 0.022 0.135 5 15 16 0 2 0.045 0.121 8 13 14 21 3 0.042 0.112 1 8 25 29 3 0.039 0.108 5 15 16 0 1 0.038 0.106
5 1 2 3 0 0 3 0.022 0.135 4 7 18 0 0 2 0.045 0.121 1 15 30 0 0 2 0.022 0.115 23 29 30 0 0 2 0.021 0.110 13 14 23 30 0 1 0.019 0.108
(b) In case of Type IV

Max
2 5 6 2 0.105 0.199 24 27 2 0.026 0.134 2 25 1 0.023 0.126 8 19 2 0.022 0.124 7 9 2 0.022 0.120
3 2 5 6 2 0.105 0.199 24 25 0 1 0.021 0.123 4 22 28 2 0.020 0.112 12 13 0 2 0.014 0.110 4 6 18 2 0.014 0.102
4 3 4 5 6 1 0.035 0.125 6 9 15 27 3 0.018 0.122 6 12 13 19 3 0.017 0.113 4 7 10 28 2 0.016 0.106 27 28 29 0 3 0.014 0.100
5 3 4 5 6 0 1 0.035 0.125 4 15 22 0 0 2 0.015 0.112 7 9 18 27 0 3 0.015 0.107 3 23 30 0 0 2 0.013 0.096 6 24 25 0 0 3 0.012 0.091
Maximal number of inputs to be selected(Max) Maximal number of inputs to be selected(Max)
2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ; 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
0.11 A : (2 3)
0.2
B : (1 2 3)
0.1 C : (1 2 3 0) 0.19
D : (1 2 3 0 0)
0.09 A : (6 10)
0.18
Training error
B : (5 7 14) A : ( 8 16)
Testing error
0.08 C : (5 15 16 0) B : (12 16 23) 0.17
D : (4 7 18 0 0) C : ( 1 8 25 29)
0.07 D : (23 29 30 0 0) 0.16
A : (10 12) A : (23 30)
0.06 B : (16 18 28) B : (14 18 21)
0.15
C : ( 8 13 14 21) C : (10 16 28 0)
0.05 D : ( 1 15 30 0 0) D : (13 14 23 30 0) 0.14
0.04 0.13
0.03 0.12
0.02 0.11
0.01 0.1
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a-1) Training error (a-2) Testing error

2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ; 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
0.11 0.2
A : (5 6)
0.1 B : (2 5 6)
C : (3 4 5 6)
0.18
0.09 D : (3 4 5 6 0)
A : (24 27)
Training error
Testing error
0.08 B : (24 25 0)
C : ( 6 9 15 27) 0.16
0.07 D : ( 4 15 22 0 0)
A : (2 25)
0.06 B : (4 22 28) 0.14
C : (6 12 13 19)
0.05 D : (7 9 18 27 0) A : ( 8 19)
B : (12 13 0) A : ( 7 9) 0.12
0.04 C : ( 4 7 10 28) B : ( 4 6 18)
D : ( 3 23 30 0 0) C : (27 28 29 0)
0.03 D : ( 6 24 25 0 0)
0.1
0.02
0.01 0.08
1 2 3 4 5 1 2 3 4 5
Layer Layer
(b-1) Training error (b-2) Testing error
Fig. 14. Performance index treated as a function of the maximal number of inputs
to be selected in Type II or Type IV
the number of system inputs is equal to 6 (Type IV). Next, Fig. 15 shows the
performance index of the gSONN with respect to the number of entire system
inputs when using Max = 5.
Considering the training and testing data sets in Type IV with Max = 5,
the best results for network in the 2nd layer happen with the Type 2 poly-
nomial (quadratic function) and 3 nodes at the inputs (nodes numbered as
4, 15, 22); the performance of this network is quantified by the values of PI
equal to 0.015 and EPI given as 0.112. The best results for the network in the
5th layer coming with PI = 0.012 and EPI = 0.091 have been reported when
using Max = 5 with the polynomial of Type 3 and 3 nodes at the inputs (the
node numbers are 6, 24, 25). In Fig. 14, A(•)–E(•) denote the optimal node
numbers at each layer of the network, namely those with the best predictive
performance. Here, the node numbers of the 1st layer represent system input
numbers, and the node numbers of each layer in the 2nd layer or higher rep-
resent the output node numbers of the preceding layer, as the optimal node
which has the best output performance in the current layer. Fig. 16 illustrates
the detailed optimal topologies of the network with 2 or 3 layers, compared
to the conventional optimized network of Fig. 17. That is, in Fig. 17, the
No. of system inputs : 2 ; ,3; ,4; ,6; No. of system inputs : 2 ; ,3; ,4; ,6;
0.05 0.4
0.045 0.35
0.04
0.3
Training error
Testing error
0.035
0.25
0.03
0.2
0.025
0.15
0.02
0.015 0.1
0.01 0.05
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a) Training error (b) Testing error
Fig. 15. Performance index regarded as a function of the number of system inputs
(SI) (in case of Max = 5)
PN 4 u(t-3)
PN4
2 1
4 3
PN 6 PN 1 u(t-2)
y(t-2) 2 2 2 3
PN 7 PN15 PN 7 u(t-1)
y
^ PN15 PN5
y(t-2) 1 2 3 3 3 2 4 2 3 2
^
y
y(t-3)
PN 14 PN30
y(t-1) 2 3 4 3
y(t-2) PN22
PN 17
5 1
2 1 y(t-1)
(a) In case of Type II with 3 layers and (b) In case of Type IV with 2 layers and
Max = 5 Max = 5
Fig. 16. PN-based genetically optimized SONN (gSONN) architecture
PD 1 PD 11
PD 2 PD 3
u(t-2) PD 4 PD 5
PD 1
PD 9 PD 3
u(t-3) • PD 8
PD 2 ^ PD 7
y(t-2) PD 1
PD 3
PD 11 PD 12 PD y u(t-2) • PD 8
PD 9
PD 2 PD 12
PD 12 PD 20 PD 10
PD 4
PD 9 PD 17 PD 7
y(t-1) PD 15
u(t-1) • PD 12
PD 18 PD 10 PD 20
PD 13 ^
PD 11 PD 24
PD 20 PD 9 PD
f
y(t-3) • PD 12
PD 16
PD 26
NOP5 PD 20
PD 13 PD 28 PD 30
NOP6
y(t-2) • PD 14
PD 21
PD 29
NOP7 PD 23
PD 15
y(t-1) •
PD 24
PD 2 PD 9 NOP25 PD 44
PD 3 PD 10 PD 25
PD 4 PD 26
(a) In case of Type II with 5 layers (b) In case of Type IV with 5 layers
Fig. 17. Conventional optimized PN-based SONN architecture

0.04 0.17
0.16
0.035
0.15
0.03
Training error
Testing error
0.14
0.025 1st layer 2nd layer 3rd layer 4th layer 5th layer 0.13 1st layer 2nd layer 3rd layer 4th layer 5th layer
0.12
0.02
0.11
0.015
0.1
0.01 0.09
0 100 200 300 400 500 0 100 200 300 400 500
Generation Generation
(a) Training error (PI) (b) Testing error (EPI)
Fig. 18. The optimization process reported in terms of PI and EPI
performance of the conventional optimized SONN in Type II or IV with 5

layers was quantified by the values of PI = 0.020, EPI = 0.119 or PI = 0.017,
EPI = 0.101 respectively whereas under the condition given as similar perfor-
mance values, two types of gSONN architectures were depicted as shown in
Fig. 16 (a) Type II with 3 layers and Max = 5 (PI = 0.022, EPI = 0.115) (refer
to Table 6(a), PI = 0.019, EPI = 0.108 in 5 layers with Max = 5) and (b) Type
IV with 2 layers and Max = 5 (PI = 0.015, EPI = 0.112) (refer to Table 6(b),
PI = 0.012, EPI = 0.091 in 5 layers with Max = 5). As shown in Fig. 16, the
genetic design procedure at each stage (layer) of SONN leads to the selection
of the preferred nodes (or PNs) with optimal local characteristics (such as the
number of input variables, the order of the polynomial, and input variables).
Fig. 18 illustrates the optimization process by visualizing the performance
index in successive generations of the genetic optimization in case of Type IV
with Max = 5. It also shows the optimized network architecture over 5 lay-
ers. Noticeably, the variation ratio (slope) of the performance of the network
changes radically around the 2nd layer from the viewpoint of PI and EPI.
Therefore, to effectively reduce a large number of nodes and avoid a sub-
stantial amount of time-consuming iterations concerning SONN layers, the
stopping criterion can be taken into consideration. Referring to Fig. 16(b),
it becomes obvious that we optimized the network up to the maximally 2nd
layer.
Table 7 summarizes the detailed results of the optimal architecture ac-
cording to each Type of the network.
5.1.2 FPN-based SONN
Fig. 19 shows an example of the FPN design that is driven by some specific
chromosome, refer to the case that the performance values are PI = 0.012,
EPI = 0.145 in the 1st layer when using Type IV, Gaussian-like MF, and
Max = 4 in Table 8(b-2). In FPN of each layer, two Gaussian-like membership
Table 7. Performance index of PN based SONN architectures for Types I, II, III,
and IV
PN-based Optimal Polynomial Neuron
Layer PIs EPIs
SONN No. of inputs (Node no.) Polynomial type
1 2(1,2) Type 1 0.022 0.335
2 4(1,4,5,6) Type 2 0.019 0.282
Max = 4 3 4(1,2,14,25) Type 2 0.020 0.273
4 3(17,22,25) Type 2 0.018 0.268
Type I 5 4(4,14,25,26) Type 2 0.018 0.265
(SI: 2) 1 2(1,2) Type 1 0.022 0.335
2 4(3,4,5,7) Type 2 0.020 0.282
Max = 5 3 5(7,9,15,24,28) Type 3 0.020 0.271
4 3(21,27,30) Type 2 0.017 0.263
5 3(2,15,13) Type 2 0.017 0.259
1 3(1,2,3) Type 3 0.022 0.135
2 3(5,15,16) Type 2 0.045 0.121
Max = 4 3 4(8,13,14,21) Type 3 0.042 0.112
4 4(1,8,25,29) Type 3 0.039 0.108
Type II 5 3(10,16,28) Type 1 0.038 0.106
(SI: 3) 1 3(1,2,3) Type 3 0.022 0.135
2 3(4,17,18) Type 2 0.045 0.121
Max = 5 3 3(1,15,30) Type 2 0.022 0.115
4 3(23,29,30) Type 2 0.021 0.110
5 4(13,14,23,30) Type 1 0.019 0.108
1 3(1,3,4) Type 3 0.022 0.135
2 4(24,26,29,30) Type 3 0.016 0.124
Max = 4 3 4(7,11,18,26) Type 2 0.014 0.113
4 3(2,28,29) Type 2 0.014 0.102
Type III 5 3(3,12,22) Type 1 0.013 0.099
(SI: 4) 1 3(1,3,4) Type 3 0.022 0.135
2 5(12,13,15,18,29) Type 2 0.015 0.119
Max = 5 3 4(7,9,10,14) Type 2 0.014 0.104
4 3(1,3,5) Type 2 0.014 0.100
5 4(8,9,23,24) Type 1 0.013 0.098
1 4(3,4,5,6) Type 1 0.035 0.125
2 4(6,9,15,27) Type 3 0.018 0.122
Max = 4 3 4(6,12,13,19) Type 3 0.017 0.113
4 4(4,7,10,28) Type 2 0.016 0.106
Type IV 5 3(27,28,29) Type 3 0.014 0.100
(SI: 6) 1 4(3,4,5,6) Type 1 0.035 0.125
2 3(4,15,22) Type 2 0.015 0.112
Max = 5 3 4(7,9,18,27) Type 3 0.015 0.107
4 3(3,23,30) Type 2 0.013 0.096
5 3(6,24,25) Type 3 0.012 0.091
Selection of node(FPN) structure by a chromosome
i) Bits for the selection of ii) Bits for the selection iii) Bits for the selection of input
Related bit items the no. of input variables of the polynomial order variables

chromosome divided 0 1 1 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0
for each item
Decoding Decoding 0 0 1 1 0 1 1 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 1 1 0 0 0
3 3 1 2
Decoding : 7024 Decoding : 11992
Genetic
Design Normalization : 2 Normalization : 3
2 2
Selected input variable : 2 Selected input variable : 3
Selected Selected input variables : 2, 3

No. of selected
polynomial
input
order
variables(2)
(Type 2) Input variables :x 2,x 3
Regression polynomial fuzzy inference Gaussian 2 MFs Entire system input variables
Fuzzy inference &

fuzzy identification R 1 : If x 2 is Small and x3 is Small, then y 1 = f1(x 1 , x 2 , x 3 , x 4 ) with Type 2
R 4 : If x 2 is Big and x 3 is Big, then y4 = f 4 (x 1 , x 2 , x 3 , x 4) with Type 2
Selected FPN FPN
Fig. 19. The example of the FPN design guided by some chromosome (Type IV,
Gaussian-like MF, Max = 4, 1st layer)
functions for each input variable are used. Here, the number of entire in-
put variables (here, entire system input variables) considered in the 1st layer
is given as 6 (Type IV). The polynomial order selected is given as Type 2.
Furthermore the entire system input variables for the regression polynomial
function of the consequent part of fuzzy rules in the first layer were considered.
That is, “Type 2∗ ” as the consequent input type in the 1st layer is used and
in the 2nd layer or higher, the input variables in the conclusion part of fuzzy
rules are kept as those in the conditional part of the fuzzy rules. Especially,
in the 2nd layer or higher, the number of entire input variables is given as W
that is the number of the nodes selected in the current layer, as the output
nodes of the preceding layer. Refer to sub-step 4 of step 4–3 of the introduced
design process. As mentioned previously, the maximal number (Max) of input
variables for the selection is confined to 4, and two variables (such as x2
and x3 ) were selected among them. The parameters of the conclusion part
(polynomial) of fuzzy rules can be determined by the standard least-squares
method.
Table 8 summarizes the results when using Types II and IV: According
to the maximal number of inputs to be selected (Max = 2 to 5), the selected
node numbers, the selected polynomial type (Type T), and its corresponding
Table 8. Performance index of the network of each layer versus the increase of maximal number of inputs to be selected for Types II
90
and IV with Type T∗

(a-1) Triangular MF
Max
2 1 0 4 0.020 0.133 8 19 1 0.049 0.129 16 0 2 0.049 0.127 7 9 3 0.018 0.122 22 30 2 0.018 0.119
3 1 0 0 4 0.020 0.133 8 20 0 1 0.049 0.129 5 0 0 2 0.049 0.127 11 20 24 1 0.017 0.123 9 17 0 4 0.017 0.119
4 1 0 0 0 4 0.020 0.133 18 21 0 0 1 0.049 0.129 1 4 13 19 1 0.019 0.126 10 20 27 30 1 0.018 0.123 14 23 0 0 1 0.018 0.119
5 1 0 0 0 0 4 0.020 0.133 4 27 0 0 0 1 0.049 0.129 11 22 0 0 0 2 0.019 0.124 5 23 0 0 0 2 0.018 0.119 17 21 26 27 0 1 0.016 0.114
(a-2) Gaussian-like MF
Max
2 1 2 4 0.017 0.142 8 14 2 0.017 0.134 17 26 2 0.017 0.132 15 25 4 0.017 0.130 1 5 4 0.016 0.128
3 1 2 0 4 0.017 0.142 25 16 17 1 0.025 0.131 10 20 26 1 0.024 0.127 2 4 0 2 0.016 0.119 16 25 0 2 0.015 0.117
4 1 2 0 0 4 0.017 0.142 3 9 14 25 1 0.021 0.125 2 3 9 10 1 0.023 0.111 10 20 21 28 1 0.030 0.108 3 13 0 0 2 0.018 0.101
5 1 2 0 0 0 4 0.017 0.142 5 6 9 24 0 1 0.020 0.127 4 7 17 20 0 1 0.022 0.117 17 29 0 0 0 2 0.014 0.114 19 29 0 0 0 3 0.014 0.108

(b-1) Triangular MF
Max
2 2 3 2 0.013 0.149 25 0 3 0.012 0.148 2 23 2 0.013 0.143 3 26 3 0.013 0.141 21 30 2 0.012 0.125
3 2 5 6 1 0.021 0.136 3 14 0 1 0.021 0.124 2 4 6 2 0.015 0.117 10 12 30 1 0.017 0.114 4 12 21 4 0.011 0.103
4 2 5 6 0 1 0.021 0.136 12 27 0 0 1 0.021 0.124 19 25 0 0 3 0.018 0.122 8 14 15 19 1 0.017 0.114 6 10 14 28 1 0.011 0.106
5 2 5 6 0 0 1 0.021 0.136 16 27 0 0 0 1 0.021 0.124 5 6 10 0 0 4 0.015 0.106 5 27 0 0 0 2 0.015 0.105 7 19 0 0 0 3 0.015 0.103
(b-2) Gaussian-like MF
Max
2 2 3 2 0.012 0.145 25 0 3 0.012 0.146 2 0 3 0.013 0.147 19 0 3 0.015 0.149 8 14 4 0.010 0.152
3 2 3 0 2 0.012 0.145 10 29 0 4 0.012 0.139 11 13 23 1 0.023 0.129 2 3 7 1 0.016 0.116 7 14 0 2 0.008 0.119
4 2 3 0 0 2 0.012 0.145 19 29 0 0 4 0.011 0.125 4 13 29 0 1 0.020 0.104 29 0 0 0 3 0.011 0.106 2 20 24 27 1 0.008 0.093
5 2 3 0 0 0 2 0.012 0.145 5 27 28 0 0 1 0.015 0.138 11 27 29 0 0 1 0.021 0.114 3 5 0 0 0 2 0.009 0.106 5 15 20 0 0 1 0.016 0.100
performance index (PI and EPI) were shown when the genetic optimization
for each layer was carried out. “Node” denotes the nodes for which the fitness
value is maximal in each layer. For example, in case of Table 8(b-2), the fit-
ness value in layer 5 is maximal for Max = 5 when nodes 5, 15, 20 occurring
in the previous layer are selected as the node inputs in the present layer. Only
3 inputs of Type 1 (constant) were selected as the result of the genetic opti-
mization. Here, node “0” indicates that it has not been selected by the genetic
operation. Therefore the width (the number of nodes) of the layer can be lower
in comparison to the conventional FPNN (which immensely contributes to the
compactness of the resulting network). In that case, the minimal value of the
performance index at the node, that is PI = 0.016, EPI = 0.100 are obtained.
Fig. 20 shows the values of performance index vis-à-vis number of layers
of the gSONN with respect to the maximal number of inputs to be selected
as optimal architectures of each layer of the network included in Table 8(b)
while Fig. 21 summarizes the values of the performance index represented
vis-à-vis the increasing number of the layers with regard to the number of

2(A) ; , 3(B) ; , 4(C) ; , 5 (D) ; 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
0.022 0.16
A : (2 3) A : (21 30)
0.02 B : (2 5 6) B : ( 4 12 21)
C : (2 5 6 0) C : ( 6 10 14 28)
0.15
D : (2 5 6 0 0) D : ( 7 19 0 0 0)
Training error
Testing error
0.018 A : (25 0) 0.14

B : ( 3 14 0)
C : (12 27 0 0)
0.016 D : (16 27 0 0 0) 0.13
0.014 0.12
A : ( 2 23)
B : ( 2 4 6) A : ( 3 26)
0.012 B : (10 12 30) 0.11
C : (19 25 0 0)
D : ( 5 6 10 0 0) C : ( 8 14 15 19)
D : ( 5 27 0 0 0)
0.01 0.1
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a-1) Training error error (a-2) Testing error
(a) Triangular MF

2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ; 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
0.024 0.16
A : (25 0) A : (19 0)
0.022 B:( 2 3 7)
B : (10 29 0)
C : (29 0 0 0)
0.15
C : (19 29 0 0)
D:( 3 5 0 0 0)
0.02 D : ( 5 27 28 0 0)
Training error
A : (8 14) 0.14
Testing error
A : (2 3) B : (7 14 0)
0.018 B : (2 3 0) C : (2 20 24 27)
C : (2 3 0 0) D : (5 15 20 0 0) 0.13
D : (2 3 0 0 0)
0.016
0.12
0.014
A : (2 0) 0.11
0.012 B : (11 13 22)
C : ( 4 13 29 0)
0.01 D : (11 27 29 0 0) 0.1
0.008 0.09
1 2 3 4 5 1 2 3 4 5
Layer Layer
(b-1) Training error error (b-2) Testing error
Fig. 20. Performance index of gSONN for Type IV according to the increase of
number of layers
No. of system inputs : 2 ; ,3; ,4; , 6; No. of system inputs : 2 ; ,3; ,4; ,6;
0.05 0.35
: Selected input variables : Selected input variables
0.045 : Entire system input variables : Entire system input variables
0.3
0.04
Training error
Testing error
0.035 0.25
0.03
0.025 0.2
0.02
0.15
0.015
0.01 0.1
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a) Triangular MF
No. of system inputs : 2 ; ,3; ,4; ,6; No. of system inputs : 2 ; ,3; ,4; ,6;
0.045 0.3
0.04 : Entire system input variables : Entire system input variables
0.25
0.035
Testing error
Testing error
0.03 0.2
0.025
0.02 0.15
0.015
0.1
0.01
0.005 0.05
1 2 3 4 5 1 2 3 4 5
Layer Layer
Fig. 21. Performance index of gSONN with respect to the increase of number of
system inputs (in case of Max = 5)
entire system inputs being used in gSONN when using Max = 5. In Fig. 20,
A(•)–D(•) denote the optimal node numbers at each layer of the network,
namely those with the best predictive performance. Here, the node numbers
of the 1st layer represent system input numbers, and the node numbers of
each layer in the 1nd layer or higher represent the output node numbers of the
preceding layer, as the optimal node that has the best output performance in
the current layer.
Fig. 22 illustrates the detailed optimal topology of the network with 2 lay-
ers, compared to the conventional optimized network of Fig. 23. That is, in Fig.
23, the performance of two types of the conventional optimized SONNs was
quantified by the values of PI = 0.020, EPI = 0.130 for Type II, and PI = 0.016,
EPI = 0.128 for Type III whereas under the condition given as similar perfor-
mance values, two types of gSONN architectures were depicted as shown Fig.
22(a) and (b). As shown in Fig. 22, the genetic design procedure at each stage
(layer) of SONN leads to the selection of the preferred nodes (or FPNs) with
optimal local characteristics (such as the number of input variables, the order
FPN5 u(t-3)
2 4
u(t-2) u(t-2) FPN19
FPN6
2 2
2 2 FPN23 u(t-1) FPN29
y(t-2)
4 1 y^ 2 4 y^
FPN9 y(t-3)
3 1 FPN 29
y(t-1) y(t-2) 4 1
FPN24
1 3 y(t-1)
(a) FPN-based gSONN with 2 layers for (b) FPN-based gSONN with 2 lay-
Type II, Max = 5, and Gaussian- ers for Type IV, Max = 4, and
like MF Gaussian-like MF
Fig. 22. Genetically optimized FPNN (gFPNN) architecture
FPN 3
FPN 5
FPN 28
u(t-2) FPN 1 FPN 6 FPN22
FPN 29
y(t-2) FPN11 y^
NOP 7
y(t-1)
NOP 2 FPN 10 NOP31 NOP 31
NOP 3
NOP 4
(a) Type II
FPN 1 FPN1
u(t-2) c FPN 1
FPN 2 FPN5
u(t-1) c
FPN 3 FPN6 FPN 17 FPN 3
FPN 7 y^
FPN4 FPN8
y(t-2) c FPN 20 FPN 24
FPN5 FPN14
y(t-1) c FPN 30
FPN6 FPN15
(b) Type III
Fig. 23. Genetically optimized FPNN (gFPNN) architecture
of polynomial of the consequent part of fuzzy rules, and a collection of the

specific subset of input variables). In the sequel, the best results for Type II
network in the 2nd layer are obtained when using Max = 5 and Gaussian-like
MF, and this happens with the Type 1 (constant) and 4 nodes at the inputs
(nodes numbered as 5, 6, 9, 24); the performance of this network is quantified
by the values of PI = 0.020 and EPI = 0.127. In addition, the most preferred
results for the Type IV network in the 2nd layer coming with PI = 0.011 and
EPI = 0.125 have been reported when using Max = 4 and Gaussian-like MF
with the polynomial of Type 4 and 2 nodes at the inputs (the node numbers
are 19, 29). Therefore the width (the number of nodes) of the layer as well as
the depth (the number of layers) of the network can be lower in comparison
Table 9. Performance index of FPN based SONN architectures for Types I, II, III,
and IV
PN-based Layer Optimal Polynomial Neuron PIs EPIs
1 1(1) Type 2 0.022 0.335
Max = 5 2 4(7, 9, 10, 11) Type 1 0.018 0.271
(Triangular MF) 3 4(6, 12, 25, 28) Type 1 0.017 0.270
4 4(6, 11, 12, 27) Type 1 0.017 0.268
Type I 5 3(10, 14, 19) Type 1 0.016 0.264
(SI: 2) 1 2(1,2) Type 3 0.022 0.281
Max = 5 2 3(6, 9, 11) Type 2 0.020 0.267
(Gaussian-like MF) 3 4(5, 13, 15, 25) Type 1 0.020 0.260
4 2(3, 6) Type 3 0.017 0.252
5 2(28, 30) Type 2 0.017 0.249
1 1(1) Type 4 0.020 0.133
Max = 5 2 2(4, 27) Type 1 0.049 0.129
(Triangular MF) 3 2(11, 22) Type 2 0.019 0.124
4 2(5, 23) Type 2 0.018 0.119
Type II 5 4(17, 21, 26, 27) Type 1 0.016 0.114
(SI: 3) 1 2(1, 2) Type 4 0.017 0.142
Max = 4 2 4(3, 9, 14, 25) Type 1 0.021 0.125
4 4(10, 20, 21, 28) Type 1 0.030 0.108
5 2(3, 13) Type 2 0.018 0.101
1 3(1, 3, 4) Type 1 0.021 0.135
Max = 5 2 2(1, 26) Type 1 0.021 0.125
4 4(10, 14, 17, 28) Type 1 0.010 0.120
Type III 5 4(9, 14, 17, 24) Type 1 0.010 0.115
(SI: 4) 1 3(1, 3, 4) Type 2 0.016 0.146
Max = 5 2 4(1, 11, 24, 27) Type 1 0.043 0.128
4 2(2, 22) Type 3 0.012 0.104
5 4(8,9,23,24) Type 1 0.014 0.099
1 3(2, 5, 6) Type 1 0.021 0.136
Max = 5 2 2(16, 27) Type 1 0.021 0.124
(Triangular MF) 3 3(5, 6, 10) Type 4 0.015 0.106
4 2(5, 27) Type 2 0.015 0.105
Type IV 5 2(7, 19) Type 3 0.015 0.103
(SI: 6) 1 2(2, 3) Type 2 0.012 0.145
Max = 4 2 2(19, 29) Type 4 0.011 0.125
(Gaussian-like MF) 3 3(4, 13, 29) Type 1 0.020 0.104
4 1(29) Type 3 0.011 0.106
5 4(2, 20, 24, 27) Type 1 0.008 0.093
to the conventional SONN (which immensely contributes to the compactness

of the resulting network).
Table 9 summarizes the detailed results of the optimal architecture ac-
cording to each Type of the network.
Table 10 contrasts the performance of the genetically developed net-
work with other fuzzy and neuro-fuzzy models studied in the literatures.
The experimental results clearly reveal that the proposed approach and the
resulting model outperforms the existing networks both in terms of better
approximation capabilities (lower values of the performance index on the
training data, PIs ) as well as superb generalization abilities (expressed by the
performance index on the testing data, EPIs ). In addition, the structurally
optimized gSONN leads to the effective reduction of the depth of network
as well as the width of the layer and the avoidance of a substantial amount
Table 10. Comparative analysis of the performance of the network; considered are
models reported in the literature
Performance index
Model PI PIs EPIs
Box and Jenkin’s model [20] 0.710
Tong’s model [21] 0.469
Sugeno and Yasukawa’s model [22] 0.355
Sugeno and Yasukawa’s model [23] 0.190
Xu and Zailu’s model [24] 0.328
Pedrycz’s model [15] 0.320
Chen’s model [25] 0.268
Gomez-Skarmeta’s model [26] 0.157
Oh and Pedrycz’s model [16] 0.123 0.020 0.271
Kim et al.’s model [27] 0.055
Kim et al.’s model [28] 0.034 0.244
Leski and Czogala’s model [29] 0.047
Lin and Cunningham’s model [30] 0.071 0.261
NNFS model [31] 0.128
CASE I (SI = 4, 5th layer) 0.016 0.116
FPNN [32]
CASE II (SI = 4, 5th layer) 0.016 0.128
Basic (SI = 4, 5th layer) 0.021 0.110
PNN [33]
Modified (SI = 4, 5th layer) 0.015 0.103
Triangular (SI = 4, 5th layer) 0.019 0.134
HFPNN [34]
Gaussian (SI = 4, 5th layer) 0.021 0.119
Basic SOPNN (SI = 4, 5th layer) 0.027 0.021 0.085
Generic SOPNN [10]
Modified SOPNN (SI = 4, 5th layer) 0.035 0.017 0.095
Basic SOPNN (SI = 4, 5th layer) 0.020 0.119
Advanced SOPNN [11]
Modified SOPNN (SI = 4, 5th layer) 0.018 0.118
Basic Case 1(5th layer) 0.016 0.266
Type I SONN Case 2(5th layer) 0.016 0.265
(SI = 2) Modified Case 1(5th layer) 0.013 0.267
SONN∗ [13] SONN Case 2(5th layer) 0.013 0.272
Basic Case 1(5th layer) 0.016 0.116
Type II SONN Case 2(5th layer) 0.016 0.128
(SI = 4) Modified Case 1(5th layer) 0.016 0.133
SONN Case 2(5th layer) 0.018 0.131
Type I 2nd layer (Max = 4) 0.019 0.282
(SI = 2) 5th layer (Max = 5) 0.017 0.259
Type II 1st layer (Max = 5) 0.022 0.135
PN- (SI = 3) 3rd layer (Max = 5) 0.022 0.115
based Type III 2nd layer (Max = 5) 0.015 0.119
(SI = 4) 5th layer (Max = 5) 0.018 0.098
Type IV 2nd layer (Max = 5) 0.015 0.112
Proposed (SI = 6) 3rd layer (Max = 5) 0.012 0.091
gSONN Type I Gaussian- 1st layer (Max = 5) 0.017 0.281
(SI = 2) like 5th layer (Max = 5) 0.014 0.249
Type II Gaussian- 1st layer (Max = 5) 0.017 0.142
FPN- (SI = 3) like 5th layer (Max = 5) 0.014 0.108
based Type III Gaussian- 1st layer (Max = 5) 0.016 0.146
Type IV Gaussian- 2nd layer (Max = 5) 0.012 0.145
PI - performance index over the entire data set,
PIs - performance index on the training data, EPIs - performance index on the testing data.
*: denotes “conventional optimized FPN-based SONN”.
of time-consuming iterations for finding the most preferred network in the

conventional SONN. PIs (EPIs ) is defined as the mean square errors(MSE)
between the experimental data and the respective outputs of the model
(network).
5.2 Chaotic time series
In this section, we demonstrate how the GA-based SONN can be utilized to

predict future values of a chaotic Mackey-Glass time series. The performance
of the network is also contrasted with some other models existing in the lit-
erature [35–41]. The time series is generated by the chaotic Mackey-Glass
differential delay equation [42] comes in the form
0.2x(t − τ )
ẋ(t) = − 0.1x(t) (20)
1 + x10 (t − τ )
The prediction of future values of this series arises as a benchmark problem

that has been used and reported by a number of researchers. To obtain the
time series value at each integer point, we applied the fourth-order Runge-
Kutta method to find the numerical solution to (20). From the Mackey-Glass
time series x(t), we extracted 1000 input-output data pairs of three types of
vector formats such as Type I, Type II, and Type III as shown in Table 11.
We choose the input variables of nodes in the 1st layer of SONN architec-
ture from these input variables. The first 500 pairs were used as the training
data set while the remaining 500 pairs formed the testing data set. To come
up with a quantitative evaluation of the network, we use the standard RMSE
performance index as given by (18).
The parameters used for optimization of this process modeling are the
same as used in the previous experiments. Especially in the optimization of
each layer of the PN-based SONN, we use 100 generations and 150 popula-
tions. The population size being selected from the total population size (150)
is equal to 100. The GA-based design procedure is carried out in the same
manner as in the previous experiments as well. The consequent input type to
be used in this process is the same as that in case of gas furnace process.
Table 11. System’s input vector formats for the design of SONN
Number of System Inputs (SI) Inputs and output
4(Type I) x(t−18), x(t−12), x(t−6), x(t);x(t+6)
5(Type II) x(t−24), x(t−18), x(t−12), x(t−6), x(t);x(t+6)
6(Type III) x(t−30), x(t−24), x(t−18), x(t−12), x(t−6), x(t);x(t+6)
where t = 118 to 1117.
5.2.1 PN-based SONN
Table 12 summarizes the results for Type III when using the maximal number,
2, 3, 4, 5, 10 of inputs to be selected. The best results for the network in the
5th layer are reported as PI = 0.0009, EPI = 0.0009 when using Max = 10.
Fig. 24 depicts the values of the performance index of the PN-based
gSONN with respect to the maximal number of inputs to be selected when
the number of system inputs is equal to 6 (Type III). Next, Fig. 25 shows the
performance index of the gSONN with respect to the number of entire system
inputs when using Max = 10.
5.2.2 FPN-based SONN
Table 14 summarizes the performance of the 1st to 5th layer of the network
when changing the maximal number of inputs to be selected; here Max was set
up to 2 through 5. Fig. 26 depicts the performance index of each layer of FPN-
based gSONN according to the increase of maximal number of inputs to be
selected. Fig. 27 summarizes the values of the performance index represented
vis-à-vis the increasing number of the layers with regard to the number of
selected inputs (or entire system inputs) being used in gSONN when using
Max = 5.
Fig. 28(a)–(b) illustrate the detailed optimal topologies of gSONN for 1
layer when using triangular MF: the results of the network have been reported
as PI = 4.0e-4, EPI = 4.0e-4 for Max = 3, and PI = 7.7e-5, EPI = 1.6e-4 for
Max = 5.
Table 13 summarizes the detailed results of the optimal architecture
according to each Type of the network.
And also Fig. 28(c)–(d) illustrate the detailed optimal topologies of gSONN
for 1 layer in case of Gaussian-like MF: those are quantified as PI = 3.0e-4,
EPI = 3.0e-4 for Max = 2, and PI = 3.6e-5, EPI = 4.5e-5 for Max = 5. As shown
in Fig. 28, the proposed network enables the architecture to be a structurally
more optimized and simplified network than the conventional SONN.
Fig. 29 illustrates the optimization process by visualizing the values of the
performance index obtained in successive generations of GA. It also shows the
optimized network architecture when using Gaussian-like MF (the maximal
number (Max) of inputs to be selected is set to 5 with the structure composed
of 5 layers). As shown in Fig. 29, the variation ratio (slope) of the performance
of the network is almost the same up to the 2nd through 5th layer, therefore
in this case, the stopping criterion can be taken into consideration up to
maximally 1 or 2 layers for the purpose to effectively reduce the number of
nodes as well as layers of the network (from the viewpoint of the width and
depth of gSONN for the compact network).
Table 15 summarizes the results of the optimal architecture according to
each Type (such as Types I, II, and III) of the network when using Type T∗
as the consequent input type.
98
Table 12. Performance index of the PN-based SONN of each layer versus the increase of maximal number of inputs to be selected for
Type III
Max
2 4 6 2 0.0502 0.0497 12 23 2 0.0309 0.0305 17 64 2 0.0215 0.0211 31 86 2 0.0173 0.0170 70 73 2 0.0156 0.0154
3 3 4 6 2 0.0347 0.0339 31 39 73 2 0.0161 0.0159 1 50 73 2 0.0107 0.0103 15 80 93 2 0.0080 0.0078 8 14 43 2 0.0067 0.0065
4 1 3 4 6 2 0.0293 0.0283 4 25 59 63 2 0.0106 0.0105 47 51 61 69 2 0.0063 0.0062 6 31 46 66 2 0.0052 0.0052 29 40 75 95 2 0.0045 0.0044
5 1 3 4 5 6 2 0.0231 0.0223 2 51 70 74 83 2 0.0126 0.0124 74 76 79 86 95 2 0.0066 0.0063 16 36 43 53 58 2 0.0049 0.0046 10 21 34 46 67 2 0.0041 0.0038
10 2 0.0211 0.0203 2 0.0036 0.0034 2 0.0021 0.0019 2 0.0013 0.0012 2 0.0009 0.0009
Maximal number of inputs to be selected(Max) Maximal number of inputs to be selected (Max)

2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ; , 10(E) ; 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ; , 10(E) ;
0.06 0.05
A : (4 6)
A : (17 64)
B : (3 4 6)
B : ( 1 50 73) 0.045
C : (1 3 4 6)
C : (47 51 61 69)
0.05 D : (1 3 4 5 6)
E : (1 2 3 4 5 6 0 0 0 0)
D : (74 76 79 86 95) 0.04
E : ( 2 3 4 31 52 61 72 78 81 85)
A : (12 23)
0.035
Training error
B : (31 39 73)
0.04
Testing error
C : ( 4 25 59 63) A : (70 73)
D : ( 2 51 70 74 83) B : ( 8 14 43) 0.03
E : ( 1 35 38 42 45 56 57 70 91 97) C : (29 40 75 95)
0.03 D : (10 21 34 46 67) 0.025
E : ( 3 8 23 38 50 53 56 63 85 99)
A : (31 86) 0.02
B : (15 80 93)
0.02 C : ( 6 31 46 66) 0.015
D : (16 36 43 53 58)
E : ( 3 4 20 27 38 57 80 82 91 96) 0.01
0.01
0.005
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
Fig. 24. Performance index according to the increase of number of layers in case of
Type III
No. of system inputs : 4 ; ,5; ,6; No. of system inputs : 4 ; ,5; ,6;
0.035 0.03
0.03 0.025
0.025
0.02
Training error
Testing error
0.02
0.015
0.015
0.01
0.01
0.005 0.005
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
Fig. 25. Performance index of gPNN with respect to the increase of number of
system inputs (Max = 10)
Table 16 gives a comparative summary of the network with other mod-

els. The experimental results clearly reveal that it outperforms the existing
models both in terms of better approximation capabilities (lower values of the
performance index on the training data, PIs ) as well as superb generaliza-
tion abilities (expressed by the performance index on the testing data EPIs ).
PIs (EPIs ) is defined as the root mean square errors (RMSE) computed for
the experimental data and the respective outputs of the network.
Table 13. Performance index of PN based SONN architectures for Types I, II,
and III
PN-based Optimal Polynomial Neuron
Layer PIs EPIs
1 4(1, 2, 3, 4) Type 2 0.0302 0.0294
2 5(19, 26, 28, 34, 41) Type 2 0.0082 0.0081
Max = 5 3 5(2, 17, 23, 48, 86) Type 2 0.0063 0.0061
4 5(3, 25, 71, 75, 78) Type 2 0.0050 0.0049
Type I 5 5(1, 3, 54, 94, 98) Type 2 0.0043 0.0043
(SI: 4) 1 4(1, 2, 3, 4)) Type 2 0.0302 0.0294
2 10(2,4,8,10,11,12,13,15,21,23) Type 2 0.0047 0.0045
Max = 10 3 10(12,34,35,43,52,60,62,72,89,95) Type 2 0.0039 0.0038
4 10(12,13,28,29,32,41,57,73,91,94) Type 2 0.0032 0.0031
5 9(44,51,53,67,69,75,86,92,93) Type 2 0.0026 0.0026
1 5(1, 2, 3, 4, 5) Type 2 0.0282 0.0274
2 5(39, 59, 71, 74, 84)) Type 2 0.0088 0.0087
Max = 5 3 5(46, 51, 66, 67, 88) Type 2 0.0054 0.0052
4 4(29, 38, 58, 72, 83) Type 2 0.0042 0.0041
Type II 5 5(10, 21, 23, 63, 85) Type 2 0.0037 0.0035
(SI: 5) 1 5(1, 2, 3, 4, 5) Type 2 0.0282 0.0274
2 10(7,38,39,40,47,48,56,59,60,68) Type 2 0.0026 0.0024
Max = 10 3 10(3,11,18,35,42,50,52,53,67,73) Type 2 0.0019 0.0018
4 10(2,8,26,30,33,35,48,79,81,85) Type 2 0.0015 0.0014
5 10(3,7,10,31,54,59,79,89,95,97) Type 2 0.0012 0.0012
1 5(1, 3, 4, 5, 6) Type 2 0.0231 0.0223
2 5(2, 51, 70, 74, 83) Type 2 0.0126 0.0124
Max = 5 3 5(74, 76, 79, 86, 95) Type 2 0.0066 0.0063
4 5(16, 36, 43, 53, 58) Type 2 0.0049 0.0046
Type III 5 5(10, 21, 34, 46, 67) Type 2 0.0041 0.0038
(SI: 6) 1 6(1, 2, 3, 4, 5, 6) Type 2 0.0211 0.0203
2 10(1,35,38,42,45,56,57,70,91,97) Type 2 0.0036 0.0034
Max = 10 3 10(2,3,4,31,52,61,72,78,81,85) Type 2 0.0021 0.0019
4 10(3,4,20,27,38,57,80,82,91,96) Type 2 0.0013 0.0012
5 10(3,8,23,38,50,53,56,63,85,99) Type 2 0.0009 0.0009
6 Concluding remarks
In this study, we introduced a class of genetically optimized self-organizing
neural networks, discussed their topologies, came up with a detailed genetic
design procedure, and used these networks to nonlinear system modeling. The
comprehensive experimental studies involving well-known datasets quantify a
superb performance of the network in comparison to the existing fuzzy and
neuro-fuzzy models. The key features of this approach can be enumerated as
follows:
• The gSONN is sophisticated and optimized architecture capable of con-
structing models out of a few number of entire system inputs as well as a
limited data set.
• The proposed design methodology helps reach a compromise between ap-
proximation and generalization capabilities of the constructed gSONN
model.
• The depth (layer size) and width (node size of each layer) of the gSONN
can be selected as a result of a tradeoff between accuracy and complexity
of the overall model.
Table 14. Performance index of the network of each layer versus the increase of maximal number of inputs to be selected for Type III
(in case of Type T∗ )
(a) Triangular MF
Max
2 4 5 3 0.0019 0.0017 14 25 4 0.0016 0.0015 4 7 3 0.0016 0.0014 10 22 3 0.0015 0.0014 14 16 3 0.0015 0.0013
3 3 4 5 3 0.0004 0.0004 2 3 13 1 0.0003 0.0003 4 10 24 4 0.0003 0.0003 2 14 22 4 0.0002 0.0003 14 20 25 3 0.0002 0.0002
4 3 4 5 6 3 7.7e-5 1.6e-4 4 14 23 0 2 6.9e-5 1.3e-4 6 10 14 0 4 6.0e-5 1.2e-4 20 25 29 0 4 5.8e-5 1.0e-4 10 13 15 0 3 5.6e-5 9.9e-5
5 3 4 5 6 0 3 7.7e-5 1.6e-4 6 16 23 0 0 4 6.7e-5 1.2e-4 12 18 0 0 0 4 6.1e-5 1.0e-4 3 15 28 0 0 3 5.6e-5 9.7e-5 13 15 19 21 0 1 5.7e-5 9.4e-5
Max
2 4 6 3 0.0003 0.0003 18 20 4 0.0002 0.0002 13 18 3 0.0002 0.0002 12 14 3 0.0002 0.0002 4 21 4 0.0002 0.0002
3 1 4 5 3 3.6e-5 0.199 3 18 19 2 3.2e-5 4.1e-5 7 17 21 2 3.1e-5 3.9e-5 6 18 29 2 2.7e-5 3.6e-5 15 23 0 3 2.7e-5 3.5e-5
4 1 4 5 0 3 3.6e-5 4.5e-5 7 16 19 24 2 2.8e-5 4.1e-5 11 17 23 28 4 2.8e-5 3.6e-5 2 3 18 0 2 2.4e-5 3.4e-5 7 19 0 0 2 2.4e-5 3.3e-5
5 1 4 5 0 0 3 3.6e-5 4.5e-5 12 14 18 0 0 2 3.2e-5 4.1e-5 7 8 13 14 20 4 2.7e-5 3.6e-5 6 14 15 0 0 3 2.4e-5 3.4e-5 18 26 28 0 0 4 2.3e-5 3.3e-5
Genetically Optimized Self-organizing Neural Networks
101
Maximal number of inputs to be selected (Max) Maximal number of inputs to be selected (Max)
x 10 −3 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;

x 10 −3
2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
2 1.8
1.8 1.6
A : (4 5)
1.6 B : (3 4 5) 1.4
C : (3 4 5 6)
1.4
Training error
D : (3 4 5 6 0) A : (14 16)
Testing error
B : (14 20 25)
1.2
1.2 A : (14 25) A : (10 22)
C : (10 13 15 0)
B : ( 2 3 13) B : ( 2 14 22)
D : (13 15 19 21 0) 1
C : ( 4 14 23 0) C : (20 25 29 0)
1 D : ( 6 16 23 0 0) D : ( 3 15 28 0 0)
0.8
0.8 A : ( 4 7)
B : ( 4 10 24)
0.6
0.6 C : ( 6 10 14 0)
D : (12 18 0 0)
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a) Triangular MF
Maximal number of inputs to be selected (Max) Maximal number of inputs to be selected (Max)
x 10 −4 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
x 10 −4 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
4 4
3.5 3.5
A : (4 6)
B : (1 4 5)
3 C : (1 4 5 0) 3
Training error
Testing error
D : (1 4 5 0 0)
2.5 A : (18 20) 2.5
B : ( 3 18 19)
A : ( 4 21)
C : ( 7 16 16 24)
2 D : (12 14 18 0 0)
B : (15 23 0) 2
C : ( 7 19 0 0)
A : (13 18) D : (18 26 28 0 0)
1.5 B : ( 7 17 21)
A : (12 14)
1.5
C : (11 17 23 28)
B : ( 6 18 29)
D : ( 7 8 13 14 20)
1 C : ( 2 2 18 0) 1
D : ( 6 14 15 0 0)
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
Fig. 26. Performance index according to the increase of number of layers
• The structure of the network is not predetermined (as in most of the ex-
isting neural networks) but becomes dynamically adjusted and optimized
during the development process.
• With a properly selected type of membership functions and the organi-
zation of the layers, FPN based gSONN performs better than other fuzzy
and neurofuzzy models.
• The gSONN comes with a diversity of local neuron characteristics such as
PNs or FPNs that are useful in copying with various nonlinear character-
istics of the nonlinear systems. GA-based design procedure at each stage
(layer) of gSONN leads to the optimal selection of these preferred nodes
(PNs or FPNs) with local characteristics (such as the number of input
variables, the order of the polynomial, and a collection of specific subset
of input variables) available within a single node, and then based on these
selections, we build the flexible and optimized architecture of gSONN.
−3 No. of system inputs : 4 ; ,5; ,6; −3 No. of system inputs : 4 ; ,5; ,6;
1.8 x 10 1.8 x 10
1.6 : Entire system input variables 1.6 : Entire system input variables
1.4 1.4
Training error
Testing error
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a) Triangular MF
−3 No. of system inputs : 4 ; ,5; ,6; −3 No. of system inputs : 4 ; ,5; ,6;
1.4 x 10 1.4 x 10
1.2 : Entire system input variables 1.2 : Entire system input variables
1 1
Training error
Testing error
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
Fig. 27. Performance index of gFPNN with respect to the increase of number of
system inputs (Max = 5)
x(t-30)
x(t-30)
x(t-24)
x(t-24)
x(t-18) x(t-18) FPN16
FPN13
y^ 4 3
y^
x(t-12)
3 3 x(t-12)
x(t-6) x(t-6)
x(t) x(t)
(a) Triangular MF (1 layer (b) Triangular MF (1 layer
and Type III, Max = 3) and Type III, Max = 5)
x(t-30)
x(t-24)
x(t-18) FPN18
2 3
y^
x(t-12)
x(t-6)
x(t)
(c) Gaussian-like MF (1 layer (d) Gaussian-like MF (1 layer

and Type III, Max = 2) and Type III, Max = 5)
Fig. 28. FPN-based gSONN architecture

x 10 −5 x 10 −5
6.5 8
6 7.5
5.5 7
6.5
5
Training error
Testing error
6
4.5
1st layer 2nd layer 3rd layer 4th layer 5th layer 5.5 1st layer 2nd layer 3rd layer 4th layer 5th layer
4
5
3.5
4.5
3 4
2.5 3.5
2 3
0 100 200 300 400 500 0 100 200 300 400 500
Generation Generation
Fig. 29. The optimization process of each performance index by the genetic
algorithms (Type III, Max = 5, Gaussian)
Table 15. Performance index of FPN based SONN architectures for Types I, II,
and III
PN-based Layer Optimal Polynomial Neuron PIs EPIs
1 4(1, 2, 3, 4) Type 3 0.0017 0.0016
Max = 5 2 5(8, 24, 25, 27, 28) Type 2 0.0014 0.0013
4 4(2, 6, 17, 26) Type 3 0.0009 0.0010
Type I 5 4(1, 3, 17, 24) Type 1 0.0009 0.0010
(SI: 4) 1 4(1, 2, 3, 4) Type 3 0.0005 0.0006
Max = 5 2 5(7, 12, 18, 22, 29) Type 4 0.0004 0.0006
4 4(3, 7, 21, 26) Type 2 0.0003 0.0005
5 2(2, 15) Type 4 0.0003 0.0005
1 5(1, 2, 3, 4, 5) Type 4 0.0003 0.0004
Max = 5 2 4(1, 8, 9, 14) Type 4 0.0003 0.0003
4 5(8, 11, 18, 21, 26) Type 1 0.0002 0.0003
Type II 5 4(7, 9, 14, 29) Type 1 0.0002 0.0003
(SI: 5) 1 3(2, 3, 5) Type 3 0.0004 0.0003
Max = 4 2 4(9, 11, 21, 30) Type 2 0.0003 0.0003
4 2(22, 24) Type 4 0.0002 0.0002
5 3(8, 18, 27) Type 3 0.0002 0.0002
1 4(3, 4, 5, 6) Type 3 7.7e-5 1.6e-4
Max = 5 2 3(6, 16, 23) Type 4 6.7e-5 1.2e-4
(Triangular MF) 3 2(12, 18) Type 4 6.1e-5 1.0e-4
4 3(3, 15, 28) Type 3 5.6e-5 9.7e-5
Type III 5 4(13, 15, 19, 21) Type 1 5.7e-5 9.4e-5
(SI: 6) 1 3(1, 4, 5) Type 3 3.6e-5 4.5e-5
Max = 5 2 3(12, 14, 18) Type 2 3.2e-5 4.1e-5
(Gaussian-like MF) 3 5(7, 8, 13, 14, 20) Type 4 2.7e-5 3.6e-5
4 3(6, 14, 15) Type 3 2.4e-5 3.4e-5
5 3(18, 26, 28) Type 4 2.3e-5 3.3e-5
Table 16. Comparative analysis of the performance of the network; considered are
models reported in the literature
Performance index
Model PI PIs EPIs NDEI∗
0.044
Wang’s model [35] 0.013
0.010
Cascaded-correlation NN [36] 0.06
Backpropagation MLP [36] 0.02
6th-order polynomial [36] 0.04
ANFIS [37] 0.0016 0.0015 0.007
FNN model [38] 0.014 0.009
Recurrent neural network [39] 0.0138
Basic Case 1 0.0011 0.0011 0.005
Type I (5th layer) Case 2 0.0027 0.0028 0.011
Modified Case 1 0.0012 0.0011 0.005
SONN∗∗ [40] (5th layer) Case 2 0.0038 0.0038 0.016
Basic Case 1 0.0003 0.0005 0.0016
Type II
(5th layer) Case 2 0.0002 0.0004 0.0011
Modified Case 1 0.000001 0.00009 0.000006
Type III
(5th layer) Case 2 0.00004 0.00007 0.00015
PN- Type I (5th layer) Max = 10 0.0026 0.0026
based Type II (5th layer) Max = 10 0.0012 0.0012
Type III (5th layer) Max = 10 0.0009 0.0009
Proposed Type I Triangular Max = 5 0.0017 0.0016
gSONN FPN- (1st layer) Gaussian Max = 5 0.0005 0.0006
based Type II Triangular Max = 5 0.0003 0.0004
st
FPN- (1 layer) Gaussian Max = 5 0.0004 0.0003
based Type III Triangular Max = 5 7.7e-5 1.6e-4
(1st layer) Gaussian Max = 5 3.6e-5 4.5e-5
*Non-dimensional error index (NDEI) as used in [41] is defined as the root mean
square errors divided by the standard deviation of the target series.
** is called “conventional optimized FPN-based SONN”.
• The design methodology comes with hybrid structural optimization and

parametric learning viewed as two phases of modeling building. The
GMDH method is comprised of both a structural phase such as a self-
organizing and an evolutionary algorithm (rooted in natural law of survival
of the fittest), and a parametric phase of Least Square Estimation (LSE)-
based learning. Therefore the structural and parametric optimization help
utilize hybrid method (combining GAs with a structural phase of GMDH)
and LSE-based technique in the most efficient way.
7 Acknowledgements
This work was supported by the Korea Research Foundation Grant funded by
the Korean Government (MOEHRD)(KRF-2006-311-D00194 and KRF-2006-
D00019).
References
1. Cherkassky V, Gehring D, Mulier F (1996) Comparison of adaptive methods for
function estimation from samples. IEEE Trans Neural Netw 7:969–984
2. Dickerson JA, Kosko B (1996) Fuzzy function approximation with ellipsoidal
rules. IEEE Trans Syst Man Cybern Part B 26:542–560
3. Sommer V, Tobias P, Kohl D, Sundgren H, Lundstrom L (1995) Neural networks
and abductive networks for chemical sensor signals: a case comparison. Sens
Actuators B 28:217–222
4. Kleinsteuber S, Sepehri N (1996) A polynomial network modeling approach to
a class of large-scale hydraulic systems. Comput Elect Eng 22:151–168
5. Cordon O, et al. (2004) Ten years of genetic fuzzy systems: current framework
and new trends. Fuzzy Set Syst 141(1):5–31
6. Ivakhnenko AG (1971) Polynomial theory of complex systems. IEEE Trans Syst
Man Cybern SMC-1:364–378
7. Ivakhnenko AG, Madala HR (1994) Inductive learning algorithms for complex
systems modeling. CRC, Boca Raton, FL
8. Ivakhnenko AG, Ivakhnenko GA (1995) The review of problems solvable by
algorithms of the group method of data handling (GMDH). Pattern Recogn
Image Anal 5(4):527–535
9. Ivakhnenko AG, Ivakhnenko GA, Muller JA (1994) Self-organization of neural
networks with active neurons. Pattern Recogn Image Anal 4(2):185–196
networks. Inf Sci 141:237–258
11. Oh SK, Pedrycz W, Park BJ (2003) Polynomial neural networks architecture:
analysis and design. Comput Electr Eng 29(6):703–725
12. Park HS, Park BJ, Oh SK (2002) Optimal design of self-organizing polynomial
neural networks by means of genetic algorithms. J Res Inst Eng Technol Dev
22:111–121 (in Korean)
13. Oh SK, Pedrycz W (2003) Fuzzy polynomial neuron-based self-organizing neural
networks. Int J Gen Syst 32(3):237–250
14. Pedrycz W, Reformat M (1996) Evolutionary optimization of fuzzy models in
fuzzy logic: a framework for the new millennium. In: Dimitrov V, Korotkich V
(eds.) Studies in fuzziness and soft computing, pp 51–67
15. Pedrycz W (1984) An identification algorithm in fuzzy relational system. Fuzzy
Set Syst 13:153–167
16. Oh SK, Pedrycz W (2000) Identification of fuzzy systems by means of an
auto-tuning algorithm and its application to nonlinear systems. Fuzzy Set Syst
115(2):205–230
17. Michalewicz Z (1996) Genetic algorithms + Data structures = Evolution
programs. Springer, Berlin Heidelberg Newyork
18. Holland JH (1975) Adaptation in natural and artificial systems. The University
of Michigan Press, Ann Arbour
19. Jong DKA (1996) Are genetic algorithms function Optimizers?. In Manner R,
Manderick B (eds.) Parallel problem solving from nature 2, North-Holland,
Amsterdam
20. Box DE, Jenkins GM (1976) Time series analysis forcasting and control. Holden
Day, California
21. Tong RM (1980) The evaluation of fuzzy models derived from experimental
data. Fuzzy Set Syst 13:1–12
22. Sugeno M, Yasukawa T (1991) Linguistic modeling based on numerical data.

IFSA 91, Brussels, Comput Manage & Syst Sci:264–267
23. Sugeno M, Yasukawa T (1993) A fuzzy-logic-based approach to qualitative
modeling. IEEE Trans Fuzzy Syst 1(1):7–31
24. Xu CW, Zailu Y (1987) Fuzzy model identification self-learning for dynamic
system. IEEE Trans Syst Man Cybern SMC 17(4):683–689
25. Chen JQ, Xi YG, Zhang ZJ (1998) A clustering algorithm for fuzzy model
identification. Fuzzy Set Syst 98:319–329
26. Gomez-Skarmeta AF, Delgado M, Vila MA (1999) About the use of fuzzy
clustering techniques for fuzzy model identification. Fuzzy Set Syst 106:179–188
27. Kim ET, et al. (1997) A new approach to fuzzy modeling. IEEE Trans Fuzzy
Syst 5(3):328–337
28. Kim ET, et al. (1998) A simple identified Sugeno-type fuzzy model via double
clustering. Inf Sci 110:25–39
29. Leski J, Czogala E (1999) A new artificial neural networks based fuzzy inference
system with moving consequents in if-then rules and selected applications. Fuzzy
Set Syst 108:289–297
30. Lin Y, Cunningham III GA (1995) A new approach to fuzzy-neural modeling.
IEEE Trans Fuzzy Syst 3(2):190–197
31. Wang Y, Rong G (1999) A self-organizing neural-network-based fuzzy system.
Fuzzy Set Syst 103:1–11
32. Park HS, Oh SK, Yoon YW (2001) A new modeling approach to fuzzy-neural
networks architecture. J Control Autom Syst Eng 7(8):664–674 (in Korean)
33. Oh SK, Kim DW, Park BJ (2000) A study on the optimal design of polynomial
neural networks structure. Trans Korean Inst Electr Eng 49D(3):145–156 (in
Korean)
34. Oh SK, Pedrycz W, Kim DW (2002) Hybrid fuzzy polynomial neural networks.
Int J Uncertain, Fuzziness Knowl-Based Syst 10(3):257–280
35. Wang LX, Mendel JM (1992) Generating fuzzy rules from numerical data with
applications. IEEE Trans Syst Man Cybern 22(6):1414–1427
36. Crowder III RS (1990) Predicting the mackey-glass time series with cascade-
correlation learning. In: Touretzky D, Hinton G, Sejnowski T (eds.) Proceedings
of the 1990 connectionist models summer school, Carnegic Mellon University
37. Jang JSR (1993) ANFIS: adaptive-network-based fuzzy inference system. IEEE
Trans Syst Man Cybern 23(3):665–685
38. Maguire LP, Roche B, McGinnity TM, McDaid LJ (1998) Predicting a chaotic
time series using a fuzzy neural network. Inf Sci 112:125–136
39. James LC, Huang TY (1999) Automatic structure and parameter training meth-
ods for modeling of mechanical systems by recurrent neural networks. Appl Math
Model 23:933–944
40. Oh SK, Pedrycz W, Ahn TC (2002) Self-organizing neural networks with fuzzy
polynomial neurons. Appl Soft Comput 2(IF):1–10
41. Lapedes AS, Farber R (1987) Non-linear signal processing using neural networks:
prediction and system modeling. Technical Report LA-UR-87-2662, Los Alamos
National Laboratory, Los Alamos, New Mexico
42. Mackey MC, Glass L (1977) Oscillation and chaos in physiological control
systems. Science 197:287–289
43. Cho MW, Kim GH, Seo TI, H YC, Cheng HH (2006) Integrated machining
error compensation method using OMM data and modified PNN algorithm,
International Journal of Machine Tool and Manufacture 46(2006):1417–1427
44. Menezes LM, Nikolaev NY (2006) Forecasting with genetically programmed

polynomial neural networks. International Journal of Forecasting 22(2):249–265
45. Pei JS, Wright JP, Smyth AW (2005) Mapping polynomial fitting into feed-
forward neural networks for modeling nonlinear dynamic systems and beyond.
Compt Methods Appl Mech Eng 194(42–44): 4481–4505
46. Nariman-Zadeh N, Darvizeh A, Jamali A, Moeini A (2005) Evolutionary design
of generalized polynomial neural networks for modelling and prediction of
explosive forming process. J Mater Process Technol 164–165:1561–1571
47. Delivopoulos E, Theocharis JB (2004) A modified PNN algorithm with
optimal PD modeling using the orthogonal least squares method. Inf Sci
168(1–4):133–170
48. Huang LL, Shimizu A, Hagihara Y, Kobatake H (2003) Face detection from clut-
tered images using a polynomial neural network. Neurocomputing 51:197–211
Evolution of Inductive Self-organizing
Networks
Dongwon Kim and Gwi-Tae Park
Summary. We discuss a new design methodology of the self-organizing approx-

imate technique using evolutionary algorithm (EA). In this technique, the self-
organizing network dwells on the idea of group method of data handling. The
performances of the network depend strongly on the number of input variables
available to the model, the number of input variables, and type (order) of the poly-
nomials to each node. They must be fixed by the designer in advance before the
architecture is constructed. Therefore, the trial and error method has a heavy com-
putation burden and low efficiency. Moreover, it does not guarantee that the obtained
model is the best one. In this chapter, we propose evolved inductive self-organizing
networks to alleviate these problems. The order of the polynomial, the number of
input variables, and the optimum input variables are encoded as a chromosome, and
the fitness of each chromosome is computed. The appropriate information of each
node is evolved accordingly and tuned gradually throughout the EA iterations. The
evolved network is a sophisticated and versatile architecture, which can construct
models from a limited data set as well as from poorly defined complex problems.
Comprehensive comparisons showed that the performance of the proposed model,
which has a much simpler structure than the conventional model as well as previous
identification methods, was significantly improved with respect to approximation
and prediction abilities.
1 Introduction
System modelling and identification is important in system analysis, con-
trol, and automation as well as scientific research, so much effort has been
directed to developing advanced techniques of system modelling. Neural net-
works (NNs) and fuzzy systems have been widely used for modelling nonlinear
systems. The approximation capability of neural networks, such as multilayer
perceptrons, radial basis function (RBF) networks, or dynamic recurrent neu-
ral networks, has been investigated [1–3]. On the other hand, fuzzy systems
can approximate nonlinear functions with arbitrary accuracy [4–5]. But the
resultant neural network representation is very complex and difficult to under-
stand, and fuzzy systems require too many fuzzy rules for accurate function
D. Kim and G.-T. Park: Evolution of Inductive Self-organizing Networks, Studies in
Computational Intelligence (SCI) 82, 109–128 (2008)
110 D. Kim and G.-T. Park
approximation, particularly in the case of multidimensional inputs. Alterna-

tively, there is the GMDH-type algorithm. Group method of data handling
(GMDH) was introduced by Ivakhnenko in the early 1970’s [6–10]. GMDH-
type algorithms have been extensively used since the mid-1970’s for prediction
and modelling complex nonlinear processes. GMDH is mainly characterized
as being self-organizing algorithm, which can provide automated selection
of essential input variables without the use of prior information about the
relationship among input-output variables [11].
Self-organizing Polynomial Neural Networks (SOPNN) [12–13], a GMDH-
type algorithm, is a useful approximate technique. SOPNN has an architecture
similar to that of the feedforward neural networks whose neurons are replaced
by polynomial nodes. The output of the each node in SOPNN structure is
obtained by using several types of high-order polynomials such as linear,
quadratic, and modified quadratic of input variables. These polynomials are
called partial descriptions (PDs). SOPNNs have fewer nodes than NNs, but its
nodes are more flexible. Although the SOPNN is structured by a systematic
design procedure, it has some drawbacks that must be solved. If there are suf-
ficiently large number of input variables and data points, SOPNN algorithm
has a tendency to produce overly complex networks. On the other hand, for
a small number of available input variables, SOPNN does not maintain good
performance. Moreover, the performances of SOPNN depend strongly on the
number of input variables available to the model as well as the number of in-
put variables and polynomial types or order in each PD. They must be chosen
in advance before the architecture of SOPNN is constructed. In most cases,
they are determined by a trial and error method, which must bear a heavy
computational burden at low efficiency. Moreover, the SOPNN algorithm is
a heuristic method so it does not guarantee that the obtained SOPNN is the
best one for nonlinear system modelling. Therefore, these above-mentioned
drawbacks must be overcome.
In this chapter, we present a new design methodology of SOPNN using
evolutionary algorithm (EA) to alleviate the above-mentioned drawbacks of
the SOPNN. This new network is called the EA-based SOPNN. Evolutionary
Algorithm (EA) has been widely used as a parallel global search method for
optimization problems [14–16]. The EA is used to determine how many input
variables are chosen to each node, which input variables are optimally chosen
among many input variables, and what is the appropriate type of polynomials
in each PD.
This chapter is organized as follows. The conventional SOPNN and its
drawbacks are briefly explained to illustrate the proposed modelling algo-
rithm in Section 2. The new algorithm, the design methodology of EA-based
SOPNN, is described and the coding of the key factors of the SOPNN, the rep-
resentation of chromosome, and fitness function are also discussed in Section 2.
The proposed EA-based SOPNN is applied to nonlinear systems modelling to
assess its performances, and its simulation results are compared with those of
other methods including the conventional SOPNN in Section 3. Conclusions
Evolution of Inductive Self-organizing Networks 111
are given in Section 4. Finally Appendix contains a summary of the design

procedure of the conventional SOPNN algorithm.
2 Design of EA-based SOPNN

The conventional SOPNN algorithm is based on the GMDH method and
utilizes various types of polynomials such as linear, quadratic, and modified
quadratic. By choosing the most significant input variables and polynomial
types, the PDs in each layer can be obtained. The framework of the design
procedure of the conventional SOPNN is summarized in Appendix, and further
discussion on the conventional SOPNN can be obtained in [13].
As provided in the Appendix, when the final layer is constructed, the node
with the best predictive capability is selected as the output node. All remain-
ing nodes except the output node in the final layer are discarded. Furthermore,
all the nodes in the previous layers that do not influence the output node are
also removed by tracing the data flow path of each layer.
The SOPNN is a flexible neural architecture whose structure is developed
through a modeling process. Each PD can have a different number of input
variables and can exploit a different order of the polynomial. As a result,
SOPNN provides a systematic design procedure. But the number of input
variables and the polynomial order type must be fixed by the designer in
advance before the architecture is constructed. As a result, the trial and error
method must bear a heavy computation burden at low efficiency. That is,
it does not guarantee the best model, only that a good model for a certain
system. Therefore, its performances depend strongly on a few factors stated
in the section 1. In this section, we propose a new design procedure using EA
for the systemic design of SOPNN with the optimum performance.
In the SOPNN algorithm, the key problems are the determination of the
optimal number of input variables, selection of input variables, and selection
of the order of the polynomial forming a PD in each node. In [13], these factors
are determined in advance by the trial and error method. But in this chapter,
these problems are solved by using EA automatically. The EA is implemented
using crossover and mutation probability rates for better exploitation of the
optimal inputs and order of polynomial in each node of SOPNN. All of the ini-
tial EA populations are randomized to use minimum heuristic knowledge. The
appropriate inputs and order are evolved accordingly and are tuned gradually
throughout the EA iterations. In the evolutionary design procedure, the key
issues are the encoding process of the order of the polynomial, the selection
of the number of input variables, and the selection of the optimum input vari-
ables as a chromosome and the defining of a criterion to compute the fitness of
each chromosome. A detailed representation of the coding strategy and choice
of fitness function are given in the following sections.
2.1 Representation of chromosome for appropriate information

of each PD
When the SOPNN is designed by using EA, the most important consider-
ation is the representation strategy, that is, the process by which the key
factors of the SOPNN are encoded into the chromosome. A binary coding is
employed for the available design specifications. The order and the inputs of
each node in the SOPNN are coded as a finite-length string. Our chromo-
somes are made of three sub-chromosomes. The first one consists of 2 bits
for the order of polynomial (PD), the second one consists of 3 bits for the
number of inputs of PD, and the last one consists of N bits, which are equal
to the total number of input candidates in the current layer. These input
candidates are the node outputs of the previous layer. The representation of
binary chromosomes is illustrated in Fig. 1. The 1st sub-chromosome is made
of 2 bits, which represent the several types of the order of PD. The relation-
ship between bits in the 1st sub-chromosome and the order of PD is shown in
Table 1. Thus, each node can exploit a different order of the polynomial. The
3rd sub-chromosome has N bits, which are concatenated bits of 0’s and 1’s cod-
ing. The input candidate is represented by a ‘1’ bit if it is chosen as the input
variable to the PD and by a ‘0’ bit it is not chosen. This type of representation
solves the problem of which input variables are to be chosen. If many input
candidates are chosen for a model design, the modelling becomes computa-
tionally complex, and normally requires a lot of time to achieve good results.
Also, complex modelling can produce inaccurate results and poor generaliza-
tions. Good approximation performance does not necessarily guarantee good
The 3rd sub-chromosome: N bits

The 1st sub-chromosome:
equal to input candidates in the
2 bits for the order of PD
current layer
The 2nd sub-chromosome:

3 bits for the number of
inputs of PD
1 0 1 1 0 0 0 0 1 1
•
•
•
Fig. 1. Structure of binary chromosome for a PD

Table 1. Relationship between bits in the 1st sub-chromosome and order of PD
Bits in the 1st sub-chromosome Order of polynomial (PD)

00 Type 1 - Linear
01 Type 2 - Quadratic
10 Type 2 - Quadratic
11 Type 3 - Modified quadratic
Table 2. Relationship between bits in the 2nd sub-chromosome and number of

inputs to PD
Bits in the 2nd sub-chromosome Number of inputs to PD

000 1
001 2
010 2
011 3
100 3
101 4
110 4
111 5
generalization capability [18]. To overcome this drawback, we introduced the

2nd sub-chromosome into the chromosome. The 2nd sub-chromosome consists
of 3 bits and represents the number of input variables to be selected. The num-
ber based on the 2nd sub-chromosome is shown in the Table 2. The number of
input variables selected for each node among all input candidates is as many
as the number represented in the 2nd sub-chromosome. Designer must deter-
mine the maximum number, considering the characteristic of system, design
specification, and prior knowledge of the model. With this method, we can
solve problems such as the conflict between overfitting and generalization and
excessive computation time. The relationship between chromosome and PD
information is shown in Fig. 2. The PD corresponding to the chromosome in
Fig. 2 is described briefly in Fig. 3.
Figure 2 shows an example of PD. The various pieces of required informa-
tion are obtained from its chromosome. The 1st sub-chromosome shows that
the polynomial order is Type 2 (quadratic form). The 2nd sub-chromosome
identifies two input variables to this node. The 3rd sub-chromosome indicates
that x1 and x6 are selected as the input variables. The node with PD corre-
sponding to Fig. 2 is shown in Fig. 3. Thus, the output of this PD, can be
expressed as (1).
ŷ = f (x1 , x6 ) = c0 + c1 x1 + c2 x6 + c3 x21 + c4 x26 + c5 x1 x6 (1)
where coefficients c0 , c1 , . . . , c5 are evaluated by using the training data set

and the standard least square estimation (LSE). Therefore, the polynomial
function of PD is formed automatically according to the information provided
by the sub-chromosomes.
The design procedure of EA-based SOPNN is shown in Fig. 4. At the
beginning of the process, the initial populations are comprised of a set of
chromosomes that are scattered all over the search space. The populations
Input cadidates Chromosome Information on PD Forming a PD
1
1st sub- Order of
0 chromosome polynomial
0
2nd sub-
1 No. of inputs
chromosome
0
x1 1 selected f ^
y
0
x2 ignored
x3 0
3rd sub- ignored
x4 0 chromosome
ignored
x5 0 ignored
x6 1 selected
Fig. 2. Example of PD whose various pieces of required information is obtained

from its chromosome
:quadratic
(Type 2)
x1
2
PD ^
y
2
x6
: 2 inputs
Fig. 3. Node with PD corresponding to chromosome in Fig. 2
are all randomly initialized; thus, the use of heuristic knowledge is minimized.
The fitness assignment in EA serves as a guide in the search toward the
optimal solution. The fitness function for specific cases of modeling will be
explained later. After each chromosome is evaluated and associated with a
fitness, the current population undergoes the reproduction process to create
the next generation of population. The roulette-wheel selection scheme is used
to determine the members of the new generation of population. After the new
group of population is built, a mating pool is formed for the crossover. The
crossover proceeds in three steps. First, two newly reproduced strings are se-
lected from the mating pool, which is produced by reproduction. Second, a
position (one point) along the two strings is selected uniformly at random.
In the third step, all characters are exchanged by following the crossing site.
We use one-point crossover operator with a crossover probability of pc (0.85).
This crossover is then followed by the mutation operation. The mutation is
the occasional alteration of a value at a particular bit position (we flip the
states of a bit from 0 to 1 or vice versa). The mutation serves as an insurance
policy for recovering the loss of a particular piece of information (any sim-
ple bit). The mutation rate used is fixed at 0.05 (pm ). Generally, after these
three operations, the overall fitness of the population improves. Each of the
Start
Generation of initial population : The fitness values of the new chromosomes

the parameters are encoded into a are improved trough generations with
chromosome genetic operators
Reproduction: roulette wheel

Evaluation: each chromosome is
A: 0 0 0 0 0 0 0 1 1 1 1 B: 1 1 0 0 0 1 1 0 0 1 1
evaluated and has its fitness value
One-point crossover
---: crossover site
NO
Termination condition A: 0 0 0 0 0 0 0 1 1 1 1 A`: 0 0 0 0 0 0 0 0 0 1 1
B: 1 1 0 0 0 1 1 0 0 1 1 B`: 1 1 0 0 0 1 1 1 1 1 1
YES before crossover after crossover
Invert mutation ---: mutation site
Results: chromosomes which have
good fitness value are selected for the A`: 0 0 0 0 0 0 0 0 0 1 1 A`: 0 0 0 1 0 0 0 0 0 1 1
new input variables of the next layer before mutation after mutation
End: one chromosome (PD)

characterized by the best
performance is selected as the output
when the 3rd layer is reached
Fig. 4. Block diagram of the design procedure of EA-based SOPNN
population generated then goes through a series of evaluation, reproduction,

crossover, and mutation, and the procedure is repeated until a termination
condition is reached. After the evolution process, the final generation of pop-
ulation consists of highly fit bits that provide optimal solutions. After the
termination condition is satisfied, one chromosome (PD) with the best perfor-
mance is selected as the output PD in the final generation of the population.
All other remaining chromosomes are discarded, and all the nodes that do not
influence this output PD in the previous layers are also removed. Finally, the
EA-based SOPNN model is obtained.
2.2 Fitness function for modelling
In EA, the fitness function must be determined. The genotype representation

encodes the problem into a string, and the fitness function measures the per-
formance of the model. It is quite important for evolving systems to find a
good fitness measurement. To construct models with superior approximation
and generalization ability, we introduce the error function such as
E = Θ × P I + (1 − Θ) × EP I (2)
where Θ ∈ [0, 1] is a weighting factor for PI and EPI, which the per-
formance index values of the training data and testing data, respectively,
as expressed in (4). Then the fitness value [12] is determined as

follows:
1
F = (3)
1+E
Maximizing F is identical to minimizing E. The choice of Θ establishes a
certain tradeoff between the approximation and generalization ability of the
EA-based SOPNN.
3 Simulation Results
In this section, we show the performance of our new EA-based SOPNN for
two well known nonlinear system modeling. One is a time series of a gas
furnace (Box-Jenkins data) [19], which was studied previously in [19–21], and
the other is a nonlinear system already exploited in fuzzy modeling [22–26].
The delayed terms of methane gas flow rate u(t) and carbon dioxide density
y(t) such as u(t−3), u(t−2), u(t−1), y(t−3), y(t−2), and y(t−1)are used
as input variables to the EA-based SOPNN. The actual system output y(t)
is used as the target output variable for this model. We choose the input
variables of nodes in the 1st layer from these input variables. The total data
set consisting of 296 input-output pairs is split into two parts. The first part
(consisting of 148 pairs) is used as the training data set, and the remaining
part of the data set serves as a testing data set. Using the training data set, the
EA-based SOPNN can estimate the coefficients of the polynomial by using the
standard LSE. The performance index is defined as the mean squared error
1
m
P I(EP I) = (yi − ŷi )2 (4)
m i=1
where yi is the actual system output, ŷi is the estimated output of each node,
and m is the number of data.
The design parameters of EA-based SOPNN for modeling are shown in
Table 3. In the 1st layer, 20 chromosomes are generated and evolved during
40 generations, where each chromosome in the population is defined as a
corresponding node. So 20 nodes (PDs) are produced in the 1st layer based
on the EA operators. All PDs are estimated and evaluated using the training
and testing data sets, respectively. They are also evaluated by the fitness
function of (3) and ranked according to their fitness value. We choose nodes
as many as a predetermined number w from the highest ranking node, and
use their outputs as new input variables to the nodes in the next layer. In
other words, the chosen PDs (w nodes) must be preserved for the design of
the next layer, and the outputs of the preserved PDs serve as inputs to the
Table 3. Design parameters of EA-based SOPNN for modeling
Parameters 1st layer 2nd layer 3rd layer

Maximum generations 40 60 80
Population size:(w) 20:(15) 60:(50) 80
String length 11 20 55
Crossover rate (pc ) 0.85
Mutation rate (pm ) 0.05
Weighting factor (Θ) 0.1˜ 0.9
Type (order) 1˜ 3
Table 4. Values of performance indices of the proposed EA-based SOPNN
Weighting factor 1st layer 2nd layer 3rd layer

PI – EPI PI – EPI PI – EPI
0.1 0.0214 – 0.1260 0.0200 – 0.1231 0.0199 – 0.1228
0.25 0.0214 – 0.1260 0.0149 – 0.1228 0.0145 – 0.1191
0.5 0.0214 – 0.1260 0.0139 – 0.1212 0.0129 – 0.1086
0.75 0.0214 – 0.1260 0.0139 – 0.1293 0.0138 – 0.1235
0.9 0.0173 – 0.1411 0.0137 – 0.1315 0.0129 – 0.1278
next layer. The value of w is different for each layer, which is also shown in
Table 3. This procedure is repeated for the 2nd layer and the 3rd layer.
Table 4 summarizes the values of the performance indices, PI and EPI,
of the proposed EA-based SOPNN according to the weighting factor. These
values are the lowest in each layer. The overall lowest values of the performance
indices are obtained at the third layer when the weighting factor is 0.5. If this
model is designed to have a fourth or higher layer, the performance values
become much lower, and the computation time increases considerably for the
model with such a complex network.
Fig. 5 depicts the trend of the performance index values produced in suc-
cessive generations of the EA for the weighting factor Θ of 0.5. Fig. 6 illustrates
the values of the error function and fitness function in successive EA genera-
tions when Θ = 0.5. Fig. 7 shows the proposed EA-based SOPNN model with
3 layers and its identification performance for Θ = 0.5. The model output
follows the actual output very well. The values of the performance indices of
the proposed method are equal to PI = 0.012, EPI = 0.108, respectively.
0.145
0.022
Performance index(EPI)
Performance index(PI)
PI 0.140 EPI
0.020 0.135
0.130
0.018
0.125
0.016 0.120
0.115
0.014
1st layer 0.110 1st layer 2nd layer 3rd layer
2nd layer 3rd layer
0.012 0.105
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
Generations Generations
(a) (b)
Fig. 5. (a) performance index for the training data set (b) performance index for
the testing data set: Trend of performance index values with respect to generations
through layers
0.945
0.080
Value of error function(E)
Value of fitness function(F)

0.940
0.075
0.070 0.935
0.065 0.930
0.060 1st layer 2nd layer 3rd layer 0.925 1st layer 2nd layer 3rd layer
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
(a) (b)
Fig. 6. (a) error function (b) fitness function : Values of the error function and
fitness function for successive generations
1
4 PD
1
5 PD
u(t-3)
1
3 PD
u(t-2)
3 3
5 PD 5 PD
u(t-1) 3
2 3 PD ^
y
5 PD 5 PD
3
y(t-3)
2 2
4 PD 4 PD
y(t-2)
2
5 PD
y(t-1) 2
4 PD
1
4 PD
(a)
Actural output
62 Model output
60 2.25
58 1.50
CO2 concentration
56
0.75
54
Error
52 0.00
50 −0.75
48 −1.50
46 training data testing data
−2.25
44
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Data number Data number
(b) (c)
Fig. 7. (a) Proposed EA-based SOPNN model with 3 layers, (b) actual output
versus model output, (c) error: Proposed EA-based SOPNN model with 3 layers
and its identification performance
Fig. 8. (a) Basic SOPNN & Case 1, (b) Modified SOPNN & Case 2: Conventional
SOPNN models with 5 layers
For the comparison of the network size of the proposed EA-based SOPNN
with that of conventional SOPNN, conventional SOPNN models are visualized
in Fig. 8. The structure of the basic SOPNN & Case 1 in Fig. 8(a) is obtained
by use of 4 input variables and Type 3 polynomial for every node in all layers
to the fifth layer. Its performance is as follows. PI = 0.012 and EPI = 0.084.
On the other hand, the structure of the modified SOPNN and Case 2 in
Fig. 8(b) is obtained by use of 2 input variables and Type 1 polynomial for
every node in the 1st layer and 3 input variables and Type 2 polynomial for
every node from the 2nd layer to the 5th layer. In this model, PI is 0.016,
and EPI is 0.101. Figures 7 and 8 show that the structure of the EA-based
SOPNN is much simpler than that of the conventional SOPNN in terms of
the number of nodes and layers, despite their similar performance.
Table 5 provides a comparison of the proposed model with other tech-
niques. The comparison is realized on the basis of the same performance index
for the training and testing data sets. Additionally, PI represents the perfor-
mance index of the model for the training data set and EPI for the testing
data. The proposed architecture, EA-based SOPNN model, outperforms other
models both in terms of accuracy and generalization capability.
Table 5. Values of performance index of some identification models
Model PI EPI
Kim’s model [20] 0.034 0.244
Lin’s model [21] 0.071 0.261
Kim’s model [12] 0.013 0.126
SOPNN Basic & case 1 0.012 0.084
(5 layers) [13] Modified & case 2 0.016 0.101
EA-based SOPNN (3 layers) 0.012 0.108
3.2 Three-input nonlinear function
This example demonstrates the application of the proposed EA-based SOPNN

model to identify highly nonlinear functions. The performance of this model
is compared with those of earlier models. The function to be identified is a
three-input nonlinear function given by (5)
−1 −1.5 2
y = (1 + x0.5
1 + x2 + x3 ) (5)
which has been widely used by Takagi and Hayashi [22], Sugeno and Kang
[23], and Kondo [24] to test their modelling approaches.
Table 6 shows 40 pairs input-output data obtained from (5) [26]. The
input x4 is a dummy variable, which is not related to (5). The data in Table
6 is divided into the training data set (Nos. 1–20) and the testing data set
(Nos. 21–40). To compare the performance, the same performance index, the
average percentage error (APE), adopted in [22–26] is used.
1 |yi − ŷi |
m
AP E = × 100 (6)
m i=1 yi
where m is the number of data pairs and yi and ŷ1 are the i-th actual output
and model output, respectively.
Again, a series of comprehensive experiments was conducted, and the re-
sults are summarized. The design parameters of EA-based SOPNN for each
layer are shown in Table 7.
The simulation results of the EA-based SOPNN are summarized in Table 8.
The lowest values of the performance indices, PI = 0.188 EPI = 1.087, are
obtained at the third layer when the weighting factor (Θ) is 0.25.
Figure 9 illustrates the trend of the performance index values produced in
successive generations of the EA for the weighting factor Θ of 0.25. Figure 10
shows the values of the error function and fitness function in successive EA
generations when Θ is 0.25.
Table 6. Input-output data of three-input nonlinear function
No. x1 x2 x3 x4 y No. x1 x2 x3 x4 y
1 1 3 1 1 11.11 21 1 1 5 1 9.545
2 1 5 2 1 6.521 22 1 3 4 1 6.043
3 1 1 3 5 10.19 23 1 5 3 5 5.724
4 1 3 4 5 6.043 24 1 1 2 5 11.25
5 1 5 5 1 5.242 25 1 3 1 1 11.11
6 5 1 4 1 19.02 26 5 5 2 1 14.36
7 5 3 3 5 14.15 27 5 1 3 5 19.61
8 5 5 2 5 14.36 28 5 3 4 5 13.65
9 5 1 1 1 27.42 29 5 5 5 1 12.43
10 5 3 2 1 15.39 30 5 1 4 1 19.02
11 1 5 3 5 5.724 31 1 3 3 5 6.38
12 1 1 4 5 9.766 32 1 5 2 5 6.521
13 1 3 5 1 5.87 33 1 1 1 1 16
14 1 5 4 1 5.406 34 1 3 2 1 7.219
15 1 1 3 5 10.19 35 1 5 3 5 5.724
16 5 3 2 5 15.39 36 5 1 4 5 19.02
17 5 5 1 1 19.68 37 5 3 5 1 13.39
18 5 1 2 1 21.06 38 5 5 4 1 12.68
19 5 3 3 5 14.15 39 5 1 3 5 19.61
20 5 5 4 5 12.68 40 5 3 2 5 15.39
Table 7. Design parameters of EA-based SOPNN for modeling
Parameters 1st layer 2nd layer 3rd layer

Maximum generations 40 60 80
Population size:(w) 20:(15) 60:(50) 80
String length 8 20 55
Crossover rate (pc ) 0.85
Mutation rate (pm ) 0.05
Weighting factor (Θ) 0.1˜0.9
Type (order) 1˜3
Figure 11 depicts the proposed EA-based SOPNN model with 3 lay-

ers for Θ = 0.25. The structure of EA-based SOPNN is very simple and
gives good performance. However, the conventional SOPNN has difficulty
in structuring the model for this nonlinear function. Therefore, only a few
Table 8. Values of performance indices of the proposed EA-based SOPNN
Weighting factor 1st layer 2nd layer 3rd layer

PI – EPI PI – EPI PI – EPI
0.1 5.7845 – 6.8199 2.3895 – 3.3400 2.2837 – 3.1418
0.25 5.7845 – 6.8199 0.8535 – 3.1356 0.1881 – 1.0879
0.5 5.7845 – 6.8199 1.6324 – 5.5291 1.2268 – 3.5526
0.75 5.7845 – 6.8199 1.9092 – 4.0896 0.5634 – 2.2097
0.9 5.7845 – 6.8199 2.5083 – 5.1444 0.0002 – 4.8804
6 7
PI Performance index(EPI) EPI
Performance index(PI)
5 6
4 5
3 4
2 3
1 2
0 1st layer 2nd layer 3rd layer 1 1st layer 2nd layer 3rd layer
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
(a) (b)
Fig. 9. (a) performance index for the training data set (b) performance index for
the testing data set: Trend of performance index values with respect to generations
through layers
7 0.6
Value of fitness function(F)
Value of error function(E)
6
0.5
5
0.4
4
3 0.3
2 0.2
1 1st layer 2nd layer 3rd layer 1st layer 2nd layer 3rd layer
0.1
0
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
(a) (b)
Fig. 10. (a) error function (b) fitness function : Values of the error function and
fitness function for successive generations
number of input candidates are considered. Fig. 12 shows the identifica-

tion performance and error of the proposed EA-based SOPNN when Θ is
0.25. The output of the EA-based SOPNN follows the actual output very
well.
2 PD
3
2 PD 2 PD
X1 3 2
2 PD 3
3 2 PD
2 3 3
X2 PD PD 5 PD y^
3 3
2 3
3 PD 2 PD
X3 1 PD 3 PD
3 1
2 PD
2
Fig. 11. Structure of the EA-based SOPNN model with 3 layers (Θ = 0.25)
20
30 Actual output
15
Model output
25 10
20 5
Errors
0
y_tr
15
−5
10 −10
−15
5
−20
5 10 15 20 5 10 15 20
30 20
Actual output
Model output 15
25 10
20 5
Errors
y_te
0
15
−5
10 −10
−15
5
−20
5 10 15 20 5 10 15 20
Fig. 12. (a) actual output versus model output of training data, (b) errors of (a)(c)
actual output versus model output of testing data, (b) errors of (c): Identification
performance and errors of EA-based SOPNN model with 3 layers
Table 9 shows the performances of the proposed EA-based SOPNN model

and other models studied in the literature. The experimental results clearly
showed that the proposed model outperformed the existing models both in
terms of approximation capability (PI) as well as generalization ability (EPI).
But the conventional SOPNN cannot be applied properly to the identification
of this example.
Table 9. Performance comparison of various identification models
Model PI EPI
GMDH model[24] 4.7 5.7
Fuzzy model [23] model 1 1.5 2.1
model 2 0.59 3.4
FNN [26] type 1 0.84 1.22
type 2 0.73 1.28
type 3 0.63 1.25
GD-FNN [25] 2.11 1.54
SOPNN Basic & case 1 2.59 8.52
(5 layers) [13] Modified Impossible Impossible
EA-based SOPNN (3 layers) 0.188 1.087
4 Conclusions
In this chapter, we proposed a new design methodology of SOPNN using an

evolutionary algorithm, which is called the EA-based SOPNN and studied
the properties of the EA-based SOPNN. The EA-based SOPNN is a sophisti-
cated and versatile architecture that can construct models from a limited data
set and poorly-defined complex problems. Moreover, the architecture of the
model is not predetermined, but can be self-organized automatically during
the design process. The conflict between overfitting and generalization can be
avoided by using a fitness function with a weighting factor. The experimental
results showed that the proposed EA-based SOPNN is superior to the conven-
tional SOPNN models as well as other previous models in terms of modeling
performance.
5 Acknowledgement
The authors would like to thank the financial support of the Korea Science &
Engineering Foundation. This work was supported by grant No. R01-2005-000-
11044-0 from the Basic Research Program of the Korea Science & Engineering
Foundation. The authors are also very grateful to the anonymous reviewers
for their valuable comments.
APPENDIX
SELF-ORGANIZING POLYNOMIAL NEURAL NETWORK [13]
This appendix summarizes the design procedure of the conventional SOPNN

algorithm.
Step 1: Determine system’s input variables
We define input variables such as x1i , x2i , . . . , xN i , related to output variables
yi , where N and i are the number of all input variables and input-output
data set, respectively. The normalization of the input data is also performed
if required.
Step 2: Form training and testing data
The input - output data set is separated into the training (ntr ) data set and
the testing (nte ) data set. Then, we have n = ntr + nte . The training data
set is used to construct a SOPNN model. And the testing data set is used to
evaluate the constructed SOPNN model.
Step 3: Choose a structure of the SOPNN
The structure of SOPNN is strongly dependent on the number of input vari-
ables and the order of PD in each layer. Two kinds of SOPNN structures,
namely, the basic SOPNN structure and the modified SOPNN structure are
available. Each of them is specified with two cases.
(a) Basic SOPNN structure - The number of input variables of PDs is the
same in every layer.
Case 1. The polynomial order of the PDs is the same in each layer of the
network.
Case 2. The polynomial order of the PDs in the 2nd or higher layer is
different from the one of PDs in the 1st layer.
(b) Modified SOPNN structure - The number of input variables of PDs varies
from layer to layer.
Case 1. The polynomial order of the PDs is same in every layer.
Case 2. The polynomial order of the PDs in the 2nd layer or higher is
different from the one of PDs in the 1st layer.
Step 4: Determine the number of input variables and the order of the poly-
nomial forming a PD
We determine arbitrarily the number of input variables and the type of the
polynomial in PDs. The polynomials are different according to the number of
input variables and the polynomial order. The total number of PDs located at
the current layer is determined by the number of the selected input variables
(r) from the nodes of the preceding layer, because the outputs of the nodes
of the preceding layer become the input variables to the current layer. The
total number of PDs in the current layer is equal to the combination, that is
N!
r!(N −r)! , where N is the number of nodes in the preceding layer.
Step 5: Estimate the coefficients of the PD
The vector of coefficients of the PDs is determined using standard mean
squared errors (MSE) by minimizing the following index
Possible inputs Choice of estimated models / Optimal model

stop conditions
x1i 1 j th layer
Z1
x2i PD
j-1
z 1
1
x3i Z2 PD
PD
x4i ∧
yi
1 z j-1 p
xN-1i
Z N!/{(N-r)!r!} zi
PD z j-1q PD
xNi
selected inputs: (j-1)th layer order

z j-1p z j-1p , z j-1q Type 2
zi
PD: j th layer
z j-1q
c0+c1z j-1p+c2z j-1q+c3(zj-1p)2+c4(zj-1q)2+c5zj-1pzj-1q
Fig. 13. Overall architecture of the SOPNN
1
ntr
N!
Ek = (yi − zki )2 , k = 1, 2, . . . , (7)
ntr i=1 r!(N − r)!
where, zki denotes the output of the k-th node with respect to the i-th data.
This step is completed repeatedly for all the nodes in the current layer and,
subsequently, all layers of the SOPNN starting from the input to the output
layer.
Step 6: Select PDs with the good predictive capability
The predictive capability of each PD is evaluated by the performance index
!
using the testing data set. Then, we choose w number of PDs among r!(NN−r)!
PDs in the order of the best predictive capability (the lowest value of the
performance index). Here, w is the pre-defined number of PDs that must be
preserved to next layer. The outputs of the chosen PDs serve as inputs to the
next layer.
Step 7: Check the stopping criterion
The SOPNN algorithm terminates when the number of layers predetermined
by the designer is reached.
Step 8: Determine new input variables for the next layer
If the stopping criterion is not satisfied, the next layer is constructed by
repeating step 4 through step 8.
The overall architecture of the SOPNN is shown in Fig. 13.
References
1. Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are
universal approximators. IEEE Trans Neural Netw 2:359–366
2. Chen T, Chen H (1995) Approximation capability to functions of several

variables, nonlinear functions, and operators by radial basis function neural
networks. IEEE Trans Neural Netw 6:904–910
3. Li K (1992) Approximation theory and recurrent networks. Proc IJCNN 2:
266–271
4. Wang LX, Mendel JM (1992) Generating fuzzy rules by learning from examples
IEEE Trans Syst Man Cybern 22:1414–1427
5. Wang LX, Mendel JM (1992) Fuzzy basis function, universal approximation,
and orthogonal least-squares learning. IEEE Trans Neural Netw 3:807–814
6. Ivakhnenko AG (1971) Polynomial theory of complex systems. IEEE Trans Syst
Man Cybern SMC-1:364–378
7. Ivakhnenko AG, Ivakhnenko NA (1974) Long-term prediction by GMDH algo-
rithms using the unbiased criterion and the balance-of-variables criterion (1974)
Sov Automat Control 7:40–45
8. Ivakhnenko AG, Ivakhnenko NA (1975) Long-term prediction by GMDH algo-
rithms using the unbiased criterion and the balance-of-variables criterion. Part 2
Sov. Automat Control 8:24–38
9. Ivakhnenko AG, Vysotskiy VN, Ivakhnenko NA (1978) Principal version of the
minimum bias criterion for a model and an investigation of their noise immunity.
Sov Automat Control 11:27–45
10. Ivakhnenko AG, Krotov GI, Ivakhnenko NA (1970) Identification of the mathe-
matical model of a complex system by the self-organization method. In: Halfon
E (ed.) Advances and case studies Theoretical Systems Ecology, Academic, New
York
11. Farlow SJ (1984) Self-organizing methods in modeling: GMDH type-algorithms.
Marcel Dekker, New York
12. Kim DW (2002) Evolutionary design of self-organizing polynomial neural net-
works. MA Thesis, Wonkwang University, Korea
networks. Inf Sci 141:237–258
14. Shi Y, Eberhart R, Chen Y (1999) Implementation of evolutionary fuzzy sys-
tems. IEEE Trans Fuzzy syst 7:109–119
15. Kristinnson K, Dumont GA (1992) System identification and control using
genetic algorithms. IEEE Trans Syst Man Cybern 22:1033–1046
16. Uckun S, Bagchi S, Kawamura K (1993) Managing genetic search in job shop
scheduling. IEEE Expert 8:15–24
17. Goldberg DE (1989) Genetic algorithms in search, optimization and machine
learning. Addison-Wesley, Reading, MA
18. Kung SY, Taur JS (1995) Decision-based neural networks with signal/image
classification applications. IEEE Trans Neural Netw 6:170–181
19. Box GEP, Jenkins FM (1976) Time series analysis: forecasting and control.
Holden-day, Sanfrancisco, CA
20. Kim E, Lee H, Park M, Park M (1998) A simple identified Sugeno-type fuzzy
model via double clustering. Inf Sci 110:25–39
21. Lin Y, Cunningham GA (1995) A new approach to fuzzy-neural modeling. IEEE
Trans Fuzzy Syst 3:190–197
22. Takagi H, Hayashi I (1991) NN-driven fuzzy reasoning. Int J Approx Reasoning
5:191–212
23. Sugeno M, Kang GT (1988) Structure identification of fuzzy model. Fuzzy Sets
Syst 28:15–33
24. Kondo T (1986) Revised GMDH algorithm estimating degree of the complete
polynomial. Trans Soc Instrum Control Eng 22:928–934
25. Wu S, Er MJ, Gao Y (2001) A fast approach for automatic generation of fuzzy
rules by generalized dynamic fuzzy neural networks. IEEE Trans Fuzzy Syst
9:578–594
26. Horikawa SI, Furuhashi T, Uchikawa Y (1992) On fuzzy modeling using fuzzy
neural networks with the back-propagation algorithm. IEEE Trans Neural Netw
3:801–806
Recursive Pattern based Hybrid Supervised
Training
Kiruthika Ramanathan and Sheng Uei Guan
Summary. We propose, theorize and implement the Recursive Pattern-based

Hybrid Supervised (RPHS) learning algorithm. The algorithm makes use of the
concept of pseudo global optimal solutions to evolve a set of neural networks, each
of which can solve correctly a subset of patterns. The pattern-based algorithm uses
the topology of training and validation data patterns to find a set of pseudo-optima,
each learning a subset of patterns. It is therefore well adapted to the pattern set
provided. We begin by showing that finding a set of local optimal solutions is theo-
retically equivalent, and more efficient, to finding a single global optimum in terms
of generalization accuracy and training time. We also highlight that, as each local
optimum is found by using a decreasing number of samples, the efficiency of the
training algorithm is increased. We then compare our algorithm, both theoretically
and empirically, with different recursive and subset based algorithms. On average,
the RPHS algorithm shows better generalization accuracy, with improvement of up
to 60% when compared to traditional methods. Moreover, certain versions of the
RPHS algorithm also exhibit shorter training time when compared to other recent
algorithms in the same domain. In order to increase the relevance of this paper
to practitioners, we have added pseudo code, remarks, parameter and algorithmic
considerations where appropriate.
1 Introduction
In this chapter we study the Recursive Pattern Based Hybrid Supervised
(RPHS) learning System, a hybrid evolutionary-learning based approach to
task decomposition. Typically, the RPHS system consists of a pattern distrib-
utor and several neural networks, or sub-networks. The input signal propagates
through the system in a forward direction, through the pattern distributor,
which chooses the best sub-network to solve the problem, which then outputs
the corresponding solution.
RPHS has been applied successfully to solve some difficult and diverse
classification problems by training them using a recursive hybrid approach to
training. The algorithm is based on the concept of pseudo global optima. This
concept is based on the idea of local optima and the gradient descent rule.
K. Ramanathan and S.U. Guan: Recursive Pattern based Hybrid Supervised Training, Studies
in Computational Intelligence (SCI) 82, 129–156 (2008)
130 K. Ramanathan and S.U. Guan
Basically, RPHS learning involves four steps: evolutionary learning [19], de-
composition, gradient descent [26], and integration. During evolutionary learn-
ing, a set of T patterns is fed to a population of chromosomes to decide the
structure and weights of a preliminary pseudo global optimal solution (also
known as a preliminary subnetwork). Decomposition identifies the learnt and
unlearnt patterns of the preliminary subnetwork. A gradient descent approach
works on the learnt patterns and optimizes the preliminary subnetwork using
the learnt patterns. This process produces a pseudo-global optimal solution or
a final subnetwork. The whole process is then repeated recursively with the
unlearnt patterns. Integration works by creating a pattern distributor, a clas-
sification system which helps associate a given pattern with a corresponding
subnetwork.
The RPHS system has two distinctive characteristics
1. Evolutionary algorithms are used to find the vicinity of a pseudo global
optimum and gradient descent is used to find the actual optimum. The
evolutionary algorithms therefore aim to span a larger area of the search
space and to improve the performance of the gradient descent algorithm
2. Two validation algorithms are used in training the system, to identify
the correct stopping point for the gradient descent [26] and to identify
when to stop the recursive decomposition. These two validation algorithms
ensure that the RPHS system is completely adapted to the topology of
the training and validation data. It is through a combination of these
characteristics, together with the ability to unlearn old information and
adapt to new information, that the RPHS system derives its high accuracy.
1.1 Motivation
The RPHS system performs task decomposition. Task decomposition is founded

based on the strategy of divide and conquer [3, 20]. Given a large problem,
it makes more sense to split the task into several subtasks and hand them
over to specialized individuals to solve. The job is done more efficiently than
it would be when one individual is given the complete responsibility. Yet the
splitting of the tasks into subtasks is a challenge by itself, and many factors
need to be determined, such as the method of decomposition, the number of
decompositions etc. In order to motivate the RPHS system, we look at other
divide-and-conquer approaches proposed in the literature.
Some recent task decomposition work proposed the output parallelism and
output partitioning [9, 10] algorithms. The algorithm decomposes the task
according to class labels. The assumption is that a two class problem is easier
to solve than an n class problem. An n class problem is therefore divided into
n two-class problems and a subnetwork applied to solve each problem. While
this approach is shown to be effective on various benchmark classification
problems, the class-based decomposition approach is limited, as it means that
the algorithm can be applied to classification problems only. Further, the
Recursive Pattern based Hybrid Supervised Training 131
assumption that a two-class problem is easier to solve than a K-class problem

does not necessarily hold, in which the effectiveness of output parallelism is
questionable.
Recursive algorithms overcome this dependency on class labels. One of
the earliest recursive algorithms is the decision tree algorithm [2, 25] which
develops a set of rules based on the information presented in the training
data. Incremental neural network based learning algorithms [9, 10, 13, 14] also
attempt to overcome the dependency on class labels, but this time, the de-
composition is done based on the input attributes. Another set partitioning
algorithm is the multisieving algorithm [24] which uses a succession of neural
networks to train the system until all the patterns are learnt.
While these approaches are efficient, there is a problem of getting trapped
in local optima in the neural network based or decision tree based divide
and conquer approaches to learning. Further, the recursive algorithms are
focused on optimal training accuracy, but do not target explicitly for optimal
generalization accuracy.
To overcome the problem of being trapped in local optima, genetic algo-
rithm based counterparts were proposed for divide and conquer algorithms.
[16] proposed the class based decomposition for GA-based classifiers, which
encoded the solution in a rule-based system. They also proposed the input
attribute based decomposition using genetic algorithms [17]. Topology-based
subset selection algorithms [22, 32] provide a genetic algorithm counterpart
to the multisieving algorithm.
However, while these approaches solve the local optima problem, the
problem of training time still remains, and the long training time of the
genetic algorithms is one of the bottlenecks to their adaptation in real world
applications.
We therefore have the neural network based, divide-and-conquer ap-
proaches, which are fast, while having a risk of being stuck in a local op-
timum and the evolutionary based approaches which solve this problem, but
take much more time. There is, further, the problem of generalization accu-
racy. How do we guarantee that we obtain the best possible generalization
accuracy? Can there be a different configuration of decomposition that can
result in better generalization accuracy? In Output parallelism [10, 15], for
instance, the best partitioning of outputs is a problem dependant variable and
plays a significant part in the generalization accuracy of the system.
Solving these problems encountered in divide and conquer approaches is
the goal of the RPHS system. Two validation procedures are incorporated to
solve the problem of generalization accuracy. The problem of local optima and
training time is overcome by the use of hybrid Genetic Algorithm based neural
networks or GANNs [11, 30] and gradient descent training [26]. However, the
hybridization algorithm proposed is not the same as implemented in earlier
works. On the other hand, we use the genetic algorithm to simply find the
best possible partial solution before the gradient descent algorithm can take
over to optimize the partial solution.
The use of a hybrid combination of genetic algorithms and neural networks

is widespread in literature. An efficient combination of GAs and backpropa-
gation has been shown effective in various problems including forecasting ap-
plications [1], classification [11,31] and image processing [4]. A comprehensive
review of GANN based hybridization methods can be found in [30].
In RPHS, we make use of GA-based neural networks to serve a dual pur-
pose: to evolve the structure and (partially) the weights of the subnetwork.
1.2 Organization of the chapter
In this chapter, we study the Recursive Pattern based Hybrid Supervised

(RPHS) learning system, as well as the associated validation and pattern dis-
tribution algorithms. The chapter is organized as follows: We begin with some
preliminaries and related work in section 2 and pave the way for the intro-
duction of the RPHS system and training algorithm. In section 3, we present
a detailed description of the algorithm. To increase the practical significance
of the paper, we include pseudo code, remarks and parameter considerations
where appropriate. A summary of the algorithm is then presented in sec-
tion 4. In section 5, we illustrate the use of the RPHS system by solving the
two-spiral problem and the gauss problem, which are difficult to solve with
non recursive approaches. In section 6, we present some heuristics and prac-
tical guidelines for making the algorithm perform better. Section 7 presents
the results of the RPHS algorithm on some benchmark pattern recognition
problems, comparing them with non hybrid and non recursive approaches.
In section 8, we complete the study of the RPHS algorithm. We summarize
the important advantages and limitations of the algorithm and conclude with
some general discussion and future work.
2 Some preliminaries
2.1 Notation
m : Input dimension
n : Output dimension
K : Number of RPHS recursions
I : Input
O : Output
Tr : Training
V al : Validation
P : Ensemble of subsets
S : Neural network solution
E : Error
NH : Number of hidden nodes in the 3-layered percepteron
Fig. 1. The architecture of the RPHS system
: Mean square error

ξ : Error tolerance for defining learnt patterns
T : Number of training patterns
i : Recursion index
λ : Number of epochs of gradient descent training
t : Time taken
Npop : Number of chromosomes
2.2 Simplified architecture
Figure 1 shows a simplified architecture of the RPHS system executed with

K recursions. The input constitutes an m-dimensional pattern vector. The
output is an n-dimensional vector. The integrator provides the select inputs
to the multiplexer which then outputs the corresponding data input. The
data inputs to the multiplexer are the outputs or each of the subnetworks.
Each subnetwork is therefore a neural network (in this case, a three layered
percepteron [26] with m inputs and n outputs. The integrator is a nearest
neighbor classifier [29] with m inputs and K outputs.
2.3 Problem formulation
Let Itr = {I1 , I2 ,..., IT } be a representation of T training inputs. Ij is defined,

for any j ∈ T over an m dimensional feature space, i.e, Ij ∈ Rm . Let Otr =
{O1 , O2 ,..., OT } be a representation of the corresponding T training outputs.
Oj is a binary string of length n. T r is defined such that Tr = {Itr , Otr }.
Further, let Iv = {I1 , I2 ,..., ITv } and Ov = {O1 , O2 ,..., OTv } represent the
input and output patterns of a set of validation data, such that Val = {Iv , Ov }.
We wish to take T r as training patterns to the system and V al as the valida-
tion data and come up with an ensemble of K subsets. Let P represent this
ensemble of K subsets:

P = P1 , P2 ,...P K
, where, for i ∈ K
i
P = Tr , Val , Tr = Itr , Oitr ,Vali = Iiv , Oiv
i i i i
i i
Here, Itr and Otr are mT i and nT i matrices respectively and Ivi and Ovi are
K K
mT v and nT v matrices respectively, such that Ti = T and Tvi = Tv.
i=1 i=1
We need to find a set of neural networks S = S1 , S2 ,...SK , where S 1
solves P 1 , S 2 solves P 2 and so on.
P should fulfill two conditions:
1. The individual subsets can be trained with a small mean square training
i
error, i.e, Eitr = Oitr − Si (Itr ) −→ 0,
j
2. None of the subsets T ri to T rK are overtrained, i.e, Eival <
i=1

j+1
Eival ; j, j + 1 ∈ K
i=1
The first property implies that each of the neural networks is a global optimum
with respect to their training subset. The second property implies two things:
firstly, each individual network should not be over trained, and secondly, none
of the decompositions should be detrimental to the system
2.4 Variable length genetic algorithm
In RPHS, we make use of the GA based neural networks to serve a dual

purpose: to evolve the structure and (partially) the weights of the subnetwork.
For this purpose, a variable length genetic algorithm is employed.
The use of variable length Genetic Algorithms was inspired by the concept
of messy genetic algorithms. Messy genetic algorithms (mGAs) [7] allow the
use of variable length strings which may be over specified or underspecified
with respect to the problem being solved. The original work by Goldberg
shows that mGAs obtain tight building blocks and are thus more explorative
in solving a given problem.
The variable length Genetic Algorithms used in this paper are also aimed
at building blocks that are more explorative in solving the given problem. In
a three-layered neural network, the number of free parameters, NP , is given
by a product of the number of inputs, the number of outputs and the number
of hidden nodes (NH ).
NP = mNH + NH n + n + NH (1)
Each of these free parameters is one element of the chromosome and represents
one of the weights or the biases in the network.
According to equation 1, a chromosome is therefore defined by the value of
NP which in turn depends on the value of NH . Initialization of the population
is done by generating a random number of hidden nodes for each individual
and a chromosome based on this number.
2.5 Pseudo global optima
The performance of the RPHS algorithm can be attributed to the fact that
RPHS aims to find several pseudo-global solutions as opposed to a single
global solution. We define a pseudo-global optimal solution as follows:
Definition 1. A pseudo-global optima is a global optimum when viewed from
the perspective of a subset of training patterns, but could be a local (or global)
optimum when viewed from the perspective of all the training patterns.
In this section, we highlight the difference among the multisieving algo-
rithm [24], single staged GANN training algorithms [11, 31], and the RPHS
algorithm, based on the concept and simplified model of pseudo-global op-
tima. Consider the use of the RPHS algorithm to model a function S i such
i
that Otr = S i (w, Itr
i
). where w is the weight vector of values to be optimized.
The training error at any point of time is given by
Etr = Eitr + Eunlearnt ≈ Ti εitr + (T − Ti )εunlearnt (2)
We know that at any given point, the training error can be split into the
error of the learnt patterns T i and the error of the unlearnt patterns (T − T i ).
Definition 2: A pattern is considered learnt if its output differs from the ideal
output by a value no greater than an error tolerance ξ. The number of total
patterns learnt is therefore
⎧⎡ ⎤ ⎫
T ⎨ 1 O ⎬

Ti = δ ⎣ φ ξ − Oi,j − Oî,j ⎦ − 1 (3)
⎩ O ⎭
i=0 j=0
where δ(.) is the unit impulse function and φ(.) is the unit step function and
Oî,j is the network approximation to Oi,j .
By definition of learnt and unlearnt patterns
tr ≤ ξ <
unlearnt . Also, as
we approach the optimal points, Et ri → 0. Also, consider that at the end of
evolutionary training, all the learnt patterns have an error less than the error
tolerance ξ, i.e.
i
Etr = Etr + Eunlearnt < ξT i + Eunlearnt (4)
RPHS splits up the training patterns after evolutionary training of recur-

sion k Gk such that the gradient descent training Lk of recursion k is carried
Fig. 2. Illustration of solutions found by (i) RPHS(SRP HS ), (ii) Single staged hybrid
search (Sss ), (iii) Multisieving algorithm (Sm )

k
out with T i patterns and the step Gk+1 with T − T i patterns. The value of
i=1
Eunlearnt is therefore a constant during Lk , i.e. for any given gradient descent
epoch,
i
Etr = Etr +C (5)
Figure 2 illustrates how the RPHS algorithm, a single-staged training al-
gorithm, and the multisieving algorithm [24] find their solutions. The graph
shows a hypothetical one-dimensional error surface to be optimized with re-
spect to w. Assume that at the end of the evolutionary training phase, solution
Sg has an error value Eg computed according to equation 3. A single-staged
algorithm such as backpropagation or GA classifiers will either try to search
for an optimum at this stage or, if the probability of finding the optimum is
too small, climb the hill and reach the local optima marked by Sss .
However, by virtue of equation 5, Eunlearnt is a constant value C. The
error curve (represented by the dotted line), is just a vertically translated
copy of the part of the original curve, which is of interest to us.
Now, if we consider the multisieving algorithm [24] or the topology-based
selection algorithm [22], the
data splitting
occurs with learnt patterns being

classified with those with Oj − Ôj < ξ. The splitting of the data therefore
depends on the error tolerance of learnt patterns ξ, as defined in equation 3.
With respect to figure 2, we can think of this algorithm as finding the solution
Sg using a gradient descent algorithm. The solution Sg , in itself, is found by
gradient descent. The final solution Sm is a vertically translated Sg . However,
Sm , can only be equal to the translated local optima if the error tolerance ξ is
set to optimum values. This is because the solution to the multisieving algo-
rithm is considered found when the pattern is learnt to the error tolerance ξ.
On the other hand, we can see from 2, that the translated local optima due
to the splitting of patterns is more optimal than the other optimal solutions,
i.e.,
ET LO ≤ Eglobal (6)
This is by virtue of the fact that Et ri ← 0 as we approach the optimal
i!
point. Further, from equation 5, ∂Etr/∂w → 0 as ∂Etr ∂w → 0. Therefore,
the solution found by the RPHS algorithm is a pseudo global optima, i.e, it
could be a local optimum but it appears global from the perspective of a pattern
subset.
In contrast to the multisieving algorithm, the RPHS solution, adapts it-
self accordingly, regardless of the error tolerance ξ, to the problem topology
due to gradient descent at the end of each recursion. Finding a pseudo global
optimum therefore reduces the dependence of the algorithm on the error tol-
erance of learnt patterns ξ. It is also the natural optima based on the data
subset. Note: Since early stopping is implemented during backpropagation
so as to prevent overtraining, the optima found by RPHS may not necessarily
be SRP HS , but in the vicinity of SRP HS .
3 The RPHS training algorithm
3.1 Hybrid recursive training
The RPHS training algorithm can be summarized as a hybrid, recursive algo-

rithm. While hybrid combinations of Genetic algorithms and neural networks
are used in various works to improve the accuracy of the neural network, the
RPHS hybrid algorithm is a novel recursive hybrid and works as outlined
below.
The hybrid algorithm uses Genetic Algorithms to find a partial solution
with a set of learnt and unlearnt patterns. Neural networks are used to learn
“to perfection” the learnt patterns and Genetic Algorithms are used again
to tackle the unlearnt patterns. The process is repeated recursively until an
increase in the number of recursion leads to overfitting. The training process
is described in detail below.
1. As we are only looking for a partial solution fast, we use GANNs to
perform the global search across the solution space with all the available
training patterns.
2. We continue training until one of the following two conditions are sat-
isfied: a). There is stagnation or b) A percentage of the patterns are
learnt.
3. In this stage, we use a condition similar to that in [22] and the multisiev-
ing network
[24] to identify learnt patterns, i.e., a pattern is considered

learnt if Oj − Ôj < ξ. More formally, we can define the percentage of
total patterns learnt as in equation 3 Note that, similar to the multi-
sieving algorithm, a tolerance ξ is used to identify learnt patterns; the
arbitrarily set value of ξ for RPHS does not affect the performance of
the algorithm as explained in section 2.
4. The dataset is now split into learnt and unlearnt patterns. With the
unlearnt patterns, we repeat steps 1 to 3.
5. Since the learnt patterns are only learnt up to a tolerance ξ, we use
gradient descent to train the learnt patterns. The aim of gradient descent
is to best adapt the solution to the data topology. Backpropagation is
used in all the recursions except the last one for which constructive
backpropagation is used. The optimum thus found is called the pseudo-
global optimal solution, and is found using a validation set of data to
prevent over training and to overcome the dependence of the algorithm
on ξ.
As the number of patterns in a data subset is small, especially as the number
of recursions increases, it is possible for the pseudo global optimal solution
to over fit the data in the subset. In order to avoid this possibility, we use
a validation dataset. The validation dataset is used along with the training
data to detect generalization loss using an algorithm in [10].
The data decomposition technique of the RPHS algorithm can be best
described by figure 3. During the first recursion, the entire training set (size
T ) is learnt using evolutionary training until stagnation occurs. Only the learnt
patterns are learnt further using backpropagation, with measures to prevent
overtraining. This ensures the finding of a pseudo global optimal solution.
Legend:
BP: Backpropagation
CBP: Constructive backpropagation
GANN: genetic algorithm evolved neural nets
S i : Solution corresponding to the dataset T r i
Fig. 3. Recursive data decomposition employed by RPHS
Fig. 4. The two-level RPHS problem solver
The second recursion repeats the same procedure with the unlearnt patterns.
The process repeats until the total number of patterns in a given recursion
(Recursion K) is too small, in which case, constructive backpropagation is
applied to the whole dataset to learn the remaining patterns to the best
possible extent.
3.2 Testing
Testing in the RPHS algorithm is implemented using a Kth nearest neighbor

(KNN) [29] based pattern distributor. KNN was used to implement the pattern
distributor due to the ease of its implementation. At the end of the RPHS
training phase, we have K subsets of data. A given test pattern is matched
with its nearest neighbor. If the neighbor belongs to subset i, the pattern is
also deemed as belonging to subset i. The solution for subset i is then used
to find the output of the pattern. A multiplexer is used for this function.
The KNN distributor provides the select input for the multiplexer, while the
outputs of subnetworks 1 to K are the data inputs. This process is illustrated
by figure 4.
4 Summary of the RPHS algorithm
Figure 5 presents the pseudo code for training the RPHS system. Train is
initially called with i = 1 and T r and V al as the whole training and validation
set. In addition to the algorithm described in the previous section, Figure 5
Train (T r,V al,i,)

{ Use Genetic algorithms to learn the dataset T r using a new set of
chromosomes
IF stagnation occurs
{ 1. Identify the learnt patterns
2. Split T r into T r i (consisting of the learnt patterns) and
(T r − T r i ) (consisting of the unlearnt patterns). Find corresponding
V ali and (V al − V ali )
3. T r i is now trained with the existing solution using the
backpropagation algorithm.
The procedure is validated using dataset V ali
4. IF local training is complete (stagnation OR generalization loss)
IF (T r − T r i ) has too few patterns
{ a. T r i = T r − T r i
b. Locally train T r i until Generalization loss OR
stagnation
c. STORE network
d. END Training
}
ELSE
{ FREEZE
Train ((T r − T r i ), (V al − V ali ), i + 1)
}}}
Fig. 5. The pseudo code for training the RPHS system
also introduces the two validation procedures used in terminating the system
(indicated in bold). Step 2 uses the validation subset to train the patterns in
a given recursion to ensure that the neural network does not over represent
the patterns in question. Step 4 uses the validation procedure to ensure when
the recursions should be stopped and to determine whether a subsequent
decomposition is detrimental to the RPHS system.
5 The two spiral problem

The two spiral problem (part of whose data is shown in figure 6a is considered
to be complicated, since there is no obvious separation between the two spirals.
Further more, it is difficult to solve this problem with a 3-layered neural
network, and even with more than one hidden layer, information such as the
number of neurons and number of hidden layers play an important part in
distinguishing the spirals apart [21]. The problem gets more complicated when
the spirals are distorted by noise. In this section, we illustrate how the use of
evolutionary algorithms in the RPHS algorithm splits the two spiral data into
easily separable subsets.
The training data (50%) of the two spiral dataset is as shown in figure 6a.
With the use of the RPHS algorithm, evolutionary training is used to split
Fig. 6. The two spiral data set and an example of how it can be decomposed into
several smaller datasets that are more easily separable
the data into easily learnable subsets. Backpropagation is used to further

optimize the error in the EA trained data. The data points in T r1 , T r2 and
T r3 are shown in figures 6b to 6d. We can observe that the two spiral data
is split such that while the original dataset is not easy to classify, each of the
decomposed datasets are far simpler and can be classified by a simple neural
network. This is remarkable improvement from the data decomposition that
is employed by the multisieving algorithm [24], where genetic algorithms are
not used in decomposition. The subsets found in each recursion are not as
separable as the subsets in figure 6.
6 Heuristics for making the RPHS algorithm better

Haykins [18] says that the design of a neural network system is more an art
than a science in the sense that many of the numerous factors involved in
the design are as a result of ones personal experience. While the statement is
true, we wish to make the RPHS system as less artistic as possible. Therefore,
we propose here several methods which will improve and make the algorithm
more focused implementation wise.
6.1 Minimal coded genetic algorithms
The implementation of Minimal coded Genetic Algorithms (MGG) [8,27] was

considered because the bulk of the training time of an evolutionary neural
network is due to the evaluation of the fitness of a chromosome. In Minimal
coded GAs however, only a minimal number of offspring is generated at each
stage. The algorithm is outlined briefly below. 1. From the population P ,
select u parents randomly. 2. Generate θ offspring from the u parents using
recombination/mutation. 3. Choose 2 parents at random from u. 4. Of the
two parents, 1 is replaced with the best from θ and the other is replaced
by a solution chosen by a roulette wheel selection procedure of a combined
population of θ offspring and 2 selected parents. In order to make the genetic
algorithm efficient timewise, we choose the values of u = 4 and θ = 1 for
the GA based neural networks. Therefore, except for the initial population
evaluation, the time taken for evolving one epoch using MGG is equivalent to
the forward pass of the backpropagation algorithm.
6.2 Seperability
In section 3, we have described the RPHS testing algorithm as a K th nearest

neighbor based pattern distributor. If the data solved by recursion i and re-
cursion j are well separated, then the K th nearest neighbor will give error-free
pattern distribution.
However, the RPHS algorithm described so far does not guarantee that
data subsets from two recursions are well separated. Error can therefore be
introduced into the system because of the pattern distributor. In this section
we discuss the efforts made to increase the separation between data subsets.
Empirically, there is some improvement in experimental results when the sep-
aration criterion is implemented, although there is a tradeoff in time. We
outline below the algorithm proposed to implement subset separability.
Definition 2. Inter-recursion separation is defined as the separation between
the learnt data of recursion k, (T rk ) and the data learnt by other recursions
(Trk ). The two data subsets are mutually exclusive.
Definition 3. Intra-recursion separation represents the separation of the data

in the same subset of RPHS. If M classes of patterns are present in the subset i
(patterns learnt by the solution of recursion i), then intra recursion separation
!
M M
is expressed by 1 M 2 sep(ωj , ωk )1, where sep(ωj , ωk ) is the separation
j=1 k=1
between patterns of class ωj and ωk .
1
It is noted that in the case of learning with neural networks, the MSE error of
the k classes can be used as a substitute for the intra-recursion substitution
Separability criterion
The separability criterion is a mathematical expression that evaluates the

separation between two sets of data. In this work, we will use the Battacharya
criterion of separability for a 2-class problem [6].
" +
#−1 $ %
DBatt = 18 (µ(l) − µ(u) ) (l)
2
(u)
µ(l) − µ(u)
(7)
| 12 ( (l) + (u) )|
+ 12 log 1 1
| (l) | 2 +| (u) | 2

In the equation above, µ is the data mean, is the covariance matrix and
the subscripts (l) and (u) represent the learnt and unlearnt patterns of the
current recursion. It should be noted that the selection of the Bhattacharya
criterion is purely arbitrary. Other criterion that can be used are Fisher’s
criterion [6], Mahalanobhis distance [5], and so on.
Objective function for evolutionary training
In order to increase the inter recursion separation, we modify the fitness func-
tion for GANNs as given by the equation below.
g(x) = w1
(x) − (1 − w1 )DBatt (x) (8)
The fitness g(x) of the chromosome x is therefore dependent on both the

inter recursion separation between the learnt and unlearnt data of the chromo-
some x, DBatt (x), and the intra recursion separation, which can be expressed
by the MSE error,
(x), of the chromosome. w1 is the importance of the in-
tra recursion separation with respect to the chromosome fitness. In the next
section, we present our results with w1 = 0.5.
6.3 Computation intensity and population size
Here we present an argument to the computational complexity of the RPHS

algorithm. Let the time taken to forward pass a single pattern through a
neural network be t and the number of training patterns be T . For simplicity,
we assume the following.
1. The neural network architecture is the same throughout.
2. The time required for other computations (backpropagation, crossover,
mutation, selection, etc.) is negligible when compared to the evaluation
time.
The second assumption is valid as the exp function of the forward pass
stage is more computationally intensive than the other functions. Therefore
the total time required for λ epochs of CBP is therefore tCBP = λP t.
The total time required for RPHS with MGG can also be expressed as a

R
summation of the time taken in each recursion i, tRP HS = tP D + ti . The
i
time taken for each recursion is given as below.
ti = λgi Ti t + λli (Ti − αi )t + Npop Tit (9)
where λgi and λli refer to the number of epochs required for evolutionary
and backpropagation training in recursion i. alphai is the number of learnt
patterns at the end of the recursion. The last term refers to the initialization
of the recursion population with Npop chromosomes (as explained in figure 5).
The bulk of the time in the equation above depends on the third term,
i.e., the initial evaluation of the chromosome population in each recursion.
The justification of the above claim follows from the properties of RPHS and
evolutionary search:
1. As the patterns in each backpropagation epoch are already learnt, fewer
epochs are required than when compared to the training of modules in
the output parallelism, i.e λli,RP HS < λi,OP .
2. From the experimental results observed and due to the capability of ge-
netic algorithms to find partial solutions faster, we can also say that
λgi,RP HS is small. In the experiments carried out, the value of λgi is
usually less than 20 epochs.
The location of the pseudo global optimal solution found by GA is relatively
unimportant as the pseudo global optima is always globally optimal in terms
of the patterns selected. This implies that with a small population size, the
RPHS algorithm is likely to be faster than the output parallelism algorithm.
In order to observe the effect of the number of chromosomes Npop on
the training time and the generalization accuracy of the RPHS system, we
performed a set of experiments using the MGG based RPHS algorithm with
a varying number of chromosomes. The graphs in figure 7 show the trend in
training time and generalization accuracy for initial population sizes between
5 and 30. The population size of 5 chromosomes was chosen so that MGG
can be implemented with 4 chromosomes for mating and still retains the best
fitness values.
It is interesting to note that the number of chromosomes Npop in the initial
population of each recursion does not play a big role in the generalization
accuracy of the system. This is, once again, an expected property of the RPHS
algorithm as it is the backpropagation algorithm that completes the training
of the system according to the validation data. The part played by the genetic
algorithm is only partial training and it is the presence of the local optima, not
its relative position that is important for the RPHS algorithm. Therefore, if
training time is an issue, using the minimal requirement of 5 chromosomes and
implementing MGG can solve the problem to certain accuracy comparable to
that using a larger population.
Fig. 7. The effect of using different sized initial populations for RPHS with the
(a) SEGMENTATION, (b) VOWEL, (c) SPAM datasets. Graphs show the training
time ( ) and generalization error(...) against the number of chromosomes in the
initial population.
Therefore, the most efficient training time for the RPHS algorithm will be
as given by equation 10, which is based on equation 9,

R
R
tRP HS = tP D + ti = tP D + Kgi Ti t + Kli (Ti − αi )t + 5T i t (10)
i=1 i=1
6.4 Validation data
For optimal training, it is necessary to use suitable validation data for each
decomposed training sets. In this section we propose and justify the algorithm
for choosing the optimal validation data for each subset of training data.
Consider the distribution of data shown in figure 8. Each colored zone
represents data from a different recursion. The patterns learnt by solution i
are explicitly exclusive of the patterns learnt by solution j, ∀i
= j. The RPHS
decomposition tree in figure 3 can therefore be expressed as shown in figure 9.
According to figure 9 and the RPHS training algorithm described in 4, the
first recursion begins with T r, the data to be trained using EAs, At the end of
the recursion, T r is split into T r1 (data to be trained with backpropagation)
to give S 1 (the network representing the data T r1 , and (T r − T r1 )(data to
be trained using EAs to give solutions 2 to n). We represent all the networks
Fig. 8. Sample data distribution
Fig. 9. The data distribution of the RPHS recursion tree
that represent (T r − T r1 ) as S 1 , i.e., the data that is represented by S 1 can

never be represented by S 1 . We therefore propose the following pseudo code
for validation.
Do until stagnation or early stopping
Optimize MSE criterion locally()
Validate ()
End Validate()
FindVi()
Use the validation set V ali to validate the solution S i for recursion i
FindVi()
For each validation pattern
Use KNN (or intermediate pattern distributor)
If pattern ∈ T ri
Add pattern to V ali
Given a set of patterns, F indV i() finds out which patterns can possibly be
solved by the solutions that exist. Patterns that can be solved are isolated and
used as specific validation sets. Besides a more accurate validation dataset,
it is also possible to obtain the intermediate generalization capability of the
system, which is useful is stopping recursions, as described in the next section.
The intermediate pattern distributor is similar to that described in Section

3 except that it only has two outputs. Its responsibility is to decide whether
a pattern is suitable for validating the subset of patterns in question or not.
6.5 Early stopping

Decompositions of data in the RPHS algorithm are done as follows: An in-
termediate pattern distributor with two outputs is implemented after each
recursion as described in the previous section. Using the intermediate pattern
i i 2
distributor, we obtain the validation error (Eval ) and training error (Etr ) of
i i−1
a recursion i. at the end of each recursion. If (Eval > Eval ), the recursion i is
overtraining the system. Therefore only the results of i − 1 recursions only are
considered. The overall RPHS training algorithm can therefore be described
as shown in figure 10.
7 Experiments and results

7.1 Problems considered
The table below summarizes the five classification problems considered in this
paper. The problems were chosen such that they varied in terms of input and
output dimensions as well as in the number of training, testing and validation
patterns made available. All the datasets, other than the two-spiral dataset,
were obtained from the UCI repository.
The results of the SPAM and the two-spiral datasets were compared with
constructive backpropagation, multisieving and the topology based subset se-
lection algorithms only. This was because the SPAM and two spiral problems
were two-class problems. Therefore implementing the output parallelism will
not make a difference to the results obtained by CBP.
The two spiral dataset consists of 194 patterns. To ensure a fair comparison
to the Dynamic subset selection algorithm [22], test and validation datasets of
192 patterns were constructed by choosing points next to the original points
in the dataset as mentioned in the paper.
Table 1. Summary of problems considered

Problem name Segmentation Vowel Letter Two-spiral Spam
recognition
Training set size 1155 495 10000 194 2301
Test set size 578 248 5000 192 1150
Validation set size 577 247 5000 192 1150
Number of inputs 18 10 16 2 57
Number of outputs 7 11 26 2 2
2 i i
Both (Eval ) and (Etr ) represent the percentage of (training and validation) pat-
terns in error of the RPHS system with i recursions
Note: In the above flowchart, the following process is described.

1. The unlearnt data for recursion i (All the data for i = 1) is used to train a
GANN subnetwork.
2. Based on the learnt and unlearnt data from the recursion i, the nearest
neighbor algorithm is used to decompose the validation data into validation
subset i, Vi (Patterns belonging to recursion i) and V i (Patterns belonging to
recursions other than i)
3. The training subset i and validation subset i are used together with the
GANN subnetwork to obtain the final subnetwork
4. If the validation accuracy of the first i subnetworks is lower than the
validation accuracy of the first i − 1 subnetworks, the final subnetwork i is
retrained using the remaining unlearnt data and the training subset i to the
best possible extent possible.
5. If 4 is not true, then 1, 2 and 3 are repeated with the remaining unlearnt
patterns.
Fig. 10. The overall RPHS training algorithm including training and validation set
decomposition, the use of backpropagation and constructive backpropagation and
the recursion stopping algorithm
7.2 Experimental parameters and control algorithms implemented
Table 2 summarizes the parameters used in the experiments. As we wish for

the RPHS technique to be as problem independent as possible, we make all
the experimental parameters constant for all problems and as given below.
Each experiment was run 40 times, with 4-fold cross validation.
For comparison purposes, the following algorithms were designed and im-
plemented. The constructive backpropagation algorithm was implemented as
a single staged (non hybrid) algorithm which conducts gradient descent based
search with the possibility of evolutionary training by the addition of a hidden
node. The multisieving algorithm (recursive non hybrid algorithm) was imple-
mented to show the necessity to find the correct pseudo global optimal solution.
The following control experiments were carried out based on the multi-
sieving algorithm and the dynamic topology based subset finding algorithm.
Both the versions of output parallelism implemented also show the effect of
the hybrid RPHS algorithm.
1. Multisieving with KNN based pattern distributor
2. Dynamic topology based subset selection
3. Output parallelism without pattern distributor [10]
4. Output parallelism with pattern distributor [15]
Table 2. Summary of parameters used

Parameter Value
Evolutionary search Population size 20
parameters
Crossover probability 0.9
Small change mutation probability 0.1
Large change mutation probability 0.7
Pattern learning tolerance of EA training 0.2
ξ. Also used for identifying learnt patterns
in TSS and multisieving
µ = 4,
MGG parameters
θ=1
Neural network Generalization loss tolerance for validation 1.5
parameters
Backpropagation learning rate 10−2
Number of stagnation epochs before CBP 25
increases one hidden node
Pattern distribution Number of neighbors in the KNN pattern 1
related parameters
distributor
Weight of intra recursion separation in the 0.5
modified obj function (w1 )
7.3 Experimental results
The graphs in figure 11 below compare the Mean square training error for the
various datasets. The typical training curves for Constructive Backpropaga-
tion [23], and RPHS training are shown.
From Figure 11, we can observe that the training error obtained by RPHS
is lower than the training error that is obtained using CBP [23]. The empirical
comparison shows that typically, the RPHS training curve converges to a
better optimal solution when compared to the CBP curve. At this stage, a note
should be made on the shape of the RPHS curve. The MSE value increases at
the end of each recursion before reducing further. This is due to the fact that
RPHS algorithm reinitializes the population at the end of each recursion.
The reinitialization of the population at the end of each recursion benefits
the solution set reached. The new population is now focused on learning the
“unlearnt” patterns, thereby enabling the search to exit from the pseudo global
(local) optima of the previous recursion.
Tables 3 to 7 summarize the training time and generalization accuracy
obtained by CBP [23], Output Parallelism with [15] and without the pattern
distributor [10] and RPHS. The Output Parallelism algorithm is chosen so as
to illustrate the difference between manual choice of subsets and evolutionary
search based choice of subsets.
Fig. 11. The data distribution of the RPHS recursion tree

Table 3. Summary of the results obtained from the Segmentation problem (average
number of recursions: 7.2)
Algorithm used Training Classification

time (s) error (%)
Constructive backpropagation 693.8 5.74
Multisieving with KNN pattern distributor 760.64 7.28
Output parallelism 1719.6 5.18
Output parallelism with pattern distributor 2219.2 5.44
RPHS-GAND 1004.8 5.62
RPHS-GAD 1151.8 4.32
RPHS-MGGND 545.8 5.27
RPHS-MGGD 688.34 4.41
RPHS-with separation 1435.6 4.17
Table 4. Summary of the results obtained from the Vowel problem (average number
of recursions: 6.425)

time (s) error (%)
RPHS-GAND 812.95 25.271
RPHS-GAD 842.16 16.721
RPHS-MGGND 396.34 23.24
RPHS-MGGD 473.88 17.73
Table 5. Summary of the results obtained from the Letter recognition problem
(average number of recursions: 21.3)

time (s) error (%)
Constructive backpropagation 20845 21.672
Multisieving with KNN pattern distributor 55349 65.04
RPHS-GAND 38461 13.14
RPHS-GAD 47447 11.1
RPHS-MGGND 27282 13.08
RPHS-MGGD 29701 13.42
Table 6. Summary of the results obtained from the Spam problem (average number
of recursions: 2.475)

time (s) error (%)
RPHS-GAND 142.24 21.00
RPHS-GAD 156.81 20.75
RPHS-MGGND 58.721 22.11
RPHS-MGGD 82.803 20.97
Table 7. Summary of the results obtained from the Two-spiral problem (average
number of recursions: 2.475)3

time (s) error (%)
Dynamic topology based subset selection (TSS) − 28.0
RPHS-GAND 76.25 15.42
RPHS-GAD 87.91 10.54
RPHS-MGGND 45.72 13.25
RPHS-MGGD 59.97 11.08
To show the effect of the heuristics proposed, the RPHS training is carried
out with four options
1. Genetic Algorithms with no decomposition of validation patterns
(RPHS-GAND)
2. Genetic Algorithms with decomposition of validation patterns (RPHS-
GAD)
3. MGG with no decomposition of validation patterns (RPHS-MGGND)
4. MGG with decomposition of validation patterns (RPHS-MGGD)
The graphs in figure 11 compare CBP with the 4th training option (RPHS-
MGGD).
Based on the results presented, we can make the following observations
and classify them according to training time and generalization accuracy.
3
Results of the topology based subset selection algorithm [22]
Generalization accuracy
• All the RPHS algorithms give better generalization accuracy when com-
pared to the traditional algorithms (CBP, Topology based selection and
multisieving).
• The algorithms which include the decomposition of validation data, al-
though marginally longer than that without decomposition, have better
generalization accuracy than output parallelism. As the algorithms imple-
menting output parallelism do so with a manual decomposition of valida-
tion data, it follows that a version of RPHS will be more accurate than a
similar corresponding algorithms based on output parallelism.
• Implementing RPHS with the separation criterion gives the best general-
ization accuracy although there is a large tradeoff in time. This is discussed
in greater detail in the following section.
• The RPHS algorithm that uses MGG with the decomposition of valida-
tion patterns (MGGD) provides the best tradeoff between training time
and generalization accuracy. When compared to RPHS-GAD and RPHS
with separation, the tradeoff in generalization accuracy is minimal when
compared to the reduction in training time.
• The number of recursions required by RPHS, on average, is lower than the
number of classes in a problem and gives better generalization accuracy.
This suggests that classwise decomposition of data is not always optimal.
Training time
• The training time required by CBP is the shortest of all the algorithms.
However, as seen from the graphs, this short training time is most likely
due to to premature convergence of the CBP algorithm.
• The training of RPHS is also more gradual. While premature convergence
is easily observed in the case of CBP, RPHS converges more slowly. The
recursions reduce the training error in small steps, learning a small number
of patterns at a time.
• Apart from the CBP algorithm, the RPHS algorithm carried out with
MGG has shorter training time than the output parallelism algorithms.
The training time of the multisieving algorithm is larger or less than the
RPHS-MGG based algorithms depending on the datasets. This is expected
as the nature of the dataset determines the number of levels that multi-
sieving has to be implemented and therefore influences the training time.
• The basic contribution of the Minimal coded genetic algorithms is the
reduction of training time. However, there is a small trade off in gener-
alization accuracy when MGGs are used. This can be observed across all
the problems.
• The use of the separation criterion with the RPHS algorithm increases the
training time by several fold. This is expected as the training time includes
the calculation of the inverse covariance matrix. This is the tradeoff for
obtaining marginally better generalization accuracy. The time taken us-

ing the separation criterion may or may not be acceptable depending on
the problem dimension, the number of patterns, etc. However, when the
primary goal is to improve the generalization accuracy of the system and
the learning is done offline, the separability criterion can be included for
better results.
8 Conclusions
Task decomposition is a natural extension to supervised learning as it decom-

poses the problem into smaller subsets. In this paper, we have proposed the
RPHS algorithm, a topology adaptive method to implement task decomposi-
tion automatically. With a combination of automatic selection of validation
patterns and adaptive detection of decomposition extent, the algorithm en-
ables to decompose efficiently the data into subsets such that the generaliza-
tion accuracy of the problem is improved.
We have compared the classification accuracy and training time of the
algorithm with four algorithms, illustrating the effectiveness of (1) recursive
subset finding, (2) pattern topology oriented recursions, and (3) efficient com-
bination of gradient descent and evolutionary training. We discovered that the
classification accuracy of the algorithm is better than both the constructive
backpropagation algorithm and the output parallelism. The improvement in
classification error when compared to the constructive backpropagation is up
to 60% and 40% when compared to output parallelism. The training time of
the algorithm is also better than the time required by the output parallelism
algorithm.
On a conceptual level, the main contribution of this paper is twofold.
Firstly, the algorithm shows, both theoretically and empirically, that when
training is performed based on pattern topology using a combination of evo-
lutionary training and gradient descent, generalization is better than parti-
tioning the data based on output classes. It also shows that the combination
of EAs and gradient descent is better than the use of gradient descent only,
as in the case of the multisieving algorithm [24].
Secondly, the paper also presents a data separation method to improve
further the generalization accuracy of the system by consciously reducing
the pattern distributor error. While this is shown, both conceptually and
empirically, to reduce the generalization error, the algorithm incurs some cost
due to its increased training time. One future work would be how this training
time can be reduced without compromising the accuracy.
Another future work involves the investigation of the decomposition mech-
anism of the evolutionary algorithms and to improve its efficiency.
References
1. Andreou AS, Efstratios F, Spiridon G, Likothanassis D (2002) Exchange rates
forecasting: a hybrid algorithm based on genetically optimized adaptive neural
networks. Comput Econ 20(3):191–200
2. Breiman L (1984) Classification and regression trees. Wadsworth International
Group, Belmont, California
3. Chiang CC, Fu HH (1994) Divide-and-conquer methodology for modular su-
pervised neural network design. In: Proceedings of the 1994 IEEE international
conference on neural networks 1:119–124
4. Dokur Z (2002) Segmentation of MR and CT images using hybrid neural network
trained by genetic algorithms. Neural Process Lett 16(3):211–225
5. Foody GM (1998) Issues in training set selection and refinement for classification
by a feedforward neural network. Geoscience and remote sensing symposium
proceeding:401–411
6. Fukunaga K (1990) Introduction to statistical pattern recognition, Academic,
Boston
7. Goldberg DE, Deb K, Korb B (1991) Don’t worry, be messy. In: Belew R, Booker
L (eds.) Proceedings of the fourth international conference in genetic algorithms
and their applications, pp 24–30
8. Gong DX, Ruan XG, Qiao JF (2004) A neuro computing model for real
coded genetic algorithm with the minimal generation gap. Neural Comput Appl
13:221–228
9. Guan SU, Liu J (2002) Incremental ordered neural network training. J Intell
Syst 12(3):137–172
10. Guan SU, Li S (2002) Parallel growing and training of neural networks using
output parallelism. IEEE Trans Neural Netw 13(3):542–550
11. Guan SU, Ramanathan K (2007) Percentage-based hybrid pattern training with
neural network specific cross over, Journal of Intelligent Systems 16(1):1–26
12. Guan SU, Li P (2002) A hierarchical incremental learning approach to task
decomposition. J Intell Syst 12(3):201–226
13. Guan SU, Li S, Liu J (2002) Incremental learning with an increasing input
dimension. J Inst Eng Singapore 42(4):33–38
14. Guan SU, Liu J (2004) Incremental neural network training with an increasing
input dimension. J Intell Syst 13(1):43–69
15. Guan SU, Neo TN, Bao C (2004) Task decomposition using pattern distributor.
J Intell Syst 13(2):123–150
16. Guan SU, Zhu F (2004) Class decomposition for GA-based classifier agents – a
pitt approach. IEEE Trans Syst Man Cybern, Part B: Cybern 34(1):381–392
17. Guan SU, Zhu F (2005) An incremental approach to genetic algorithms based
classification. IEEE Trans Syst Man Cybern Part B 35(2):227–239
18. Haykins S (1999) Neural networks, a comprehensive foundation, Prentice Hall,
Englewood Cliffs, NJ
19. Holland JH (1973) Genetic algorithms and the optimal allocation of trials. SIAM
J Comput 2(2):88–105
20. Kim SP, Sanchez JC, Erdogmus D, Rao YN, Wessberg J, Principe J, Nicolelis M
(2003) Divide and conquer approach for brain machine interfaces: non linear
mixture of competitive linear models, Neural Netw 16(5–6):865–871
21. Lang KJ, Witbrock MJ (1988) Learning to tell two spirals apart. In:
Touretzky D, Hinton G, Sejnowski T (eds.) Proceedings of the 1988 connec-
tionist models summer School, Morgan Kaufmann, San Mateo, CA
22. Lasarzyck CWG, Dittrich P, Banzhaf W (2004) Dynamic subset selection based
on a fitness case topology. Evol Comput, 12(4):223–242
23. Lehtokangas (1999), Modeling with constructive Backpropagation. Neural Netw
12:707–714
24. Lu BL, Ito K, Kita H, Nishikawa Y (1995) Parallel and modular multi-sieving
neural network architecture for constructive learning. In: Proceedings of the 4th
international conference on artificial neural networks 409:92–97
25. Quilan JR (1986) Introduction of decision trees. Mach Learn 1:81–106
26. Rumelhart D, Hinton G, Williams R (1986) Learning internal representations
by error propagation. In Rumelhart D, McClelland J (eds.) Parallel distributed
processing, 1: Foundations. MIT, Cambridge, MA
27. Satoh H, Yamamura M, Kobayashi S (1996) Minimal generation gap model for
GAs considering both exploration and exploitation. In: Proceedings of 4th Int
Conference on Soft Computing, Iizuka:494–497
28. The UCI Machine Learning repository:
http://www.ics.uci.edu/mlearn/MLRepository.html
29. Wong MA, Lane T (1983) A kth nearest neighbor clustering procedure. JR Stat
Soc (B) 45(3):362–368
30. Yao X (1993) A review of evolutionary artificial neural networks. Int J Intell
Syst 8(4):539–567
31. Yasunaga M, Yoshida E, Yoshihara I (1999) Parallel backpropagation using
genetic algorithm: real-time BP learning on the massively parallel computer
CP-PACS. In: International Joint Conference on Neural Networks 6:4175–4180
32. Zhang BT, Cho DY (1998) Genetic programming with active data selection.
Simulated Evol Learn 1485:146–153
Enhancing Recursive Supervised Learning
Using Clustering and Combinatorial
Optimization (RSL-CC)
Kiruthika Ramanathan and Sheng Uei Guan
Summary. The use of a team of weak learners to learn a dataset has been shown
better than the use of one single strong learner. In fact, the idea is so successful
that boosting, an algorithm combining several weak learners for supervised learn-
ing, has been considered to be one of the best off-the-shelf classifiers. However, some
problems still remain, including determining the optimal number of weak learners
and the overfitting of data. In an earlier work, we developed the RPHP algorithm
which solves both these problems by using a combination of genetic algorithm, weak
learner and pattern distributor. In this paper, we revise the global search component
by replacing it with a cluster-based combinatorial optimization. Patterns are clus-
tered according to the output space of the problem, i.e., natural clusters are formed
based on patterns belonging to each class. A combinatorial optimization problem
is therefore formed, which is solved using evolutionary algorithms. The evolution-
ary algorithms identify the “easy” and the “difficult” clusters in the system. The
removal of the easy patterns then gives way to the focused learning of the more com-
plicated patterns. The problem therefore becomes recursively simpler. Overfitting is
overcome by using a set of validation patterns along with a pattern distributor. An
algorithm is also proposed to use the pattern distributor to determine the optimal
number of recursions and hence the optimal number of weak learners for the prob-
lem. Empirical studies show generally good performance when compared to other
state-of-the-art methods.
1 Introduction
Recursive supervised learners are a combination of weak learners, data de-
composition and integration. Instead of learning the whole dataset, different
learners (neural networks, for instance) are used to learn different subsets
of the data, resulting in several sub solutions (or sub-networks). These sub-
networks are then integrated together to form the final solution to the system.
Figure 1 shows the general architecture of such learners.
K. Ramanathan and S.U. Guan: Enhancing Recursive Supervised Learning Using Clustering
and Combinatorial Optimization (RSL-CC), Studies in Computational Intelligence (SCI) 82,
157–176 (2008)
Fig. 1. Architecture of the final solution of recursive learners
In the design of recursive learners, several factors come into play in deter-
mining the generalization accuracy of the system. Factors include
1. The accuracy of the sub networks
2. The accuracy of the integrator
3. The number of subnetworks
The choice of subsets for training each of the subnetworks plays an impor-
tant part in determining the effect of all these factors. Various methods have
been used for subset selection in literature. Several works, including topology
based selection [17], and recursive pattern based training [6], use evolutionary
algorithms to choose subsets of data.
Algorithms such as active data selection [23], boosting [21] and multi-
sieving [19] implement this subset choice using neural networks trained with
various training methods. Multiple weak learners have also been used and the
best weak learner given the responsibility of training and solving a subset [9].
Other algorithms make use of more brute force methods such as the use of
the Mahalanobhis distance [3]. Even other algorithms such as the output
parallelism algorithm manually decompose their tasks [5], [7], [10]. Clustering
has also been used to decompose datasets in some cases [2].
The common method used in these algorithms (except the manual de-
composition), is to allow a network to learn some patterns and declare these
patterns as learnt. The system then creates other networks to deal with the
unlearnt patterns. The process is done recursively and with successively de-
creasing subset size, allowing the system to concentrate more and more on
the difficult patterns.
While all these methods work relatively well, the hitch lies in the fact that,
with the exception of manual partitioning, most of the techniques above use
RSL Using Clustering and Combinatorial Optimization 159
some kind of intelligent learner to split the data. While intelligent learners
and algorithms such as neural networks [12], genetic algorithms [13] and such
are effective algorithms, they are usually considered as black boxes [1], with
little known about the structure of the underlying solution.
In this chapter, our aim is to reduce, by a certain degree, this black box
nature of recursive data decomposition and training. Like in previous works,
genetic algorithms are used to select subsets, however, unlike previous works;
genetic algorithms are not used to select the patterns for a subset, but to
select clusters of patterns for a subset. By using this approach, we hope to
group patterns into subsets and derive a more optimal partitioning of data.
We also aim to gain a better understanding of optimal data partitioning and
the features of well partitioned data.
The system proposed consists of a pre-trainer and a trainer. The pre-
trainer is made up of a clusterer and a pattern distributor. The clusterer
splits the data set into clusters of patterns using Agglomerative hierarchi-
cal clustering. The pattern distributor assigns validation patterns to each of
these clusterers. The trainer now solves a combinatorial optimization problem,
choosing the clusters that can be learnt with best training and validation accu-
racy. These clusters now form the easy patterns which are then learnt using a
gradient descent with the constructive backpropagation algorithm [18] to cre-
ate the first subnetwork (a three layered percepteron). The remaining clusters
form the difficult patterns. The trainer now focuses attention on the diffi-
cult patterns, thereby recursively isolating and learning increasingly difficult
patterns and creating several corresponding subnetworks.
The use of genetic algorithms in selecting clusters is expected to be more
efficient than their use in the selection of patterns for two reasons
1. The number of combinations is now n Ck as opposed to T CL , where
the number of available clusters n, is less than the number of training
patterns T . Similarly, the number of clusters chosen k is smaller than the
number of training patterns chosen L. The solution space is now smaller,
therefore increasing the probability of finding a better solution.
2. The distribution of validation information is performed during pre-
training, as opposed to during the training time. Validation pattern
distribution is therefore a one-off process, thereby saving training time.
The rest of the paper is organized as follows. We begin with some pre-
liminaries and related work in section 2. In section 3, we present a detailed
description of the proposed Recursive supervised learning with clustering and
combinatorial optimization (RSL-CC) algorithm. To increase the practical
significance of the paper, we include pseudo code, remarks and parame-
ter considerations where appropriate. More details and specifications of the
algorithm are then presented in section 4. In section 5, we present some
heuristics and practical guidelines for making the algorithm perform better.
Section 5.4 presents the results of the RSL-CC algorithm on some benchmark
pattern recognition problems, comparing them with other recursive hybrid
and non-hybrid techniques. In section 6, we complete the study of the RSL-

CC algorithm. We summarize the important advantages and limitations of
the algorithm and conclude with some general discussion and future work.
2 Some preliminaries
2.1 Notation
m : Input dimension
n : Output dimension
K : Number of recursions
I : Input
O : Output
Tr : Training
V al : Validation
P : Ensemble of subsets
S : Neural network solution
E : Error
T : Number of training patterns
r : Recursion index
Nchrom : Number of chromosomes
Nc : Number of clusters
2.2 Problem formulation for recursive learning
Let Itr = {I1 , I2 ,..., IT } be a representation of T training inputs. Ij is defined,

for any j ∈ T over an m dimensional feature space, i.e, Ij ∈ Rm . Let Otr =
{O1 , O2 ,..., OT } be a representation of the corresponding T training outputs.
Oj is a binary string of length n. Tr is defined such that Tr = {Itr , Otr }.
Similarly Iv = {Iv,1 , Iv,2 ,..., Iv,Tv } and Ov = {Ov,1 , Ov,2 ,..., Ov,Tv } rep-
resent the input and output patterns of a set of validation data, such that
Val = {Iv , Ov }. We wish to take T r as training patterns to the system and
V al as the validation data and come up with an ensemble of K subsets. Let
P represent this ensemble of K subsets:

P = P1 , P2 ,...P for i ∈ K
K
, where,
P = Tr , Val , Tr = Iitr , Oitr , Vali = Iiv , Oiv
i i i i
Here, Iitr and Oitr are mxTi and nxTi matrices respectively and Iiv and
K
Oiv are mxTvi and nxTvi matrices respectively, such that i=1 Ti = T and
K i
1 2 K

i=1 Tv = Tv. We need to find a set of neural networks S = S , S ,...S ,
where S 1 solves P1 , S 2 solves P2 and so on.
P should fulfill two conditions:

Condition set 1 Conditions for a good ensemble
1. The individual subsets can be trained with a small mean square training
error, i.e., Eitr = Oitr − Si (Iitr ) → 0.
2.
None of the subsets
Tr1 to TrK are overtrained, i.e,
j j+1
i=1 Eval ; j, j + 1 ∈ K
i i
i=1 Eval <
2.3 Related work
As mentioned in the introduction,

several methods are used to select a suitable
partition P = P1 , P2 ,...PK that fulfills the conditions 1 and 2 above. In
this section we shall discuss some of them in greater detail and discuss their
pros and cons.
Manual decomposition
Output parallelism was developed by Guan and Li [5]. The idea involves split-
ting a n-class problem into n two-class problems. The idea behind the decom-
position is that a two class problem is easier to solve than an n-class problem
and hence is more efficient. Each of the n sub problem consists of two outputs,
class i and class i. Guan et al. [7] later added an integrator in the form of a
pattern distributor to the system. The Output parallelism algorithm essen-
tially develops a set of n sub-networks, each catering to a 2-class problem and
integrates them using a pattern distributor.
While the algorithm is shown to be effective in terms of both training time
and generalization accuracy, a major drawback of the algorithm is its class
based manual decomposition. In fact, research has been carried out [7] shows
empirically that the 2-class decomposition is not necessarily the optimum
decomposition for a problem. This optimum decomposition is a problem de-
pendent value. Some problems are better solved when decomposed into three
class sub problems, others when decomposed into sub-problems with a vari-
able number of classes. While automatic algorithms have been developed to
overcome this problem of manual decomposition [8], the net result is an algo-
rithm which is computationally expensive. The other concern associated with
the output parallelism is that it can only be applied to classification problems.
Genetic algorithm based partitioning algorithms
GA based algorithms are shown to be effective in partitioning the datasets

into simpler subsets. One interesting observation was that using GA for par-
titioning leads to simpler and easily separable subsets [6].
However, the problem with these approaches is that the use of GA is
computationally costly. Also, a criterion has to be established, right at the
beginning, to separate the difficult patterns from the easy ones. In the case of
various algorithms, criterions used include the history of difficulty, the degree
of learning [17] and so on.
In the Recursive Pattern based hybrid supervised learning (RPHS) algo-
rithm [6], the problem of the degree of learning is overcome by hybridizing the
algorithm and adapting the solution to the problem topology by using a neu-
ral network. In this algorithm, GA is only a pattern selection tool. However,
the problem of computational cost still remains.
Brute force methods
Brute force mechanisms include the use of a distance or similarity measure

such as the Mahalanobhis distance [3]. Subsets are selected to ensure good
separation of patterns in one output class from another, resulting in good
separation between classes in each subset. A similar method, bounding boxes,
has been developed [9] which clusters patterns based on their Euclidean dis-
tances and overlap from patterns of other classes.
While the objective of the approach is direct, much effort in involved in
pretraining as it involves brute force distance computation. In [9], the authors
have shown that the complexity of the brute force distance measure increases
almost exponentially with the problem dimensionality. Also, a single distance
based criterion may not be the most suitable for different problems, or even
different classes.
Neural network approaches
Neural network approaches to decomposition are similar to genetic algorithms

based approaches. Multisieving [19], for instance, allows a neural network to
learn patterns, after which it isolates the network and allows another network
to learn the “unlearnt” patterns. The process continues until all the patterns
are learnt.
Boosting [21] is similar in nature, except that instead of isolating the
“learnt” patterns, a weight is assigned to them. An unlearnt pattern has more
weight and is thus concentrated on by the new network. Boosting is a success-
ful method and has been regarded as “one of the best off the shelf classifiers”
in the world [11].
However, the system uses neural networks to select patterns, which solves
the problem, but results in the black box like structure. The underlying reason
for the solution formation is unknown i.e., we do not know why some patterns
are better learnt. While one can solve the dataset with these approaches,
not much information is gained about the data in the solving process. Also,
the multisieving algorithm, for instance, does not talk about generalization,
which is often an issue in task decomposition approaches. Subsets which are
too small can result in the overtraining of the corresponding subnetwork.
Clustering based approaches

Clustering based approaches to task decomposition for supervised learning
are many fold [2]. Most of them divide the dataset into clusters and solve
individual clusters separately. However, this particular approach is not very
good as it creates a bias. Many subsets have several patterns in one class
and few patterns in other classes. The effect of this bias has been observed
in the PCA reduced visualizations of the data and also in the generalization
accuracy of the resulting network. Therefore, the networks created are not
sufficiently robust.
Separability
For the purpose of this chapter, we are interested in subsets which fulfill the
following conditions:
Condition set 2 Separability conditions for good subset partitioning
1. Each class in a subset must be well separated from other classes in the
same subset
2. Each subset must be well separated from another subset
Intuitively, we can observe that the fulfilling of these two conditions is
equivalent to fulfilling condition set 1
3 The RSL-CC algorithm

The RSL-CC algorithm can be described in two parts, pre-training and
training. In this section, we explain these two aspects of training in detail.
3.1 Pre-training
1. We express Itr , Otr , Iv and Ov as a combination of m classes of patterns,
i.e.,
Itr = {IC1 , IC2 ,..., ICn }

Otr = {OC1 , OC2 ,..., OCn }
Iv = {IC1
v , Iv ,..., Iv }
C2 Cn
Ov = {OC1
v , Ov ,..., Ov }
C2 Cn
2. The datasets Itr , Otr , Iv and Ov are split into n subsets,

IC1 , OC1 , IC1
v , O C1
v , IC2
, O C2 C2
, I v , O C2
v , .... ICn
, O Cn Cn
, Iv , O Cn
v
(1)
where each subset in expression 1 consists of only patterns from one class.

3. Each subset, ICi , OCi , ICiv , Ov
Ci
, i ∈ n, now undergoes a clustering
treatment as below:
• Cluster ICi into k Ci partitions or natural clusters. Any clustering al-
gorithm can be used, including SOMs, K-means [14], Agglomerative
Hierarchical clustering [15], Bounding Boxes [9].
• Using a pattern distributor, patterns in Iv Ci are assigned to one of
the k Ci patterns. In this paper, we implement the pattern distributor
using the nearest neighbor algorithm [22].
• Each pattern, validation or training, in a given cluster j Ci , j ∈ k , has
the same output pattern.
4. The total number of clusters is now the sum of the natural clusters formed
in each class.
n
Nc = k ci (2)
i=1
3.2 Training
1. Number of recursions r = 1.
2. A set of binary chromosomes are created, each chromosome having Nc
elements, where Nc is defined in equation 2. An element in a chromosome
is set at 0 or 1, 1 indicating that the corresponding cluster will be selected
for solving using recursion r.
3. A genetic algorithm is executed to minimize Etot , the average of the train-
ing and validation errors Etr and Eval
1
Etot = (ET r + EV al ) (3)
2
4. The best chromosome Chrombest is a binary string with a combination of
0s and 1s, with the size Nc . The following steps are executed:
a) Ncr = 0, Trr = [], Valr = []
b) For e = 1 to Nc
c) if Chrombest (e) == 1
Ncr + +
Trr = Trr + Trchrombest (e)
Valr = Valr + Valchrombest (e)
d) The data is updated as follows:
Tr = Tr − Trr
Val = Val − Valr
Nc = Nc − Ncr
r++
r r r
e) Tr and Val are used to find S , the solution network corresponding
to the subset of data in recursion r.
5. Steps 2 to 4 are repeated with the new values of Tr, Val, Nc and r.
Fig. 2. Using a KNN distributor to test the RSL-CC system
3.3 Simulation
Simulating and testing the RSL-CC algorithm is implemented using a Kth

nearest neighbor (KNN) [22] based pattern distributor. KNN was used to
implement the pattern distributor due to the ease of its implementation. At
the end training, we have K subsets of data. A given test pattern is matched
with its nearest neighbor. If the neighbor belongs to subset i, the pattern is
also deemed as belonging to subset i. The solution for subset i is then used
to find the output of the pattern. A multiplexer is used for this function.
The KNN distributor provides the select input for the multiplexer, while the
outputs of subnetworks 1 to K are the data inputs. This process is illustrated
by figure 2.
4 Details of the RSL-CC algorithm

Figure 3 summarizes the data decomposition of the RSL-CC algorithm. Hy-
pothetically, we can observe the algorithm as finding the simpler subset of
data and developing a subnetwork to solve the subset. The size of the “com-
plicated” subset becomes smaller as training proceeds, thereby allowing the
system to focus more on the “complicated” data. When the size of the re-
maining dataset becomes too small, we find that there is no motivation for
further decomposition and the remaining data is trained in the best possible
Fig. 3. Illustration of the data decomposition of the RSL-CC algorithm
way. Later in this paper, we observe how the use of GA’s combinatorial opti-
mization automatically takes care of when to stop recursions. The use of GAs
to select patterns [17], [6] requires extensive tests for detrimental decomposi-
tion and overtraining. The proposed RSL-CC algorithm eliminates the need
for such tests. As a result, the resulting algorithm is self sufficient and very
simple, with minimal adaptations.
4.1 Illustration
Figure 4 shows a hypothetical condition where the RSL-CC algorithm is

applied to create a system to learn the dataset shown. The steps performed
on the dataset are traced below. With the data in Figure 4, the best
chromosome selected at the end of the first recursion has the configuration:
“0 0 1 0 1 1 1 0 0 1 1 0 0 1 0 0 1”, the chromosome selected at the end
of the second recursion has a configuration: “1 0 1 0 1 1 1 1 1”. And at the
end of the third recursion, the chromosome has the configuration “1 1”. All
the remaining data is selected and the training is considered complete.
Note:
1. Hypothetical data pre-training: Patterns are clustered according to 1: class
labels and
2. Natural clusters within the class. Clusters 1, 3, 5, 8, 12, 14, 15 and 17 con-
tain patterns from class 1 and the rest of the clusters contain patterns
from class 2.
3. The combinatorial optimization procedure of the 1st recursion selects the
above clusters as the “easy” patterns. They are isolated and separately
learnt.
4. These patterns are the “difficult” patterns of the 1st recursion, they are
focused on in the second recursion.
5. The above patterns are considered “easy” by the combinatorial optimiza-
tion of the second recursion and are isolated and learnt separately.
6. The remaining “difficult” patterns of the second recursion are solved by
the 3rd recursion.
4.2 Termination criteria
The grouping of patterns means that clusters of patterns are selected for each
subset. Further, in contrast with any other method, the GA based recursive
subset selection proposed selects the optimal subset combination. Therefore,
we can assume that the decomposition performed is optimal. We prove this
by using apogage
Proof. Apogage is used:

If the subset chosen at recursion i is not optimal, an alternative subset will
be chosen. The largest possible alternative subset is the training set for the
r−1
recursion, Trr = Tr − i=1 Tri . Therefore if decomposition at a particular
recursion is suboptimal, no decomposition will be performed and the training
will be completed.
We therefore have the following termination conditions:

Condition set 3 Termination conditions
1. No clusters of patterns are left in the system
a.
b. c.
d. e.
Fig. 4. RSL-CC decomposition illustration

2. Only one cluster is left in the remaining data

3. More than one cluster is present in the remaining data, but all the clusters
belong to the same class
Condition 1 above occurs when the optimal choice in a system is to choose
all the clusters as decomposition is not favorable. Conditions 2 and 3 describe
dealing with cases when it is not necessary to create a classifier due to the
homogeneity of output classes.
Fitness function for combinatorial optimization In equation 3, we defined
the fitness function as an average of the training and validation errors obtained
when training the subset selected by the chromosome, 12 (ET r + EV al ). The
values of Etr and Eval are calculated as follows:
1. Design a three layered neural network with an arbitrary number of hidden
nodes (we use 10 nodes, for the purpose of this paper).
2. Use the training and validation subsets selected by the corresponding
chromosome to train the network.
The best performing network is chosen as Chrombest .
5 Heuristics for improving the performance

of the RSL-CC algorithm
The design of a neural network system “is more an art than a science in the
sense that many of the numerous factors involved in the design are as a result
of one’s personal experience.” [12]. While the statement is true, we wish to
make the RSL-CC system as less artistic and as much scientific as possible.
Therefore, we propose here several methods which will improve and make the
algorithm “to the point” as far as implementation is concerned.
5.1 Minimal coded genetic algorithms
The implementation of Minimal coded Genetic Algorithms (MGG) [4], [20]

was considered because the bulk of the training time of an evolutionary neural
network is due to the evaluation of the fitness of a chromosome. In Minimal
coded GAs however, only a minimal number of offspring is generated at each
stage. The algorithm is outlined briefly below.
1. From the population P , select u parents randomly.
2. Generate θ offspring from the u parents using recombination/mutation.
3. Choose two parents at random from u.
4. Of the two parents, one is replaced with the best from θ and the other is
replaced by a solution chosen by a roulette wheel selection procedure of a
combined population of θ offspring and two selected parents.
In order to make the genetic algorithm efficient timewise, we choose the

values of u = 4 and θ = 1 for the GA based neural networks. Therefore,
except for the initial population evaluation, the time taken for evolving one
epoch using MGG is equivalent to the forward pass of the backpropagation
algorithm.
5.2 Population size
The number of elements in each chromosome depends on the total number of

cluster formed. However, the number of chromosomes in the population, in
this paper, is evaluated as follows
Nchrom = min(2Nc , P OP SIZE) (4)
This means that the population size is either P OP SIZE, a constant for
the maximal population size, or if Nc is small, 2NC .
The argument behind the use of a smaller population size is so that when
there are 4 clusters, for example, it does not make much efficiency to evaluate
a large number of chromosomes. So only 16 chromosomes are created and
evaluated.
5.3 Number of generations
In the case where the number of chromosomes is 2NC , only one generation is
executed. This step is to ensure efficiency of the algorithm.
5.4 Duplication of chromosomes
Again, with efficiency in mind, we ensure that in the case where the population
size is 2NC , we ensure that all the chromosomes are unique. Therefore, when
the number of clusters is small, the algorithm is a brute force technique.
sectionExperimental results
5.5 Problems Considered
The table below summarizes the three classification problems considered in

this paper. The problems were chosen such that they varied in terms of input
and output dimensions as well as in the number of training, testing and val-
idation patterns made available. All the datasets, other than the two-spiral
dataset, were obtained from the UCI repository.
The results of the two-spiral dataset were compared with constructive
backpropagation, multisieving and the topology based subset selection algo-
rithms only. This was because the SPAM and two spiral problems were two-
class problems. Therefore implementing the output parallelism will not make
a difference to the results obtained by CBP.
The two spiral dataset consists of 194 patterns. To ensure a fair comparison
to the Dynamic subset selection algorithm [17], test and validation datasets of
192 patterns were constructed by choosing points next to the original points
in the dataset as mentioned in the paper.
5.6 Experimental parameters and control algorithms implemented
Table 2 summarizes the parameters used in the experiments. As we wish for

the RSL-CC technique to be as problem independent as possible, we make
all the experimental parameters constant for all problems and as given below.
Each experiment was run 40 times, with 4-fold cross validation.
For comparison purposes, the following algorithms were designed and im-
plemented. The constructive backpropagation algorithm [18] was implemented
as a single staged (non hybrid) algorithm which conducts gradient descent
based search with the possibility of evolutionary training by the addition of a
hidden node. The Multisieving algorithm [19] (recursive non hybrid pattern
based selection) was implemented to show the necessity to find the correct
pseudo global optimal solution.
Table 1. Summary of the problems considered
Problem Name Vowel Letter recognition Two spiral

Training set size 495 10000 194
Test set size 248 5000 192
Validation set size 247 5000 192
Number of inputs 10 16 2
Number of outputs 11 26 2
Table 2. Summary of parameters used
Parameter Value
Evolutionary search Population size 20
parameters
Crossover probability 0.9
Small change mutation probability 0.1
µ = 4,
MGG parameters
θ=1
Neural network Generalization loss tolerance for validation 1.5
parameters
Backpropagation learning rate 10−2
Number of neighbors in the KNN pattern 1
distributor
The following control experiments were carried out based on the multi-
sieving algorithm and the dynamic topology based subset finding algorithm.
Both the versions of output parallelism implemented also show the effect of
hybrid selection.
In order to illustrate the effect of the GA based combinatorial optimization,
we also implement the single cluster algorithm explained in section 2 [2]. The
algorithm, in contrast to the RSL-CC algorithm, simply divides the data into
clusters and develops a network to solve each cluster separately.
The RSL-CC algorithm was also compared to our earlier work on RPHS [6],
which uses a hybrid algorithm to recursively select patterns, as opposed to
clusters.
In a nutshell, the following algorithms were implemented to compare with
the various properties of the RSL-CC algorithm
1. Constructive Backpropagation
2. Multisieving1
3. Dynamic topology based subset finding
4. Output parallelism without pattern distributor [5]
5. Output parallelism with pattern distributor [7]
6. Single clustering for supervised learning
7. Recursive pattern based hybrid supervised learning
Table 2 summarizes the parameters used. For clustering, the agglomerative
hierarchial clustering is employed with complete linkage and cityblock distance
mechanism. Using thresholding, the natural clusters of patterns in each class
were obtained. AHC was preferred to other clustering methods such as K-
means or SOMs due to its parametric nature and since the number of target
clusters is not required beforehand.
5.7 Results
We divide this section into two parts. In the first part, we compare the mean
generalization accuracies of the various recent algorithms described in this
paper with the generalization accuracy of the RSL-CC algorithm. In the sec-
ond part, we present the clusters and data decomposition employed by RSL-
CC, illustrating the finding of simple subsets. Comparison of generalization
accuracies.
From the tables above, we can observe that the generalization error of
the RSL-CC is comparable to the generalization error of the RPHS algorithm
and is a general improvement over other recent algorithms. A particularly
significant improvement can be observed in the vowel dataset.
There is some tradeoff observed in terms of training time. The training
times for the RSL-CC algorithm for the vowel and two-spiral problems are
1
The multisieving algorithm [19] did not propose a testing system. We are testing
the generalization accuracy of the system using the KNN pattern distributor,
similar to the RSL-CC pattern distributor.
Table 3. Summary of the results obtained from the VOWEL problem (38 clusters,
12 recursions)

time (s) error (%)
Output Parallelism 418.9 25.54
RPHS 473.88 17.733
Single clustering 458.43 25.24
RSL-CC 547.37 9.84
Table 4. Summary of the results obtained from the LETTER RECOGNITION

problem (100 clusters, 16 recursions)

time (s) error (%)
Constructive Backpropagation 20845.05 21.672
Multisieving with KNN pattern 55349 65.04
distributor
Output Parallelism 42785.4 20.06
Output parallelism with pattern 45625.4 18.636
distributor
RPHS- MGGD 29701 12.42
RSL-CC 12682 13.04
higher than other methods. However, it is interesting to note that the train-
ing time for the Letter Recognition problem is 50% less than any of the recent
algorithms. It is felt that this reduction is training time comes from the re-
duction of the problem space from the selection of patterns to the selection
of clusters, where clusters are selected from 100 possible clusters while RPHS
has to select patterns out of 10,000, thereby reducing the solution space by
100 fold.
On the other hand, for the vowel problem, the problem space is reduced
by only about 13 fold. The performance of RSL-CC is more efficient when
the reduction of the problem space is more significant than the GA-based
combinatorial optimization. The RSL-CC decomposition figures
Figures 5 and 6 illustrate the data decomposition for the letter recognition
and the vowel problems. Only one instance of decomposition is presented
in the figures. From the figures, we can observe the data being split into
increasingly smaller subsets, thereby increasing focus on the difficult patterns.
The decomposition presented is the 2 dimensional projections on the principal
component axis (PCA) [3] of the input space.
Table 5. Summary of the results obtained from the TWO-spiral problem (4 clusters,
2 recursions)

time (s) error (%)
Constructive Backpropagation 15.58 49.38
Dynamic Topology Based subset selection (TSS) – 28.0
RPHS 59.97 11.08
Single clustering 14.35 10.82
RSL-CC 30.61 10.82
Fig. 5. Decomposition of data for the Letter recognition problem
Fig. 6. Decomposition of data for the vowel problem

6 Conclusions and future directions

In this chapter, we present the RSL-CC algorithm which divides the problem
space into class based clusters, where a combination of clusters will form a
subset. Therefore, the problem becomes a combinatorial optimization prob-
lem, where the clusters chosen for the subset becomes the parameter to be
optimized. Genetic algorithms are used to solve this problem to select a good
subset.
The subset chosen is then trained separately, and the combinatorial opti-
mization problem is repeated with the remaining clusters. The situation pro-
gresses recursively until all the patterns are learnt. The sub networks are then
integrated using a KNN based pattern distributor and a multiplexer.
Results show that reducing the problem space into clusters simplifies the
problem space and produces generalization accuracies which are either com-
parable to or better than other recent algorithms in the same domain.
Future directions would include parallelizing the RSL-CC algorithm and
exploring the use of other clustering methods such as K-means or SOMs on
the algorithm. The study of the effect of various clusteringalgorithms will
help us determine better the algorithm simplicity and the robustness. Also
to be studied and determined are methods to further reduce the training
time of combinatorial optimization, alternative fitness functions and ways to
determine the robustness of class based clustering.
References
1. Dayhoff JE, DeLeo JM (2001) Artificial neural networks: Opening the black
box. Cancer 91(8):1615–1635
2. Engelbrechet AP, Brits R (2002) Supervised training using an unsupervised
approach to active learning. Neural Process Lett 15(3):247–260
3. Fukunaga K (1990) Introduction to statistical pattern recognition. Academic,
Boston
4. Gong DX, Ruan XG, Qiao JF (2004) A neuro computing model for real
coded genetic algorithm with the minimal generation gap. Neural Comput Appl
13:221–228
5. Guan SU, Li S (2002) Parallel growing and training of neural networks using
output parallelism. IEEE Trans Neural Netw 13(3):542–550
6. Guan SU, Ramanathan K (2004) Recursive percentage based hybrid pattern
training. In: Proceedings of the IEEE conference on cybernetics and intelligent
systems, pp 455–460
7. Guan SU, Neo TN, Bao C (2004) Task decomposition using pattern distributor.
J Intell Syst 13(2):123–150
8. Guan SU, Qi Y, Tan SK, Li S (2005) Output partitioning of neural networks.
Neurocomputing 68:38–53
9. Guan SU, Ramanathan K, Iyer LR (2006) Multi learner based recursive training.
In: Proceedings of the IEEE conference on cybernetics and intelligent systems
(Accepted)
10. Guan SU, Zhu F (2004) Class decomposition for GA-based classifier agents – a
pitt approach. IEEE Trans Syst Man Cybern B Cybern 34(1):381–392
11. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning:
Data mining, inference, and prediction. Springer, New York
12. Haykins S (1999) Neural networks, a comprehensive foundation. Prentice Hall,
Upper Saddle River, NJ
13. Holland JH (1973) Genetic algorithms and the optimal allocation of trials. SIAM
J Comput 2(2):88–105
14. MacQueen JB (1967) Some methods for classification and analysis of multi-
variate observations. In: Proceedings of 5th Berkeley symposium on mathemat-
ical statistics and probability, vol 1. University of California Press, Berkeley,
pp 281–297
15. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: A review, ACM Comput
Surv 31(3):264–323
16. Kohonen T (1997) Self organizing maps. Springer, Berlin
17. Lasarzyck CWG, Dittrich P, Banzhaf W (2004) Dynamic subset selection based
on a fitness case topology. Evol Comput 12(4):223–242
18. Lehtokangas M (1999) Modeling with constructive backpropagation. Neural
Netw 12:707–714
19. Lu BL, Ito K, Kita H, Nishikawa Y (1995) Parallel and modular multi-sieving
neural network architecture for constructive learning. In: Proceedings of the 4th
international conference on artificial neural networks, 409, pp 92–97
20. Satoh H, Yamamura M, Kobayashi S (1996) Minimal generation gap model
for GAs considering both exploration and exploitation. In: Proceedings of 4th
international conference on soft computing, Iizuka, pp 494–497
21. Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the 16th
international joint conference on artificial intelligence, pp 1–5
22. Wong MA, Lane T (1983) A kth nearest neighbor clustering procedure. J Roy
Stat Soc B 45(3):362–368
23. Zhang BT, Cho DY (1998) Genetic programming with active data selection.
Lect Notes Comput Sci 1585:146–153
Evolutionary Approaches to Rule Extraction
from Neural Networks
Urszula Markowska-Kaczmar
Summary. A short survey of existing methods of rule extraction from neural net-
works starts the chapter. Because searching rules is similar to NP-hard problem it
justifies an application of evolutionary algorithm to the rule extraction. The sur-
vey contains a short description of evolutionary based methods, as well. It creates
a background to show own experiences from satisfying applications of evolutionary
algorithms to this process. Two methods of rule extraction namely: REX and GEX
are presented in details. They represent a global approach to rule extraction, per-
ceiving a neural network by the set of pairs: input pattern and response produced
by the neural network. REX uses prepositional fuzzy rules and is composed of two
methods REX Michigan and REX Pitt. GEX takes an advantage of classical crisp
rules. All details of these methods are described in the chapter. Their efficiency was
tested in experimental studies using different benchmark data sets from UCI repos-
itory. The comparison to other existing methods was made and is presented in the
chapter.
1 Introduction
Neural networks are widely used in real life. Many successful applications in
various areas may be listed here, for example: in show business [29], in pattern
recognition [36] and [11], in medicine, e.g. in drug development, image analysis
and patient diagnosis (two major cornerstones are detection of coronary artery
disease and processing EEG signals [30]), in robotics [27], in industry [37] and
optimization [7].
Their popularity is a consequence of ability to learn. Instead of an algo-
rithm that describes how to solve the problem neural networks need a training
set with patterns representing the proper transformation of the input patterns
on the output patterns. After training they can generalize knowledge they pos-
sessed during training performing the trained task on the unseen before data.
They are well known because of their skill in removing noise from data, also.
The next advantages is their resistance against damages.
U. Markowska-Kaczmar: Evolutionary Approaches to Rule Extraction from Neural Networks,

Studies in Computational Intelligence (SCI) 82, 177–209 (2008)
178 U. Markowska-Kaczmar
Although the number of the neural network applications increases, one

of the obstacles to their acceptance in real life is the black box image of
them. In industrial and bio-medical projects, the clients need to be convinced
that the neural network output can be justified in an understandable way.
This is the main reason why methods of knowledge extraction from neural
networks are developed. Knowledge acquired from neural networks is repre-
sented as crisp prepositional rules. Fuzzy rules [26], first order rules and deci-
sion trees are used, as well. The most popular representation are prepositional
rules, because they are easy comprehensible for human.
As it is presented in Section 3 searching for condition standing in the
premise part of the rule is similar to NP-hard problem. That is why evolu-
tionary approach which is well known from its ability of the quick search is
very helpful in this case.
2 The basics of neural networks

In this section the elementary concepts of neural networks will be presented.
It will give grounds for understanding, the way in which rules from neural
network are extracted. Generally speaking, each neural network consists of
elements that perform very simple operations: multiplication, addition and
transformation obtained result by a given function. It is amazing how so simple
elements by connecting them together are able to solve so complex problem.
Let’s see in Fig. 1 that presents a model of neuron. It determines a net-input
value on the basis of all its input connections. Typically, we calculate the net
value assigned as net by summing the input values multiplied by corresponding
weights as in Eq. (1).
neti = xj wij (1)
j
Once the net input is calculated it is converted to the output value by applying
an activation function f :
yi = fi (neti ) (2)
Various types of functions can be applied as an activation function. Fig. 2
shows typical one. It is the sigmoidal function. The other popular one is
hyperbolic tangent. Its shape is similar to the sigmoidal function, but the
Fig. 1. The model of a neuron

Rule Extraction from Neural Networks 179
Fig. 2. The example of sigmoidal function
Fig. 3. The feedforward neural network
values belong to the range (−1,1) instead of (0,1). These simple elements
connected together create a neural network. Depending on the way they are
joined one can distinguish feedforward and recurrent neural networks.
Here we will focus on the first type of the neural network architecture.
Frequently, in the feedforward network, neurons are organized in layers. The
neurons in a given layer are not connected to each other. Connections exist
only between neurons of the neighbouring layers. Such a network is shown
in Fig. 3. One can distinguish the input layer that split input information
to the neurons in the next layers. Then, the information is processed by the
hidden layer. Each neuron in this layer adds weighted signals and processes
the total net by an activation function. Once all neurons in this layer have
calculated the output values, the neurons in the output layer become active.
They calculate the total net and process it by the activation function as it
was described for the previous layer. This is the way in which information is
processed by the network. In general, the network can contain more than one
hidden layer.
As we mentioned above, the network possesses its ability to solve the
problem after the training process. It consists in searching the weights in the
iterative way. After training the network can produce the response on the basis
of the input value. To train the network, it is necessary to collect data in the
training set T.
T = {(x1 , y1 ), (x2 , y2 ) . . . (xp , yp )} (3)
One element is created by a pair, which is an example of the proper transfor-
mation of an input value represented by the vector x = [x1 , x2 , . . . , xN ] onto
the desired output value represented by the vector y = [y1 , y2 , . . . , yM ].
The most popular training rule is backpropagation. It minimizes a squared
error Ep between desired output y for the given pattern x and the answer of
the neural network represented by the vector o = [o1 , o2 , . . . , oM ] for this
input.
Ep = 1/2 (ypk − opk )2 (4)
k
The weights are updated iteratively until an error reaches the assumed value.
In each time step the weight is changed according to Eq. (5):
w(t + 1) = w(t) + α · ∆w(t), (5)
where α is the training coefficient, which influences on the speed of training,

∆w(t) is the change in weight during time step t. This value is calculated
on the basis of the gradient error Ep . However, because of a large number of
neurons in a network and the parallelism of processing, it is difficult to describe
clearly how a network solves the problem. Broadly speaking, the knowledge of
this problem is encoded in the network’s architecture, weights and activation
functions of individual neurons.
3 Rule extraction from neural networks

In the literature two main group of neural network applications is considered:
classification and regression. The majority of methods is applied to the net-
works that perform the classification task, although papers concerning rule
extraction for the network solving a regression task appear, as well [32]. Here
we focused on the first group.
3.1 Problem formulation
Let’s assume that a network solves a classification problem. This resolves

itself into dividing objects (patterns) into k mutually separable classes
C1 , C2 , . . . , Ck . For the network the class of object is encoded by the vector
y = [y1 , y2 , . . . , yk ], where exists exactly one yi = 1 (for k
= i yk = 0).
Fig. 4 represents the scheme of the neural network solving classification
task. As an input it takes a vector x describing attributes (features) of the
pattern (object) and as an output it produces the vector y encoding the class
Fig. 4. The scheme of the neural network for classification
of this pattern. We assume that during rule extraction we know the patterns
from the training set.
The aim of the rule extraction from a network is to find a set of rules
that describe the performance of the neural network. The most popular form
are prepositional rules, however predicates are applied, as well. Because of
their comprehensibility we concentrate on the prepositional rules that have
the following form:
IF prem1 AN D prem2 AN D . . . .AN D premN T HEN classj (6)
The premise premi refers to the i-th network input and formulates a condition
that has to be satisfied by the value of i-th attribute in order to classify the
pattern to the class, which is indicated by the conclusion of the rule.
The other popular form of the rule are prepositional fuzzy rules that is
shown below:
IF x1 is Z1r AN D . . . AN D xN is ZN b T HEN y1 y2 . . . yk (7)
where: xi corresponds to the i-th input of NN, the premise xi is Zib states
that attribute (input variable) xi belongs to the fuzzy set Zib (i ∈ [1, N ]),
and y1 y2 . . . yk is a code of a class, where k is the number of classes and
only one yc = 1 (c ∈ [1, N ], for i
= c yi = 0). In general, the number of
premises is less or equal to N , which stands for the number of inputs in
a neural network. Fuzzy sets can take different shapes. The most popular
are: triangular, trapezoidal and Gaussian function. The example of triangular
fuzzy sets are shown in Fig. 5.
There are many criteria of the evaluation of the extracted rules’ quality.
The most frequently used include:
• fidelity – stands for the degree to which the rules reflect the behaviour of
the network they have been extracted from,
• accuracy – is determined on the basis of the number of previously unseen
patterns that have been correctly classified,
• consistency – occurs if, during different training sessions, the network pro-
duces sets of rules that classify unseen patterns in the same way,
Fig. 5. The way of assigning of triangular fuzzy sets; a < b < c
• comprehensibility – is defined as the ratio of the number of rules to the

number of premises in a single rule.
In [2] these four requirements are abbreviated to FACC. These criteria are
also discussed in detail by Gosh and Taha in [34].
In real applications not all of them may be taken into account and their
importance weight may differ. The main problem in rule extraction is that
these criteria, especially fidelity and comprehensibility, tend to be contradic-
tory. The least complex sets, consisting of few rules, cover usually only the
most typical cases that are represented by large numbers of training pat-
terns. If we want to improve a given set’s fidelity, we must add new rules that
would deal with the remaining patterns and exceptions, thus making the set
more complicated and less understandable. Therefore rule extraction requires
finding a compromise between different criteria, since their simultaneous opti-
mization is practically unfeasible. A good algorithm of rule extraction should
have the following properties [35]:
• independence from the architecture of the network,
• no restrictions on the process of a network’s training,
• guaranteed correctness of obtained results,
• a mechanism of accurate description of a network’s performance.
3.2 The existing methods of rule extraction from neural networks
The research on effective methods of acquiring knowledge from neural net-

works has been carried out since the 90 s, which testifies not only to the
importance of the problem, but also to its complexity.
The developed approaches may be divided into two main categories: global
and local. This taxonomy bases on the degree to which a method examines
the structure of a network. In the local approach the first stage consists in
describing the conditions of a single neuron’s activation, i.e. in determining
the values of its inputs that produce an output equal to 1. Let us consider an
example shown in Fig. 6. It comes from [2]. The neuron has five binary inputs
and one binary output (for simplicity the step function is used in this case
Fig. 6. A single neuron and extracted rules
with threshold in θ). Neuron will fire if the following condition is satisfied:

5
xi wi > θ (8)
i=1
Let us consider the rule: IF x1 AND x2 AND NOT x5 THEN 1 which in

Fig. 6 is represented as: y ← x1 ∧ x2 ∧¬x5 . The rule says that if x1 = True,
x2 = True and x5 = False, then the output y is 1, i.e. True. This rule is true
independently of the values of other attributes because:
0 ≤ x3 w3 + x4 w4 ≤ 4 (9)
What does mean rule extraction in the context of neural network with many
hidden and output neurons and the continuous activation function. When sig-
moidal function or hyperbolic tangent is applied then by setting appropriate
value of β we can model the step function (Fig. 2). In this case to obtain rules
for a single neuron we can follow the way presented above. Such rules are
created for all neurons and concatenated on the basis of mutual dependen-
cies. Thus we obtain rules that describe the relations between the inputs and
outputs for the entire network as illustrated in Fig. 7. As examples one may
mention the following methods: Partial RE [34], M of N [12], Full RE [34],
RULEX [3]. The majority of algorithms belonging to this group in order to
create a rule, which describes the activation of neuron considers the influence
of the sign of the weight. When the weight is positive then this input helps
to move the neuron activation to be fired (output = 1), when the weight is
negative it means that this input disturbs in firing the neuron. The problem
of rule extraction by means of the local approach is simple if the network is
relatively small.
Global methods treat a network as a black box, observing its inputs and
responses produced at the outputs. In other words, a network provides the
method with training patterns. The examples include VIA [35] that uses a
procedure similar to classical sensibility analysis, BIO-RE [34] that applies
1 1 2 3 4
2 3 4
Fig. 7. The local approach to rule extraction; source: [2]
truth tables to extract rules, Ruleneg [13], where an adaptation of PAC algo-
rithm is applied or [28], based on inversion. It is worth mentioning that due to
such an approach the architecture of a neural network is insignificant. Most
of the methods of rule extraction concern multilayer perceptrons (MLP net-
works - Fig. 3). However, methods dedicated to other networks are developed
as well, e.g. [33], [10]. A method that would meet all these requirements (or
at least the vast majority) has not been developed yet. Some of the methods
are applicable to enumerable or real data only, some require repeating the
process of training or changing the network’s architecture, some require pro-
viding a default rule that is used if no other rule can be applied in a given
case. The methods differ in computational complexity (that is not specified
in most cases). Hence the necessity of developing new methods. Some of them
use evolutionary algorithm to solve this problem, what will be presented in
section 5.
4 Basic concepts of evolutionary algorithms
Evolutionary algorithms (EA) as inspired by biological evolution work with

the population of individuals, that encode the solution1 . Individuals evolve
toward better and better individuals by using genetic operators like mutation
and crossover and by selection and reproduction. The operators especially mu-
tation are specifically designed depending on the way of encoding the solution.
For some problems specialized genetic operators are defined.
1
Also the evolutionary algorithms in the literature [9], [25] are used as a gen-
eral term for different concepts like: genetic algorithms, evolutionary strategies,
genetic programming, here we will you use it in a narrowed sense in order to
underline the difference with reference to the genetic algorithm. In the classical
genetic algorithm a solution is encoded in binary way. Here real numbers are used
to encode information. It requires to specify special genetic operators but general
outline of the algorithm is the same.
The outline of evolutionary algorithm is included in the following steps:

1. Randomly generate an initial population P (0) of individuals I.
2. Compute and save fitness f (I) for each individual I in the current popu-
lation P (t).
3. Define a selection probabilities of p(I) for each individual I in P (t), so
that p(I) is proportional to f (I).
4. Generate P (t + 1) by selecting individuals from P (t) to produce offspring
population by using genetic operators.
5. Go to step 2 until satisfying solution is obtained.
These steps can be summarised as follows. An evolutionary algorithm starts
with an initial population, which in most cases is created randomly. Then in-
dividuals are evaluated on the basis of fitness function that is a crucial issue in
the design of the genetic algorithm. It expresses the function to be optimized
in the target problem. The higher is the fitness function, the higher is the
probability that an individual I will be selected to the new generation. There
are different selection schemes, the main are: rank, roulette wheel, tournament
selection.
After selection, genetic operators - typically mutation and crossover are ap-
plied to form a new generation of individuals. Crossover is the operator that
exchanges the genetic material between parents with probability pc . After
crossover, mutation operator is applied, which is designed depending on the
information encoding. For binary encoding mutation flips the bit. For real
encoding usually the actual value is changed about small value with proba-
bility pc . With reference to the rule extraction these genetic operators will be
described in the next sections.
5 Evolutionary methods in rule extraction

from neural networks
The description of rule extraction from neural networks shown in the previous
section gave the reader an image of the search space scale and awakened to
difficulties concerned with it. That is why there is a need for applying another
techniques that limit the search space.
Evolutionary algorithms are well known from their ability to search a
solution in a huge space. Typically, they do not offer the optimal solution
in the global sense2 but they produce it relatively quick. Taking into account
the search of condition in the premises of rules, which is very hard problem,
evolutionary algorithms may be very useful in this case.
2
There are evolutionary algorithms with proven convergence to a global optimum
[8]
5.1 Local approach
Let us recall that the local approach to rule extraction (Section 3.2), which
starts with searching rules describing the activation of each neuron in the
network, is simple only when the network is relatively small. In other case
the methods decreasing of the connection number and the number of neurons
are used. As an example of solutions applied in this case we can enumerate here
clusterisation of hidden neurons activation (substituting a cluster of neurons
by one neuron is used) or the structure of a neural network is optimized by
introducing a special training procedure that causes the pruning of a neural
network. Using evolutionary algorithms as it is shown in [31], [15] may be also
very helpful.
Let us start our survey of EA applications in rule extraction from neural
network by presenting the approach from [31]. The authors evolve feedfor-
ward neural networks topology, that is trained by RPROP algorithm which
is a gradient method similar to backpropagation method. They use ENZO
tool, which evolves fully connected feedforward neural network with a single
hidden layer. The tool is extended by implementation of RX algorithm of rule
extraction from neural network (proposed by [19],) which is a local method.
In ENZO each gene in a genotype representing an individual is interpreted
as connection between two neurons. The selection is based on ranking. The
higher is the ranking of an individual, the higher is the probability that it will
be selected.
Two crossover operators are designed. The first one inserts all connections
from parents into the child and the weights are the average of the weights
inherited from parents. The second one inserts in random chosen hidden neu-
ron and its connection from parent with higher fitness value to the current
topology of the child. The mutation operator inserts or removes neuron from
the current topology.
The next crucial element of the method based on evolutionary algorithm
is a fitness function. Here the fitness function is expressed as follows:
f itness = 1 − Wacc Acc + Wcom comprehensibility (10)
where: Wacc , Wcom are the importance weights for terms accuracy and
comprehensibility, Acc is accuracy, which is expressed as quotient of true
positive covered patterns to the number of all patterns and comprehensibility
in tern is defined as follows:
R C
2 +
comprehensibility = 1 − MaxR MaxC
(11)
3
In this equation R is the number of rules, C is the total number of rule con-
ditions, M axR is the maximum number of rules extracted from an individ-
ual, M axC is the maximum number of rule conditions among all individuals
evolved so far.
Fig. 8. One genetic cycle in extended ENZO includes rule extraction and evaluation
Evolution lasts the assumed number of generation, which is the parame-

ter of the application. Comparing to the classical genetic algorithm the evo-
lution cycle besides selection, crossover and mutation contains: training of
each neural network, rule extraction and evaluation of rules acquired for each
individual (Fig. 8).
Similar but extended approach is used by [16] where a genotype of an
individual is composed of the connection matrix and the weight matrix. The
evolutionary algorithm searches for a minimal structure and weights of neural
network. The final values of weights are tuned by RPROP algorithm. The
crossover operator is not implemented but there are four mutation operators:
insertion of a hidden neuron, deletion of a hidden neuron, insertion of a con-
nection and deletion of a connection. To mutate the weight matrix Gaussian
type mutation is used.
In comparison to the previous described approach authors applied multi-
objective optimization. Optimization of the neural network structure is
expressed as biobjective optimization problem:
min(f1 , f2 ) (12)
where : f1 = E (13)
f2 = Ω (14)
where E is the error of neural network and Ω is the regularization term that is
expressed by the number of connections in the network. In the paper there is
an illustrative example that shows the feasibility study of the idea to search the
optimal architecture of neural network and to acquire rules for that network. It
is the breast cancer benchmark problem from UCI repository [4]. It contains
699 examples with 9 input features that belong to two classes: benign and
malignant. Only the first output of the neural network is considered in the
Fig. 9. The simplest network found by approach proposed in [16]
work. The simplest network obtained by the authors from 41 networks is

shown in Fig. 9.
On the basis of this simple network it was very easy to acquire two rules,
assuming that the class malignant is decided when y < 0.25 and benign when
y > 0.75:
R1 : IF x2 (clumpthickness ≥ 0.5 T HEN malignant (15)

R2 : IF x2 (clumpthickness ≤ 0.2 T HEN benign (16)
Based on these two rules, only 2 out of 100 test samples were misclassified
and 4 of them cannot be decided with predicated value of 0.49.
The paper [14] is an example of the evolutionary algorithm application in
clustering of the activations of hidden neurons. It uses the simple encoding and
the chromosomes has fixed length. It allows to use classical genetic operators.
The fitness function is based on the Euclidian distance between activation
values of hidden neurons and the number of objects belonging to the given
cluster. The rules in this case have the following form:
IF v1min ≤ a1 ≤ v1max AN D v2min ≤ a2 ≤ v2max . . . AN D vnmin ≤ an ≤ v1max

T HEN class Cj ,
where ai is the activation function of hidden neuron, vimin and vimax are re-
spectively minimal and maximal values; n is the number of hidden neurons. In
order to acquire rules in the first step the equations describing an activation
for each hidden neuron are created. Then, during processing training patterns
for each neuron the activation value and the label of the class is assigned.
Next, the values of activation obtained in the previous step are separated ac-
cording to the class and the hidden neuron activations with the application of
EA are clustered. It means that EA searches for groups of activation values
of hidden neurons. Finally, in order to acquire rules expressed in (Eq. 17)
minimal and maximal values for each cluster are searched.
5.2 Evolutionary algorithms in a global approach to rule

extraction from neural networks
It seems that the global methods of rule extraction from neural network by
offering an independence of the neural network architecture are very attractive
proposals. Bellow there are few examples described in more detailed way.
Searching for essential dependency
We will start with the presentation of the relatively simple method, drawn
up by [17], where evolutionary algorithm searches for the path composing of
connections from an input neuron to the output neuron. Knowing that the
neural network solves classification task the method searches for an essential
input feature causing classification of patterns to the given class.
Single gene encodes one connection, that means that in order to find path
from the input layer to the output layer for two layered network, the chro-
mosome consists of two genes. The fitness function is defined as a product
of the connection weights, which belong to the path. They inform about the
strength of the connections belonging to the path for producing the output.
Classical genetic operators are used.
Another idea is shown in [18]. In this paper for each attribute the range
of values are divided into 10 subranges encoded by natural numbers. The
nominative attributes are substituted by the number of binary attributes that
is equal to the number of values for this attribute. The chromosome has as
many genes as many inputs has the neural network. The example is shown
in Fig. 10. Special meaning has the value of a gene equal to 0. It means that
the condition referring to this attribute is passed over in the final form of
the rule. The chromosome encodes the premise part of the rule. Conclusion
is established on the basis of the neural network response. Fig. 10 explains
the way of the chromosome decoding. Knowing that 1 corresponds to the
first subrange of the attribute in the rule creating phase the first subrange is
taken for the premise referring to the first attribute. In the same way values
of the second and third genes are transformed. The pattern is processed by
the neural network and in response on the output layer the code of the class is
delivered. The output of the neural network is coded in a local way. According
to this principle the biggest value in the output layer is searched. In case this
value is greater than assumed threshold, the chromosome creates a rule.
Fig. 10. The example of the chromosome in the method described in [18]
Fig. 11. The idea of the evolutionary algorithm application in a global method of
rule extraction
Treating a neural network as a black box
Now we focus on the approaches that treat a neural network as a black box
that delivers patterns for the method extracting rules. Fig. 11 illustrates the
idea of this approach. Evolutionary algorithm works here as a tool extracting
rules. In principle, this way of evolutionary algorithm application may be seen
in similar way as rule extraction immediately from data. The main difference
lies in the form of fitness function that in the case of rule extraction from
neural networks has to consider the criteria described in the Section 3.1.
Generally speaking there are two possibilities to extract a set of rules that
describes the classification of patterns. In so called Michigan [25] approach
one individual encodes one rule. In order to obtain the set of rules different
techniques are used. As examples we can enumerate here:
• sequential covering of examples, the pattern covered by the rule are re-
moved from the set of patterns and for the rest of patterns a new rule is
searched.
• an application of two hierarchical evolutionary algorithms where one
searches for single rules the set of which is optimized then by evolutionary
algorithm on the higher level,
• special heuristics, applied to define when the rule can be substituted by
more general rule in the final set of rules.
In the Pitt [25] approach the set of rules is encoded in one individual, which
causes that all of the mentioned above solutions are useless, but the chromo-
some is much more complex.
REX methods - fuzzy rule extraction
Now Pitt and Michigan approaches to rule extraction from neural networks
will be presented on the base of [23]. In this paper two methods of rule
extraction from neural networks are described: REX-Pitt and REX-Michigan.

Both of them use fuzzy prepositional rules instead of crisp rules as we have
concerned so far. The fuzzy rules extracted by REX have the form expressed
by (7). REX-Pitt keeps all knowledge extracted from the neural network in
the genotype of one individual, however REX Michigan is composed of two
evolutionary algorithms, where one of them is responsible for generating of
rules (one rule is encoded in the genotype of an individual) and the second
one searches for fuzzy set parameters (centers and width).
Each fuzzy set in REX will be described by:
• F – flag defining whether the given fuzzy set is active or not,
• Coding data – one real number corresponding to the apex d of the triangle
of a given fuzzy set (Fig. 12).
In the case of REX, rules and fuzzy sets are coded independently. Accord-
ing to the form of rule presented in (7) the code of rule is composed of codes of
premises and the code of conclusion (Fig. 13). The code of the rule contains a
constant number of genes representing premises equal to the number of neural
network inputs, but the bit of activation, standing before the premise, reports
whether it is active or not.
The part of chromosome coding the collection of groups of fuzzy sets is
presented in Fig. 12. The number of groups is equal to the number n of the
input neurons of the neural network. One group consists of a constant number
z of fuzzy sets. Each fuzzy set is coded by the real number representing the
apex di and the binary flag F of the activation of the fuzzy set. After decoding,
only active fuzzy sets (F = 1) form the group. It is worth noting that the flag
of activation of the premise and the flag of activation of the fuzzy set are
G1 G2 Gn
FS2 , 1 FS2 , 2 FS2 , 3 FS2 , z
F2 , 2 D2 , 2
Fig. 12. A part of the chromosome coding fuzzy sets; Gn – gene of n-th group of
fuzzy sets; F Sn,l – code of l−th fuzzy set in the n-th group, Fl,k - flag, Dl,k – code
of an dk apex of triangle of the F Sl,k fuzzy set
Fig. 13. A part of the chromosome coding rule

Fig. 14. The scheme of the chromosome in the REX Pitt; F – activation bit, P C –
code of premise, CC – code of conclusion, D – the number corresponding to the
triangle apex of the fuzzy set
independent. The first one decides which premises are present in the rule,
while the second one determines the form of a fuzzy set group (the number of
fuzzy sets and their ranges).
Fig. 14. presents a general scheme of an individual in REX Pitt. F is the
bit of activation. It stands before each rule, premise and fuzzy set. If it is set
to 1, it means that the part after this bit is active. P C is a gene of premise
and is coded as an integer number indicating the index of the fuzzy set in the
group for a given input variable. The description of fuzzy sets included in one
individual deals with all rules coded in the individual.
To form a new generation, genetic operators are applied. One of them is a
mutation, which consists in the random change of chosen elements in the chro-
mosome. There are several mutation operators in the REX algorithm. The mu-
tation of the central point of a fuzzy set (gene D in Fig. 14. which corresponds
to d in Fig. 5.) is based on adding or subtracting a random floating–point num-
ber (the operation is chosen randomly). It is performed according to (17):
rand() · range
d ←d± (17)
10
where: rand() is the function giving a random number uniformly distributed
in the range 0; 1); range is the range of possible values of d. The parameter
equal to 10 in the denominator is used in order to ensure a small change of d.
The mutation of an integer value (P C) relies on adding a random integer
number modulo z, which is formally expressed by (18):
i ← (i + rand(z − 1)) mod z (18)
where z is a maximum number of fuzzy sets in a group and is set by the user.
The mutation of a bit (for example bit F , bits in CC) is simply its negation.
There is also a mutation operator which relies on changing the sequence of the
rules. As it was mentioned, the sequence of rules is essential for the inference
process.
The second operator is the crossover, which is a uniform one ([25]). Because
of the floating point representation and a complex semantic in the chromosome
of the REX method, the uniform crossover is performed at the level of genes
representing the rules and groups of fuzzy sets (see Fig. 14). This means that
the offspring inherits genes of a certain whole rule or a whole group of fuzzy
sets from one or another parent. After these genetic operations, the individual
that is not consistent has to be repaired.
One of the most important elements in the evolutionary approach is the
design of the fitness function. Here, the evaluation of individuals is included
in the fuzzy reasoning process by using a decoded set of rules with training
and/or testing examples, and the calculation of metrics defined below which
are used in the fitness function expressed by (19). The fuzzy reasoning result
has to be compared with that of the NN, giving the following metrics:
• corr – the number of correctly classified patterns – those patterns which
fired at least one rule – and the result of reasoning (classification) was
equal to the output of the neural network; this parameter determines the
fidelity of the rule set;
• incorr – the number of incorrectly classified patterns – those patterns for
which the reasoning gave results being different than those of the neural
network;
• unclass – the number of patterns that were not classified – those patterns
for which none of the rules fired, and it was impossible to perform the
reasoning.
There are also metrics describing the complexity of rules and fuzzy sets:
• prem – the total number of active premises in active rules,
• fsets – the total number of active fuzzy sets in an individual.
All the metrics mentioned above were used to create the following evaluation
function (19):
f (i) = β · corri · Ψ (incorri ) + χ · corri · Ψ (unclassi ) (19)

+ δ · Ψ (premi ) +
· Ψ (f setsi )
where: i – index of the individual, β, χ, δ,

– weight coefficients; Ψ (x) is a
function equal to 2 when x = 0, and equal to 1/x when x
= 0.
Each component of the fitness function expresses one criterion of the eval-
uation of an individual. The first one assures increasing the ratio between
correctly and incorrectly classified patterns. The next one is the greater; the
greater is the number of correctly classified patterns and the less is the num-
ber of unclassified patterns. The last two components ensure minimizing the
number of premises and the number rules.
The algorithm terminates when one of the following conditions occur: a
maximum number of steps elapsed; there is no progress for a certain number
Fig. 15. The idea of REX Michigan
of steps; or when the evaluation function for the best individual reaches a
certain value.
REX Michigan consists of two specialized evolutionary algorithms alter-
nating with each other (see Fig. 15). The first one – EARules searches rules,
while the second evolutionary algorithm, which we called EAFuzzySets, opti-
mises the membership functions of fuzzy sets applied to the rules. This ap-
proach could be promising when the initial form of fuzzy sets is given by the
experts. The role of EAFuzzySets would be only the tuning of the fuzzy sets.
One individual in the EARules codes one rule. In one cycle, EARules is
operating several times to find a set of rules describing the neural network.
Each time patterns covered by the previously found rule are removed from the
training set. This method is known as sequential covering. The rules are evalu-
ated by using the simplified fuzzy reasoning, where one of the fuzzy sets found
by EAFuzzySets is used. In the first cycle of REX Michigan the set of the fuzzy
sets is chosen at random or it can be established by an expert (however, in
presented experiments the first case was used). In the evolutionary algorithm
optimising fuzzy sets (EAFuzzySets), one individual codes a collection of fuzzy
set groups. They are evaluated on the basis of simplified fuzzy reasoning, as
well. The EAFuzzySets is processing the given number of generations (GF ),
then the best individual is passed to EARules. The cycle of the alternate work
of these two stages lasts until the final set of rules with appropriate fuzzy sets
represents the knowledge contained in the neural network at the appropriate
level or the given number of cycles has elapsed.
The evolutionary algorithm EARules concentrates on the optimization of
the rules. It is included in periodically searching the best rule describing the
neural network, which is then passed to the set of rules. All training examples,
for which the result of reasoning with the investigated rule was the same as
the neural network answer, are marked as covered by the rule. It allows to
search the rule for the uncovered examples in the next cycle. This approach is
similar to ([6]). The form of the rule is the same as for REX Pitt. The muta-
tion operator from REX Pitt is also applied in REX Michigan. The crossover
is a uniform operator. After the crossover or mutation, the consistency of each
rule is tested. The individuals are repaired if necessary. To evaluate an indi-
vidual for each training example, a simplified fuzzy reasoning is performed
and appropriate metrics are calculated on the basis of the activation of the
rule. The following metrics are used in the fitness function:
• corri – the number of correctly classified examples, it means those exam-
ples for which the i–th rule was fired with an appropriate strength (greater
than θ) and its conclusion was the same as classification of neural network
for that pattern. This metric is applied for all training examples even if
they were covered by other rules;
• corr newi – the number of new covered examples – the covering of un-
marked examples;
• incorri – the number of incorrectly classified examples – the i–th rule
fired with an appropriate strength (activation of the rule is greater than
threshold θ)– but its conclusion was inconsistent with the neural network
output. Here, all examples are considered, including examples, which are
marked as covered;
• unclassi – the number of unclassified examples, for which the i–th rule
does not fire, but the conclusion of the rule was consistent with the neural
network output for this example. All examples are concerned in this case;
• unclass newi – the number of unclassified examples which are not marked
as covered by other rule.
Additionally, the fitness function applies the metric premi that informs about
the number of active premises in the rule. The fitness function used at this
level is presented by (20):
f (i) = α · corr newi · Ψ (incorri )

&' (
corr newi · (incorri + 1)
+ β·Ψ + δ · Ψ (premi ) (20)
(corr newi + incorri + 1)3
where i – the number of an individual (rule); α, β, δ – the weight coeffi-

cients; corr newi , incorri , premi – metrics for the i-th rule; Ψ is the function
described by (19).
The first component of the fitness function (20) ensures maximizing the
number of newly correctly classified patterns while minimizing the number
of incorrectly classified patterns. The second one has a strong selective role,
because its value for an individual representing a rule that does not cover any
new pattern is much less than for that one which covers only one example.
The third component minimizes the number of premises in the rule.
The optimization of fuzzy sets is performed by an evolutionary algorithm,
which is called EAFuzzySets. A collection of groups of fuzzy sets is coded in
one individual. One group Gj corresponds to one input attribute that relates
to the neural network input. Additionally, a gene RS coding the sequence of
fuzzy rules is included. The initial experiments have shown that classical fuzzy
reasoning was not an effective solution in this case, so a determination of the
sequence of fuzzy rules was necessary. The form of chromosome in EAFuzzySets
is presented in Fig. 12. The coding of a group of fuzzy sets is common for both
approaches (REX Pitt and REX Michigan). The way of coding a single rule is
similar as in REX Pitt. The first number in the premise reports the activity
of the premise. The second number is simply the index of a fuzzy set in the
group corresponding to the given neural network input.
Generalising, we can say that the information contained in one chromo-
some in REX Pitt is split into two chromosomes in REX Michigan. These two
chromosomes represent individuals in EARules and EAFuzzySets.
The evaluation of individuals involves their decoding. As a result, in
EARules the set of rules is obtained. Then, for each example included in the
training set, simplified fuzzy reasoning takes place. For individual j at this
level, the following statistics are collected during this process: corrj , incorrj ,
unclassj . Additionally, the complexity of the collection of fuzzy sets is mea-
sured by the parameter f setsj describing the number of active fuzzy sets in
the collection of fuzzy sets. The form of fitness function is as follows (21):
f (j) = β · corrj · Ψ (incorrj ) + χ · corrj · Ψ (unclassj ) + δ · Ψ (f setsj ) (21)
where: j – is an index of an individual; β, χ, δ – weight coefficients; corrj ,

incorrj , unclassj , f setsj – metrics for the j–th individual. Function Ψ is
defined by (19). The first component of the fitness function in (21) ensures
the greater value of fitness, the greater is the number of correctly and the less
is the number of incorrectly classified patterns. The second one is the greater;
the greater is the number of correctly classified patterns and the less is the
number of unclassified ones. The last component minimises the number of
fuzzy sets in the individual.
Fig. 16. The scheme of the chromosome at the EAFuzzySets level

Fig. 17. The comparison of the extracted rules for REX Pitt and Full–RE for the
IRIS data set
Experimental studies performed with Iris, Wisconsin Breast Cancer and

Wine data sets have shown that the REX methods gave satisfying results -
small number of comprehensible rules. The comparison with Full–RE [34]
made on the basis of the Iris data divided into two partitions – 89 patterns in
learning partition and 61 patterns in testing one has shown that REX Pitt and
Full–RE gave comparable results in terms of fidelity (Fig. 17). One can observe
that the rules extracted by REX Pitt are very similar to those extracted by
Full–RE, taking the number of rules and the number of premises into account.
Also, the boundary values for each attribute in the rules extracted by Full–
RE are very close to the centres and widths of fuzzy sets in premises of rules
extracted by REX.
Because the REX methods extract fuzzy rules, they express knowledge
in the way which is more natural for humans. REX Pitt produces a smaller
number of rules than the other methods. REX Michigan seems to be a worse
solution. The greatest problem with the approach was creating a too large
number of rules. The algorithm does not ensure the optimization of the num-
ber of rules. It searches for solutions where a single rule covers as many as
possible examples from the proper class, and as few as possible examples from
the other classes.
GEX - crisp rule extraction
Much more successful application of Michigan approach to rule extraction

from neural networks may be found in [20] and [21]. Both papers describe
GEX - the method, where one individual represents a crisp rule, but the final
set of rules is optimized by heuristics and parameters of the method thanks
of which one can influence on the number of rules in the final set. The idea of
GEX performance is shown in Fig. 18. It is composed of as many evolutionary
Fig. 18. The idea of GEX performance
algorithms as many classes exists in the classification problem solved by the

neural network.
The individuals in subpopulation can evolve independently or optionally
migration of individuals is possible. Each species contains individuals corre-
sponding to one class, which is recognized by the neural network. One indi-
vidual in a subpopulation encodes one rule. The form of the rule is described
by (6). The premise in a rule expresses a condition, which has to be satisfied
by the value of the corresponding input of the neural network in order to
classify the pattern to the class indicated by the conclusion of the rule. The
form of the premise is depending on the type of attribute, which is included
in the pattern.
In GEX the following types of attributes (feature Xj ) are concerned:
• real Xj ∈ Vr ⇔ Xj ∈ .
Between them two types are distinguished:
– continuous Vc : their domain is defined by a range of real numbers
Xj ∈ Vc ⇔ d(Xj ) = (xjmin ; xjmax ) ∈ .
– discrete Vd : the domain creates a countable set Wd of values wi and
the order relation is defined on this set.
Xj ∈ Vd ⇔ d(Xj ) = {wi ∈ , i = 1, . . . k, k ∈ ℵ}.
• nominal Vw : the domain is created by a set of discrete unordered values
Xj ∈ Vw ⇔ d(Xj ) = {w1 , w2 , . . . ww }, where wi is a symbolic value
• binary Vb : the domain is composed of only two values True and False
Xj ∈ Vb ⇔ d(Xj ) = {T rue, F alse}
For real type of attribute (discrete and continuous) the following premises are
covered:
• xi < value1 ,
• xi < value2 ,
• xi > value1 ,
• xi > value2 ,)
• value1 < xi * xi < value2 ⇔ xi ∈ (value1 ; value2),
• xi < value1 value2 < xi .
For a discrete attribute, instead of (<, >) inequalities (≤, ≥) are used. One
assumes that value1 < value2 .
For enumerative attributes – only two operators of relation are used {=,
=},
so the premise has one of the following form:
• xi = valuei ,
• xi =

valuei .
For boolean attributes there is only one operator of relation =. It means that
the premise can take the following form:
• xi = T rue,
• xi = F alse.
All rules in one subpopulation have identical conclusion. The evolutionary
algorithm is performed in a classical way (Fig. 19). The only difference between
classical performance of evolutionary algorithm and the proposed one lies in
the evaluation of individuals, which requires the existence of decision system
based on the processing of rules. After each generation rules are evaluated
by using the set of patterns, which are processed by the neural network and
the rules from the actual population (Fig. 11). To realize it a decision system
Fig. 19. The schema of evolutionary algorithm in GEX

consisting in searching rules that cover given pattern is implemented. The rule
covers a given example according to the definition Definition 1.
Definition 1 The rule ri covers a given example p when for all values of the
attributes presented in this pattern the premises in the rule are true.
The comparison of the results of classification is the basis for the evaluation
of each rule, which is expressed by the value of a fitness function.
The evolutionary algorithm performing in the presented way will look for
the best rules that cover as many patterns as possible. In this case the risk
exists that some patterns never would be covered by any rule. To solve this
problem in GEX the niche mechanism is implemented. The final set of rules
is created on the basis of best rules found by evolutionary algorithm but
also some heuristics are developed in order to optimize it. The details of the
method describing evolutionary algorithm and mentioned above heuristics are
presented in the following subsections.
Figure 20 shows the general scheme of the genotype in GEX. It is com-
posed of the chromosomes corresponding to the inputs of neural network and
a single gene of conclusion. A chromosome consists of gene being a flag and
genes encoding premises, which are specific for the type of attribute of the
premise it refers to.
The existence of flag assures that the rules have different length, because
the premise is included in the body of the rule if the flag is set to 1, only.
The chromosome is designed dependently on the type of attribute (Fig. 21) in
order to reflect the condition in the premise. For the real type of the attribute
the chromosome consists of the code of relation operator and two values deter-
mining the limits of range (Fig. 21c). For the nominal attribute there is a code
of the operator and a value (Fig. 21b). Fig. 21a represents a chromosome for
the binary attribute. Besides the gene of flag, it consists of one gene referring
to the value of the attribute
Fig. 20. Scheme of a chromosome in GEX
Fig. 21. The designed genes in GEX

The initial population is created randomly with the number of individuals

equal to StartSize. The basic operators used in GEX are a crossover and a
mutation. They are applied after a selection of individuals that creates a pool
of parents for the offspring population. In the selection the roulette wheel is
used. The individuals that are not chosen to become parents are moved to the
pool of weak individuals (Fig. 18). In each generation the size of population
is decreased by 1. When the population size reaches the value defined by
the parameter M inSize migration operator becomes active. It consists in
taking individuals from the pool of weak individuals (Fig. 18) to increase the
size of the population to N size. In case the migration is inactive a kind of
macromutation is used.
In the experimental studies the two-points crossover was used. It relies on
the choice of a couple of the parent genotypes with the probability pc w, then
two points are chosen in random and the information is exchanged. These
points can only lie between chromosomes. It is not allowed to cut the individ-
uals between genes in the middle of chromosome.
The mutation is specifically designed for each type of a gene and is strongly
dependent on the type of the chromosome (premise) it refers to. It changes the
information contained in gene. The following parameters define this operator:
• pmu−op - the probability of mutation of the relation operator or binary
value,
• pmu−range - the probability of mutation of the range limits,
pmu−act - the probability of mutation of value for genes in chromosomes
for nominal attributes,
• rch - the range change.
The mutation of the flag A relies in the change of its actual value to the
opposite one with probability pmu−op . The mutation of the gene containing
value in the chromosome of a binary attribute is realized as the change of the
gene value to its opposite value with the probability pmu−op (it flips the value
True to False or False to ). The mutation of the gene Operator independently
on the chromosome consists in the change of the operator to other operator
defined for this type of premise with the probability pmu−op . The mutation
of gene referring to the value in chromosomes for the nominal attributes is
realized as the change of the actual value to the other one specified for this
type with the probability pmu−act . The mutation of the gene encoding the
limits of a range in chromosomes for real attributes consists in the change
value1 and value2 . It is realized distinctly for continuous and discrete values.
For continuous attributes the limits are changed into new values by adding a
value from the following range (22).
(−(ximax − ximin ) · rch ; (ximax − ximin ) · rch ), (22)
where ximax and ximin are respectively the maximal and minimal values of
i-th attribute, rch is the parameter, which defines how much the limits of range
can be changed. For the discrete type a new value is chosen in random from
values defined for this type. The assumed fitness function, is defined as the
weighted average of the following parameters: accuracy (acc), classCovering
(classCov), inaccuracy (inacc), and comprehensibility (compr):
A ∗ acc + B ∗ inacc + C ∗ classCov + D ∗ compr
F un = (23)
A+B+C +D
Weights (A, B, C, D) are implemented as the parameters of the application.
Accuracy measures how good the rule mimics knowledge contained in the
neural network. It is defined by (24).
correctF ires
acc = (24)
totalF iresCount
inacc is a measure of incorrect classification made by the rule. It is expressed
by Eq. (25).
missingF ires
inacc = (25)
totalF iresCount
Parameter classCovering abbreviated as classcov contains information about
the part of all patterns from a given class, which are covered by the evaluated
rule. It is formally defined by Eq. (26);
correctF ires
classcov = , (26)
classExamplesCount
where classExamplesCount is a number of patterns from a given class. The
last parameter - comprehensibility abbreviated as compr is calculated on the
basis of Eq. (27).
maxConditionCount − ruleLength
compr = , (27)
maxConditionCount − 1
where ruleLength is the number of premises of the rule, maxConditionsCount
is the maximal number of premises in the rule. In other words, it is the number
of inputs of neural network.
During the evolution the set of rules is updated. Some rules are added
and some are removed. In each generation individuals with accuracy and
classCovering greater then minAccuracy and minClassCovering are the
candidates to update the set of rules. The values minAccuracy and minClass-
Covering are the parameters of the method. Rules are added to the set of
rules when they are more general than the rules actually being in the set of
rules according to the Definition 2.
Definition 2 Rule r1 is more general than rule r2 when the set of examples
covered by r2 is a subset of the set of examples covered by r1 . In the case the
rules r1 and r2 cover the same examples, the rule that has the bigger fitness
value is assumed as more general.
Furthermore, the less general rules are removed. After presenting all patterns
for each rule usability is calculated according to Eq. (28).
usabilityCount
usability = (28)
examplesCount
All rules with usability less then minU sability, which is a parameter set by
the user, are removed from the set of rules. We can say that the optimization
of the set of rules consists in removing less general and rarely used rules and
in supplying them by more general rules from the current generation.
The following statistics characterize the quality of the set of rules. The
value covering defines the percentage of the classified examples from all ex-
amples used in the evaluation of the set of rules (Eq. 29).
classif iedCount
covering = (29)
examplesCount
F idelity expressed in (Eq. 30) describes the percentage of correct (according
to the neural network answer) classified examples from all examples classified
by the set of rules.
correctClassif iedCount
f idelity = (30)
classif iedCount
Covering and f idelity are two measures of a quality of the acquired set of rules
that say about its accuracy generalization. Additionally, the perf ormance
(Eq. 31) is defined, which informs about the percentage of the correct classified
examples compared to all examples used in the evaluation process.
correctClassif iedCount
perf ormance = (31)
examplesCount
In Table 1 the results of experimental studies for benchmark data files are
shown. The aim of experiments was to test the quality of the rule extraction
made by GEX for data sets with different types of attributes. The procedure
proceed as follows. First, data set was divided into 10 different parts. Each
time 9 parts of the training set was used to train and the last part was used to
test. It was repeated 25 times (it is equivalent to 2.5 times made 10-fold cross
validation). Then, the results were averaged and the mean and the standard
deviation was calculated. The 15 tested data sets come from UCI repository.
They represent a spectrum of data sets with different types of attributes. The
files - Agaricus lepiota, Breast cancer, Tictactoe, Monk1 have the nominal
attributes, only. WDBC, Vehicle, Wine and Liver are the examples of files
with continuous attributes. Discrete attributes are in Dermatology files, logical
in Zoo data file. Mixed attributes are in the files Australian credit, Heart, Ecoli.
In these experiments the values of parameters were set as follows:
• mutation = 0.2,
• crossover = 0.5,
Table 1. The results of experiments with GEX in terms: the number of generations
(Ngenerations), the number of rules (Nrules), covering and fidelity for data sets
from UCI repository with different types of attributes using 10 − cross validation
(average value ± standard deviation is shown)
f ile N generations N rules covering f idelity

Agaricus 111,6 ± 43,2 27,48 ± 5,42 0,985 ± 0,006 1,00 ± 0,0000
Breast Cancer 220,8 ± 200,1 19,64 ± 2,33 0,981 ± 0,02 0,982 ± 0,014
Tictactoe 920,9 ± 218 47,12 ± 7,61 0,790 ± 0,05 0,979 ± 0,022
Monk-1 47.3 ± 18.8 9.16 ± 1.40 0.983 ± 0.041 1.000 ± 0.997
WDBC 1847.9 ± 141.8 28.40 ± 4.51 0.772 ± 0.110 0.977 ± 0.024
Vehicle 1651.8 ± 296.3 16.88 ± 2.0 0.804 ± 0.134 0.723 ± 0.152
Pima 1459 ± 500.4 27.8 ± 3.56 0,889 ± 0,062 0,970 ± 0,022
Wine 120,4 ± 30,3 10,04 ± 1,84 0,953 ± 0,042 0,972 ± 0,042
Liver 1867,4 ± 127,7 33,68 ± 4,05 0,199 ± 0,080 0,685 ± 0,166
Dermatology 1680,3 ± 344,4 24,8 ± 3,12 0,690 ± 0,061 0,985 ± 0,026
Austral. credit 1660,5 ± 321,9 67,52 ± 4,98 0,643 ± 0,056 0,899 ± 0,045
Heart 1420.8 ± 449.7 17.28 ± 2.01 0.910 ± 0.047 0.877 ± 0.057
E. coli 1488,2 ± 465,2 18,2 ± 1,94 0.914 ± 0.046 0.925 ± 0.055
Hypothyroid 561.7 ± 425.9 17.56 ± 3.36 0.978 ± 0.008 0.996 ± 0.004
Zoo 357.6 ± 433.2 8.36 ± 0.76 0.968 ± 0.059 0.963 ± 0.059
• number of individuals = 40,

• minimal number of individuals = 20,
• type of crossover = two point,
• operators for real attributes = 1, 2, 3, 4, 5,
• minAccuracy = 1,
• minUsability = 1,
• max number of generation = 2000,
• niching and migration = on.
The stopping criteria was set as achieving 98% of performance or 250 gener-
ation without any improvement of the performance or 2000 generation was
exceeded. The results of this extensive tests allow to conclude that GEX is
general method that enables acquiring rules for classification problems with
various types of attributes. However, the results are strongly dependent on the
parameters of the method and there is the need of tuning them for a given
data set. It refers to the number of individuals and the number of generation
needed to obtain assumed accuracy and performance of the acquired rules. It
can be perceived as a certain drawback of the proposed method.
Other tests made with the following data sets from UCI: SAT (36 contin-
uous attributes, 4435 patterns), Artificial characters (16 discrete attributes,
16 000 patterns), Sonar (60 continuous attributes, 208 patterns), Animals
(72 nominal and continuous attributes, 256 patterns), German Credit (21
nominal, discrete and continuous attributes, 1000 patterns) and Shutle (9
continuous, 43500 attributes) confirmed that the method is scalable.
A comprehensibility of the rules acquired by GEX were tested, as well. In
this case LED data set was chosen, because its patterns are easy visualized.
The simple version of this data file contains seven attributes representing the
segments of LED as it is shown in the Fig. 22. In the applied file these 7 at-
tributes were increased in additional 17 attributes that are specified in random
as 0 or 1. To keep the task nontrivial to the proper attributes representing
segments of LED 1% of noise was introduced (1% values of proper attributes
has change their values to the opposite one). This problem was not easy to
classify for neural network, as well. It was trained with performance - 80%.
After several runs of the GEX application with different values of param-
eters the example of the extracted set of rules is presented in Table 2. It can
be noticed that only in the rules for digit 3 there is the premise referring to
the additional - artificial attribute. For the rest digits the rules are perfect.
They are visualized in Fig. 23 (except of rules for digit 3). During tests one
could observe that the time of rule extraction and the final result (rules for all
Up right
Up left
Down right
Down left
Fig. 22. The description of segments in LED
Table 2. The example of the set of rules for LED data set
IF not up-left AND mid-center AND down-right AND att16 THEN 3

IF not up-left AND mid-center AND down-right AND not att16 THEN 3
IF not up-left AND down-right AND down-center AND att9 THEN 3
IF not mid-center AND down-left THEN 0
IF not up-center AND not mid-center THEN 1
IF not down-right THEN 2
IF not up-center AND mid-center THEN 4
IF not up-right AND not down-left THEN 5
IF not up-right AND down-left THEN 6
IF up-center AND not down-center THEN 7
IF up-left AND up-right AND mid-center AND down-left THEN 8
IF up-left AND up-right AND not down-left AND down-center THEN 9
Fig. 23. Visualization of LED data set - the first row, visualization of rules extracted
for LED data set - the second row
Table 3. The comparison of GEX performance to performance of other methods of

rule extraction from neural networks for different data sets
data set test method : GEX M ozer s method [1]

Breast cancer wisconsin 10-fold cross validation 97.81 95.2
data set test method : GEX Santos s method [31]

Iris 70% patterns = training set 92.48 93.33
Breast cancer wisconsin 30%patterns = testing set 90.63 85.32
data set test method : GEX Bologna s method [5]

Breast cancer wisconsin 10-fold cross validation 97.81 96.68
classes and the number of rules for each class) was strongly dependent on the
first found rule describing one from the similar digits (6, 9, 8). During these
experiments the crucial were parameters: minAccuracy and minU sability.
For noisy data minAccuracy equal to 0.9 or even 0.8 gave good results. The
greater is the value minU sability the more general rules are in the final set
of rules.
Table 3 presents a comparison of results obtained by GEX and other meth-
ods of rules extraction from neural network. Obtained results do not allow to
state that GEX is always better then its counterpart methods (e.q. see Iris
data set for Santos’s method), but it can be observed that in most cases GEX
surpasses the other methods. Besides it delivers rules for all classes without
necessity of existence the default rule.
6 Conclusion
Neural networks, are very efficient in solving various problems but they have
no ability of explaining their answers and presenting gathered knowledge in
a comprehensible way. In the last years numerous methods of data extraction
have been developed. Two main approaches are used, namely the global one
that treats a network as a black box and the local one that examines its
structure. The purpose of these algorithms is to produce a set of rules that
would describe a network’s performance with the highest fidelity possible,
taking into account its ability to generalise, in a comprehensible way.
Because the problem of rule extraction is very complex, using evolutionary
algorithms for this purpose, especially for networks that solve the problem of
classification is popular and popular. In the chapter both examples of their
usage were presented - for local rule extraction as well as for global one. Ef-
fective application of local approach depends on the number of neurons in
the net, so existing methods follow to limit an architecture of neural network.
Evolutionary algorithm can be applied to cluster hidden neurons activation
and substituting a cluster of neurons by one neuron. In other application they
optimize the structure of a neural network. Global methods treat a neural
network as black box. As such it delivers the training examples for neural
network acquiring rules. In this group three different methods based on evo-
lutionary algorithm were presented. The experimental studies presented in
this chapter show that the methods of rule extraction from neural networks
based on evolutionary algorithms are not only the promising idea but they can
deliver rules fuzzy or crisp that are accurate and comprehensible for human
in acceptable time. Because of the limited space of this chapter the methods
using multiobjective optimization in the Pareto sense in the global group are
not presented here. Interested readers can be referred to [24] and [22].
References
1. Alexander JA, Mozer M (1999) Template-based procedures for neural network
interpretation. Neural Netw 12:479–498
2. Andrews R, Diedrich J, Tickle A (1995) Survey and critique of techniques
for extracting rules from trained artificial neural networks. Knowl-Based Syst
8(6):373–389
3. Andrews R, Geva S (1994) Rule extraction from constrained error back propa-
gation MLP. In: Proceedings of 5th Australian conference on neural networks,
Brisbane, Quinsland, pp 9–12
4. Blake CC, Merz C (1998) UCI Repository of Machine Learning Databases.
University of California, Irvine, Department of Information and Computer
Sciences.
5. Bologna G (2000) A study on rule extraction from neural network applied to
medical databases. In: The 4th European conference on principles and practice
of knowledge discovery
6. Castillo L, González A, Pérez R (2001) Including a simplicity criterion in the
selection of the best rule in a genetic algorithm. Fuzzy Sets Syst 120:309–321
7. Cichocki A, Unbehauen R (1993) Neural networks for optimization and signal
processing. Wiley, London
8. Eiben A, Aarts E, Hee K (1991) Parallel problem solving from nature. Chapter
Global convergence of genetic algorithms: a Markov chain analysis, Springer,
Berlin Heidelberg New York, p 412
9. Eiben AE, Smith J (2003) Introduction to evolutionary computing. Natural

computing series, Springer, Berlin Heidelberg Newyork
10. Fu X, Wang L (2001) Rule extraction by genetic algorithms based on a simplified
RBF neural network. In: Proceedings congress on evolutionary computation,
pp 753–758
11. Haddadnia J, Ahmadi M, Faez P (2002) A hybrid learning RBF neural network
for human face recognition with pseudo Zernike moment invariant. In: IEEE
international joint conference on neural network
12. Hayashi Y, Setiono R, Yoshida K (2000) Learning M of N concepts for medical
diagnosis using neural networks. J Adv Comput Intell 4:294–301
13. Hayward R, Ho-Stuart C, Diedrich J, et al. (1996) RULENEG: extracting rules
from trained ANN by stepwise negation. Technical report, Neurocomputing
Research Centre Queensland University Technology Brisbane, Old
14. Hruschka ER, Ebecken N (2000) A clustering genetic algorithm for extracting
rules from supervised neural network models in data mining Tasks’. IJCSS 1(1)
15. Jin Y, Sendhoff B, Koerner E (2005a) Evolutionary multi-objective optimization
for simultaneous generation of signal-type and symbol-type representations. In:
The third international conference on evolutionary multi-criterion optimization,
pp 752–766
16. Jin Y, Sendhoff B, Korner E (2005b) Evolutionary multiobjective optimization
for simultanous generation of signal type and symbol type representation. In:
EMO 2005 LNCS, pp 752–766
17. Keedwell E, Narayanan A, Savic D (1999) Using genetic algorithm to extract
rules from trained neural networks. In: Proceedings of the genetic and evolu-
tionary computing conference, pp 793–804
18. Keedwell E, Narayanan A, Savic D (2000) Creating rules from trained neural
networks using genetic algorithms. IJCSS 1(1):30–43
19. Lu H, Setiono R, Liu H (1995) NeuroRule: a connectionist approach to data
mining. In: Proceedings of the 21th conference on very large databases, Zurich,
pp 478–489
20. Markowska-Kaczmar U (2005) The influence of parameters in evolution-
ary based rule extraction method from neural network. In: Proceedings of 5th
international conference on intelligent systems design and applications pp 106–
111
21. Markowska-Kaczmar U, Chumieja M (2004) Discovering the mysteries of neural
networks. Int J Hybrid Intell Syst 1(3/4):153–163
22. Markowska-Kaczmar U, Mularczyk K (2006) GA-based pareto optimization,
Vol. 16 of Studies in computational intelligence. Springer, Berlin Heidelberg
Newyork
23. Markowska-Kaczmar U, Trelak W (2005) Fuzzy logic and evolutionary algo-
rithm - two techniques in rule extraction from neural networks. Neurocomputing
63:359–379
24. Markowska-Kaczmar U, Wnuk-Lipinski P (2004) Rule extraction from neural
network by genetic algorithm with pareto optimization. In: Rutkowski L (ed.)
Artificial intelligence and soft computing, pp 450–455
25. Michalewicz Z (1996) Genetic algorithms + Data structures = Evolution pro-
grams. Springer, Berlin Heidelberg Newyork
26. Mitra S, Yoichi H (2000) Neuro-fuzzy rule generation: survey in soft computing
framework. IEEE Trans Neural Netw 11(3):748–768
27. Omidvar O, Van der Smagt P (eds.) (1997) Neural systems for robotics.
Academic, New York
28. Palade V, Neagu DC, Patton RJ (2001) Interpretation of trained neural net-
works by rule extraction. Fuzzy days 2001, LNC 2206, pp 152–161
29. Reil T, Husbands P (2002) Evolution of central pattern generators for
bipedal walking in a real-time physics environment. IEEE Trans Evol Comput
6(2):159–168
30. Robert C, Gaudy JF, Limoge A (2002) Electroencephalogram processing using
neural network. Clin Neurophysiol 113(5):694–701
31. Santos R, Nievola J, Freitas A (2000) Extracting comprehensible rules from neu-
ral networks via genetic algorithms Symposium on combinations of evolutionary
computation and neural network 1:130–139
32. Setiono R, Leow WK, Zurada J (2002) Extraction of rules from artificial neural
networks for nonlinear regression. IEEE Trans Neural Netw 13(3):564–577
33. Siponen M, Vesanto J, Simula O, Vasara P (2001) An approach to automated
interpretation of SOM. In: Proceedings of workshop on self-organizing map 2001
(WSOM2001), pp 89–94
34. Taha I, Ghosh J (1999) Symbolic interpretation of artificial neural networks.
IEEE Trans Knowl Data Eng 11(3):448–463
35. Thrun SB (1995) Advances in neural information processing systems. MIT, San
Mateo, CA
36. Van der Zwaag B-J (2001) Handwritten digit Recognition: a neural network
demo. In: Computational intelligence: theory and applications, Vol. 2206 of
Springer LNCS. Dortmund, Germany, pp 762–771
37. Widrow B, Rumelhart DE, Lehr M (1994) Neural networks: applications in
industry, business and science. Commun ACM 37(3):93–105
Cluster-wise Design of Takagi and Sugeno
Approach of Fuzzy Logic Controller
Tushar and Dilip Kumar Pratihar
Summary. We have a natural quest to know input-output relationships of a

process. A fuzzy logic controller (FLC) will be an appropriate choice to tackle the
said problem, if there are some imprecisions and uncertainties associated with the
data. The present chapter deals with Takagi and Sugeno approach of FLC, in which
a better accuracy is generally obtained compared to Mamdani approach but at
the cost of interpretability. For most of the real-world processes, the input-output
relationships might be nonlinear in nature. Moreover, the degree of non-linearity
could not be the same over the entire range of the variables. Thus, the same set
of response equations (obtained through statistical regression analysis for the en-
tire range of the variables) might not hold equally good at different regions of the
variable-space. Realizing the above fact, an attempt has been made to cluster the
data based on their similarity among themselves and then cluster-wise linear regres-
sion analysis has been carried out, to determine the response equations. Moreover,
attempts have been made to develop Takagi and Sugeno approach of FLC cluster-
wise, by utilizing the above linear response equations as the consequent part of the
rules, which have been further optimized to improve the performances of the con-
trollers. Moreover, two types of membership function distributions, namely linear
(i.e., triangular) and non-linear (i.e., 3rd order polynomial) have been considered
for the input variables, which have been optimized during the training. The per-
formances of the developed three approaches (i.e., Approach 1: Cluster-wise linear
regression; Approach 2: Cluster-wise Takagi and Sugeno model of FLC with linear
membership function distributions for the input variables; Approach 3: Cluster-wise
Takagi and Sugeno model of FLC with nonlinear (polynomial) membership function
distributions for the input variables) have been tested on two physical problems.
Approach 3 is found to outperform the other two approaches.
List of Symbols and Abbreviations

a Abrasive mesh size (micro-m)
a0 , a1 , . . . , ap Coefficients
A1 , A2 , . . . , Ap Membership function distributions corresponding to the
linguistic terms
Tushar and D.K. Pratihar: Cluster-wise Design of Takagi and Sugeno Approach of Fuzzy Logic
Controller, Studies in Computational Intelligence (SCI) 82, 211–250 (2008)
212 Tushar and D.K. Pratihar
c % concentration of the abrasive

C1 Welding speed (cm/min)
C2 Wire feed rate (cm/min)
C3 % cleaning
C4 Arc gap (mm)
C5 Welding current (Amp)
COij Calculated value of j-th output for i-th case
Dij Euclidean distance between i and j
D Mean distance
Ei Entropy of i-th data point
gk Half base-width of membership function distribution of xk
variable
G Generation number
hijk Variation of the value of a co-efficient
n No. of clusters
n No. of cycles
N No. of data points
O1 Front height of weld bead (mm)
O2 Front width of weld bead (mm)
O3 Back height of weld bead (mm)
O4 Back width of weld bead (mm)
p No. of input variables
pc Crossover probability
pm Mutation probability
P Population size
q No. of output variables
Q No. of fired rules
R Dimension of data points
Ra Surface roughness (micro-m)
Sij Similarity between the data points i and j
T No. of training cases
[T ] Hyperspace
T Oij Target value of j-th output for i-th case
v Flow speed of abrasive media (cm/min)
wi Control action of i-th rule
x1 , x2 , . . . , xp Set of input dimensions of a variable
x1 , x2 , . . . , xN Data points
y Output
α Constant
β Threshold for similarity
γ Threshold for determining a valid cluster
µ Membership function value
Takagi and Sugeno Approach of FLC 213
F LC Fuzzy logic controller

GA Genetic Algorithm
H High
KB Knowledge base
L Low
M Medium
M RR Material removal rate (mg/min)
1 Introduction
We, human beings, have a natural thirst to gather input-output relationships
of a process, which are necessary, particularly for on-line control of the same.
Several attempts were made in the past, to capture the above relationships
for a number of processes, by using the tools of traditional mathematics (e.g.,
differential equations and their solutions), statistical analysis (e.g, regression
analysis based on some experimental data collected in a particular fashion,
say full factorial design, fractional factorial design, central composite design,
and others), and others. Most of the real-world problems are so complex that
it might be difficult to formulate them, in the form of differential equations.
Moreover, even if it is possible to determine the differential equation, it could
be difficult to get its solution. The situation becomes worse, when the input
and output variables are associated with imprecision and uncertainty. A fuzzy
logic controller (FLC), which works based on Zadeh’s fuzzy set theory [1],
could be a natural choice, to solve the above problem, as it is a potential tool
for dealing with imprecision and uncertainty. The FLCs are becoming more
and more popular, nowadays, due to their simplicity, ease of implementations
and ability to tackle complex real-world problems.
To design an FLC for controlling a process, we try to model the human rea-
soning used to solve the said problem, artificially. The variables are expressed
in terms of some linguistic terms (such as VN–very near, N–near, and others)
and the degree of belongingness of an element to a class is expressed by its
membership function value (which lies between 0 and 1). The rules generally
express the relationships among the input and output variables of a process.
The performance of an FLC depends on its knowledge base (KB), which con-
sists of both data base (i.e., membership function distributions) as well as
rule base. Two basic approaches of FLC, namely Mamdani Approach [2] and
Takagi and Sugeno Approach [3], are generally available in the literature. An
FLC developed based on Mamdani Approach may not be accurate enough but
it is interpretable. On the other hand, Takagi and Sugeno Approach can yield
a more accurate controller compared to that designed based on the Mamdani
Approach but at the cost of interpretability. In Mamdani Approach, the crisp
value corresponding to the fuzzified output of the controller is determined by
using a defuzzification method, which could be computationally expensive,
whereas in Takagi and Sugeno Approach, the output is directly expressed as
a function (either linear or nonlinear) of the input variables and as a result

of which, the defuzzification is not required at all. The present chapter deals
with Takagi and Sugeno approach of FLC.
The input-output relationships of a process may not always be linear in
nature. Moreover, the degree of non-linearity could be different from one re-
gion of the input-output space to another and truly speaking, there might
be a number of ups and downs in the above space. Thus, one set of response
equations may not be sufficient to represent the input-output relationships for
the entire range of the variables. Realizing the above fact, an attempt will be
made to cluster the entire space into a number of regions based on similarity.
Moreover, FLCs (based on Takagi and Sugeno Approach) will be developed
cluster-wise, to hopefully yield better prediction of the outputs.
As an FLC does not have an optimization module in-built, an optimization
tool is to be used during its training, to improve the performance. Genetic
algorithm (GA) [4], a population-based search and optimization tool working
based on the mechanics of natural selection and natural genetics, has been
utilized extensively, to improve the performance of the FLC and as a result
of which, a new field of research called genetic-fuzzy system has emerged. The
said field started with the work of Karr [5], Thrift [6] and later on, its principle
has been utilized by a number of investigators, to solve a variety of problems.
In this system, attempts are made to optimize either the data base or the
rule base or the complete KB of the FLC by using a GA. Interested readers
may refer to the recent survey on genetic-fuzzy system, carried out by Cordon
et al. [7].
There are three basic approaches of the genetic-fuzzy system, such as Pitts-
burgh Approach [8], Michigan Approach [9] and iterative rule learning ap-
proach [10]. In Pittsburgh Approach, a GA-string represents the entire rule
base of the FLC, whereas in the Michigan Approach, one rule is indicated by
a GA-string. The genetic operators are used to modify the KB of the FLC and
ultimately the optimal KB is obtained. In an iterative rule learning approach,
the GA-string carries information of the rules. The GA through its iterations
tries to add some good rules to the rule base and delete the bad rules. Thus,
an optimal rule base of the FLC will be evolved gradually. Several encoding
schemes have been developed in this context, some of those are mentioned
below.
Furuhashi et al. [11] developed a variable length decoding method known as
the Nagoya Approach, in which the lengths of the chromosomes are not fixed.
But, it suffers from some difficulties in the crossover operation. Yupu et al. [12]
used a GA to determine appropriate fuzzy rules and the membership function
distributions were optimized by using a neural network. More recently, some
trials have also been made to automatically design the FLC by using a GA [13,
14]. It is important to mention that the GA might evolve some redundant
rules, due to its iterative nature. Some attempts were also made to remove
the redundant rules from the rule base. In this context, the work of Ghosh
and Nath [15], Hui and Pratihar [16] are important to mention.
Attempts were also made to optimize the Takagi and Sugeno type of FLC.
Smith and Comer used the Least Mean Square (LMS) learning algorithm to
tune a general Takagi and Sugeno type FLC [17]. The co-efficients of the
output functions in the Takagi-Sugeno type FLC had been optimized using a
GA by Kim et al. [18]. The structure of ANFIS: adaptive-network-based fuzzy
inference system was developed by Jhang [19], in which the main aim was to
design an optimal Takagi-Sugeno type FLC by using a neural network.
The present chapter deals with GA-based (off-line) tuning of the FLCs
working based on Takagi and Sugeno approach. After the training of the
FLCs is over, their performances have been tested on two physical problems.
The rest of the text is organized as follows: Section 2 explains the principle
of Takagi and Sugeno approach of FLC. Section 3 gives a brief introduction
to the GA. The issues related to entropy-based fuzzy clustering and cluster-
wise linear regression are discussed in Section 4. The proposed method for
cluster-wise design of the FLCs working based on Takagi and Sugeno approach
and their GA-based tuning are explained in Section 5. The performances of
the developed approaches are compared among themselves on two physical
problems in Section 6. Some concluding remarks are made in Section 7 and
the scope for future work is indicated in Section 8.
2 Takagi and Sugeno Approach of FLC

In this approach, a rule of an FLC consists of fuzzy antecedent and functional
consequent parts. Thus, a rule (say, i-th) can be represented as follows:
If x1 is Ai1 and x2 is Ai2 ..... and xp is Aip
then y i = ai0 + ai1 x1 + . . . + aip xp ,
where a0 , a1 , . . . , ap are the coefficients. A nonlinear system is considered in
this way, as a combination of several linear systems. Control action of i-th
rule can be determined for a set of inputs (x1 , x2 , . . . , xp ) like the following.
wi = µiA1 (x1 )µiA2 (x2 ) . . . µiAp (xp ), (1)
where A1 , A2 , . . . , Ap indicate the membership function distributions of the

linguistic terms used to represent the input variables and µ denotes the mem-
bership function value. Thus, the combined control action can be determined
as follows. k
wi y i
y = i=1
k
, (2)
i
i=1 w
where k is the number of fired rules.
3 Genetic Algorithm
Genetic algorithm (GA) is a population-based probabilistic search and opti-
mization technique, which works based on the mechanism of natural genetics
and Darwin’s principle of natural selection (i.e., survival of the fittest) [4].
The concept of GA was introduced by Prof. John Holland of the University
of Michigan, Ann Arbor, USA, in the year 1965, but his seminal book was
published in the year 1975 [20]. This book lays the foundation of genetic al-
gorithms. It is basically a heuristic search technique, which works using the
concept of probability. The working principle of a GA can be explained briefly
with the help of Fig. 1.
• A GA starts with a population of initial solutions, chosen at random.
• The fitness/goodness value (i.e., objective function value in case of a max-
imization problem) of each solution in the population is calculated.
• The population of solutions is then modified by using different operators,
namely reproduction, crossover, mutation, and others.
• All the solutions in a population may not be equally good in terms of
their fitness values. An operator named reproduction is used to select
the good solutions by using their fitness information. Thus, reproduction
forms a mating pool, which consists of good solutions hopefully. It is to be
Start
Initialize a population of
solutions
Gen=0
No
Gen > Max_gen
?
Assign fitness to
Yes all solutions in the
population
End
Reproduction
Crossover
Mutation
Gen = Gen+1
Fig. 1. A schematic diagram showing the working cycle of a GA

mentioned that there could be multiple copies of a particular good solution

in the mating pool. The size of the mating pool is kept equal to that of
the population of solutions before reproduction. Thus, the average fitness
of the mating pool is expected to be higher (for a maximization problem)
than that of the pre-reproduction population of solutions. There exists
a number of reproduction schemes in the GA-literature, namely propor-
tionate selection (such as roulette-wheel selection), tournament selection,
ranking selection, and others [21].
• The mating pairs (also known as parents) are selected at random from the
mating pool, which will take part in crossover. In crossover, there is an
exchange of properties of the parents and as a result of which, new children
solutions are created. It is important to note that if the parents are good,
the children are expected to be good. There are various types of crossover
operators in the GA-literature, such as single-point crossover, two-point
crossover, multi-point crossover, uniform crossover, and others [22].
• The word - mutation means a sudden change of parameter. It is used for
achieving a local change around the current solution. Thus, if a solution
gets stuck at the local minimum, this operator may help it to come out of
that local basin.
• After reproduction, crossover and mutation are applied to the whole popu-
lation of solutions, one generation of a GA is completed. Different criteria
are used to terminate the program, such as the maximum number of gen-
erations, a desired accuracy in the solution, and others.
Various types of the GA are available in the literature, namely binary-coded
GA, real-coded GA, micro-GA, messy-GA, and others. In the present work,
a binary-coded GA has been used.
4 Clustering and Linear Regression Analysis Using

the Clustered Data
Clustering is a technique of pushing the similar elements into a group (also
known as a cluster) and thus putting the dis-similar elements into different
groups. It is important to mention that the clusters could be either hard or
soft (fuzzy) in nature. Hard clusters are separated by the well-defined fixed
boundaries, whereas there are no fixed boundaries of the fuzzy clusters. There
are several methods of fuzzy clustering, such as fuzzy C-means algorithm [23],
entropy-based fuzzy clustering algorithm [24], and others. The present chapter
deals with the entropy-based fuzzy clustering algorithm, which is explained
below.
4.1 Entropy-based Fuzzy Clustering (EFC)

Let us consider a set of N data points in an R − D hyperspace. Each
data point xi (i = 1, 2, . . . , N ) is represented by a vector of R values,
(i.e., xi1 , xi2 , . . . , xiR ). The following steps are followed to determine entropy
of each point.
• Arrange the data set in rows and columns. Thus, there are N rows and
R columns.
• Determine Euclidean distance between the points i and j as follows.

R

Dij = (xik − xjk )2 . (3)
k=1
It is to be noted that N 2 distances are possible among N data points. It

is also interesting to note that out of N 2 distances, N C2 distances belong
to Dij and Dji each and there are N diagonal elements of the distance
matrix. As Dij is equal to Dji and each of the diagonal elements of the
matrix is equal to zero, it is required to calculate N C2 distances (Dij ) only.
• Calculate similarity Sij between two data points (i and j) by using the
following expression.
Sij = e−αDij , (4)
where α is a constant determined as follows. We assume a similarity Sij
of 0.5, when the distance between two data points Dij becomes equal
to the mean distance of all pairs of data points, i.e., Dij = D, where
N j<i,j=i
D = N 1C2 i=1 j=1 Dij . From equation (4), the following expression
of α can be derived.
ln 2
α= . (5)
D
It is important to note that similarity Sij varies in the range of 0 to 1.0.
• Calculate entropy of all the N data points separately as follows. Entropy of
a data point with respect to another data point is given by the expression
E = −S log2 S − (1 − S) log2 (1 − S). (6)
From the above equation, it can be observed that entropy E becomes equal
to 0.0, for a value of S = 0 and S = 1.0. Moreover, entropy E takes the
maximum value of 1.0, corresponding to a value of S = 0.5. Thus, entropy
of a point with respect to another point varies between 0.0 and 1.0.
• Calculate total entropy value at a data point xi with respect to all other
data points by using the expression given below.

j=i
Ei = − (Sij log2 Sij + (1 − Sij ) log2 (1 − Sij )) (7)
j∈X
It is also important to note that during clustering, the point having min-
imum total entropy may be selected as the cluster center. It is so, because
(1 − E) indicates the probability of a point for getting selected as a cluster
center.
Clustering Algorithm
Let us suppose that [T ] is the data set containing N data points and each
data point has R dimensions. Thus, [T ] is an N × R matrix. The clustering
algorithm consists of the following steps.
• Step 1: Calculate entropy Ei for each data point xi lying in [T ].
• Step 2: Locate xi , which has the minimum Ei value and select it
(xi,minimum ) as the cluster center.
• Step 3: Put xi,minimum and the data points having similarity with
xi,minimum greater than β (a threshold value of similarity) in a cluster
and remove them from [T ].
• Step 4: Check if [T ] is empty. If yes, terminate the program, else go Step 2.
In the above algorithm, entropy has been defined in such a way that a data
point which is far away from the rest of the data may also be selected as a
cluster center. It may happen so, because a very distant point (from the rest
of the data points) may also have a low value of entropy. To overcome this,
another parameter γ (in %) is introduced, which is nothing but a threshold
used to declare a cluster to be a valid one. After the clustering is over, we
count the number of data points in each cluster and if this number becomes
greater than or equal to γ% of the total number of data points, we declare
this cluster as a valid one. Otherwise, these data points, which could not form
a cluster will be declared as outliers.
4.2 Approach 1: Cluster-wise Linear Regression
Let us suppose that the data set related to input-output relationships of a

process has been clustered into n groups by using the above entropy-based
fuzzy clustering algorithm. Cluster-wise linear regression is then carried out,
to determine the relationship among each output and the inputs separately.
Thus, each of the outputs will be expressed as a linear function of the input
variables. Let us assume that each R-dimensional data consists of two parts –
p dimensions are used to indicate the inputs and the outputs are represented
by q dimensions. Thus, q response equations will be derived for each cluster
and a total of q × n response equations will be obtained for n such clusters.
It is to be noted that each of these q × n response equations is a function of p
input variables. To determine the outputs for a set of test inputs, we first find
the cluster to which this set of inputs will belong, by calculating the Euclidean
distance between the said input data and the different cluster centers obtained
above. The set of test input data will belong to that cluster, whose center is
found to be the closest from that data. Once the belongingness of a set of
inputs to a particular cluster is known, the response equations corresponding
to that cluster can be utilized to determine the required outputs.
5 GA-based Tuning of Takagi and Sugeno Approach

of FLC
In Takagi and Sugeno approach of FLC, the output of the controller is ex-
pressed as a function of the input variables, determination of which might be
a difficult task. It is to be noted that the above response equations obtained
cluster-wise can be used to determine the appropriate functional form of the
outputs of a controller. Thus, a separate FLC will be developed to represent
input-output relationships for each cluster. As the linear regression equations
have been developed cluster-wise, they will be able to determine the outputs
in the optimal sense for the set of inputs, with the help of which the above re-
sponse equations have been derived. To tune the above FLCs, so that they can
perform reasonably well in the wide range of the variables, a genetic algorithm
(GA) has been used. As a GA is computationally expensive, the tuning is done
off-line, with the help of a large amount of known input-output data following
the batch mode of training. Before we go for the GA-based tuning of the FLCs,
let us explain the way it has been formulated as an optimization problem.
Let us assume that i-th output (i = 1, 2, . . . , q) of the FLC contained in
j-th cluster (j = 1, 2, . . . , n) is represented as follows.
y ij = aij ij ij ij ij
0 + a1 x1 + a2 x2 + . . . + ak xk + . . . + ap xp , (8)
where aij ij ij
0 , a1 , . . . , ap are the co-efficients to be tuned properly during the
GA-based learning.
Let us also suppose that each of the input dimensions of a variable (i.e.,
x1 , x2 , . . . , xp ) is expressed by using three linguistic terms – L (Low), M
(Medium) and H (high). The membership function distributions of the input
variables are assumed to linear in Approach 2, whereas those are expressed
by using the third order polynomial functions [25] in Approach 3 (refer to
Fig. 2). It is to be noted that the parameter – gk (k = 1, 2, . . . , p) can be var-
ied, while carrying out optimization by using a GA. As three linguistic terms
L M H L M H
1.0 1.0
µ µ
0.0 0.0
xk,min xk,min
g gk g g
k k k
x x
k k
(a) (b)
Fig. 2. Membership function distributions of k-th input variable–(a) linear function,

(b) nonlinear (third order polynomial) function
(such as L, M and H) have been utilized to represent each input dimension

of a variable, its corresponding coefficient used in equation (8) might have
three different values. It is assumed that the value of the coefficient (aij
k ) will
remain the same with the value obtained through the regression analysis, if
the input xk is M and it will increase or decrease by an amount hijk , if xk
is H or L, respectively. Thus, the following three conditions may occur in the
fired rule:
• If xk is L, then modified a∗ij ij
k = ak − hijk ,
∗ij
• If xk is M , then modified ak = aij k ,
• If xk is H, then modified a∗ij
k = a ij
k + hijk ,
It is important to mention that hijk will be varied during optimization. The

GA, in total, will have to deal with (p + q × n × p) real variables during
optimization, as there are p values of gk and q × n × p values of hijk .
As there are p dimensions of an input variable and each is assumed to be
represented by using three linguistic terms – L, M and H, there is a maximum
of 3p possible rules. Out of these 3p rules, a maximum of 2p rules could be
fired. Now, corresponding to i-th output, the strength of a rule, say q-th, can
be determined as follows.
wqi = µi (x1 ) × µi (x2 ) × . . . × µi (xp ), (9)
where µi (x1 ) indicates the membership function value of x1 , and so on. It is

important to mention that a rule, say q-th is said to be fired, if the value of wqi
comes out to be non-zero. Moreover, the output of a fired rule (say, q-th) can
be calculated by using the modified version of equation(8), like the following.
∗ij ∗ij ∗ij ∗ij
yqij = aij
0 + a1 x1 + a2 x2 + . . . + ak xk + . . . + ap xp . (10)
It is important to mention that the constant term aij

0 has been kept unaltered
during optimization. Thus, the combined control action corresponding to i-th
output of the FLC representing j-th cluster, can be determined as follows.
Q i ij
ij q=1 wq yq
y = Q , (11)
i
q=1 wq
where Q ≤ 2p is the number of fired rules.
5.1 Genetic-Fuzzy System
A genetic-fuzzy system has been developed, in which the performances of the

FLCs have been improved by using a GA-based tuning of their knowledge
bases (KBs) off-line. Fig. 3 shows the schematic diagram of the developed
genetic-fuzzy system. It is to be noted that a batch mode of training has been
adopted in the present work.
GA−based
tuning
Off−line
Knowledge
Base
Inputs FLC Outputs
On−line
Fig. 3. A schematic diagram of the genetic-fuzzy system
A binary-coded GA is used to carry out the above optimization, in which

a particular string will look as follows.
1 . . 01. . . . 1
+ . ,- + . ,-
. . 10. . . . 0
+ . ,-
. . 10. 1
+ . ,-
. . 00. . . . 0+ . ,-
. . 01. . . . 1+ . ,-
. . 01.
g1 gk gp h111 hijk hqnp
The fitness of a GA-string is calculated as the average absolute % deviation

in prediction, like the following.
q
1 T Oij − COij
T
f= × 100 , (12)
T q i=1 j=1 T Oij
where T indicates the number of training cases, q represents the number of

outputs and the target and calculated values are indicated by T Oij and COij ,
respectively.
To summarize the above discussion, the following steps have been consid-
ered in the developed algorithm:
• Cluster a large amount of training data (known input-output relationships)
using a similarity-based fuzzy clustering algorithm,
• Carry out linear regression analysis cluster-wise, to express each output
as a linear function of the inputs by utilizing the data falling into that
particular cluster.
• Design the FLCs based on Takagi and Sugeno’s approach, in which each
FLC will represent a particular cluster.
• Tune the FLCs by using a GA, off-line.
It is to be noted that once optimized, the FLCs can be used for making
on-line predictions of the outputs, for a set of inputs.
Thus, three different approaches have been developed to establish input-
output relationships of a process, which are as follows.
• Approach 1: Cluster-wise Linear Regression

• Approach 2: Cluster-wise Takagi and Sugeno Model of FLC with Linear
Membership Function Distributions for the Input Variables
• Approach 3: Cluster-wise Takagi and Sugeno Model of FLC with Non-
linear (Polynomial) Membership Function Distributions for the Input
Variables
6 Results and Discussion
The performances of the above three developed approaches have been tested
and compared among themselves on two different problems – one is related
to Abrasive Flow Machining (AFM) [26] and the other deals with Tungsten
Inert Gas (TIG) welding [27], which are discussed below.
6.1 Modeling of Abrasive Flow Machining (AFM) Process [26]
To model input-output relationships in AFM process, four inputs (such as

flow speed of abrasive media v, % concentration of abrasive c, abrasive mesh
size a and number of cycles n ) and two outputs, namely material removal rate
(M RR) and surface roughness (Ra ) have been considered. One thousand data
related to the above relationships have been generated artificially by selecting
the values of the variables lying within their respective ranges, at random and
substituting them in the following empirical relationships:
M RR = 5.25 × 10−7 v 1.6469 c3.0776 a−0.9371 n−0.1893 , (13)

−1.8221 −1.3222 0.1368 −0.2258
Ra = 282751v c a n , (14)
where 40.0 ≤ v ≤ 85.0, 33.0 ≤ c ≤ 45.0, 100.0 ≤ a ≤ 240.0, 20 ≤ n ≤ 120.

It is important to note that each of these 1000 data has six dimensions – the
first four indicate the input variables (i.e., v, c, a and n ) and the last two
represent the output variables, namely M RR and Ra . Thus, the above data
set has a size of 1000 × 6.
Entropy-based fuzzy clustering algorithm has been used to make the clus-
ters based on similarity. As the performance of the clustering algorithm de-
pends on the threshold value of similarity β, a thorough study is carried out
to determine a suitable value of it. Fig. 4 shows the variations of number of
clusters and outliers with different values of β. The number of cluster initially
increases with the value of β, reaches the maximum value corresponding to
certain values of β and then it decreases with a further increase in the value
of β, as expected. From the above study, a suitable value of β, i.e., 0.52 has
been selected, corresponding to which four clusters have been generated and
there are two outliers.
10
7
No. of clusters
6
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Beta value
(a) No. of cluster vs. beta
1200
1000
800
No. of outliers
600
400
200
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Beta value
(b) No. of outliers vs. beta
Fig. 4. Variations of no. of clusters and no. of outliers with beta
Results of Approach 1
To establish input-output relationships cluster-wise, linear regression analysis

is carried out using a commercial software package named MINITAB-14. The
following response equations are obtained for the different clusters.
Cluster 1
M RR = −0.278 + 0.0029v + 0.00852c − 0.000512a − 0.000183n (15)

Ra = 4.11 − 0.0306v − 0.0299c + 0.000918a − 0.00235n (16)
Cluster 2
M RR = −0.47 + 0.0049v + 0.0149c − 0.00137a − 0.000617n (17)

Ra = 4.37 − 0.0306v − 0.0335c + 0.000845a − 0.00327n (18)
Cluster 3
M RR = −0.328 + 0.00351v + 0.0105c − 0.000637a − 0.000774n (19)

Ra = 5.27 − 0.0366v − 0.0441c + 0.00123a − 0.00688n (20)
Cluster 4
M RR = −0.279 + 0.00372v + 0.011c − 0.00119a − 0.000443n (21)

Ra = 4.93 − 0.0326v − 0.0372c − 0.00034a − 0.00406n (22)
Fig. 5 shows the comparisons among the target (determined by using the em-
pirical relationships) and calculated values (obtained by utilizing the above
regression equations) of two outputs, namely M RR and Ra , for 50 randomly-
generated test cases. The belongingness of a test case to a particular cluster
is decided by considering the minimum of its Euclidean distances from the
four cluster centers obtained above. The model is able to make reasonably
good predictions of M RR for a number of test cases (corresponding to which
the points are lying on the ideal y = x line) but not all (refer to Fig. 5(a)).
The above figure shows that the model has under-estimated the M RR values
for a few test cases. It is interesting to notice from Fig. 5(b) that the model
has predicted surface roughness values almost accurately for most of the test
cases but not all. It is also important to note that the points are found to lie
on both the sides of the ideal y = x line. Thus, reasonably good predictions of
both the outputs have been made by Approach 1, for most of the random test
cases. The actual input-output relationships of the above process may be non-
linear in nature. Moreover, the degree of non-linearity may not be the same
throughout the entire input-output space. As the clustering is done based on
the concept of similarity, the similar points are expected to form a cluster.
Here, the non-linear input-output space has been divided into four clusters
based on the similarity and at each cluster, the input-output relationships
have been determined using the linear regression analysis. Thus, a non-linear
space has been divided into four regions and at each region, the input-output
relationships have been approximated as the linear ones. The deviations in
predictions as shown in Fig. 5 could be due to the above reason.
0.6
0.5
Target values of MRR 0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6
Calculated values of MRR
(a) Target vs. calculated values of MRR
2.5
2
Target values of Ra
1.5
0.5
0
0 0.5 1 1.5 2 2.5
Calculated values of Ra
(b) Target vs. calculated values of Ra

Fig. 5. Performance testing of cluster-wise linear regression – AFM data
Fuzzy logic controllers (FLCs) have been designed based on Takagi and
Sugeno’s approach cluster-wise, in which each of the outputs has been ex-
pressed as the linear function of inputs, as obtained above. The Knowledge
Base (KB) of the FLCs have been tuned to improve their performance by using
a GA. As the performance of the GA depends on its parameters, a systematic
study is conducted to determine the optimal GA-parameters, in which only
one parameter has been varied at a time after keeping the others fixed. Fig. 6
shows the results of the above parametric study. It is important to note that
9.3
9.25
9.2
9.15
Fitness
9.1
9.05
8.95
0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Crossover probability
(a) Fitness vs. pc
9.45
9.4
9.35
9.3
9.25
Fitness
9.2
9.15
9.1
9.05
9
8.95
0 0.0005 0.001 0.0015 0.002 0.0025 0.003
Mutation probability
(b) Fitness vs. pm
Fig. 6. Results of the GA-parametric study
10.4
10.2
10
Fitness 9.8
9.6
9.4
9.2
8.8
0 20 40 60 80 100 120 140 160 180 200
Population size
(c) Fitness vs. population size
10.6
10.4
10.2
10
9.8
Fitness
9.6
9.4
9.2
8.8
0 50 100 150 200 250 300 350 400 450 500
No. of generations
(d) Fitness vs. no. of generations
Fig. 6. (Continued)
the best results are obtained with the following GA-parameters: probability
of crossover pc = 0.87, probability of mutation pm = 0.00223, population size
P = 190 and maximum number of generations G = 450. In this problem, there
are four input variables (i.e, p = 4), two outputs (i.e., q = 2) and the number
of clusters n is set equal to 4. Thus, a total of p +q × n × p = 4 + 2 × 4 × 4 = 36
values are to be varied within their respective ranges (refer to Table 1), during
Table 1. Ranges and optimized values of 36 parameters – AFM data
Parameter Range Optimal value Optimal value

Approach 2 Approach 3
g1 10,35 28.890518 33.191593

g2 2,12 11.824047 9.507331
g3 20,100 59.960899 59.960899
g4 10,70 43.782991 37.689284
h111 0.00001,0.0006 0.00001 0.00001
h112 0.00001,0.0006 0.00001 0.00001
h113 0.000001,0.00006 0.000001 0.000001
h114 0.000001,0.00006 0.000001 0.000001
h121 0.001,0.006 0.00405 0.004363
h122 0.001,0.006 0.002711 0.001371
h123 0.000001,0.00006 0.000001 0.000001
h124 0.0001,0.0006 0.000286 0.00035
h131 0.00001,0.0006 0.00001 0.00001
h132 0.0001,0.006 0.0001 0.0001
h133 0.00001,0.0006 0.000147 0.000158
h134 0.00001,0.00008 0.00008 0.00008
h141 0.001,0.006 0.004094 0.004754
h142 0.001,0.006 0.003502 0.001938
h143 0.000001,0.00006 0.000001 0.000001
h144 0.0001,0.0006 0.000506 0.000599
h211 0.0001,0.0006 0.0001 0.0001
h212 0.0001,0.006 0.0001 0.0001
h213 0.00001,0.00006 0.00001 0.00001
h214 0.00001,0.00006 0.00006 0.00006
h221 0.001,0.006 0.004436 0.004651
h222 0.001,0.006 0.00316 0.002237
h223 0.0001,0.0006 0.000113 0.000209
h224 0.0001,0.0006 0.000579 0.0006
h231 0.00001,0.0006 0.00001 0.000064
h232 0.0001,0.006 0.000152 0.0001
h233 0.0001,0.0006 0.0001 0.0001
h234 0.00001,0.00006 0.000016 0.00001
h241 0.001,0.006 0.003498 0.004123
h242 0.001,0.006 0.003761 0.00221
h243 0.00001,0.00006 0.000058 0.00005
h244 0.0001,0.0006 0.000592 0.0006
optimization. A batch mode of training has been provided to the FLCs by

the GA, with the help of 1000 artificial training data generated earlier. The
optimized values of the above 36 parameters (i.e., 4 values of gk and 32 val-
ues of hijk ) obtained during the training, are also shown in Table 1. Fig. 7
compares the model-predicted (i.e., Approach 2) values of M RR and Ra with
0.5
0.4
Target values of MRR
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5
2
Target values of Ra
1.5
0.5
0
0 0.5 1 1.5 2
Fig. 7. Performance testing of the FLC having linear membership function distri-
butions – AFM data
their respective target vaues (obtained from the empirical expressions). It is

interesting to note that Approach 2 has yielded slightly better predictions of
M RR compared to those obtained by Approach 1. Moreover, Approach 2 has
shown a tendency to shift the points (shown in Fig. 7(b)), to-wards one side
of the ideal line y = x, i.e., to underestimate the surface roughness values.
Approach 2 has performed slightly better than Approach 1 and it is due to
the fact that in Approach 2, a GA has been used to fine-tune the coefficients
of the linear regression equations and membership function distributions of
the variables. The GA is able to inject an intelligent module into the FLC
and as a result of which, Approach 2 is expected to show its adaptability.
A parametric study has been carried out for this approach by following the
same procedure explained above and the following GA-parameters – pc = 0.93,
pm = 0.00133, P = 190 and G = 430 are found to yield the best results. The
optimal values of gk and hijk obtained by using this approach are shown in
Table 1. The values of M RR and Ra , predicted by using Approach 3, have
been compared with their respective target values in Fig. 8. It is interesting
to note that Approach 3 has predicted both M RR as well as Ra values almost
with the same accuracy level as that obtained by Approach 2.
Comparisons
The above three approaches have been compared in terms of % deviation in

prediction of M RR and Ra , as shown in Fig. 9. It is interesting to notice
that the values of % deviation in prediction of M RR and Ra , yielded by
Approaches 2 and 3, are found to vary within the shorter ranges compared
to those obtained by Approach 1. Thus, both Approaches 2 as well as 3 are
seen to outperform Approach 1, in most of the test cases. It could be due
to the GA-based optimization of the membership function distributions of
the input variables and the co-efficients of the linear regression equations.
The values of average absolute % deviation in prediction of the outputs have
been calculated for the three approaches and those are found to be equal to
0.1949155, 0.1565955 and 0.151972, for Approaches 1, 2 and 3, respectively.
Approach 3 has shown a slightly better performance compared to Approach 2.
It may happen due to the fact that linear distributions of the membership
function used in Approach 2 have been replaced by the non-linear distributions
in Approach 3. The supremacy of Approach 3 over Approach 2 indicates that
the input-output relationships of the process might be nonlinear in nature.
6.2 Modeling of Tungsten Inert Gas (TIG) Process [27]
In the present work, five inputs, such as welding speed (C1 ), wire feed rate
(C2 ), % cleaning (C3 ), arc gap (C4 ), welding current (C5 ) and four outputs,
0.5
0.4
Target values of MRR

0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5
2
Target values of Ra
1.5
0.5
0
0 0.5 1 1.5 2
Fig. 8. Performance testing of the FLC having nonlinear membership function

distributions – AFM data
namely weld bead front height (O1 ), front width (O2 ), back height (O3 ), back
width (O4 ), have been considered to model the TIG welding process. The
ranges of different input parameters have been set as follows: C1 (24.0 to
46.0 cm/min), C2 (1.5 to 2.5 cm/min), C3 (30.0 to 70.0), C4 (2.4 to 3.2 mm)
and C5 (80.0 to 110.0 amp). To establish input-output relationships of this
1
’CWLRMRR’
’LRFLCMRR’
’NONLRFLCMRR’
0.5
% deviation in prediction of MRR

0
-0.5
-1
-1.5
-2
-2.5
5 10 15 20 25 30 35 40 45 50
Test Cases
(a) Output 1: MRR
1.2
’CWLRRa’
’LRFLCRa’
’NONLRFLCRa’
1
% deviation in prediction of Ra
0.8
0.6
0.4
0.2
-0.2
5 10 15 20 25 30 35 40 45 50
Test Cases
(b) Output 2: Surface roughness Ra
Fig. 9. Comparison of three approaches in terms of % deviation in prediction of

different outputs – AFM
process by using statistical regression analysis, the data collected as per full
factorial design of experiments (as there are five input variables, the number
of experiments will be equal to 25 = 32) [27], have been used. Table 2 shows
the set of above 32 data involving input-output relationships of the process.
The following response equations have been obtained by using MINITAB-
14 software package on the above 32 data.
O1 = − 17.2504 + 0.6202C1 + 4.6762C2 + 0.0866C3 + 7.4479C4 + 0.0431C5
− 0.1870C1 C2 − 0.0058C1 C3 − 0.2210C1 C4 − 0.0029C1 C5 + 0.0018C2 C3
− 1.8396C2 C4 + 0.0191C2 C5 − 0.0586C3 C4 + 0.0018C3 C5 − 0.0352C4 C5
+ 0.014C1 C2 C3 + 0.0623C1 C2 C4 + 0.0002C1 C2 C5 + 0.0022C1 C3 C4
− 0.0070 × 10−3 C1 C3 C5 + 0.0011C1 C4 C5 + 0.0061C2 C3 C4
− 0.0014C2 C3 C5 − 0.0030C2 C4 C5 − 0.0003C3 C4 C5 − 0.0004C1 C2 C3 C4
+ 0.0189 × 10−3 C1 C2 C3 C5 − 0.0460 × 10−3 C1 C2 C4 C5 − 0.0009C1 C3 C4 C5
− 0.0004C2 C3 C4 C5 − 0.0069C1 C2 C3 C4 C5 , (23)
O2 = − 329.6758 + 8.2539C1 + 167.1041C2 + 5.8187C3 + 101.462C4 + 3.9953C5

− 4.0707C1 C2 − 0.1414C1 C3 − 2.5489C1 C4 − 0.0991C1 C5 − 2.9150C2 C3
− 54.1378C2 C4 − 1.9883C2 C5 − 1.8510C3 C4 − 0.0686C3 C5 − 1.2150C4 C5
+ 0.07C1 C2 C3 + 1.3175C1 C2 C4 + 0.0486C1 C2 C5 + 0.0441C1 C3 C4
+ 0.0017C1 C3 C5 + 0.0308C1 C4 C5 + 0.9399C2 C3 C4
+ 0.0345C2 C3 C5 + 0.6524C2 C4 C5 + 0.0223C3 C4 C5 − 0.0223C1 C2 C3 C4
− 0.8392 × 10−3 C1 C2 C3 C5 − 0.0159C1 C2 C4 C5 − 0.5426 × 10−3 C1 C3 C4 C5
+ 0.0113C2 C3 C4 C5 + 0.2718 × 10−3 C1 C2 C3 C4 C5 , (24)
O3 = 20.7999 − 0.3831C1 − 3.5745C2 + 0.1079C3 − 9.3284C4 − 0.0924C5

+ 0.0058C1 C2 − 0.0054C1 C3 + 0.1665C1 C4 + 0.0005C1 C5 − 0.1114C2 C3
+ 2.2936C2 C4 − 0.0168C2 C5 − 0.0092C3 C4 − 0.0037C3 C5 + 0.0576C4 C5
+ 0.0044C1 C2 C3 − 0.0237C1 C2 C4 + 0.0015C1 C2 C5 + 0.0016C1 C3 C4
+ 0.0001C1 C3 C5 − 0.0006C1 C4 C5 + 0.0262C2 C3 C4
+ 0.0024C2 C3 C5 − 0.0041C2 C4 C5 + 0.0009C3 C4 C5 − 0.0013C1 C2 C3 C4
− 0.769 × 10−3 C1 C2 C3 C5 − 0.0003C1 C2 C4 C5 − 0.0393C1 C3 C4 C5
− 0.0007C2 C3 C4 C5 + 0.028 × 10−3 C1 C2 C3 C4 C5 , (25)
O4 = − 179.4354 + 4.1209C1 + 104.7708C2 + 4.1113C3 + 52.8753C4 + 2.4368C5

− 2.5474C1 C2 − 0.0946C1 C3 − 1.2695C1 C4 − 0.0573C1 C5 − 2.2272C2 C3
− 34.1677C2 C4 − 1.2973C2 C5 − 1.3856C3 C4 − 0.0508C3 C5 − 0.7198C4 C5
+ 0.0520C1 C2 C3 + 0.8188C1 C2 C4 + 0.0321C1 C2 C5 + 0.0318C1 C3 C4
+ 0.0012C1 C3 C5 + 0.0182C1 C4 C5 + 0.7684C2 C3 C4
+ 0.0268C2 C3 C5 + 0.04353C2 C4 C5 + 0.0176C3 C4 C5 − 0.0178C1 C2 C3 C4
− 0.6427 × 10−3 C1 C2 C3 C5 − 0.0108C1 C2 C4 C5 − 0.4212C1 C3 C4 C5
− 0.0094C2 C3 C4 C5 + 0.2246 × 10−3 C1 C2 C3 C4 C5 , (26)
Table 2. Data (as per full factorial design of experiments) used to carry out
regression analysis
Sl. Treatment Level of the factors Responses values

No. combination C1 C2 C3 C4 C5
O1 O2 O3 O4
(mm) (mm) (mm) (mm)
1 1 − − − − − −0.149 6.090 0.672 5.664
2 C1 + − − − − 0.357 4.982 0.001 2.255
3 C2 − + − − − 0.155 6.676 0.743 5.960
4 C3 − − + − − −0.179 7.432 0.593 7.058
5 C4 − − − + − 0.027 6.411 0.412 5.197
6 C5 − − − − + −0.599 11.348 0.805 11.679
7 C1 C2 + + − − − 0.390 4.780 0.062 1.330
8 C1 C3 + − + − − 0.088 5.020 0.281 3.302
9 C1 C4 + − − + − 0.168 4.898 0.277 2.998
10 C1 C5 + − − − + −0.217 6.092 0.359 6.419
11 C2 C3 − + + − − −0.129 7.009 0.878 6.989
12 C2 C4 − + − + − 0.099 6.824 0.803 5.732
13 C2 C5 − + − − + −0.232 9.338 0.866 10.611
14 C3 C4 − − + + − −0.306 7.287 0.630 6.895
15 C3 C5 − − + − + −0.254 11.237 0.470 12.000
16 C4 C5 − − − + + −0.745 11.491 1.100 11.848
17 C1 C2 C3 + + + − − 0.380 5.231 0.397 2.817
18 C1 C2 C4 + + − + − 0.487 4.992 0.139 1.600
19 C1 C2 C5 + + − − + −0.010 6.396 0.536 6.197
20 C1 C3 C4 + − + + − 0.090 4.423 0.420 3.172
21 C1 C3 C5 + − + − + −0.249 7.719 0.492 7.706
22 C1 C4 C5 + − − + + −0.339 7.335 0.619 7.520
23 C2 C3 C4 − + + + − −0.077 7.460 0.820 7.809
24 C2 C3 C5 − + + − + −0.623 11.767 1.128 12.860
25 C2 C4 C5 − + − + + −0.557 12.348 1.139 12.403
26 C3 C4 C5 − − + + + −0.683 12.946 0.945 13.921
27 C1 C2 C3 C4 + + + + − 0.394 5.337 0.378 3.041
28 C1 C2 C3 C5 + + + − + −0.201 7.052 0.658 7.480
29 C1 C2 C4 C5 + + − + + 0.074 6.863 0.484 6.072
30 C1 C3 C4 C5 + − + + + −0.396 7.633 0.458 7.601
31 C2 C3 C4 C5 − + + + + −0.617 12.533 1.084 13.346
32 C1 C2 C3 C4 C5 + + + + + −0.358 7.759 0.798 7.917
One thousand data have been generated artificially by using the above re-
gression equations and those are grouped into a number of clusters based on
similarity by utilizing the entropy-based fuzzy clustering algorithm. The best
set of four clusters with zero outlier is obtained, corresponding to a threshold
of similarity β = 0.43. Three approaches have been developed with the best
set of clusters obtained above, the results of which are explained below.
To re-establish the input-output relationships of the process, linear regression

analysis is carried out cluster-wise using the software – MINITAB-14. The
cluster-wise response equations have been obtained as follows.
Cluster 1:
O1 = 0.918 + 0.0168C1 + 0.138C2 − 0.00378C3 − 0.0806C4 − 0.0157C5 (27)

O2 = −2.51 − 0.134C1 + 0.0608C2 + 0.0157C3 + 0.516C4 + 0.133C5 (28)
O3 = − 0.231 − 0.0180C1 + 0.182C2 + 0.00257C3 + 0.0806C4 + 0.00790C5 (29)
O4 = −5.98 − 0.204C1 − 0.0169C2 + 0.0342C3 + 0.496C4 + 0.181C5 (30)
Cluster 2:
O1 = 0.795 + 0.0164C1 + 0.208C2 − 0.00364C3 − 0.0555C4 − 0.0167C5 (31)

O2 = −0.146 − 0.126C1 + 0.0560C2 + 0.0178C3 + 0.576C4 + 0.100C5 (32)
O3 = −0.282 − 0.0222C1 + 0.0862C2 + 0.00338C3 + 0.114C4 + 0.0108C5 (33)
O4 = −4.23 − 0.194C1 − 0.419C2 + 0.0327C3 + 0.427C4 + 0.169C5 (34)
Cluster 3:
O1 = 1.30 + 0.0145C1 + 0.0987C2 − 0.00319C3 − 0.158C4 − 0.0161C5 (35)

O2 = 1.83 − 0.192C1 − 0.129C2 + 0.0236C3 + 0.859C4 + 0.0989C5 (36)
O3 = −0.443 − 0.0175C1 + 0.195C2 + 0.00175C3 + 0.144C4 + 0.00830C5 (37)
O4 = −3.34 − 0.234C1 − 0.177C2 + 0.0321C3 + 0.655C4 + 0.165C5 (38)
Cluster 4:
O1 = 1.07 + 0.0170C1 + 0.200C2 − 0.00233C3 − 0.165C4 − 0.0169C5 (39)

O2 = −3.43 − 0.190C1 − 0.258C2 + 0.0257C3 + 1.32C4 + 0.137C5 (40)
O3 = − 0.410 − 0.0207C1 + 0.120C2 + 0.000108C3 + 0.193C4 + 0.00988C5 (41)
O4 = −6.43 − 0.227C1 − 0.301C2 + 0.0358C3 + 0.811C4 + 0.188C5 (42)
The performances of this approach have been tested on 50 randomly-generated

cases, the results of which are shown in Fig. 10. The calculated values of the
outputs have been compared with their respective target values. It is to be
mentioned that good predictions are obtained by Approach 1, for all the four
outputs. The points on Fig. 10 are seen to lie on either the ideal y = x line or
both the sides of it. A few points are found to lie on both the sides of the ideal
y = x line and it could be due to the fact that at each cluster, the responses
have been determined as the linear functions of the input process parameters
but their interaction and non-linear terms have been neglected for simplicity.
0.3
0.2
Target value of Output 1
0.1
−0.1
−0.2
−0.3
−0.4
−0.5
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
Calculated value of Output 1
(a) Target vs. calculated values of output 1
11
10
5
5 6 7 8 9 10 11
(b) Target vs. calculated values of output 2
Fig. 10. Performance testing of cluster-wise linear regression – TIG data
0.9
0.8

0.7
0.6
0.5
0.4
0.3
0.2
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(c) Target vs. calculated values of output 3
11
10
9
3
3 4 5 6 7 8 9 10 11
(d) Target vs. calculated values of output 4
Fig. 10. (Continued)
In this approach, FLCs have been developed cluster-wise by following the Tak-
agi and Sugeno approach and utilizing the linear response equations obtained
above. A binary-coded GA has been used to improve their KBs. The GA

is found to perform in the optimal sense, with the following parameters:
pc = 0.89, pm = 0.00097, P = 100, G = 500. During optimization, the
values of p + q × n × p = 5 + 4 × 4 × 5 = 85 parameters have been varied
within their respective ranges as shown in Table 3. A batch mode of train-
ing has been adopted using the artificially-generated same set of 1000 data.
The optimal values of the above 85 parameters obtained using Approach 2
are also shown in Table 3. The calculated values of four outputs obtained by
utilizing Approach 2 have been compared with their respective target values
in Fig. 11. It is interesting to note that Approach 2 has yielded slightly better
predictions of the outputs compared to those obtained by Approach 1 and as
a result of which, the points on the above figure are seen to move closer to
the ideal y = x line. It is important to note that Approach 2 is found to be
more adaptable to the test cases (as the adaptability has been injected by the
GA) compared to Approach 1 is.
In this approach, linear membership function distributions of the input vari-

ables adopted in Approach 2, have been replaced by the nonlinear distributions,
such as third order polynomial function. A systematic parametric study has
been carried out, to decide the optimal GA-parameters, which are found to
be as follows: pc = 0.78, pm = 0.00133, P = 100, G = 500. During the
GA-based tuning of the FLCs, a batch mode of training (with the help of
the same set of 1000 data) has been provided. Fig. 12 shows the comparisons
of the calculated values of the outputs (obtained by using Approach 3) with
their respective target values. A close watch on Figures 10 through 12 reveals
that Approach 3 has shown better predictions compared to those obtained
by Approach 1. Moreover, Approach 3 is seen to be marginally better than
Approach 2, in predicting the outputs. It could be due to the reason that the
membership function distributions of the variables have been assumed to be
non-linear in Approach 3.
Comparisons
Fig. 13 compares the performances of three approaches, in terms of % devia-

tion in prediction of different outputs. For the test cases, the values of average
absolute % deviation in prediction of the outputs have been determined for
the three approaches. Approach 1 has yielded a value of 10.910930%, whereas
these are found to be equal to 8.082850% and 8.058067% for Approaches 2
and 3, respectively. Thus, Approach 3 has proved its supremacy over the other
approaches.
The performances of three developed approaches have been checked on
the data related to two physical problems – Abrasive Flow Machining (AFM)
and Tungsten Inert Gas (TIG) welding. In both the problems, Approaches 2
Table 3. Ranges and optimized values of 85 parameters – TIG data

g1 1,20 1 1.928641
g2 0.05,0.8 0.706891 0.244282
g3 1,35 35 35
g4 0.05,0.6 0.36129 0.342473
g5 1,25 25 24.131966
h111 0.0001,0.0006 0.0001 0.0001
h112 0.001,0.006 0.006 0.005541
h113 0.00001,0.00006 0.00001 0.00001
h114 0.0001,0.0006 0.0006 0.0006
h115 0.0001,0.0006 0.000133 0.000444
h121 0.001,0.006 0.002256 0.003517
h122 0.0001,0.0006 0.000572 0.000256
h123 0.0001,0.0006 0.000447 0.000105
h124 0.001,0.006 0.001313 0.001039
h125 0.001,0.006 0.002261 0.003498
h131 0.0001,0.0006 0.000103 0.0001
h132 0.001,0.006 0.001 0.001
h133 0.00001,0.00006 0.000054 0.00006
h134 0.0001,0.0006 0.000134 0.00013
h135 0.00001,0.00006 0.000052 0.00006
h141 0.001,0.006 0.003146 0.001029
h142 0.0001,0.0006 0.000268 0.000582
h143 0.0001,0.0006 0.000104 0.000101
h144 0.001,0.006 0.001166 0.001293
h145 0.001,0.006 0.002261 0.00122
h211 0.0001,0.0006 0.000225 0.000227
h212 0.001,0.006 0.001 0.001
h213 0.00001,0.00006 0.000011 0.000041
h214 0.0001,0.0006 0.000599 0.000368
h215 0.0001,0.0006 0.0001 0.000124
h221 0.001,0.006 0.003502 0.002706
h222 0.0001,0.0006 0.000393 0.0003
h223 0.0001,0.0006 0.000586 0.000592
h224 0.001,0.006 0.001132 0.001034
h225 0.001,0.006 0.001362 0.001318
h231 0.0001,0.0006 0.000164 0.0001
h232 0.0001,0.0006 0.000129 0.000103
h233 0.00001,0.00006 0.000059 0.000056
h234 0.001,0.006 0.001039 0.001073
h235 0.0001,0.0006 0.000173 0.000167
h241 0.001,0.006 0.003507 0.002667
h242 0.001,0.006 0.002075 0.001088
h243 0.0001,0.0006 0.000594 0.000151
h244 0.001,0.006 0.001 0.001015
Table 3. (Continued)

h245 0.001,0.006 0.001073 0.001005
h311 0.0001,0.0006 0.0001 0.0001
h312 0.0001,0.0006 0.0006 0.000106
h313 0.00001,0.00006 0.000012 0.000012
h314 0.001,0.006 0.004451 0.004148
h315 0.0001,0.0006 0.000101 0.0001
h321 0.001,0.006 0.001 0.001
h322 0.001,0.006 0.005242 0.001191
h323 0.0001,0.0006 0.000589 0.000586
h324 0.001,0.006 0.00102 0.00101
h325 0.0001,0.0006 0.00011 0.000104
h331 0.0001,0.0006 0.0001 0.0001
h332 0.001,0.006 0.001029 0.001
h333 0.00001,0.00006 0.000058 0.000059
h334 0.001,0.006 0.001039 0.00101
h335 0.00001,0.00006 0.000052 0.000016
h341 0.001,0.006 0.001 0.001
h342 0.001,0.006 0.005936 0.001083
h343 0.0001,0.0006 0.000598 0.000595
h344 0.001,0.006 0.00101 0.001073
h345 0.001,0.006 0.001 0.001
h411 0.0001,0.0006 0.00021 0.000104
h412 0.001,0.006 0.006 0.00101
h413 0.00001,0.00006 0.000059 0.00006
h414 0.001,0.006 0.005976 0.004636
h415 0.0001,0.0006 0.000349 0.000132
h421 0.001,0.006 0.001024 0.00101
h422 0.001,0.006 0.005189 0.003639
h423 0.0001,0.0006 0.000576 0.00057
h424 0.001,0.006 0.001015 0.001029
h425 0.001,0.006 0.001005 0.001
h431 0.0001,0.0006 0.0001 0.0001
h432 0.001,0.006 0.001274 0.001
h433 0.000001,0.000006 0.000004 0.000003
h434 0.001,0.006 0.001 0.001005
h435 0.00001,0.00006 0.000059 0.000057
h441 0.001,0.006 0.001543 0.001078
h442 0.001,0.006 0.002017 0.001538
h443 0.0001,0.0006 0.000115 0.000103
h444 0.001,0.006 0.001406 0.001406
h445 0.001,0.006 0.001 0.001
0.3
0.2
0.1
0
−0.1
−0.2
−0.3
−0.4
−0.5
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
11
10
Target values of output 2
5
5 6 7 8 9 10 11
Calculated values of output 2
Fig. 11. Performance testing of the FLC having linear membership function distri-
butions – TIG data
0.9

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
12
11
10
3
3 4 5 6 7 8 9 10 11 12
and 3 are found to perform better than Approach 1, i.e., GA-tuned FLCs (de-
veloped cluster-wise) are seen to outperform the cluster-wise linear regression
analysis. It could be due to the fact that the input-output relationships of
the above processes are not exactly the linear in nature, although those have
been assumed to be so. In both Approach 2 as well as Approach 3, the GA

has brought the necessary adaptability during the training phase. Thus, they
are able to tackle the situations (i.e., testing on some random test cases) more
efficiently. On the other hand, Approach 1 lacks that adaptability. Moreover,
0.3
0.2
0.1
−0.1
−0.2
−0.3
−0.4
−0.5
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
11
10
5
5 6 7 8 9 10 11
Fig. 12. Performance testing of the FLC having nonlinear membership function
distributions – TIG data
0.9
0.8
Target values of Output 3
0.7
0.6
0.5
0.4
0.3
0.2
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Calculated values of Output 3
12
11
10
Target values of Output 4
3
3 4 5 6 7 8 9 10 11 12
Calculated values of Output 4
a close watch on the above results indicates that Approach 3 has performed
slightly better than Approach 2. It might be due to the reason that the linear
membership function distributions of the input variables used in Approach 2,
have been replaced by the nonlinear ones in Approach 3. Thus, Approach 3
300
’CWLR1’
’LRFLC1’
200 ’NLRFLC1’
% deviation in prediction of output 1

100
−100
−200
−300
−400
−500
−600
−700
5 10 15 20 25 30 35 40 45 50
Test Cases
(a) Output 1: Weld bead front height
10
’CWLR2’
’LRFLC2’
’NLRFLC2’
8
−2
−4
−6
5 10 15 20 25 30 35 40 45 50
Test Cases
(b) Output 2: Weld bead front width
Fig. 13. Comparison of three approaches in terms of % deviation in prediction of
different outputs – TIG
10
’CWLR3’
’LRFLC3’
’NLRFLC3’

5
−5
−10
−15
−20
5 10 15 20 25 30 35 40 45 50
Test Cases
(c) Output 3: Weld bead back height
10
’CWLR4’
’LRFLC4’
’NLRFLC4’
8
−2
−4
−6
5 10 15 20 25 30 35 40 45 50
Test Cases
(d) Output 4: Weld bead back width
has recorded the better performance compared to Approach 2, in grabbing

the non-linear relationships of the processes.
7 Concluding Remarks
To establish input-output relationships of two physical processes, three ap-
proaches (one is a statistical regression analysis and the other two deal with
GA-tuned FLCs) have been developed cluster-wise and their performances
are tested on 50 randomly-generated new cases. From the above study, the
following conclusions have been drawn:
1. GA-tuned FLCs (both Approaches 2 as well as 3) developed based on
Takagi and Sugeno approach have outperformed the linear regression anal-
ysis, i.e., Approach 1, for the random test cases. It might be due to the
reason that adaptability has been injected into the FLCs, during their
training carried out by using a GA, whereas Approach 1 does not have
the provision to gain such a property.
2. Approach 3 has yielded better results compared to Approach 2. It could
be due to the fact that the membership function distributions of the input
variables have been assumed to be nonlinear in Approach 3, whereas those
in Approach 2 are linear in nature. Thus, Approach 3 is able to capture the
nonlinearity of the processes in a more effective way. In fact, Approach 3
is found to be the best of all.
3. Performances of the approaches are problem-dependent.
8 Scope for Future Work

The present chapter deals with an effective way of designing FLCs cluster-wise,
based on Takagi and Sugeno approach. However, there are chances of further
improvement of the developed algorithms. For example, the consequent parts
of the FLCs have been considered as the linear functions of the input variables.
It may not work well for a highly nonlinear process, for which second or higher
order functions are to be tried. Moreover, some sort of sensitivity analysis may
be carried out for the developed approaches, in future.
References
1. Zadeh LA (1965) Fuzzy Sets, Information and Control, 8(3):338–353
2. Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a
fuzzy logic controller. Int J Man Mach Stud 7:1–13
3. Takagi T, Sugeno M (1985) Fuzzy identification of systems and its application
to modeling and control. IEEE Trans Syst Man Cybern, SMC-15:116–132
4. Goldberg DE (1989) Genetic algorithms in search, optimization, and machine

5. Karr C (1991) Genetic algorithms for fuzzy controllers. AI Expert, pp 26–33
6. Thrift P (1991) Fuzzy logic synthesis with genetic algorithms. In: Proceedings
of 4th International Conference on Genetic algorithms (ICGA’91), pp 509–513
7. Cordon O, Gomide F, Herrera F, Hoffmann F, Magdalena L (2004) Ten years
of genetic-fuzzy system: current framework and new trends, Fuzzy Set Syst,
141:5–31
8. Hoffman F, Pfister G (1997) Evolutionary design of a fuzzy knowledge base for
a mobile robot. Int J Approx Reason 17(4):447–469
9. Ishibuchi H, Nakashima T, Murata T (1999) Performance evaluation of fuzzy
classifier systems for multidimensional pattern classification problems IEEE
Trans Syst Man Cybern 29:601–618
10. Cordon O, DeeJesus M, Herrera F, Lozano M (1999) MOGUL: a methodology
to obtain genetic fuzzy rule-based systems under the iterative rule learning
approach. Int J Intell Syst 14(11):1123–1153
11. Furuhashi T, Miyata Y, Nakaoka K, Uchikawa Y (1995) A new approach to
genetic based machine learning and an efficient finding of fuzzy rules – proposal
of Nagoya Approach, Lecture notes on artificial intelligence 101:178–189
12. Yupu Y, Xiaoming X, Wengyuan Z (1998) Real-time stable self-learning FNN
controller using genetic algorithm. Fuzzy Set Syst 100:173–178
13. Angelov P, Buswell R (2003) Automatic generation of fuzzy rule-based models
from data by genetic algorithms. Inf Sci 50:17–131
14. Nandi A, Pratihar D (2004) Automatic design of fuzzy logic controller using a
genetic algorithm–to predict power requirement and surface finish in grinding.
J Mater Process Technol 148:288–300
15. Ghosh A, Nath B (2004) Multi-objective rule mining using genetic algorithms.
Inf Sci 163:123–133
16. Hui N, Pratihar D (2005) Automatic design of fuzzy logic controller using a
genetic algorithm for collision-free, time-optimal navigation of a car-like robot.
Int J Hybrid Intell Syst 2:161–187
17. Smith SM, Comer DJ (1992) An algorithm for automated fuzzy controller tun-
ing. In: Proceedings of IEEE conference on fuzzy systems, pp 615–622
18. Kim JH, Cho HC, Choi YK, Jeon HT (1995) On design of the self-organizing
fuzzy logic system based on genetic algorithms. In: Proceedings of the fifth IFSA
world congress on fuzzy logic and its applications to engineering. In: Bien Z,
Min KC (eds.) Information sciences and intelligent systems, Kluwer, Dordrecht,
Netherlands, pp 101–110
19. Jhang JS (1993) ANFIS: adaptive-network-based fuzzy inference system. IEEE
Trans Syst Man Cybern 23(3):665–685
20. Holland J (1975) Adaptation in natural and artificial systems. The University
of Michigan Press, Ann Arbor, USA
21. Goldberg DE, Deb K (1991) A comparison of selection schemes used in ge-
netic algorithms. In: Rawlins GJE (ed.) Proceedings of foundations of genetic
algorithms, pp 69–93
22. Spears WM, De Jong KA (1991) An analysis of multi-point crossover. In: Rawl-
ins GJE (ed.) Proceedings of foundations of genetic algorithms, pp 301–315
23. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms.
Kluwer Academic Publishers, Norwell, MA, USA
24. Yao J, Dash M, Tan ST (2000) Entropy-based fuzzy clustering and fuzzy mod-
eling. Fuzzy Set Syst 113:381–388
25. Nandi AK, Pratihar DK (2004) Design of a genetic-fuzzy system to predict
surface finish and power requirement in grinding. Fuzzy Set Syst 148:487–504
26. Jain RK, Jain VK (2000) Optimum selection of machining conditions in abrasive
flow machining using neural networks. J Mater Process Technol 108:62–67
27. Juang SC, Tarng YS, Lii HR (1998) A comparison between the back-propagation
and counter-propagation networks in the modeling of the TIG welding process.
J Mater Process Technol 75:54–62
Evolutionary Fuzzy Modelling for Drug
Resistant HIV-1 Treatment Optimization
Mattia Prosperi and Giovanni Ulivi
Summary. Fuzzy relational models for genotypic drug resistance analysis in Human
Immunodeficiency Virus type 1 (HIV-1) are discussed. Fuzzy logic is introduced to
model high-level medical language, viral and pharmacological dynamics. In-vitro
experiments of genotype/phenotype pairs and in-vivo clinical data bases are the
base for the knowledge mining. Fuzzy evolutionary algorithms and fuzzy evaluation
functions are proposed to mine resistance rules, to improve computational perfor-
mances and to select relevant features.
1 Introduction
1.1 Artificial Intelligence in Medicine
Recent years have seen medicine and artificial intelligence cross their roads and
proceed together: statistical analysis has been a valid support on epidemiology
or diagnosis, but after first genomic regions were sequenced and interpreted
the scenario and the needs became more complex, involving biology and bio-
chemistry. Today computer science -through machine learning and intelligent
systems- integrates medicine and biology in several fields, from sequence anal-
ysis to protein structure and function prediction, to gene regulatory networks
modelling, to molecular design, to medical diagnosis. Medical and biological
data bases started to grow up and assume standard structures. Biological
systems are complex systems, medical measures are extremely variable even
under the same conditions and indirect indicators of real processes: one way
to handle them is to use uncertainty and vagueness concepts of Fuzzy Logic,
which thereby is a suitable modelling framework. The HIV treatment opti-
mization scenario in fact sees the drug resistance development by its genomic
variation under drug pressure: the high mutation rate determines a huge state
variable space. There is a large number of different drugs that attack differ-
ent viral targets genes and have to be combined in order to control the viral
suppression and the chance of resistance rise. Moreover, in the human body
M. Prosperi and G. Ulivi: Evolutionary Fuzzy Modelling for Drug Resistant HIV-1 Treatment
Optimization, Studies in Computational Intelligence (SCI) 82, 251–287 (2008)
252 M. Prosperi and G. Ulivi
virus/drug interactions are affected by a host of cofactors, either uncontrol-

lable or unobservable, thus requiring approximate reasoning models.
1.2 Road Map
The first sections (2 and 3) of this chapter are intended to provide necessary
introduction and literature references to be probe further: in detail, biologi-
cal background on antiviral treatments, resistance developing, viral sequence
analysis and data collection is given in section 2. The following section 3 is
an overview on the current machine learning approaches for drug resistant
HIV-1 treatment optimization. Section 4 introduces fuzzy modelling for med-
ical science: from previous studies on the field continues defining the fuzzy
relational system for in-vitro and in-vivo modelling. In section 5 optimiza-
tion techniques are discussed, describing Fuzzy Genetic Algorithms, Random
Searches and Fuzzy Feature Selection criteria. Section 6 presents application
results for phenotype and in-vivo clinical prediction, with conclusions and
future perspectives.
2 Background on HIV Treatment and Drug

Resistance Onset
Many microorganisms can enter the human body and cause harm, including
viruses, fungi, bacteria, protozoa. Once inside the body, the primary goal of
a microorganism is to survive and reproduce itself. Most antimicrobial agents
are designed to kill these pathogens or prevent them from reproducing. When
a microorganism as a virus continues to replicate despite the pressure of a
drug, mutants are selected that more efficiently adapt themselves to grow in
the presence of a certain drug concentration: this results in the phenomenon
of drug resistance. When drug resistance occurs, the drug -or combination
of drugs- can’t keep the microorganism. Over time, the treatment can stop
working completely. Evolution consists of a selective pressure from the envi-
ronment that acts on organisms: it selects the best individuals from popu-
lations, appearing random mutations on the gene pool; advantages acquired
from mutations will be transmitted to progeny. The Human Immunodeficiency
Virus (a Lentivirus divided in two major families, HIV-1 and HIV-2) has a
rapid rate of mutation and has developed through this resistance to antivi-
rals. A brief introduction for non-biologists is given in [18], which will be the
reference for the following description.
HIV-1 causes a progressive deterioration of immune system leading almost
relentlessly to AIDS (Acquired Immune Deficiency Syndrome) and death due
to opportunistic infections. Modelling mechanisms of resistance requires to
investigate the viral genome (which is in the form of RNA) and genes encoded
within. A gene is a sequence of nucleotides (four varieties), while the genome
produces proteins, important in virus life cycle. A protein is a sequence of
Evolutionary Fuzzy HIV Modelling 253
amino acids, which are encoded by blocks of three adjacent nucleotides in the
genome, called codons.
Genomic sequences are the building blocks of biological mechanisms: com-
puter science is today necessary to investigate the genes and their func-
tions; even simple organisms like viruses are characterized by long character
sequences. The base for sequence analysis is [5], which is also a complete and
generic guide for the whole set of derived subtasks.
2.1 HIV Replication and Treatment Design
In the virus life cycle (when it reproduces) the genome string has to be copied
from a generation to the next one. Soon after HIV enters the body, the virus
begins reproducing at a rapid rate and billions of new viruses are produced
every day. In the process, HIV produces both perfect copies of itself (wild
type) and copies containing errors (mutants): copying errors occur frequently.
Mutations can change virus structure or functions and then modify its inter-
action in the environment: the high mutation rate of HIV (combined with the
fact that it attacks the immune system) leads to difficulties in the design of a
vaccine, and rapid selection of mutant strains resistant to drugs. At present
three classes of drugs are approved from FDA (Food and Drug Administration
of USA) as antiviral treatment against HIV: these are Reverse Transcriptase
inhibitors (RTi), Protease inhibitors (PRi) and Fusion inhibitors (Fi); each
class acts against a step of the viral replication process, and there are around
15 different molecules in commerce.
The viral genotype is a RNA 4-character sequence, from which usually
are extracted mutations comparing the sequence with the wild type. Usually
mutations are identified with a number representing a codon (a position in the
genomic sequence), headed by a letter that indicates the amino acid present
in the wild type (i.e. the standard virus, without mutations) and followed
by another letter that describes the amino acid replaced in the mutant. For
instance, a mutation that usually confers resistance to Lamivudine (3TC) is
the M184V: it indicates that in codon 184 amino acid Methonine (M) has
been replaced by Valine (V).
During infection there’s no single virus in the body, but a large population
of mixed viruses called quasispecies. Wild type virus is the one naturally
evolved with highest replicative capacity: before therapy is started, it is the
most abundant in the body and dominates all other quasispecies. Mutant
variants are too weak to survive and/or can’t reproduce. Others are strong
enough to reproduce but still can’t compete with the more fit wild type; as a
result, their numbers are less than the wild type ones in the body.
A drug usually works blocking a key role in the virus life cycle. Some
variants have mutations that allow the virus to partly, or even fully, resist an
antiretroviral drug. In a constant therapy mutant resistant strains can become
dominant (though having lower replicative capacity or fitness) in the patient.
This is called selective resistance, because the mutant is selected by the drug.
If it is not recognized, treatment loses its efficacy. Selected mutants are more
challenging to treat because therapy options are reduced. If drug regimen is
changed, new mutations can be selected and furthermore there are mutations
(such as the insertion at codon 69 in the Reverse Transcriptase gene) that
cause cross-resistance to a whole class of antiretrovirals.
Treatment interruptions showed that in a couple of months HIV reverts
to the wild type, but maintains low concentrations of resistant mutants, so if
a heavily experience drug is reused resistance arises shortly.
Combined therapies that involve multiple drugs are an approach to avoid
resistance. If virus changes to resist against a drug, but it’s inhibited by many
different others, it can be suppressed to undetectable levels (even though
complete eradication is not possible). Combined treatments can contain from
three to five different drugs, but lead often to tolerability and toxicity prob-
lems. Such therapies (usually two or three RTis and at least a PRi) are called
HAARTs (Highly Active Anti Retroviral Therapies) or cARTs (combined Anti
Retroviral Therapies): usually HAARTs produce a sensible reduction of viral
load within three-four weeks, and can be sustained in a long time window.
Unfortunately mutations occur also under HAARTs, even though at a lower
rate.
2.2 Experimental Settings and Data Collection
Before being approved and commercialized from the FDA, drugs follow a
long iter in which their efficacies are tested through different phases (namely,
from phase I to phase IV): first they’re designed, synthesized and put in viral
cultures; when they’re proven to be effective in-vitro, they start to be tested for
adsorption levels and toxicity in-vivo, until they are judged to be relatively
safe for the human body and effective in viral eradication. In-vitro studies
however are always carried on -even after the commercialization- in order to
point out further resistance development.
In-vitro and in-vivo studies are the data available for modelling.
In-vitro studies are collections of experiments that measure how a mutated
virus responds in a culture to a single drug inhibition, compared with the
replication of the wild type under the same drug pressure: the phenotype
is a numeric indicator of viral replication power, expressed as Fold Change
of the drug concentration needed to inhibit 50% of the viral replication as
compared to a wild type drug-susceptible reference viral strain: the data sets
are pairs of genotype sequences and Fold Changes values. At now, for each
drug there are thousands of such pairs freely available and the quality -being
fixed environment conditions and repeatability- is fairly high. These tests are
expensive compared to the cost of sequencing a viral strain, so a first attempt
is to define models that give phenotype prediction from genotype sequences.
In-vivo studies usually are data bases collecting patients’ Follow Ups, i.e.
analyses carried out before and after a therapy switch: usually a therapy
is stopped and considered failing when the Viral Load1 is detectable in the
patient’s blood and/or the CD4+ T2 cell counts are very low; when the virus
is detectable, it can be also sequenced and so it is possible to find which
are the selected mutations. Unfortunately, not always it is possible to obtain
clean and huge data sets: perspective cohorts (called clinical trials, studies on
precise therapeutic protocols lead by an equipe of physicians, in which pa-
tients are controlled weekly) are the most reliable ones, but not always free
and not so large; retrospective cohorts (collections of clinical reports from
the hospitals) are larger in size, but suffer from noises like time delays, miss-
ing data and unadherent patients. In addition, in-vivo measures are biased
by instruments’ systematic errors: viral load measures are reliable within 1
Log and can’t detect copies under certain limits (500 or 50 cp/ml); genotype
sequencing methods have an accuracy of 90% in revealing mutations, but per-
formances decrease using plasma samples with low viral concentrations. Even
input errors are not negligible: data bases are not automated and have bad
relational structures and implementations, mostly data are recorded manually
from paper clinical reports to spreadsheets. There are thousands of instances
available, but the variability of in vivo data is extremely high and the space of
400
investigation is huge: in$ theory
% there
$15%about 20 possible genotypes (in the
sequenced region) and 1 + · · · + 5 = 4943 therapeutic combinations.
15
3 Machine Learning for Drug Resistant HIV
CARTs with three or more antiretroviral drugs have lead to significant de-
creases in HIV-related morbidity and mortality by reducing HIV replication.
The goal of therapy is to suppress the plasma viral load as much as possible
for as long as possible. Given the fact that viral eradication is not feasible
with the current treatment armamentarium, HIV mutants ultimately are de-
veloped, with different degrees of decreased susceptibility to the ongoing treat-
ment regimen and of cross-resistance to other agents. This results in virologic
rebound and eventually disease progression: for these patients, the aim is to
build a therapy optimization tool, that will explore efficacies among a set of
possible cARTs and assure the best subsequent Viral Load reduction, taking
into account input attributes among the viral genotype, Baseline Viral Load
(virion counts in the plasma taken in a therapy switch), CD4+ T cell counts,
pharmacodynamics-kinetics, viral drug resistance mechanisms, et cetera.
One of the first studies published on drug resistant HIV treatment opti-
mization was the CTSHIV (Customized Treatment Strategy for HIV, see [18]):
the system operated on a set of a predetermined fuzzy-like set of in vivo viral
drug resistance rules applied to a patient’s viral genotype and nearby mutants,
1
Viral Load is the virion count in the plasma
2
CD4+ T are immune cells targeted and infected by the virus, so a measure for
the immune response to the infection
and a branch-and-bound algorithm to find optimal drug combinations. How-

ever, this study did not focused on mining knowledge from data (i.e. finding
the rules), which became a must in the latter studies.
Nowadays -combined with the choice of an appropriate model- the main
issue is to mine drug resistance mechanisms from the clinical data.
The support for this task are in-vitro (coming from cell cultures) and
in-vivo (coming from plasma analyses on patients) data sets.
Predicting single in-vitro phenotypes from viral genotypic data is a widely
explored task. Linear Regressors, Decision Trees and Support Vector Machines
applied to genotype-phenotype pairs are able to perform predictions that ex-
plains correctly 30 to 79% of phenotypic variance (depending on the drugs)
[20, 24–26, 41]. Nevertheless, the final goal is to predict the real (in-vivo) vi-
ral load change (and consequently the CD4+ T cell count response) for cART
regimens, either analyzing associations of genotypes with phenotypes and sub-
sequently phenotypes with treatment response, either directly associations of
genotypes with treatment response. Various and different algorithms for vi-
ral genotype interpretation are currently available that give indications about
drug resistance: some base their inferences on Neural Networks [30–32, 42]
other are based on Case Based Reasoning (CBR) and k-Nearest Neighbor
(kNN) algorithm [33], other are Rule-Based [37, 43–45] other are Fuzzy-Rule-
Based [29], other take into account evolutionary pathways modelling, or com-
pute genetic barriers probabilities [7, 21–23, 27, 28, 34, 35]. The Rule-Based
algorithms display predictions in terms of resistance classes (such as high, in-
termediate, low), while the Neural Network approach and CBR-kNN attempt
to predict actual viral load changes by regression. Each approach has advan-
tages and disadvantages: the rule based ones are easy to interpret but the
tuning and update is difficult to achieve in an automated way, moreover, cor-
relating the real outcomes with the resistance classes involves arbitrary steps.
Neural Networks on the other hand are powerful in finding non linear inter-
actions and easy to train, but act as black boxes and biological mechanisms
can’t be easily extracted. Finally, for CBR-kNN systems ad-hoc similarity
functions are needed, no model is given and every time the Case Base must be
scanned, but accurate estimate of the error and an easy to understand collec-
tion of similar cases are provided. Furthermore, while the CBR-kNN approach
provides a desirable local optimization, searching the local neighborhood of
input instances, it lacks of generalization for unseen input space regions; the
other methods are capable of making extrapolations, though assuming global
optimization.
4 Fuzzy Modelling for HIV Drug Resistance

Interpretation
Human Immunodeficiency Virus develops resistance to various drug combina-
tions through mutations positively selected. It has been shown that a mono- or
bi-therapeutic regimen is not able to keep low viral loads and leads in a short
time to the selection of resistant strains. With the introduction of HAART

techniques we can see a sensible reduction in viral loads, capacity of replica-
tion and mutation. Unfortunately resistance still occurs under these multiple
regimens (especially if they are carried out after long single-drug therapies)
even though in a greater time scale.
There are clear difficulties for physicians when they have to determine a
certain drug cocktail among many combinations for each patient, considering
viral genotype, tolerability conditions, toxicity. Actually, the present medical
knowledge on the relations between mutations and drug resistance is inade-
quate: rules for therapy optimization are mostly those derived from in vitro
studies, but the scenario changes when drugs act in the human body (absorb-
tion, immune response, cytotoxic response, latent reservoirs in which the virus
escapes. . . ) and drug combinations are given to patients; anyway, functions
(fairly linear) that were feasible in the phenotype prediction can be starting
point for in vivo strategies, as the statistical analyses provided for clinical
trials (even though often under restricted backgrounds and small instances).
In this section a fuzzy representation models of such a complex sys-
tem is introduced. For fuzzy sets, norms and relations theory references are
in [1,2,11,13]. Thread of the plot will be the attempt to rediscover (to confirm
or to reject) medical hypotheses on the mechanisms involved between muta-
tion selection and resistance to antiretrovirals, to find out new associations.
The aim is to define a decision system that associates to each pathological
instance (i.e. a patient in a certain clinic state, defined by an input attribute
set) an appropriate therapeutic regimen, possibly the best, which ensures the
maximum viral load reduction and which is capable to bring itself up to date
when new medical knowledge or data are available.
First task is to represent the biological behavior of the virus under drug
pressure with fuzzy terms. It has been observed from in vitro studies that, as
mutations accumulate in time under drug pressure, the resistance increases.
There are primary mutations that confer suddenly high-level resistance and
secondary mutations that contributes the increment. Some mutations confer
cross-resistance to the drugs of the same class, other can be susceptible (i.e.
the treatment is more effective compared to the wild type response) for a drug
and resistant for another. In vivo viral behavior is different: first of all, the
virus is under multiple drug pressures; secondly, drugs must be adsorbed by
the body and concentrations vary; third, HIV attacks different cells and hides
in latent reservoirs; fourth, depending on drug combinations and many other
factors, evolutionary mutational pathways are different.
Moreover, mutations do cluster: among polymorphisms and many uncor-
related single mutations, some aggregate in patterns. Clusters are usually
overlapping (i.e. a mutation can be present in different patterns), but there
are situations in which mutations antagonist among each other determine mu-
tually excluding clusters: models for evolutionary pathways of the virus under
selective drug pressure are presented in [7, 21–23, 27, 28, 34].
The following subsections will present a fuzzy relational framework that

models the in-vitro experiments (for genotype→phenotype prediction) and
-with an extension- translates a limited set of viral-host dynamics for in-vivo
settings. The model will take into account the effect of increased resistance
through accumulation of mutations, the possibility to have susceptibility mu-
tations and the difference between major and minor mutations. Mutations
will be examined singularly, and using interactive operators weighted func-
tions of them will calculate resistance and susceptibility for each single drug.
Combined therapies will be viewed as function of single drug effects.
In Section 6.3 a deeper description of fuzzy terms, rules, relations and
input space partitioning will be presented, suitable for handling a wider set
of biological behavior, but practically limited due to the unavailability of
sufficient amount of training data.
4.1 Fuzzy Medical Diagnosis
Advantages for the fuzzy modelling come from the strong flexibility gained us-
ing linguistic variables, i.e. the possibility of modelling hypotheses in complex
natural language, like the medical one.
However, due to the multi-dimensional and heterogeneous characteristics
of the input attribute space (discrete character viral genotype mutational
sequences, cARTs and real valued plasma analyses) define linguistic variables
is complex, and even worse mine them.
Before introducing the HIV models, It’s worth here to cite the CADIAG
system, a fuzzy inference relational model proposed by Sanchez, Adlassnig
and Gupta [9, 10] developed as automated tool for medical diagnosis: this
study in fact inspired the HIV model. In the Sanchez approach the medical
knowledge is represented as a fuzzy relation between symptoms and diseases.
So, given the fuzzy set A of the symptoms observed in the patient3 and the
fuzzy relation R = (A → B) representing the medical knowledge that relates
symptoms s ∈ S and diseases d ∈ D, then a fuzzy set B of the patient’s
possible diseases can be calculated through
B = A ◦ R (1)
or equivalently
µB (d) = maxs∈S {min{µA (s), µR (s, d)}} ∀d ∈ D (2)
The fuzzy relation R should be found solving
T =Q◦R (3)
3
here membership functions are set up on healty/ill distributions from real-valued
analyses
where R is unknown and Q, T represent respectively the symptoms and the

diagnoses made for a set of known cases. Methods to solve the above equation
(and generalizations) are in [1, 8, 9, 12, 14] and the maximal solution was
taken. Regarding fuzzy relational equations solving, there are exact solution
algorithms (and solution lattices) only for a restricted set of compositional
operators (like max-min and max-product ): Genetic Algorithms are used to
approximate solutions for general -⊥ norm composition.
The model was actually made building various R relations (somewhat
related to plausibility and belief concept). The CADIAG system (and its re-
finements) reached great performances, yielding 93% accuracy compared with
the clinical evidences.
4.2 Fuzzy Relational System for In-Vitro Cultures
Genotype→phenotype prediction will be modelled using a simple fuzzy rela-

tional composition: maintaining this structure it will be possible to extend
the system in order to handle more complex behaviors.
Define a first relation M , that represents the mutations from the wild type
virus present in a viral genotype (sequenced from a patient’s plasma sample
or cell culture). This relation is a matrix in which each row corresponds to
an observation, i.e. to a sequenced genotype that has been selected for a
phenotypic test. The generic element mi,j of M contains the fraction of the
mutation j in the genotype i. At now the sequencing instruments are not able
to measure the exact prevalence of an amino acid in the total population,
they only retrieve most prevalent mixtures, so -for instance- if viral genotype
sequence i reveals the M 184M V I, it means that at position 184 there can
be (can be means that either the instrument was not able to discriminate
which nucleotide was present or how many) two amino acidic substitutions
V, I living with the wild type M : a-priori equal fraction values are assigned
(i.e. mi,184M = mi,184V = mi,184I = 13 ) to them. The m variable is assumed to
be a single amino acidic substitution in a codon, but can be as well a mixture
of amino acids in the same codon or more generally a set of codons (i.e. a
cluster, but different distance measures for membership functions then are
needed). Let then W be a weight matrix in which the generic element wj,d
is the degree of resistance (in [0,1]) shown by the single mutation j to the
drug d. The matrix R is the fuzzy composition M ◦ W , yielding the overall
resistance values ri,d for each viral genotype i and drug d.
M ◦W =R (4)
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
m1,1 · · · m1,M w1,AZT · · · w1,3T C r1,AZT · · · r1,3T C
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ .. . . .. ⎟ ⎜ .. .. .. ⎟ ⎜ .. .. .. ⎟
⎜ . . . ⎟◦⎜ . . . ⎟=⎜ . . . ⎟
⎝ ⎠ ⎝ ⎠ ⎝ ⎠
mN,1 · · · mN,M wM,AZT · · · wM,3T C rN,AZT · · · rN,3T C
Define furthermore another weight matrix W , in which there are the de-
grees of susceptibility (i.e. how much a mutation increases the power of a
drug), obtaining a result matrix S:
M ◦ W = S (5)
If the fuzzy -norm ⊥-conorm compositional operators used are interactive

ones (like bounded sum and bounded product or algebraic norms), the accu-
mulation effect is explained.
The two R and S relations give information about two opposite viral be-
haviors: in fact it’s not assured that a resistant virus is not susceptible for
a certain drug and vice-versa. For instance, suppose that the drug d select
m1 and m2 resistance mutations. Drug d selects m1 as well as resistance
mutation, but not m2 , that alone confers instead susceptibility to d . So, if a
mutated genotype with m1 and m2 (selected by d) is grown under d pressure,
the corresponding cell culture will see growth of a specie that shares both
resistance and susceptibility characteristics in the phenotype. Something will
dominate, or maybe the effect will be compensatory.
There are several ways to interpret this contradiction, that anyway remains
well depicted by the two matrices R and S: the ratio between resistance and
susceptibility, the max value (domination effect), the mean, a formula like
¬susceptible ∧ resistant. . .
The resistance R and the susceptibility S result matrices need to be com-
bined and transformed in order to yield a Log Fold Change value for phenotype
prediction. The transforming function has to be a general f : [0, 1] → [−∞, ∞],
but must take into account actual bounds registered by the experiments.
A solution is to calculate the phenotypic matrix P as pi,d = tanh(ri,d −
si,d ) for each element: in section 6.1, which presents results, this function is
shown to provide accurate predictions.
Then relations W and W have to be estimated from the data. Given a
genotype and a drug, there is a corresponding Log Fold Change value: in order
to estimate the weight matrices, the M relation is filled with the values coming
from each sequenced genotype in the available data, with the corresponding
phenotypic values stored in a matrix P : define a loss function
L(f (R = M ◦ W, S = M ◦ W ), P ) = L(P , P )
and minimize it with an optimization algorithm. In section 5 fast optimization

algorithms (Fuzzy Genetic Algorithms and Random Searches) and a set of
suitable loss functions (that take into account also Feature Selection) will be
discussed.
4.3 Models for In-Vivo Clinical Data
Facing in vivo drug resistance and viral outcome prediction is a harder task.
The best way probably should be to have a predetermined rule system and
optimize the parameters: the problem is that the rules have be found too.
Relying on the existing rules would avoid the input space partition search
problem: however the optimization would be biased and would not permit to
find new associations. Moreover, data bases collecting in-vivo analyses are still
fragmented, small in size and incomplete. The following models are a com-
promise due to this lack of training data. Two methods will be presented: the
first one -already known in literature- uses the in-vitro predictor to infer con-
clusions for in-vivo treatments; the second is an extension of the in-vitro fuzzy
relational predictor above introduced, modified to keep in-vivo dynamics, but
does not need in-vitro knowledge and results.
Existing Model
This model, proposed by Beerenwinkel in [7], is one of the few complete and
well documented studies in literature that can be compared and/or integrated
to the system presented in the previous subsection. It’s worth to be cited here
because it’s not a standard machine learning technique and possesses a lot of
common points with the modelling assumptions of the fuzzy system.
The author assumes that in vitro genotype→phenotype predictions are a
reliable starting point to predict in vivo cART efficacies. This is an arbitrary
hypothesis and statistical studies between phenotype and clinical response
today are still not giving precise confidences, but this application revealed
anyway promising results.
Assume to have a genotype→phenotype in vitro predictor: use for instance
the Fuzzy above described, a Linear Regressor or a Support Vector Machine
regressor (this latter was the one chosen in [7]).
Analysis of phenotype predictions (among naive and treated patients)
shows large differences in range, location and deviation of models, but in
general reveals a bimodal nature of distributions (resistance and susceptibil-
ity) among the whole set of drugs (even though for some drug not clearly), as
can be seen in Figure 1. Thus Beerenwinkel models the probability density of
predicted phenotypes (y) using a two-component gaussian mixture model for
each drug:
αφ(y, µ1 , σ1 ) + (1 − α)φ(y, µ2 , σ2 )
where φ(y, µi , σi ) is the density of the normal distribution i with mean µi and
standard deviation σi ; α is the mixing parameter. Parameters are estimated
by Expectation Maximization (EM) algorithm.
Then a log-likelihood ratio is defined to decide whether a given phenotype
was more likely belonging to the resistant or susceptible subpopulation:
Pr(res|y)
l(y) = log (6)
Pr(sus|y)
Fig. 1. Two-Component Gaussian Mixture Model for a Generic Drug - Phenotype

Log Fold Change on x axis, Probability of Resistance on y
From Bayes’ formula it follows
Pr(y|res) Pr(res)
Pr(y)
l(y) = log (7)
Pr(y|sus) Pr(sus)
Pr(y)
φ(y, µ1 , σ1 ) Pr(res)
l(y) = log + log (8)
φ(y, µ2 , σ2 ) Pr(sus)
Then l(y) is approximated with its tangent l (y) in y0 zero.
Finally the probability score ps is introduced as the logistic function of l (y)
for a given genotype with respect to a drug:
1
ps = ≈ Pr(res|y) (9)
1 + e−l (y)
A scoring function can be defined as estimation of activity of a treatment
against a given viral strain.
1
activity(d, x) = (10)
1+ el (fd (x))
where d is a drug, x is a genotype sequence and yd = fd (x) is the Log
Fold Change phenotype (prediction) for sequence x and drug d, having
activity(d, x) ≈ Pr(sus|yd ).
To calculate a score that takes into account drug combinations, it is im-
portant to note that treatments with drugs coming from different drug classes
(at now there are the NRTi, NNRTi, PRi and Fi classes) benefit from synergic
effects, while combinations restricted to a single drug class are in general less
potent. Thus the score will be additive for drugs coming from different classes
and taken as the maximum for drugs in the same class: having a mono-class
restricted drug combination Ci = {d1 . . . dn } ⊂ Di , where Di is the set of all
available drugs in the class i
activity(Ci , x) = max{activity(d1 , x) . . . activity(dn , x)} (11)

and letting C = {C1 . . . Ck } a multi-class cART

k
activity(C, x) = (activity(Ck , x)) (12)
1
This system is designed to use the overall activity score as a predictor of

short-term (4-weeks) virological response.
A refinement of this model is then discussed towards long-term response
prediction (from 12-weeks). It is pointed out that in a prolonged treatment the
success or failure will depend not only on the initial (mutant) viral population
(and corresponding phenotype) but also on the ability of the virus in escaping
the drug pressure through selecting new mutants. Resistant mutations will be
closely related to the starting mutational configuration of the viral population.
Thus the model was modified exploring activity prediction against a worst case
mutant in different mutational neighborhoods of a given genotype sequence.
Fixing a drug class i, for a given genotype sequence g0 , let Nr (x0 ) the
mutational neighborhood of x0 at distance r, i.e. the set of all sequences
distant less than r from x0 (Hamming distance is used). A worst case mutant
for a fixed cART C is characterized by attaining the minimum
minx∈Nr (x0 ) {activity(Ci , x)}
where Ci ⊂ C is the subset of drugs in the class i.

Since sequence space is huge, exhaustive searches are practical only for
very restricted neighborhoods. Instead, Beerenwinkel proposes a Beam Search
algorithm (BS, explained in [7]).
Note that the concept of mutational neighborhood used is a way to handle
uncertainty and vagueness: the search in the variable space however is huge
(even if the Beam Search improves sensibly computation without losing much
optimality) and must be executed each time for each patient. Here see instead
the suitability of fuzzy logic to represent such a concept.
Extension of the Fuzzy System for cART optimization
Starting from the matrix compositions M ◦ W = R and M ◦ W = S de-

scribed in section 4.2, suppose that relevant mutations and weights (maybe
different from in-vitro ones) are known. Indicators of viral efficacy (resistance
or susceptibility) to a single drug for a given genotype can be calculated.
The need is to find a function that combines single-drug efficacies into
an overall combined value for a cART, then transform this value in order to
regress the viral load changes in the body.
Single-drug therapies have been used widely before 1996, because there
were only a few drugs in commerce, but they were not able to keep unde-
tectable viral loads and resistance was arising in short times. As a larger set
of drugs was available, HAARTs became the accepted protocol and AIDS
progression was significantly shifted through time: in fact the more drugs you
keep, the less are chances for the virus to develop resistance, because it is
attacked in different replicative steps (depending on the drug class) and must
select cross-resistance mutations being also in low concentrations.
Roughly speaking, drugs have different power : an old drug like Zidovudine
(AZT) is able to bring the viral load of a naive patient (i.e. wild type) down
by 1 Log from the baseline after three months of single-drug therapy, while
Lopinavir (LPV) can provide a 2 Log reduction. Drug power is closely related
to pharmacodynamics and pharmacokinetics, but no precise values can be
obtained: think about Saquinavir (SQV), a Protease inhibitor that shows high
viral load reduction in vitro, but is badly adsorbed by human body; its power
however increases when taken together with small doses of Ritonavir (RTV).
A combined therapy of AZT+LPV for a naive patient won’t lead to a
simple 1 + 2 Log reduction, but rather a slightly lower value.
The goal is to obtain an unique indicator of cART activity taking into
account genotypic resistance and susceptibility, drug powers and combination
effects.
This leads again to a fuzzy formula:
1
aD = ((¬rd ∧ pd ) ∨ sd ) (13)
d∈D
where aD is the overall activity for the D cART, d is a drug included in the
cART, rd is the viral genotypic resistance to drug d (the negation has the
meaning of an efficacy), pd is the single d drug power and sd is the viral
genotypic susceptibility to drug d. Clearly r and s are given by the above
matrix compositions. The power is coupled first with the efficacy and then
with the susceptibility, in order to take into account the fact that a drug can
be more powerful (thus efficient) than usual in presence of hypersusceptibility
mutations. *
The ∧ and ∨ operators are algebraic norm and conorm, while the is the
Hamacher conorm:
a + b − ab − (1 − γ)ab
⊥Hamacher (a, b, γ) = (14)
1 − (1 − γ)ab
where a, b are two generic single-drug activities and γ > 0 is a parameter that
takes into account drug synergies (and if γ < 1 then ⊥Hamacher < ⊥Algebraic ).
The γ and pd parameters were not estimated, but provided by physicians and
virologists according to their experimental evidences.
Now there is a unique indicator in [0,1]. Next step is to transform it (as
it was made for the phenotype prediction) in a Viral Load value: help comes
from [19], in which a system of differential equations -parameterized by drug
activities- models HIV-1 replication in the human body. This permits again
to define a loss function and optimize it.
Fig. 2. HIV-1 Schematic summary of the dynamics of HIV-1 infection in vivo: shown
in the center is the cell-free virion population sampled in the plasma - Image taken
from [46]
Differential Equations for HIV-1 Dynamics
HIV-1 replication in the human body follows the process described in figure 2.
Drugs target different replication steps, as it can be seen in figure 3. Several
mathematical models have been proposed, the one coming from Perelson [19] is:
dT T
= s + pT (1 − ) − dT T − KVi T (15)
dt Tmax
dT ∗
= (1 − ηRT )kVi T − δT ∗ (16)
dt
dVi
= (1 − ηP R )N δT ∗ − cVi (17)
dt
dVni
= ηP R N δT ∗ − cVni (18)
dt
where T are uninfected CD4+ T cells, T ∗ are infected CD4+ T cells, Vi
and Vni are infectious and non-infectious virions respectively, ηi are the drug
class efficacies. Cytotoxic response (CD8), latent and long-lived cells are not
included in the model, as the presence of multiple viral strains.
Unfortunately, informations about T cells too often are missing. A simpli-
fied model is:
dV
= (1 − α)cVeq − cV (19)
dt
Fig. 3. HIV-1 Drug Targets - Image taken from [47]
where α is the cART overall activity, c is the viral clearance rate (about three
days), Veq is the Steady State Viral Load.
Solution of this equation is
V (t) = V0 e−ct + (1 − α)Veq (1 − e−ct ) (20)
where V0 is the Baseline Viral Load.

After three months of therapy the Viral Load Log change from Baseline
can be assumed equal to
V0
∆Log (12-weeks) = Log (21)
(1 − α)Veq
5 Optimization Techniques
Simple optimization algorithms are often limited to regular convex func-
tions. Actually, most real problems lead to face multi-modal, discontinuous,
non-differentiable functions. To optimize such functions traditional research
techniques use gradient-based algorithms [4], while new approaches rely on
stochastic mechanisms: these latter base the search of the next point basing
on stochastic decision rules, rather than deterministic processes.
Genetic Algorithms, Simulated Annealing and Random Searches (see again
[4]) are among these, and often are used either when the problems are difficult
to be defined, either when “comfortable” properties -such as differentiability
or continuity- are missing. Genetic Algorithms search the solution space of

a function through evolutionary processes; they maintain and manipulate a
population of solutions, implementing a strategy of survival of the best : in
general the best individuals within a population can reproduce and survive to
the next generation, improving through time the descent in researching the
best solution. In addition, they are implicitly suitable for parallelism, permit-
ting exploration of different solution sets, with advantages in computational
issues. Complete reviews are [3] and [4], in which are examined solutions to
linear and non-linear problems. Aim of this section is not to describe the GA
in detail, for which references are given, but to review fuzzy tools used to
improve performances of GAs and fuzzy evaluation functions defined for op-
timization and Feature Selection. Then the implementation settings for the
Fuzzy Relational system will be discussed.
5.1 Fuzzy Genetic Algorithms
There are two possible ways to integrate Fuzzy Logic and Genetic Algorithms.
The first convolves application of GAs to solve optimization and search prob-
lems referring to fuzzy sets and rule systems, the second utilizes fuzzy tools
to model different components of GA in order to improve computational per-
formances. In this chapter both ways are discussed: in fact a GA is used to
mine fuzzy rules and the same GA is optimized through fuzzy tools.
Fuzzy Logic can integrate GAs in:
Chromosome Representation Classical binary or real valued representations
can be generalized into fuzzy set and membership functions
Crossover Operators Fuzzy operators can be considered to design crossover
operators able to guarantee an adequate and parameterizable level of di-
versity in population, in order to avoid problems of premature convergence
Evaluation Criteria Uncertainty, vagueness, belief, plausibility measures can
be introduced to define more powerful fitness functions.
GA Components Based on Fuzzy Tools
The behavior of a GA is strongly determined from the balance between ex-

ploitation and exploration. This balance between creation and reduction of
diversity is essential to reach good performances of GAs under complex prob-
lems. When such an equilibrium is not compensated, premature convergence
problems appear and cause stagnation of GA in local optima.
A collection of crossover functions, proposed in [17], called Fuzzy Connec-
tive Based (FCB) crossovers will be described in this section. These fuzzy
crossovers were implemented in the application.
F, S, M Crossover Operators
Let X = (x1 . . . xn ) and Y = (y1 . . . yn ) be two real valued chromosomes, with

xi < yi ∈ [ai , bi ] and i = 1 . . . n. There are three intervals [ai , xi ], [xi , yi ], [yi , bi ]
in which will be genes resulting from combinations of xi and yi . These intervals

can be classified as exploration ([ai , xi ], [yi , bi ]) or exploitation ([xi , yi ]) zones.
Define then three monotonic non-decreasing functions F, S, M : [a, b]×[a, b] →
[a, b]. ∀c, c ∈ [a, b] holds:
F (c, c ) ≤ min{c, c}

S(c, c ) ≥ max{c, c }
min{c, c } ≤ M (c, c ) ≤ max{c, c }
Fuzzy connectives (-norm ⊥-conorm and mean operators) can be used for
F, S, M respectively, having [a, b] = [0, 1]. Tables 1 and 2 present a few
examples.
Furthermore, the following conditions hold:
E ≤ A ≤ Z ≤ ΥA ≤ ΥE ≤ ΥZ ≤ ⊥Z ≤ ⊥A ≤ ⊥E
Dynamic FCB Crossover Operators
An idea to avoid premature convergence problems consists in preferring the

exploration at the beginning of the search process and the exploitation at the
end. Starting from FCB crossovers to guarantee desiderated levels of diversity,
dynamic FCB s can be defined using parameterized fuzzy connectives.
Consider two genes xi < yi ∈ [ai , bi ] to be recombined during a t generation
and be gmax the maximum number of generations. F, S, M crossovers can be
Table 1. F, S Crossover Operators
family -norm ⊥-conorm

Zadeh Z (x, y) = min{x, y} ⊥Z (x, y) = max{x, y}
Algebraic A (x, y) = xy ⊥A (x, y) = x + y − xy
Einstein E (x, y) = xy
1+(1−x)(1−y)
⊥E (x, y) = x+y
1+xy
Table 2. M Crossover Operators
family mean operator (0 ≤ λ ≤ 1)

Zadeh ΥZ (x, y) = (1 − λ)x + λy
Algebraic ΥA (x, y) = x1−λ y λ
2
Einstein ΥE (x, y) =
1+( 2−x
x
)1−λ +( 2−y
y
)λ
Table 3. F family
type -norm param

(q x −1)(q y −1)
Frank qF (x, y) = logq 1 + q−1
q > 0, q = 1
Dubois qD (x, y) = xy

max{x,y,q}
0≤q≤1
Table 4. S family
type ⊥-conorm param

(q 1−x −1)(q 1−y −1)
Frank ⊥qF = 1 − logq 1 + q−1
q > 0, q = 1
Dubois ⊥qD = 1 − (1−x)(1−y)

max{(1−x),(1−y),q}
0≤q≤1
extended in time series to F = (F 1 . . . F gmax ), S = (S 1 . . . S gmax ) and M =

(M 1 . . . M gmax ), which (being 1 ≤ t ≤ gmax , ∀c, c ∈ [a, b]) satisfy:
F t (c, c ) ≤ F t+1 (c, c ) ∧ F gmax (c, c ) ∼ min{c, c }

S t (c, c ) ≥ S t+1 (c, c ) ∧ S gmax (c, c ) ∼ max{c, c }
M t (c, c ) ≥ M t+1 (c, c ) ∨ M t (c, c ) ≤ M t+1 (c, c )∀t
∧M gmax (c, c ) ∼ Mlim (c, c )
where Mlim is an M limit function. Denote furthermore M + and M − two

families of the M functions that fulfill respectively the first and the second
part of the last property. F and S families can be built using a parameterized
-norm and ⊥-conorm that converge to the Zadeh’s. Tables 3 and 4 depict
different choices.
M family can be obtained using mean parameterized operators, like
2
xq + y q
∀x, y ∈ [0, 1], Υ (x, y) = q
q
; −∞ ≤ q ≤ ∞
2
x+y
and Mlim = 2 .
Soft Genetic Operators and Template Fuzzy Crossover
Other approaches for crossover operators have been presented that are worth
to be cited. In [16], Soft Genetic Operators are introduced, while in [15]
Sanchez proposes Template Fuzzy Crossover.
Performance Comparison of GAs and FGAs
FGAs were proven to be more efficient than GAs in different scenarios. Tests
of performances in the minimization of non-linear and non-differentiable prob-
lems for GAs and FGAs have been showed in [4, 16, 17].
5.2 Random Searches
In GAs, when the mutation rate is high (above 0.1), performances approach
that of a primitive random search. The advantage to use such an optimization
algorithm is that still remains derivative-free and does not need to keep alive
a large population of solutions: actually just one is iteratively modified and
evaluated.
As described in [4] Random Search explores the parameter space of an
objective function sequentially adding random values to the solution vector
in order to find an optimal point that minimizes or maximizes the objective
function. Despite its simplicity, it has been proved that converges to the global
optimum.
Let be f (x) an objective function to be minimized and x the current vector
point considered. The basic algorithm iterates the following steps:
1. x is the current starting point
2. add a random vector dx to x and evaluate f (x + dx)
3. IF f (x + dx) < f (x) THEN set x = x + dx
4. IF (optimal f (x) or maximum number of iterations is reached) THEN
stop ELSE go to 2
Improvements to this primitive version are suggested in [4].
5.3 Implementation
In the previous sections the models for in-vitro and in-vivo prediction were
defined. Aim of this section is the estimation of parameters. For each model
there is a loss function to be minimized:
L(f (R = M ◦ W, S = M ◦ W ), P ) = L(P , P )
For the in-vitro model, f = f (R, S) = f (W, W ) and pi,d = tanh(ri,d − si,d ).
The same is for the in-vivo model: there are the relational compositions M ◦
W = R and M ◦ W , then R and S are combined through equation 13 (to
get the overall cART activity α) and then through equation 21 to get Viral
Loads.
The parameters to be estimated are -for both models- the values in rela-
tions W and W . The idea is to use either a FGA or a RS to minimize the loss
function (which will be in the fitness function) and compare different approxi-
mate solutions through different runs. Unlike a derivative-based approach (for
instance Gradient Descent) in which the solutions can stuck in local optima,
an evolutionary algorithm in this scenario can explore a wider set of solutions
with less constraints and still contemplate reasonable computational times.
In order to define the chromosome coding, note that W and W relations
are matrix of real numbers in [0, 1]: these matrices can be directly used as
chromosomes, being possible to apply directly fuzzy crossover operators: thus
the FGA searches solutions within a population of weight matrices.
Fuzzy Connective Based crossovers are used (specifically, algebraic norms

and weighted mean for F, S, M families), while the fitness function is the
N
L(P , P ) = (P − P )T (P − P ) = i=1 (pi − pi )2 (Squared Error Loss) for
$ %2
phenotype prediction and L = g(h(W, W ), O) = g(O , O) = σσo σoo = ρ2
o
(Squared Linear Correlation, where σo2 , σo2 are the variances and σo o is the
covariance) for in-vivo Viral Load outcome prediction, where h(W, W ) is the
serialization of functions given in equations 13 and 21. In the following sec-
tion 5.4 a Fuzzy Feature Selection function will be introduced to modify the
fitness function in order to select relevant variables and compact solutions
within the same error losses.
For the RS, a slightly different implementation is made. The solution
vector x = [x0 . . . xi−1 , xi , xi+1 . . . xM ] is iteratively added different dxj =
[0 . . . 0, di,j , 0 . . . 0], i.e. just the ith coordinate of x is explored, such that
f (x + dxj ) is the minimal (for Squared Error Loss) among different js. How-
ever it’s not proven if this affects the optimality property, because it’s a greedy
search on one direction of the space.
5.4 Feature Selection
The loss functions above described tend to minimize the error or maximize
the correlation between observed and predicted vectors. They do not take
account for the number of parameters used. In usual engineering scenarios
the parameters to be optimized are few and related to significant variables
as position, speed, acceleration: in this biological framework instead there is
a huge number of variables (all the mutations in the viral genotype) and a
corresponding large parameter space. Many of the input variables can be not-
significant for the model and many parameters mean that the system can be
easily overfitted. Feature Selection is closely related to the Occam’s princi-
ple, probabilistically interpreted as the Minimum Description Length (MDL)
principle (see [6]), for which models that use a minor number of parameters
are preferred under the same prediction performances. This is useful when
dealing with high-dimensional data sets, where many input attributes could
be irrelevant and redundant to the dependant variables and act just as a noise.
By allowing learning algorithms to focus only on highly predictive variables,
their accuracy can be even improved.
Feature Selection methods can be classified in two groups: Filter and
Wrapper methods (see again [6]). Filter methods usually rank each attribute
individually by different quality criteria (for example p-value of t-tests, mu-
tual information values et cetera) and then select the subset of best ranked
attributes. Wrapper methods evaluate performance of the learning algorithms
using different subsets of the attributes: an exhaustive search through all sub-
sets is clearly not possible, leading to 2n variable subsets to be evaluated, so
different search algorithms use heuristics (greedy) to direct towards promising
subsets. One such search method can be the Genetic Algorithm.
The approach suggested here is relying on the fact that in the GA it is

possible to define in the fitness function either mechanisms to evaluate the
prediction performances or to evaluate attribute subset characteristics (like
the number of variables included, the adjusted-ρ2 . . . ).
Akaike Information Criterion
The Akaike Information Criterion (AIC) is a statistical model fit measure. It

quantifies the relative goodness of fit of various previously derived statistical
models, given a sample of data. It uses a rigorous framework of information
analysis based on the concept of entropy. The driving idea behind the AIC is
to examine the complexity of the model together with goodness of its fit to
the sample data, and to produce a measure which balances between the two.
The formula is
AIC = 2k − 2ln(L) (22)
where k is the number of parameters, and L is the likelihood function. When
errors are assumed to be normally distributed, AIC is computed as AIC =
2k + n · ln(RSS/n), where n is the number of observations and RSS is the
residual sum of squares. A model with many parameters will provide a very
good fit to the data, but will have few degrees of freedom and be of limited
utility. This balanced approach discourages overfitting. The preferred model
is that with the lowest AIC value.
Fuzzy Feature Selection Functions
The idea for Fuzzy Feature Selection rises from the AIC definition and its
extended in fuzzy terms. While the AIC formula is fixed and selects variables
only based on their statistical significance, families of parameterized fuzzy
functions that take into account the number of parameters and the loss func-
tion can be designed, in order to decide with more flexibility how much the
model has to be simple (i.e. how many parameters are included) joined with
its goodness of fit. Fuzzy formulae are set up in order to select models with
high squared correlation ρ2 (or low (Mean) Squared Error SE) and a few
parameters:
(Error is low) ∧ (v is low)

(ρ2 is high) ∧ (v is low)
where v is related to the number of active parameters. The fuzzy set for
v can be defined as a parameterized function of the variable weights: the
more they’re close to zero, the better is, because they do not participate to
the model. Excluding non-interactive operators (like min or drastic product )
every fuzzy -norm is admitted: the algebraic product was the choice for the
application. Two simple examples for membership functions are:
M 2
wi
e− 2σ2
µv (w) = 1 − i=1
(23)
M
|{w ∈ W : abs(w) < σ}|
µv (w) = 1 − (24)
M
where | · | is the cardinality of the set. A reasonable value for σ is 0.01, used in
the application. The first function is smoother, while the second cut variables
regardless their weight, just fixing a bound. For SE, the same holds:
N (xi −xi )2
e− 2σ2
µError (SE) = 1 − i=1
(25)
N
where xi and xi are predicted and observed value respectively; here σ can be
set on 0.5 (the choice rose from variability of plasma analyses within ±0.5
Log). Finally, ρ2 is by itself a goodness of fit indicator defined in [0,1]. Results
in section 6 will show the advantages gained with the Fuzzy Feature Selection.
6 Application
6.1 Phenotype Prediction
Data Set Description and System Setting
Genotype/phenotype pairs were collected from public Stanford data base [37]
and from VIRCO Laboratories [38]. Available data set sizes were ranging from
700 to 1000 pairs for the whole set of drugs {AZT, DDI, DDC, 3TC, TDF,
NVP, ABC, NFV, EFV, SQV, IDV, LPV, APV, D4T, DLV, RTV}, except
for {TPV, ATV, FTC} in which sizes were in the order of 70 to 300. Data sets
were split in training (90%) and validation (10%) in order to assess robustness
of results. Viral nucleotide sequences were aligned to consensus B wild type
viral reference strain with a global alignment algorithm (CLUSTALW), taking
into account high gap penalties for insertions and deletions. Mutations were
extracted consequently, handling also ambiguous sequencing.
M relation was filled according to the definition given in section 4.2. All
the mutational positions were included in the system (550 in RT gene and
330 in PR gene), but only positions related to the corresponding drug target
were allowed to have a weight, i.e. only mutations in the Protease gene were
considered for Protease inhibitors and the same for Reverse Transcriptase;
this was in order to respect real biological mechanisms. The Fuzzy Feature
Selection function used in conjunction either with the FGA or the RS was
(Error is low) ∧ (v is low) as proposed in section 5.4.
Results
In order to assess performances, the Fuzzy System was trained and validated
on the whole set of commercial drugs for which phenotypic tests are available.
The system was then compared with a Linear Regressor, a literature standard
for genotype→phenotype prediction. Being a huge number of input variables,
the Linear Regressor was enhanced in two ways: first a Singular Value Decom-
position (SVD) cut the quasi-colinear attributes; secondly, a stepwise selection
heuristic method (starting from an input variable subset, variables are added
or removed according to the Akaike Information Criterion) was used to re-
duce the number of variables. Performance indicator was the squared linear
correlation ρ2 between predicted and observed vector, a widely used mea-
sure in biology. While the validation performances between the three models
were not significantly different (Kruskal-Wallis rank sum test), i.e. the models
have the same predictive power, the Fuzzy Feature Selection method selected
significantly a lower number of input variables (p < 4 · 10−7 , Wilcoxon rank
sum test). Table 5 summarizes the results. The SVD LR produced robust
models, but the weight interpretation is difficult, because (even though cut in
the decomposition) they’re re-projected in the original attribute space. The
Table 5. Validation Results and Model Comparisons for Phenotype prediction
Lin Reg SVD Lin Reg + Stepwise Fuzzy

2 2
drug n= no. var ρ no. var ρ no. var ρ2
AZT 94 537 0.7619 138 0.8183 7 0.8756
DDI 89 526 0.7926 138 0.8075 4 0.8609
TDF 77 530 0.7468 87 0.7707 6 0.8148
TPV 07 126 0.4414 107 0.4408 3 0.9426
NVP 97 537 0.8045 160 0.7076 15 0.7960
ABC 89 529 0.8378 134 0.8694 6 0.7730
NFV 94 332 0.8906 101 0.8854 20 0.8912
ATV 35 253 0.8649 117 0.8995 5 0.9222
FTC 28 374 0.7693 0 0 8 0.6939
3TC 93 538 0.8080 121 0.9409 10 0.9234
EFV 95 531 0.7665 138 0.8591 14 0.8373
SQV 95 333 0.9333 98 0.9409 15 0.9276
IDV 94 332 0.9124 86 0.9178 6 0.828
LPV 75 327 0.9730 113 0.9783 17 0.8566
APV 94 331 0.7841 109 0.8081 6 0.8606
D4T 92 534 0.8527 158 0.8778 10 0.6455
DLV 96 532 0.7006 126 0.7239 2 0.8441
DDC 91 523 0.8619 114 0.9218 4 0.7788
RTV 94 330 0.9174 89 0.9444 9 0.8682
stepwise heuristic function reduced the input attribute space, but the resulting
models possessed a higher number of variables (order 10 of magnitude) than
the Fuzzy engine. A simpler model has the advantage to be more understand-
able and to point out features that can have real biological meaning. In fact, for
the whole drugs, the Fuzzy model optimized with the Fuzzy Feature Selection
yielded a set of weights that resemble with high accuracy medical hypotheses.
Table 6 shows estimated weights for three drugs that completely agree the list
of resistance/susceptibility mutations approved by IAS/USA [36]. Note that
the Fuzzy Feature Selection function independently selected these among more
than 500 input variables. Usually Machine Learners are trained only using the
IAS/USA list.
Table 6. Weight Estimation for Phenotype Prediction
AZT
RT mutation weight effect IDV
67N 0.35 resistance PR mutation weight effect

70R 0.35 resistance 46I 0.5 resistance
116Y 0.85 resistance 46L 0.55 resistance
184V 0.1 susceptibility 48V 0.55 resistance
210W 0.6 resistance 54V 0.55 resistance
215F 0.6 resistance 84V 0.4 resistance
215Y 0.6 resistance 90M 0.45 resistance
EFV
RT mutation weight effect
100I 0.85 resistance
101E 0.55 resistance
101P 0.9 resistance
101Q 0.5 resistance
103N 0.9 resistance
103S 0.8 resistance
108I 0.45 resistance
181C 0.35 resistance
184V 0.05 resistance
188L 0.95 resistance
190A 0.85 resistance
190S 0.95 resistance
221Y 0.3 resistance
225H 0.7 resistance
Fig. 4. Log Fold Change regression - Fuzzy Relational System + Feature Selection
- AZT Validation Set
- ATV Validation Set
Regarding the optimization algorithms performances, different executions

of FGA and RS did not show differences in the time needed to find an acc-
eptable solution. Figures 4, 5, 6 and 7 depict validation results for different
drugs.
6.2 In-Vivo Prediction
Clinical Data Sets
The data bases available were the five clinical trials {GART, HAVANA,
ARGENTA, ACTG 320, ACTG 364} and the retrospective cohort ARCA
(taken from [37, 39]): 1329 instances were selected, according to the follow-
ing constraints:
• Viral Load Equilibrium was the maximum viral load value ever observed
in a patient
- SQV Validation Set
- IDV Validation Set
• Baseline Viral Load had to be collected in the interval [−15, 7] days from
the therapy switch date
• Viral Genotype sequenced in the interval [−90, 30] days from therapy
switch date
• 12-Weeks Viral Load taken from 8 to 16 weeks after the therapy switch
date
For the clinical trials, the equilibrium viral load was not really reliable, because
patients were enrolled being treated and little information about patient’s
history were available. Retrospective cohorts, on the other hand, often were
missing baseline measure. Mutations were extracted aligning each patient’s
viral genotype with the consensus B wild type reference strain as for the in-
vitro tests. 12 ARVs were included in the model: {AZT, 3TC, D4T, DDI,
ABC, EFV, NVP, NFV, SQV, LPV, RTV, IDV}. Data were split in training
(90%) and validation (10%) sets. Furthermore, a blind-validation set of 42
Table 7. Drug Powers
NRTi or NNRTi power

PRi power
AZT 0.8
IDV 0.97
3TC 0.95
NFV 0.97
DDI 0.95
RTV 0.97
D4T 0.8
SQV 0.9
ABC 0.96
LPV 0.99
EFV 0.98
RTV BOOSTER 0.088
NVP 0.96
observations from patients recently recorded in ARCA, coming from different

clinics (with complete information about Baseline, Equilibrium Viral Load,
Genotype and Follow Up), was considered as additional test set. The Fuzzy
Feature Selection function was (ρ2 is high) ∧ (v is low) as proposed in 5.4:
this time the squared correlation ρ2 was preferred to the MSE because the
assumptions made on drug synergies and powers are not precise. Drug powers
are summarized in Table 7, given by physicians after pharmacokinetics studies.
Variance and Bias Estimation
Unlike the in-vitro scenario, in-vivo clinical data sets possess high variability:
this is due to patients’ unadherence, different drug adsorption levels, different
contour conditions (psychological state, co-infections . . . ), even instruments’
systematic errors and wrong data insertions. Furthermore, Viral Loads (and
CD4+) are just a surrogate of the real disease progression in the body, because
they reflect only the viral strains present in the peripheral blood (and mostly
the infection acts in the lymphatic tissues). Input attributes are chosen among
a set of variables that have been showed relevant in-vitro and among limited
in-vivo statistical tests, so they could be not the most predictive ones or pre-
dictive at all. For instance, an important information as the therapy history
is rarely recorded, and thus cannot be contemplated in a model. Before pre-
senting the results, it’s useful to show the high variance in the follow-up Viral
Loads among observations that share the same input attributes. Figures 8 and
9 explain clearly the situation for two selected observation subsets: the out-
come distribution for patients that are wild-type (do not possess mutations),
have the same Baseline Viral Loads and take the same cARTs are almost flat.
Whatever the Machine Learning techniques is used, for such biased data the
results will be poor.
Fig. 8. 12-Weeks Viral Load Log distribution for 6 Wild Type patients under
AZT+3TC+SQV with 5 Baseline Viral Load Log
Fig. 9. 12-Weeks Viral Load Log distribution for 4 Wild Type patients under AZT+
3TC with 4,75 Baseline Viral Load Log
Results
The Fuzzy system was trained and validated under different parametric con-
ditions:
• zero-resistance/susceptibility model: a null model in which mutations are
assumed not to contribute to resistance or susceptibility (useful to compare
performances of the others)
• input variables taken from IAS/USA [36] list of relevant mutations (with
or without Fuzzy Feature Selection)
• input variables on the entire RT + PR genes (all mutational positions, all
amino acidic substitutions) with Fuzzy Feature Selection
All the results are summarized in Table 8. The zero-resistance/susceptibility
model, that assumes perfect inhibition despite mutational profiles and just
Table 8. In-Vivo Prediction Performances
fuzzy model data set type n= ρ2

zero-resistance/susceptibility (null model) validation 133 0.2045
all mutations + Feature Selection training 1329 0.6672
all mutations + Feature Selection validation 133 0.3553
IAS mutations no Feature Selection training 1329 0.36
IAS mutations no Feature Selection validation 133 0.2488
IAS mutations + Feature Selection training 1329 0.3971
IAS mutations + Feature Selection validation 133 0.3762
IAS mutations + Feature Selection validation (diff. clinics) 42 0.3899
Beerenwinkel’s model (trained in-vitro) validation 96 0.368
Fig. 10. In Vivo 12-Weeks Viral Load Log regression - Fuzzy model, no Feature
Selection - Mutations from IAS/USA - Validation Set - observed values on x axis,
predictions on y axis
relies on drug powers and Viral Load exponential decrease, was the poorest
predictor: the Fuzzy system explains better the data. Figure 10 depicts val-
idation results (real outcomes vs prediction) having trained system on the
IAS/USA list without Feature Selection: validation ρ2 was poor, only 0.2488
(training yielded ρ2 = 0.36) and weight matrices were quite unstable executing
algorithms different times with different starting points. Note that the per-
fectly aligned bunch of points is due to the undetectable saturation. Results
started to be better using the Fuzzy Feature Selection and the IAS/USA
list: Figures 11, 12 and 13 show training performances and validation per-
formances (using the blind-validation set coming from the different clinics),
for which ρ2 was always above 0.37. Different runs with perturbations on
the starting points yielded slightly different weight matrices and variable in-
cluded, but with low variability. Furthermore, weights were resembling often
Fig. 11. In Vivo 12-Weeks Viral Load Log regression - Fuzzy model + Feature
Selection - Mutations from IAS/USA - Training Set - observed values on x axis,
predictions on y axis
Selection - Mutations from IAS/USA - Validation set - observed values on x axis,
predictions on x axis
Selection - Mutations from IAS/USA - Validation set from different clinics - observed
values on x axis, predictions on y axis
Table 9. Weight Matrices for Fuzzy System. NRT is for mutations in Reverse Tran-
criptase targeted by Nucleoside/Nucleotide analogues, NNRT is for Non-Nucleoside
RT inhibitors, PR is for Protease inhibitors. Weights were forced to assume zero
value for mutations that were in regions not targeted by the corresponding drug
class in order to resemble physical behaviors.
mutation drug weight effect

NRT 41 DDI 0.1 resistance
NRT 41 ABC 0.95 resistance mutation drug weight effect
NRT 69 ABC 0.45 resistance
NRT 77 3TC 0.95 susceptibility NNRT 106 NVP 0.95 resistance
NRT 115 3TC 0.95 susceptibility NNRT 181 EFV 0.6 resistance
NRT 151 DDI 1.0 susceptibility NNRT 181 NVP 0.95 resistance
NRT 184 3TC 0.95 resistance NNRT 190 NVP 0.95 resistance
NRT 219 DDI 0.5 resistance
mutation drug weight effect mutation drug weight effect

PR 10 NFV 0.1 resistance PR 54 NFV 0.95 resistance
PR 10 RTV 0.9 resistance PR 54 LPV 0.95 resistance
PR 10 SQV 0.7 resistance PR 71 NFV 0.95 resistance
PR 30 SQV 1.0 susceptibility PR 90 IDV 0.7 resistance
PR 32 SQV 0.95 resistance PR 90 NFV 0.95 resistance
PR 46 RTV 0.95 resistance PR 90 SQV 0.95 resistance
medical hypotheses. Table 9 reports a small set of weights (amino acidic sub-
stitutions are not shown for clearness) that is capable of handle ρ2 = 0.39
in validation: emphasized terms disagree medical hypotheses. Differently from
the phenotype prediction, the FGA did not improve and escape fast from
the zero-resistance/susceptibility model, being stuck in this local optimum,
while the RS was able to find better solutions in a shorter time. The last
was test made using the complete mutational regions in RT and PR, forcing
just mutations in RT not to interact with PR drugs and vice-versa, relying
on the Fuzzy Feature Selection function for the feature selection. The pa-
rameter search space was huge: around 400 mutations and 3000 weights to
be estimated (not considering amino acidic substitutions and having just a
thousand of training examples). Training performances were optimal, yielding
ρ2 = 0.6672, but validation results did not increase, yielding ρ2 = 0.3553.
The system was obviously over-parameterized and, even if the Feature Selec-
tion was used, the weights in the relational matrices were unstable, changing
their values largely among different executions of the FGA and RS algorithms.
Final comparison was made with the Beerenwinkel’s model described in sec-
tion 4.3, that uses an in-vitro predictor: this system was tested on a set of
96 therapy switch episodes (but not coming from the sets here used), using
the overall activity score as a predictor of 4-weeks (28 ± 10 days) virological
response (very short term response). Linear least squares regression analysis
gave a ρ2 = 0.368. Being different the validation settings, it’s not possible to
compare directly the models: however, being this model trained with in-vitro
data, it’s at least an indication that the two experimental settings are related.
6.3 Conclusions
The Fuzzy Relational system has been shown to be accurate, robust and com-
pact for in-vitro prediction: differently from other models as Linear Regressors
or SVM, has the advantage to provide a meaningful explanation of biological
mechanisms and joined with the Fuzzy Feature Selection function selects the
best models according to the Occam’s principle. Its extension for in-vivo cART
optimization and Viral Load prediction gives encouraging results, providing
still a compact model: moreover, its derivation from the in-vitro framework
and the comparison with Beerenwinkel’s model emphasize the relationships
between in-vitro tests and in-vivo treatments. However, in a mere therapy
optimization purpose, the correlations results are still not satisfactory: the
variance estimation anyway shows how much the clinical data are biased, ei-
ther due to the limited attribute recording or the intrinsic variabilities in the
human body.
Future Perspectives for In-Vivo Modelling
The Fuzzy Relational system described in this chapter is designed under a lim-
ited data scenario. For instance, the differential equation system that models
the viral reproduction had to be simplified ignoring the contribution of CD4+
T cells, because they were too often missing in the data bases. Mutations
were treated separately, because a preliminary clustering did not worked due
to the large number of therapy combinations compared to the small number
of training instances. Therapeutic history, that could have a crucial role in
the learning process, is missing as well. However, public availability of clinical
data bases is today increasing, as confirmed by the EuResist data base [40]
(an European project that aims to integrate several clinical data bases on
HIV and build a treatment decision system), as well quality and additional
attribute recording in the data sets. In this perspective, it’s possible to design
more complex models, still maintaining the aim to model meaningfully biolog-
ical mechanisms and handle uncertainty and vagueness. The Fuzzy Relational
system can be modified and extended in order to produce a rule set capable
to infer prediction, at least eliminating the noise produced by (previously)
unseen significant attributes. An appropriate rule base for an in vivo HIV re-
sistance/susceptibility FIS (Mamdami. . . ), that still waits to be trained with
a sufficient amount of data, is defined in Table 10.
Table 10. Fuzzy System for In-Vivo cART Optimization
premise 0 (fact) pm is M and pd is D and pv is V and . . . and ph is H

and pp is P
premise 1 (rule) if pm is M1 and pd is D1 and pv is V1 and . . . and ph is
H1 and pp is P1 then po is O1
premise 2 (rule) if pm is M2 and pd is D2 and pv is V2 and . . . and ph is
H2 and pp is P2 then po is O2
... ...
premise i (rule) if pm is Mi and pd is Di and pv is Vi and . . . and ph is Hi
and pp is Pi then po is Oi
consequence po is O
(conclusion)
In detail, pk is the patient’s state on the K fuzzy sets: Mi is a set of viral

mutational clusters, Di is a cART, while Vi plasma analyses (Viral Load,
CD4+), Pi is the phenotype corresponding to the Mi set, Hi is designed on
therapy history. Additional input features (like viral subtype, risk factor, co-
infections) can be included. O is the 12-Weeks follow up. The above inferences
can be represented through compositions of fuzzy relations. The importance of
this equivalence is that often relations can be treated easier in a computational
approach: basically they’re n-dimensional matrices, so their manipulation is
possible also through many algebraic properties; moreover, they are suitable
for evolutionary algorithms application.
The problem is to define membership functions and then input space par-
titions. While for Viral Loads and CD4+ this is feasible, for patients’ vi-
ral genotypes there is a vector of mutations revealed from the wild type: a
distance function has to be defined to calculate neighborhoods (Hamming,
Jaccard, Levehnstein, phylogenetic distances. . . ), remembering that in each
position there can be a mixture of amino acids (for instance M184MVI). For
therapy history, an exponential decreasing function parameterized on time of
exposure and time of interruption could be suitable.
Having defined such an enlarged input attribute set and the corresponding
similarity functions, membership functions can be tuned and rules discovered
with an heuristic algorithm.
7 Acknowledgements
We want to thank physician Andrea De Luca (Institute of Clinical Infection
Disease - Catholic University of Rome - UCSC) and virologist Maurizio Zazzi
(University of Siena, Italy) who collaborated actively to this study. Further-
more we want to thank the ARCA consortium [39] that gave the in-vivo
retrospective data sets, Stanford University [37] for in-vivo clinical trials and
VIRCO [38] for in-vitro genotype/phenotype data sets.
References
1. Klir G (1988) Fuzzy sets, uncertainty and information. Prentice-Hall, Englewood
Cliffs, NJ
2. Bandemer H (1992) Fuzzy data analysis. Kluwer Academic, Dordrecht
3. Michalewicz Z (1994) Genetic algorithms + data structures = evolution pro-
grams. AI Series. Springer, Berlin Heidelberg New York
4. Jang JSR, Sun CT, Mizutani E (1997) Neuro-fuzzy and soft computing. Prentice
Hall, Englewood Cliffs, NJ
5. Brunak S, Baldi P (2001) Bioinformatics: the machine learning approach. MIT,
Cambridge, MA
6. Witten IH, Frank E (2005) Data mining: practical machine learning tools and
techniques. Morgan Kauffmann, Los Altos, CA
7. Beerenwinkel N (2003) Computational analysis of HIV drug resistant data. PhD
Thesis, MPS-MPI for Informatics, University of Saarland, Saarbruecken, Ger-
many
8. Sanchez E (1977) Solutions in composite fuzzy relation equations. In: Gupta,
Saridis, Gaines (eds) Fuzzy automata and decision processes. North-Holland,
New York, pp 221–234
9. Sanchez E (1979) Medical diagnosis and composite fuzzy relations. In: Gupta,
Ragade, Yager (eds) Advances in fuzzy set theory and applications. North-
Holland, New York, pp 437–444
10. Adlassnig K, Kolarz G (1982) CADIAG-2: Computer-assisted medical diagnosis
using fuzzy subsets. In: Gupta, Sanchez (eds) Approximate reasoning in decision
analysis. North-Holland, New York, pp 203–217, 219–247
11. Zadeh L (1965) Fuzzy sets. Inf Control 8:338–353
12. Sanchez E (1984) Solution of fuzzy equations with extended operations. Fuzzy
Sets Syst 12:237–248
13. Mizumoto M (1989) Pictorial representations of fuzzy connectives, part II: cases
of compensatory operators and self-dual operators. Fuzzy Sets Syst 32:45–79
14. Pedrycz W (1993) Fuzzy relational equations. Fuzzy Sets Syst 59:189–195
15. Sanchez E (1993) Fuzzy genetic algorithms in soft computing enviroment. In-
vited Plenary Lecture in Fifth IFSA World Congress, Seoul
16. Voigt HM (1995) Soft genetic operators in evolutionary algorithms. In: Banzhaf
W, Eeckman FH (eds). Evolution and biocomputation. Lecture notes in com-
puter science, vol 899. Springer, Berlin Heidelberg New York, pp 123–121
17. Herrera F (1996) Dynamic and heuristic fuzzy connective based crossover oper-
ators for controlling the diversity and the convergence of real coded algorithms.
Int J Intell Syst 11:1013–1041
18. Lathrop R, Pazzani MJ (1999) Combinatorial optimization in rapidly mutating
drug-resistant viruses. J Combinatorial Optimiz 3:301–320
19. Perelson AS, Nelson PW (1999) Mathematical analysis of HIV-1 dynamics in
vivo. SIAM Rev 41:3–44
20. Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D,
Korn K, Selbig J (2001) Geno2pheno: interpreting genotypic HIV drug resistance
tests. IEEE Intell Syst Biol 16(6):35–41
21. Beerenwinkel N, Kaiser R, Schmidt B, Walter H, Korn K, Hoffmann D, Lengauer

T, Selbig J (2001) Clustering resistance factors: identification of complex cate-
gories of drug resistance. Antiviral Ther 6(1):105
22. Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D,
Korn K, Selbig J (2001) Identifying drug resistance-associated patterns in HIV
genotypes. Proceedings of the German conference on bioinformatics, October
7–10, 2001, pp 126–130
23. Beerenwinkel N, Sing T, Daumer M, Kaiser R, Lengauer T (2004) Computing
the genetic barrier. Antiviral Ther 9:S125
24. Beerenwinkel N, Daumer M, Sierra S, Schmidt B, Walter H, Korn K, Oette M
et al. (2002) Geno2pheno is predictive of short-term virological response. An-
tiviral Ther 7:S74
25. Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D
et al. (2002) Diversity and complexity of HIV-1 rrug resistance: a bioinformat-
ics approach to predict phenotype from genotype. Proc Natl Acad Sci USA
99(12):8271–8276
26. Beerenwinkel N, Daumer M et al. (2003) Geno2pheno: estimating phenotypic
drug resistance from HIV-1 genotypes. Nucleic Acids Res 31(13):3850–3855
27. Beerenwinkel N, Lengauer T, Daumer M, Kaiser R, Walter H, Korn K,
Hoffmann D, Selbig J (2003) Methods for optimizing antiviral combination ther-
apies. Bioinf 19(1)(ISMB ’03):i16–i25
28. Beerenwinkel N, Kaiser R, Rahnenfuhrer J, Daumer M, Hoffmann D, Selbig J,
Lengauer T (2003) Tree models for the evolution of drug resistance. Antiviral
Ther 8:S107
29. De Luca A, Vendittelli M, Baldini F, Di Giambenedetto S, Trotta MP, Cingolani
A, Bacarelli A, Gori C, Perno C F, Antinori A, Ulivi G (2004) Construction,
training and clinical validation of an interpretation system for genotypic HIV-1
drug resistance based on fuzzy rules revised by virological outcomes. Antiviral
Ther 9(4)
30. Larder BA, Revell A, Wang D, Harrigan R, Montaner J, Wegner S, Lane C
(2005) Neural networks are more accurate predictors of virological response to
HAART than rules-based genotype interpretation systems. Poster presentation
at 10th european AIDS conference/EACS, Dublin Ireland, 17–20 November
31. Wang D, Larder BA, Revell A, Harrigan R, Montaner J, Wegner S, Lane C
(2005) Treatment history improves the accuracy of neural networks predicting
virologic response to HIV therapy. Abstract & Poster presentation at BioSapi-
ens – viRgil workshop on bioinformatics for infectious diseases, Caesar Bonn,
Germany, September 21–23
32. Revell A, Larder BA, Wang D, Wegner S, Harrigan R, Montaner J, Lane C
(2005) Global neural network models are superior to single clinic models as gen-
eral quantitative predictors of virologic treatment response. Poster presentation
at third IAS conference on HIV pathogenesis and treatment 24–27 July, Rio de
Janeiro, Brazil
33. Prosperi M, Zazzi M, De Luca A, Di Giambenedetto S, Ulivi G et al. (2005)
Common law applied to treatment decisions for drug resistant HIV – antiviral
therapy 10:S62 (abstract & poster at XIV International HIV Drug Resistance
Workshop, Quebec City)
34. Prosperi M, Zazzi M, Gonnelli A, Trezzi M, Corsi P, Morfini M, Nerli A,
De Gennaro M, Giacometti A, Ulivi G, Di Giambenedetto S, De Luca A
(2005) Modelling in vivo HIV evolutionary mutational pathways under AZT-

3TC regimen through Markov chains – abstract, poster & selected lecture at
BioSapiens-viRgil Workshop on bioinformatics for infectious diseases, Caesar
Bonn, Germany, September 21–23
35. Savenkov I, Beerenwinkel N, Sing T, Daumer M, Rhee SY, Horberg M, Scarsella
A, Zolopa A, Lee S Y, Hurley L, Fessel WJ, Shafer RW, Kaiser R, Lengauer
T (2005) Probabilistic modelling of genetic barriers enables reliable prediction
of HAART outcome at below 15% error rate. Abstract at BioSapiens-viRgil
Workshop on bioinformatics for infectious diseases, Caesar Bonn, Germany,
September 21–23
36. IAS-USA. http://iasusa.org
37. Stanford HIV data base. http://hivdb.stanford.edu
38. VIRCO Labs. http://vircolab.com
39. ARCA consortium. http://www.hivarca.net
40. EuResist. http://www.euresist.org
41. Geno2Pheno system. http://www.genafor.org
42. RDI. http://www.hivrdi.org/
43. ANRS. http://www.hivfrenchresistance.org/
44. ANRS. http://pugliese.club.fr/index.htm
45. HIV-1 Genotypic Drug Resistance rules from REGA Institute.
http://www.kuleuven.be/rega/cev/pdf/ResistanceAlgorithm6 22.pdf
46. The cellML – http://www.cellml.org
47. The body – complete HIV resource http://bbs.thebody.com/index.html
A New Genetic Approach for Neural
Network Design
Antonia Azzini and Andrea G.B. Tettamanzi
Summary. Neuro-genetic systems, a particular type of evolving systems, have

become a very important topic of study in evolutionary design. They are biologically-
inspired computational models that use evolutionary algorithms (EAs) in conjunc-
tion with neural networks (NNs) to solve problems. EAs are based on natural genetic
evolution of individuals in a defined environment and they are useful for complex
optimization problems with huge number of parameters and where the analytical
solutions are difficult to obtain.
This work present an approach to the joint optimization of neural network struc-
ture and weights, using backpropagation algorithm as a specialized decoder, and
defining a simultaneous evolution of architecture and weights of neural networks.
1 Introduction
Evolutionary algorithms (EAs) are models based on natural evolution of in-
dividuals in a defined environment. They are especially useful for complex
optimization problems where the number of parameters is large and the ana-
lytical solutions are difficult to obtain.
Texts of reference and synthesis in the field of evolutionary algorithms are
[8,36], and recent advances in evolutionary computation are described in [60],
in which these approaches attract increasing interest from both academia and
industrial society.
EAs can help to find out the optimal solution globally over a domain,
although convergence to the global optimum may only be guaranteed in prob-
ability, provided certain rather mild assumptions are met [1]. They have been
applied in different areas such as fuzzy control, path planning, modeling and
classification etc. Their strength is essentially due to their updating of a whole
population of possible solutions at each iteration of evolving individuals; this
is equivalent to carry out parallel explorations of the overall search space in
a problem.
New evolving systems, neuro-genetic systems, have become a very impor-
tant topic of study in evolutionary computation. As indicated by Yao et al.
A. Azzini and A.G.B. Tettamanzi: A New Genetic Approach for Neural Network Design, Studies
in Computational Intelligence (SCI) 82, 289–323 (2008)
290 A. Azzini and A.G.B. Tettamanzi
in [61], they are biologically-inspired computational models that use evolution-

ary algorithms in conjunction with neural networks (NNs) to solve problems.
Such an evolutionary algorithm is a more integrated way of designing artificial
neural networks (ANNs) since it allows all aspects of NN design to be taken
into account at once and does not require expert knowledge of the problem.
An important issue in the ANN design considers the training process, car-
ried out by adjusting the connection weights iteratively, so that learned ANNs
can perform the desired task. Weight training is usually formulated as mini-
mization of an error function, such as the mean square error between target
and actual outputs averaged over all examples, by iteratively adjusting con-
nection weights. In the most frequently used methods to train neural networks,
the BackPropagation algorithm (BP) has emerged as the standard algorithm
for finding a set of good connection weights and biases [50]. As conjugate gra-
dient, BP is based on gradient descent [38]. It generally uses a least-squares
optimality criterion, defining a method for calculating the gradient of the error
with respect to the weights for a given input, by propagating error backwards
through the network. There have been some successful applications of BP in
various areas [28], but BP has drawbacks due to its use of gradient descent.
Usually, a gradient descent algorithm is used to adapt the weights based on
a comparison between the desired and actual network response to a given in-
put stimulus. In each iteration of backpropagation, the gradient of the search
surface is calculated and network weights are changed in a direction opposite
to the gradient. This can be computationally expensive if a large number of
iterations is required to find an acceptable network and to avoid local min-
ima entrapment. Consequently, it often gets trapped in a local minimum of
the error function and it is not able to finding a global minimum if the error
function is multimodal and/or non- differentiable. Moreover, BP is sensitive
to initial conditions and it can become slow. Oscillations may occur during
learning, and, if the error function is shallow, the gradient is very small leading
to small weight changes. One way to overcome gradient-descent-based training
algorithms shortcomings is to adopt EAs in order to formulate the training
process as the evolution of connection weights in the environment determined
by the architecture and the learning task. For this reason, much research has
been undertaken on the combination of EAs and NNs. Through the use of
EAs, the problem of designing a NN is regarded as an optimization problem.
Some EAs have implemented search over the topology space, or a search for
the optimal learning parameters. Some others focus on weight optimization:
these can be regarded as alternative training algorithms, and in this case the
evolution of weights assumes that the architecture of the network must be
static.
Evolutionary algorithm and state of the art design of Evolutionary Artifi-
cial Neural Network (EANN) were also introduced by Abraham [2], followed
by the proposed MLEANN framework. In that framework, in addition to the
evolutionary search of the connection weights and architectures, local search
techniques were used to fine-tune the weights (meta-learning).
A New Genetic Approach for Neural Network Design 291
This work presents an approach to the joint optimization of neural net-

work structure and weights which can take advantage of BP as a specialized
decoder. In our approach we decode the initial settings of the weights, that
will become the inputs of backpropagation algorithm, in order to obtain a
neural network with its connection weights, that corresponds to the ‘pheno-
type’. Backpropagation will decode the initial values into the weights of the
neural network. This is then confirmed by the reproduction operator, in which
the backpropagation will be not applied to the weights, since only the initial
values of a parent, that correspond to the ‘genotype’, are assigned to the
connection weights of its offspring.
The approach is successfully applied to three different real-world applica-
tions. The first regards an electrical engine fault diagnosis problem in order
to estimate the fault probability of an engine with particular attention to re-
duced power consumption and silicon area occupation. The second describes
an application to brain wave signal processing, in particular as a classification
algorithm in the analysis of P300 Evoked Potential. Finally, the third appli-
cation considers a financial problem, whereby a factor model capturing the
mutual relationship among several financial instruments is sought for.
This chapter is organized as follows. Section 2 discusses different app-
roaches to evolving ANNs and indicates correlated work shown in literature.
Section 3 describes the new evolutionary approach to the simultaneous evo-
lution of neural network structure and weights which can take advantage of
backpropagation as a specialized decoder. Genetic operators implemented are
presented in detail. Finally three real world application are discussed in Sec-
tion 4, considering different problems: electrical engine fault diagnosis, brain
wave signal classification and financial modeling.
2 Evolving ANNs
There are several approaches to evolving ANNs and EAs are used to perform
various tasks, such as connection weight training, architecture design, learning
rule adaptation, connection weight initialization, rule extraction from ANNs,
etc. Three of them are considered as the most popular approaches at these
levels:
• Connection weights, that concentrates just on the weights optimization,
assuming that the architecture of the network must be static. The evo-
lution of connection weights introduces an adaptive and global approach
to training, especially in the reinforcement learning and recurrent network
learning paradigm where gradient-based training algorithms often experi-
ence great difficulties.
• Learning rules, that can be regarded as a process of learning to learn in
ANNs where the adaptation of learning rules is achieved through evolution.
It can also be regarded as an adaptive process of automatic discovery of
novel learning rules.
• Architecture evolution, that enables ANNs to adapt their topologies to

different tasks without human intervention and thus provides an approach
to automatic ANN design as both ANN connection weights and structures
can be evolved. In this case a further subdivision can be made by defin-
ing a ‘pure’ architecture evolution and a simultaneous evolution of both
architectures and weights.
Other approaches consider the evolution of transfer function of a neural
network, but it is usually applied in conjunction with one of the three methods
above in order to obtain better results.
Different types of evolutionary algorithms are defined in literature: the
first proposal comes from Holland, with genetic algorithms (GA) [25]. Fogel’s
evolutionary programming (EP) [20] is a technique for searching through a
space of finite-state machines. The evolutionary strategies (ES), introduced by
Rechenberg [46], are algorithms that imitate the principles of natural evolution
for parameter optimization problems. The last development is so-called genetic
programming (GP), proposed by Koza [29] to search for the most fit computer
program to solve a particular task.
Genetic algorithms are based on a representation independent of the prob-
lem, like a string of binary, integer or real numbers. This representation (the
genotype) encodes a network (the phenotype) and gives rise to a dual repre-
sentation scheme. The ability to create better solutions in a genetic algorithm
relies mainly on the genetic recombination operator. The benefits of the ge-
netic operators come from the ability of forming connected substrings of the
representation that correspond to problem solutions.
Evolutionary programming and genetic programming use the same
paradigm as genetic algorithm, but they use different representations for indi-
viduals in the population, and they put emphasis on different operators. For
particular tasks, they use specialized data structures, like finite state machines
and tree-structured computer programs, and specialized genetic operators, in
order to perform evolution.
Evolutionary strategies were developed as a method to solve parameter op-
timization problems with continuosly changeable parameters, and then they
were extended also for discrete problems. They differ from genetic algorithm
for several aspects. Indeed, they apply deterministic selection after reproduc-
tion, and a solution is not coded and together with probabilities of mutation
and crossover, constitute chromosomes. These probabilities are different for
all individuals and they change during the evolution. In this approach, wrong
solutions are eliminated from the population.
Evolution programs are particular kind of modified genetic algorithms and
they are introduced by Michalevicz [36]. They consider the idea of Davis [16],
who believes that:
‘a genetic algorithm should incorporate real-world knowledge in one’s algo-
rithm by adding it to one’s decoder or by expanding one’s operator set’.
Evolution programs would leave the problem unchanged, modifying a chro-

mosome representation of a potential solution, using natural data structure,
and applying appropriate genetic operators. In this algorithm a possible solu-
tion is directly mapped in an encoding scheme. They offer a major advantage
over genetic algorithms when evolving ANNs since the representation scheme
allows manipulating networks directly, avoiding the problems associated with
a dual representation.
The use of evolutionary learning for designing neural networks dates from
no more than two decades. However, a lot of work has been made in these
years, which has produced many approaches and working models in different
ANNs optimizations. Some of these are reported below.
Weight Optimization
The evolution of weights can be regarded as an alternative to training algo-

rithms, and it assumes that the architecture of the network must be static.
The primary motivation for using evolutionary techniques to establish the
weighting values rather than traditional gradient descent techniques such as
BP [49], lies in the trapping in local minima and in the requirement that the
function is differentiable. For this reason, rather than adapting weights based
on local improvement only, EAs evolve weights based on the whole network
fitness.
Several works in this direction have been described by Montana and Davis
[39] and by Whitley and colleagues [55]; in [56], they also implemented a
purely evolutionary approach using binary codings of weights. In other cases
an EA and a gradient descent algorithm have been combined [27].
Few years ago, Yang and colleagues [58] proposed an improved genetic algo-
rithm based on a kind of evolutionary strategy. Often, during application of
GAs, some problems of premature convergence and stagnation of solution can
occur [21]. This algorithm was implemented to keeping the balance between
population diversity and convergent speed during the evolution. This is carried
out by means of a kind of a mutation operator in conjunction with a controller
stable factor.
Zalzala and Mordaunt [40] studied an evolutionary NN suitable for gait
analysis of human motion evolving the connection weights of a predefined
NN structure. The evolution was explored with mutation and a multi-point
crossover separately implemented as the best combination search mechanism.
Learning Rules Optimization
Several standard learning rules govern the speed and the accuracy with which
the network will be trained and, when there is a little knowledge about the
most suitable architecture for a given problem, the possibly dynamic adap-
tation of the learning rules becomes very useful. Examples of these are the
learning rate and momentum, that can be difficult to assign by hand, and
therefore become good candidates for evolutionary adaptation.
One of the first studies in this field was conducted by Chalmers [13]. The
aim of his work was to see if the well-known delta rule, or a fitter variant, could
be evolved automatically by a genetic algorithm. With a suitable chromosome
encoding and using a number of linearly separable mappings for training,
Chalmers was able to evolve a rule analogous to the delta rule, as well as
some of its variants. Although this study was limited to somewhat constrained
network and parameter spaces, it paved the way for further progress.
Several studies have been carried out in this direction: Merelo and col-
leagues [35], present a search for the optimal learning parameters of multilayer
competitive learning neural networks. Another work, based on simulated an-
nealing, is proposed by Castillo et al. in [12].
Architeture Optimization
There are two major ways in which EAs have been used for searching network
topologies: either all aspects of a network architecture are encoded into an
individual or a compressed description of the network is evolved. The first
case defines a direct encoding, while the second leads to an indirect encoding.
• Direct Encoding specifies each parameter of the neural network and a little
effort in decoding is required, since a direct transformation of genotypes
into phenotypes is defined. Several examples of this approach are shown
in the literature, like in [37, 56] and in [61], in which the direct encoding
scheme is used to represent ANN architectures and connection weights
(including biases). EP-Net [61] is based on evolutionary programming with
several different sophisticated mutation operators.
• Indirect Encoding requires a considerable effort for the neural network
decoding, but, in some cases, the network can be pre-structured, using re-
strictions in order to rule out undesirable architectures, which makes the
searching space much smaller. A few sophisticated encoding method is im-
plemented based on network parameter definitions. These parameters may
represent the number of layers, the size of the layers, i.e., the number of
neurons in each layer, the bias of each neuron and the connections among
them. This kind of encoding is an interesting idea that has been further
pursued by other researchers, like Filho and colleagues [19], and Harp and
colleagues [23]. Their method is aimed at the choice of the architecture
and connections, and uses a representation which describes the main com-
ponents of the networks, dividing them in two classes, i.e., parameter and
layer sections.
The design of an optimal NN architecture can be formulated as a search
problem in the architecture space, where each point represents an architec-
ture. As pointed out by Yao [59, 61, 62], given some performance (optimality)
criteria, e.g., minimum error, learning speed, lower complexity, etc., about
architectures, the performance level of all these forms a surface in the design
space. Determining the optimal architecture design is equivalent to finding
the highest point on this surface. There are several arguments which make
the case for using EAs for searching for the best network topology [37, 53].
Pattern classification approaches [54] can also be found to design the net-
work structure, and constructive and destructive algorithms can be imple-
mented [59]. The constructive algorithm starts with a small network. Hidden
layers, nodes, and connections are added to expand the network dinami-
cally [61]. The destructive algorithm starts with a large network. Hidden
layers, nodes, and connections are then deleted to contract the network dy-
namically [41].
Stanley and Miikkulainen in [53] presented a neuro evolutionary method
through augmenting topologies (NEAT). This algorithm outperforms solu-
tions employing a principled method of crossover of different topologies, pro-
tecting structural innovation using speciation, and incrementally growing from
minimal structure.
One of the most important forms of deception in ANNs structure opti-
mization arises from the many-to-one and from one-to-many mapping from
genotypes in the representation space to phenotypes in the evaluation space.
The existence of networks functionally equivalent and with different encod-
ings makes the evolution inefficient. This problem is usually termed as the
permutation problem [22] or the competing convention problem [51]. It is clear
that the evolution of pure architectures has difficulties in evaluating fitness
accurately. As a result, the evolution would be very inefficient.
Simultaneous Evolution of Architecture and Weights
One solution to decrease the noisy fitness evaluation problem in ANNs struc-
ture optimization is to consider a one-to-one mapping between genotypes and
phenotypes of each individual. This is possible by considering a simultane-
ous evolution of the architecture and the network weights. The advantage of
combining these two basic elements of a NN is that a completely functioning
network can be evolved without any intervention by an expert.
Some methods that evolve both the network structure and connection
weights were proposed in the literature.
In the ANNA ELEONORA algorithm [34] new genetic operator, called
GA-simplex [9], an encoding procedure, called granularity encoding [32, 33],
that allows the algorithm to autonomously identify an appropriate suitable
lenght of the coding string, were introduced. Each gene consists of two parts:
the connectivity bits and the connection weight bits. The former indicates the
absence or the presence of a link, and the latter indicates the value of the
weight of a link. This approach employs the four genetic operators, reproduc-
tion, crossover, mutation and GA-simplex, and it has been shown in a parallel
and sequential version.
In another evolutionary approach, named GNARL algorithm [3], the num-

ber of hidden nodes and connection links for each network is first randomly
chosen within some defined ranges. Three steps are then implemented to gen-
erate an offspring: copying the parents, determining the mutations to be per-
formed, and mutating the copy. The mutation of a copy is separated into two
classes, defining a parametric mutation that can alter the connection weights,
and a structural mutation that modifies the number of hidden nodes and links
of the considered network.
An evolutionary system, EPNet [61], was also been presented for evolving
feedforward ANNs. The idea behind EPNet is to put more emphasis on evolv-
ing ANN behaviors. For this reason a number of techniques have been adopted
to maintain a close behavioral link between parents and their offspring. For
example, the training mutation is always attempted first before any architec-
tural mutation since it causes less behavioral disruption and a hidden node is
not added to an exisiting ANN at random, but through splitting an existing
node. EPNet evolves ANN architectures and weights simultaneously in order
to reduce the noise in fitness evaluation even though the evolution simulated
by this approach is closer to the Lamarckian evolution rather than to the
Darwinian one.
Leung and colleagues developed a new system [30] for tuning of the struc-
ture and parameters of a neural network in a simple manner. A given fully
connected feedforward neural network may become a partially connected net-
work after training. The topology and weights are tuned simultaneously using
a proposed improved GA. In this approach the weights of the network links
govern the input-output relationship of the NN, while the structure of the
neural network is governed by introducing switches elements for each NN
connection.
Great attention to simultaneous evolution of architecture and weights of
a network was also given in a new kind of models for evolving ANNs, named
Cooperative Coevolutionary models. These are based on the coevolution of
several species of subnetworks (called nodules) that must cooperate to form
networks for solving a given problem. In the work by Pedrajas and colleagues
[45] an example of this approach has been implemented in the COVNET
model. Here a population of networks that was evolved by means of a steady-
state genetic algorithm kept track of the best combinations of nodules for
solving the problem.
Further works were carried out in order to address drawbacks of BP gradi-
ent descent approach introduced in Section 1. P.P. Palmes and colleagues [44]
implemented a mutation-based genetic NN (MGNN) to replace BP by us-
ing the mutation strategy of local adaptation of evolutionary programming
(EP) to effect weight learning. This algorithm also dinamically evolved struc-
ture and weights at the same time. In MGNN a gaussian perturbation was
implemented together with a first stochastic (GA-inspired) and a scheduled
stochastic (EP-inspired) mutation.
3 Neuro-Genetic Approach
This work presents an approach to the design of NNs based on EAs, whose
aim is both to find an optimal network architecture and to train the network
on a given data set.
The approach is designed to be able to take advantage of BP if that is
possible and beneficial; however, it can also do without it. The basic idea
is to exploit the ability of the EA to find a solution close enough to the
global optimum, together with the ability of the BP algorithm to finely tune
a solution and reach the nearest local minimum.
As indicated in Section 1, this research is primed by an industrial applica-
tion [4,5], in which it is required to design neural engine controllers to be imple-
mented in hardware, with particular attention to reduced power consumption
and silicon area occupation. The validity of resulting approach, however, is by
no means limited to hardware implementations of NNs. Further application
describes a brain wave signal processing [7], in particular as a classification
algorithm in the analysis of P300 Evoked Potential. Finally, the third appli-
cation considers a financial problem, whereby a factor model capturing the
mutual relationships among several financial instruments is sought for [6].
The attention is restricted to a specific subset of the NN architectures,
namely the Multi-Layer Perceptron (MLP). MLPs are feedforward NNs with
one layer of input neurons, one layer of one or more output neurons and zero
or more ‘hidden’ (i.e., internal) layers of neurons in between; neurons in a
layer can take inputs from the previous layer only.
A peculiar aspect of this approach is that BP is not used as some genetic
operator, as is the case in some related work [11]. Instead, the EA optimizes
both the topology and the weights of the networks; BP is optionally used
to decode a genotype into a phenotype NN. Accordingly, it is the genotype
which undergoes the genetic operators and which reproduces itself, whereas
the phenotype is used only for calculating the genotype’s fitness.
The idea proposed in this work is near to the solution presented in EPNet
[61], a new evolutionary system for evolving feedforward ANNs, that puts
emphasis on evolving ANNs behaviors. This neuro-genetic approach evolves
ANNs architecture and connection weights simultaneously, as EPNet, in order
to reduce the noise in fitness evaluation.
Close behavioral link between parent and offspring is also maintained by
applying different techniques, like weight mutation and partial training, in
order to reduce behavioral disruption. The first technique is attempted before
any architecural mutation; the second is employed after each architectural
mutation. Moreover, a hidden node is not added to an existing ANN at ran-
dom, but through splitting and existing node. In our work we carried out
weight mutation before topology mutation since we want to perturb the con-
nection weights of the neurons in a neural network, and then carry out a
weight control in order to delete neurons whose contribution is negligible
resptect to the overall network output. This allows to implement, if it is
possible, a reduction of the computational cost of the entire network before

each architecture mutation.
3.1 Evolutionary Algorithm
The overall evolutionary process can be described by the following pseudocode:

1. Initialize the population, either by generating new random individuals or
by loading a previously saved population.
2. Create for each genotype the corresponding MLP, and calculate its mean
square error (mse), its cost and its fitness values.
3. Save the best individual as the best-so-far individual.
4. While not termination condition do
a) Apply genetic operators to each network.
b) Decode each new genotype into the corresponding network.
c) Compute the fitness value for each network.
d) Save statistics.
The application of the genetic operators to each network is described by
the following pseudo-code.
1. Select from the population (of size n) n/2 individuals by truncation and
create a new population of size n with copies of the selected individuals.
2. For all individuals in the population:
a) Perform crossover.
b) Mutate the weights and the topology of the offspring.
c) Train the resulting network using the training and validation sets if
bp = 1.
d) Calculate f and fˆ (see Section 3.3).
e) Save the individual with lowest fˆ as the best-so-far individual if the
fˆ of the previously saved best-so-far individual is higher (worse).
3. Save statistics.
If a new population is to be generated, the corresponding networks will be
initialized with different hidden layer sizes, using two exponential distributions
to determine the number of hidden layers and neurons for each individual,
and a normal distribution to determine the weights and bias values. Variance
matrices will be also defined for all weights and bias matrices, that will be
applied in conjunction with evolutionary strategies in order to perturbe net-
work weights and bias. Variance matrices will be initialized with matrices of
all ones.
In both cases, unlike other approaches like [62], the maximum size and
number of the hidden layers is not determined in advance, nor bounded, even
though the fitness function may penalize large networks.
3.2 Individual Encoding
Each individual is encoded in a structure that maintains basic information on

the net as illustrated in Table 1. The values of all these parameters are affected
by the genetic operators during evolution, in order to perform incremental
(adding hidden neurons or hidden layers) and decremental (pruning hidden
neurons or hidden layers) learning.
The use of the bp parameter explains the two different types of genetic
encoding schemes: direct and indirect encodings, already defined in Section 2.
Generally, indiret encoding allows a more compact representation than
direct encoding, because every connection and node are not specified in the
genome, although they can be derived from it. On the other side, the major
drawback of indirect schemes is that they require more detailed genetic and
neural knowledge.
In this approach, if no BP-based network training is employed, a direct en-
coding is defined, in which the network structure is directly translated into the
corresponding phenotype; otherwise, there is an indirect encoding of networks,
where the phenotype is obtained by the training of an initial (embryonic) net-
work using BP.
While promising results can be obtained by combining backpropagation
and evolutionary search, fast variants of backpropagation are sometimes re-
quired to speed up the effciency of these algorithms. We use a darwinian model
of the evolution, in which the influences of the environment over the pheno-
type, corresponding to the BP application, do not perturb the genotype, as we
have previously indicated, and consequently, do not affect the features inher-
ited by the offspring. In this work, considering the computational trade-offs
between local and evolutionary search, the Resilient backpropagation algo-
rithm RPROP [47] is adopted as the local search method, since it is useful
for fine-tuning of the solution. Although the evolutionary learning can be
slow for some problems in comparison with fast variants of BP, evolutionary
Table 1. Individual Representation

Element Description
l Lenght of the topology string, corresponding to the number of
layers.
topology String of integer values that represent the number of neurons in
each layer.
W (0) Weights matrix of the input layer neurons of the network.
Var(0) Variance matrix of the input layer neurons of the network.
W (i) Weights matrix for the ith layer, i = 1, . . . , l.
Var(i) Variance matrix for the ith layer, i = 1, . . . , l.
bij Bias of the jth neuron in the ith layer.
Var(bij ) Variance of the bias of the jth neuron in the ith layer.
Table 2. Parameters of the Algorithm

Symbol Meaning Default Value
n Population size 60
seed Previously saved population none
bp Backpropagation selection 1
p+
layer Probability to insert a hidden layer 0.1
p−
layer Probability to delete a hidden layer 0.05
p+
neuron Probability to insert a neuron in a hidden layer 0.05
pcross Probability to crossover 0.2
r Parameter for use in weight mutation for neuron 1.5
elimination
Threshold for alternative use for neuron 0
elimination
h Mean for the exponential distribution 3
Nin Number of network inputs *
Nout Number of network outputs *
α Cost of single neuron 2
β Cost of single synapsis 4
λ Desired tradeoff between network cost and 0.5
accuracy
k Constant for scaling cost and mse in the same 10−5
range
* depends on the problem.
algorithms are generally much less sensitive to initial condition of training

and they are useful for searching for a globally optimal solution.
Some problem-specific parameters of the algorithm are the cost α of a
neuron and β of a synapsis, used to establish a parsimony criterion for the
network architecture; a bp parameter, which enables the use of BP if set to 1,
and other parameters like probability values, used to define topology, weight
distribution and ad hoc genetic operators.
Table 2 lists all the parameters of the algorithm, and specifies the default
values that they assume in this work.
3.3 Fitness Function
The fitness of an individual depends both on its accuracy (i.e., its mse) and
on its cost. Although it is customary in EAs to assume that better individuals
have higher fitness, the convention that a lower fitness means a better NN
is adopted in this work. This maps directly to the objective function of the
genetic problem, which is a cost minimization problem. Therefore, the fitness
is proportional to the value of the mse and to the cost of the considered
network. It is defined as
f = λkc + (1 − λ)mse, (1)

where λ ∈ [0, 1] is a parameter which specifies the desired trade-off between

network cost and accuracy, k is a constant for scaling the cost and the mse of
the network to a comparable scale, and c is the overall cost of the considered
network, defined as
c = αNhn + βNsyn , (2)
where Nhn is the number of hidden neurons, and Nsyn is the number of
synapses.
The mse depends on the Activation Function, that calculates all the output
values for each single layer of the neural network. In this work the tangent
sigmoid transfer function
2
y= −1 (3)
1 + e−2x
is implemented. The rationale behind introducing a cost term in the objective
function is that there is a search for networks which use a reasonable amount
of resources (neurons and synapses), which makes sense in particular when a
hardware implementation is envisaged.
To be more precise, two fitness values are actually calculated for each
individual: the fitness f , used by the selection operator, and a test fitness fˆ.
Following the commonly accepted practice of machine learning, the problem
data are partitioned into three sets:
• training set, used to train the network;
• test set, used to decide when to stop the training and avoid overfitting;
• validation set, used to test the generalization capabilities of a network.
It is important to stress that no thikness is given to these dataset defini-
tions in the literature.
Now, fˆ is calculated according to Equation 1 by using the mse over the test
set, while f is calculated according the same equation by using the mse over
the training set. When BP is used, i.e., if bp = 1, f = fˆ, since in our approach
we want to define as better an individual that outperforms good results over
input data, that are not used during the learning phase. Otherwise (bp = 0),
f is calculated according to Equation 1 by using the mse over the training
and test sets together, maintaining the best individual over the test set.
3.4 Selection
In the evolution process two important and strongly related issues are the pop-
ulation diversity and the selective pressure. Indeed, an increase in the selective
pressure decreases the diversity of the population, and vice versa. Like indi-
cated by Michalewicz [36], it is important to strike a balance between these
two factors, and sampling mechanisms are attempt to achieve this goal. As
observed in that work, many of the parameters used in the genetic search
affect these factors. In this sense as selective pressure is increased, the search
focuses on the top individuals in the population, causing a loss of diversity

population. Using larger population, or reducing the selective pressure, in-
creases exploration, since more genotypes are involved in the search. In the
De Jong work [26], several variations ot the simple selection method were con-
sidered; the first variation, named elitist model , enforces the genetic algorithm
by preserving the best chromosome during the evolution. An important result
by G. Rudolph [48], is that elitism is a necessary condition for convergence
of an evolutionary algorithm; of course, convergence is only probabilistic, and
there is no guarantee that just one run of an evolutionary algorithm for a
given number of generations will yield the globally optimal solution.
The selection method implemented in this work is based on breeder genetic
algorithm approach [42], that differs from natural probabilistic selection since
the evolution of a population considers only the individuals that better adapt
themself to the environment. Elitism is also considered in this work, allowing
the survival of the best individual unchanged into the next generation and
the solutions to get better over time. The selection strategy used in this gene-
tic algorithm is truncation: starting from a population of n individuals, the
worst n/2 (with respect to f ) are eliminated. The remaining individuals
are duplicated in order to replace those eliminated. Finally, the population is
randomly permuted.
3.5 Mutation
Two types of mutation operators are used: a general random perturbation of

weights, applied before the BP learning rule, and three mutation operators
which affect the network architecture. The weight mutation is applied first,
followed by the topology mutations, as follows:
1. Weight mutation: all the weight matrices W (i) , i = 0, . . . , l and the biases
are perturbed by using variance matrices and evolutionary strategies ap-
plied to the number of synapses of the entire neural network Nsyn . This
mutation is implemented by the following equation:
(i) (i) (i)
Wj ← Wj + N (0, 1) · Varj (4)
(i) (i)
Varj ← Varj · eτ N (0,1)+τ N (0,1)
(5)
with
1
τ = 3 (6)
2Nsyn
1
τ = 4 3 (7)
2 Nsyn
After this perturbation has been applied, neurons whose contribution to

the network output is negligible are eliminated, basing on threshold. In
this work two different kinds of threshold are considered and alternatively
applied to the weight perturbation. The first is a fixed threshold, simply
defining a
parameter, setted before execution. The following pseudocode
is implemented in mutation operator by applying a comparison between
that parameter and all weight matrices values.
for i = 1 to l − 1 do
if Ni > 1
for j = 1 to Ni do
(i)
if ||Wj || <
delete the jth neuron

(i)
where Ni is the number of neurons in the ith layer, and Wj is the jth
column of matrix W (i) .
This solution presents the drawback that the fixed threshold value
could
be difficult to set for different real-world application.
A solution to this problem has been implemented in this approach by
defining a variable threshold. In this case the new threshold is defined,
depending on a norm (in this case L∞ ) of the weight vector for each
node, and the relevant average and standard deviation of the norms of
the considered layer. This task is carried out according to the following
pseudo-code:
for i = 1 to l − 1 do
if Ni > 1
for j = 1 to Ni do
(i) (i) (i)
if ||Wj || < (avgk (||Wk ||) − r · stdevk (||Wk ||))
delete the jth neuron
(i)
where Ni is the number of neurons in the ith layer, Wj is the jth column
of matrix W (i) , and r is a parameter which allows the user to tune how
many standard deviations below the layer average the contribution of a
neuron must be before it is deleted. In this solution the settings of r para-
meter is only for tuning standard deviation and corresponding variances
are not so invasive in mutation.
2. Topology mutation: this operator affects the network structure (i.e., the
number of neurons in each layer and the number of hidden layers). In
particular, three mutations can occur:
a) Insertion of one hidden layer: with probability p+ layer , a hidden layer
i is randomly selected and a new hidden layer i − 1 with the same
number of neurons is inserted before it, with W (i−1) = I(Ni ) and
bi−1,j = bij , with j = 1, . . . , Ni = Ni−1 , where I(Ni ) is the Ni × Ni
identity matrix.
b) Deletion of one hidden layer: with probability p− layer , a hidden layer i is
randomly selected; if the network has at least two layers and layer i has
exactly one neuron, layer i is removed and the connections between
the (i − 1)th layer and the (i + 1)th layer (to become the ith layer)
are rewired as follows:
W (i−1) ← W (i−1) · W (i) . (8)
Since W (i−1) is a row vector and W (i) is a column vector, the result
of the product of their transposes is a Ni+1 × Ni−1 matrix.
c) Insertion of a neuron: with probability p+
neuron , the jth neuron in the
hidden layer i is randomly selected for duplication. A copy of it is
inserted into the same layer i as the (Ni + 1)th neuron; the weight
matrices are then updated as follows:
i. a new row is appended to W (i−1) , which is a copy of jth row of
W (i−1) ;
(i)
ii. a new column WNi +1 is appended to W (i) , where
(i) 1 (i)
Wj ←W , (9)
2 j
(i) (i)
WNi +1 ← Wj . (10)
The rationale for halving the output weights from both the jth neuron
and its copy is that, by doing so, the overall network behavior remains
unchanged, i.e., this kind of mutation is neutral.
All three topology mutation operators are designed so as to minimize their
impact on the behavior of the network; in other words, they are designed to
be as little disruptive (and as much neutral) as possible.
3.6 Recombination
As indicated in [34] there has been some debate in the literature about the
opportunity of applying crossover to ANN evolution, based on disruptive eff-
ects that it could make into neural model. In this approach two ideas of
crossover are independently implemented: the first is a kind of single-point
crossover with different cutting points; the second implements a kind of ver-
tical crossover, defining a merge-operator between the topologies and weight
matrices of two parents in order to create the offspring.
Single-Point Crossover
It is a kind of single-point crossover, where cutting points are extracted for

each parent, since the genotype lenght is variable. Furthermore, the genotypes
can be cut only in ‘meaningful’ places, i.e., only between one layer and the
next: this means that a new weight matrix has to be created to connect the two
layers at the crossover point in the offspring. These new weight matrices are
initialized from a normal distribution, while corresponding variance matrices
are setted to matrices of all ones. This kind of crossover is shown in Figure 1.
Parent ‘a’ Parent ‘b’

Cut Point a
Cut Point b
Input First Layer Second Layer Output Input First Layer Second Layer Third Layer Output
iw iw
lw {i, j} lw {i, j} lw lw {i, j} lw
{i, j} {i, j}
Offspring from ‘a’ and ‘b’

Input First Layer Second Layer Output
iw
lw{i , j} lw{i , j}
Fig. 1. Single-Point Crossover Representation
Vertical Crossover
The second type of crossover operator is a kind of ‘vertical’ crossover and it

is implemented as shown in Figure 2.
Once the new population has been created by the selection operator de-
scribed in Section 3, two individual are chosen for coupling and their neural
structures are compared. If there are some differences in the topology length
l, the hidden layer insertion mutation operator will be applied to the shortest
neural topology in order to obtain individuals with the same number of layers.
Then a new individual will be created, the child of the two parents selected.
The neural structure of the new individual is created by adding the number
of neurons in any hidden layer of each parent, excepted for input and output
layer (they are the same for each neural network).
The new input-weights matrix W (0) and the relative variance matrix
Var(0) are respectively obtained by appending the matrix of the second par-
ent to the matrix of the first parent. Then, the new weight matrix W (i) and
the corresponding variance matrix Var(i) for each hidden layer of the new
individual are respectively defined as the block diagonal matrix of the ma-
trix of the first parent and the matrix of the second parent. Bias values and
corresponding variance matrices of two parents are concatenated in order to
obatin the new values for the new biases bij and variances Var(bij ).
All the weights of the inputs to the new output layer will be set to the
half of the corresponding weights in the parents. The rationale of this choice
Parent ‘a’ Parent ‘b’

Input First Layer Second Layer Output Input First Layer Second Layer Output
iw iw
lw{i , j} lw{i , j} lw {i , j} lw {i , j}
5 6
(a)
lw1,1 lw1,2
(a)
(b) (b)
(a) (a) lw1,1 lw1,2
lw2,1 lw2,2
Offspring from ‘a’ and ‘b’

Input First Layer Second Layer Output
iw lw{i , j} lw{i , j}
⎡ (a) (a)
⎤
lw1,1 lw1,2 0 0
⎢ (a) (a) ⎥
⎣ lw2,1 lw2,2 0 0 ⎦
(b) (b)
0 0 lw1,1 lw1,2
Fig. 2. Merge-Crossover Representation.
is that, if both parents were ‘good’ networks, they would both supply the
appropriate input to the output layer; without halving it, the contribution
from the two subnetworks would add and yield an approximately double input
to the output layer. Therefore, halving the weights helps to make the operator
as little disruptive as possible.
The main problem in the crossover implementation, is to maintain, in a
neural network, a structural alignment between neurons of each parent, when
a new offspring is created. Without alignment, some disruptive effects could
be make into neural model. Another important open issue, defined also in
the two approaches implemented in this work, regards the initialization of
connection weight values in the merging point between two selected parents.
In this particular approach, in the first crossover operator, that is a single-
point, new weight matrices in the merging point of the offspring are initialized
from a normal distribution, while corresponding variance matrices are setted
to matrices of all ones. In the vertical crossover, the new weight matrices are
defined, in the merging point, as the block diagonal matrix of the matrix of
the first parent and the matrix of the second parent. Also in this case, the new
connections will be initialized from a normal distribution.
In any case, the frequency and the size of crossover changes must be care-
fully selected in order to assure the balance between exploring and exploiting
abilities.
4 Real-World Applications
As indicated in previous sections, this approach has been applied to three dif-
ferent real-world problems. Section 4.1 describes an industrial application for
neural engine controllers design. The second application presented in Section
4.2 concerns a neural classification algorithm for brain wave signal processing,
in particular the analysis of P300 Evoked Potential. Finally, the third applica-
tion presented in Section 4.3, considers a financial modeling, whereby a factor
model capturing the mutual relationships among several financial instruments
is sought for.
4.1 Fault Diagnosis Problem
The algorithm described in Section 3 was primed by applying it to a real-world

diagnosis application. Every industrial application requires a suitable moni-
toring system for its processes in order to identify any decrease in efficiency
and any loss. A generic information from an electric power measurement sys-
tem, which monitors the power consumption of an electric component, can be
usefully exploited for sensorless monitoring of an AC motor drive.
Having in mind the recent trend toward more and more integrated systems,
where the drive can be considered as a “black-box”, the only accessible points
assumed of the system are the AC input terminals.
The experimental setup is realized as shown in Figure 3, where a three-
phase PWM inverter with switching frequency of 4 kHz was used.
The DC-link between the Rectifier and the PWM-Inverter performs a fil-
tering action respect to the AC input, theoretically eliminating most of the
Hall-effect
Transducer
Rectifier DC-Link PWM-Inverter
AC M
A/D Wavelet
GA-Tuned
Acquisiton processing Diagnosis
ANN
board stage
Fig. 3. The experimental setup used for the fault diagnosis application
information about the output circuits of the drive and the motor. Instead, it
was proved that the operating condition of the AC motor will appear on the
AC side as a transient phenomenon or a sudden variation in the load power.
The presence of this electrical transient in the current suggests an approach
based on time-frequency or, better, time-scale analysis. In particular, the use
of Discrete Wavelet Transform (DWT) [52] could be efficiently used in the
process of separating the information.
The genetic approach involves the analysis of the signal—the load
current—through wavelet series decomposition. The decomposition results in
a set of coefficients, each carrying local time-frequency information. An or-
thogonal basis function is chosen, thus avoiding redundancy of information
and allowing for easy computation.
The computation of the wavelet series coefficients can be efficiently per-
formed with the Mallat algorithm [31]. The coefficients are computed by pro-
cessing the samples of the signal with a filter bank where the coefficients of the
filters are peculiar to the family of the chosen wavelets. The Figure 4 shows
the bandpass filtering, which is implemented as a lowpass gi (n) - highpass
hi (n) filter pair which has mirrored characteristics.
In particular, in this application the 6-coefficient Daubechies wavelet [15]
was used. In Table 3 the filter coefficients of the utilized wavelet are reported.
The wavelet coefficients allow a compact representation of the signal and
the features of the normal operating or faulty conditions are condensed in the
wavelet coefficients. Conversely, the features of given operating modes can be
hi (n) = Highpass filter
d1 gi (n) = Lowpass filter

h1 (n) 2 max
d2 2 = Downsample by2
is (n) h2 (n) 2 max
a1
g1 (n) 2 a2 d1 = Detail
g2 (n) 2
a1 = Approximation
i s (n) = Current
Fig. 4. Bandpass filtering
Table 3. Filter coefficients of the 6-coefficient Daubechies wavelet

Filter Low-pass filter High-pass filter
Coefficients decomposition decomposition
1 0.035226 −0.33267
2 −0.085441 0.80689
3 −0.13501 −0.45988
4 0.45988 −0.13501
5 0.80689 0.085441
6 0.33267 0.035226
Fig. 5. A depiction of the logical structure of a case of study in the fault diagnosis
problem. The elements w1 to w8 are the maximum coefficients for each level of
wavelet decomposition; C indicates whether the case is faulty or not
Fig. 6. Conventional approach to the fault diagnosis problem by means of a neuro-

fuzzy network with pre-defined topology trained by BP
recognized in the wavelet coefficients of the signal and the operating mode
can be identified.
Employing the wavelet analysis both the identification of the drive operat-
ing conditions (faulty or normal operation) and the identification of significant
parameters for the specific condition have been obtained.
Figure 5 depicts the logical structure of the data describing a case of study.
Each vector is known to have been originated in fault or non-fault conditions,
so it can be associated with a fault index C equal to 1 (fault condition), or 0
(normal condition).
This problem has been already approached with a neuro-fuzzy network,
whose structure was defined a priori, trained with BP [14], as indicated in
Figure 6.
Experiments
In this approach, both the network structure (topology) and the weights have
to be determined through evolution at the same time, as depicted in Figure 7.
The model proposed has to look for networks with 8 input attributes (the
features from w1 to w8 of Figure 5, corresponding to the maximum coefficients
for each level of wavelet decomposition, and 1 output, the diagnosis C, which
there will be interpret as an estimate of the fault probability: zero thus means
a fault is not expected at all, whereas one is the certainty that a fault is about
to occur.
Fig. 7. Neuro-genetic approach to the fault diagnosis problem
The data used for learning have been obtained from a Virtual Test Bed
(VTB) model simulator of a real engine.
Several settings of five parameters, backpropagation bp, population size
n, and three mutation probabilities relevant to structural mutation, p+ layer ,
− +
player , and pneuron , have been explored in order to assess the robustness of
the approach and to determine an optimal set-up. The pcross parameter, that
defines the probability to crossover, is set to 0 for all runs, because neither
single-point crossover nor merge crossover give satisfactory results for this
problem. All the other parameters are set to the default values shown in
Table 2.
All runs were allocated the same fixed amount of neural network exe-
cutions, to allow for a fair comparison between the cases with and without
backpropagation. The results are respectively summarized in Table 4 and in
Table 5. Ten runs are executed for each setting, of which the average and
standard deviation for the best solutions found are reported.
Results
A first comment can be made regarding the size of the population. In most
cases it is possible to observe that the solutions found with a larger population
are better than those found with a smaller population. With bp = 1, 15 settings
out of 27 give better results with n = 60, while with bp = 0, 19 settings out
of 27 give better results with the larger pupulation.
The best solutions, on average, have been found with p+ layer = 0.1,
− +
player = 0.2, and pneuron = 0.05, and there is a clear tendency for the runs us-
ing backpropagation (bp = 1) to consistently obtain better quality solutions.
The best model over all runs performed is a multi-layer perceptron with a
phenotype of type [3, 1], here specified without input layer.
Table 4. Experimental results for the engine fault diagnosis problem with BP = 1
−
setting p+ +
layer player pneuron bp = 1
n = 30 n = 60
avg stdev avg stdev
1 0.05 0.05 0.05 0.11114 0.0070719 0.106 0.0027268
2 0.05 0.05 0.1 0.10676 0.003172 0.10606 0.0029861
3 0.05 0.05 0.2 0.10776 0.0066295 0.10513 0.0044829
4 0.05 0.1 0.05 0.10974 0.0076066 0.10339 0.0036281
5 0.05 0.1 0.1 0.1079 0.0067423 0.10696 0.0050514
6 0.05 0.1 0.2 0.10595 0.0035799 0.10634 0.0058783
7 0.05 0.2 0.05 0.10332 0.0051391 0.10423 0.0030827
8 0.05 0.2 0.1 0.10723 0.0097194 0.10496 0.0050782
9 0.05 0.2 0.2 0.10684 0.007072 0.1033 0.0031087
10 0.1 0.05 0.05 0.10637 0.0041459 0.10552 0.0031851
11 0.1 0.05 0.1 0.10579 0.0050796 0.10322 0.0035797
12 0.1 0.05 0.2 0.10635 0.0049606 0.10642 0.0042313
13 0.1 0.1 0.05 0.10592 0.0065002 0.10889 0.0038811
14 0.1 0.1 0.1 0.10814 0.0064667 0.10719 0.0054168
15 0.1 0.1 0.2 0.10851 0.0051502 0.11015 0.0055841
16 0.1 0.2 0.05 0.10267 0.005589 0.10318 0.0085395
17 0.1 0.2 0.1 0.10644 0.0045312 0.10431 0.0041649
18 0.1 0.2 0.2 0.10428 0.004367 0.10613 0.0052063
19 0.2 0.05 0.05 0.10985 0.0059448 0.10757 0.0045103
20 0.2 0.05 0.1 0.10593 0.0048254 0.10643 0.0056578
21 0.2 0.05 0.2 0.10714 0.0043861 0.10884 0.0049295
22 0.2 0.1 0.05 0.10441 0.0051143 0.10789 0.0046945
23 0.2 0.1 0.1 0.1035 0.0030094 0.1083 0.0031669
24 0.2 0.1 0.2 0.10722 0.0048851 0.1069 0.0050953
25 0.2 0.2 0.05 0.10285 0.0039064 0.1079 0.0045474
26 0.2 0.2 0.1 0.10785 0.008699 0.10768 0.0061734
27 0.2 0.2 0.2 0.10694 0.0052523 0.10652 0.0050768
In all cases, the relative standard deviation is sufficiently small to guaran-

tee finding a good solution in a few runs.
A comparison with the results obtained in [14] for a hand-crafted neuro-
fuzzy network did not reveal any significant difference. This is an extremely
positive outcome, given the expert time and effort spent in hand-crafting the
neuro-fuzzy network, as compared to the practically null effort required to set
up these experiments. On the other hand, the amount of required computing
resources was substantially greater with this neuro-genetic approach.
The experiments showed that the algorithm is somewhat robust w.r.t. the
setting of its parameters, i.e., its performance is little sensitive of the fine
tuning of the parameters.
Table 5. Experimental results for the engine fault diagnosis problem with BP = 0
setting p+
layer p−
layer p+
neuron bp = 0
n = 30 n = 60
avg stdev avg stdev
1 0.05 0.05 0.05 0.14578 0.013878 0.13911 0.0086825
2 0.05 0.05 0.1 0.1434 0.011187 0.13573 0.013579
3 0.05 0.05 0.2 0.13977 0.014003 0.13574 0.010239
4 0.05 0.1 0.05 0.14713 0.0095158 0.13559 0.011214
5 0.05 0.1 0.1 0.14877 0.010932 0.13759 0.014255
6 0.05 0.1 0.2 0.14321 0.0095505 0.1309 0.012189
7 0.05 0.2 0.05 0.14304 0.014855 0.13855 0.0089141
8 0.05 0.2 0.1 0.13495 0.015099 0.13655 0.0079848
9 0.05 0.2 0.2 0.14613 0.010733 0.14165 0.013385
10 0.1 0.05 0.05 0.13939 0.013532 0.13473 0.0085242
11 0.1 0.05 0.1 0.13781 0.0094961 0.13991 0.012132
12 0.1 0.05 0.2 0.13692 0.017408 0.13143 0.012919
13 0.1 0.1 0.05 0.13348 0.009155 0.1363 0.013102
14 0.1 0.1 0.1 0.13785 0.013465 0.13836 0.0094587
15 0.1 0.1 0.2 0.14076 0.01551 0.13994 0.011786
16 0.1 0.2 0.05 0.1396 0.0098416 0.13719 0.016372
17 0.1 0.2 0.1 0.13597 0.012948 0.14091 0.014344
18 0.1 0.2 0.2 0.14049 0.013535 0.13665 0.011426
19 0.2 0.05 0.05 0.13486 0.0079435 0.14068 0.013874
20 0.2 0.05 0.1 0.13536 0.0112 0.12998 0.013489
21 0.2 0.05 0.2 0.13328 0.0087402 0.1314 0.0088282
22 0.2 0.1 0.05 0.13693 0.0096481 0.13456 0.012431
23 0.2 0.1 0.1 0.13771 0.015971 0.13939 0.0092643
24 0.2 0.1 0.2 0.13204 0.010325 0.1378 0.01028
25 0.2 0.2 0.05 0.14062 0.012129 0.14005 0.011195
26 0.2 0.2 0.1 0.14171 0.008802 0.13877 0.0094973
27 0.2 0.2 0.2 0.14216 0.015659 0.13965 0.015732
The results obtained on this real-world problem compared well against

alternative approaches based on the conventional training of a predefined
neuro-fuzzy network with BP.
4.2 Brain Wave Analysis
Brain-Computer Interfaces
Brain Computer Interfaces (BCI) represent a new communication option for

those suffering from neuromuscular impairment that prevents them from using
conventional input devices, such as mouses, keyboards, joysticks, etc. This new
approach has been developing quickly during the last few years, thanks to the
increase of computational power and the availability of new algorithms for
signal processing that can be used to analyze brain waves. During the first
international meeting on BCI technology, Jonathan R. Wolpaw formalized the
definition of the BCI systems as follows:
A brain-computer interface (BCI) is a communication or control sys-
tem in which the user’s messages or commands do not depend on the
brain’s normal output channels. That is, the message is not carried by
nerves and muscles, and, furthermore, neuromuscular activity is not
needed to produce the activity that does carry the message [57].
According with this definition, BCI systems appear as a possible and some-
times unique mode of communication for people with severe neuromuscular
disorders like spinal cord injury or cerebral paralysis. Exploiting the residual
functions of the brain, may allow those patients to communicate.
The human brain has an intense chemical and electrical activity, partially
characterized by peculiar electrical patterns, which occur at specific times and
at well-localized brain sites. All of that is observable with a certain level of
repeatability under well-defined environmental conditions. These simple phys-
iological issues can lead to the development of new communication systems.
Problem Description
One of the most utilized electrical activities of the brain for BCI is the so-called
P300 Evoked Potential. This wave is a late-appearing component of an Event
Related Potential (ERP) which can be auditory, visual or somatosensory. It
has a latency of about 300 ms and is elicited by rare or significant stimuli,
when these are interspersed with frequent or routine stimuli. Its amplitude is
strongly related to the unpredictability of the stimulus: the more unexpected
the stimulus, the higher the amplitude. This particular wave has been used
to make a subject chose between different stimuli [17, 18].
The general idea of Donchin’s solution is that the patient is able to generate
this signal without any training. This is due to the fact that the P300 is
the brains response to an unexpected or surprising event and is generated
naturally. Donchin developed a BCI system able to detect an elicited P300 by
signal averaging techniques (to reduce the noise) and used a specific method
to speed up the overall performance. Donchin’s idea has been adopted and
further developed by Beverina and colleagues of ST Microelectronics [10].
In this application, the neuro-genetic approach described in Section 3 has
been applied to the same dataset of P300 evoked potential used by Beverina
and colleagues for their approach on brain signal analysis based on support
vector machines.
Experiments
The dataset provided by Beverina and colleagues consists of 700 negative

cases and 295 positive cases. The feature are based on wavelets, morphological
criteria and power in different time windows, for a total of 78 real-valued

input attributes and 1 binary output attribute, indicating the class (positive
or negative) of the relevant case. A positive case is one for which the response
to the stimulus is correct; a negative case is one for which the response is
incorrect.
In order to create a balanced dataset of the same cardinality as the one
used by Beverina and colleagues, for each run of the evolutionary algorithm
218 positive cases from the 295 positive cases of the original set, and 218
negative cases from the 700 negative cases of the original set are extracted,
to create a 436 case training dataset; for each run, also a 40 case test set is
created, by randomly extracting 20 positive cases and 20 negative cases from
the remainder of the original dataset, so that there is no overlap between the
training and the test sets. This is the same protocol followed by Beverina and
colleagues.
For each run of the evolutionary algorithm up to 25,000 network eval-
uations are allowed (i.e., simulations of the network on the whole training
set), including those performed by the backpropagation algorithm. Hundred
runs of the neuro-genetic approach with parameters set to the values listed in
Table 1 were executed with bp = 0 and bp = 1, i.e., both without and with
backpropagation. The pcross parameter is maintained to 0 for all runs, because
neither single-point crossover nor merge crossover give satisfactory results also
for these simulations.
Results
The results obtained with settings defined above for experiments are presented
in Table 6.
Due to the way the training set and the test set are used, it is not surprising
that error rates on the test sets look better than error rates on the training
sets. That happens because, in the case of bp = 1, the performance of a
network on the test set is used to calculate its fitness, which is used by the
evolutionary algorithm to perform selection. Therefore, it is only networks
whose performance on the test set is better than average which are selected
for reproduction. The best solution has been found by the algorithm using
backpropagation and is a multi-layer perceptron with one hidden layer with 4
Table 6. Error rates of the best solutions found by the neuro-genetic approach with
and without the use of backpropagation, averaged over 100 runs
training test
bp false positives false negatives false positives false negatives
avg stdev avg stdev avg stdev avg stdev
0 93.28 38.668 86.14 38.289 7.62 3.9817 7.39 3.9026
1 29.42 14.329 36.47 12.716 1.96 1.4697 2.07 1.4924
neurons, which gives 22 false positives and 29 false negatives on the training
set, while it commits no classification error on the test set.
The results obtained by the neuro-genetic approach, without any specific
tuning of the parameters, appear promising. To provide a reference, the av-
erage number of false positives obtained by Beverina and colleagues with
support vector machines are 9.62 on the training set and 3.26 on the test set,
whereas the number of false negatives are 21.34 on the training set and 4.45
on the test set [43].
4.3 Financial Modeling
The last application of this neural evolutionary approach presented in this

chapter, regards the building of factor models of financial instruments. Factor
models are statistical models (in this case ANNs) that represent the returns
of a financial instrument as a function of the returns of other financial in-
struments [24]. Factor models are used primarily for statistical arbitrage. A
statistical arbitrageur builds a hedge portfolio consisting of one or more long
positions and one or more short positions in various correlated instruments.
When the price of one of the instruments diverges from the value predicted
by the model, the arbitrageur puts on the arbitrage, by going long that in-
strument and short the others, if the price is lower than predicted, or short
that instrument and long the others, if the price is higher. If the model is
correct, the price will tend to revert to the value predicted by the model, and
the arbitrageur will profit.
To study the capabilities, the approach was been tried on a factor modeling
problem whereby the Dow Jones Industrial Average (DJIA) is modeled against
a number of other market indices, including foreign exchange rates, stock of
individual companies taken as representatives of entire market segments, and
commodity prices as shown in Table 7.
Experiments
In this application the training and test sets are created by considering daily
returns for the period since the 2nd of January, 2001 until the 30th of Novem-
ber, 2005. All data are divided in two different datasets, rispectively with 1000
cases for training set and 231 cases for test set. The validation set consists of
the daily returns for the period since the 1st of December, 2005 until the 13th
of January, 2006.
All the parameters are set to the default values shown in Table 2, while
the pcross parameter is set to 0, because the two types of crossover defined give
to unsatisfactory results in this application. This is probably due to fact that
whereas evolutionary algorithms are known to be quite effective in exploring
the search space, they are in general quite poor at closing into a local optimum;
backpropagation, which is essentially a local optimization algorithm, appears
to complement well the evolutionary approach.
Table 7. Input Market Indices

Class Ticker Description
Foreign EURGBP 1EUR = xGBP
Exchange GBPEUR 1GBP = xEUR
Rates EURJPY 1EUR = xJPY
JPYEUR 1JPY = xEUR
GBPJPY 1GBP = xJPY
JPYGBP 1JPY = xGBP
USDEUR 1USD = xEUR
EURUSD 1EUR = xUSD
USDGBP 1USD = xGBP
GBPUSD 1GBP = xUSD
USDJPY 1USD = xJPY
JPYUSD 1JPY = xUSD
Industry DNA Biotecnologies
Representatives* TM Motors
DOW Chemicals
NOK Communications
JNJ Drug Manufactures
UN Food
BAB Airlines
XOM Oil & Gas
BHP Metal & Mineral
AIG Insurance
INTC Semiconductors
VZ Telecom
GE Conglomerates
Commodities OIL Crude Oil $/barrel
AU Gold, $/Troy ounce
AG Silver, $/Troy ounce
US TYX 30-year bond
Treasury TNX 10-year note
Bonds FVX 5-year note
IRX 13-week bill
* Representatives are, as a rule, the companies with largest market capitalization
for their sector.
Time series of training, test and validation set are preprocessesd by delet-
ing the 20 days moving average from all the components. Several runs of this
approach have been carried out in order to find out optimal settings of the
−
genetic parameters p+ +
layer , player , and pneuron . For each run of the evolutionary
algorithm, up to 100,000 network evaluations (i.e., simulations of the network
on the whole training set) have been allowed, including those performed by
the backpropagation algorithm.
All simulations have been carried out with bp = 1, i.e., while not all cases
with bp = 0 have been considered, since no otpimal results are obtained from
Table 8. Financial Modeling Experimental Results

Setting Parameter Setting BP = 1
−
p+ +
layer player pneuron avg stdev
1 0.05 0.05 0.05 0.2988 0.0464
2 0.05 0.05 0.1 0.2980 0.0362
3 0.05 0.05 0.2 0.3013 0.0330
4 0.05 0.1 0.05 0.2865 0.0368
5 0.05 0.1 0.1 0.2813 0.0435
6 0.05 0.1 0.2 0.3040 0.0232
7 0.05 0.2 0.05 0.2845 0.0321
8 0.05 0.2 0.1 0.2908 0.0252
9 0.05 0.2 0.2 0.3059 0.0208
10 0.1 0.05 0.05 0.2987 0.0290
11 0.1 0.05 0.1 0.3039 0.0341
12 0.1 0.05 0.2 0.3155 0.0396
13 0.1 0.1 0.05 0.3011 0.0395
14 0.1 0.1 0.1 0.2957 0.0201
15 0.1 0.1 0.2 0.3083 0.0354
16 0.1 0.2 0.05 0.2785 0.0325
17 0.1 0.2 0.1 0.2911 0.0340
18 0.1 0.2 0.2 0.2835 0.0219
19 0.2 0.05 0.05 0.2852 0.0292
20 0.2 0.05 0.1 0.2983 0.0309
21 0.2 0.05 0.2 0.2892 0.0374
22 0.2 0.1 0.05 0.3006 0.0322
23 0.2 0.1 0.1 0.2791 0.0261
24 0.2 0.1 0.2 0.2894 0.0260
25 0.2 0.2 0.05 0.2892 0.0230
26 0.2 0.2 0.1 0.2797 0.0360
27 0.2 0.2 0.2 0.2783 0.0369
these simulation. The results obtained are presented in Table 8: here are repor-
ted data about the average and the standard deviation of the test fitness values
about the best solutions found for each parameter settings over all runs.
−
The best solutions, on average, have been found with p+layer = 0.2, player =
0.2, and p+ neuron = 0.2, although they do not differ significantly from other
solutions found with bp = 1.
The best model over all runs performed has been found by the algorithm
using backpropagation, and it is a multi-layer perceptron with a phenotype of
type [2, 1], specified without input layer, which obtained a mean square error
of 0.39 on the test set.
Results
One observation is that the approach is substantially robust with respect to
the setting of parameters other than bp.
250
200
150
100
Returns
50
−50
−100
−150
0 5 10 15 20 25 30
Time
Fig. 8. Comparison between the daily closing prices predicted by the best model
(dashed line) and actual daily closing prices (solid line) of the DJIA on the valida-
tion set.
Figure 8 shows a satisfactory agreement between the output of the best

model with the actual closing values of the DJIA on the validation set. The
results obtained with few parameter settings appear promising with respect
to further simulations with different parameter tuning.
A comparison with simple linear regression on the same data have been
carried out, in order to assess the quality of the model. The linear regression
yields a linear model

32
y= wi xi , (11)
i=1
where the wi are the same weights values of the weights of best solution.
The prediction obtained by the linear regression model are compared with
our best solution found, as shown in Figure 9. The neuro-genetic solution
obtained with our approach has a mse of 1291.7, a better result compared to
the mse of 1320.5 of the prediction based on linear regression on the same
validation dataset.
The usefullness of such a model is evaluated with a paper simulation of a
very simple statistical arbitrage strategy, carried out on the same validation
set of the financial modeling. The strategy is described in detail in [6], and
the results show that the information given by a neural network obtained by
the approach would enable an arbitrageur to gain significant profit.
5 Conclusion and Future Work

The research of genetic ANN evolution for different application has been a
key issue in the ANN field. Neuro-genetic systems are coming of age, and
they consider different evolutionary solutions, like architecture optimization
250
200
150
100
Returns
50
−50
−100
−150
0 5 10 15 20 25 30
Time
Fig. 9. Comparison between the daily closing prices predicted by the best model
(dashed line), those predicted by the linear regression (dash-dotted line), and the
actual daily closing prices (solid line) of the DJIA on the validation set.
of neural network models, connections weights optimizations, and more suit-

able conjunctions of these solutions. This chapter has shown a review of dif-
ferent approaches presented in literature, highlighting the several aspects of
neuro-genetic evolution for each technique. A new evolutionary approach to
the joint optimization of neural network weights and structure has been pre-
sented in this chapter. This approach can take advantage of both evolutionary
algorithms and the backpropagation algorithm as specialized decoder.
This neuro-genetic approach has been validated and then successfully app-
lied to three different real-world applications. The results obtained on the
first fault diagnosis application compared well against alternative approaches
based on the conventional training of a predefined neuro-fuzzy network with
BP and they shown how the algorithm is somewhat robust w.r.t. the setting
of its parameters, i.e., its performance is little sensitive of the fine tuning of
the parameters. In the second application of brain-wave analysis, the results
obtained by the neuro-genetic approach, without any specific tuning of the
parameters, appear promising after a comparison with a mature approach
based on support vector machine. Finally, an application to financial modeling
has been implemented and successfully validated.
In each real-world application implemented, we have considered the same
parameters, maintaining some of them as constant values, as indicated in the
chapter, and we carried out several simulation in order to find the values of
other parameter settings in order to obtain the better solution for the consid-
ered problem. For a better explanation of the results, some input and output
parameter values are shown in Table 9, calculated in the three real-world
applications. Note that a comparison between these results cannot be car-
ried out, since they are referring to different real-world problem with different
condition.
Table 9. Summary of some parameter values of real-world applications.

Parameter Real-World Applications
Fault Diagnosis Brain Wave Financial
Problem Analysis Modeling
N-input 8 78 32
N-output 1 1 1
BP 1 1 1
p-cross 0 0 0
p+
layer 0.1 0.1 0.2
p−
layer 0.2 0.2 0.2
p+
neuron 0.05 0.05 0.2
Population
Dimension 60 60 60
Network
Simulations 25,000 25,000 100,000
Best
Topology [3 1] [4 1] [2 1]
Future work will consider the study of the efficiency and the robustness of
this approach even when input data are affected by uncertainty depending on
errors introduced by some measurement instrumentations. A further improve-
ment could be given by the elimination of algorithm parameters, even though
this approach has been demonstrated to be robust with respect to parameter
tuning.
Further studies of new crossover design could improve the genetic algo-
rithm implementation, by being as little disruptive as possible. The new
merge-crossover implemented in this work seems to be a promising step in
that direction, even though its use did not boost the performance of the algo-
rithm significantly in the present form.
References
1. Aarts EHL, Eiben AE, van Hee KM (1989) A general theory of genetic
algorithms. Computing Science Notes. Eindhoven University of Technology,
Eindhoven
2. Abraham A (2004) Meta learning evolutionary artificial neural networks. Neuro-
computing 56:1–38
3. Angeline PJ, Saunders GM, Pollack JB (1994) An evolutionary algorithm that
constructs recurrent neural networks. IEEE Trans Neural Netw 5:54–65
4. Azzini A, Cristaldi L, Lazzaroni M, Monti A, Ponci F, Tettamanzi AGB (2006)
Incipient fault diagnosis in electrical drives by tuned neural networks. In: Pro-
ceedings of the IEEE instrumentation and measurement technology conference,
IMTC 2006, Sorrento, Italy. IEEE, April, 24–27
5. Azzini A, Lazzaroni M, Tettamanzi AGB (2005) A neuro-genetic approach to
neural network design. In: Sartori F, Manzoni S, Palmonari M (eds) AI*IA 2005
workshop on evolutionary computation. AI*IA, Italian Association for Artificial

Intelligence, September 20, 2005
6. Azzini A, Tettamanzi AGB (2006) A neural evolutionary approach to financial
modeling. In: Sigevo (ed) Proceedings of the genetic and evolutionary compu-
tation conference, GECCO 2006, Seattle, WA, July 8–12, 2006
7. Azzini A, Tettamanzi AGB (2006) A neural evolutionary classification method
for brain-wave analysis. In: Proceedings of the European workshop on evolu-
tionary computation in image analysis and signal processing, EvoIASP 2006,
April 2006
8. Baeck T, Fogel DB, Michalewicz Z (2000) Evolutionary computation 1–2.
Institute of Physics Publishing, Bristol and Philadelphia, Dirac House, Tem-
ple Back, Bristol, UK
9. Bersini H, Seront G (1992) In search of a good crossover between evolution and
optimization. Parallel problem solving from Nature 2:479–488
10. Beverina F, Palmas G, Silvoni S, Piccione F, Giove S (2003) User adaptive bcis:
Ssvep and p300 based interfaces. PsychNology J 1(4):331–354
11. Castillo PA, Carpio J, Merelo JJ, Prieto A, Rivas V, Romero G (2000) Evolving
multilayer perceptrons. Neural Process Lett 12(2):115–127
12. Castillo PA, Gonzlez J, Merelo JJ, Rivas V, Romero G, Prieto A (1998) Sa-prop:
optimization of multilayer perceptron parameters using simulated annealing.
Neural Processing Lett
13. Chalmers DJ (1990) The evolution of learning: an experiment in genetic connec-
tionism. In: Touretzky DS, Elman JL, Hinton GE (eds) Connectionist models:
proceedings of the 1990 summer school. Morgan Kaufmann, San Mateo, CA,
pp 81–90
14. Cristaldi L, Lazzaroni M, Monti A, Ponci F (2004) A neurofuzzy applica-
tion for ac motor drives monitoring system. IEEE Trans Instrum Measurement
53(4):1020–1027
15. Daubeschies I (1992) Ten lectures on wavelet Society for Industrial and Applied
Mathematics, Philadelphia, Pennsylvania
16. Davis L (1989) Adapting operator probabilities in genetic algorithms. In: Schaf-
fer J (ed) Proceedings of the third international conference on genetic algo-
rithms. Morgan Kaufmann, San Mateo, CA, pp 61–69
17. Donchin E, Spencer KM, Wijesinghe R (2000) The mental prosthesis: assessing
the speed of a p300-based brain–computer interface. IEEE Trans Rehabil Eng
8(2):174–179
18. Farwell LA, Donchin E (1988) Talking off the top of your head: toward a men-
tal prosthesis utilizing event-related brain potentials. Electroencephalogr Clin
Neurophysiol 70(6):510–523
19. Filho EFM, Carfalhode A (1997) Evolutionary design of mlp neural network
architectures. In: IEEE Proceedings of the fourth Brazilian symposium on neural
networks, pp 58–65
20. Fogel LG, Owens AJ, Walsh MJ (1966) Artificial intelligence through simulated
evolution. Wiley, New York
21. Goldberg DE (1992) Genetic algorithms in search optimization & machine learn-
ing. Addison-Wesley, Reading, MA
22. Hancock PJB (1992) Genetic algorithms and permutation problems: a com-
parison of recombination operators for neural net structure specification. In:
Whitley LD, Schaffer JD (eds) Proceedings of the third international workshop
on combinations genetic algorithms neural networks, 1992, pp 108–122
23. Harp S, Samad T, Guha A (1991) Towards the genetic syntesis of neural net-
works. Fourth international conference on genetic alglorithms, pp 360–369
24. Harris L (2003) Trading and exchanges: market microstructure for practitioners.
Oxford University Press, Oxford
25. Holland JH (1975) Adaption in natural and artificial systems. The University
of Michigan Press, Ann Arbor, MI
26. De Jong KA (1993) Evol Comput. MIT, Cambridge, MA
27. Keesing R, Stork DG (1991) Evolution and learning in neural networks: the
number and distribution of learning trials affect the rate of evolution. Adv Neu-
ral Inf Process Syst 3:805–810
28. Knerr S, Personnaz L, Dreyfus G (1992) Handwritten digit recognition by neural
networks with single-layer training. IEEE Trans Neural Netw 3:962–968
29. Koza JR (1994) Genetic programming. The MIT, Cambridge, MA
30. Leung EHF, Lam HF, Ling SH, Tam PKS (2003) Tuning of the structure and
parameters of a neural network using an improved genetic algorithm. IEEE
Trans Neural Netw 14(1):54–65
31. Mallat S (1999) A wavelet tour of signal processing. Academic, San Diego, CA
32. Maniezzo V (1993) Granularity evolution. In: Proceedings of the fifth interna-
tional conference on genetic algorithm and their applications, p 644
33. Maniezzo V (1993) Searching among search spaces: hastening the genetic evo-
lution of feedforward neural networks. In: International conference on neural
networks and genetic algorithms, GA-ANN’93, pp 635–642
34. Maniezzo V (1994) Genetic evolution fo the topology and weight distribution of
neural networks. IEEE Trans Neural Netw 5(1):39–53
35. Merelo JJ, Patn M, Canas A, Prieto A, Morn F (1993) Optimization of a com-
petitive learning neural network by genetic algorithms. IWANN93. Lect Notes
Comp Sci 686:185–192
36. Michalevicz Z (1996) Genetic algorithms + data structures = evolution program.
Springer, Berlin Heidelberg New York
37. Miller GF, Todd PM, Hegde SU (1989) Designing neural networks using genetic
algorithms. In: Schaffer JD (ed) Proceedings of the third international confer-
ence on genetic algorithms, pp 379–384
38. Moller MF (1993) A scaled conjugate gradient algorithm for fast supervised
learning. Neural Netw 6(4):525–533
39. Montana D, Davis L (1989) Training feedforward neural networks using genetic
algorithms. In: Proceedings of the eleventh international conference on artificial
intelligence. Morgan Kaufmann, Los Altos, CA, pp 762–767
40. Mordaunt P, Zalzala AMS (2002) Towards an evolutionary neural network for
gait analysis. In: Proceedings of the 2002 congress on evolutionary computation,
vol 2. pp 1238–1243
41. Moze MC, Smolensky P (1989) Using relevance to reduce network size automat-
ically. Connect Sci 1(1):3–16
42. Muhlenbein H, Schlierkamp-Voosen D (1993) The science of breeding and its
application to the breeder genetic algorithm (bga). Evol Comput 1(4):335–360
43. Giorgio Palmas (2005) Personal communication, November 2005
44. Palmes PP, Hayasaka T, Usui S (2005) Mutation-based genetic neural network.
IEEE Trans Neural Netw 16(3):587–600
45. Pedrajas NG, Martinez CH, Prez JM (2003) Covnet: a cooperative coevolu-
tionary model for evolving artificial neural networks. IEEE Trans Neural Netw
14(3):575–596
46. Rechenberg I (1973) Evolutionsstrategie: optimierung technischer Systeme nach

Prinzipien der biologischen Evolution. Fromman-Holzboog, Stuttgart
47. Redmiller M, Braun H (1993) A direct adaptive method for faster backpropaga-
tion learning. In: Proceedings of the international conference on neural networks,
pp 586–591
48. Rudolph G (1994) Convergence analysis of canonical genetic algorithms. IEEE
Trans Neural Netw 5(1):96–101
49. Rumelhart DE, McClelland JL, Williams RJ (1986) Learning representations
by back-propagating errors. Nature 323:533–536
50. Rumelhart DE, Hinton G, Williams R (1986) Parallel distributed processing.
MIT, Cambridge, MA
51. Schaffer JD, Whitley LD, Eshelman LJ (1992) Combinations of genetic algo-
rithms and neural networks: a survey of the state of the art. In: Whitley LD,
Schaffer JD (eds) Proceedings of the third international workshop on combina-
tions genetic algorithms neural networks, pp 1–37
52. Sidney Burrus C, Gopinath RA, Guo H (1998) Introduction to wavelets and
wavelet transorms – a primer. Prentice Hall, Upper Saddle River, NJ
53. Stanley K, Miikkulainen R (2002) Evolving neural networks through augmenting
topologies. Evol Comput 10(2):99–127
54. Weymaere N, Martens J (1994) On the initialization and optimization of mul-
tiplayer perceptrons. IEEE Trans Neural Netw 5:738–751
55. Whitley D, Hanson T (1989) Optimizing neural networks using faster, more acc-
urate genetic search. In: Schaffer JD (ed) Proceedings of the third international
conference on genetic algorithms, pp 391–396
56. Whitley D, Starkweather T, Bogart C (1993) Genetic algorithms and neural
networks: optimizing connections and connectivity. Parallel Comput 14:347–361
57. Wolpaw JR, Birbaumer N, Heetderks WJ, McFarland DJ, Peckham PH, Schalk
G, Donchin E, Quatrano LA, Robinson JC, Vaughan TM (2000) Brain computer
interface technology: a review of the first international meeting. IEEE Trans
Rehab Eng 8(2):164–173
58. Yang B, Su XH, Wang YD (2002) Bp neural network optimization based on
an improved genetic algorithm. In: Proceedings of the IEEE first international
conference on machine learning and cybernetics, November 2002, pp 64–68
59. Yao X (1999) Evolving artificial neural networks. In: Proceedings on IEEE,
pp 1423–1447
60. Yao X, Xu Y (2006) Recent advances in evolutionary computation. Comput Sci
Technol 21(1):1–18
61. Yao X, Liu Y (1997) A new evolutionary system for evolving artificial neural
networks. IEEE Trans Neural Netw 8(3):694–713
62. Yao X, Liu Y (1998) Towards designing artificial neural networks by evolution.
Appl Math Comput 91(1):83–90
A Grammatical Genetic Programming
Representation for Radial Basis Function
Networks
Ian Dempsey, Anthony Brabazon, and Michael O’Neill
Summary. We present a hybrid algorithm where evolutionary computation, in the

form of grammatical genetic programming, is used to generate Radial Basis Function
Networks. An introduction to the underlying algorithms of the hybrid approach is
outlined, followed by a description of a grammatical representation for Radial Basis
Function networks. The hybrid algorithm is tested on five benchmark classification
problem instances, and its performance is found to be encouraging.
1 Introduction
General purpose neural network (NN) models such as multi-layer perceptrons
(MLPs) and radial basis function networks (RBFNs) have been applied to
many real-world problems. Although these models have very general utility,
the construction of a quality network can be time consuming. Practical prob-
lems faced by the modeller include the selection of model inputs, the selection
of model form, and the selection of appropriate parameters for the model
such as weights. The use of evolutionary algorithms (EAs) such as the ge-
netic algorithm provides scope to automate one or more of these decisions.
Traditional methods of combining EA and NN methodologies typically entail
the encoding of aspects of the NN model using a fixed-length binary or real-
valued chromosome. The EA is then applied to a population of chromosomes,
where each member of this population encodes a specific NN structure. The
population of chromosomes is evolved over time so that better NN structures
are uncovered. A drawback of this method is that the use of a fixed length
chromosome places a restriction on the nature of the NN models that can be
evolved by the EA. This study adopts an alternative approach using a novel
hybrid algorithm where evolutionary computation, in the form of grammatical
genetic programming, is used to generate an RBFN. This approach employs
a variable length chromosome which implies that the structure of the RBFN
is not determined a priori but rather is uncovered by means of an evolution-
ary process. This study represents the first application of a grammar-based
I. Dempsey et al.: A Grammatical Genetic Programming Representation for Radial Basis

Function Networks, Studies in Computational Intelligence (SCI) 82, 325–335 (2008)
326 I. Dempsey et al.
genetic programming algorithm, namely Grammatical Evolution, to generate

RBFNs.
In the remainder of this chapter the two components of the hybrid method-
ology are initially outlined (sections 2 and 3), followed by a description of how
they are combined to form the hybrid algorithm (section 3). The results of the
application of the hybrid algorithm to five benchmark classification problem
instances is provided in section 5. Conclusions and suggestions for future work
are detailed in section 6.
2 Grammatical Evolution
Grammatical Evolution (GE) is an evolutionary algorithm that can evolve
computer programs in any language [12–16] and it can be considered a form
of grammar-based genetic programming. GE has enjoyed particular success in
the domain of Financial Modelling [2] amongst numerous other applications
including Bioinformatics, Systems Biology, Combinatorial Optimisation and
Design [3, 4, 9, 11]. Rather than representing the programs as parse trees, as
in GP [1, 5–8], a linear genome representation is used. A genotype-phenotype
mapping is employed such that each individual’s variable length binary string
contains in its codons (groups of 8 bits) the information to select production
rules from a Backus Naur Form (BNF) grammar. The grammar allows the
generation of programs (or in this study, RBFN forms) in an arbitrary lan-
guage that are guaranteed to be syntactically correct. As such, it is used as a
generative grammar, as opposed to the classical use of grammars in compilers
to check syntactic correctness of sentences. The user can tailor the grammar
to produce solutions that are purely syntactically constrained, or they may in-
corporate domain knowledge by biasing the grammar to produce very specific
forms of sentences.
BNF is a notation that represents a language in the form of production
rules. It is comprised of a set of non-terminals that can be mapped to elements
of the set of terminals (the primitive symbols that can be used to construct the
output program or sentence(s)), according to the production rules. A simple
example of a BNF grammar is given below, where <expr> is the start symbol
from which all programs are generated. These productions state that <expr>
can be replaced with either one of <expr><op><expr> or <var>. An <op> can
become either +, -, or *, and a <var> can become either x, or y.
<expr> ::= <expr><op><expr> (0)
| <var> (1)
<op> ::= + (0)
| - (1)
| * (2)
<var> ::= x (0)
| y (1)
A Grammatical Genetic Programming Representation 327
220 240 220 203 101 53 202 203 102 55 221 202
241 133 30 74 204 140 39 202 203 102
Fig. 1. An example GE individual’s genome represented as integers for ease of

reading
The grammar is used in a developmental process to construct a program by

applying production rules, selected sequentially using the genome, beginning
from the start symbol of the grammar. In order to select a production rule in
GE, the next codon value on the genome is read, interpreted, and placed in
the following formula:
Rule = c % r
where c is the codon value, r the number of choices for the current non-terminal,
and % represents the modulus operator.
Fig. 1 provides an example of an individual genome (where each 8-bit
codon is represented as an integer for ease of reading). The first codon integer
value is 220, and given that we have 2 rules to select from for <expr> in the
above grammar, we get 220 % 2 = 0. The expression <expr> will therefore
be replaced with <expr><op><expr>. Beginning from the the left hand side
of the genome, codon integer values are generated and used to select appro-
priate rules for the left-most non-terminal in the developing program from
the BNF grammar, until one of the following situations arise: (a) A complete
program is generated. This occurs when all the non-terminals in the expres-
sion being mapped are transformed into elements from the terminal set of the
BNF grammar. (b) The end of the genome is reached before the complete
program is generated, in which case the wrapping operator is invoked. This
results in the return of the genome reading frame to the left hand side of the
genome once again. The reading of codons will then continue unless an upper
threshold representing the maximum number of wrapping events has occurred
during this individuals mapping process. (c) In the event that a threshold on
the number of wrapping events has occurred and the individual is still incom-
pletely mapped, the mapping process is halted, and the individual assigned
the lowest possible fitness value.
Returning to the example individual, the left-most <expr> in
<expr><op><expr> is mapped by reading the next codon integer value 240
and used in 240 % 2 = 0 to become another <expr><op><expr>. The devel-
oping program now looks like <expr><op><expr><op><expr>. Continuing to
read subsequent codons and always mapping the left-most non-terminal the
individual finally generates the expression y*x-x-x+x, leaving a number of
unused codons at the end of the individual, which are deemed to be introns
and simply ignored. A full description of GE can be found in O’Neill & Ryan
(2003) [12]. Some more recent developments are covered in Brabazon & O’Neill
(2005) [2].
3 Radial Basis Function Networks

A radial basis function network (RBFN) generally consists of a three-layer
feedforward network. Like an MLP, a RBFN can be used for prediction and
classification purposes, but RBFNs differ from MLPs in that the activation
functions of the hidden layer nodes are radial basis functions.
The training of RBFNs typically consists of a combination of unsupervised
and supervised learning. Initially, a number hidden layer nodes (or centres)
must be positioned in the input data space. This can be performed by following
a simple rule, or in a more sophisticated application by using unsupervised
learning. Methods for choosing the locations of centers include distributing the
centres in a regular grid over the input space, selection of a random subset of
the training data vectors to serve as centres, or using an algorithm to cluster
the input data (e.g. SOMs can be used for this) and then selecting a centre
location to represent each cluster. Each of these centres forms a hidden node
in the RBFN’s structure.
Input data vectors are typically standardised before training. When each
input vector is presented to the network a value is calculated at each centre
H0
I1 w0
H1
w1
I2 .
. Output
. w5
H5
I3
Fig. 2. A radial basis function network. The output from each hidden node (H0 is
a bias node, with a fixed input value of 1) is obtained by measuring the distance
between each input pattern and the location of the hidden node, and applying the
radial basis function to that distance. The final output from the network is obtained
by taking the weighted sum (using w0, w1 and w5) of the outputs from the hidden
layer and from H0
using a radial basis function. This value represents the quality of the match
between the input vector and the location of that centre in the input space.
Each hidden node, therefore, can be considered as a local detector in the
input data space. The most commonly used radial basis function is a Gaussian
function. This produces an output value of one if the input and weight vectors
are identical, falling towards zero as the distance between the two vectors gets
large. A range of alternative radial basis functions exists, including the inverse
multi-quadratic function and the spline function.
The second phase of the model construction process is the determination
of the value of the weights on the connections between the hidden layer and
the output layer. In training these weights, the output value for each input
vector will be known, as will the activation values for that input vector at each
hidden layer node, so a supervised learning method can be used. The simplest
transfer function for the node(s) in the output layer is a linear function where
the network’s output is a linearly weighted sum of the outputs from the hidden
nodes. In this case, the weights on the arcs to the output node(s) can be found
using linear regression, with the weight values being the regression coefficients.
Sometimes it may be preferred to implement a non-linear transfer function
at the output node(s). For example, when the RBFN is acting as a binary
classifier it would be useful to use a sigmoid transfer function to limit outputs
to the range 0 → 1. In this case, the weights between the hidden and output
layer could be determined using the backpropagation algorithm.
Once the RBFN has been constructed using a training set of input-output
data vectors it can then be used to classify or to predict outputs for new
input data vectors, for which an output value is not known. The new input
data vector is presented to the network, and an activation value is calculated
for each hidden node. Assuming that a linear transfer function is used in the
output node(s), the final output produced by the network is the weighted sum
of the activation values from the hidden layer, where these weights are the
coefficient values obtained in the linear regression step during training. The
basic algorithm for the canonical RBFN is as follows:
i. Select the initial number of centres (m).
ii. Select the initial location of each of the centres in the data space.
iii. For each input data vector/centre pairing calculate the activation value
φ(||x − y||), where φ is a radial basis function and ||...|| is a distance
measure between input vector x and a centre y in the data space. As an
d 2= ||x − y||. The value of a Gaussian RBF is then given
example, let −d
by y = exp( 2σ2 ), where σ is a modeller selected parameter which deter-
mines the size of the region of input space a given centre will respond
to.
iv. Once all the activation values for each input vector have been obtained,
calculate the weights for the connections between the hidden and output
layers using linear regression.
v. Go to step (iii) and repeat until a stopping condition is reached.
vi. Improve the fit of the RBFN to the training data by adjusting some or
all of the following: the number of centres, their location, or the width of
the radial basis functions.
As the number of centres increases, the predictive ability of the RBFN on the
training data will tend to increase, possibly leading to overfit and poor out-
of-sample generalisation. Hence, the object is to choose a sufficient number
of hidden layer nodes to capture the essential features in the training data,
without overfitting it.
4 GE-RBFN Hybrid
Despite the apparent dissimilarities between GE and RBFN methodologies,
the methods can complement each other. A practical problem in utilising
RBFNs is the selection of model inputs and model form. By defining an appro-
priate grammar, GE is capable of automatically generating a range of RBFN
forms. Hence, a combined GE-RBFN hybrid can be considered as embedding
both hypothesis generation and hypothesis optimisation components.
The basic operation of the GE-RBFN methodology is as follows. Initially,
a population of binary strings are randomly created. In turn, each of these is
mapped to a RBFN structure using a grammar which has been constructed
specifically for the task of generating RBFNs (see next subsection). The qual-
ity of each resulting RBFN is then assessed using the training data. Based
on this information, the binary strings resulting in higher quality networks
are preferentially selected for survival and reproduction. Over successive iter-
ations, the quality of the networks encoded in the population of binary strings
improves.
4.1 Grammar
There are multiple grammars that could be defined in order to generate
RBFNs depending on exactly what the modeller wishes to evolve. For ex-
ample, if little was known about which inputs would be useful for the RBFN,
the grammar could be written so that GE selected which inputs to use, in
addition to selecting the form of the RBFN itself (the number of hidden layer
nodes, their associated weight vectors, the form of their associated radial basis
functions and so on).
In this study we define a grammar which permits GE to construct RBFNs
with differing numbers of centres. GE is also used to decide where to locate
those centres in the input space. The Backus Naur Form grammar for this is
as follows.
<RBFN> :: = 1 / (1 + exp (- <HL>) )
<HL> ::= <weight> * <HN>

| <weight> * <HN> + <HL>
<HN> ::= <gaussian>
<center> ::= <real>, <real> (one item for each
<radius> ::= <real>
<weight> ::= <real>
<real> ::= your constant generation method of choice

V 2
− i=1 (input[I][i]−center[HN ][i]) )
where the non-terminal <gaussian> ::= exp 2∗(<radius>2 ) .
Under the above grammar, the generation of a RBFN starts from the root
< RBF N >. This can only be mapped to one choice, hence it gives rise to the
expression 1/(1 + exp(− < HL >)). Next, the non-terminal in this expression
< HL > is mapped into either < weight > ∗ < HN > or < weight > ∗ <
HN > + < HL >, depending on the value of the next codon on the binary
genome. Suppose the next codon on the genome gives rise to an integer value
of 34. Taking 34 Mod 2 (the number of choices available for < HL >) gives 0,
hence < HL > becomes the first choice, < weight > ∗ < HN >. At this
point, the RBFN consists of a network with a single hidden layer node. In
subsequent derivation steps, the real numbers corresponding to the location
of this centre, and the real number corresponding to the radius of the centre
are derived, eventually giving rise to a complete RBFN form.
4.2 Example Individuals
Fig. 3 provides a graphical illustration of the possible derivation trees which

the grammar could create. Tree A illustrates the basic form that all the RBFN
will take. The non-terminal <HL> is then expanded and can result in a RBFN
which has one or more hidden layer nodes. The RBFN generation process
iterates until all the non-terminals are mapped to terminals.
5 Experimental Setup & Results

Five benchmark classification problem instances from the UCI Machine Learn-
ing Repository [17] are tackled. Summary statistics on each problem instance
are provided in Table 1. Each dataset was recut between training and test
data ten times, with 80% of the dataset being used for training and 20% for
out of sample testing in each case. In assessing the quality of the developed
RBFNs, the number of correct classifications produced was used. The Wis-
consin problem is a data set of malignant and benign breast cancer cases.
Pima includes data on Pima Indians Diabetes from the National Institute of
Diabetes and Digestive and Kidney Diseases. The Thyroid data set is made
(A)
/
1 +
1 exp
<HL>
(B) (C)
<HL> <HL>
* +
<weight> <HN>
* <HL>
<weight> <HN>
Fig. 3. An output radial basis function network in the form of a derivation tree.
Tree (A) represents the common structure of all RBFN’s generated by the example
grammar. Trees (B) and (C) represent the two possible sub-trees that can replace
the <HL> non-terminal in (A). (B) represents the case where a <HL> becomes a
single node, and (C) represents the case where <HL> becomes at least two nodes
Table 1. Problem instance statistics and the training and test set partition sizes in
each case
Dataset Training Test #variables #classes
Wisconsin 559 140 9 2
Pima 614 154 8 2
Thyroid 172 43 5 3
Australian 552 138 6 2
Bupa 276 69 6 2
Table 2. Results for GE/RBFN including average fitnesses for both in and out of
sample data sets along with standard deviation for the out of sample data
Mean best Mean best Std. dev.

in sample out of sample
Australian 70.52 71.53 4.059
Bupa 60.26 57.11 4.504
Thyroid 62.40 75.78 4.559
Wisconsin 88.92 95.20 2.643
Pima 68.82 67.53 3.647
Table 3. Comparative out of sample results
Mean best Std. dev.

out of sample
Bupa 65.97 11.27
Thyroid 96.27 4.17
Wisconsin 95.63 1.58
Pima 73.50 4.23
up of thyroid patient records classified into disjoint disease classes. Australian

data set is made up of cases of credit card applications from the Credit Screen-
ing Database and the Bupa data set is from Bupa Medical Research Ltd. and
has data on liver disorders.
The results obtained by the hybrid system over the ten recuts are reported
in Table 2. Overall the results are encouraging. Comparing them against pre-
viously published results from [18] on four of the same datasets (see Table 3),
it can be seen that the evolved RFBNs outperform on two of the datasets,
and underperform on the other two. In assessing the results it should be noted
that there is considerable room to fine-tune the parameters of the GE-RBFN
hybrid, and this provides scope to further improve the results. In this proof of
concept study, typical off-the-shelf parameter settings were adopted for GE.
These settings included, a population size of 500 individuals, 100 generations
of training, and a generational rank replacement strategy wherein 25% of the
weakest performing members of the population being replaced with newly
generated individuals on each generation. For each dataset, a total of 30 runs
were conducted with a crossover rate of 0.9 and a mutation rate of 0.1 as
in [19]. All reported results are averaged over the 30 runs.
6 Conclusions & Future Work

This study presents a novel approach, based on a form of grammatical ge-
netic programming (grammatical evolution), for the automatic generation of
RBFNs. A particular feature of this methodology is that the structure of the
resulting RBFN is not defined a priori, but is evolved during the construction
process. The developed GE-RBFN hybrid was applied to five benchmark in-
stances from the UCI Machine Learning repository with encouraging results.
Substantial scope exists to further develop the RBFN-GE methodology
outlined in this chapter. In this initial study we did not include the selection
of inputs, or the selection of the form of the RBFs in the evolutionary process.
However, the RBFN grammar could be easily adapted in order to incorporate
these steps if required. The use of the GE methodology also opens up a variety
of other research avenues. The methodology applied in this study is based
on the canonical form of the GE algorithm. As already noted, a substantial
literature exists on GE covering such issues as the use of alternative search
engines for the algorithm, and the use of alternatives to the strict left-to-right
mapping of the genome (piGE). Future work could usefully examine the utility
of these GE variants for the purposes of evolving RBFNs.
References
1. Banzhaf W, Nordin P, Keller RE, Francone FD (1998) Genetic programming –
an introduction. On the automatic evolution of computer programs and its ap-
plications. Morgan Kaufmann, Los Altos, CA
2. Brabazon A, O’Neill M (2005) Biologically inspired algorithms for financial mod-
elling. Springer, Berlin Heidelberg New York
3. Cleary R, O’Neill M (2005) An attribute grammar decoder for the 01 multiCon-
strained knapsack problem. In: LNCS 3448 Proceedings of evolutionary com-
putation in combinatorial optimization EvoCOP 2005, Lausanne, Switzerland.
4. Hemberg M, O’Reilly U-M (2002) GENR8 – using grammatical evolution in a
surface design tool. In: Proceedings of the first grammatical evolution workshop
GEWS2002, New York City, New York, US. ISGEC, pp 120–123
5. Koza JR (1992) Genetic programming. MIT, Cambridge, MA
6. Koza JR (1994) Genetic programming II: automatic discovery of reusable pro-
grams. MIT, Cambridge, MA
7. Koza JR, Andre D, Bennett III FH, Keane M (1999) Genetic programming 3:
Darwinian invention and problem solving. Morgan Kaufmann, Los Altos, CA
8. Koza JR, Keane M, Streeter MJ, Mydlowec W, Yu J Lanza G (2003) Genetic
programming IV: routine human-competitive machine intelligence. Kluwer
Academic, Dordrecht
9. Moore JH, Hahn LW (2004) Systems biology modeling in human genetics using
petri nets and grammatical evolution. In: LNCS 3102 Proceedings of the genetic
and evolutionary computation conference GECCO 2004, Seattle, WA, USA,
10. O’Neill M, Brabazon A (2004) Grammatical swarm. In: LNCS 3102 Proceedings
of the genetic and evolutionary computation conference GECCO 2004, Seattle,
WA, USA. Springer Berlin Heidelberg New York, pp 163–174
11. O’Neill M, Adley C, Brabazon A (2005) A grammatical evolution approach to
eukaryotic promoter recognition. In: Proceedings of Bioinformatics INFORM
2005, Dublin City University, Dublin, Ireland
12. O’Neill M, Ryan C (2003) Grammatical evolution: evolutionary automatic pro-

gramming in an arbitrary language. Kluwer Academic, Dordrecht
13. O’Neill M (2001) Automatic programming in an arbitrary language: evolving
programs in grammatical evolution. PhD thesis, University of Limerick, 2001
14. O’Neill M, Ryan C (2001) Grammatical evolution. IEEE Trans Evol Comput
5(4):349–358
15. O’Neill M, Ryan C, Keijzer M, Cattolico M (2003) Crossover in grammatical
evolution. Genetic programming and evolvable machines, vol 4, no 1. Kluwer
Academic, Dordrecht
16. Ryan C, Collins JJ, O’Neill M (1998) Grammatical evolution: evolving programs
for an arbitrary language. Proceedings of the first European workshop on GP.
17. Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning
databases [http://www.ics.uci.edu/mlearn/MLRepository.html]. University of
California, Department of Information and Computer Science, Irvine, CA
18. Smith M, Bull L (2005) Genetic programming with a genetic algorithm for
feature construction and selection. Genet Program Evolvable Machines 6(3):
265–281
19. Dempsey I, O’Neill M, Brabazon A (2002) Investigations into market index trad-
ing models using evolutionary automatic programming. In: O’Neill M, Sutcliffe
R, Ryan C, Eaton M, Griffith N (eds) LNAI 2464, Proceedings of the 13th
Irish conference in artificial intelligence and cognitive science. Springer, Berlin
Heidelberg New York, pp 165–170
A Neural-Genetic Technique for Coastal
Engineering: Determining Wave-induced
Seabed Liquefaction Depth
Daeho Cha, Michael Blumenstein, Hong Zhang, and Dong-Sheng Jeng
Summary. In the past decade, computational intelligence (CI) techniques have

been widely adopted in various fields such as business, science and engineering, as
well as information technology. Specifically, hybrid techniques using artificial neu-
ral networks (ANNs) and genetic algorithms (GAs) are becoming an important
alternative for solving problems in the field of engineering in comparison to tra-
ditional solutions, which ordinarily use complicated mathematical theories. The
wave-induced seabed liquefaction problem is one of the most critical issues for
analysing and designing marine structures such as caissons, oil platforms and har-
bours. In the past, various investigations into wave-induced seabed liquefaction have
been carried out including numerical models, analytical solutions and some labora-
tory experiments. However, most previous numerical studies are based on solving
complicated partial differential equations. In this study, the proposed neural-genetic
model is applied to wave-induced liquefaction, which provides a better prediction of
liquefaction potential. The neural-genetic simulation results illustrate the applicabil-
ity of the hybrid technique for the accurate prediction of wave-induced liquefaction
depth, which can also provide coastal engineers with alternative tools to analyse the
stability of marine sediments.
1 Introduction
1.1 Artificial Neural Networks in Engineering
Artificial Neural Networks (ANNs) are amongst the most successful empir-
ical processing technologies to be used in engineering applications. ANNs
serve as an important function for engineering purposes such as modelling
and predicting the evolution of dynamic systems.
Hagan et al. [1] espoused that the pioneering work in neural networks
commenced in 1943 when McCulloch and Pitts [2] postulated a simple math-
ematical model to explain how biological neurons work. It may be the first
significant publication on the theory of artificial neural networks, which is
generally considered. ANN models have been widely applied to various en-
gineering problems. For example, the prediction of water quality parameters
D. Cha et al.: A Neural-Genetic Technique for Coastal Engineering: Determining Wave-
induced Seabed Liquefaction Depth, Studies in Computational Intelligence (SCI) 82, 337–351
(2008)
338 D. Cha et al.
[3], generation of wave equations based on hydraulic data [4], soil dynamic
amplification analysis [5], tide forecasting using artificial neural networks
[6], prediction of settlement of shallow foundations [7], earthquake-induced
liquefaction [8], and ground settlement by deep excavation [9].
Unlike conventional approaches based on engineering mechanics, the main
requirement for accurate prediction using ANN models is an appropriate
database. With sufficient quality training data, ANNs can provide accurate
predictions for various engineering problems.
1.2 Genetic Algorithms
Generally speaking, Genetic Algorithms (GAs) are one of the various Com-
putational Intelligence (CI) technologies, which also include ANNs and Fuzzy
Logic.
Fundamental theories of GAs were established by Holland [10] in the early
1970s. Holland [10] was amongst the first to put computational evolution on
a firm theoretical footing. The GA’s main role is numerical optimisation in-
spired by natural evolution. GAs can be applied to an extremely wide range of
problems. The basic component of GAs is strings of binary values (sometimes
real-values) called chromosomes. A GA operates on a population of individ-
uals (chromosomes), each presenting a possible solution to a given problem.
Each chromosome is assigned a fitness value based on a fitness function, and
its operation is based on crossover, selection and mutation.
1.3 ANN models trained by GAs (Evolutionary Algorithms)
Generally, it is time-consuming to configure and adjust the settings of ANN

models during the supervised training procedure e.g. those using the Back-
propagation (BP) algorithm. Even though its results may be acceptable for
some engineering applications, ANN training algorithms such as BP may suffer
problems inherent in gradient descent-based techniques such as being trapped
in local minima and an incapability of finding a global minimum if the er-
ror function is multi-modal and/or non-differentiable [11–13]. An ANN model
trained using GAs can deal with large, complicated spaces, which are on occa-
sion non-differentiable, and multi-modal as is common in real world problems.
Hence, the use of a hybrid technique combining GAs in conjunction with
ANNs is advantageous in complex engineering problems. The configuration of
an ANN model using GAs is shown in Fig. 1(b).
ANN models predict data, based on the relationship between input and
output values. This relationship is based on the iterative modification of net-
work weights, via the training procedure. Clearly, the training procedure is
a key factor of an ANN model. However, aside from the inherent problems
associated with gradient descent-based techniques (as described above) the
training procedure can be time consuming in terms of selecting the most ap-
propriate settings for ANN training; therefore it is beneficial to employ GAs
A Neural-Genetic Technique for Coastal Engineering 339
Fig. 1. Comparison of a traditional ANN model (a) and an ANN model trained by
GAs (b) (MSE: Mean Squared Error)
to optimise an ANN model’s weights in a hybrid technique for an application

such as wave-induced seabed liquefaction. In this study, We executed just one
epoch of the ANN model and saved the weights for the initial chromosome of
the GA model. Subsequently, the GA operations optimised the chromosome
using crossover and mutation, which were have varied from one point crossover
to up to ten point crossover, depending on chromosome size. Mutation was
varied from 2% up to 30%. Based on the results of the GA output, the ANN
weights were optimised.
1.4 Wave-induced seabed liquefaction
In the last few decades, various investigations of wave-induced seabed lique-

faction have been carried out. Although the protection of marine structures
has been extensively studied in recent years, understanding of their interaction
with waves and the seabed is far from complete. Damage of marine structures
still occurs from time to time, with two general failure modes evident. The
first mode is that of structural failure, caused by wave forces acting on and
damaging the structure itself. The second mode is that of foundation failure,
caused by liquefaction or erosion of the seabed in the vicinity of the structure,
resulting in collapse of the structure as a whole. Numerous research studies
have been carried out on this topic in the last decade [14,15]. Fig. 2 illustrates
the change of a seabed due to wave action (Fig. 2(a): without liquefaction,
Fig. 2(b) with liquefaction).
340 D. Cha et al.
(a)
(b)
Fig. 2. Phenomenon of wave-induced seabed liquefaction. (a) Stable structure

without liquefaction. (b) Failed structure due to liquefaction
Bjerrum [16] was possibly the first author that considered wave-induced
liquefaction occurring in saturated seabed sediments. Later, Nataraja et al.
[17] suggested a simplified procedure for ocean-based, wave-induced lique-
faction analysis. Recently, Rahman [18] established the relationship between
liquefaction and characteristics of wave and soil. He concluded that liquefac-
tion potential increases in degree of saturation and with an increase of wave
period. Jeng [19] examined a wave-induced liquefied state for several differ-
ent cases, together with Zen and Yamazaki’s [20] field data. He found that
no liquefaction occurs in a saturated seabed, except in very shallow water,
for large waves and a seabed with very low permeability. For more advanced
poro-elastic models for the wave-induced liquefaction potential, the readers
can refer to Sassa and Sekiguchi [21] and Sassa et al. [22]. All aforementioned
investigations have been reviewed by Jeng [23]. However, most previous inves-
tigations for the wave-induced liquefaction potential in a porous seabed have
been based on various assumptions of engineering mechanics, which limits the
application of the model in realistic engineering problems.
CI models for the estimation of the wave-induced liquefaction apply differ-
ent techniques to investigate coastal engineering problems, as compared to the
traditional engineering approach. Traditional engineering methods for wave-
induced liquefaction prediction always use deterministic models, which involve
complicated partial differential equations, depending on physical variables,

such as shear modulus, degree of saturation and Poisson ratio etc. However,
CI models employ statistical theory, which is a data-driven technique and can
be built through the knowledge of a quality database. The data should com-
prise all characteristic phenomena. Therefore, physical understanding of the
application problem is essential in CI model development.
In this study, we adopt a neural-genetic model for the wave-induced seabed
maximum liquefaction depth, based on a pre-built poro-elastic model [24].
2 A neural-genetic technique for wave-induced

liquefaction
Besides the use of standard ANNs for wave-induced liquefaction depth pre-
diction, we will discuss a neural-genetic approach for this problem. Generally,
there are two areas where GAs have been used in ANN modelling, these are:
optimising the weights of network connections (training) [32] and optimising
neural network structure design [25, 26]. In this study we adopted the op-
tion of optimising network weights using a single hidden layer feedforward
ANNs with a fixed network structure. A general multi layer neural network is
presented in (Fig. 3).
As shown in Fig. 3, a multi-layer network is an expansion of the single layer
network, and it can be used to solve more difficult and complicated problems.
It consists of an input layer, one or more hidden layers of neurons and an
Fig. 3. A typical multi-layer feedforward network architecture

342 D. Cha et al.
output layer of neurons. In the present study, GAs are utilised to optimise the
weights of the network and adjust the interconnections to minimise its output
error. It can be applied to the network, which has at least one hidden layer,
and fully connected to all units in each layer. The goal of this procedure is to
obtain a desired output when certain inputs are given. The general network
error E is shown in (1), where Dx and Ox are the desired and actual output
values, respectively.
1
n
E(x) = (Dx − Ox )2 (1)
n x=1
Since the error is the difference between the actual output and the target
output, the error depends on the weights, so we employed this error function
in the GA fitness function for optimising the weights instead of a standard,
iterative, gradient descent-based training method (2),
f (x) = Emax − E (x) (2)
where, f (x) is the corresponding fitness and Emax is the maximum perfor-
mance error.
For the GA selection function, we used the Roulette wheel method
developed by Holland [10]. It is defined as
Fi
Pi = S (3)
i=1 Fi
where, Pi is the probability for each individual chromosome, S is the popula-
tion size, and Fi is the fitness value of each chromosome.
In the use of GAs, there are basic operators employed and modified in
the search mechanism, which are Crossover and Mutation. For example, the
basic two point Crossover operator takes two individuals from a chromosome
and produces two new individuals, whilst mutation alters one individual to
produce a single new solution. Further discussion and details pertaining to
Crossover and Mutation settings in this research are presented in the results
section.
2.1 Data preparation
To ensure the accurate prediction of an ANN model trained using GAs, we

needed to build a reliable database or training/test sets. In this section, we
describe the establishment of the database by an existing poro-elastic model
developed by the author [27].
Poro-elastic model
In this model, we consider an ocean wave propagating over a porous seabed of

infinite thickness. A two-dimensional wave-seabed interaction problem is con-
sidered, treating the porous seabed as hydraulically isotropic with a uniform
permeability. Biot [28] presented a general set of equations governing the

behaviour of a linear elastic porous solid under dynamic conditions. They are
summarized in the tensor form as below
σij,j = ρüi + ρf ẅi (4)

ρf ρf g
−p,i = ρf ü + ẅi + ẇi (5)
n kz
n
ε̇ii + ẇii = − ṗ (6)
Kf
where p is pore pressure, n is porosity, ρ is the combined density, ρf is the fluid
density, u and w are the displacements of solid and relative displacements of
solid and pore fluid, σij is total stress. K1f is the compressibility of pore-fluid,
which is defined by
1 1 1−S
= + (7)
Kf 2 × 10 9 Pwo
where S is the degree of saturation, Pwo is the absolute water pressure.

The definition of effective stresses, σij , which are assumed to control the de-
formation of the soil skeleton, given by the total stress (σij ) and pore pressure
(p) as,

σij = σij − δij p, (8)
where, δij is the Delta denotation. Therefore, the equation of force balance,
equation (4) becomes
9 :
n Kf
− ṗ = (ε̇ii + ẇii )i ⇒ −p,i = (εii + wii )i . (9)
Kf i n
Then substituting (9) into (4) and (6), the governing equation can be
rewritten as
Kf ρf ρf g
(εii + wii )i = ρf üi + ẅi + ẇi . (10)
n n kz
If the acceleration terms are neglected in the above equation, it be-
comes the consolidation equation, which has been used in previous work [19].
Based on the wave-induced soil and fluid displacements, we can obtain the
wave-induced pore pressure, effective stresses and shear stresses. Detailed
information of the above solution can be found in [27].
Estimation of liquefaction
It has generally been accepted that when the vertical effective stress van-
ishes, the soil will be liquefied. Thus, the soil matrix loses its strength to
carry and load and consequently causes seabed instability. Based on the con-
cept of excess pore pressure, Zen and Yamazaki [20] proposed a criterion of
344 D. Cha et al.
liquefaction, which has been further extended by considering the effects of

lateral loading [19]
1
− (1 + K0 )(γs − γw )z + (Pb − p) ≤ 0. (11)
3
where K0 is the coefficient of earth pressure at rest, which is normally varied
from 0.4 to 1.0, and 0.5 is commonly used for marine sediments [30], γs is
the unit weight of soil, γw is the unit weight of water, p is pore pressure,
z is maximum liquefaction depth and Pb is the wave pressure at the seabed
surface, which is given by
γw H
p= cos(kx − ωt) (12)
cosh kd
where “cos(kx − ωt)” denotes the spatial and temporal variations in wave
pressure within the two-dimensional progressive wave described above.
3 Results and discussion

3.1 Neural-genetic model configuration for wave-induced
liquefaction
In general, wave-induced seabed liquefaction is calculated by solving com-

plicated mathematical equations, such as poro-elastic models. However, an
ANN-based model does not need to solve complicated nonlinear equations;
rather it requires high quality input data and accurate values for output
data. As pointed out in previous work [19], the most important factors, which
significantly affect the wave-induced soil liquefaction, include the degree of
saturation, seabed thickness, soil permeability, wave period, water depth and
wave height. Since Cha [27] found that the soil permeability has been very
sensitive to the occurrence of the wave-induced liquefaction potential, we use
a fixed value of permeability in addition to the other parameters as inputs to
the ANN model as shown in Table 2. The wave-induced maximum liquefaction
depth is the output value. Fig. 4 illustrates the structure of the ANN model
for this study. The important factors, such as degree of saturation, seabed
thickness, wave period, wave height and water depth, of this ANN model are
displayed in Table 1.
The database generated by the Poro-elastic model as described in
Section 2.1 is used to establish the ANN model. The range of the variables is
shown in Table 2. As seen in the table, we built the database, based on the
most possible ranges of wave and soil conditions. Amongst this input data,
we chose the most effective component of wave-induced liquefaction, which in-
cluded soil permeability, seabed thickness, degree of saturation, wave period,
wave height and water depth for the neural-genetic model.
In the current study, we had approximately 20,000 outputs of maximum
liquefaction depth from each simulation. Among these data we used 80% for
Fig. 4. Structure of ANN model for wave-induced liquefaction
Table 1. Key components of the neural network model
Training model
Number of inputs 5
Number of output neurons 1
Number of hidden neurons 1 to 5
learning rate 0.5
momentum factor 0.2
Table 2. Input data for the poro-elastic model
Wave characteristics
wave period (T ) 8 sec to 12.5 sec
wave height (H) 7.5 m to 10.5 m
water depth (d) 50 m to 100 m
Soil characteristics
soil permeability (Kz ) 10−4 , 5 × 10−4 m/sec
seabed thickness (h) 10 to 80 m
degree of saturation (S) 0.95 to 0.99
the training procedure, and the remaining data for validating the prediction
capability using the best run of each case.
In this paper, we not only use a correlation value(R2 ) for comparison
between the database and the neural-genetic prediction but we also use the
346 D. Cha et al.
(RMSE) value, which is more meaningful in engineering application. The

RMSE is defined as

1 N
RM SE = (LAi − LP i )2 (13)
N i=1
where LAi and LP i are the liquefaction depth from the ANN model and the
Poro-elastic model respectively; N is the total number of liquefaction depth
data.
3.2 ANN model training using GAs for wave-induced liquefaction
Generally, it is time-consuming to configure and adjust the settings of an ANN

model during the supervised training procedure. Even though its results are
acceptable from the engineering viewpoint, an ANN model trained using GAs
can reduce the complexity of the procedure; hence, it is an advantage to use
them in conjunction with ANNs.
As discussed in the previous sections, ANN models predict data, based on
the relationship between input and output values. This relationship is based on
the iterative modification of network weights, during the training procedure.
Clearly, the training procedure is a key factor of an ANN model. However, the
authors found that the training procedure can be time consuming in terms of
selecting the most appropriate settings for ANN training; therefore the authors
propose to use GAs to optimise the ANN model’s weights, which is described
in the previous section. The use of GAs is advantageous in this research due to
the problems inherent in regular gradient-descent based learning techniques
(as discussed in Section 1.3).
Hence to initiate the training procedure, we save the first weight config-
uration of the existing ANN model (small random values) for the initial GA
population.
Fig. 5 illustrates the concept of a chromosome, which we adopted (encoded)
from the ANN model.
As seen in Fig. 5, the size of the chromosome depends on the number of
weights in the ANN model. After we store the initial population from the
ANN model, GA operations are performed, such as selection, crossover and
mutation for optimising the weights.
In this study, we adopted 3 crossover (simple, arithmetic and heuristic
crossover) and mutation (uniform, non-uniform and boundary mutation)
functions, which were developed by Michalewicz [31]. For example, Arith-
metic Crossover produces two complimentary linear combinations of the
Fig. 5. The concept of the chromosome used in the approach

parents, and Heuristic Crossover produces a linear extrapolation of the two

individuals, which is the only operator that utilises fitness information. In
regards to mutation, the first type (Uniform mutation) may be explained as
follows. Let ai and bi be the lower and upper bound, respectively, for each
variable i then it may be further described as below

U (ai , bi ), when i = j
xi = (14)
xi , when i
= j
where, U (ai , bi ) is uniform random number, j is a randomly selected variable

Similarly Non-uniform mutation is described as

xi + (bi − xi )f (G), if ri ≤ 0.5,
xi = (15)
xi − (xi + ai )f (G), if ri ≥ 0.5
$ G
%s
where, f (G) = r2 (1− Gmax ) r1 and r2 is a uniform random number between
(0, 1), G is the current, Gmax is the maximum number of generations, s is a
shape parameter.
During GA experimentation, we altered the settings from one point cross-
over up to ten point crossover, depending on the chromosome size. Mutation
was varied from 2% up to 30%.
3.3 Results for determining wave-induced liquefaction
Numerous experiments were conducted to determine the maximum liquefac-

tion depth for two conditions of soil permeability. In this sub-section, the top
results are presented in Figs. 6 and 7 using the neural-genetic approach with
accompanying discussion.
Fig. 6 represents the predicted maximum liquefaction depth obtained from
an ANN model trained by GAs versus the database maximum liquefaction
depth (soil permeability, 10−4 m/s). As seen in the figures, overall, the pre-
diction of maximum liquefaction depth agrees with the numerical calculation
data. It is shown that the correlation of the ANN model and the poro-elastic
model is over 96%. The figure illustrates that the RMSE value is between the
10 to 30% range, which is acceptable for an engineering applications. Also,
the figures indicated that the correlation of the neural-genetic model with
the poro-elastic model and RMSE values could improve depending on the
GA settings used (e.g. Fig. 6(a) 6000 total generations, Fig. 6(b) 8000 total
generations).
Fig. 7 illustrates the predicted maximum liquefaction depth obtained using
the neural-genetic model versus the poro-elastic numerical maximum liquefac-
tion depth (soil permeability, 5×10−4 m/s). As shown in the figures, prediction
of maximum liquefaction depth using the neural-genetic model agrees well the
with the poro-elastic model, in that the correlation values are greater than
96% and the RMSE values are less than 25% in both cases. Fig. 7 results are
348 D. Cha et al.
(a)
(b)
Fig. 6. Comparisons of the wave-induced liquefaction depth by the approach ver-

sus the poro-elastic model (Soil Permeability: 10−4 m/sec). (a) 6000 Generations.
(b) 8000 Generations
slightly better than those shown in Fig. 6 because the GA operation settings
were based on those in Fig. 6(b). It is clearly shown that better results may
be producing varying the GA settings, with specific attention to increasing
the number of generations and varying crossover and mutation parameters.
(a)
(b)
Fig. 7. Comparisons of the wave-induced liquefaction depth by the approach versus
the poro-elastic model (Soil Permeability: 5 × 10−4 m/sec). (a) 6000 Generations.
(b) 8000 Generations
These results indicated that the performance of neural-genetic model for the
prediction of maximum wave-induced seabed liquefaction compares favourably
with the previous authors’ results [29]. In this study, 3 crossover and mutation
functions were adopted, which were used in the neural-genetic model.
350 D. Cha et al.
4 Conclusions
In this study, we adopted the concept of GA-based training of ANN models
in an effort to overcome the problems inherent in some ANN training pro-
cedures (i.e. gradient-based techniques) whilst providing accurate results for
determining maximum liquefaction depth in a real-world application.
Unlike the conventional engineering mechanics approaches, the neural-
genetic techniques are based on statistical theory, which is a data-driven
technique and can be built with the knowledge of a quality database, and
can save time in configuring and adjusting the settings of an ANN model
during the supervised training process.
In the proposed neural-genetic model, based on a physical understanding
of wave-induced seabed liquefaction, several important parameters, includ-
ing wave period, water depth, wave height, seabed thickness and the degree
of saturation, were used as the input parameters with constant soil perme-
ability, whilst the maximum liquefaction depth was the output parameter.
Experimental results demonstrate that the neural-genetic model is a successful
technique in predicting the wave-induced maximum liquefaction depth.
References
1. Hagan MT, Demuth HB, Beale M (1996) Neural network design. PWS, Boston,
MA
2. McCulloch WS, Pitts W (1943) A logical calculus of the ideas imminent in
nervous activity. Bull Math Biophys 5:115–133
3. Maier HR, Dandy HR (1997) Modeling cyanobacteria (blue-green algae) in the
River Murray using artificial neural networks. Math Comput Simulation 43:
377–386
4. Dibike YB, Minns AW, Abbott MB (1999) Applications of artificial neural net-
works to the generation of wave equations from hydraulic data. J Hydraulic Res
37(1):81–97
5. Hurtado JE, Londono JE, Meza JE (2001) On the applicability of neural
networks for soil dynamic amplification analysis. Soil Dyn Earthquake Eng
21(7):579–591
6. Lee TL, Jeng DS (2002) Application of artificial neural networks in tide
forecasting. Ocean Eng 29(9):1003–1022
7. Mohamed AS, Holger RM, Mark BJ (2002) Predicting settlement of shallow
foundations using neural networks. J Geotech Geo Envir Eng 128(9):785–793
8. Jeng DS, Lee TL, Lin C (2003) Application of artificial neural networks in
assessment of Chi–Chi earthquake-induced liquefaction. Asian J Inf Technol
2(3):190–198
9. Leo SS, Lo HS (2004) Neural network based regression model of ground surface
settlement induced by deep excavation automation in construction 13:279–289
10. Holland J (1975) Adaptation in natural and artificial systems. University of
Michigan Press. (Second edition: MIT, Cambridge, MA, 1999)
11. Yao X (1993) Evolving artificial neural networks. Int J Neural Syst 4(3):203–222
12. Gruau F, Whitley D, Pyeatt L (1996) A comparison between cellular encoding

and direct encoding for genetic neural networks. In: Genetic programming 1996:
proceedings of the first annual conference. MIT, Cambridge, MA, pp 81–89
13. Stanley KO, Miikkulainen R (2002) Evolving neural networks through augment-
ing topologies. Evol Comput 10(2):99–127
14. Zen K, Umehara Y, Finn WDL (1985) A case study of the wave-induced lique-
faction of sand layers under damaged breakwater. In: Proceedings of the third
Canadian conference on marine geotechnical engineering, pp 505–520
15. Silvester R, Hsu JRC (1989) Sines Revisited. J Waterways, Port, Coastal Ocean
Eng, ASCE 115(3):327–344
16. Bjerrum J (1973) Geotechnical problem involved in foundations of structures in
the North Sea. Geotechnique 23(3):319–358
17. Nataraja MS, Singh H, Maloney D (1980) Ocean wave-induced liquefaction anal-
ysis: a simplified procedure. In: Proceedings of an international symposium on
soils under cyclic and transient loadings, pp 509–516
18. Rahman MS (1997) Instability and movement of ocean floor sediments – a
review. Int J Offshore Polar Eng 7(3):220–225
19. Jeng DS (1997) Wave-induced seabed instability in front of a breakwater. Ocean
Eng 24(10):887–917
20. Zen K, Yamazaki H (1991) Field observation and analysis of wave-induced
liquefaction in seabed. Soils Found 31(4):161–179
21. Sassa S, Sekiguchi H (2001) Analysis of wave-induced liquefaction of sand beds.
Geotechnique 51(2):115–126
22. Sassa S, Sekiguchi H, Miyamamoto J (2001) Analysis of progressive liquefaction
as moving-boundary problem. Geotechnique 51(10):847–857
23. Jeng DS (2003) Wave-induced seafloor dynamics. Appl Mech Rev 56(4):407–429
24. Jeng DS, Cha DH (2003) Effects of dynamic soil behaviour and wave non-
linearity on the wave-induced pore pressure and effective stresses in porous
seabed. Ocean Eng 30(16):2065–2089
25. Montana DJ, Davis L (1989) Training Feedforward neural networks using genetic
algorithms. In: Proceedings of the international joint conference on artificial
intelligence, pp 762–767
26. Miller GF, Todd PM, Hegde PM (1989) Designing neural networks using genetic
algorithms. In: Proceedings of the third international conference on genetic
algorithms. Morgan Kaufmann, San Francisco
27. Cha DH (2003) Mechanism of ocean waves propagating over a porous seabed.
MPhil Thesis, Griffith University, Australia
28. Biot MA (1956) Theory of propagation of elastic waves in a fluid-saturated
porous solid. Part I: low frequency range; part II. High frequency analysis.
J Acoustics Soc 28:168–191
29. Jeng DS, Cha DH, Michael B (2004) Neural network model for the prediction
of wave-induced liquefaction potential. Ocean Eng 31(17–18):2073–2086
30. Scott RF (1968) Principle of soil mechanics. Addison, MA
31. Michalewicz Z (1994) Genetic algorithms + data structures = evolution
programs, AI Series, Springer, Berlin Heidelberg New York
32. Rooij van AJF, Jain LC, Johnson RP (1996) Neural network training using
genetic algorithms. World Scientific, London
On the Design of Large-scale Cellular Mobile
Networks Using Multi-population Memetic
Algorithms
Alejandro Quintero and Samuel Pierre
Summary. This chapter proposes a proposes a multi-population memetic algo-

rithm (MA) with migration and elitism to solve the problem of assigning cells to
switches as a design step of large-scale mobile networks. Well-known in the litera-
ture as an NP-hard combinatorial optimization problem, this problem requires the
recourse to heuristic methods which can practically lead to good feasible solutions,
not necessarily optimal, the objective being rather to reduce the convergence time
toward these solutions. Computational results obtained from extensive tests confirm
the efficiency and the effectiveness of MA to provide good solutions in comparison
with other heuristic methods well-known in the literature, specially for large-scale
cellular mobile networks with a number of cells varying between 100 and 1000, and
a number of switches varying between 5 and 10, that means the search space size is
between 5100 and 101000 .
1 Introduction
A Personal Communication Network is a wireless communication network
which integrates various services such as voice, video, electronic mail, accessi-
ble from a single mobile terminal and for which the subscriber obtains a single
invoicing. These various services are offered in an area called cover zone which
is divided into cells. In each cell is installed a base station which manages all
the communications within the cell. In the cover zone, cells are connected
to special units called switches which are located in mobile switching cen-
ters (MSC). When a user in communication goes from a cell to another, the
base station of the new cell has the responsibility to relay this communication
by allotting a new radio channel to the user. Supporting the transfer of the
communication from a base station to another is called handoff. This mech-
anism, which primarily involves the switches, occurs when the level of signal
received by the user reaches a certain threshold. We distinguish two types of
handoffs. In the case of Figure 1 for example, when a user moves from cell
A to cell B, it refers to soft handoff because these two cells are connected to
the same switch. The MSC which supervises the two cells remains the same
A. Quintero and S. Pierre: On the Design of Large-scale Cellular Mobile Networks Using Multi-
population Memetic Algorithms, Studies in Computational Intelligence (SCI) 82, 353–377 (2008)
354 A. Quintero and S. Pierre
switch 1
Switch 1 cell Simple handoff

Cell A Cell B
simple handoff
Assignment of Cells B
A
to Switches switch 1 switch 2
D
C
Complex handoff
Cell A Cell C
Switch 2
cell complex handoff
Fig. 1. Geographic division in a cellular network
and the induced cost is weak. On the other hand, when the user moves from
cell A to cell C, there is a complex handoff. The induced cost is high because
both switches 1 and 2 remain active during the procedure of handoff and the
database containing information on subscribers must be updated.
The total operating cost of a cellular network includes two components: the
cost of the links between the cells (base station) and the switches to which they
are joined, and the cost generated by the handoffs between cells. It appears
therefore intuitively more discriminating to join cells A and C to the same
switch if the frequency of the handoffs between them is high. The problem
of assigning cells to switches essentially consists of finding the configuration
that minimizes the total operating cost of the network. The resolution of
this problem by an exhaustive search method would entail a combinatorial
explosion, and therefore an exponential growth of execution times.
Assigning cells to switches in cellular mobile networks being an NP-hard
problem, enumerative search methods are practically inappropriate to solve
large-sized instances of this problem [1, 40]. Because they exhaustively exam-
ine the entire search space in order to find the optimal solution, they are only
efficient for small search spaces corresponding to small-sized instances of the
problem. For example, for a network with m switches and n cells, mn solutions
should be examined.
Merchant and Sengupta [27] have proposed the first heuristic to solve this
problem. Their algorithm starts from an initial solution, which they attempt
to improve through a series of greedy moves, while avoiding to be blocked
in a local minimum. The moves used to escape a local minimum explore a
very limited set of options. These moves depend on the initial solution and
do not necessarily lead to a good final solution. Others heuristic approaches
have been developed for this kind of problem [1, 2, 7, 23, 39, 49].
On the Design of Large-scale Cellular Mobile Networks 355
In the most general terms, evolution can be described as a two-step iter-

ative process, consisting of random variation followed by selection [11, 12].
In the real world, an evolutionary approach to solving engineering prob-
lems offers considerable advantages. One such advantage is adaptability to
changing situations [11, 12]. Evolutionary algorithms have been applied suc-
cessfully in various domains of search, optimization, and artificial intelligence
[6, 19, 21, 31, 38, 47, 50, 53, 58].
Genetic Algorithms (GA) are robust search techniques based on Darwin’s
concepts of natural selection and genetic mechanisms [18, 22, 33]. They con-
sists of creating a population of candidate solutions and applying probabilistic
rules to simulate the evolution of the population [18]. They are used to solve
extremely complex search and optimization problems which are difficult to
handle using analytic or simple enumeration methods, by combining the space
exploration of solutions with an adequate selection of the best results.
Memetic Algorithm (MA) are inspired by Dawkins’ notion of a meme de-
fined as a unit of information that reproduces itself while people exchange
ideas. In contrast to genes, memes are typically adapted by the people who
transmit them before they are passed on to the next generation [35]. Infor-
mally, the idea exploited is that if a local optimiser is added to a genetic
algorithm, and applied to every child before it is inserted into the population,
then a memetic algorithm can be thought of simply as a special kind of genetic
search over the subspace of local optima. Recombination and mutation will
usually produce solutions that are outside this space of local optima but a lo-
cal optimiser can then “repair” such solutions to produce final children that lie
within this subspace, yielding a memetic algorithm [43]. Memetic algorithms
have been applied with success to several other combinatorial optimization
problems [29–32].
In previous papers, we have studied the problem of assigning cells to
switches in moderate-sized mobile networks with a number of cells varying
between 100 and 200, and a number of switches varying between 5 and 7, that
means the search space size is between 5100 and 7200 . In [42], a short letter, we
have presented preliminary results obtained from memetic algorithm without
the concepts of multi-population, elitism and migration. In [41], we have pre-
sented a comparative study between canonical genetic algorithm, tabu search
and simulated annealing. In [41], we have presented a hybrid genetic algorithm
to solve the problem of assigning cells to switches in moderate-sized mobile
networks. In those papers, we have not studied the large-scale cellular mobile
networks with a number of cells varying between 500 and 1000, and a num-
ber of switches varying between 8 and 10, because the average improvement
rates are very small in comparison with well-known heuristic methods in the
literature.
This chapter proposes a multi-population memetic algorithm with migra-
tions (MA) to efficiently solve the problem of assigning cells to switches in
cellular mobile networks. Section 2 presents background and related work.
Section 3 first describes the genetic and memetic algorithms, then presents
a multi-population approach. Section 4 presents some adaptation and imple-

mentations details of memetic and local search strategy. Finally, Section 5
presents an analysis of results and compares them to other methods well
studied in the literature.
2 Background and related work

This assignment problem consists of determining a cell assignment pattern,
which minimizes a certain cost function, while respecting certain constraints,
especially those related to limited switch’s capacity. An assignment of cells
can be carried out according to a single or a double cell’s homing. A single
homing of cells corresponds to the situation where a cell can only be assigned
to a single switch. When a cell is related to two switches, that refers to a
double homing. In this chapter, only single homing is considered.
Let n be the number of cells to be assigned to m switches. The location of
cells and switches is fixed and known. Let Hij be the cost per unit of time for

a simple handoff between cells i and j involving only one switch, and Hij the
cost per time unit for a complex handoff between cells i and j (i, j = 1, . . . , n

with i
= j) involving two different switches. Hij and Hij are proportional to
the handoff frequency between cells i and j. Let cik be the amortization cost
of the link between cell i and switch k (i = 1, . . . , n; k = 1, . . . , m).
Let xik be a binary variable, equal to 1 if cell i is related to switch k,
otherwise xik is equal 0.
The assignment of cells to switches is subject to a number of constraints.
Actually, each cell must be assigned to only one switch, which can be expressed
by the follows:
n
xik = 1 for i = 1, . . . , n. (1)
k=1
Let zijk and yij be defined as:
zijk = xik xjk for i, j = 1, . . . , n and k = 1, . . . , m with i

= j.

m
yij = zijk for i, j = 1, . . . , n and i
= j.
k=1
zijk is equal to 1 if cells i and j, with i

= j, are both connected to the same
switch k, otherwise zijk is equal to 0. Thus yij takes the value 1 if cells i and
j are both connected to the same switch and the value 0 if cells i and j are
connected to different switches.
The cost per time unit f of the assignment is expressed as follows:

n
n
n
n
n
n

f= cik xik + Hij (1 − yij ) + Hij yij (2)
i=1 k=1 i=1 j=1 i=1 j=1
j=i j=i
The first term of the equation represents the link or cabling cost. The
second term takes into account the complex handoffs cost and the third, the
cost of simple handoffs. We should keep in mind that the cost function is
quadratic in xik , because yij is a quadratic function of xik . Let’s mention that
an eventual weighting could be taken into account directly in the link and
handoff costs definitions.
The capacity of a switch k is denoted Mk . If λi denotes the number of
calls per unit of time directed to i, the limited capacity of switches imposes
the following constraint:

n
λi xik ≤ Mk for k = 1, . . . , m (3)
i=1
according to which the total load of all cells which are assigned to the switch
k is less than the capacity Mk of the switch. Finally, the constraints of the
problem are completed by:
xik = 0 or 1 for i = 1, . . . , n and k = 1, . . . , m (4)
zijk = xij xik and i, j = 1, . . . , n and k = 1, . . . , m (5)

m
yij = zijk for i, j = 1, . . . , n and i
= j (6)
k=1
(1), (3) and (4) are constraints of transport problems. In fact, each cell i could
be likened to a factory which produces a call volume λi . The switches are then
considered as warehouses of capacity Mk where the cells production could be
stored. Therefore, the problem is to minimize (2) under (1), and (3) to (6).
When the problem is formulated in this way, it could not be solved with a
polynomial time algorithm such as linear programming because constraint
(5) is not linear. Merchant and Sengupta [27, 28] replaced it by the following
equivalent set of constraints:
zijk ≤ xik (7)

zijk ≤ xjk (8)
zijk ≥ xik + xjk − 1 (9)
zijk ≥ 0 (10)
Thus, the problem could be reformulated as follows: minimizing (2) under

constraints (1), (3), (4) and (6) to (10). We can further simplify the problem
by defining:

hij = Hij − Hij
hij refers to the reduced cost per time unit of a complex handoff between cells
i and j. Relation (2) is then re-written as follows:

n
m
n
n
n
n
f= cik xik + hij (1 − yij ) + Hij
i=1 k=1 i=1 j=1 i=1 j=1
j=i j=i
+ ,- .
constant
The assignment problem takes then the following form:

Minimize:
n m
n n
f= cik xik + hij (1 − yij ) (11)
i=1 k=1 i=1 j=1
j=i
subject to (1), (3), (4) and (7) to (10). In this form, the assignment problem
could be solved by usual programming methods such as integer programming.
The total cost includes two types of cost, namely cost of handoff between
two adjacent cells, and cost of cabling between cells and switches. The design
is to be optimized subject to the constraint that the call volume of each
switch must not exceed its call handling capacity. This kind of problem is
NP-hard, so enumerative searches are practically inappropriate for moderate-
and large-sized cellular mobile networks [27].
Merchant and Sengupta [27, 28] studied this assignment problem. Their al-
gorithm starts from an initial solution, which they attempt to improve through
a series of greedy moves, while avoiding to be stranded in a local minimum.
The moves used to escape a local minimum explore only a very limited set
of options. These moves depend on the initial solution and do not necessarily
lead to a good final solution.
In [46], an engineering cost model has been proposed to estimate the cost
of providing personal communications services in a new residential develop-
ment. The cost model estimated the costs of building and operating a new
PCS using existing infrastructure such as the telephone, cable television and
cellular networks. In [14], economic aspects of configuring cellular networks
are presented. Major components of costs and revenues as well as the ma-
jor stakeholders were identified and a model was developed to determine the
system configuration (e.g., cell size, number of channels, link cost, etc.). For
example, in a large cellular network, it is impossible for a cell located in east
America to be assigned to a switch located in west America. In this case,
the variable link cost is ∞. The geographical relationships between cells and
switches are considered in the value of the cost of linking, so that the base
station of a cell is generally assigned to a neighbouring switch and not to far
switches [60]. In [15], different methods have been proposed to estimate the
handoff rate in PCS and the economic impacts of mobility on system configu-
ration decisions (e.g., annual maintenance and operations, channel cost, etc.).
The cost model used in this chapter is based on [14, 15, 46].
3 Memetic Approach
In the field of combinatorial optimization, it has been shown that com-
bining evolutionary algorithms with problem-specific heuristics can lead to
highly effective approaches [10, 48]. These hybrid evolutionary algorithms
combine the advantages of efficient heuristics incorporating domain knowl-
edge and population-based search approaches [32, 34]. For further details on
evolutionary algorithms, see [55, 56].
3.1 Basic Principles of Canonical Genetic Algorithms
Genetic algorithms (GA) are robust search techniques based on natural selec-
tion and genetic production mechanisms. GAs perform a search by evolving
a population of candidate solutions through non-deterministic operators and
by incrementally improving the individual solutions forming the population
using mechanisms inspired from natural genetics and heredity (e.g., selection,
crossover and mutation). In many cases, especially with problems character-
ized by many local optima (graph coloring, travelling salesman, network design
problems, etc.), traditional optimization techniques fail to find high quality
solutions. GAs can be considered as an efficient and interesting option [22, 52].
GAs [18] are composed of a first step (initialization of the population)
and a loop. In each loop step, called generation, the population is altered
through selection and variation operators, then the resulting individuals are
evaluated. It is hoped that the last generation will contain a good solution,
but this solution is not necessarily the optimum [9].
Crossover is a process by which two chosen string genes are interchanged.
To execute the crossover, strings of the mating pool are coupled at random.
The crossover of a string pair of length l is performed as follows: a position i
is chosen uniformly between 1 and (l − 1), then two new strings are created
by exchanging all values between positions (i + 1) and l of each string of the
pair considered.
Mutation is the process by which a randomly chosen bit in a chromosome
is flipped. It is employed to introduce new information into the population
and also to prevent the population from becoming saturated with similar
chromosomes (premature convergence). Large mutation rates increase the
probability that good schemata be destroyed, but increase population diver-
sity. A schema is a subset of chromosomes which are identical in certain fixed
positions [18, 22].
The next generation of chromosomes is generated from present population
by selection and reproduction. The selection process is based on the fitness
of the present population, such that the fitter chromosome contribute more
to the reproductive pool; typically this is also done probabilistically. For a
selection, inheritance is required; the offspring must retain at least some of
the features that made their parents fitter than average [18].
The steady state genetic algorithm is different to the canonical genetic

algorithm in that there is typically one single new member inserted into the
new population at any time. A replacement strategy defines which member
of the population will be replaced by the new offspring [59]. Sayoud et al.
[51] present an application of steady state genetic algorithms to minimize the
total installation cost of a communication network by optimally designing the
topology layout and assigning the corresponding capacities.
Natural selection is no independent force of nature, it is the result of the
competition of natural organisms for resources. In contrast, in the science of
breeding the above problem does not exist. The selection is done by human
breeders. Their strategies are based on the assumption that mating two indi-
viduals with high fitness more likely produces an offspring of high fitness than
two randomly mating individuals [36].
3.2 Basic Principles of Memetic Algorithms
In problems characterized by many local optima, canonical genetic algorithm

(CGA) can suffer from excessively slow convergence before finding an accurate
solution because of their characteristics of using a priori minimal knowledge
and failure to exploit local information [4, 13, 38, 44, 53]. This may prevent
them from being really of practical interest for a lot of large-scale constrained
applications.
Memetic algorithms (MA) are populationbased heuristic search approaches
for combinatorial optimization problems based on cultural evolution [35, 47].
The memetic approach takes the concept of evolution as employed in genetic
algorithms and combines this with an element of local search. It can be seen
that a genetic algorithm where one variation operator is a local search ope-
rator, either providing the local minimum closest to the starting point or a
point on the path leading to this closest local optimum. This local search
in the neighbourhood is applied to each newly created offspring before its
insertion into the new population.
In the context of evolutionary computation, a hybrid evolutionary algo-
rithm is called memetic if the individuals representing solutions to a given
problem are improved by a local search or another improvement technique
[29]. Kado et al. [24] compare different implementations of hybrid genetic
algorithms.
In this chapter, we propose a memetic algorithm with local refinement
strategy to combine the strengths of both by providing global and local
exploitation aspects to the problem of assigning cells to switches in cellu-
lar mobile networks. The local refinement strategy used with our memetic
algorithm is tabu search.
A tabu search method is an adaptive technique used in combinatorial op-
timization to solve difficult problems [17, 39, 40]. Tabu search can indeed be
applied to different problems and different instances of problems, but mainly
the local search neighborhood and the way the tabu list is built and exploited
are subject to many variations, which gives to Tabu its meta-heuristic nature.
The tabu list is not always a list of solutions, but can be a list of forbidden
moves/perturbations [16, 45].
Tabu search is a hill-climber endowed with a tabu list (list of solutions or
moves) [54]. Let Xi denote the current point; let N (Xi ) denote all admissible
neighbors of Xi , where Y is an admissible neighbor of Xi if Y is obtained
from Xi through a single move not in the tabu list, and Y does not belong to
the tabu list is updated as Xi is replaced with the best point in N (Xi ); stop
after nbmax steps or if N (Xi ) is empty.
Other mechanisms of tabu search are intensification and diversification:
by the intensification mechanism, the algorithm does a more comprehensive
exploration of attractive regions which may lead to a local optimal point;
by the diversification mechanism, on the other hand, the search is moved to
previously unvisited regions, something that is important in order to avoid
getting trapped into local minimum points [16].
3.3 Multi-population Approach
Canonical genetic algorithms are powerful and perform well on a broad class
of problems. However, part of the biological and cultural analogies used to
motivate a genetic algorithm search are inherently parallel.
One approach is the partitioning of the population into several subpopula-
tions (multi-population approach) [57]. The evolution of each subpopulation
is handled independently from each other and help maintain genetic diversity.
Diversity is the term used to describe the relative uniqueness of each individ-
ual in the population. From time to time, there is however some interchange
of genetic material between different subpopulations. This exchange of indi-
viduals is called migration [37]. Sometimes a topology is introduced on the
population, so that individuals can only interact with nearby chromosomes in
their neighborhood [20].
The parallel implementation of the migration model shows not only a
speedup in computation time, but it also needs less objective function eval-
uations when compared to a single population algorithm for some classes of
problems [55]. Cohoon et al. [5] present results in which parallel algorithms
with migration found better solutions than a sequential GA for optimization
problems, and Lienig [26] indicates that parallel genetic algorithms in isolated
evolving subpopulations with migrations may offer advantages over sequential
approaches.
The migration algorithm is controlled by many parameters that affect its
efficiency and accuracy. Among other things, one must decide the number and
the size of the populations, the rate of the migration, the migration interval
and the destination of the migrants. The migration interval is the number of
generations between each migration, and the migration rate is the number
of individuals selected for migration. An important property of the architec-
ture used between the demes is its degree, which is the number of neighbors
of each subpopulation or deme (a separately evolving subset of the whole

population) [26].
In this chapter, all the demes have the same degree denoted as δ. The
degree completely determines the cost of communications, it also influences
the size of demes and consequently the time of computations. The execution
time of the parallel algorithm is the sum of communications and computation
times : Tp = gnd Tf + δTc , whereas the execution time of sequential algorithm
only refers to the computation time : Tp = gnd Tf , where g is the domain-
dependant number of generations until convergence, nd is the population size,
Tf is the time of one fitness evaluation, and Tc is the time required to commu-
nicate with one neighbour. For further details on the parameters associated
with the migration algorithm, see [3, 57].
4 Implementation Details
4.1 Memetic Algorithm Implementation
We have introduced a simple notation to represent cells and switches, and to

encode chromosomes and genes. We opted for a non-binary representation of
the chromosomes [19]. In this representation, the genes (squares) represent the
cells, and the integers they contain represent the switch to which the cell of
row i (gene of the ith position) is assigned. Our chromosomes have therefore
a length equal to the number of cells in the network n, and the maximal
value that a gene can take is equal to the maximal number of switches m.
A chromosome represents the set of cells in the cellular mobile network, and
the length is the number of cells. A particular value of the string is called
a gene and the possible values are called alleles, taken from the alphabet
V = 1, 2, . . . , m. A chromosome of the proposed MA consists of sequence of
positive integers that represent the IDs of switches to which the cell of row i
(gene of the ith position) is assigned.
The first individual of the initial population is the one obtained when
all cells are assigned to the nearest switch. This first chromosome is created
therefore in a deterministic way. The creation of other chromosomes of the
population is probabilistic and follows the strategy of population without
doubles, that means, we test equality between individuals and remove doubles.
This strategy permits to ensure the diversity of the population and a good
cover of the search space. All chromosomes of the population verify the unique
assignment constraint, but not necessarily the one of the switches’ capacity.
Figure 2(a) shows the overall procedure of mutation operator. This opera-
tor is uniform random. For example, The gene of the 4th position is changed
from switch 1 to switch 3. The crossover operator used in the proposed MA
is the same as that of the conventional one-point crossover (Figure 2(b)).
A = 211↓1321
B = 132↓1212
A = 2 1 113 2 1 fl
fl A’ = 2 1 1 1 2 1 2
A’ = 2 1 1 33 2 1 B’ = 1 3 2 1 3 2 1
(a) Mutation (b) Crossover
Fig. 2. An example of mutation and crossover
The choice of the candidates is based on the evaluation function given by:

n
m
n
n
f= cik xik + hij (1 − yij ) (12)
i=1 k=1 i=1 j=1
j=i
In our adaptation, every chromosome is evaluated according to the crite-

rion of cost in a first instance. The sort by ascending order of the objective
value of those chromosomes permits to have the best potential chromosomes
as the first individual of population. The second stage of evaluation consists
of verifying the chromosomes in relation to the capacity constraint on the
switches and to determine the best chromosome that verifies this constraint.
To select the offsprings of the new generation, we have used the concept
of elitism; according to this concept only the lowest ranked string is deleted
and the best string is automatically kept in the population and for the others
we used the roulette wheel method. Because the problem we have to solve is
a minimization problem, we applied the roulette wheel in order to invert the
objective values of chromosomes. Then, we recover both in the new selected
population, chromosomes that verify the switches capacity’s constraint and
those that violate it. The number of generations is fixed at the beginning of
the execution.
For the migration algorithm used in this chapter, subpopulations are ar-
ranged in fully-meshed topology. Here, individuals may migrate from any
subpopulation to another. For each subpopulation, a pool of potential em-
igrants (clones) is constructed from the other subpopulations and they are
not deleted from their original populations. The migration interval is incorpo-
rated into the parallel algorithm as a probability Pm , and the migration rate is
incorporated as a maximum value Sm . For each subpopulation in the parallel
algorithm, migration is achieved as follows. At the end of a generation, a uni-
formly distributed random number x is generated. If x < Pm then migration is
initialized. During migration, a uniform random number determines the num-
ber of individual ns between 1 and Sm to send. The selection of the individuals
for migration is a fitness-based process. The best ns individuals in the
Subpopulation 2 Subpopulation i Subpopulation k
… … …
The Sm best individuals The Sm best individuals
Subpopulation 1 Selection the Pr best individuals
… …
Replace the Pr worst individuals
Before exchange after exchange
Fig. 3. Scheme for migration of individuals between subpopulations
subpopulation are sent to the other subpopulations. Whether or not emigrants

are sent to the other subpopulations, each subpopulation then checks to see if
emigrants are arriving from its neighbour. If so, a uniform random number pr
determines the number of accepted individuals, then the best pr individuals
are received into the subpopulation and replace the pr least fit individuals.
Figure 3 gives a detailed description for the unrestricted migration scheme of
k subpopulations with fitness-based selection. Subpopulations 2, 3, . . . , k con-
struct a pool of their best individuals (fitness-based migration). Then the pr
best individuals are chosen from this pool and replace the worst pr individuals
in subpopulation 1. This cycle is performed for every subpopulation. Thus, it
is ensured that no subpopulation will receive individuals from itself. Figure 4
shows the multi-population memetic algorithm with migration proposed.
Migration intervals are typically specified as a fixed number of genera-
tions, known as an epoch. The problem with using a fixed epoch value is that
migration is globally synchronized across all subpopulations. Using a random
interval allows the subpopulations to evolve asynchronously [34].
4.2 Local Search Strategy
This section presents the implementations details of the local refinement strat-
egy used to improve the individuals representing solutions provided by genetic
algorithms : tabu search.
To solve the assignment problem with tabu search, we have chosen a
search domain free from capacity constraints on the switches, but respect-
ing the constraints of unique assignment of cells to switches. We associate
with each solution two values: the first one is the intrinsic cost of the solution,
which is calculated from the objective function; the second is the evaluation
of the solution, which takes into account the cost and the penalty for not
respecting the capacity constraints. At each step, the solution that has the
best evaluation is chosen. Once an initial solution built from the problem
Initialize population(gen)
evaluation population(gen)
for each individual i∈gen do
i = tabu_search(i)
end for
while not terminated do
repeat
select two individuals i,j ∈ gen
apply Crossover(i,j) giving children(c)
c = tabu_search(c)
add children(c) to newgen
until crossover = false
for each individual i ∈ gen do
if probability-mutation then
i = apply tabu_search(Mutation(i))
add (i) to newgen
end if
end for
gen = Select_elitist(newgen)
begin migration
if migration appropriate
Choose emigrants(population(gen))
Send clones of emigrants
end if
if immigrants available
Im = Receive immigrants
end if
end migration
gen = Select_ elitist((Im ∪ gen)
end while
Fig. 4. Multi-population memetic algorithm with migration and elitism
data, the short term memory component attempts to improve it, while avoid-
ing cycles. The middle-term memory component seeks to intensify the search
in specified neighbourhoods, while the long-term memory aims at diversifying
the exploration area.
The neighbourhood N (S) of a solution S is defined by all the solutions that
are accessible from S by applying a move a → b to S. a → b is defined as re-
assignment of cell a to switch b. To evaluate the solutions in the neighbourhood
N (S), we define the gain GS (a, b) associated to the move a → b and to the
solution S by:
⎧ n n
⎪
⎨ (hai + hia )xib0 − (hai + hia )xib + cab − cab0 if b =

b0
GS (a, b) = ii=1 i=1
⎪
⎩
=a i=a
M if not
(13)
where:
• hij refers to the handoff cost between cells i and j;
• b0 is the switch of cell a in solution S, that is, before the application of
move a → b;
• xik takes value 1 if cell i is assigned to switch k, 0 otherwise;
• cik is the cost of linking cell i to switch k;
• M is an arbitrary large number.
The short-term memory moves iteratively from one solution to another,
by applying moves, while prohibiting a return to the k latest visited solutions.
It starts with an initial solution, obtained simply by assigning each cell to the
closest switch, according to an Euclidean distance metric. The objective of this
memory component is to improve the current solution, either by diminishing
its cost or by diminishing the penalties.
The middle-term memory component tries to intensify the search in
promising regions. It is introduced after the end of the short-term memory
component and allows a return to solutions we may have omitted. It mainly
consists in defining the regions of intensified search, and then choosing the
types of move to be applied.
To diversify the search, we use a long-term memory structure in order to
guide the search towards regions that have not been explored. This is often
done by generating new initial solutions. In this case, a table n×m (where n is
the number of cells and m the number of switches) counts, for each link (a, b),
the number of times this link appears in the visited solutions. A new initial
solution is generated by choosing, for each cell a, the least visited link (a, b).
Solutions visited during the intensification phase are not taken into account
because they result from different type of moves than those applied in short
and long-term memory components.
4.3 Experimental Setting
We submitted our approach to a series of tests in order to determine its

efficiency and sensitivity to different parameters.
In the first step, the experiences were executed by supposing that the
cells are arranged on an hexagonal grid of almost equal length and width.
The antennas are located at the center of cells and distributed evenly on the
grid. The cost of cabling between a cell and a switch is proportional to the
distance separating both [40]. The call rate γi of a cell i follows a gamma
law of average and variance equal to the unit. The call duration inside the
cells are distributed according to an exponential law of parameter equal to
1 [8]. If a cell j has k neighbors, the [0,1] interval is divided into k + 1 sub-
intervals by choosing k random numbers distributed evenly between 0 and 1.
At the end of the service period in cell j, the call could be either transferred
to the ith neighbour (i = 1, . . . , k) with a handoff probability rij equal to the
length of ith interval, or ended with a probability equal to the length of the
k + 1th interval. To find the call volumes and the rates of coherent handoff,
the cells are considered as M/M/1 queues forming a Jackson network [25].
The incoming rates αi in cells are obtained by solving the following system:

n
αi − αj rij = γi with i = 1, . . . , n
j=1
If the incoming rate αi is greater than the service rate, the distribution is
rejected and chosen again. The handoff rate hij is defined by:
hij = λi rij
All the switches have the same capacity M calculated as follows:
K
n
1
M= (1 + ) λi
m 100 i=1
where K is uniformly chosen between 10 and 50, which insures a global excess
of 10 to 50% of the switches’ capacity compared to the cells’ volume of call.
In the second step, we generate an initial population of size 100 chro-
mosomes. In the third step, we estimate each chromosome by the objective
function, what allows to deduct its value of capacity. Finally, in the last step,
the cycle of generations of the populations begins then, each new population
replacing the previous one. The number of 400 generations is defined at first.
To determine the number of subpopulations in parallel, MA was executed
over a set of 600 cases with 3 instances of problem in series of 20 runs for each
assignment pattern, with a number of populations varying between 1 and 10.
This experience shows that MA converges to good solutions with a number
of populations varying between 7 and 10, as shown in Figure 5.
To define the population size, MA was executed over a set of 600 cases
with 3 instances of problem in series of 20 runs for each assignment pattern
with 8 populations. This experience shows that MA converges to provide good
solutions with a population size varying between 80 and 140.
The values used by MA are: the number of generations is 400; the pop-
ulation size is 100; the number of populations is 8 for MA; the crossover
probability is 0.9; the mutation probability is 0.08; the migration interval
(Pm ) is 0.1; the migration rate (Sm ) is 0.4 and the emigrants accepted (Pr )
10000
9000
8000
Evaluation Cost
7000
6000 5 switches, 100 cells
3000
2000
1000
0
1 2 3 4 5 6 7 8 9 10
Number of populations in parallel
Fig. 5. MA convergence results
3100
20 migrants
2900 40 migrants
60 migrants
2700
80 migrants
0 migrants
2500
Fitness value
2300
2100
1900
1700
1500
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311 321 331 341 351 361 371 381 391 401
Number of generations
Fig. 6. Number of immigrants and MA convergence
is 0.2. A larger migration interval is normally used in connection with larger

migration rate (Figure 6).
Whereas the migration algorithm seeks to improve the normalized cost
and reliability of the CGA, it is important also to ensure that unacceptable
time overhead is not introduced by migration (Table 1). In order to analyze
the performance of the multi-population memetic algorithm without migra-
tion (MPW) migration is turned off (pm = 0.0). Turning off migration for
analysis of the multi-population memetic algorithm ensures that evaluation
costs measurements indicate the effects of parallel algorithm, rather than the
effects of migration. In order to compare the performance of MA and MPW,
Table 1. CPU Time for MA execution

Element CPU Time (%)
Generate an initial population 1
Evaluation of the chromosomes 49
Crossover, mutation, local-search, Selection procedure 41
Immigration procedure 1
Communication time 7
a set of experiments were performed to evaluate the quality of the solutions

in terms of their costs.
In these experiments, the results obtained by MPW are compared directly
with MA. The two algorithms always provide the feasible solutions. However,
MA yields an improvement in evaluation cost in comparison with MPW, be-
cause MA converges faster than MPW to good solutions with a small number
of generations (Figure 6). All the test runs described in this section were per-
formed in a networked workstation environment operating at 100 Mbps with
10 PCs (Pentium 500 MHz).
5 Performance Evaluation and Numerical Results

In order to compare the performance of memetic algorithms with that of the
other heuristics, two types of experiments were performed: a set of experiments
to evaluate the quality of the solutions in terms of their costs, and another
set to evaluate the performance of MA in terms of CPU times.
5.1 Comparison with Other Heuristics
Merchant and Sengupta [27] have designed a heuristic, which we call H, for
solving the cell assignment problem. Pierre and Houéto [40] have been used
Tabu search (TS) for solving the same problem.
We compare TS and heuristics H with MA. For the experiments, we used
a number of cells varying between 100 and 1000, and a number of switches
varying between 5 and 10, that means the search space size is between 5100
and 101000 . In all our tests, the total number of evaluations remained the
same.
The three heuristics always find feasible solutions. However, these re-
sults inform only on the feasibility of obtained results without demonstrating
whether these solutions are among the best or not. Figure 7 shows the re-
sults obtained for 5 different instances of problem: 5 switches and 100 cells,
6 switches and 150 cells, 7 switches and 200 cells, 8 switches and 500 cells,
and 10 switches and 1000 cells. For each instance, we tested 16 different cases
whose evaluation costs represent the average over 100 runs of each algorithm.
Evaluation cost comparison
1900
1800
1700
Fitness value
1600 Memetic algorithm

1500 Tabu search
1400 Heuristic H
1300
1200
1100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Mobile network (100 cells, 5 switches)

(a) Evaluation cost comparison for an instance of problem with 5 switches, 100
cells
3800
3600
3400
3200
Fitness value

2800 Tabu search
2600 Heuristic H
2400
2200
2000
1800
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
(b) Evaluation cost comparison for an instance of problem with 6 switches, 150
cells
Fig. 7. Comparison between MA, tabu search and heuristics H
4200
4000
3800
Fitness value

Tabu search
3400
Heuristic H
3200
3000
2800
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
(c) Evaluation cost comparison for an instance of problem with 7 switches, 200
cells
8500
8300
8100
Fitness value
7900
Memetic algorithm
7700 Tabu search
Heuristic H
7500
7300
7100
6900
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
(d) Evaluation cost comparison for an instance of problem with 8 switches, 500
cells
Fig. 7. (Continued)
14700
14200
Fitness value
13700
Memetic algorithm
Tabu search
Heuristic H
13200
12700
12200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
(e) Evaluation cost comparison for an instance of problem with 10 switches, 1000
cells
Fig. 7. (Continued)
Table 2. MA average improvement rates

5 switches 6 switches 7 switches 8 switches 10 switches
100 cells 150 cells 200 cells 500 cells 1000 cells
Tabu search 1.35% 1.94% 2.19% 3.97% 4.84%
Heuristics H 1.77% 2.01% 4.16% 6.51% 7.98%
The three heuristics always find feasible solutions with objective values
close to the optimum solution. In each of the all considered series of tests,
MA yields an improvement in the cost function in comparison with the other
two heuristics. In terms of evaluation fitness, MA provides better results than
tabu search and heuristics H. Table 2 summarizes the results. Nevertheless,
given the initial link, the handoff and the annual maintenance costs for large-
sized cellular mobile networks (in the order of hundred of millions of dollards)
this small improvement represents a large reduction in costs over a 10-years
period in the order of millions of dollards. For example, in a cellular network
composed by 300 cells, with initial link and handoff cost of $350,000 for each
cell, an improvement of 2% in the cost function represents an approximate
saving of $2M over 10 years.
In terms of CPU times, for a large number of cells, TS is a bit faster
than heuristic H. Conversely, for problems of smaller size, TS is a bit slower.
MA is slower than heuristics H and TS. However, this is not an important
Table 3. Relative distances between the MA solutions and the lower bound
Instance of 5 switches 6 switches 7 switches 8 switches 10 switches
problem 100 cells 150 cells 200 cells 500 cells 1000 cells
MA’s Mean 1.35% 1.94% 2.19% 3.97% 4.84%
distance (%)
fact because this heuristic is used in designing and planning phase of cellular
mobile networks.
5.2 Quality of the Solutions
A MA solution does not necessarily correspond to a global minimum. An

intuitive lower bound for the problem is:

n
LB1 = min(cik )
k
i=1
which is the link cost of the solution obtained by assigning each cell i to the
nearest switch k. This lower bound does not take into account handoff cost.
In fact, we suppose that capacity constraint is being relaxed and that all cells
could be assigned to a single switch. Thus, we have a lower bound whatever
the values of Mk and λi . Table 3 summarizes the results.
MA gives good solutions in comparison with the lower bound. Note that
the lower bound does not include handoff costs and therefore no solution could
equal the lower bound.
6 Conclusion
In this chapter, we proposed a multi-population memetic algorithm with
elitism (MA) to design large-scale cellular mobile networks, and specifically
to solve the problem of assigning cells to switches in cellular mobile networks.
To select the offsprings of the new generation, we have used the concept of
elitism; according to this concept only the lowest ranked string is deleted and
the best string is automatically kept in the population. Also, the migrants
are clones, and they are not deleted from their original populations. The local
refinement strategy used with our memetic algorithm is tabu search. Exper-
iments have been conducted to measure the quality of solutions provided by
this algorithm.
To evaluate the performance of this approach, we defined two lower bounds
for the global optimum, which are used as references to judge the quality of
the obtained solutions. Generally, the results are sufficiently close to the global
optimum.
Computational results obtained confirm the efficiency and the effective-

ness of MA to provide good solutions in comparison with tabu search [40]
and Merchant and Sengupta’s heuristics [27], specially for large-scale cellular
mobile networks with a number of cells varying between 100 and 1000, and
a number of switches varying between 5 and 10, that means the search space
size is between 5100 and 101000 , and the average improvement rates are the
order of 5% and 8% respectively. Also, we have improved the results reported
in [43] where a memetic algorithm without elitism where used. This improve-
ment represents a large reduction in maintenance and operations costs for a
10 years period in the order of millions dollars. This heuristic can be used for
designing next-generation mobile networks.
References
[1] Beaubrun R, Pierre S, Conan J (1999) An efficient method for optimizing the
assignment of cells to MSCs in PCS networks. In: Proceedings of the eleventh
international conference on wireless communication, wireless 99, vol 1. Calgary
(AB), July 1999, pp 259–265
[2] Bhattacharjee P, Saha D, Mukherjee A (1999) Heuristics for assignment of cells
to switches in a PCSN: a comparative study. In: International conference on
personal wireless communications, Jaipur, India, February 1999, pp 331–334
[3] Cantu-Paz E (2000) Efficient and accurate parallel genetic algorithms, Kluwer
Academic, Dordecht
[4] Ching-Hung W, Tzung-Pei H, Shian-Shyong T (1998) Integrating fuzzy knowl-
edge by genetic algorithms. IEEE Trans Evol Comput 2(4):138–149
[5] Cohoon J, Martin W, Richards D (1991) A multi-population genetic algorithm
for solving the K-partition problem on hyper-cubes. In: Proceedings of the
fourth international conference on genetic algorithms, pp 244–248
[6] Costa D (1995) An evolutionary Tabu Search algorithm and the NHL scheduling
problem. INFOR 33(3):161–178
[7] Demirkol I, Ersoy C, Caglayan MU, Delic H (2001) Location area planning
in cellular networks using simulated annealing. In: Proceedings of IEEE-
INFOCOM 2001, vol 1, 2001, pp 13–20
[8] Fang Y, Chlamtac I, Lin Y (1997) Modeling PCS networks under general call
holding time and cell residence time distributions. IEEE/ACM Trans Network
5(6):893–905
[9] Fogel D (1995) Evolutionary computation. Piscataway, NJ
[10] Fogel D (1995) Evolutionary computation: toward a new philosophy of machine
intelligence. IEEE, New York
[11] Fogel D (1999) An overview of evolutionary programming. Springer-Verlag,
Berlin Heidelberg New York, pp 89–109
[12] Fogel D (1999) An introduction to evolutionary computation and some
applications. Wiley, Chichester, UK
[13] Forrest S, Mitchell M (1999) What makes a problem hard for a genetic al-
gorithm? Some anomalous results and their explanation. Machine Learning
13(2):285–319
[14] Gavish B, Sridhar S (1995) Economic aspects of configuring cellular networks.
Wireless Netw 1(1):115–128
[15] Gavish B, Sridhar S (2001) The impact of mobility on cellular network
configuration. Wireless Netw 7(1):173–185
[16] Glover F, Laguna M (1993) Tabu search. Kluwer, Boston

[17] Glover F, Taillard E, Werra D (1993) A user’s guide to tabu search. Ann Oper
Res 41(3):3–28
[18] Goldberg DE (1989) Genetic algorithms in search, optimization and machines
[19] Gondim RLP (1996) Genetic algorithms and the location area partitioning
problem in cellular networks. In: Proceedings of the vehicular technology
conference 1996, Atlanta, VA, April 1996, pp 1835–1838
[20] Gorges-Schleuter M (1989) ASPARAGOS: an asynchronous parallel genetic
optimization strategy. In: Proceedings third international conference on genetic
algorithms, pp 422–427
[21] He L, Mort N (2000) Hybrid genetic algorithms for telecommunications network
back-up routing. BT Tech J 18(4):42–50
[22] Holland J (1975) Adaptation in natural and artificial systems. The University
of Michigan Press, Ann Arbor
[23] Hurley S (2002) Planning effective cellular mobile radio networks. IEEE
Transactions on Vehicular Technology 51(2):243–253
[24] Kado K, Ross P, Corne D (1995) A study of genetic algorithms hybrids for
facility layout problems. In: Eshelman LJ (ed). Proceedings of the sixth inter-
national conference genetic algorithms, San Mateo, CA. Morgan Kaufmann,
Los Altos, CA, pp 498–505
[25] Kleinrock L (1975) Queuing systems I: theory. Wiley, New York
[26] Lienig (1997) A parallel genetic algorithm for performance-driven VLSI routing.
IEEE Transactions on Evolutionary Computation 1(1):29–39
[27] Merchant A, Sengupta B (1995) Assignment of cells to switches in PCS
networks. IEEE/ACM Transactions on Networking 3(5):521–526
[28] Merchant A, Sengupta B (1994) Multiway graph partitioning with applica-
tions to PCS networks. 13th Proceedings of IEEE Networking for Global
Communications, INFOCOM ’94 2:593–600
[29] Merz P, Freisleben B (2000) Fitness landscape analysis and memetic algorithms
for the quadratic assignment problem. IEEE Trans Evol Comput 4(4):337–352
[30] Merz P, Freisleben B (1997) Genetic local search for the TSP: new results. In:
Proceedings of the IEEE international conference evolutionary computation,
Piscataway, NJ, pp 159–164
[31] Merz P, Freisleben B (1998) Memetic algorithms and the fitness landscape
of the graph bi-partitioning problem. In: Eiben AE, Back T, Schoenauer M,
Schwefel HP (eds) Proceedings of the fifth international conference on parallel
problem solving from nature PPSN V. Springer, Berlin Heidelberg New York,
pp 765–774
[32] Merz P, Freisleben B (1999) A comparison of memetic algorithms, tabu search,
and ant colonies for the quadratic assignment problem. In: Proceedings of the
1999 international congress of evolutionary computation (CEC’99). IEEE, New
York
[33] Michalewicz M (1996) Genetic algorithms + data structures = evolution
programs. Springer, Berlin Heidelberg New York
[34] Moscato P (1993) An introduction to population approaches for optimization
and hierarchical objective functions: a discussion on the role of tabu search,
vol 41, pp 85–121
[35] Moscato P, Norman MG (1993) A memetic approach for the traveling sales-
man problem implementation of a computational ecology for combinatorial
optimization on message passing systems. IOS, pp 177–186
[36] Muhlenbein H, Schlierkamp-Voosen D (1993) Predictive models for the breeder
genetic algorithm I. Continuous parameter optimization. Trans Evol Comput
1(1):25–49
[37] Munetomo M, Takai Y, Sato Y (1993) An efficient migration scheme for

subpopulations-based asynchronously parallel genetic algorithms. In: Pro-
ceedings of the fifth international conference on genetic algorithms. Morgan
Kaufmann, Los Altos, CA, p 649
[38] Olivier F (1998) An evolutionary strategy for global minimization and its
Markov chain analysis. IEEE Trans Evol Comput 2(3):77–90
[39] Pierre S, Elgibaoui A (1997) A tabu-search approach for designing computer-
network topologies with unreliable components. IEEE Trans Reliab 46(3):350–
359
[40] Pierre S, Houéto F (2002) A tabu search approach for assigning cells to switches
in cellular mobile networks. Comput Commun 25:464–477
[41] Quintero A, Pierre S (2003) Assigning cells to switches in cellular mobile
networks: a comparative study. Comput Commun 26(9):950–960
[42] Quintero A, Pierre S (2002) A memetic algorithm for assigning cells to switches
in cellular mobile networks. IEEE Commun Lett 6(11):484–486
[43] Radcliffe NJ, Surry PD (1994), Formal memetic algorithms. Springer Verlag
LNCS 865, Berlin Heidelberg New York, pp 1–16
[44] Rankin R, Wilkerson R, Harris G, Spring J (1993) A hybrid genetic algorithm
for an NP-complete problem with an expensive evaluation function. In: Pro-
ceedings of the 1993 ACM/SIGAPP symposium on applied computing: states
of the art and practice, Indianapolis, USA, pp 251–256
[45] Rayward-Smith V, Osman I, Reeves C, Smith G (1996) Modern heuristic search
methods. Wiley, New York
[46] Reed DP (1993) The cost structure of personal communication services. IEEE
Commun Mag 31(4):102–108
[47] Reynolds RG, Sverdlik W (1994) Problem solving using cultural algorithms.
In: IEEE world congress on computational intelligence, Proceedings of the first
IEEE conference on evolutionary computation, vol 2, pp 645–650
[48] Reynolds RG, Zhu S (2001) Knowledge-based function optimization using fuzzy
cultural algorithms with evolutionary programming. IEEE Trans Syst Man
Cybernet, Part B 31(1):1–18
[49] Saha D, Mukherjee A, Bhattacharjee P (2000) A simple heuristic for assigment
of cell to switches in a PCS network. Wireless Personal Commun 12:209–224
[50] Salomon R (1998) Evolutionary algorithms and gradient search: similarities and
differences. IEEE Trans Evol Comput 2(2):45–55
[51] Sayoud H, Takahashi K, Vaillant B (2001) Designing communication network
topologies using steady-state genetic algorithms. IEEE Commun Lett 5(3):113–
115
[52] Schaffer J (1987) Some effects of selection procedures on hyperplane sampling
by genetic algorithms. Pitman, London, pp 89–99
[53] Schenecke V, Vornberger V (1997) Hybrid genetic algorithms for constrained
placement problems. IEEE Trans Evol Comput 1(4):266–277
[54] Sebag M, Schoenauer M (1997) A society of hill-climbers. In: Proceedings of the
fourth IEEE international conference on evolutionary computation, pp 319–324
[55] Bäck T (1996) Evolutionary algorithms in theory and practice. Oxford Univer-
sity Press, New York
[56] Bäck T, Schwefel H (1993) An overview of evolutionary algorithms for
parameter ptimization. Evol Comput 1(1):1–23
[57] Tanese R (1989) Distributed genetic algorithms. In: Schaffer JD (ed) Pro-
ceedings of the third international conference on genetic algorithms. Morgan
Kaufmann, San Mateo CA, pp 434–439
[58] Turney P (1995) Cost-sensitive classification: empirical evaluation of a hybrid
genetic decision tree induction algorithm. J Artif Intell Res 2:369–409
[59] Vavak F, Fogarty T (1996) Comparison of steady state and generational genetic
algorithms for use in non stationary environments. In: Proceedings of IEEE
international conference on evolutionary computation, pp 192–195
[60] Wheatly C (1995) Trading coverage for capacity in cellular systems: a system
perspective. Microwave J 38(7):62–76
A Hybrid Cellular Genetic Algorithm for the
Capacitated Vehicle Routing Problem
Enrique Alba and Bernabé Dorronsoro
Summary. Cellular genetic algorithms (cGAs) are a kind of genetic algorithm (GA)
– population based heuristic – with a structured population so that each individual
can only interact with its neighbors. The existence of small overlapped neighbor-
hoods in this decentralized population provides both diversity and opportunities
for exploration, while the exploitation of the search space is strengthened inside
each neighborhood. This balance between intensification and diversification makes
cGAs naturally suitable for solving complex problems. In this chapter, we solve a
large benchmark (composed of 160 instances) of the Capacitated Vehicle Routing
Problem (CVRP) with a cGA hybridized with a problem customized recombination
operation, an advanced mutation operator integrating three mutation methods, and
two well-known local search algorithms for routing problems. The studied test-suite
contains almost every existing instance for CVRP in the literature. In this work, the
best-so-far solution is found (or even improved) in 80% of the tested instances (126
out of 160), and in the other cases (20%, i.e. 34 out of 160) the deviation between our
best solution and the best-known one is always very low (under 2.90%). Moreover,
9 new best-known solutions have been found.
1 Introduction
Transportation plays an important role in logistic tasks of many companies
since it usually accounts for a high percentage of the value added to goods.
Therefore, the use of computerized methods in transportation often results in
significant savings of up to 20% of the total costs (see Chap. 1 in [1]).
A distinguished problem in the field of transportation consists in finding
the optimal routes for a fleet of vehicles which serve a set of clients. In this
problem, an arbitrary set of clients must receive goods from a central depot.
This general scenario presents many chances for defining (related) problem
scenarios: determining the optimal number of vehicles, finding the shortest
routes, and so on, all of them are subject to many restrictions like vehicle
capacity, time windows for deliveries, etc. This variety of scenarios leads to a
plethora of problem variants in practice. Some reference case studies where the
E. Alba and B. Dorronsoro: A Hybrid Cellular Genetic Algorithm for the Capacitated Vehicle
Routing Problem, Studies in Computational Intelligence (SCI) 82, 379–422 (2008)
380 E. Alba and B. Dorronsoro
Depot VRP Depot
(a) (b)
Fig. 1. The Vehicle Routing Problem consists in serving a set of geographically

distributed customers (points) from a depot (a) using the minimum cost routes (b)
application of vehicle routing algorithms has led to substantial cost savings

can be found in Chaps. 10 and 14 in [1].
As stated before, the Vehicle Routing Problem (VRP) [2] consists in de-
livering goods to a set of customers with known demands through minimum-
cost vehicle routes originating and terminating at the depot, as can be seen in
Figs. 1a and 1b (a detailed description of the problem is available in Sect. 2).
The VRP is a very important source for problems, since solving it is equiv-
alent to solving multiple TSP problems at once. Due to the difficulty of this
problem (NP-hard) and because of its many industrial applications, it has
been largely studied both theoretically and in practice [3]. There is a large
number of extensions to the canonical VRP. One basic extension is known as
the Capacitated VRP – CVRP –, in which vehicles have fixed capacities of
a single commodity. Many different variants can be constructed from CVRP;
some of the most important ones [1] are those including Time Windows re-
strictions – VRPTW – (customers must be supplied following a certain time
schedule), Pick-up and Delivery – VRPPD – (customers will require goods to
be either delivered or picked up), or Backhauls – VRPB – (like VRPPD, but
deliveries must be completed before any pickups are made). In fact, there are
many more extensions for this problem, like the use of Multiple Depots (MD-
VRP), Split Deliveries (SDVRP), Stochastic variables (SVRP), or Periodic
scheduling (PVRP). The reader can find a public web site with all of them,
latest best solutions, chapters and related stuff in our web site [4].
We consider in this chapter the Capacitated Vehicle Routing Problem
(CVRP), in which a fixed fleet of delivery vehicles of the same capacity must
service known customer demands for a single commodity from a common de-
pot at minimum transit costs. The CVRP has been studied in a large number
of separate works in the literature, but (to our knowledge) no work addresses
such a huge benchmark, since it means solving 160 different instances. We use
such a large set of instances to test the behavior of our algorithm in many dif-
ferent scenarios in order to give a deep analysis of our algorithm and a general
A Hybrid cGA for the CVRP 381
view of this problem not biased by any ad hoc selection of individual instances.
The included instances are characterized by many different features: instances
from real world, theoretical ones, clustered, not clustered, with homogeneous
and heterogeneous demands on customers, with the existence of drop times
or not, etc.
In recent VRP history, there has been a steady evolution in the quality
of the solution methodologies used for this problem, borrowed both from the
exact and the heuristic fields of research. However, due to the difficulty of
the problem, no known exact method is capable of consistently solving to op-
timality instances involving more than 50 customers [1, 5]. In fact, it is also
clear that, as for most complex problems, non-customized heuristics would
not compete in terms of the solution quality with present techniques like the
ones described in [1, 6]. Moreover, the power of exploration of some modern
techniques like Genetic Algorithms or Ant Systems is not yet fully explored,
specially when combined with an effective local search step. All these con-
siderations could allow us to refine solutions to optimality. Particularly, we
present in this chapter a Cellular Genetic Algorithm (cGA) [7] hybridized with
specialized recombination and mutation operators and with a local search al-
gorithm for solving CVRP.
Genetic Algorithms (GAs) are heuristics based on an analogy with biolog-
ical evolution. A population of individuals representing tentative solutions is
maintained. New individuals are produced by combining members of the pop-
ulation, and they replace existing individuals attending to some given policy.
In order to induce a lower selection pressure (larger exploration) compared to
that of panmictic GAs (here panmictic means that an individual may mate
with any other in the population – see Fig. 2a), the population can be decen-
tralized by structuring it in some way [8] (see Figs. 2b and 2c). Cellular GAs
are a subclass of GAs in which the population is structured in a specified topol-
ogy (usually a toroidal mesh of dimensions d = 1, 2 or 3), so that individuals
may only interact with their neighbors (see Fig. 2c). The pursued effect is to
improve on the diversity and exploration capabilities of the algorithm (due to
the presence of overlapped small neighborhoods) while still admitting an easy
(a) (b) (c)
Fig. 2. (a) Panmictic GA has all its individual black points in the same population.
Structuring the population usually leads to distinguish between (b) distributed, and
(c) cellular GAs
combination with local search at the level of each individual to improve on

exploitation [9]. These techniques, also called diffusion or fine-grained models,
have been popularized, among others, by early works of Gorges-Schleuter [10],
Manderick and Spiessens [7], and more [11–13].
The contribution of this work is then to define a powerful yet simple hybrid
cGA capable of competing with the best known approaches in solving of the
CVRP in terms of accuracy (final cost) and effort (the number of evaluations
made). For that goal, we test our algorithm over a large selection of instances
(160), which will allow us to guarantee deep and meaningful conclusions. Be-
sides, we compare our results against the best existing ones in the literature,
even improving them in some instances. In [14] the reader can find a seminal
work with a comparison between our algorithm and some other known heuris-
tics for a reduced set of 8 instances. In that work, we showed the advantages
of embedding local search techniques into a cGA for solving CVRP, since
our hybrid cGA was the best algorithm out of all those compared in terms
of accuracy and time. Cellular GAs represent a paradigm much simpler in
terms of customization and comprehension than others such as Tabu Search
(TS) [15,16] and similar (very specialized or very abstract) algorithms [17,18].
This is an important point too, since the greatest emphasis on simplicity and
flexibility is nowadays a must in research to achieve widely useful contribu-
tions [6].
The chapter is organized in the following manner. In Sect. 2 we define
CVRP. The proposed hybrid cGA is thoroughly described in Sect. 3. The
objective of Sect. 4 is to justify the elections we have made for the design
of our highly hybridized algorithm. Sect. 5 presents the results of our tests,
comparing them with the best-known values in the literature. Finally, our
conclusions and future lines of research are presented in Sect. 6.
2 The Vehicle Routing Problem

The VRP can be defined as an integer programming problem which falls into
the category of NP-hard problems [19]. Among the different variants of VRP
we work here with the Capacitated VRP (CVRP), in which every vehicle
has a uniform capacity of a single commodity. The CVRP is defined on an
undirected graph G = (V, E) where V = {v0 , v1 , . . . , vn } is a vertex set and
E = {(vi , vj )/vi , vj ∈ V, i < j} is an edge set. Vertex v0 stands for the depot,
and it is from where m identical vehicles of capacity Q must serve all the cities
or customers, represented by the set of n vertices {v1 , . . . , vn }. We define on
E a non-negative cost, distance or travel time matrix C = (cij ) between
customers vi and vj . Each customer vi has non-negative demand of goods qi
and drop time δi (time needed to unload all goods). Let be V1 , . . . , Vm a
partition of V, a route Ri is a permutation of the customers in Vi specifying
the order of visiting them, starting and finishing at the depot v0 . The cost
of a given route Ri = {v0 , v1 , . . . , vk+1 }, where vj ∈ V and v0 = vk+1 = 0
(0 denotes the depot), is given by:

k
k
Cost(Ri ) = cj,j+1 + δj , (1)
j=0 j=0
and the cost of the problem solution (S) is:

m
FCVRP (S) = Cost(Ri ). (2)
i=1
The CVRP consists in determining a set of m routes (i) of minimum total

cost – see (2) –; (ii) starting and ending at the depot v0 ; and such that
(iii) each customer is visited exactly once by exactly one vehicle; subject to
the
restrictions that (iv) the total demand of any route does not exceed Q
( vj ∈Ri qj ≤ Q); and (v) the total duration of any route is not larger than a
preset bound D (Cost(Ri ) ≤ D). All the vehicles have the same capacity and
carry a single kind of commodity. The number of vehicles is either an input
value or a decision variable. In this study, the length of routes is minimized
independently of the number of vehicles used.
It is clear from our description that the VRP is closely related to two
difficult combinatorial problems. On the one hand, we can get an instance of
the Multiple Travelling Salesman Problem (MTSP) just by setting C = ∞. An
MTSP instance can be transformed into a TSP instance by adjoining to the
graph k − 1 additional copies of node 0 (depot) and its incident edges (there
are no edges among the k depot nodes). On the other hand, the question of
whether there exists a feasible solution for a given instance of the VRP is an
instance of the Bin Packing Problem (BPP). So the VRP is extremely difficult
to solve in practice because of the interplay of these two underlying difficult
models (TSP and BPP). In fact, the largest solvable instances of the VRP are
two orders of magnitude smaller than those of the TSP [20].
3 The Proposed Algorithm

In this section we will present a detailed description of the algorithm we have
developed. Basically, we use a simple cGA highly hybridized with specific
recombination and mutation operators, and also with an added local post-
optimization step. In Algorithm 3.1 we can see a pseudocode of JCell , our
basic hybrid cGA.
In JCell, the population is structured in a 2D toroidal grid, and the neigh-
borhood defined on it – line 5 – always contains 5 individuals: the one being
considered (position(x,y)) plus the North, East, West, and South individ-
uals (called NEWS, linear5, or Von Neumann neighborhood [7, 9, 21]). The
first parent for a recombination (operator of arity 2) is chosen by using binary
Algorithm 3.1 Pseudocode of JCell

1: proc Steps Up(cga) //Algorithm parameters in ‘cga’
2: while not Termination Condition() do
3: for x ← 1 to WIDTH do
4: for y ← 1 to HEIGHT do
5: n list←Get Neighborhood(cga,position(x,y));
6: parent1←Binary Tournament(n list);
7: parent2←Individual At(cga,position(x,y));
8: aux indiv←Recombination(cga.Pc,parent1,parent2);
9: aux indiv←Mutation(cga.Pm,aux indiv);
10: aux indiv←Local Search(cga.Pl,aux indiv);
11: Evaluate Fitness(aux indiv);
12: Insert If Better(position(x,y),aux indiv,cga,aux pop);
13: end for
14: end for
15: cga.pop←aux pop;
16: Update Statistics(cga);
17: end while
18: end proc Steps Up;
tournament selection (BT) inside this neighborhood (line 6), while the other
parent will be the current individual itself (line 7). The two parents can be
the same individual (replacement) or not. The algorithm iteratively consid-
ers as “current” each individual in the grid. Genetic operators are applied
to the individuals in lines 8 and 9 to increasingly improve on the average
fitness of individuals in the grid also on a neighborhood basis (explained in
Sects. 3.2 and 3.3). We add to this basic cGA a local search technique in line
10 consisting in applying 2-Opt and 1-Interchange, which are well-known lo-
cal optimization methods (see Sect. 3.4). After applying these operators, we
evaluate the fitness value of the new individual (line 11), and insert it in the
new (auxiliary) population – line 12 – only if its fitness value is larger than
that of the parent located at that position in the current population (elitist
replacement).
After applying the above mentioned operators to the individuals, we re-
place the old population for the new one at once (line 15), and we then calcu-
late some statistics (line 16). It can be noticed that new individuals replace
the old ones en bloc (synchronously) and not incrementally (see [22] for other
replacement techniques). The algorithm stops when an optimal solution is
found or when an a priori predetermined maximum number of generations is
reached.
The fitness value assigned to every individual is computed as follows
[14, 15]:
f (S) = FCVRP (S) + λ · overcap(S) + µ · overtm(S), (3)
feval (S) = fmax − f (S). (4)
The objective of our algorithm is to maximize feval (S) (4) by minimizing
f (S) (3). The value fmax must be larger or equal with respect to that of
the worst feasible solution for the problem. Function f (S) is computed by
adding the total costs of all the routes (FCVRP (S) – see (2) –), and penalizes
the fitness value only in the case that the capacity of any vehicle and/or
the time of any route are exceeded. Functions ‘overcap(S)’ and ‘overtm(S)’
return the overhead in capacity and time of the solution (respectively) with
respect to the maximum allowed value of each route. These values returned
by ‘overcap(S)’ and ‘overtm(S)’ are weighted by multiplying them by factors
λ and µ, respectively. In this work we have used λ = µ = 1000 [23].
In Sects. 3.1 to 3.4 we proceed to explain in detail the main features that
characterize our algorithm (JCell). The algorithm itself can be applied with
all the mentioned operations and also applying only some of them to analyze
their separate contribution to the performance of the search.
3.1 Problem Representation
In a GA, individuals represent candidate solutions. A candidate solution to

an instance of the CVRP must specify the number of vehicles required, the
allocation of the demands to all these vehicles, and also the delivery order
of each route. We adopted a representation consisting of a permutation of
integer numbers. Each permutation will contain both customers and route
splitters (delimiting different routes), so we will use a permutation of numbers
[0 . . . n − 1] with length n = c + k for representing a solution for the CVRP
with c customers and a maximum of k + 1 possible routes. Customers are
represented with numbers [0 . . . c−1], while route splitters belong to the range
[c . . . n − 1]. Note that due to the nature of the chromosome (permutation of
integer numbers) route splitters must be different numbers, although it should
be possible to use the same number for designating route splitters in the case
of using other possible chromosome configuration.
Each route is composed of the customers between two route splitters in
the individual. For example, in Fig. 3 we plot an individual representing a
possible solution for a hypothetical CVRP instance with 10 customers using
at most 4 vehicles. Values [0, . . . , 9] represent the customers while [10, . . . , 12]
are the route splitters. Route 1 begins at the depot, visits customers 4–5–2
(in that order), and returns to the depot. Route 2 goes from the depot to
customers 0–3–1 and returns. The vehicle of Route 3 starts at the depot and
visits customers 7–8–9. Finally, in Route 4, only customer 6 is visited from
the depot. Empty routes are allowed in this representation simply by placing
two route splitters contiguously without clients between them.
Fig. 3. Individual representing a solution for 10 customers and 4 vehicles

Random STEP1 STEP2

Choice
1 1
2 2
3 3
4 1 Offspring 4 Offspring
5 5
6 6
1 STEP3 1 END
2 2
3 3
4 Offspring 4 Offspring
5 5
6 6
Fig. 4. Edge recombination operator
3.2 Recombination
Recombination is used in GAs as an operator for combining parts in two (or

more) patterns in order to transmit (hopefully) good building blocks in them
to their offspring. The recombination operator we use is the edge recombi-
nation operator (ERX) [24], since it has been largely reported as the most
appropriate for permutations compared to other general operators like order
crossover (OX) or partially matched crossover (PMX). ERX builds an off-
spring by preserving edges from its two parents. For that, an edge list is used.
This list contains, for each city, the edges of the parents that start or finish
in it (see Fig. 4).
After constructing the edge list of the two parents, the ERX algorithm
builds one child solution by proceeding as follows. The first gene of the off-
spring is chosen from between the first one of both parents. Specifically, the
gene having a lower number of edges is selected. In the case of a tie, the first
gene of the first parent will be chosen. The other genes are chosen by tak-
ing from among their neighbors the one with the shortest edge list. Ties are
broken by choosing the first city found that fulfill this shortest list criterion.
Once a gene is selected, it is removed from the edge list.
3.3 Mutation
The mutation operator we use in our algorithms will play an important role
during the evolution since it is in charge of introducing a considerable degree
of diversity in each generation, counteracting in this way the strong selective
pressure which is a result of the local search method we plan to use. The
mutation consists in applying Insertion, Swap or Inversion operations to each
gene with equal probability (see Algorithm 3.2).
These three mutation operators (see Fig. 5) are well-known methods
found in the literature, and typically applied sooner than later in routing
Algorithm 3.2 The mutation algorithm

1: proc Mutation(pm, ind) // ‘pm’ is the mutation probability, and ‘ind’ is the individual
to mutate
2: for i ← 1 to ind.Length() do
3: if Rand0To1()<pm then
4: r = Rand0To1();
5: if r < 0.33 then
6: ind.Inversion(i, RandomInt(ind.Length()));
7: else if r > 0.66 then
8: ind.Insertion(i, RandomInt(ind.Length()));
9: else
10: ind.Swap(i, RandomInt(ind.Length()));
11: end if
12: end if
13: end for
14: end proc Mutation;
Insertion Swap Inversion

Parent
Offspring
Fig. 5. The three different mutations used
problems. Our idea here has been to merge these three in a new com-
bined operator. The Insertion operator [25] selects a gene (either customer
or route splitter) and inserts it in another randomly selected place of the
same individual. Swap [26] consists in randomly selecting two genes in a
solution and exchanging them. Finally, Inversion [27] reverses the visiting
order of the genes between two randomly selected points of the permuta-
tion. Note that the induced changes might occur in an intra or inter-route
way in all the three operators. Formally stated, given a potential solution
S = {s1 , . . . , sp−1 , sp , sp+1 , . . . , sq−1 , sq , sq+1 , . . . , sn }, where p and q are ran-
domly selected indexes, and n is the sum of the number of customers plus the
number of route splitters (n = c + k), then the new solution S obtained after
applying each of the different proposed mechanisms is shown below:
Insertion : S = {s1 , . . . , sp−1 , sp+1 , . . . , sq−1 , sq , sp , sq+1 , . . . , sn }, (5)

Swap : S = {s1 , . . . , sp−1 , sq , sp+1 , . . . , sq−1 , sp , sq+1 , . . . , sn }, (6)
Inversion : S = {s1 , . . . , sp−1 , sq , sq−1 , . . . , sp+1 , sp , sq+1 , . . . , sn }. (7)
3.4 Local Search
It is very clear after all the existing literature on VRP that the utilization of a
local search method is almost mandatory to achieve results of high quality [14,
17, 28]. This is why we envision from the beginning the application of two of
the most successful techniques in recent years. In effect, we will add a local
refining step to some of our algorithms consisting in applying 2-Opt [29] and
1-Interchange [30] local optimization to every individual.
a b a b
Route i
Route j
d c d c
(a) (b)
Fig. 6. 2-Opt works into a route (a), while λ-Interchange affects two routes (b)
On the one hand, the 2-Opt simple local search method works inside each
route. It randomly selects two non-adjacent edges (i.e. (a, b) and (c, d)) of a
single route, deletes them, thus breaking the tour into two parts, and then
reconnects those parts in the other possible way: (a, c) and (b, d) (Fig. 6a).
Hence, given a route R = {r1 , . . . , ra , rb , . . . , rc , rd , . . . , rn }, being (ra , rb ) and
(rc , rd ) two randomly selected non-adjacent edges, the new route R obtained
after applying the 2-Opt method to the two considered edges will be R =
{r1 , . . . , ra , rc , . . . , rb , rd , . . . , rn }.
On the other hand, the λ-Interchange local optimization method that we
use is based on the analysis of all the possible combinations for up to λ
customers between sets of routes (Fig. 6b). Hence, this method results in
customers either being shifted from one route to another, or being exchanged
between routes. The mechanism can be described as follows. A solution to the
problem is represented by a set of routes S = {R1 , . . . , Rp , . . . , Rq , . . . , Rk },
where Ri is the set of customers serviced in route i. New neighboring solutions
can be obtained after applying λ-Interchange between a pair of routes Rp and
Rq ; to do so we replace each subset of customers S1 ⊆ Rp of size |S1 | ≤ λ
with any other one S2 ; ⊆ Rq of size |S2 | ≤ λ. This
; way, we obtain two new
routes Rp = (Rp − S1 ) S2 and Rq = (Rq − S2 ) S1 , which are part of the
new solution S = {R1 . . . Rp . . . Rq . . . Rk }.
Hence, 2-Opt searches for better solutions by modifying the order of visit-
ing customers inside a route, while the λ-Interchange method results in cus-
tomers either being shifted from one route to another, or customers being
exchanged between routes. This local search step is applied to an individual
after the recombination and mutation operators, and returns the best solution
between the best ones found by 2-Opt and 1-Interchange, or the current one
if it is better (see a pseudocode in Algorithm 3.3). In the local search step, the
algorithm applies 2-Opt to all the pairs of non-adjacent edges in every route
and 1-Interchange to all the subsets of up to 1 customer between every pair
of routes.
In summary, the functioning of JCell is quite simple: in each generation,
an offspring is obtained for every individual after applying the recombina-
tion operator (ERX) to its two parents (selected from its neighborhood).
Algorithm 3.3 The local search step

1: proc Local Search(ind) // ‘ind’ is the individual to be improved
2: // MAX STEPS = 20
3: for s← 0 to MAX STEPS do
4: //First: 2-Opt. K stands for the number of routes
5: best2-Opt = ind;
6: for r ← 0 to K do
7: sol = 2-Opt(ind,r);
8: if Better(sol,best2-Opt) then
9: best2-Opt = sol;
10: end if
11: end for
12: end for
13: //Second: 1-Interchange.
14: best1-Interchange = ind;
15: for s← 0 to MAX STEPS do
16: for i← 0 to Length(ind) do
17: for j← i+1 to Length(ind) do
18: sol = 1-Interchange(i,j);
19: if Better(sol,best1-Interchange) then
20: best1-Interchange = sol;
21: end if
22: end for
23: end for
24: end for
25: Better(best2-Opt, best1-Interchange)? return best2-Opt : return best1-Interchange;
26: end proc Local Search;
The offsprings are mutated with a special combined mutation, and later a
local post-optimization step is applied to the mutated individuals. This lo-
cal search algorithm consists in applying to the individual two different local
search methods (2-Opt and 1-Interchange), and returns the best individual
from among the input individual and the output of 2-Opt and 1-Interchange.
The population of the next generation will be composed of the current one
after replacing its individuals with their offsprings in the cases where they are
better.
4 Looking for a New Algorithm: the Way to JCell2o1i
In this section we present a preliminary study in order to justify some choices

made in the final design of our algorithm, JCell2o1i. Specifically, in Sect. 4.1
we exemplify the kind of advantage obtained from using a decentralized pop-
ulation (cellular) instead of a panmictic one (non-structured). In Sect. 4.2 we
analyze the behavior of three algorithms using the three mutation methods
separately (very commonly used in routing problems), that will later justify
our proposed combined mutation. Finally, we justify the choice of the local
search method used (composed of two well known methods) in Sect. 4.3. The
observed performances in this initial study will guide us to the final proposal
of our hybrid cGA.
For the present studies, some instances of the hard benchmark of
Christofides, Mingozzi and Toth [3] have been chosen. Specifically, we use
Table 1. Parameterization used in the algorithm
Population Size 100 Individuals (10 × 10)

Selection of Parents Binary Tournament + Current Individual
Recombination ERX, pc = 0.65
Individual Mutation Insertion, Swap or Inversion (equal prob.), pm = 0.85
Probability of Bit Mutation pbm = 1.2/L
Individual Length L
Replacement Rep if Better
Local Search (LS) 2-Opt + 1-Interchange, 20 optimization steps
the three smallest instances of the benchmark with and without drop times
and maximum route length restrictions (CMT-n51-k5, CMT-n76-k10 and
CMT-n101-k8, and CMTd-n51-k6, CMTd-n76-k11 and CMTd-n101-k9). In or-
der to obtain a more complete benchmark (and thus, more reliable results),
we have additionally selected the clustered instances CMT-n101-k10 and
CMTd-n101-k11. This last one (CMTd-n101-k11) is obtained after including
drop times and maximum route lengths to CMT-n101-k10.
The algorithms studied in this section are compared by means of the av-
erage total travelled length of the solutions found, the average number of
evaluations needed to get those solutions, and the hit rate (percentage of suc-
cess rate) in 30 independent runs. Hence, we are comparing them in terms
of their accuracy, efficiency, and efficacy. In order to look for the existence of
statistical reliability in the comparison of the results, ANOVA tests are per-
formed. The parameters used for all the algorithms in the current work are
essentially those of Table 1, with some exceptions mentioned in each case. The
average values in the tables are shown along with their standard deviation,
and the best ones are bolded.
4.1 Cellular vs. Generational Genetic Algorithms
We will compare in this section the behavior of our cellular GA, JCell2o1i,
with a generational panmictic one (genGA) in order to illustrate our decision
of using a structured cellular model for solving the CVRP. The same algorithm
is run in the two cases with the difference being the neighborhood structure
of the cGA. As we will see, structuring the population has been useful for
improving the algorithm in many domains, since the population diversity is
better maintained with regard to non-structured population. The results of
this comparison are shown in Table 2. As can be seen, the genGA is not able to
find the optimum for half of the tested instances (CMT-n76-k10, CMT-n101-k8,
CMTd-n51-k6, and CMTd-n101-k9), while JCell2o1i always finds it, and with
larger hit rates and lower effort.
In Fig. 7 we graphically compare the two algorithms by plotting the dif-
ference of the values obtained (genGA minus cGA) in terms of the accuracy
–average solution length– (Fig. 7a), efficiency –average evaluations– (Fig. 7b),
Table 2. Cellular and generational genetic algorithms

Generational GA JCell2o1i
Instance Avg. Sol. Length Avg. Evals. Hit (%) Avg. Sol. Length Avg. Evals. Hit (%)
CMT-n51-k5 527.88 98158.34 40.00 525.19 26374.31 53.33
±3 .70 ±23550 .01 ±1 .52 ±6355 .15
CMT-n76-k10 845.00 —– 0.00 842.77 72102.00 3.33
±6 .22 —– ±4 .58 ±0 .00
CMT-n101-k8 833.50 —– 0.00 832.43 177248.00 3.33
±4 .07 —– ±0 .00 ±0 .00
CMT-n101-k10 828.98 321194.74 63.33 820.89 70637.41 90.00
±14 .65 ±122200 .58 ±4 .88 ±13104 .67
CMTd-n51-k6 561.37 —– 0.00 558.28 27969.15 23.33
±3 .03 —– ±2 .05 ±6050 .44
CMTd-n76-k11 923.47 344500.00 10.00 918.50 64964.50 6.67
±9 .43 ±62044 .26 ±7 .42 ±25595 .15
CMTd-n101-k9 879.20 —– 0.00 876.94 91882.00 3.33
±5 .78 —– ±5 .89 ±0 .00
CMTd-n101-k11 868.34 342742.86 63.33 867.35 333926.32 70.00
±4 .50 ±125050 .93 ±2 .04 ±125839 .62
CMT-n51-k5
CMT-n76-k10
CMT-n101-k8
CMT-n101-k10
CMTd-n51-k6
CMTd-n76-k11
CMTd-n101-k9
CMTd-n101-k11
0 2 4 6 8 0 100000 200000 300000 −30 −20 −10 0
Avg. Distances Difference Avg. Number of Evaluations Difference Hit Rate Difference (% )
(a) (b) (c)
Fig. 7. Comparison between the generational GA and JCell2o1i (values of genGA

minus JCell2o1i)
and efficacy –hit rate– (Fig. 7c). As it can be noticed, JCell2o1i always outper-
forms the generational GA for all of the three compared parameters (8×3 = 24
tests), except for the hit rate in CMTd-n76-k11. The histogram of Fig. 7b is
incomplete because no solutions were found by the genGA for some instances
(see second column of Table 2). After applying t -tests to the obtained results
in the number of evaluations made and the distance of the solutions, we con-
clude that there is always statistical confidence (p-values ≥ 0.05) for these
claims between the two algorithms.
4.2 On the Importance of the Mutation Operator
Once we clearly showed the higher performance of JCell on a generational

GA we go deeper in analyzing its behavior. Hence, in this section we will
study the behavior of JCell when using each of the three proposed mutation
operators separately, and we compare the results with those of JCell2o1i,
which uses a combination of the three mutations. In Table 3 we can see the
results obtained by the four algorithms (along with their standard deviations).
Table 3. Analysis of mutation

JCell2o1iinv JCell2o1iins JCell2o1isw JCell2o1i
Inst. Avg. Avg. Hit Avg. Avg. Hit Avg. Avg. Hit Avg. Avg. Hit
Sol. Evals. (%) Sol. Evals. (%) Sol. Evals. (%) Sol. Evals. (%)
CMT-n51-k5 527.38 1.52E5 57 527.02 2.09E5 60 525.68 1.48E5 80 525.19 2.64E4 53
±3 .48 ±8 .23E4 ±3 .32 ±6 .53E5 ±2 .48 ±7 .59E4 ±1 .52 ±6 .36E3
CMT-n76-k10 844.79 —– 0 843.13 6.19E5 3 844.81 —– 0 842.77 7.21E4 3
±4 .75 —– ±5 .48 ±0 .00 ±5 .71 —– ±4 .58 ±0 .00
CMT-n101-k8 835.99 —– 0 832.98 5.25E5 3 835.25 —– 0 832.43 1.77E5 3
±4 .54 —– ±4 .20 ±0 .00 ±5 .71 —– ±0 .00 ±0 .00
CMT-n101-k10 824.91 3.05E6 73 820.27 2.71E5 90 826.98 3.48E5 57 820.89 7.06E4 90
±9 .15 ±1 .30E5 ±3 .60 ±7 .49E4 ±9 .28 ±1 .56E5 ±4 .88 ±1 .31E4
CMTd-n51-k6 558.86 9.85E4 17 558.61 2.60E5 20 559.89 1.04E5 10 558.28 2.80E4 23
±2 .62 ±2 .15E4 ±2 .37 ±1 .88E5 ±3 .06 ±3 .49E4 ±2 .05 ±6 .05E3
CMTd-n76-k11 921.57 4.01E5 10 917.71 5.06E5 7 918.76 3.83E5 10 918.50 6.50E4 7
±10 .21 ±1 .90E5 ±7 .42 ±5 .51E4 ±8 .40 ±1 .03E4 ±7 .42 ±2 .56E4
CMTd-n101-k9 876.08 7.59E5 3 873.26 5.55E5 13 876.15 6.03E5 7 876.94 9.19E4 3
±6 .51 ±0 .00 ±5 .77 ±1 .54E5 ±6 .74 ±4 .12E5 ±5 .89 ±0 .00
CMTd-n101-k11 867.50 3.50E5 90 867.83 3.86E5 77 867.40 4.51E5 83 867.35 3.34E5 70
±3 .68 ±1 .31E5 ±3 .46 ±1 .38E5 ±4 .24 ±1 .73E5 ±2 .04 ±1 .26E5
These algorithms differ from each other just in the mutation method applied:
inversion (JCell2o1iinv ), insertion (JCell2o1iins ), swap (JCell2o1isw ), or they
three (JCell2o1i) —for details on the mutation operators refer to Sect. 3.3.
An interesting observation is that using a single mutation operator does
not seem to throw a clear heading winner, i.e., no overall best performance
of any of the three basic cellular variants can be concluded. For example,
using the inversion mutation (JCell2o1iinv ) we obtain very accurate solu-
tions for most instances, although it is less accurate for some other ones (i.e.,
instances CMT-n51-k5, CMT-n101-k8, and CMTd-n76-k11). The same beha-
vior is obtained in terms of the average evaluations made and for the hit
rate; so it is clear that the best algorithm depends on the instance being
solved when we are using a single mutation operator. Hence, as in the work of
Chellapilla [31], wherein his algorithm (evolutionary programming) is outper-
formed by combining two different mutations, we decided to develop the new
mutation composed of the three proposed ones (called combined mutation) in
order to exploit the behavior of the best of them for each instance. As we can
see in Table 3 this idea works: JCell2o1i stands out as the best one out of the
four compared methods. In terms of the average evaluations made, it is the
best algorithm for all the tested instances. Moreover, in terms of accuracy and
the hit rate, JCell2o1i is the algorithm that finds the best values for a larger
number of instances (see the bolded values for the three metrics).
In Fig. 8 we show a graphical evaluation of the four algorithms in terms
of accuracy, efficiency, and efficacy. JCell2o1i is, in general, more accurate
than the other three algorithms, although differences are very small (Fig. 8a).
In terms of the average evaluations made (efficiency), the algorithm using
the combined mutation largely outperforms the other three (Fig. 8b), while
the obtained values are slightly worse in the case of the hit rate only for
a few instances (Fig. 8c). After applying the ANOVA test to the results of
JCellins JCellinv JCellsw JCell2o1i
Avg. Number of Evaluations

100 8,0E+05
900
80
Avg. Distance
6,0E+05
Hit rate (%)

800
60
4,0E+05
700 40
2,0E+05
600 20
500 0
0,0E+00
-k5 -k10 1- k8 -k10 1- k6 -k11 1-k9 - k11 -k5 -k10 1- k8 -k10 1- k6 -k11 1-k9 - k11 -k5 -k10 1- k8 -k10 1- k6 -k11 1-k9 - k11
51 6 0 01 5 76 0 1 51 6 0 01 5 76 0 1 51 6 0 01 5 76 0 1
T-n T-n7 T-n1 -n1 Td-n d-n d-n1 -n10 T-n T-n7 T-n1 -n1 Td-n d-n d-n1 -n10 T-n T-n7 T-n1 -n1 Td-n d-n d-n1 -n10
CM CM CM MT CM MT MT Td CM CM CM MT CM MT MT Td CM CM CM MT CM MT MT Td
C C C CM C C C CM C C C CM
(a) (b) (c)
Fig. 8. Comparison of the behavior of the cGA using the three different mutations
independently and together in terms of (a) accuracy, (b) effort, and (c) efficacy
the algorithms, no statistical differences were found, in general, in terms of

accuracy, but the new combined mutation method is better than the other
three with statistical confidence for the eight tested instances if we pay atten-
tion to the number of evaluations made (effort). This means that we have an
algorithm of competitive accuracy but with greater proven efficiency.
4.3 Tuning the Local Search Step
Mutation has been shown to be an important ingredient to reduce the effort of

the algorithm but local search is also an influential operator, so let us enter the
analysis on how deep its impact is here. In the local search step, our proposal
includes 2-Opt and 1-Interchange, and the natural question is whether or not
(improved) variants could have been used instead (more sophisticated). To
answer this question we test in this section the behavior of JCell (no local
search) against some different possibilities for hybridizing it. All the studied
hybrid options are based on 2-Opt and λ-Interchange (two well-known local
search methods for VRP): JCell with 2-Opt (JCell2o), with 1-Interchange
(JCell1i), and two different combinations of 2-Opt and λ-Interchange; those
with λ = 1 (JCell2o1i) and λ = 2 (JCell2o2i). All these proposed algorithms
implement the combined mutation. The results are all shown in Table 4.
From Table 4 we can conclude that it is possible to improve the per-
formance of a cGA with the combined mutation for VRP just by including a
local search step into it. It can be seen that JCell2o obtains better results than
JCell (without local search) in all the studied instances (algorithm columns
1 and 2), but the improvement specially stands out when the λ-Interchange
method is considered with or without 2-Opt. The importance of this local
search operator is substantial, to such an extent that the resulting algorithm
is unable to find the best solution to the problem for any tested instance when
λ-Interchange is not used. Applying the ANOVA test to the solutions obtained
by the algorithms we concluded that the three algorithms using λ-Interchange
are better than the others with statistical confidence in terms of accuracy. The
Table 4. Analysis of the effects of using local search on the JCell with combined
mutation
JCell (no LS) JCell2o JCell1i JCell2o1i JCell2o2i
Inst. Avg. Avg. Hit Avg. Avg. Hit Avg. Avg. Hit Avg. Avg. Hit Avg. Avg. Hit
Dist. Evals. (%) Dist. Evals. (%) Dist. Evals. (%) Dist. Evals. (%) Dist. Evals. (%)
CMT-n51-k5 576.1 —– 0 551.6 —– 0 529.9 1.0E5 50 525.2 2.6E4 53 526.1 1.6E5 77
±14 .4 —– ±9 .4 —– ±6 .7 ±3 .7E4 ±1 .5 ±6 .4E3 ±2 .9 ±6 .8E4
CMT-n76-k10 956.4 —– 0 901.8 —– 0 851.5 —– 0 842.8 7.2E4 3 844.7 7.5E5 3
±20 .5 —– ±15 .8 —– ±6 .3 —– ±4 .6 ±0 .0 ±5 .3 ±0 .0
CMT-n101-k8 994.4 —– 0 867.2 —– 0 840.6 —– 0 832.4 1.8E4 3 831.9 8.2E5 3
±29 .9 —– ±14 .5 —– ±6 .5 —– ±0 .0 ±0 .0 ±7 .2 ±0 .0
CMT-n101-k10 1053.6 —– 0 949.6 —– 0 822.8 2.3E5 10 820.9 7.1E4 90 823.3 1.2E5 17
±43 .5 —– ±29 .0 —– ±5 .9 ±4 .2eE5 ±4 .9 ±1 .3E4 ±7 .6 ±2 .7e4
CMTd-n51-k6 611.5 —– 0 571.9 —– 0 561.7 6.9E4 3 558.3 2.8E4 23 558.6 3.7E5 13
±15 .2 —– ±13 .4 —– ±4 .0 ±5 .9E3 ±2 .1 ±6 .1E3 ±2 .5 ±1 .8E5
CMTd-n76-k11 1099.1 —– 0 1002.6 —– 0 924.0 2.0E5 3 918.50 6.5E4 7 917.7 7.1E5 13
±26 .1 —– ±21 .7 —– ±8 .2 ±0 .0 ±7 .4 ±2 .6E4 ±9 .1 ±3 .5E5
CMTd-n101-k10 1117.5 —– 0 936.5 —– 0 882.2 3.8E5 13 876.94 9.2E4 3 873.7 2.87E5 80
±34 .8 —– ±15 .1 —– ±10 .4 ±0 .0 ±5 .9 ±0 .0 ±4 .9 ±1 .2E5
CMTd-n101-k11 1110.1 —– 0 1002.8 —– 0 869.5 1.8E5 13 867.4 3.3E5 70 867.9 3.5E5 83
±58 .8 —– ±40 .4 —– ±3 .5 ±6 .7E4 ±2 .0 ±1 .3E5 ±3 .9 ±1 .4E5
differences among these three algorithms in the average evaluations made have
also (with minor connotations) statistical significance.
Regarding the two algorithms with the overall best results (those using 2-
Opt and λ-Interchange), JCell2o1i always obtains better values than JCell2o2i
(see bolded values) in terms of the average evaluations made (with statis-
tical significance). Hence, applying 1-Interchange is faster than applying 2-
Interchange (as can be expected since it is a lighter operation) with greater
accuracy and efficiency (less computational resources).
We summarize in Fig. 9 the results of the compared algorithms. JCell and
Jcell2o do not appear in Figs. 9b and 9c because these two algorithms were
not able to find the best-known solution for any of the tested instances. In
terms of the average distance of the solutions found (Fig. 9a), it can be seen
that the three algorithms using the λ-Interchange method have similar results,
outperforming the rest in all the cases. JCell2o1i is the algorithm which needs
the lowest number of evaluations to get the solution in almost all the cases
(Fig. 9b). In terms of the hit rate (Fig. 9c), JCell1i is, in general, the worst
algorithm (among the three using λ-Interchange) for all the tested instances.
To summarize in one example the studies made in this subsection, we plot
in Fig. 10 the evolution of the best (10a) and the average (10b) solutions found
during a run with the five algorithms studied in this section for CMT-n51-k5.
All the algorithms implement the combined mutation, composed of the three
methods studied in Sect. 3.3: insertion, inversion and swap. In graphics 10a
and 10b we can see a similar behavior, in general, for all the algorithms:
the population quickly evolves towards solutions close to the best-known one
(value 524.61) in all the cases. After zooming in on the graphics we can no-
tice more clearly that both JCell and JCell2o converge more slowly than the
others, and finally get stuck at local optima. JCell2o maintains the diversity
for longer with respect to the other algorithms. Although the algorithms with
λ-Interchange converge faster than JCell and JCell2o, they are able to find
Avg. Distance 1100 Jcell

JCell2o
900
JCell1i
700 JCell2o1i
JCell2o2i
500
5 10 -k8 10 6 11 -k9 11
1-k 6-k 01 1-k 51-k 6-k 01 1-k
n5 -n7 -n1 0 d-n 7 1 0
T- T T -n1 T d-n d-n -n1
CM CM CM
T
CM
T T Td
CM CM CM CM
(a)
Avg. Number of Evaluations
1.0E+06 100
8.0E+05 80
HitRate(%)
6.0E+05 60
4.0E+05 40
2.0E+05 20
0.0E+00 0
5 0 8 0 6 1 k9 k11 5 0 k8 10 -k6 k11 -k9 k11
1-k -k1 1-k -k1 1-k -k1 1- - 1-k -k1 01- 1-k 1 - 1 -
- n5 -n76 -n10 101 d-n5 n76 -n10 101 T -n5 -n76 -n1 n10 d-n5 -n76 -n10 n101
T T -n T d- d -n T - d d d-
T
CM CM CM MT CM MT CMT MTd CM CMT CM MT CMT MT MT T
C C C C C C CM
(b) (c)
Fig. 9. Comparison of the behavior of the cGA without local search and four cGAs
using different local search techniques in terms of (a) accuracy, (b) effort, and (c)
efficacy
the optimal value, escaping from the local optima thanks to the λ-Interchange
local search method.
5 Solving CVRP with JCell2o1i

In this section we describe the results of the experiments we have made for
testing our algorithm on CVRP. JCell2o1i has been implemented in Java and
tested on a 2.6 GHz PC under Linux with a very large test suite, composed
of an extensive set of benchmarks drawn from the literature: (i) Augerat
et al. (sets A, B and P) [32], (ii) Van Breedam [33], (iii) Christofides and
Eilon [34], (iv) Christofides, Mingozzi and Toth (CMT) [3], (v) Fisher [35],
(vi) Golden et al. [5], (vii) Taillard [36], and (viii) a benchmark generated from
TSP instances [37]. All these benchmarks, as well as the best-known solution
for their instances, are publicly available at [4].
Due to the heterogeneity of this huge test suite, we will find many different
features in the studied instances, which will represent a hard test to JCell2o1i
for this problem, usually much larger than usual studies in a single chapter in
the VRP literature. The parameters used by JCell2o1i in all tests are those
previously listed in Table 1.
In Sects. 5.1 to 5.8 we analyze in details the results obtained with JCell2o1i
for the 160 instances composing our test suite. The numerical results of the
JCell JCell2o JCell1i JCell2o1i JCell2o2i
Best solution
1850
1650
1450
Distance
1250
1050
850
650
450
0 100000 200000 300000 400000 500000 600000 700000
Evaluations
(a)
Average Solution
25000
20000
Distance
15000
10000
5000
0
0 100000 200000 300000 400000 500000 600000 700000
Evaluations
(b)
Fig. 10. Evolution of the best (a) and average (b) solutions for CMT-n51-k5 with
some algorithms differing only in the local search applied
algorithms themselves (optimal routes, costs, etc.) are shown in Appendix B,

where the figures in the tables are calculated after making 100 independent
runs (for statistical significance), except for Table 21, where 30 runs were
made due to the computational difficulty of the benchmark. The new best-
known solutions found here for the first time are shown in Appendix A. We
use the same nomenclature for all the instances. It is composed of an identifier
of the benchmark which the instance belongs to, followed by an ‘n’ and the
number of nodes (customers to serve plus depot), and a ‘k’ with the number
of vehicles used in the best known solution (i.e., bench-nXX-kXX).
In this section, the metrics for the performance of the algorithm are (see
tables of results in Appendix B) the value of the best solution found for each
instance, the average number of evaluations made until that best solution
was found, the success rate, the average value of the solutions found in all the
independent runs, the deviation between our best solution and the best-known
one so far for the instance (in percentage), and the previously best-known
solution for each instance. The deviation between our best solution (sol ) and
the best-so-far one (best) is calculated by Equation 8.
< =
best − sol
∆(sol) = ∗ 100 . (8)
best
In sections 5.1, 5.2, 5.3, 5.5, and 5.8, distances between customers have
been rounded to the closest integer value, following the TSPLIB conven-
tion [37]. In the other sections, no rounding has been done. We discuss each
benchmark in a separate section.
5.1 Benchmark by Augerat et al.
This benchmark, proposed in 1995, is composed of three suite cases (Sets A,

B, and P). All the instances of each set are characterized by some different
features, like the distribution of the customer locations. The best-known so-
lutions have been proved to be the optimal ones for every instance of this
benchmark [38]. In the remainder of this section we will study the behavior
of JCell2o1i when solving the three suite cases of this benchmark. The devia-
tions of the solutions found are not plotted in this section because JCell2o1i
finds the best-known ones in all the instances in this benchmark (except for
B-n68-k9).
Set A.
This benchmark is made up of instances wherein both customer locations and

demands are randomly generated by a uniform distribution. The size of the
instances range from 31 to 79 customers. We can see in Table 14 that JCell2o1i
solves all the instances to optimality. In Fig. 11 we plot the percentage of runs
in which the algorithm was able to find the optimal solution for every instance
(hit rate). The reader can notice that, in general, it is harder for the algorithm
100
80
Hit Rate (%)
60
40
20
0
n4 7
n6 9
n6 9
n3 5
n3 5
n3 6
n3 5
n3 5
n3 5
n3 6
n3 5
n3 5
n4 6
n4 7
n4 6
n4 7
n5 7
n5 7
n5 7
n6 9
n6 9
n 8
n6 k9
n6 9
n8 k9
n6 0
0
A- 5-k
A- 1-k
A- 5-k
A- 2-k
A- 3-k
A- 3-k
A- 4-k
A- 6-k
A- 7-k
A- 7-k
A- 8-k
A- 9-k
A- 9-k
A- 4-k
A- 5-k
A- 6-k
A- 8-k
A- 3-k
A- 4-k
A- 5-k
A- 0-k
A- 2-k
A- 4-k
A- -k1
k1
A- 63-
A- 9-
0-
n3
3
A-
Fig. 11. Success rate for the benchmark of Augerat et al., set A
100
80
Hit Rate (%)
60
40
20
0
n3 5
n3 5
n3 5
n3 6
n4 5
n4 6
n4 6
n4 7
n4 5
n5 6
n5 7
n5 8
n5 7
n5 7
n5 7
n 7
n6 k9
n6 0
n6 9
n6 9
n6 0
n7 9
0
B- 1-k
B- 4-k
B- 5-k
B- 8-k
B- 9-k
B- 1-k
B- 3-k
B- 4-k
B- 5-k
B- 5-k
B- 0-k
B- 0-k
B- 1-k
B- 2-k
B- 6-k
B- 7-k
B- -k1
B- 4-k
B- 6-k
B- -k1
B- 8-k
k1
B- 57-
8-
n3
7
B-
Fig. 12. Success rate for the benchmark of Augerat et al., set B
to find the optimal solution as the number of customers grows, since the hit
rate decreases as the size of the instance increases. Indeed, the hit rate is
under 8% for instances involving more than 59 customers, although it is also
low for smaller instances like A-n39-k5 and A-n45-k6.
Set B.
The instances in this set are all mainly characterized by being clustered. This
set of instances does not seem to pose a real challenge to our algorithm since
it is able to find the optimal values for all of them (Table 15). There is just
one exception (instance B-n68-k9) but the deviation between our best found
solution and the best known one is really low (0.08%). In Fig. 12 we show the
success rate obtained for every instance. This hit rate is lower than 5% only
for few instances (B-n50-k8, B-n63-k10, and B-n66-k9).
Set P.
The instances in class ‘P’ are modified versions of other instances taken from
the literature. In Table 16 we summarize our results. The reader can see that
our algorithm is also able to find all the optimal solutions for this benchmark.
The success rates are shown in Fig. 13, and the same trend of the two previous
suite cases appears: linking low hit rates for the largest instances (more than
60 customers).
To end this section we just point out the remarkable properties of the
algorithm, being itself competitive to many other algorithms and in the three
benchmarks.
5.2 Benchmark by Van Breedam

Van Breedam proposed in his thesis [33] a large set of 420 instances with
many different features, like different distributions of the customers (single,
100
80
Hit Rate (%)
60
40
20
0
n1 8
n 2
n2 k2
n2 2
n2 2
n2 8
n4 k8
n4 5
n5 5
n5 7
8
1 0
n5 0
n k7
n5 k8
P- 5-k 0
n6 15
n6 10
n6 15
n7 0
n7 0
n7 4
n1 5
4
P- 6-k
P- 9-k
P- 1-k
P- 2-k
P- 2-k
P- 0-k
P- 5-k
P- 0-k
P- 0-k
n5 k1
P- -k1
n5 k1
P- -k1
P- -k1
P- 6-k
P- 6-k
-k
P- 20-
P- 3-
P- 5-
P- 55-
P- 0-k
P- 0-k
01
P- 50-
P- 5-
n1
0
P-
n Fig. 13. Success rate for the benchmark of Augerat et al., set P
clusters, cones, ...) and the depot (central, inside, outside) in the space, time
windows, pick-up and delivery, or heterogeneous demands. Van Breedam also
proposed a reduced set of instances (composed of 15 problems) from the initial
one, and solved it with many different heuristics [33] (there exist more recent
works using this benchmark, but no new best-solutions were reported [39,40]).
This reduced set of instances is the one we study in this chapter.
All the instances of this benchmark are composed of the same number of
customers (n = 100), and the total demand of the customers is always the
same, 1000 units. If we should adopt the nomenclature initially used in all this
chapter, instances using the same vehicle capacity will have the same name;
to avoid this we will use a special nomenclature for this benchmark, number-
ing the problems from Bre-1 to Bre-15, in order to avoid the repetition of
names for different instances. Problems Bre-1 to Bre-6 are constrained only
by vehicle capacity. The demand for each stop is 10 units. The vehicle capacity
equals 100 units for Bre-1 and Bre-2, 50 units for Bre-3 and Bre-4, and 200
units for Bre-5 and Bre-6. Problems Bre-7 and Bre-8 are not studied in this
work because they include pick-up and delivery constraints. A specific feature
of this benchmark is the use of homogeneous demands at stops, i.e. all the
customers of an instance require the same amount of goods. The exceptions
are problems Bre-9 to Bre-11, for which the demands of customers are het-
erogeneous. The remaining four problems, Bre-12 to Bre-15, are not studied
in this work because they include time window constraints (out of scope).
Our first conclusion is that JCell2o1i outperforms the best-so-far solutions
for eight out of nine instances. Furthermore, the average solution found in
all the runs is better than the previously best-known solution in all these
eight improved problems (see Table 17), what is an indication of the high
accuracy and stability of our algorithm. As we can see in Fig. 14, the algo-
rithm finds the new best-solutions in a high percentage of the runs for some
instances (Bre-1, Bre-2, Bre-4, and Bre-6), while in some other ones new
best-solutions were found in several runs (Bre-5, Bre-9, Bre-10, and Bre-11).
In Fig. 15 we show the deviations of the solutions found (the symbols mark
100
80
Hit Rate (%)

60
40
20
0
Bre-1 Bre-2 Bre-3 Bre-4 Bre-5 Bre-6 Bre-9 Bre-10 Bre-11
Fig. 14. Success rate for the benchmark of Van Breedam
6
5,59*
4,68*
5
4
3,24*
(%)
2,76*
3
2,56*
1,82*
2
1,02*
1
0,41*
0.0
0
Bre-1 Bre-2 Bre-3 Bre-4 Bre-5 Bre-6 Bre-9 Bre-10 Bre-11
Fig. 15. Deviation rates for the benchmark of Van Breedam
the improved instances). The new best solutions found for this benchmark are
shown in Tables 5 to 12 in Appendix A.
We have not found common patterns in the different instances (neither in
the distribution of customers nor in the location of the depot) that allow us to
predict the accuracy of the algorithm when solving them. We have noticed that
the 4 instances for which the obtained hit rate is lower (Bre-5, Bre-9, Bre-10,
and Bre-11) have common features, such as the non-centered localization of
the depot and the presence of a unique customer located far away from the
rest, but these two features can also be found in instances Bre-2 and Bre-6.
However, a common aspect of instances Bre-9 to Bre-11 (not present in the
other instances of this benchmark) are the heterogeneous demands at the
stops. Hence, we can conclude that the use of heterogeneous demands in this
benchmark reduces the number of runs in which the algorithm obtains the
best found solution (hit rate). Finally, in the case of Bre-3 (uniformly circle-
distributed clusters and not centered depot), the best-so-far solution could
not be improved, but it was found in every run.
5.3 Benchmark by Christofides and Eilon
The instances that compose this benchmark range from easy problems of just
21 customers to more difficult ones of 100 customers. We can distinguish two
different kind of instances in the benchmark. In the smaller instances (up to
100
80
Hit Rate (%)
60
40
20
0
-k4 2-k4 3-k3 0-k3 0-k4 1-k7 3-k4 1-k5 6-k7 6-k8 -k10 -k14 1-k8 -k14
13 2 2 3 3 3 3 5 7 7 6 6 0 1
E-n E-n E-n E-n E-n E-n E-n E-n E-n E-n E-n7 E-n7 E-n1 -n10
E
Fig. 16. Success rate for the benchmark of Christofides and Eilon
32 customers), the depot and all the customers (except one) are in a small
region of the plane, and there exists only one customer located far away from
that region wherein all the others are (like in the case of some instances of
the benchmark of Van Breedam). In the rest of instances of this benchmark,
the customers are randomly distributed in the plane and the depot is either
in the center or near to it.
In order to compare our results with those found in the literature, dis-
tances between customers have been rounded to the closest integer value, as
we previously did in the cases of the three sets of instances of Augerat et al.
in Sect. 5.1. We show the percentage of runs in which the optimal solution is
found in Fig. 16. It can be seen in that picture how the algorithm finds some
difficulties when solving the largest (over 50 customers) instances (customers
distributed in the whole problem area), for which the hit rate ranges from 1%
to 8%. In the case of the smaller instances (32 customers or less), the solution
is found in a 100% of the runs, except for E-n31-k7 (67%). The deviations
of the solutions found (∆) are not plotted because JCell2o1i is able to find
the optimal solutions (∆ = 0.0) for all the instances in this benchmark (see
Table 18).
5.4 Benchmark by Christofides, Mingozzi and Toth
The benchmark we present in this section has been well-explored by many

researchers. It is composed of eight instances in which the customers are lo-
cated randomly in the plane, plus four clustered instances (CMT-n101-k10,
CMT-n121-k7, CMTd-n101-k11, and CMTd-n121-k11). Among the fourteen
problems composing this benchmark, there are seven basic VRP problems
that do not include route length restrictions (those called CMT-nXX-kXX).
The other seven instances (named CMTd-nXX-kXX) use the same customer
locations as the previous seven, but they are subject to additional constraints,
like limited distances on routes and drop times (time needed to unload goods)
for visiting each customer. These instances are unique in all the benchmarks
studied in this work because they have drop times associated to the cus-
tomers. These drop times are homogenous for all the customer of an instance,
i.e. all the customers spend the same time in unloading goods. Conversely, the
demands at the stops are heterogeneous, so different customers can require
distinct amounts of goods.
The four clustered instances have a not centered depot, while in the other
ones the depot is centered (with the exceptions of CMT-n51-k5 and CMTd-
n51-k6).
The reader can see in Table 19 that these instances are, in general, harder
than all the previously studied ones since JCell2o1i is not able to find the
optimal solution for some of them for the first time in our study (see Fig 17).
Specifically, the best-known solution cannot be found by our algorithm for six
instances (the largest ones). Nevertheless, it can be seen in Fig. 18 that the
100
80
Hit Rate (%)
60
40
20
0
10
12
k9
0
k6
11
14
8
k5
T d - k1
-k
-k
k1
k1
k1
k1
1-
1-
-k
-k
1-
-k
T d 1- k
01
21
6-
1-
0-
01
51
76
10
00
5
n5
1
n7
n1
n1
-n
10
12
15
20
n1
n1
n2
-n
-n
T-
Td
T-
T-
T-
-n
-n
-n
-n
Td
M
T-
T-
T-
Td
Td
Td
M
M
C
M
C
M
C
Fig. 17. Success rate for the benchmark of Christofides, Mingozzi and Toth
2
1.88
1.6
1.08
1.2
(%)
0.66
0.8
0.16 0.22
0.4
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0
n1 0
n2 12
17
10 9
n1 0
n 1 k8
12 1
15 1
20 4
8
n7 5
k
1
k1
-k
-k
k1
k
k1
1-
1-
-k
-k
-
T d 1- k
T d 1- k
T d 1- k
01
21
51
6-
0-
01
51
00
76
10
n5
-n
n1
-n
-n
T-
Td
T-
T-
T-
-n
-n
-n
-n
Td
Td
M
T-
T-
T-
Td
M
M
C
M
C
M
C
Fig. 18. Deviation rates for the benchmark of Christofides, Mingozzi and Toth
difference (∆) between our best solution and the best-known one is always
lower than 2%. The best-known solution has been found for all the clustered
instances, except for CMTd-n121-k11 (∆ = 0.16). Notice that the two non-
clustered instances with a non-centered depot where solved in a 51% (CMT-
n51-k5) and 23% (CMTd-n51-k6) of the runs. Hence, the best-known solution
was found for all instances having a non-centered depot, with the exception
of CMTd-n121-k11.
5.5 Benchmark by Fisher
This benchmark is composed of just three instances, which are taken from real-
life vehicle routing applications. Problems F-n45-k4 and F-n135-k7 represent
a day of grocery deliveries from the Peterboro and Bramalea, Ontario termi-
nals, respectively, of National Grocers Limited. The other problem (F-n72-k4)
is concerned with the delivery of tires, batteries and accessories to gasoline
service stations and is based on data obtained from Exxon (USA). The depot
is not centered in the three instances. Since this benchmark also belongs to
the TSPLIB, distances between customers are rounded to the nearest integer
value. Once again, all the instances have been solved to optimality by our
algorithm (see Table 20) at high hit rates for two out of the three instances
(hit rate of 3% for F-n135-k7). In Fig. 19 we show the obtained hit rates. This
figure clearly shows how the difficulty of finding the optimal solution grows
with the size of the instance. Once more, the deviations (∆) have not been
plotted because they are 0.00 for the three instances.
5.6 Benchmark by Golden et al.
In this subsection we deal with a difficult set of still larger-scale instances,

ranging from 200 to 483 customers. The instances composing the benchmark
are theoretical, and the customers are spread in the plane forming geometric
figures, such as concentric circles (8 instances), squares (4 instances), rhom-
buses (4 instances) and stars (4 instances). The depot is centered in all the
instances except for the four having a rhombus shape, for which the depot is
located in one vertex. Maximal route distances are considered only in instances
named gold-nXXX-kXX. In the other ones (gol-nXXX-kXX), route distances are
unlimited. No drop times are considered in this benchmark.
100
Hit Rate (%)
80
60
40
20
0
F-n45-k4 F-n72-k4 F-n135-k7
Fig. 19. Success rate for the benchmark of Fisher

2,8
2,53 2,54 2,42
2,43
2,4
2,04 1,79 1,90
2
1,74 1,65 1,73
1,57 1,63
1,6
(%)
1,16
1,2 0,99
0,98
0,75
0,8
0,48
0,31
0,4
0,18
0,00
0
go 253 2
go 256 7
go 301 14
go 321 28
go 324 30
go 361 6
go 397 3
go 400 4
go 42 18
go 481 41
go 484 8
go -n2 19
go 241 5
go -n2 10
go 32 k8
go d-n k10
go n40 -k9
go 44 0
48 1
2
l-n -k2
1
3
3
-k
-n k1
-n k1
k1
-
l-n -k
l-n -k
l-n -k
l-n -k
l-n -k
l-n -k
l-n -k
l-n -k
l-n 1-k
l-n -k
ld -k
ld -k
ld 01
ld 81
1
l 1-
ld 1-
ld 1-
1-
go 41
ld 36
2
l-n
-n
-n
-
go
Fig. 20. Deviation rates for the benchmark of Golden et al.
The results are given in Table 21 of Appendix B. As can be seen, the

instances of this benchmark are specially hard to solve: only for instance
gold-n201-k5 the best-known solution was obtained (with a hit rate of 2%).
However, the deviation between the best-known value and the best found one
is always very low (as in the case of the previous sections): under 2.60% (see
Fig. 20).
We have also observed that the run time of our algorithm grows with the
size of the instances. This is due to the exhaustive local search step we apply
to each generation for every individual. In this local optimization step, the
algorithm tests all the neighbors of a given individual obtained after applying
2-Opt and 1-Interchange methods to it. Therefore, the number of individuals
to test grows exponentially with the number of customers of the instance.
Conversely, the accuracy of JCell2o1i does not necessarily decrease with
the increment in the size of the instance, as shown in Fig. 20. For example, the
accuracy when solving gol-n484-k19 is higher (lower deviation with respect
to the best-known solution) than in the case of smaller instances such as
gol-n481-k38, gol-n421-k41, gol-n400-k18, or even instances having under
100 customers less (gol-n361-k33 and gol-n324-k16).
5.7 Benchmark by Taillard
We present in this subsection a hard VRP benchmark proposed by Taillard.

It is composed of 15 nonuniform problems. The customers are located on the
plane, spread out into several clusters, and both the number of such clusters
and their density are quite variable. The quantities ordered by the customers
are exponentially distributed (so the demand of one customer may require the
entire capacity of one vehicle).
There are four instances with size 75 customers (9 or 10 vehicles), 100
customers (10 or 11 vehicles), and 150 customers (14 or 15 vehicles). There
is also one pseudo-real problem involving 385 cities, which was generated
as follows: each city is the most important town or village of the smallest
political entity (commune) of the canton of Vaud, in Switzerland. The census
of inhabitants of each commune was taken at the beginning of 1990. A demand
of 1 unit per 100 inhabitants (but at least 1 for each commune) was considered,
and vehicles have a capacity of 65 units.
As we can see in Table 22, JCell2o1i is able to obtain the best-known
solutions for the four smaller instances (those of 75 customers) and also
for ta-n101-k11b (see Table 21 for the hit rates). Moreover, for instance
ta-n76-k10b our algorithm has improved on the best-known solution so far
by 0.0015% compared to the previous solution (new solution = 1344.62); this
solution is shown in Table 13 of Appendix A. Note that this best solution is
quite close to the previously existing one, but they represent very different
solutions in terms of resulting routes. JCell2o1i has found the previous best
solution for this instance in 30% of the runs made. In the instances for which
the known best solution was found but not improved, the hit rate is over 60%.
The deviations in the other instances, as in the case of the previous section,
are very low (under 2.90), as can be seen in Fig. 22, i.e., the algorithm is very
stable.
100
80
Hit Rate (%)
60
40
20
0
a b a b a b a b a b a b 7
10 10 -k9 -k9 10 10 11 11 14 14 15 15 -k4
7 6-k 76-k -n76 -n76 01-k 01-k 01-k 01-k 51-k 51-k 51-k 51-k 386
t a-n ta-
n ta ta a-n
t
1
ta-
n 1
ta-
1
n a-n
t
1
ta-
n 1
ta-
1
n a-n
t
1
ta-
n 1
ta-
n
Fig. 21. Success rate for the benchmark of Taillard
3
2.87
2.5
2.09
2
(%)
1.5
0.95
1
0.39 0.32 0.35
0.5 0.19 0.02 0.04
0.00 0.00* 0.00 0.00
0
a b a 0a 10b 11a 11b 4a 4b 5a 5b 7
10 10 -k9 9b -k1 -k -k -k -k1 -k1 1-k1 1-k1 6-k4
7 6-k 76-k -n76 7 6-k 101 101 101 101 1 51 151 1 5 1 5 3 8
a-n n n n n n n n n n n n
t ta- ta ta- ta- ta- ta- ta- ta- ta- ta- ta- ta-
Fig. 22. Deviation rates for the benchmark of Taillard ( marks new best solution
found by JCell2o1i)
5.8 Benchmark of Translated Instances from TSP
This benchmark is made up of small instances (with sizes ranging from 15

to 47 customers) whose data have been reused from well known instances for
the Travelling Salesman Problem (TSP). As in the case of some instances of
the benchmark of Van Breedam, the demands of customers are homogeneous
in this benchmark, i.e., all the customers require the same amount of goods.
Once more, JCell2o1i solves all the instances to optimality (see Table 23),
and obtaining a (very) high percentage of runs finding the optimal solution
for every instance. Indeed, the optimal solution was found in the 100% of the
runs for every instance except for att-n48-k4 (85%) and gr-n48-k3 (39%),
as it is shown in Fig. 23. Hence, the average solution found in the 100 runs
made for each instance is very proximate to the optimal one, ranging from
0.00% in the cases of 100% of hit rate to 0.00071% for att-n48-k4. These
results suggest that translations from TSP end in instances of just average
difficulty compared to problems proposed directly as CVRP instances. Due
to the large distances appearing in instance att-n48-k4, we have used values
λ = µ = 10000 for avoiding possible non-correct solutions to be better than
the optimal (see Sect. 3).
6 Conclusions and Further Work

In this chapter we have developed a single algorithm which is able to compete
with the many different best known optimization techniques for solving the
CVRP. Our algorithm has been tested in a very large set of 160 instances with
different features, e.g., uniformly and not uniformly dispersed customers, clus-
tered and not clustered, with a centered or not centered depot, having maximal
route distances or not, considering drop times or not, with homogeneous or
heterogeneous demands, etc.
We consider that the behavior of our cellular algorithm with merged muta-
tion plus local search is very satisfactory since it obtains the best-known solu-
tion in most cases, or values very close to it, for all of the test suite. Moreover,
100
80
Hit Rate (%)
60
40
20
0
-k4 9-k
4
9-k
5
2-k
4
6-k
3
7-k
3
1-k
3
4-k
4
8-k
3
8-k
4
2-k
5
-n48 -n2 -n2 ig-n4 fri-n2 gr-n1 gr-n2 gr-n2 gr-n4 -n4 -n4
att y g y s h k is s
ba ba ntz sw
da
Fig. 23. Success rate for the benchmark of translated instances from TSP
it has been able to improve the known best solution so far for nine of the tested
instances, which represents an important record in present research. Hence,
we can say that the performance of JCell2o1i is similar or even better to that
of the best algorithm for each instance. Besides, our algorithm is quite simple
since we have designed a canonical cGA with three widely used mutations in
the literature for this problem, plus two well known local search methods.
As future work, it may be interesting to test the behavior of the algorithm
with other local search methods, i.e., 3-Opt. Another further step is to adapt
the algorithm for testing it on other variants of the studied problem, like
VRP with time windows (VRPTW), multiple depots (MDVRP), or backhauls
(VRPB).
Finally, it will also be interesting to study the behavior of JCell2o1i after
applying the local search step in a less exhaustive form, thus developing a
faster algorithm.
7 Acknowledgement
This work has been partially funded by MCYT and FEDER under
contract TIN2005-08818-C04-01 (the OPLINK project: http://oplink.lcc.
uma.es).
References
1. Toth P, Vigo D (2001) The vehicle routing problem. Monographs on discrete
mathematics and applications. SIAM, Philadelphia
2. Dantzing G, Ramster R (1959) The truck dispatching problem. Manag Sci
6:80–91
3. Christofides N, Mingozzi A, Toth P (1979) The vehicle routing problem. In:
Combinatorial optimization. Wiley, New York, pp 315–338
4. http://neo.lcc.uma.es/radi-aeb/WebVRP/index.html
5. Golden B, Wasil E, Kelly J, Chao IM (1998) The impact of metaheuristics on
solving the vehicle routing problem: algorithms, problem sets, and computa-
tional results. In: Fleet management and logistics. Kluwer, Boston, pp 33–56
6. Cordeau JF, Gendreau M, Hertz A, Laporte G, Sormany JS (2005) New heuris-
tics for the vehicle routing problem. In: Langevin A, Riopel D (eds.) Logistics
systems: design and optimization. Kluwer Academic, Dordecht, Springer Verlag
NY, pp. 279–297
7. Manderick B, Spiessens P (1989) Fine-grained parallel genetic algorithm. In:
Schaffer J (ed.) Proceedings of the third international conference on genetic
algorithms – ICGA89, Morgan-Kaufmann, Los Altos, CA, pp 428–433
8. Alba E, Tomassini M (2002) Parallelism and evolutionary algorithms. IEEE
Trans Evol Comput 6:443–462
9. Sarma J, Jong KD (1996) An analysis of the effect of the neighborhood size
and shape on local selection algorithms. In: Voigt H, Ebeling W, Rechenberg I,
Schwefel H (eds.) Parallel problem solving from nature (PPSN IV). Volume 1141
of lecture notes in computer science. Springer, Berlin Heidelberg New York, pp
236–244
10. Gorges-Schleuter M (1989) ASPARAGOS an asynchronous parallel genetic op-

timisation strategy. In: Schaffer JD (ed.) Proceedings of the third international
conference on genetic algorithms – ICGA89. Morgan Kaufmann, Los Altos, CA,
pp 422–427
11. Alba E, Dorronsoro B (2005) The exploration/exploitation tradeoff in dynamic
cellular evolutionary algorithms. IEEE Trans Evol Comput 9:126–142
12. Giacobini M, Alba E, Tomassini M (2003) Selection intensity in asynchronous
cellular evolutionary algorithms. In: E. Cantú-Paz et al. (ed.) Proceedings of the
genetic and evolutionary computation conference, GECCO03. Springer, Berlin
13. Giacobini M, Alba E, Tettamanzi A, Tomassini M (2004) Modeling selection
intensity for toroidal cellular evolutionary algorithms. In: Deb K (ed.) Proceed-
ings of the genetic and evolutionary computation conference, GECCO04. LNCS
3102, Seattle, Washington. Springer, Berlin Heidelberg New York, pp 1138–1149
14. Alba E, Dorronsoro B (2004) Solving the vehicle routing problem by using cellu-
lar genetic algorithms. In: Gottlieb J, Raidl GR (eds.) Evolutionary computation
in combinatorial optimization – EvoCOP 2004. volume 3004 of LNCS, Coimbra,
Portugal. Springer, Berlin Heidelberg New York, pp 11–20
15. Gendreau M, Hertz A, Laporte G (1994) A tabu search heuristic for the vehicle
routing problem. Manag Sci 40:1276–1290
16. Toth P, Vigo D (2003). The granular tabu search and its application to the
vehicle routing problem. INFORMS J Comput 15:333–346
17. Berger J, Barkaoui M (2003). A hybrid genetic algorithm for the capacitated
vehicle routing problem. In: Cantú-Paz E (ed.) Proceedings of the international
genetic and evolutionary computation conference – GECCO03. LNCS 2723,
Illinois, Chicago, USA. Springer-Verlag, Berlin, pp 646–656
18. Prins C (2004) A simple and effective evolutionary algorithm for the vehicle
routing problem, Computers and Operations Research 31:1985–2002
19. Lenstra J, Kan AR (1981) Complexity of vehicle routing and scheduling prob-
lems. Networks 11:221–227
20. Ralphs T, Kopman L, Pulleyblank W Jr, LT (2003) On the capacitated vehicle
routing problem. Math Prog Ser B 94:343–359
21. Whitley D (1993) Cellular genetic algorithms. In: Forrest S (ed.) Proceedings
of the fifth international conference on genetic algorithms. Morgan Kaufmann,
Los Altos, CA, p 658
22. Alba E, Giacobini M, Tomassini M, Romero S Comparing synchronous and asyn-
chronous cellular genetic algorithms. In: JJ Merelo et al. (ed.) Parallel problem
solving from nature – PPSN VII. Volume 2439 of lecture notes in computer
science, Granada, Spain. Springer-Verlag, Heidelberg, pp 601–610
23. Duncan T (1995) Experiments in the use of neighbourhood search techniques
for vehicle routing. Technical report AIAI-TR-176, Artificial Intelligence Appli-
cations Institute, University of Edinburgh, Edinburgh
24. Whitley D, Starkweather T, Fuquay D (1989) Scheduling problems and travel-
ing salesman: the genetic edge recombination operator. In: Schaffer J (ed.) Pro-
ceedings of the third international conference on genetic algorithms – ICGA89.
Morgan-Kaufmann, Los Altos, CA, pp 133–140
25. Fogel D (1988) An evolutionary approach to the traveling salesman problem.
Biol Cybernetics 60:139–144
26. Banzhaf W (1990) The molecular traveling salesman. Biol Cybernetics 64:7–14
27. Holland J (1975) Adaptation in natural and artificial systems. University of

Michigan Press, Ann Arbor, MI
28. Rochat Y, Taillard E (1995) Probabilistic diversification and intensification in
local search for vehicle routing. J Heuristics 1:147–167
29. Croes G (1958) A method for solving traveling salesman problems. Oper Res
6:791–812
30. Osman I (1993) Metastrategy simulated annealing and tabu search algorithms
for the vehicle routing problems. Ann Oper Res 41:421–451
31. Chellapilla K (1998) Combining mutation operators in evolutionary program-
ming. IEEE Trans Evol Comput 2:91–96
32. Augerat P, Belenguer J, Benavent E, Corbern A, Naddef D, Rinaldi G (1995)
Computational results with a branch and cut code for the capacitated vehicle
routing problem. Research Report 949-M, Universite Joseph Fourier, Grenoble,
France
33. Van Breedam A (1994) An analysis of the behavior of heuristics for the vehi-
cle routing problem for a selection of problems with vehicle related, customer-
related, and time-related constraints. PhD thesis, University of Antwerp –
RUCA, Belgium
34. Christofides N, Eilon S (1969) An algorithm for the vehicle dispatching problem.
Oper Res Quart 20:309–318
35. Fisher M (1994) Optimal solution of vehicle routing problems using minimum
k-trees. Oper Res 42–44:626–642
36. Taillard E (1993) Parallel iterative search methods for vehicle-routing problems.
Networks 23:661–673
37. Reinelt G (1991) TSPLIB: A travelling salesman problem library. ORSA J
Comput 3:376–384 URL: http://www.iwr.uni–heidelberg.de/groups/comopt/
software/TSPLIB95/
38. Lysgaard J, Letchford A, Eglese R (2004) A new branch-and-cut algorithm for
capacitated vehicle routing problems. Math Program 100:423–445
39. Van Breedam A (2001) Comparing descent heuristics and metaheuristics for the
vehicle routing problem. Comput Oper Res 28:289–315
40. Van Breedam A (2002) A parametric analysis of heuristics for the vehicle routing
problem with side-constraints. Eur J Oper Res 137:348–370
41. Fukasawa R et al (2004) Robust branch-and-cut-and-price for the capacitated
vehicle routing problem. In: Integer programming and combinatorial optimiza-
tion (IPCO). Volume 3064 of LNCS, New York, USA. Springer-Verlag, Berlin,
pp 1–15
42. Gendreau M, Hertz A, Laporte G (1991) A tabu search heuristic for the vehi-
cle routing problem. Technical Report CRT-777, Centre de Recherche sur les
Transports – Universit de Montral, Montral, Canada
43. Mester D, Bräysy O (2005) Active guided evolution strategies for large-scale
vehicle routing problems with time windows. Comps & Ops Res 32:1593–1614
44. Li F, Golden B, Wasil E (2005). Very large-scale vehicle routing: new test prob-
lems, algorithms, and results. Comput & Oper Res 32:1165–1179
45. Tarantilis C, Kiranoudis C (2002). Boneroute: an adaptive memory-based
method for effective fleet management. Ann Oper Res 115:227–241
46. Gambardella L, Taillard E, Agazzi G (1999) MACS-VRPTW: a multiple ant
colony system for vehicle routing problems with time windows. In: New ideas in
optimization. McGraw-Hill, New York, pp 63–76
47. http://branchandcut.org/VRP/data/
A Best Found Solutions

In this appendix we include all the details on the new best solutions found
by JCell2o1i for future precise studies. The improved results are included in
the benchmarks of Van Breedam (Tables 5 to 12), and Taillard (Table 13).
In these tables, we show the number of routes composing the solution, its
global length (sum of the distance of every route), and the composition of
each route (vehicle capacity used, distance of the route, and visiting order
of the customers). Note that, due to the problem representation we use (see
Sect. 3.1), the customers are numbered from 0 to c − 1 (0 stands for the first
customer, and not for the depot), being c the number of customers for the
instance at hand. The appendix is included to make the work self-contained
and meaningful for future extensions and comparisons.
Table 5. New best solution for instance Bre-1

Number of routes 10
Total solution length 1106.0
Capacity 100
Route 1 Distance 121.0
Visited customers 69, 89, 99, 34, 39, 44, 49, 90, 70, 50
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100

Number of routes 10
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
B Results
The tables containing the results obtained by JCell2o1i when solving all the in-
stances proposed in this work are shown in this appendix. The values included
in Tables 14 to 23 are obtained after making 100 independent runs (for sta-
tistical significance), except for the benchmark of Golden et al. (Table 21),
where only 30 runs were made due to its difficulty.
The values in the tables belonging to this section correspond to the best so-
lution found in our runs for each instance, the average number of evaluations
made to find that solution, the success rate, the average value of the solu-
tions found in all the independent runs, the deviation (∆) between our best
solution and the previously best-known one (in percentage), and the known

Number of routes 20
Capacity 50 Capacity 50
Route 1 Distance 82.0 Route 11 Distance 51.0
Visited Visited
customers 46, 47, 48, 49, 41 customers 35, 40, 39, 34, 31
Visited Visited
Visited Visited
Visited Visited
Visited Visited
Visited Visited
Visited Visited
Visited Visited
Visited Visited
Visited Visited
best solution for each instance. The execution times of our tests range from
153.50 hours for the most complex instance to 0.32 seconds in the case of the
simplest one.

Number of routes 5
Capacity 200
Distance 180.0
Visited customers 97, 88, 98, 78, 89, 69, 99, 79, 59, 40, 30,
20, 39, 29, 49, 38, 28, 48, 57, 77
Capacity 200
74, 54, 65, 85, 95, 75, 55, 66, 86
Capacity 200
3, 4, 5, 15, 6, 16, 36, 67, 87
Capacity 200
Visited customers 37, 17, 8, 18, 9, 19, 0, 10, 21, 41, 31, 42,
32, 22, 11, 2, 1, 7, 27, 47
Capacity 200
61, 50, 70, 90, 80, 60, 58, 68

Number of routes 5
Capacity 200
49, 73, 31, 83, 38, 90, 91, 45, 98
Capacity 200
27, 34, 41, 82, 86, 93, 95, 48
Capacity 200
Visited customers 42, 94, 89, 85, 79, 35, 28, 21, 14, 7, 0, 50, 51,
52, 53, 1, 8, 72, 77, 44
Capacity 200
22, 29, 75, 80, 87, 36, 43
Capacity 200
26, 33, 76, 81, 88, 40, 47

Number of routes 10
Capacity 100
Visited customers 47, 48, 41, 90, 88, 84, 82, 34, 86, 40, 46
Capacity 100
Visited customers 32, 78, 73, 26, 27, 75, 81, 33, 39
Capacity 100
Capacity 100
Capacity 100
Visited customers 97, 45, 94, 91, 38, 95, 99, 98
Capacity 100
Visited customers 96, 37, 30, 70, 23, 77, 29, 74, 80, 85, 36, 44
Capacity 100
Visited customers 79, 83, 87, 89, 35, 42, 43
Capacity 100
Capacity 100
Capacity 100

Number of routes 10
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100

Number of routes 10
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Visited customers 32, 21, 11, 1, 0, 10, 20, 30, 40, 50, 60, 81, 92
Capacity 100
Capacity 100
Capacity 100
Capacity 100
Table 13. New best solution for instance ta-n76-k10b

Number of routes 10
Capacity 1654
Route 1 Distance 258.5993825897365
Visited customers 73, 9, 11, 68, 54, 55, 60,
57, 56, 59, 61, 58, 67, 69, 35
Capacity 1659
Route 2 Distance 174.46520984752948
Capacity 1590
Visited customers 12, 42, 49, 47
Capacity 1594
Route 4 Distance 321.3083147478987
Capacity 1219
Route 5 Distance 13.508270432752166
Visited customers 6, 10, 0
Capacity 1677
Route 6 Distance 214.36282712025178
Visited customers 25, 33, 23, 16, 30, 27, 34, 24, 28,
19, 26, 20, 21, 29, 32
Capacity 1617
Route 7 Distance 90.52827230772104
Capacity 1547
Route 8 Distance 32.385178886685395
Visited customers 1, 4, 74, 71, 14, 8
Capacity 687
Visited customers 3
Capacity 1662
Route 10 Distance 160.7000958035075
Visited customers 2, 51, 37, 18, 44, 40, 70
Table 14. Computational Facts. Benchmark of Augerat et al. Set A

Instance Our Best Best-Known Solution Found Avg. Solution ∆ Previously
Avg. Evaluations Hit (%) (%) Best-Known
A-n32-k5 784.00 10859.36 ± 3754.35 100.00 784.00 ± 0.00 0.00 784.0 [41]
A-n33-k5 661.00 10524.08 ± 2773.86 100.00 661.00 ± 0.00 0.00 661.0 [41]
A-n33-k6 742.00 9044.89 ± 3087.92 100.00 742.00 ± 0.00 0.00 742.0 [41]
A-n34-k5 778.00 13172.95 ± 4095.90 100.00 778.00 ± 0.00 0.00 778.0 [41]
A-n36-k5 799.00 21699.29 ± 5927.13 86.00 800.39 ± 3.62 0.00 799.0 [41]
A-n37-k5 669.00 15101.04 ± 4227.04 100.00 669.00 ± 0.00 0.00 669.0 [41]
A-n37-k6 949.00 20575.88 ± 6226.19 95.00 949.12 ± 0.67 0.00 949.0 [41]
A-n38-k5 730.00 17513.31 ± 5943.05 96.00 730.07 ± 0.36 0.00 730.0 [41]
A-n39-k5 822.00 25332.52 ± 7269.13 21.00 824.77 ± 1.71 0.00 822.0 [41]
A-n39-k6 831.00 22947.67 ± 4153.08 12.00 833.16 ± 1.35 0.00 831.0 [41]
A-n44-k7 937.00 39208.71 ± 11841.16 34.00 939.53 ± 2.69 0.00 937.0 [41]
A-n45-k6 944.00 52634.00 ± 10484.88 6.00 957.88 ± 7.85 0.00 944.0 [41]
A-n45-k7 1146.00 46516.30 ± 13193.60 76.00 1147.17 ± 2.23 0.00 1146.0 [41]
A-n46-k7 914.00 32952.40 ± 8327.32 100.00 914.00 ± 0.00 0.00 914.0 [41]
A-n48-k7 1073.00 41234.69 ± 11076.23 55.00 1078.68 ± 6.89 0.00 1073.0 [41]
A-n53-k7 1010.00 56302.26 ± 15296.18 19.00 1016.10 ± 4.17 0.00 1010.0 [41]
A-n54-k7 1167.00 58062.13 ± 12520.94 52.00 1170.68 ± 4.84 0.00 1167.0 [41]
A-n55-k9 1073.00 50973.67 ± 10628.18 48.00 1073.87 ± 2.15 0.00 1073.0 [41]
A-n60-k9 1354.00 97131.75 ± 13568.88 8.00 1359.82 ± 4.53 0.00 1354.0 [41]
A-n61-k9 1034.00 87642.33 ± 8468.60 6.00 1040.50 ± 7.43 0.00 1034.0 [41]
A-n62-k8 1288.00 166265.86 ± 27672.54 7.00 1300.92 ± 7.12 0.00 1288.0 [41]
A-n63-k9 1616.00 131902.00 ± 12010.92 2.00 1630.59 ± 5.65 0.00 1616.0 [41]
A-n63-k10 1314.00 90994.00 ± 0.0 1.00 1320.97 ± 3.69 0.00 1314.0 [41]
A-n64-k9 1401.00 100446.00 ± 17063.90 2.00 1416.12 ± 6.38 0.00 1401.0 [41]
A-n65-k9 1174.00 88292.50 ± 20815.10 2.00 1181.60 ± 3.51 0.00 1174.0 [41]
A-n69-k9 1159.00 90258.00 ± 0.0 1.00 1168.56 ± 3.64 0.00 1159.0 [41]
A-n80-k10 1763.00 154976.00 ± 33517.14 4.00 1785.78 ± 10.88 0.00 1763.0 [41]
Table 15. Computational Facts. Benchmark of Augerat et al. Set B

B-n31-k5 672.00 9692.02 ± 3932.91 100.00 672.00 ± 0.00 0.00 672.0 [41]
B-n34-k5 788.00 11146.94 ± 4039.59 100.00 788.00 ± 0.00 0.00 788.0 [41]
B-n35-k5 955.00 12959.81 ± 4316.64 100.00 955.00 ± 0.00 0.00 955.0 [41]
B-n38-k6 805.00 23227.87 ± 4969.47 92.00 805.08 ± 0.27 0.00 805.0 [41]
B-n39-k5 549.00 21540.88 ± 6813.76 85.00 549.20 ± 0.57 0.00 549.0 [41]
B-n41-k6 829.00 31148.89 ± 5638.03 79.00 829.21 ± 0.41 0.00 829.0 [41]
B-n43-k6 742.00 27894.05 ± 5097.48 60.00 742.86 ± 1.22 0.00 742.0 [41]
B-n44-k7 909.00 31732.34 ± 5402.31 85.00 910.34 ± 4.17 0.00 909.0 [41]
B-n45-k5 751.00 36634.68 ± 4430.40 28.00 754.52 ± 3.88 0.00 751.0 [41]
B-n45-k6 678.00 58428.21 ± 9709.70 33.00 680.94 ± 3.51 0.00 678.0 [41]
B-n50-k7 741.00 16455.25 ± 3159.24 100.00 741.00 ± 0.00 0.00 741.0 [41]
B-n50-k8 1312.00 59813.00 ± 0.0 1.00 1317.22 ± 4.32 0.00 1312.0 [41]
B-n51-k7 1032.00 59401.30 ± 12736.87 79.00 1032.26 ± 0.56 0.00 1032.0 [41]
B-n52-k7 747.00 32188.17 ± 5886.91 70.00 747.56 ± 1.18 0.00 747.0 [41]
B-n56-k7 707.00 45605.11 ± 7118.89 17.00 710.56 ± 2.22 0.00 707.0 [41]
B-n57-k7 1153.00 106824.57 ± 9896.32 7.00 1168.97 ± 15.52 0.00 1153.0 [41]
B-n57-k9 1598.00 71030.38 ± 7646.15 16.00 1601.64 ± 3.66 0.00 1598.0 [41]
B-n63-k10 1496.00 67313.50 ± 10540.84 2.00 1530.51 ± 14.07 0.00 1496.0 [41]
B-n64-k9 861.00 97043.32 ± 15433.00 19.00 866.69 ± 6.04 0.00 861.0 [41]
B-n66-k9 1316.00 94125.50 ± 15895.05 2.00 1321.36 ± 2.94 0.00 1316.0 [41]
B-n67-k10 1032.00 130628.60 ± 26747.01 5.00 1042.48 ± 11.63 0.00 1032.0 [41]
B-n68-k9 1273.00 —– 0.00 1286.19 ± 3.88 0.08 1272.0 [41]
B-n78-k10 1221.00 145702.71 ± 16412.80 14.00 1239.86 ± 13.05 0.00 1221.0 [41]
Table 16. Computational Facts. Benchmark of Augerat et al. Set P

P-n16-k8 450.00 1213.92 ± 325.60 100.00 450.00 ± 0.00 0.00 450.0 [41]
P-n19-k2 212.00 2338.79 ± 840.45 100.00 212.00 ± 0.00 0.00 212.0 [41]
P-n20-k2 216.00 3122.74 ± 1136.45 100.00 216.00 ± 0.00 0.00 216.0 [41]
P-n21-k2 211.00 1704.60 ± 278.63 100.00 211.00 ± 0.00 0.00 211.0 [41]
P-n22-k2 216.00 2347.77 ± 505.00 100.00 216.00 ± 0.00 0.00 216.0 [41]
P-n22-k8 603.00 2708.12 ± 877.52 100.00 603.00 ± 0.00 0.00 603.0 [41]
P-n23-k8 529.00 3422.29 ± 1275.23 100.00 529.00 ± 0.00 0.00 529.0 [41]
P-n40-k5 458.00 16460.94 ± 4492.84 99.00 458.01 ± 0.10 0.00 458.0 [41]
P-n45-k5 510.00 25434.43 ± 6480.60 72.00 510.89 ± 1.68 0.00 510.0 [41]
P-n50-k7 554.00 45280.10 ± 9004.78 29.00 556.08 ± 1.94 0.00 554.0 [41]
P-n50-k8 631.00 62514.50 ± 8506.60 4.00 638.80 ± 5.58 0.00 631.0 [41]
P-n50-k10 696.00 49350.00 ± 0.0 1.00 698.58 ± 2.65 0.00 696.0 [41]
P-n51-k10 741.00 55512.21 ± 9290.20 19.00 747.25 ± 6.65 0.00 741.0 [41]
P-n55-k7 568.00 57367.27 ± 11916.64 22.00 573.03 ± 3.09 0.00 568.0 [41]
P-n55-k8 576.00 48696.50 ± 12961.64 78.00 576.42 ± 1.15 0.00 576.0 [41]
P-n55-k10 694.00 47529.00 ± 0.0 1.00 698.22 ± 1.81 0.00 694.0 [41]
P-n55-k15 989.00 108412.75 ± 8508.23 4.00 1002.96 ± 9.82 0.00 989.0 [41]
P-n60-k10 744.00 77925.62 ± 12377.87 8.00 749.35 ± 4.95 0.00 744.0 [41]
P-n60-k15 968.00 87628.35 ± 20296.24 23.00 972.42 ± 3.12 0.00 968.0 [41]
P-n65-k10 792.00 66024.53 ± 12430.83 30.00 798.44 ± 5.03 0.00 792.0 [41]
P-n70-k10 827.00 131991.00 ± 11429.62 4.00 836.28 ± 5.70 0.00 827.0 [41]
P-n76-k4 593.00 89428.14 ± 17586.60 7.00 599.73 ± 5.07 0.00 593.0 [41]
P-n76-k5 627.00 150548.60 ± 38449.91 5.00 632.89 ± 4.51 0.00 627.0 [41]
P-n101-k4 681.00 107245.00 ± 26834.63 7.00 687.36 ± 5.22 0.00 681.0 [41]
Table 17. Computational Facts. Benchmark of Van Breedam

Bre-1 1106.00 428896.92 ± 365958.66 65.00 1113.03 ± 10.80 3.24 1143.0 [40]
Bre-2 1506.00 365154.72 ± 196112.12 53.00 1513.68 ± 11.07 4.68 1580.0 [40]
Bre-3 1751.00 146429.00 ± 29379.40 100.00 1751.00 ± 0.00 0.00 1751.0 [40]
Bre-4 1470.00 210676.00 ± 82554.15 100.00 1470.00 ± 0.00 0.41 1476.0 [40]
Bre-5 950.00 977540.00 ± 554430.07 5.00 955.36 ± 4.16 2.56 975.0 [40]
Bre-6 969.00 820180.39 ± 436314.36 51.00 970.79 ± 2.46 1.02 979.0 [40]
Bre-9 1690.00 1566300.00 ± 0.0 1.00 1708.48 ± 10.48 5.59 1790.0 [40]
Bre-10 1026.00 1126575.00 ± 428358.44 4.00 1033.34 ± 6.22 1.82 1045.0 [40]
Bre-11 1128.00 1742600.00 ± 840749.96 2.00 1153.35 ± 10.81 2.76 1160.0 [40]
Table 18. Computational Facts. Benchmark of Christofides and Eilon

E-n13-k4 247.00 4200.00 ± 0.00 100.00 247.00 ± 0.00 0.00 247.0 [41]
E-n22-k4 375.00 3831.60 ± 1497.55 100.00 375.00 ± 0.00 0.00 375.0 [41]
E-n23-k3 569.00 2068.78 ± 414.81 100.00 569.00 ± 0.00 0.00 569.0 [41]
E-n30-k3 534.00 6400.86 ± 2844.69 99.00 534.03 ± 0.30 0.00 534.0 [41]
E-n30-k4 503.00 4644.87 ± 1256.38 100.00 503.00 ± 0.00 —– —–
E-n31-k7 379.00 274394,03 ± 162620,36 67.00 380.67 ± 3.36 0.00 379.0 [41]
E-n33-k4 835.00 14634.89 ± 4463.32 100.00 835.00 ± 0.00 0.00 835.0 [41]
E-n51-k5 521.00 40197.43 ± 10576.82 30.00 526.54 ± 4.75 0.00 521.0 [41]
E-n76-k7 682.00 80594.00 ± 20025.23 3.00 688.19 ± 3.42 0.00 682.0 [41]
E-n76-k8 735.00 86700.33 ± 26512.46 3.00 741.25 ± 4.14 0.00 735.0 [41]
E-n76-k10 830.00 166568.33 ± 11138.72 3.00 841.09 ± 6.16 0.00 830.0 [41]
E-n76-k14 1021.00 235643.50 ± 55012.20 2.00 1033.05 ± 6.14 0.00 1021.0 [41]
E-n101-k8 817.00 192223.62 ± 43205.58 8.00 822.84 ± 3.92 0.00 815.0 [41]
E-n101-k14 1071.00 1832800.00 ± 0.00 1.00 1089.60 ± 7.59 0.00 1071.0 [41]
Table 19. Computational Facts. Benchmark of Christofides, Mingozzi and Toth

CMT-n51-k5 524.61 27778.16 ± 6408.68 51.00 525.70 ± 2.31 0.00 524.61 [42]
CMT-n76-k10 835.26 118140.50 ± 65108.27 2.00 842.90 ± 5.49 0.00 835.26 [36]
CMT-n101-k8 826.14 146887.00 ± 42936.94 2.00 833.21 ± 4.22 0.00 826.14 [36]
CMT-n101-k10 819.56 70820.85 ± 14411.87 89.00 821.01 ± 4.89 0.00 819.56 [36]
CMT-n121-k7 1042.12 352548.00 ± 0.00 1.00 1136.95 ± 31.36 0.00 1042.11 [36]
CMT-n151-k12 1035.22 —– 0.00 1051.74 ± 7.27 0.66 1028.42 [36]
CMT-n200-k17 1315.67 —– 0.00 1339.90 ± 11.30 1.88 1291.45 [43]
CMTd-n51-k6 555.43 27413.30 ± 6431.00 23.00 558.10 ± 2.15 0.00 555.43 [36]
CMTd-n76-k11 909.68 69638.50 ± 12818.35 6.00 919.30 ± 7.22 0.00 909.68 [36]
CMTd-n101-k9 865.94 91882.00 ± 0.00 1.00 877.01 ± 5.92 0.00 865.94 [36]
CMTd-n101-k11 866.37 87576.85 ± 13436.17 68.00 867.53 ± 3.65 0.00 866.37 [30]
CMTd-n121-k11 1543.63 —– 0.00 1568.53 ± 19.57 0.16 1541.14 [36]
CMTd-n151-k14 1165.10 —– 0.00 1182.42 ± 10.50 0.22 1162.55 [36]
CMTd-n200-k18 1410.92 —– 0.00 1432.94 ± 11.41 1.08 1395.85 [28]
Table 20. Computational Facts. Benchmark of Fisher

F-n45-k4 724.00 14537.45 ± 4546.13 96.00 724.16 ± 0.93 0.00 724.0 [41]
F-n72-k4 237.00 25572.12 ± 5614.46 84.00 237.30 ± 0.83 0.00 237.0 [41]
F-n135-k7 1162.00 167466.33 ± 30768.37 3.00 1184.76 ± 20.04 0.00 1162.0 [41]
Table 21. Computational Facts. Benchmark of Golden et al.

gol-n241-k22 714.73 —– 0.00 724.96 ± 7.11 0.98 707.79 [43]
gol-n253-k27 872.59 —– 0.00 884.05 ± 5.98 1.57 859.11 [43]
gol-n256-k14 589.20 —– 0.00 601.10 ± 5.43 0.99 583.39 [43]
gol-n301-k28 1019.19 —– 0.00 1037.31 ± 18.59 2.04 998.73 [43]
gol-n321-k30 1107.24 —– 0.00 1122.32 ± 9.71 2.43 1081.31 [43]
gol-n324-k16 754.95 —– 0.00 766.31 ± 6.29 1.74 742.03 [44]
gol-n361-k33 1391.27 —– 0.00 1409.42 ± 7.74 1.79 1366.86 [43]
gol-n397-k34 1367.47 —– 0.00 1388.93 ± 8.52 1.65 1345.23 [43]
gol-n400-k18 935.91 —– 0.00 950.07 ± 8.57 1.90 918.45 [43]
gol-n421-k41 1867.30 —– 0.00 1890.16 ± 12.28 2.53 1821.15 [43]
gol-n481-k38 1663.87 —– 0.00 1679.55 ± 9.64 2.54 1622.69 [43]
gol-n484-k19 1126.38 —– 0.00 1140.03 ± 7.11 1.73 1107.19 [43]
gold-n201-k5 6460.98 1146050.00 ± 89873.27 2.00 6567.35 ± 104.68 0.00 6460.98 [45]
gold-n241-k10 5669.82 —– 0.00 5788.72 ± 44.79 0.75 5627.54 [43]
gold-n281-k8 8428.18 —– 0.00 8576.61 ± 89.38 0.18 8412.88 [45]
gold-n321-k10 8488.29 —– 0.00 8549.17 ± 47.61 0.48 8447.92 [45]
gold-n361-k9 10227.51 —– 0.00 10369.83 ± 69.64 0.31 10195.56 [45]
gold-n401-k10 11164.11 —– 0.00 11293.47 ± 69.95 1.16 11036.23 [45]
gold-n441-k11 11946.05 —– 0.00 12035.17 ± 63.45 2.42 11663.55 [43]
gold-n481-k12 13846.94 —– 0.00 14052.71 ± 140.07 1.63 13624.52 [45]
Table 22. Computational Facts. Benchmark of Taillard

ta-n76-k10a 1618.36 107401.97 ± 51052.17 65.00 1619.79 ± 2.18 0.00 1618.36 [36]
ta-n76-k10b 1344.62 75143.36 ± 20848.50 11.00 1345.10 ± 0.63 0.00 1344.64 [36]
ta-n76-k9a 1291.01 116404.75 ± 60902.73 72.00 1294.81 ± 10.02 0.00 1291.01 [36]
ta-n76-k9b 1365.42 98309.19 ± 36107.58 94.00 1365.53 ± 0.88 0.00 1365.42 [36]
ta-n101-k10a 1411.66 —– 0.00 1419.41 ± 4.05 0.39 1406.20 [46]
ta-n101-k10b 1584.20 —– 0.00 1599.57 ± 4.69 0.19 1581.25 [43]
ta-n101-k11a 2047.90 —– 0.00 2070.39 ± 9.39 0.32 2041.34 [46]
ta-n101-k11b 1940.36 —– 0.00 1943.60 ± 4.68 0.02 1939.90 [46]
ta-n151-k14a 2364.08 —– 0.00 2407.75 ± 25.57 0.95 2341.84 [36]
ta-n151-k14b 2654.69 —– 0.00 2688.06 ± 14.12 0.35 2645.39 [46]
ta-n151-k15a 3056.41 —– 0.00 3124.28 ± 51.65 0.04 3055.23 [36]
ta-n151-k15b 2732.75 —– 0.00 2782.21 ± 28.61 2.87 2656.47 [36]
ta-n386-k47 24941.71 —– 0.00 25450.87 ± 165.41 2.09 24431.44 [28]
Table 23. Computational Facts. Translated instances from TSP

att-n48-k4 40002.00 121267.06 ± 69462.74 85.00 40030.28 ± 69.04 0.00 40002.0 [47]
bayg-n29-k4 2050.00 32326.00 ± 20669.26 100.00 2050.00 ± 0.00 0.00 2050.0 [47]
bays-n29-k5 2963.00 18345.00 ± 6232.28 100.00 2963.00 ± 0.00 0.00 2963.0 [47]
dantzig-n42-k4 1142.00 71399.00 ± 45792.68 100.00 1142.00 ± 0.00 0.00 1142.0 [47]
fri-n26-k3 1353.00 12728.00 ± 6130.78 100.00 1353.00 ± 0.00 0.00 1353.0 [47]
gr-n17-k3 2685.00 4200.00 ± 0.00 100.00 2685.00 ± 0.00 0.00 2685.0 [47]
gr-n21-k3 3704.00 5881.00 ± 3249.57 100.00 3704.00 ± 0.00 0.00 3704.0 [47]
gr-n24-k4 2053.00 13835.00 ± 6313.48 100.00 2053.00 ± 0.00 0.00 2053.0 [47]
gr-n48-k3 5985.00 175558.97 ± 123635.58 39.00 5986.96 ± 1.75 0.00 5985.0 [47]
hk-n48-k4 14749.00 117319.00 ± 71898.28 100.00 14749.00 ± 0.00 0.00 14749.0 [47]
swiss-n42-k5 1668.00 64962.00 ± 28161.78 100.00 1668.00 ± 0.00 0.00 1668.0 [47]
ulysses-n16-k3 7965.00 4200.00 ± 0.00 100.00 7965.00 ± 0.00 0.00 7965.0 [47]
ulysses-n22-k4 9179.00 7603.00 ± 4401.78 100.00 9179.00 ± 0.00 0.00 9179.0 [47]
Particle Swarm Optimization with Mutation
for High Dimensional Problems
Jeff Achtnig
Summary. Particle Swarm Optimization (PSO) is an effective algorithm for solving

many global optimization problems. Much of the current research, however, focusses
on optimizing functions involving relatively few dimensions (typically around 20
to 30, and generally fewer than 100).
Recently PSO has been applied to a different class of problems – such as neu-
ral network training – that can involve hundreds or thousands of variables. High
dimensional problems such as these pose their own unique set of challenges for any
optimization algorithm.
This chapter investigates the use of PSO for very high dimensional problems. It
has been found that combining PSO with some of the concepts from evolutionary
algorithms, such as a mutation operator, can in many cases significantly improve
the performance of PSO on very high dimensional problems. Further improvements
have been found with the addition of a random constriction coefficient.
1 Introduction
1.1 Outline
This chapter is organized as follows. The remainder of this section describes

the basic particle swarm algorithm along with some of the difficulties encoun-
tered with high dimensional optimization problems. Section 2 describes the
modifications made to the basic PSO algorithm. Section 3 outlines the exper-
imental settings used in our tests, followed by a discussion of the test results
in section 4. Finally, section 5 offers some conclusions.
1.2 Particle Swarm Optimization
The PSO algorithm was first proposed by Kennedy and Eberhart [1] as a
new method for function optimization. The inspiration for PSO came from
observing the social behavior found in nature in such things as a flock of birds,
a swarm of bees, or a school of fish.
J. Achtnig: Particle Swarm Optimization with Mutation for High Dimensional Problems,
Studies in Computational Intelligence (SCI) 82, 423–439 (2008)
424 J. Achtnig
PSO consists of a population of particles – called a swarm – that moves

in an n-dimensional, real-valued problem space. Each particle keeps track of
its current position in the problem space, along with its current velocity and
its best position found so far. The basic PSO algorithm starts by randomly
positioning a number of particles uniformly throughout the problem space. At
each time step, a particle updates its position, x, and velocity, v, according
to the following equations:
vt+1 = χ (vt + ϕ1 (p − x) + ϕ2 (g − x)) (1)

xt+1 = xt + vt+1 (2)
The constriction coefficient, χ = 0.729844, is due to Clerc and Kennedy [2],

who introduced it as a way of keeping the particles’ velocities from exploding.
A particle’s best position is denoted by p, and the global best position amongst
all particles is denoted by g. ϕ1 and ϕ2 are scalar values whose purpose is
to add some randomness to the particle’s velocity. They are randomly drawn
from a uniform distribution over [0, 2.05] at each time step and for each
dimension.
One of the problems with the basic PSO algorithm is that of premature
convergence [8, 11]. PSO has a tendency to converge quite rapidly, which often
means that it will converge to a suboptimal solution. A number of researchers
have investigated various solutions to this problem, which include adding a
spatial extension to PSO [10], changing the neighborhood topology [3], and
increasing the diversity of the particles [11].
1.3 Curse of Dimensionality
The “curse of dimensionality” [5] refers to the exponential growth of volume

associated with adding extra dimensions to a problem space. For popula-
tion based optimization algorithms such as PSO, this rapid growth in volume
means that as the dimensionality of the problem increases, each of the parti-
cles has to potentially search a larger and larger area in the problem space.
Consider a one dimensional problem. We might assign two particles to
search for a solution to this problem along a line segment −1.0 x +1.0.
In this case, one particle might search the negative half of x, while the other
searches the positive half of x. If we move to two dimensions, each particle
might now be assigned to searching half of a square – a considerably larger
area. Moving to three dimensions increases the search space even more. Every
additional dimension potentially requires that each particle search a larger
and larger area.
Some of our initial tests with the basic PSO algorithm in 400 dimen-
sions indicated that PSO might have difficulties in higher dimensions. For
many of the problems we looked at, PSO’s predisposition towards prema-
ture convergence worsened as the dimensionality of the problem grew. The
increased problem space, along with the sparse population of particles within
PSO with Mutation 425
that problem space, often caused the swarm to quickly converge on relatively
poor solutions.
PSO’s difficulty with high dimensional problems can be understood as
follows. A problem with 1000 variables, for example, would require 2000 it-
erations in order for a given particle to explore the effects of changing just a
single variable at a time (i.e. two iterations per variable would be needed to
explore both a positive and a negative change in direction from its current
position). Exploring the effects of simultaneous variable changes would require
even more iterations. In that time, the swarm would have already begun con-
verging on one of the particles. Once the swarm converges the search becomes
much more local, and the ability to explore large-scale changes in each of the
variables diminishes.
2 PSO Modifications
Our initial goal was to examine the ability of PSO in high dimensional problem
spaces, and see if there might be a simple way of improving its performance on
those types of problems. To that end, our research was narrowed to two tech-
niques that did not require any large or computationally expensive changes to
the PSO algorithm. The first was the inclusion of a mutation operator. The
second involved using a random constriction coefficient instead of a fixed one.
2.1 Mutation
The basic PSO algorithm is modelled around the concept of swarming, and
therefore does not make use of an explicit mutation operator. In typical evo-
lutionary algorithms, however, mutation is an important operation that helps
to maintain the diversity of the population. It was therefore hypothesized that
combining the basic PSO algorithm with a mutation operator would help the
swarm avoid premature convergence in high dimensional problems.
One possible approach for creating a mutation operator for PSO is to
give each problem dimension of every particle a random chance of getting
mutated. For a 1,600 dimensional problem with 100 particles, however, this
would result in 160,000 random numbers generated in each iteration of the
swarm. To reduce the number of random number calculations required, it was
decided instead to give each particle a chance of getting selected for mutation.
If selected, then a second random variable would be generated to determine
which single dimension/variable for that particle would get mutated. This
approach considerably reduces the number of random number calculations
required.
Note that limiting the mutation to a single dimension/variable does not
preclude the ability to explore the effects of multiple variable changes. A parti-
cle can still be selected for mutation multiple times in succession, thus allowing
potentially more than one variable in each particle to be affected by mutation.
426 J. Achtnig
Once a dimension/variable of a particle has been selected for mutation, it

is replaced by a uniformly distributed random number drawn over the entire
range of the problem. The standard velocity and position update formulas
from 1 and 2 are then applied, as usual, to all of the particle’s dimensions.
After some preliminary testing, a mutation rate of 2.5% was chosen for our
experiments.
2.2 Random Constriction Coefficient
A common variant of PSO utilizes a linearly decreasing inertial weight term [4]
instead of the constriction coefficient, χ, of equation 1. However, in order to
use the inertial weight term in this way requires advance knowledge as to how
many iterations the PSO algorithm will run for, so that the weight term can
be decreased accordingly. Instead of arbitrarily – or through trial and error –
deciding on what fixed number of iterations to choose, it was decided instead
to use Clerc’s version of PSO (which utilizes a static constriction coefficient)
for our experiments.
However, in an attempt to improve some of our initial results with using a
mutation operator, we also experimented with a number of other PSO modi-
fications in combination with the mutation operator. After a few initial tests,
we settled on utilizing a random constriction coefficient in place of a static one.
In this approach, we generated a uniform random number in the range
of [0.2, 0.9] for each particle in the swarm, for every iteration. This random
number was then used as the value of the constriction coefficient for that
particle, and was applied to every dimension/particle of that particle.
3 Experimental Settings
Four common numerical optimization problems and two neural-network train-

ing problems were chosen to compare the performance of the standard PSO
algorithm (with a constriction coefficient) against the modifications presented
in this chapter. For each test function we compare:
1) the standard PSO algorithm with constant constriction coefficient
χ = 0.729844;
2) PSO with mutation and a constant constriction coefficient (as above);
3) PSO with a random constriction coefficient [0.2, 0.9], but no mutation;
and
4) PSO with a combination of both mutation and a random constriction
coefficient.
For all of the trials, ϕ1 and ϕ2 were randomly chosen from the range
[0, 2.05]. Also, in addition to a constriction coefficient, each particle was con-
strained by a vmax parameter. If a particle’s velocity exceeded vmax , then the
velocity for that particle at that time step was set equal to vmax , where vmax
was determined according to the following equation:
F unctionU pperLimit − F unctionLowerLimit
vmax = 0.50 · . (3)
2
The performance of each algorithm was compared on each test function
in 400, 800, and 1600 dimensions (the dimensionality of the neural net-
work problems are roughly the same). With the exception of the neural
network problems of 800 and 1600 dimensions, each test ran for 50 runs and
the average fitness was used in our comparisons. For the previously mentioned
neural network problems, 25 runs were averaged due to time constraints. For
each run, each of the four PSO algorithms being tested was initialized with
the same random seed, ensuring that the initial positions of the particles
would be the same. The population size for all tests was fixed at 100 particles,
while the number of iterations varied depending on the test function and its
dimensionality.
Larger populations of particles – equal to the number of dimensions of
the problem – were also experimented with, but the few tests we did seemed
to suggest that running 100 particles for more iterations resulted in equal
or better solutions. We were also looking at reducing the computation time
involved, and decided that using fewer particles would best meet that goal. All
particles in our tests were connected in the standard “star” (fully connected)
topology.
Finally, we decided to compare the PSO algorithms with another evolu-
tionary algorithm: Differential Evolution (DE) [12]. After some preliminary
tests with a few DE variants and settings, we chose the DE/rand/1/best strat-
egy with C = 0.80 and F = 0.50 for our comparisons. As in the PSO variants,
the population size was fixed at 100 particles.
3.1 Standard Test Functions
The four numerical optimization problems used in our tests are the unimodal
Sphere and Rosenbrock functions, and the multimodal Rastrigin and Schwefel
functions.
n
Sphere = x2i (4)
i=1

n−1
$ %2
Rosenbrock = (100 · xi+1 − x2i + (1 − xi )2 ) (5)
i=1

n
Rastrigin = 10n + (x2i − 10 cos (2πxi )) (6)
i=1
n "
"3 ##
Schwef el = 418.9829 · n + −xi sin |xi | (7)
i=1
428 J. Achtnig
Figure 1, below, depicts each of the standard test functions in two

dimensions (i.e., n = 2, in the above equations).
For problems with a global minimum at or near the origin, symmetrically
positioning the particles in the search space can give a misleading indication
of performance [6, 7]. Following the advice of Angeline, for those problems
with a global minimum at or near the origin (Sphere, Rastrigin, and Rosen-
brock), the initial population of the swarm was asymmetrically positioned in
the problem space. The Schwefel function is a deceptive function, in that it’s
(a) Sphere Function (b) Rosenbrock’s Valley
(c) Rastrigin’s Function (d) Schwefel’s Function
Fig. 1. 2D Test Function Plots
Table 1. Function Ranges

Initialization Range Function Range
Function Min Max Min Max
Sphere −100 −50 −100 +100
Rosenbrock −10 −5 −10 +10
Rastrigin −5.12 −2.56 −5.12 +5.12
Schwefel −500 +500 −500 +500
global minimum (at about [420.9687, 420.9687, . . . ]) is positioned far away

from the next best local minimum. For this function we chose to initialize the
particles over the entire range of the function. Table 1 lists the full ranges of
each of the four standard test functions, along with the initialization range
used.
3.2 Neural Network Test Functions
Both a feed-forward neural network and a recurrent neural network archi-

tecture were used as test problems. In both cases, the performance of each
algorithm was determined solely by the error between the outputs of the neural
network and the expected output values as given by the training data. We did
not take into consideration the generalization abilities of the networks. Also,
the design of the neural networks – including the number of hidden nodes –
was based solely on creating an approximate number of weights/variables for
the PSO algorithms to optimize (specifically, we were trying to create net-
works with approximately 400, 800, and 1600 weights). As such, the networks
themselves may not be optimal for the problems they are trying to solve.
Feed-forward Neural Network
For the feed-forward network, Kramer’s nonlinear PCA neural network [9]
was used. As shown in Figure 2, this is a five layer auto-associative neural
network with a bottleneck layer in the middle to reduce the dimensionality of
the input variables (the number of nodes in the bottleneck layer is less than
the number of nodes in the input/output layers).
The following network configurations (nodes per layer) were tested:
1) 12–11–5–11–12, which – including the bias weights – results in 413
weights to optimize;
2) 18–15–8–15–18, which results in 836 weights to optimize; and
3) 24–24–11–24–24, which results in 1763 weights to optimize.
Each network was trained on 200 test cases, and the average error of
all of the test cases between the input and output values was used as our
Fig. 2. Feedforward PCA Neural Network

430 J. Achtnig
fitness function. The test cases were created such that there would be some
correlation between the input variables – a non-linear correlation and a linear
one. The following scheme was used to generate the inputs for each test case:
1) input[1] = 3
random variable uniformly chosen over the range [0.0, 1.0];
2) input[2] = input[1]; and
3) input[3] = 0.75 · input[1].
This pattern was repeated for the remainder of the inputs (4..6, 7..9, 10..12,
etc.) for each test case. As such, the number of inputs was always chosen to
be a multiple of three.
Recurrent Neural Network
The recurrent neural network consisted of a three layer network with the pre-
vious values of both the middle and output layers fed back into the input layer,
similar to the one shown in Figure 3. Three external inputs were connected
to the network. The first two of these inputs were generated from a chaotic
time series (the logistic map with r = 4), while the third input was a non-
linear combination of the first two inputs. The outputs of the network were
the future values of each of the three input values (ranging from two future
Fig. 3. Recurrent Neural Network Architecture

values per input, to four future values per input depending on the network
problem). The three inputs were:
1) input[1] = logistic map function, with initial value of 0.3;
2) input[2] = 4
logistic map function, with initial value of 0.5; and
2
3) input[3] = (input[1]) + (input[2])2 .
The logistic map function, with r = 4, is: f (xt+1 ) = 4xt · (1 − xt ).
The following network configurations (nodes per layer) were tested:
1) (3 + 20) − 14 − 6, which – including the bias weights – results in 426
weights to optimize;
2) (3 + 29) − 20 − 9, which results in 849 weights to optimize; and
3) (3 + 40) − 28 − 12, which results in 1580 weights to optimize.
The number of input nodes was always 3 plus the number of middle and
output layer nodes. Each network was trained on 200 successive values of the
input series.
4 Results and Discussion

Tables 2, 3, and 4 present the final results of the optimization problems defined
earlier. In all of the functions tested, the combination of a mutation operator
together with a random constriction coefficient (PSO.cmb) resulted in the best
performance. With respect to the PSO variants only, the mutation operator
by itself (PSO.mut) resulted in the next best performance, while the random
constriction coefficient by itself (PSO.rnd) performed similarly to the basic
PSO algorithm (bPSO). The bPSO algorithm had a tendency to converge
prematurely to relatively poor solutions.
The graphs on the following pages (figures 4, 5, and 6) plot the results of
the test functions, over all iterations, for each of the PSO variants. The Sphere
function (the first plot shown in each of the 400, 800, and 1600 dimensional
groupings) is useful for testing the efficiency of each of the algorithms being
Table 2. Fitness Results: 400 Dimensions

(400) bPSO DE PSO.mut PSO.rnd PSO.cmb
Sphere 539, 918 101.8 7.163 389, 551 3.82 × 10−6
5,000 iterations σ=66,782 σ=48.9 σ=1.947 σ=83,469 σ=2.57×10−6
Rastrigin 4, 455 3, 600 674 4, 391 395
10,000 iterations σ=295 σ=307 σ=105 σ=311 σ=57
Rosenbrock 46, 719, 745 843 918 27, 606, 006 413
10,000 iterations σ=8,543,113 σ=108 σ=122 σ=5,388,505 σ=139
Schwefel 81, 092 133, 310 22, 715 90, 801 21, 825
15,000 iterations σ=5,543 σ=4,605 σ=1,070 σ=6,555 σ=1,317
NN − PCA∗ 14.85 13.42 10.80 14.08 6.56
20,000 iterations σ=1.99 σ=0.57 σ=1.41 σ=1.92 σ=2.63
∗
RNN 17.08 21.86 14.91 18.32 13.76
432 J. Achtnig

Sphere 1, 213, 534 3, 897 54.8 958, 082 1.09 × 10−5
10,000 iterations σ=189,567 σ=2,446 σ=10.4 σ=143,965 σ=3.82×10−6
Rastrigin 9, 453 3, 491 2, 200 9, 608 1, 307
10,000 iterations σ=679 σ=920 σ=153 σ=564 σ=156
Rosenbrock 1.05 × 10 8
59, 367 5, 340 69, 771, 572 1, 235
10,000 iterations σ=12,675,401 σ=31,251 σ=456 σ=6,679,159 σ=152
Schwefel 173, 138 274, 675 47, 668 199, 200 44, 917
NN − PCA∗ 14.41 12.14 10.89 14.35 9.80
∗
RNN 14.85 16.56 11.62 13.84 10.74

Sphere 2, 897, 640 165, 626 6, 762 2, 682, 835 0.073
12,500 iterations σ=289,354 σ=21,200 σ=829 σ=304,273 σ=0.075
Rastrigin 19, 871 5, 836 6, 609 21, 262 3, 262

15,000 iterations σ=1,222 σ=224 σ=312 σ=1,009 σ=345
Rosenbrock 2.34 × 108 3, 230, 033 55, 786 1.90 × 108 3, 539
15,000 iterations σ=21,149,219 σ=822,702 σ=6,176 σ=14,150,730 σ=252
Schwefel 361, 471 556, 345 119, 508 441, 430 102, 464
NN − PCA∗ 13.97 17.45 11.08 13.54 9.44
∗
RNN 12.36 13.19 10.44 11.73 10.20
compared. There are no local minimums to get stuck in; there is only one
global optimal solution centered at the origin, and – regardless of where a
particle is positioned in the search space – following the gradient information
will lead directly to that optimal solution. Yet, despite these favorable con-
ditions, the bPSO algorithm still gets stuck in a non-optimal solution. PSO’s
problem of premature convergence appears to have a much more noticeable
effect in higher dimensions.
Adding a small mutation to the PSO algorithm, however, seems to give the
swarm a needed push that helps to keep it from getting stuck. While the bPSO
algorithm converges prematurely in each of the Sphere tests, the PSO.mut al-
gorithm continues to improve its fitness throughout each of the iterations. The
random constriction coefficient (PSO.rnd) by itself didn’t offer much improve-
ment over the bPSO algorithm, but combining it with the mutation operator
(PSO.cmb) resulted in the best performance.
The relative performance of the algorithms was found to be similar on all
of the other test problems – including the two neural network problems. In
each case the bPSO algorithm performed relatively poorly, while the algo-
rithms that performed the best were those with the mutation operator. The
combination of the mutation operator along with the random constriction
coefficient performed the best in all instances.
(a) Sphere (b) Rastrigin
(c) Rosenbrock (d) Schwefel
(e) NN-PCA (413) (f) RNN (426)
Fig. 4. Test Functions in 400 Dimensions

434 J. Achtnig
(e) NN-PCA (836) (f) RNN (849)

(e) NN-PCA (1763) (f) RNN (1580)

436 J. Achtnig
Fig. 7. Diversity Plots (400 Dimensions)
All of the PSO algorithms being tested still had a tendency – albeit to vary-
ing degrees – to converge to non-optimal solutions. This was most noticeable
for the Schwefel function in all dimensions, and in general for all functions
it became more noticeable as the dimensionality of the problem increased.
Even the PSO.cmb algorithm appeared to have moments of difficulties on the
Sphere function in 1600 dimensions (although it did manage to recover).
The final set of graphs plot the diversity of the swarm for the Sphere, Ras-
trigin, Rosenbrock, and Schwefel functions in 400 dimensions. The diversity
of the swarm is given by:

|S| N
1 $ %2
Diversity = · pij − pj , (8)
|S| · D i=1 j=1
where |S| is the swarm size, D is the length of the largest diagonal in the
problem, N is the Number of Dimensions of the problem, pij is the j’th value
of the i’th particle, and pj is the j’th value of the average point p. In all cases,
the PSO.rnd algorithm quickly achieves the lowest diversity compared with all
of the other algorithms, suggesting that the random constriction coefficient
has the effect of speeding up the convergence of the swarm. The PSO.mut
algorithm, on the other hand, has a diversity greater than or roughly equal
to the bPSO algorithm in all cases. This is to be expected, as the random
mutations should increase the diversity of the swarm.
The best performing variant – the PSO.cmb algorithm – was subjected
both to forces trying to increase its diversity (i.e. the mutation operator),
and to those trying to decrease its diversity (i.e. the random constriction
coefficient), with results somewhere in-between depending on the problem. In
all cases the diversity of the PSO.cmb algorithm very quickly decreased in the
first few iterations – much more so than the bPSO algorithm, and generally
matching the decreased diversity of the PSO.rnd algorithm. After the first few
iterations of the Rastrigin and Schwefel functions, however, the PSO.cmb’s
diversity started to match that of the PSO.mut algorithm. Conversely, on
the Sphere and Rosenbrock functions the PSO.cmb algorithm continued to
decrease its diversity similar to the PSO.rnd for an extended period of time,
before levelling off (eventually, near the end of the iterations for both of these
latter cases, the bPSO algorithm’s diversity continued to decrease until it fell
below that of the PSO.cmb algorithm).
4.1 Comparison with Differential Evolution
The DE algorithm performed quite well on its own, beating the “standard”
bPSO algorithm on most – but not all – test functions. Note especially the
Rastrigin function, where the DE algorithm in 800 dimensions performed bet-
ter than it did on the same problem in 400 dimensions. However, the PSO.cmb
algorithm still performed the best out of all of the algorithms on all of the test
problems. The PSO.mut algorithm was also better than the DE algorithm on
all but two problems (Rosenbrock in 400 dimensions, and the Rastrigin func-
tion in 1600 dimensions).
In general, the two PSO variants with a mutation operator (PSO.mut
and PSO.cmb) performed well across all of the problems tested, where-as
the DE algorithm performed relatively well on some problems, but not so
well on others (such as the Schwefel function). Also, on problems such as
the Rosenbrock function, the DE algorithm performed noticeably worse as
the dimensionality of the problem increased (as compared with the PSO.mut
and PSO.cmb algorithms, which appear to handle the increasing problem
dimensions better than the DE algorithm).
438 J. Achtnig
5 Conclusions and Future Work

This chapter investigated the use of Particle Swarm Optimization on very
high dimensional problems. PSO has been known to have a problem with
premature convergence; in higher dimensions this issue becomes much more
noticeable. However, adding a mutation operator to the basic PSO algorithm
noticeably improves PSO’s performance on these types of problems. Further
improvement was found with the addition of a random constriction coefficient.
The two PSO algorithms with a mutation operator (especially the
PSO.cmb algorithm) also seemed to perform relatively well across all of the
problem types tested. This is in contrast to the variant of DE that we ex-
perimented with, which performed well on some problems but not so well on
others.
Although the addition of a mutation operator improved the performance
of the basic PSO algorithm – in many cases significantly – the modified PSO
algorithm is still prone to premature convergence. This problem becomes more
noticeable as the dimensionality of the function increases. As has been previ-
ously mentioned, other techniques – such as Adaptive-Repulsive PSO – have
been suggested by others that address the issue of premature convergence
in PSO. It would be interesting to try these techniques on these higher di-
mensional problems, both with and without the mutation operator, to see if
further improvement could be obtained. Additional experiments could also be
carried out to investigate various rates of mutation, as well as different values
for the constriction coefficient (including various random ranges such as [0.2,
1.1]), or even alternatives to the random constriction coefficient.
We are also currently investigating the use of a mutation operator on PSO
in lower dimensional problems, with promising results.
References
1. Kennedy J, Eberhart R (2001) Swarm intelligence. Morgan Kaufmann, San
Francisco, CA
2. Clerc M, Kennedy J (2002) The particle swarm: explosion, stability, and conver-
gence in a multidimensional complex space. IEEE Trans Evol Comput 6:58–73
3. Kennedy J (1999) Small worlds and mega-minds: effects of neighborhood topol-
ogy on particle swarm performance. In: Proceedings of the 1999 Congress of
Evolutionary Computation, vol 3. IEEE, pp 1931–1938
4. Eberhart R, Shi Y (1998) A modified particle swarm optimizer. IEEE Int Conf
Evol Comput. Anchorage, Alaska
5. Bellman R (1961) Adaptive control processes: a guided tour. Princeton Univer-
sity Press
6. Fogel D, Beyer H-G (1996) A note on the empirical evaluation of intermediate
recombination. Evol Comput 3(4):491–495
7. Angeline PJ (1998) Evolutionary optimization versus particle swarm optimiza-
tion. In: Porto VW, Saravanan N, Waagen D, Eiben AE (eds) Evolutionary
programming VII. Lecture notes in computer science, vol 1447. Springer, Berlin
8. Vesterstroem J, Riget J, Krink T (2002) Division of labor in particle swarm
optimisation. In: Proceedings of the IEEE congress on evolutionary computa-
tion (CEC 2002), Honolulu, Hawaii
9. Kramer MA (1991) Nonlinear principal component analysis using autoassocia-
tive neural networks. AIChE J 37(2):233–243
10. Riget J, Vesterstroem J, Krink T (2002) Proceedings of the Congress on Evo-
lutionary Computation, CEC ’02 2:1474–1479
11. Riget J, Vesterstroem J (2002) A diversity-guided particle swarm optimizer –
the arPSO. In: EVALife, Department of Computer Science, University of
Aarhus, Denmark
12. Lampinen J, Storn R (2004) Differential evolution. In: Onwubolu GC, Babu BV
(eds) New optimization techniques in engineering. Studies in fuzziness and soft
computing, vol 141. Springer, Berlin Heidelberg, New York, pp 123–166
Index
λ-Interchange, 388, 393 switch, 353–356

2-Opt, 388, 393 chromosome, 338, 339, 342, 346, 347
classification, 180
accuracy, 186 Clustering, 215, 217–219, 222, 223, 225,
activation function, 178, 301 236
Akaike Information Criterion (AIC), coastal engineering, 340
272 codon, 253
amino acid, 252 combined Anti Retroviral Therapy
analytical solutions, 337 (cART), 254
ANN model, 337–339, 342, 344, 346, competing convention problem, 295
347, 350
comprehensibility, 186
ANN training, 338, 346, 350
connection weights, 291
architecture, 292
Constructive algorithms, 7, 295
artificial neural networks, 290, 337
Constructive backpropagation, 138,
back-propagation, 180, 329, 338 139, 147–152, 154
back-propagation learning algorithm, 23 cooperative coevolutionary models, 296
backpropagation algorithm, 290 correlation value, 346, 347
brain-computer interface, 313 crossover, 338, 339, 342, 346–349
breeder genetic algorithm, 302 curse of dimensionality, 424
Capacitated Vehicle Routing Problem

(CVRP), 380, 382 Data clustering, 158, 159, 163, 164, 172,
CD4+ T cell, 255 173, 175
Cellular Genetic Algorithm (cGA), 381 Data decomposition, 157, 159, 165, 166,
cellular mobile networks, 354, 355, 358, 172, 173
360, 362, 372–374 degree of saturation, 340, 341, 343–345,
cellular networks, 353, 358, 372 350
base station, 353 destructive algorithms, 7, 295
handoff, see handoff direct encoding, 294
cell assignment, 354–356, 358, 364, Divide-and-conquer, 130, 131
365, 369, 373 drug resistance, 253
problem definition, 354 drug susceptibility, 257
costs, 354 dynamic recurrent neural networks, 109
442 Index
edge recombination operator (ERX), GA operation, 339, 346, 348

386 GAs, 337, 338, 341, 342, 346, 347
elitist model, 302 Gaussian function, 329
engineering applications, 337, 338, 346, gene, 252
347 Generalization, 131, 138, 140, 144–146,
Estimation of Distribution Algorithms, 149, 150, 152–154, 330
1 generation, 338, 347, 348
evolution of learning rules, 8 Genetic algorithms, 1, 213, 214, 216,
evolution programs, 292 267, 292, 337, 338, 355, 359–361,
Evolution Strategies, 1, 292 381
Evolutionary algorithms, 130, 131, 135, crossover, 359
136, 138, 140, 144, 149, 150, 154, evolution, 360
158, 184, 289 mutation, 359
Evolutionary architecture adaptation, 7 genetic design approach, 61
Genetic Programming, 1, 292
Evolutionary Artificial Neural Network,
5, 290 genetic tuning, 12
genetically optimized Hybrid Fuzzy
Evolutionary Clustering, 14
Neural Networks, 24
evolutionary computation, 289
genetically optimized multilayer
Evolutionary Design of Complex
perceptron, 59
Paradigms, 15
genotype, 252, 291
Evolutionary Intelligent Systems, 2 GEX method, 197
Evolutionary Programming, 1, 292 global methods, 189
Evolutionary search for connection Gradient descent, 129–131, 133,
weights, 6 135–138, 149, 154, 290
Evolutionary Search of Fuzzy gradient descent-based techniques, 338
Membership Functions, 12 Group Method of Data Handling, 59
Evolutionary Search of Fuzzy Rule
Base, 12 Hamacher norm, 264
evolved inductive self-organizing handoff, 353, 356, 357
networks, 109 complex, 354, 356–358
evolving ANN behaviors, 296 cost, 354, 357, 358, 366, 372, 373
exponential distributions, 298 frequency, rate, 354, 358, 367
frequency,rate, 367
probability, 367
feature selection, 271 soft, 353
feedforward, 328, 341 hidden layer, 341
fitness, 293, 300 Highly Active Anti Retroviral Therapy
fitness function, 338, 342 (HAART), 254
Fuzzy Connective Based (FCB) HIV-1 dynamics, 265
crossover, 267 HIV-1 replication, 252
fuzzy feature selection, 272 HIV-1 treatments, 253
Fuzzy Genetic Algorithms (FGA), 267 Hybrid Fuzzy Neural Networks, 23
Fuzzy Inference System (FIS), 283 hybrid technique, 337–339
Fuzzy logic controller, 211, 213, 226 Hybrid training, 129, 132, 137, 149
fuzzy medical diagnosis, 258 in-vitro cultures, 254
fuzzy polynomial neurons, 59 indirect encoding, 294
fuzzy relational composition, 258 input data, 344
fuzzy rule-based computing, 59 input space, 328
Index 443
JCell, 383 Polynomial Neural Networks, 23

joint optimization of structure and polynomial neurons, 59
weights, 291 pore pressure, 343, 344
poro-elastic model, 340–342, 344, 347
Learning Classifier Systems, 1 prediction, 337, 338, 340–342, 345–347,
learning rate, 294 349
learning rules, 291 prepositional rule, 181
Linear regression, 211, 215, 219, 220, fuzzy, 181
222–224, 231, 243, 248 Pseudo global optima, 129, 130, 137,
Local optima, 129, 131, 136, 137, 144 138, 144, 149
loss function, 270
radial basis function, 109
marine structures, 337, 339
radial basis function network (RBFN),
maximum liquefaction depth, 341, 344,
347, 350 328, 329, 332
Messy genetic algorithms, 134 Random Search (RS), 270
meta-learning evolutionary artificial resilient backpropagation algorithm,
neural network, 10 299
modelling nonlinear systems, 109 retrospective clinical cohorts, 255
momentum, 294 REX methods, 190
multi-layer perceptrons, 109, 297 root mean square error, 346
Multiobjective Evolutionary Design, 16 roulette wheel, 342
mutations, 253, 302, 338, 339, 342, rule extraction, 181
346–349 global methods, 183
local methods, 182, 186
Neural networks, 129, 131–134, 137, Michigan approach, 190, 197
140–143, 149, 157–160, 162, 170, Pitt approach, 190
178
feedforward, 179 seabed liquefaction, 337, 339, 344, 349,
recurrent, 179 350
neural-genetic, 337, 341, 344, 346–350 selection, 302, 338, 342, 346
neuro-genetic systems, 289 self-organising map (SOM), 328
neurons, 178, 337, 342 self-organizing neural networks, 59
normal distribution, 298 Self-organizing Polynomial Neural
Networks, 110
optimization algorithms, 266 soil permeability, 344, 345, 347, 350
output data, 344 Supervised learning, 157
Output parallelism, 130, 131, 144, 147,
System modelling, 109
149–151, 153, 154, 158, 161, 170,
172
tabu search, 355, 360, 364, 369, 372–374
parametric optimization, 23 list, 360
partial descriptions, 110 properties, 361
Particle Swarm Optimization, 423 Takagi and Sugeno approach, 211,
Pattern distributor, 129, 130, 139, 142, 213–215, 220, 238, 248
146, 147, 149–152, 154 tangent sigmoid transfer function, 301
permutation problem, 295 Task decomposition, 129, 130, 154
perspective clinical trials, 255 training, 338, 341, 342, 345, 346, 350
phenotype, 254, 291 truncation, 302
444 Index
variance, 298 Weak learners, 157, 158

Vehicle Routing Problem (VRP), 380, weights, 338, 339, 341, 342, 346
382 wild type HIV-1, 253
viral load, 255

Engineering Evolutionary Intelligent Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Engineering Evolutionary Intelligent Systems

Uploaded by

Copyright:

Available Formats

Ajith Abraham, Crina Grosan and Witold Pedrycz (Eds.

With 191 Figures and 109 Tables

ISBN 978-3-540-75395-7 e-ISBN 978-3-540-75396-4

Studies in Computational Intelligence ISSN 1860-949X

Library of Congress Control Number: 2007938406

Engineering Evolutionary Intelligent Systems: Methodologies,

3 The architecture and development of genetically optimized

3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3 The RSL-CC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

4.1 Entropy-based Fuzzy Clustering (EFC) . . . . . . . . . . . . . . . . . . . . . 217

Evolutionary Fuzzy Modelling for Drug Resistant HIV-1

3.4 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

On the Design of Large-scale Cellular Mobile Networks

A Hybrid Cellular Genetic Algorithm for the Capacitated

A Best Found Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

Particle Swarm Optimization with Mutation for High

Evolutionary Computation (EC) has become an important and timely method-

Alba and Dorronsoro solve the Capacitated Vehicle Routing Problem

Trondheim, Norway Ajith Abraham

Ajith Abraham Anthony Brabazon

Crina Grosan Michael O’Neill

Urszula Markowska-Kaczmar Samuel Pierre

Mattia Prosperi Andrea G.B. Tettamanzi

Ajith Abraham and Crina Grosan

Summary. Designing intelligent paradigms using evolutionary algorithms is getting

A. Abraham and C. Grosan: Engineering Evolutionary Intelligent Systems: Methodologies,

The rest of the chapter is organized as follows. In Section 2, the various

2 Architectures of Evolutionary Intelligent Systems

Evolutionary algorithm Intelligent paradigm

Fig. 1. EIS architecture 1

Fig. 2. EIS architecture 2

Problem / Data Evolutionary algorithm

Fig. 3. EIS architecture 3

Fig. 4. EIS architecture 4

Fig. 5. EIS architecture 5

Fig. 6. EIS architecture 6

Figure 1 illustrates a transformational architecture where an evolutionary

3 Evolutionary Artiﬁcial Neural Networks

Evolutionary search of architectures and node transfer

Evolutionary search of connection weights Fast

Fig. 7. A General framework for evolutionary artiﬁcial neural network

a particular class of architectures is pursued, it is better to implement the

3.1 Evolutionary Search of Connection Weights

The neural network training process is formulated as a global search of connec-

Algorithm 3.1 Evolutionary search of connection weights

2. Depending on the ﬁtness and using suitable selection methods reproduce a

3.2 Evolutionary Search of Architectures

Evolutionary architecture adaptation can be achieved by constructive and de-

Algorithm 3.2 Evolutionary search of architectures

2. Depending on the ﬁtness and using suitable selection methods reproduce a

3.3 Evolutionary Search of Learning Rules

where t is the time, ∆w is the weight change, x1 , x2 ,. . . .. x n are local variables

Algorithm 3.3 Evolutionary search of learning algorithms or rules

2. Depending on the ﬁtness and using suitable selection methods reproduce a

3.4 Recent Applications of Evolutionary Neural Networks

simultaneously the connection weights and the optimal selection of relevant

4 Evolutionary Fuzzy Systems

if x is A 1 and y is B 1 then f = C (2)

Takagi, Sugeno and Kang (TSK) [72] proposed an inference scheme in

Fig. 8. Adaptive fuzzy control system architecture

Evolutionary search of fuzzy rules

Evolutionary search of membership functions

Fig. 9. Interaction of various search mechanisms in the design of optimal adaptive