Professional Documents
Culture Documents
)
Engineering Evolutionary Intelligent Systems
Studies in Computational Intelligence, Volume 82
Editor-in-chief
Prof. Janusz Kacprzyk
Systems Research Institute
Polish Academy of Sciences
ul. Newelska 6
01-447 Warsaw
Poland
E-mail: kacprzyk@ibspan.waw.pl
Further volumes of this series Vol. 70. Javaan Singh Chahl, Lakhmi C. Jain, Akiko
can be found on our homepage: Mizutani and Mika Sato-Ilic (Eds.)
springer.com Innovations in Intelligent Machines-1, 2007
ISBN 978-3-540-72695-1
Vol. 60. Vladimir G. Ivancevic and Tijana T. Ivacevic Vol. 71. Norio Baba, Lakhmi C. Jain and Hisashi Handa
Computational Mind: A Complex Dynamics (Eds.)
Perspective, 2007 Advanced Intelligent Paradigms in Computer
ISBN 978-3-540-71465-1 Games, 2007
ISBN 978-3-540-72704-0
Vol. 61. Jacques Teller, John R. Lee and Catherine
Roussey (Eds.) Vol. 72. Raymond S.T. Lee and Vincenzo Loia (Eds.)
Ontologies for Urban Development, 2007 Computation Intelligence for Agent-based Systems, 2007
ISBN 978-3-540-71975-5 ISBN 978-3-540-73175-7
Vol. 62. Lakhmi C. Jain, Raymond A. Tedman Vol. 73. Petra Perner (Ed.)
and Debra K. Tedman (Eds.) Case-Based Reasoning on Images and Signals, 2008
Evolution of Teaching and Learning Paradigms ISBN 978-3-540-73178-8
in Intelligent Environment, 2007
Vol. 74. Robert Schaefer
ISBN 978-3-540-71973-1
Foundation of Global Genetic Optimization, 2007
Vol. 63. Wlodzislaw Duch and Jacek Mańdziuk (Eds.) ISBN 978-3-540-73191-7
Challenges for Computational Intelligence, 2007
Vol. 75. Crina Grosan, Ajith Abraham and Hisao
ISBN 978-3-540-71983-0
Ishibuchi (Eds.)
Vol. 64. Lorenzo Magnani and Ping Li (Eds.) Hybrid Evolutionary Algorithms, 2007
Model-Based Reasoning in Science, Technology, and ISBN 978-3-540-73296-9
Medicine, 2007
Vol. 76. Subhas Chandra Mukhopadhyay and Gourab Sen
ISBN 978-3-540-71985-4
Gupta (Eds.)
Vol. 65. S. Vaidya, L.C. Jain and H. Yoshida (Eds.) Autonomous Robots and Agents, 2007
Advanced Computational Intelligence Paradigms in ISBN 978-3-540-73423-9
Healthcare-2, 2007 Vol. 77. Barbara Hammer and Pascal Hitzler (Eds.)
ISBN 978-3-540-72374-5 Perspectives of Neural-Symbolic Integration, 2007
Vol. 66. Lakhmi C. Jain, Vasile Palade and Dipti ISBN 978-3-540-73953-1
Srinivasan (Eds.) Vol. 78. Costin Badica and Marcin Paprzycki (Eds.)
Advances in Evolutionary Computing for System Design, Intelligent and Distributed Computing, 2008
2007 ISBN 978-3-540-74929-5
ISBN 978-3-540-72376-9
Vol. 79. Xing Cai and T.-C. Jim Yeh (Eds.)
Vol. 67. Vassilis G. Kaburlasos and Gerhard X. Ritter Quantitative Information Fusion for Hydrological
(Eds.) Sciences, 2008
Computational Intelligence Based on Lattice ISBN 978-3-540-75383-4
Theory, 2007
ISBN 978-3-540-72686-9 Vol. 80. Joachim Diederich
Rule Extraction from Support Vector Machines, 2008
Vol. 68. Cipriano Galindo, Juan-Antonio ISBN 978-3-540-75389-6
Fernández-Madrigal and Javier Gonzalez
A Multi-Hierarchical Symbolic Model of the Environment Vol. 81. K. Sridharan
for Improving Mobile Robot Operation, 2007 Robotic Exploration and Landmark Determination, 2008
ISBN 978-3-540-72688-3 ISBN 978-3-540-75393-3
Vol. 69. Falko Dressler and Iacopo Carreras (Eds.) Vol. 82. Ajith Abraham, Crina Grosan and Witold
Advances in Biologically Inspired Information Systems: Pedrycz (Eds.)
Models, Methods, and Tools, 2007 Engineering Evolutionary Intelligent Systems, 2008
ISBN 978-3-540-72692-0 ISBN 978-3-540-75395-7
Ajith Abraham
Crina Grosan
Witold Pedrycz
(Eds.)
Engineering Evolutionary
Intelligent Systems
123
Ajith Abraham Crina Grosan
Centre for Quantifiable Quality of Service Department of Computer Science
in Communication Systems (Q2S) Faculty of Mathematics
Centre of Excellence and Computer Science
Norwegian University of Science Babes-Bolyai University
and Technology Cluj-Napoca, Kogalniceanu 1
O.S. Bragstads plass 2E 400084 Cluj - Napoca
N-7491 Trondheim Romania
Norway
ajith.abraham@ieee.org
Witold Pedrycz
Department of Electrical & Computer
Engineering
University of Alberta
ECERF Bldg., 2nd floor
Edmonton AB T6G 2V4
Canada
c 2008 Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations are
liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Cover Design: Deblik, Berlin, Germany
Printed on acid-free paper
9 8 7 6 5 4 3 2 1
springer.com
Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
Preface
PNNs. The optimization of the FNN is realized with the aid of a standard
back-propagation learning algorithm and genetic optimization.
In the third Chapter, Oh and Pedrycz introduce Self-Organizing Neu-
ral Networks (SONN) that is based on a genetically optimized multilayer
perceptron with Polynomial Neurons (PNs) or Fuzzy Polynomial Neurons
(FPNs). In the conventional SONN, an evolutionary algorithm is used to ex-
tend the main characteristics of the extended Group Method of Data Handling
(GMDH) method, that utilizes the polynomial order as well as the number of
node inputs fixed at the corresponding nodes (PNs or FPNs) located in each
layer during the development of the network. The genetically optimized SONN
(gSONN) results in a structurally optimized structure and comes with a higher
level of flexibility in comparison to the one encountered in the conventional
SONN.
Kim and Park discuss a new design methodology of the self-organizing
technique which builds upon the use of evolutionary algorithms. The self-
organizing network dwells on the idea of group method of data handling. The
order of the polynomial, the number of input variables, and the optimal num-
ber of input variables and their selection are encoded as a chromosome. The
appropriate information of each node is evolved accordingly and tuned grad-
ually using evolutionary algorithms. The evolved network is a sophisticated
and versatile architecture, which can construct models from a limited data set
as well as from poorly defined complex problems.
In the fifth Chapter, Ramanathan and Guan propose the Recursive
Pattern-based Hybrid Supervised (RPHS) learning algorithm that makes use
of the concept of pseudo global optimal solutions to evolve a set of neural
networks, each of which can solve correctly a subset of patterns. The pattern-
based algorithm uses the topology of training and validation data patterns to
find a set of pseudo-optima, each learning a subset of patterns.
Ramanathan and Guan improve the RPHP algorithm (as discussed in
Chapter 5) by using a combination of genetic algorithm, weak learner and
pattern distributor. The global search component is achieved by a cluster-
based combinatorial optimization, whereby patterns are clustered according
to the output space of the problem. A combinatorial optimization problem is
therefore formed, which is solved using evolutionary algorithms. An algorithm
is also proposed to use the pattern distributor to determine the optimal num-
ber of recursions and hence the optimal number of weak learners suitable for
the problem at hand.
In the seventh Chapter, Markowska-Kaczmar proposes two methods of
rule extraction referred to as REX and GEX. REX uses propositional fuzzy
rules and is composed of two methods REX Michigan and REX Pitt. GEX
takes an advantage of classical Boolean rules. The efficiency of REX and
GEX were tested using different benchmark data sets coming from the UCI
repository.
Tushar and Pratihar deals with Takagi and Sugeno Fuzzy Logic Con-
trollers (FLC) by focusing on their design process. This development by
Preface XV
clustering the data based on their similarity among themselves and then
cluster-wise regression analysis is carried out, to determine the response equa-
tions for the consequent part of the rules. The performance of the developed
cluster-wise linear regression approach; cluster-wise Takagi and Sugeno model
of FLC with linear membership functions and cluster-wise Takagi and Sugeno
model of FLC with nonlinear membership functions are illustrated using two
practical problems.
In the ninth Chapter, Prosperi and Ulivi propose fuzzy relational mod-
els for genotypic drug resistance analysis in Human Immunodeficiency Virus
type 1 (HIV-1). Fuzzy logic is introduced to model high-level medical lan-
guage, viral and pharmacological dynamics. Fuzzy evolutionary algorithms
and fuzzy evaluation functions are proposed to mine resistance rules, to
improve computational performance and to select relevant features.
Azzini and Tettamanzi present an approach to the joint optimization of
neural network structure and weights, using backpropagation algorithm as a
specialized decoder, and defining a simultaneous evolution of architecture and
weights of neural networks.
In the eleventh Chapter, Dempsey et al. present grammatical genetic
programming to generate radial basis function networks. Authors tested
the hybrid algorithm considering several benchmark classification problems
reporting on encouraging performance obtained there.
In the sequel Cha et al. propose neural-genetic model to wave-induced
liquefaction, which provides a better prediction of liquefaction potential. The
wave-induced seabed liquefaction problem is one of the most critical issues for
analyzing and designing marine structures such as caissons, oil platforms and
harbors. In the past, various investigations into wave-induced seabed lique-
faction have been carried out including numerical models, analytical solutions
and some laboratory experiments. However, most previous numerical studies
are based on solving complicated partial differential equations. The neural-
genetic simulation results illustrate the applicability of the hybrid technique
for the accurate prediction of wave-induced liquefaction depth, which can also
provide coastal engineers with alternative tools to analyze the stability of
marine sediments.
In the thirteenth Chapter, Quintero and Pierre propose a multi-population
Memetic Algorithm (MA) with migration and elitism to solve the problem of
assigning cells to switches as a design step of large-scale mobile networks. Be-
ing well-known in the literature to be an NP-hard combinatorial optimization
problem, this task requires the recourse to heuristic methods, which can prac-
tically lead to good feasible solutions, not necessarily optimal, the objective
being rather to reduce the convergence time toward these solutions. Compu-
tational results reported on an extensive suite of extensive tests confirm the
efficiency and the effectiveness of MA to provide good solutions in compar-
ison with other heuristics well-known in the literature, especially those for
large-scale cellular mobile networks.
XVI Preface
Jeff Achtnig
Nalisys (Research Division) Daeho Cha
jeff.achtnig@nalisys.com Griffith School of Engineering
Griffith University, Gold Coast
Enrique Alba Campus
Department of Computer Science QLD 4215, Australia
University of Málaga f.cha@griffith.edu.au
eat@lcc.uma.es
Antonia Azzini
Università degli Studi di Milano Ian Dempsey
Dipartimento di Tecnologie Natural Computing Research
dell’Informazione and Applications Group
via Bramante 65 University College Dublin
26013, Crema - Italy Ireland
azzini@dti.unimi.it ian.dempsey@PipelineFinancial.
com
Michael Blumenstein
School of Information
and Communication Technology
Griffith University Bernabé Dorronsoro
Gold Coast Campus, QLD 4215 Department of Computer Science
Australia University of Málaga
M.Blumenstein@griffith.edu.au bernabe@lcc.uma.es
XVIII Contributors
1 Introduction
Evolutionary Algorithms (EA) have recently received increased interest, par-
ticularly with regard to the manner in which they may be applied for practical
problem solving. Usually grouped under the term evolutionary computation
or evolutionary algorithms, we find the domains of Genetic Algorithms [34],
Evolution Strategies [68], [69], Evolutionary Programming [20], Learning Clas-
sifier Systems [36], Genetic Programming [45], Differential Evolution [67] and
Estimation of Distribution Algorithms [56]. They all share a common con-
ceptual base of simulating the evolution of individual structures and they
differ in the way the problem is represented, processes of selection and the us-
age/implementation of reproduction operators. The processes depend on the
perceived performance of the individual structures as defined by the problem.
Compared to other global optimization techniques, evolutionary algo-
rithms are easy to implement and very often they provide adequate solutions.
A population of candidate solutions (for the optimization task to be solved)
is initialized. New solutions are created by applying reproduction operators
(mutation and/or crossover). The fitness (how good the solutions are) of the
resulting solutions are evaluated and suitable selection strategy is then applied
to determine, which solutions are to be maintained into the next generation.
The procedure is then iterated.
Problem / Data
Solution
(Output)
Solution
Evolutionary algorithm Intelligent paradigm
(Output)
Problem / Data
Solution
Intelligent paradigm
(Output)
Solution
Intelligent paradigm
(Output)
Error
Problem / Data feedback
Evolutionary algorithm
Solution
Intelligent paradigm
(Output)
Problem / Data
Evolutionary algorithm
Solution
Intelligent paradigm
(Output)
Error
Problem / Data feedback
Evolutionary algorithm
Slow
Evolutionary Search of learning rules
3. Apply genetic operators to each child individual generated above and obtain the
next generation.
4. Check whether the network has achieved the required error rate or the specified
number of generations has been reached. Go to Step 2.
5. End
Engineering Evolutionary Intelligent Systems 7
3. Apply genetic operators to each child individual generated above and obtain the
next generation.
4. Check whether the network has achieved the required error rate or the specified
number of generations has been reached. Go to Step 2.
5. End
8 A. Abraham and C. Grosan
For the neural network to be fully optimal the learning rules are to be adapted
dynamically according to its architecture and the given problem. Deciding the
learning rate and momentum can be considered as the first attempt of learning
rules [48]. The basic learning rule can be generalized by the function
⎛ ⎞
n n
k
∆w(t) = ⎝θi1 ,i2 ,...,ik xij (t − 1)⎠ (1)
k=1 i1 ,i2 ,...,ik =1 j=1
3. Apply genetic operators to each child individual generated above and obtain the
next generation.
4. Check whether the network has achieved the required error rate or the specified
number of generations has been reached. Go to Step 2.
5. End
Engineering Evolutionary Intelligent Systems 9
learning, in many cases a pre-defined architecture was used and in a few cases
architectures were evolved together. Abraham [2] proposed the meta-learning
evolutionary evolutionary neural network with a tight interaction of the differ-
ent evolutionary search mechanisms using the generic framework illustrated
in Figure 7.
Cai et al. [11] used a hybrid of Particle Swarm Optimization (PSO) [41], [18]
and EA to train Recurrent Neural Networks (RNNs) for the prediction of
missing values in time series data. Experimental results illustrate that RNNs,
trained by the hybrid algorithm, are able to predict the missing values in
the time series with minimum error, in comparison with those trained with
standard EA and PSO algorithms.
Castillo et al. [12] explored several methods that combine evolution-
ary algorithms and local search to optimize multilayer perceptrons. Authors
explored a method that optimizes the architecture and initial weights of multi-
layer perceptrons, a search algorithm for training algorithm parameters, and
finally, a co-evolutionary algorithm, that handles the architecture, the net-
work’s initial weights and the training algorithm parameters. Experimental
results show that the co-evolutionary method obtains similar or better re-
sults than the other approaches, requiring far less training epochs and thus,
reducing running time.
Hui [37] proposed a new method for predicting the reliability for repairable
systems using evolutionary neural networks. Genetic algorithms are used to
globally optimize the number of neurons in the hidden layer and learning
parameters of the neural network architecture.
Marwala [51] proposed a Bayesian neural network trained using Markov
Chain Monte Carlo (MCMC) and genetic programming in binary space within
Metropolis framework. The proposed algorithm could learn using samples
obtained from previous steps merged using concepts of natural evolution which
include mutation, crossover and reproduction. The reproduction function is
the Metropolis framework and binary mutation as well as simple crossover,
are also used.
Kim and Cho [42] proposed an incremental evolution method for neu-
ral networks based on cellular automata and a method of combining several
evolved modules by a rule-based approach. The incremental evolution method
evolves the neural network by starting with simple environment and gradu-
ally making it more complex. The multi-modules integration method can make
complex behaviors by combining several modules evolved or programmed to
do simple behaviors.
Kim [43] explored a genetic algorithm approach to instance selection in
artificial neural networks when the amount of data is very large. GA optimizes
10 A. Abraham and C. Grosan
adapting the fuzzy membership functions or by learning the fuzzy if-then rules
[55], [33]. Figure 8 shows the architecture of the adaptive fuzzy control system
wherein the fuzzy membership functions and the rule bases are optimized
using a hybrid global search procedure. An optimal design of an adaptive fuzzy
control system could be achieved by the adaptive evolution of membership
functions and the learning rules that progress on different time scales. Figure 9
illustrates the general interaction mechanism with the global search of fuzzy
rules evolving at the highest level on the slowest time scale. For each fuzzy
rule base, global search of membership functions proceeds at a faster time
scale in an environment decided by the problem.
Evolutionary search
(Adaptation of fuzzy sets Performance
and rule base) measure
Fuzzy sets
+
Process
if-then rules
-
Knowledge base
Fuzzy controller
Slow
Fast
The tuning of the scaling parameters and fuzzy membership functions (piece-
wise linear and/or differentiable functions) is an important task in the design
of fuzzy systems and is popularly known as genetic tuning. Evolutionary al-
gorithms could be used to search the optimal shape, number of membership
functions per linguistic variable and the parameters [31]. The genome encodes
parameters of trapezoidal, triangle, logistic, Laplace, hyperbolic-tangent or
Gaussian membership functions etc. Most of the existing methods assume
the existence of a predefined collection of fuzzy membership functions giving
meaning to the linguistic labels contained in the rules (database). Evolution-
ary algorithms are applied to obtain a suitable rule base, using chromosomes
that code single rules or complete rule bases. If prior knowledge of the mem-
bership functions is available, a simplified chromosome representation could
be formulated accordingly.
The first decision a designer has to make is how to represent a solution
in a chromosome structure. First approach is to have the chromosome en-
code the complete rule base. Each chromosome differs only in the fuzzy rule
membership functions as defined in the database. In the second approach, each
chromosome encodes a different database definition based on the fuzzy domain
partitions. The global search for membership functions using evolutionary
algorithm is formulated in Algorithm 4.1.
3. Apply genetic operators to each child individual generated above and obtain the
next generation.
4. Check whether the fuzzy system has achieved the required error rate or the
specified number of generations has been reached. Go to Step 2.
5. End
Engineering Evolutionary Intelligent Systems 13
according to the nature of the problem. The rule base of the fuzzy system
may be represented using relational matrix, decision table and set of rules.
In the Pittsburg approach, [71] each chromosome encodes a whole rule set.
Crossover serves to provide a new combination of rules and mutation provides
new rules. The disadvantage is the increased complexity of search space and
additional computational burden especially for online learning. The size of the
genotype depends on the number of input/output variables and fuzzy sets.
In the Michigan approach, [35] each genotype represents a single fuzzy rule
and the entire population represents a solution. The fuzzy knowledge base is
adapted as a result of antagonistic roles of competition and cooperation of
fuzzy rules. A classifier rule triggers whenever its condition part matches the
current input, in which case the proposed action is sent to the process to be
controlled. The fuzzy behavior is created by an activation sequence of mutually
collaborating fuzzy rules. In the Michigan approach, techniques for judging
the performance of single rules are necessary.
The Iterative Rule Learning (IRL) approach [27] is similar to the Michi-
gan approach where the chromosomes encode individual rules. In IRL, only
the best individual is considered as the solution, discarding the remaining
chromosomes in the population. The evolutionary algorithm generates new
classifier rules based on the rule strengths acquired during the entire process.
Defuzzification operators and its parameters may be also formulated as an
evolutionary search [46], [40], [5].
Tsang et al. [73] proposed a fuzzy rule-based system for intrusion detection,
which is evolved from an agent-based evolutionary framework and multi-
objective optimization. The proposed system can also act as a genetic feature
selection wrapper to search for an optimal feature subset for dimensionality
reduction.
Edwards et al. [19] modeled the complex export pattern behavior of multi-
national corporation subsidiaries in Malaysia using a Takagi-Sugeno fuzzy
inference system. The proposed fuzzy inference system is optimized by us-
ing neural network learning and evolutionary computation. Empirical results
clearly show that the proposed approach could model the export behavior
reasonably well compared to a direct neural network approach.
Chen et al. [15] proposed an automatic way of evolving hierarchical Tak-
agi - Sugeno Fuzzy Systems (TS-FS). The hierarchical structure is evolved
using Probabilistic Incremental Program Evolution (PIPE) with specific in-
structions. The fine tuning of the if - then rules parameters encoded in the
structure is accomplished using Evolutionary Programming (EP). The pro-
posed method interleaves both PIPE and EP optimizations. Starting with
random structures and rules parameters, it first tries to improve the hierar-
chical structure and then as soon as an improved structure is found, it further
fine tunes the rules parameters. It then goes back to improve the structure
14 A. Abraham and C. Grosan
and the rules’ parameters. This loop continues until a satisfactory hierarchical
TS-FS model is found or a time limit is reached.
Pawara and Ganguli [61] developed a Genetic Fuzzy System (GFS) for on-
line structural health monitoring of composite helicopter rotor blades. Authors
formulated a global and local GFSs. The global GFS is for matrix cracking
and debonding/delamination detection along the whole blade and the local
GFS is for matrix cracking and debonding/delamination detection in various
parts of the blade.
Chua et al. [16] proposed a GA-based fuzzy controller design for tunnel
ventilation systems. Fuzzy Logic Control (FLC) method has been utilized
due to the complex and nonlinear behavior of the system and the FLC was
optimized using the GA.
Franke et al. [22] presented a genetic - fuzzy system for automatically
generating online scheduling strategies for a complex objective defined by a
machine provider. The scheduling algorithm is based on a rule system, which
classifies all possible scheduling states and assigns a corresponding scheduling
strategy. Authors compared two different approaches. In the first approach,
an iterative method is applied, that assigns a standard scheduling strategy
to all situation classes. In the second approach, a symbiotic evolution varies
the parameter of Gaussian membership functions to establish the different
situation classes and also assigns the appropriate scheduling strategies.
5 Evolutionary Clustering
Clustering means the act of partitioning an unlabeled dataset into groups
of similar objects. Each group, called a ‘cluster’, consists of objects that
are similar between themselves and dissimilar to objects of other groups. A
comprehensive review of the state-of-the-art clustering methods can be found
in [76], [64].
Data clustering is broadly based on two approaches: hierarchical and par-
titional. In hierarchical clustering, the output is a tree showing a sequence
of clustering with each cluster being a partition of the data set. Hierarchical
algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglom-
erative algorithms begin with each element as a separate cluster and merge
them in successively larger clusters. Partitional clustering algorithms, on the
other hand, attempt to decompose the data set directly into a set of disjoint
clusters by optimizing certain criteria. The criterion function may emphasize
the local structure of the data, as by assigning clusters to peaks in the prob-
ability density function, or the global structure. Typically, the global criteria
involve minimizing some measure of dissimilarity in the samples within each
cluster, while maximizing the dissimilarity of different clusters. The advan-
tages of the hierarchical algorithms are the disadvantages of the partitional
algorithms and vice versa.
Engineering Evolutionary Intelligent Systems 15
Clustering can also be performed in two different modes: crisp and fuzzy.
In crisp clustering, the clusters are disjoint and non-overlapping in nature.
Any pattern may belong to one and only one class in this case. In case of
fuzzy clustering, a pattern may belong to all the classes with a certain fuzzy
membership grade.
One of the widely used clustering methods is the fuzzy c-means (FCM)
algorithm developed by Bezdek [9]. FCM partitions a collection of n vectors
xi , i = 1, 2 . . . , n into c fuzzy groups and finds a cluster center in each group
such that a cost function of dissimilarity measure is minimized. To accom-
modate the introduction of fuzzy partitioning, the membership matrix U is
allowed to have elements with values between 0 and 1. The FCM objective
function takes the form:
c
c
n
J(U, c1 , . . . cc ) = Ji = um 2
ij dij (4)
i=1 i=1 j=1
where uij , is a numerical value between [0,1]; ci is the cluster center of fuzzy
group i; dij = ci − xj is the Euclidian distance between ith cluster center and
j th data point; and m is called the exponential weight which influences the
degree of fuzziness of the membership (partition) matrix. Usually a number
of cluster centers are randomly initialized and the FCM algorithm provides
an iterative approach to approximate the minimum of the objective function
starting from a given position and leads to any of its local minima [3]. No guar-
antee ensures that FCM converges to an optimum solution (can be trapped
by local extrema in the process of optimizing the clustering criterion). The
performance is very sensitive to initialization of the cluster centers.
Research efforts have made it possible to view data clustering as an opti-
mization problem. This view offers us a chance to apply EA for evolving the
optimal number of clusters and their cluster centers. The algorithm is initial-
ized by constraining the initial values to be within the space defined by the
vectors to be clustered. An important advantage of the EA is its ability to
cope with local optima by maintaining, recombining and comparing several
candidate solutions simultaneously.
Abraham [3] proposed the concurrent architecture of a fuzzy clustering
algorithm (to discover data clusters) and a fuzzy inference system for Web
usage mining. A hybrid evolutionary FCM approach is proposed in this paper
to optimally segregate similar user interests. The clustered data is then used
to analyze the trends using a Takagi-Sugeno fuzzy inference system learned
using a combination of evolutionary algorithm and neural network learning.
combines fuzzy polynomial neurons (FPNs) [57] that are located at the first
layer of the network with polynomial neurons (PNs) forming the remaining
layers of the network. The GA-based design procedure being applied at each
layer of HSOFPNN leads to the selection of preferred nodes of the network
(FPNs or PNs) whose local characteristics (such as the number of input vari-
ables, the order of the polynomial, a collection of the specific subset of input
variables, the number of membership functions for each input variable, and
the type of membership function) can be easily adjusted.
Juang and Chung [39] proposed a recurrent TakagiSugenoKang (TSK)
fuzzy network design using the hybridization of a multi-group genetic algo-
rithm and particle swarm optimization (R-MGAPSO). Both the number of
fuzzy rules and the parameters in a TRFN are designed simultaneously by
R-MGAPSO. In R-MGAPSO, the techniques of variable-length individuals
and the local version of particle swarm optimization are incorporated into a
genetic algorithm, where individuals with the same length constitute the same
group, and there are multigroups in a population.
Aouiti et al. [6] proposed an evolutionary method for the design of beta ba-
sis function neural networks (BBFNN) and beta fuzzy systems (BFS). Authors
used a hierarchical genetic learning model of the BBFNN and the BFS.
Chen at al. [14] introduced a new time-series forecasting model based on
the flexible neural tree (FNT). The FNT model is generated initially as a flexi-
ble multi-layer feed-forward neural network and evolved using an evolutionary
procedure. FNT model could also select the appropriate input variables or
time-lags for constructing a time-series model.
them were kept as variables, and a Pareto front was effectively constructed
by minimizing the training error along with the network size.
8 Conclusions
This Chapter presented the various architectures for designing intelligent
paradigms using evolutionary algorithms. The main focus was on designing
evolutionary neural networks and evolutionary fuzzy systems. We also illus-
trated some of the recent generic evolutionary design architectures reported
in the literature including fuzzy neural networks and multiobjective design
strategies.
References
1. Abraham A, Grosan C, Han SY, Gelbukh A (2005) Evolutionary multiobjective
optimization approach for evolving ensemble of intelligent paradigms for stock
market modeling. In: Alexander Gelbukh et al. (eds.) 4th Mexican international
conference on artificial intelligence, Mexico, Lecture notes in computer science,
Springer, Berlin Heidelberg New York, pp 673–681
2. Abraham, A (2004) Meta-learning evolutionary artificial neural networks.
Neurocomput J 56c:1–38
3. Abraham A (2003) i-Miner: A Web Usage Mining Framework Using Hierarchi-
cal Intelligent Systems, The IEEE International Conference on Fuzzy Systems,
FUZZ-IEEE’03, IEEE Press, ISBN 0780378113, pp 1129–1134
4. Abraham A, Ramos V (2003), Web Usage Mining Using Artificial Ant Colony
Clustering and Genetic Programming, 2003 IEEE Congress on Evolutionary
Computation (CEC2003), Australia, IEEE Press, ISBN 0780378040, pp 1384–
1391, 2003
5. Abraham A (2003), EvoNF: A Framework for Optimization of Fuzzy Infer-
ence Systems Using Neural Network Learning and Evolutionary Computation,
The 17th IEEE International Symposium on Intelligent Control, ISIC’02, IEEE
Press, ISBN 0780376218, pp 327–332
6. Aouiti C, Alimi AM, Karray F, Maalej A (2005) The design of beta basis func-
tion neural network and beta fuzzy systems by a hierarchical genetic algorithm.
Fuzzy Sets Syst 154(2):251–274
7. Bhattacharya A, Abraham A, Vasant P, Grosan C (2007) Meta-learning evo-
lutionary artificial neural network for selecting flexible manufacturing systems
under disparate level-of-satisfaction of decision maker. Int J Innovative Comput
Inf Control 3(1):131–140
8. Baxter J (1992) The evolution of learning algorithms for artificial neural
networks, Complex systems, IOS, Amsterdam, pp 313–326
9. Bezdek, JC (1981) Pattern recognition with fuzzy objective function algorithms.
Plenum, New York
10. Boers EJW, Borst MV, Sprinkhuizen-Kuyper IG (1995) Artificial neural nets
and genetic algorithms. In: Pearson DW et al. (eds.) Proceedings of the
international conference in Ales, France, Springer, Berlin Heidelberg New York,
pp 333–336
Engineering Evolutionary Intelligent Systems 19
11. Cai X, Zhang N, Venayagamoorthy GK, Wunsch II DC (2007) Time series pre-
diction with recurrent neural networks trained by a hybrid PSOEA algorithm.
Neurocomputing 70(13–15):2342–2353
12. Castillo PA, Merelo JJ, Arenas MG, Romero G (2007) Comparing evolutionary
hybrid systems for design and optimization of multilayer perceptron structure
along training parameters. Inf Sci 177(14):2884–2905
13. Capi G, Doya K (2005), Evolution of recurrent neural controllers using an
extended parallel genetic algorithm. Rob Auton Syst 52(2–3):148–159
14. Chen Y, Yang B, Dong J, Abraham A (2005) Time-series forecasting using
flexible neural tree model. Inf Sci 174(3–4):219–235
15. Chen Y, Yang B, Abraham A, Peng L (2007) Automatic design of hierarchical
takagi-sugeno fuzzy systems using evolutionary algorithms. IEEE Trans Fuzzy
Syst 15(3):385–397
16. Chu B, Kim D, Hong D, Park J, Chung JT, Chung JH, Kim TH (2008) GA-based
fuzzy controller design for tunnel ventilation systems, Journal of Automation in
Construction, 17(2):130–136
17. Delgado M, Pegalajar MC (2005) A multiobjective genetic algorithm for obtain-
ing the optimal size of a recurrent neural network for grammatical inference.
Pattern Recognit 38(9):1444–1456
18. Eberhart RC, Kennedy J (1995) A new optimizer using particle swarm theory.
In: Proceedings of 6th Internationl Symposium on Micro Machine and Human
Science, Nagoya, Japan, IEEE Service Center, Piscataaway, NJ, pp 39–43
19. Edwards R, Abraham A, Petrovic-Lazarevic S (2005) Computational intelli-
gence to model the export behaviour of multinational corporation subsidiaries
in Malaysia. Int J Am Soc Inf Sci Technol (JASIST) 56(11):1177–1186
20. Fogel LJ, Owens AJ, Walsh MJ (1966) Artificial intelligence through simulated
evolution. Wiley, USA
21. Fontanari JF, Meir R (1991) Evolving a learning algorithm for the binary
perceptron, Network, vol. 2, pp 353–359
22. Franke C, Hoffmann F, Lepping J, Schwiegelshohn U (2008) Development of
scheduling strategies with Genetic Fuzzy systems, Applied Soft Computing
Journal, 8(1):706–721
23. Frean M (1990), The upstart algorithm: a method for constructing and training
feed forward neural networks. Neural Comput 2:198–209
24. Fullmer B, Miikkulainen R (1992) Using marker-based genetic encoding of neu-
ral networks to evolve finite-state behaviour. In: Varela FJ, Bourgine P (eds.)
Proceedings of the first European conference on artificial life, France, pp 255–262
25. Garca-Pedrajas N, Hervs-Martnez C, Muoz-Prez J (2002) Multi-objective co-
operative coevolution of artificial neural networks (multi-objective cooperative
networks). Neural Netw 15(10):1259–1278
26. Grau F (1992)Genetic synthesis of boolean neural networks with a cell rewrit-
ing developmental process. In: Whitely D, Schaffer JD (eds.) Proceedings of
the international workshop on combinations of genetic algorithms and neural
Networks, IEEE Computer Society Press, CA, pp 55–74
27. Gonzalez A, Herrera F (1997) Multi-stage genetic fuzzy systems based on the
iterative rule learning approach. Mathware Soft Comput 4(3)
28. Grosan C, Abraham A, Nicoara M (2005) Search optimization using hybrid
particle sub-swarms and evolutionary algorithms. Int J Simul Syst, Sci Technol
UK 6(10–11):60–79
20 A. Abraham and C. Grosan
29. Gutierrez G, Isasi P, Molina JM, Sanchis A, Galvan IM (2001) Evolutionary cel-
lular configurations for designing feedforward neural network architectures, con-
nectionist models of neurons. In: Jose Mira et al. (eds.) Learning processes, and
artificial intelligence, Springer, Berlin Heidelberg New York, LNCS 2084, pp
514–521
30. Harp SA, Samad T, Guha A (1989) Towards the genetic synthesis of neural
networks. In: Schaffer JD (ed.) Proceedings of the third international conference
on genetic algorithms and their applications, Morgan Kaufmann, CA, pp 360–
369
31. Herrera F, Lozano M, Verdegay JL (1995) Tuning fuzzy logic controllers by
genetic algorithms. Int J Approximate Reasoning 12:299–315
32. Herrera F, Lozano M, Verdegay JL (1995) Tackling fuzzy genetic algorithms.
In: Winter G, Periaux J, Galan M, Cuesta P (eds.) Genetic algorithms in
engineering and computer science, Wiley, USA, pp 167–189
33. Hoffmann F (1999) The Role of Fuzzy Logic in Evolutionary Robotics. In:
Saffiotti A, Driankov D (ed.) Fuzzy logic techniques for autonomous vehicle
navigation, Springer, Berlin Heidelberg New York
34. Holland JH (1975) Adaptation in Natural and Artificial Systems, The University
of Michigan Press, Ann Arbor, MI
35. Holland JH, Reitman JS (1978), Cognitive systems based on adaptive al-
gorithms. In: Waterman DA, Hayes-Roth F (eds.) Pattern-directed inference
systems. Academic, San Diego, CA
36. Holland, JH (1980) Adaptive algorithms for discovering and using general
patterns in growing knowledge bases. Int J Policy Anal Inf Sys 4(3):245–268
37. Hui LY (2007) Evolutionary neural network modeling for forecasting the field
failure data of repairable systems. Expert Syst Appl 33(4):1090–1096
38. Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of
fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J
Approximate Reason 44(1):4–31
39. Juang CF, Chung IF (2007) Recurrent fuzzy network design using hybrid
evolutionary learning algorithms, Neurocomputing 70(16–18):3001–3010
40. Jin Y, von Seelen W (1999) Evaluating flexible fuzzy controllers via evolution
strategies. Fuzzy Sets Syst 108(3):243–252
41. Kennedy J, Eberhart RC (1995). Particle swarm optimization. In: Proceedings of
IEEE International Conference on Neural Networks, Perth, Australia, pp 1942–
1948
42. Kim KJ, Cho SB (2006) Evolved neural networks based on cellular automata
for sensory-motor controller. Neurocomputing 69(16–18):2193–2207
43. Kim KJ (2006) Artificial neural networks with evolutionary instance selection
for financial forecasting. Expert Syst Appl 30(3):519–526
44. Kitano H (1990) Designing neural networks using genetic algorithms with graph
generation system. Complex Syst 4(4):461–476
45. Koza JR (1992) Genetic programming: on the programming of computers by
means of natural selection, MIT, Cambridge, MA
46. Kosinski W (2007) Evolutionary algorithm determining defuzzyfication opera-
tors. Eng Appl Artif Intell 20(5):619–627
47. Kelesoglu O (2007) Fuzzy multiobjective optimization of truss-structures using
genetic algorithm. Adv Eng Softw 38(10):717–721
Engineering Evolutionary Intelligent Systems 21
67. Storn R, Price K (1997) Differential evolution – a simple and efficient adap-
tive scheme for global optimization over continuous spaces. J Global Optim
11(4):341–359
68. Rechenberg I, (1973) Evolutions strategie: optimierung technischer Systeme
nach Prinzipien der biologischen Evolution, Fromman-Holzboog, Stuttgart
69. Schwefel HP (1977) Numerische Optimierung von Computermodellen mittels
der Evolutionsstrategie, Birkhaeuser, Basel
70. Serra GLO, Bottura CP (2006) Multiobjective evolution based fuzzy PI con-
troller design for nonlinear systems. Eng Appl Artif Intell 19(2):157–167
71. Smith SF (1980) A learning system based on genetic adaptive algorithms. PhD
thesis, University of Pittsburgh
72. Takagi T, Sugeno M (1983) Derivation of fuzzy logic control rules from human
operators control actions. In: Proceedings of the IFAC symposium on fuzzy
information representation and decision analysis, pp 55–60
73. Tsang CH, Kwong S, Wang A (2007) Genetic-fuzzy rule mining approach
and evaluation of feature selection techniques for anomaly intrusion detection.
Pattern Recognit 40(9):2373–2391
74. Wang H, Kwong S, Jin Y, Wei W, Man K (2005) A multi-objective hierarchical
genetic algorithm for interpretable rule-based knowledge extraction. Fuzzy Sets
Syst 149(1):149–186
75. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization.
IEEE Trans Evol Comput 1(1):67–82
76. Xu R, Wunsch D, (2005) Survey of clustering algorithms. IEEE Trans Neural
Netw 16(3):645–678
77. Yao X (1999) Evolving artificial neural networks. Proc IEEE 87(9):423–1447
Genetically Optimized Hybrid Fuzzy Neural
Networks: Analysis and Design of Rule-based
Multi-layer Perceptron Architectures
1 Introductory remarks
Recently, a lot of attention has been devoted towards advanced techniques
of modeling complex systems inherently associated with nonlinearity, high-
order dynamics, time-varying behavior, and imprecise measurements. It is
anticipated that efficient modeling techniques should allow for a selection
of pertinent variables and in this way help cope with dimensionality of the
problem at hand. The models should be able to take advantage of the existing
S.-K. Oh and W. Pedrycz: Genetically Optimized Hybrid Fuzzy Neural Networks: Analysis
and Design of Rule-based Multi-layer Perceptron Architectures, Studies in Computational
Intelligence (SCI) 82, 23–57 (2008)
www.springerlink.com
c Springer-Verlag Berlin Heidelberg 2008
24 S.-K. Oh and W. Pedrycz
the PD of each node has a different type in comparison with zi of the lst layer.
The “NOP” node states that the Ath node of the current layer is the same as
the node positioned in the previous layer (NOP stands for “no operation”).
An arrow to the NOP node is used to show that the same node moves from
the previous layer to the current one.
We consider two kinds of FNNs (viz. FS FNN and FR FNN) based on two
types of fuzzy inferences, namely, simplified and linear fuzzy inferences. The
structure of the FNN is the same as the used in the premise of the conventional
HFNN. The FNN is designed by using space partitioning realized in terms of
the individual input variables or an ensemble of all variables. Its each topology
is concerned with a granulation carried out in terms of fuzzy sets defined
in each input variable or fuzzy relations that capture an ensemble of input
variables respectively. The fuzzy partitions formed for each case lead us to the
topologies visualized in Figs. 4–5.
28 S.-K. Oh and W. Pedrycz
The notation in these figures requires some clarification. The “circles” de-
note units of the FNN while “N” identifies a normalization procedure applied
tothe membership grades of the input variable xi . The output fi (xi ) of the
“ ” neuron is described by some nonlinear function fi . Not necessarily fi is
a sigmoid function encountered in conventional neural networks but we allow
for more flexibility in this regard. Finally, in case of FS FNN, the output of
the FNN y is governed by the following expression;
m
y = f1 (x1 ) + f2 (x2 ) + · · · + fm (xm ) = fi (xi ) (1)
i=1
with m being the number of the input variables (viz. the number of the outputs
fi ’s of the “ ” neurons in the network). As previously mentioned, FS FNN is
affected by the introduced fuzzy partition of each input variable. In this sense,
we can regard each fi given by fuzzy rules as shown in Table 2(a). Table
2(a) represents the comparison of fuzzy rules, inference result and learning
for two types of FNNs. In Table 2(a), Rj is the j-th fuzzy rule while Aij
denotes a fuzzy variable of the premise of the corresponding fuzzy rule and
represents membership function µij . In the simplified fuzzy inference, ωij is
a constant consequence of the rules and, in the linear fuzzy inference, ωsij
is a constant consequence and ωij is an input variable consequence of the
rules. They express a connection (weight) existing between the neurons as we
have already visualized in Fig. 4. Mapping from xi to fi (xi ) is determined
by the fuzzy inferences and a standard defuzzification. The inference result
for individual fuzzy rules follows a standard center of gravity aggregation. An
input signal xi activates only two membership functions, so inference results
can be written as outlined in Table 2(a) [23,24]. The learning of FNN is realized
by adjusting connections of the neurons and as such it follows a standard
Back-Propagation (BP) algorithm [23,24]. The complete update formulas are
covered in Table 2(a). Where η is a positive learning rate and α is a positive
momentum coefficient. The case of FR FNN, see Table 2(b), is carried out in
a same manner as outlined in Table 2(a) (the case of FS FNN).
The task of optimizing any model involves two main phases. First, a class
of some optimization algorithms has to be chosen so that it meets the re-
quirements implied by the problem at hand. Secondly, various parameters of
the optimization algorithm need to be tuned in order to achieve its best per-
formance. Along this line, genetic algorithms (GAs) viewed as optimization
techniques based on the principles of natural evolution are worth consider-
ing. GAs have been experimentally demonstrated to provide robust search
capabilities in problems involving complex spaces thus offering a valid solu-
tion to problems requiring efficient searching. It is instructive to highlight the
main features that tell GA apart from some other optimization methods: (1)
GA operates on the codes of the variables, but not the variables themselves.
(2) GA searches optimal points starting from a group (population) of points
in the search space (potential solutions), rather than a single point. (3) GA’s
30 S.-K. Oh and W. Pedrycz
⎧
⎪ ∆ωsij = 2 · η · (y − y) · µij
⎪
⎪
⎪
⎨
∆ωij = 2 · η · (yp − yp ) · µij (xi ) + α(ωsij (t) − ωsij (t − 1))
Learning
+ α(ωij (t) − ωij (t − 1)) ⎪
⎪ ∆ω = 2 · η · (y − y) · µij · xi
⎪
⎪
⎩
+ α(ωij (t) − ωij (t − 1))
Table 2. (Continued)
n
n
y = fi y = fi
i=1 i=1
n n
= µ̄i · ωi = µ¯i · (ω0i + ω1i · x1 + ωki · xk )
Inference result
i=1 i=1
n
µi · ωi n
µi · (ω0i + ω1i · x1 + ωki · xk )
= n = n
i=1 µi i=1 µi
i=1 i=1
⎧
⎪ ∆ω0i = 2 · η · (y − y) · µ¯i + α(ω0i (t)
∆ωi = 2 · η · (y − y) · µ̄i ⎪
⎪
⎨ − ω0i (t − 1))
Learning + α(ωi (t)
⎪
⎪ ∆ω = 2 · η · (y − y) · µ¯i · xk
− ωi (t − 1)) ⎪
⎩
ki
search is directed only by some fitness function whose form could be quite
complex; we do not require its differentiability [8–10].
In order to enhance the learning of the FNN, we use GAs to adjust learning
rate, momentum coefficient and the parameters of the membership functions
of the antecedents of the rules [19,20,34,35].
As underlined, the PNN algorithm is based upon the GMDH [22] method and
utilizes a class of polynomials such as linear, quadratic, modified quadratic,
etc. to describe basic processing realized there. By choosing the most signif-
icant input variables and an order of the polynomial among various types of
forms available, we can obtain the best one - it comes under a name of a
partial description (PD). It is realized by selecting nodes at each layer and
eventually generating additional layers until the best performance has been
reached. Such a methodology leads to an optimal PNN structure [14,15].
In addressing the problems with the conventional PNN (see Fig. 6), we
introduce a new genetic design approach; in turn we will be referring to these
networks as genetically optimized PNN (to be called “gPNN”). When we
construct PNs of each layer in the conventional PNN, such parameters as
the number of input variables (nodes), the order of polynomial, and input
variables available within a PN are fixed (selected) in advance by the designer.
This could have frequently contributed to the difficulties in the design of the
optimal network. To overcome this apparent drawback, we resort ourselves to
the genetic optimization, see Figs. 8–9 of the next section for more detailed
flow of the development activities.
The overall genetically-driven structural optimization process of PNN is
shown in Fig. 7. The determination of the optimal values of the parame-
ters available within an individual PN (viz. the number of input variables,
32 S.-K. Oh and W. Pedrycz
PN
z1 • PN
PN
PN
z2 • PN PN
PN
^
PN •• PN f
z3 • •
PN • PN
PN PN
z4 •
PN
Polynomial Neuron(PN)
zq zp, zq 2
E : Entire inputs, S : Selected PNs, zi : Preferred outputs in the ith stage (zi = z1i, z2i, ..., zWi)
the order of the polynomial, and input variables) leads to a structurally and
parametrically optimized network. As a result, this network is more flexible
as well as it exhibits simpler topology in comparison to the conventional PNN
discussed in the previous research [14,15].
For the optimization of the PNN model, GAs uses the serial method of bi-
nary type, roulette-wheel used in the selection process, one-point crossover in
the crossover operation, and a binary inversion (complementation) operation
in the mutation operator. To retain the best individual and carry it over to
the next generation, we use elitist strategy [8,9].
Genetically Optimized Hybrid Fuzzy Neural Networks 33
Layer generation
Layer generation
Networks (gHFNN)
parameters PN PN
x2= z1 x3= z2
Interface
PN PN
PN PN
Simplified
PN PN
&
Linear
fuzzy inference
S : Selected PNs, zi : Outputs of the ith layer, xj : Input variables of the jth layer ( j = i + 1)
START
Normalization of an activation
Connection 1
Point ?
2
Fuzzy inference
for output of the rules
Output of FS_FNN/FR_FNN
Generation of a PN by a
chromosome in population
Reproduction
Roulette-wheel selection Evaluation of PNs(Fitness)
One-point crossover x1 = z1, x2 = z2, ..., xW = zW
Invert mutation Elitist strategy &
Selection of PNs(W) The outputs of the preserved PNs
serve as new inputs to the next
layer
No Stop
condition
Yes
Generate a layer of gPNN
A layer consists of optimal PNs
selected by GAs
Stop No
condition
Yes
gHFNN
gHFNN is organized by FS_FNN
and layers with optimal PNs
END
[Layer 1] Input layer: The role of this layer is to distribute the signals to the
nodes in the next layer.
[Layer 2] Computing activation degrees of linguistic labels: Each node in
this layer corresponds to one linguistic label (small, large, etc.) of the input
variables in layer 1. The layer determines a degree of satisfaction (activation)
of this label by the input.
[Layer 3] Normalization of a degree activation (firing) of the rule: As de-
scribed, a degree of activation of each rule was calculated in layer 2. In this
layer, we normalize the activation level by using the following expression.
µij µik
µ̄ij =
n = = µik (2)
µik + µik+1
µij
j=1
µ1j aij CP 2
N
x1 w1j ∑
N
µij ∑ y^
N
xi wij ∑
fi(xi)
N
CP 1
Fig. 10. Connection points used for combining FS FNN (Simplified) with gPNN
36 S.-K. Oh and W. Pedrycz
Decoding Decoding
(Decimal) (Decimal)
1 1 0 1 1 1
1 r
Decoding Decoding
Normalization (Decimal) (Decimal)
(less than Normalization
Genetic Max) (1 ~ 3)
Normalization Normalization
Design (1 ~ n(or W)) (1 ~ n(or W))
Decision of Decision of
input variables input variables
Selection of Selection of
no. of input the order of
Selection of input
variables(r) polynomial
(Type 1~Type 3)
variables
Selected PN PN
model is expanded. The outputs of the preserved nodes (z1 ,z2 , · · · , zW ) serves
as new inputs to the next layer (x1 ,x2 , · · · , xW ). Repeating steps 3–10 carries
out the gPNN.
5 Experimental studies
In this section, the performance of the gHFNN is illustrated with the aid
of some well-known and widely used datasets. In the first experiment, the
network is used to model a three-input nonlinear function [1,13,19,25,28,29]. In
the second simulation, an gHFNN is used to model a time series of gas furnace
(Box-Jenkins data) [2–6,26,30–35]. Finally we use gHFNN for NOx emission
process of gas turbine power plant [27,29,32]. The performance indexes (object
function) used here are: (9) for the three-input nonlinear function and (8) for
both gas furnace process and NOx emission process.
i) Mean Squared Error (MSE)
1
n
E(P I or EP I) = (yp − yp )2 (8)
n p=1
1 |yp − yp |
n
E(P I or EP I) = × 100(%) (9)
n p=1 yp
by the values of PI equal to 0.299 and EPI given as 0.555, whereas under
the condition given as similar performance, the best results for the proposed
network related to the output node mentioned previously were reported as
PI = 0.299 and EPI = 0.517. In the sequel, the depth (the number of lay-
ers) as well as the width (the number of nodes) of the proposed genetically
optimized HFNN (gHFNN) can be lower in comparison to the “conventional
HFNN” (which immensely contributes to the compactness of the resulting
network), refer to Fig. 13. In what follows, the genetic design procedure at
stage (layer) of HFNN leads to the selection of the preferred nodes (or PNs)
44 S.-K. Oh and W. Pedrycz
Fig. 13. Comparison of the proposed model architecture (gHFNN) and the
conventional model architecture (SOFPNN [19])
Fig. 14. Optimal topology of genetically optimized HFNN for the nonlinear function
(In case of using FS FNN)
with optimal local characteristics (such as the number of input variables, the
order of the polynomial, and input variables). In addition, when considering
linear fuzzy inference and CP2, the best results are reported in the form of
the performance index such as PI = 0.113 and EPI = 0.299 for layer 2, and
PI = 0.0092 and EPI = 0.056 for layer 5. Their optimal topologies are shown
in Figs. 14 (a) and (b). Figs. 15(a) and (b) depict the optimization process by
showing the values of the performance index in successive cycles of both BP
learning and genetic optimization when using each linear fuzzy inference-based
FNN. Noticeably, the variation ratio (slope) of the performance of the network
changes radically around the 1st and 2nd layer. Therefore, to effectively reduce
Genetically Optimized Hybrid Fuzzy Neural Networks 45
16
8 14
Premise part; Consequence part;
7 12
Premise part; FNN gPNN
FNN Consequence part;
gPNN
Performance Index
6
10
Performance Index
1st layer
5 1st layer
E_PI = 3.45
8
2nd layer 2nd layer
4
3rd layer 6 3rd layer
3
4th layer 4th layer
PI = 2.929 4
2 E_PI = 2.069
5th layer
: PI 5th layer
: PI
1
: E_PI 2
E_PI = 0.299 : E_PI PI = 2.518
PI = 0.113 E_PI = 0.056 E_PI = 0.587
PI = 0.0092 PI = 0.039
50 200 400 600 800 1000 100 200 300 400 500 50 200 400 600 800 1000 150 300 450 600 750
Iteration Generation Iteration Generation
0.9 1.6
: PI : PI
: E_PI : E_PI
0.8 1.4
Premise part; Consequence part;
0.7 FNN gPNN Consequence part;
1st layer 1.2
gPNN
Performance Index
1st layer
Performance Index
(a) In case of using FS FNN (linear) with (b) In case of using FR FNN (linear)
Type I and Type II
NOx emission process is modeled using the data of gas turbine power plant
coming from a GE gas turbine power plant located in Virginia, US.
The input variables include AT (Ambient Temperature at site), CS (Com-
pressor Speed), LPTS (Low Pressure Turbine Speed), CDP (Compressor
Discharge Pressure), and TET (Turbine Exhaust Temperature). The output
variable is NOx [27,29,32]. The performance index is defined by (8). We con-
sider 260 pairs of the original input-output data. 130 out of 260 pairs of
input-output data are used as learning set; the remaining part serves as a
50 S.-K. Oh and W. Pedrycz
testing set. Using NOx emission process data, the regression model is
y = −163.77341 − 0.06709 x1 + 0.00322 x2 + 0.00235 x3
+ 0.26365 x4 + 0.20893 x5 (11)
And it comes with PI = 17.68 and EPI = 19.23. We will be using these results
as a reference point when discussing gHFNN models. Table 10 summarizes
Table 10. Summary of the parameters of the optimization environment
(a) In case of using FS FNN
Generation 150
Population size 100
GAs Elite population size 50
Premise structure 10 (per one variable)
String length Consequence CP 1 3 + 3 + 70
structure CP 2 3 + 3 + 35
No. of entire system inputs 5
Learning iteration 1000
Premise Simplified 0.052
Learning rate tuned
(FNN) Linear 0.034
Momentum Simplified 0.010
Coefficient tuned Linear 0.001
gHFNN No. of rules 10(2 + 2 + 2 + 2 + 2)
CP 1 10
No. of system inputs
CP 2 5
Consequence Maximal layer 5
(gPNN) No. of inputs to be CP 1 1 ≤ N ≤ 10 (Max)
selected (N) CP 1 1 ≤ N ≤ 5 (Max)
Type (T) 1≤T≤3
N, T : integer
(b) In case of using FR FNN and Type II
Generation 150
Population size 100
GAs Elite population size (W) 50
Premise structure (FNN) 10 (per one variable)
String length
Consequence structure (PNN) 3 + 3 + 105
No. of entire system inputs 5
Learning iteration 1000
Premise Simplified 0.568
Learning rate tuned
Linear 0.651
Momentum Simplified 0.044
Coefficient tuned Linear 0.064
gHFNN No. of rules 32(2 × 2 × 2 × 2 × 2)
No. of entire inputs 32
Consequence Maximal layer 5
(Gpnn) No. of inputs to be selected (N) 1 ≤ N ≤ 15 (Max)
Type (T) 1≤T≤3
N, T : integer
52 S.-K. Oh and W. Pedrycz
25 : PI 0.3
: PI
: E_PI : E_PI
Consequence part;
Premise part; Consequence part; gPNN
0.25
20 FNN gPNN
1st layer 1st layer
Performance Index
2nd layer
15 3rd layer
10
5th layer 4th layer
0.1
PI = 8.054 5th layer
5 E_PI = 0.044
0.05
E_PI = 0.026
E_PI = 0.026
PI = 0.0045 PI = 0.0095 PI = 0.0045
200 400 600 800 1000 150 300 450 600 750 150 300 450 600 750
Iteration Generation Generation
Fig. 18. Optimal procedure of gHFNN with FS FNN (linear) by BP and GAs
1 with Type 1 (linear function) and 9 nodes at input; this network comes with
the value of PI equal to 0.0045 and EPI set as 0.026. In case of using FR FNN
and simplified fuzzy inference, the most preferred network architecture have
been reported as PI = 0.041 and EPI = 0.111. As shown in Table 11 and
Fig. 18, the variation ratio (slope) of the performance of the network changes
radically at the 2nd layer. Therefore, to effectively reduce a large number of
nodes and avoid a large amount of time-consuming iteration of gHFNN, the
stopping criterion can be taken into consideration up to maximally the 2nd
layer.
Table 12 covers a comparative analysis including several previous fuzzy-
neural network models. The experimental results clearly reveal that the
proposed approach and the resulting model outperform the existing networks
both in terms of better approximation and generalization capabilities.
6 Concluding remarks
7 Acknowledgement
This work has been supported by KESRI(I-2004-0-074-0-00), which is funded
by MOCIE (Ministry of commerce, industry and energy).
References
1. Kang G, Sugeno M (1987) Fuzzy modeling. Trans Soc Instrum Control Eng
23(6cr):106–108
2. Oh SK, Pedrycz W (2000) Fuzzy identification by means of auto-tuning algo-
rithm and its application to nonlinear systems. Fuzzy Sets Syst 115(2):205–230
3. Park BJ, Pedrycz W, Oh SK (2001) Identification of fuzzy models with the aid of
evolutionary data granulation. IEE Proc-Control Theory Appl 148(5):406–418
4. Oh SK, Pedrycz W, Park BJ (2002) Hybrid identification of fuzzy rule-based
models. Int J Intell Syst 17(1):77–103
5. Park BJ, Oh SK, Ahn TC, Kim HK (1999) Optimization of fuzzy systems by
means of GA and weighting factor. Trans Korean Inst Electr Eng 48A(6):789–
799 (In Korean)
6. Oh SK, Park CS, Park BJ (1999) On-line modeling of nonlinear process sys-
tems using the adaptive fuzzy-neural networks. Trans Korean Inst Electr Eng
48A(10):1293–1302 (In Korean)
7. Narendra KS, Parthasarathy K (1991) Gradient methods for the optimization
of dynamical systems containing neural networks. IEEE Trans Neural Netw
2:252–262
8. Goldberg DE (1989) Genetic algorithms in search, optimization & machine
learning. Addison-wesley, Reading
9. Michalewicz Z (1996) Genetic Algorithms + Data Structures = Evolution
Programs. Springer, Berlin Heidelberg Newyork
10. Holland JH (1975) Adaptation in natural and artificial systems. The University
of Michigan Press, Ann Arbour
11. Pedrycz W, Peters JF (1998) Computational intelligence and software engineer-
ing. World Scientific, Singapore
12. Computational intelligence by programming focused on fuzzy neural networks
and genetic algorithms. Naeha, Seoul (In Korean)
13. Horikawa S, Furuhashi T, Uchigawa Y (1992) On fuzzy modeling using fuzzy
neural networks with the back propagation algorithm. IEEE Trans Neural Netw
3(5):801–806
14. Oh SK, Pedrycz W (2002) The design of self-organizing polynomial neural
networks. Inf Sci 141(3–4):237–258
15. Oh SK, Pedrycz W, Park BJ (2003) Polynomial neural networks architecture:
Analysis and Design. Comput Electr Eng 29(6):653–725
16. Ohtani T, Ichihashi H, Miyoshi T, Nagasaka K (1998) Orthogonal and successive
projection methods for the learning of neurofuzzy GMDH. Inf Sci 110:5–24
17. Ohtani T, Ichihashi H, Miyoshi T, Nagasaka K (1998) Structural learning
with M-Apoptosis in neurofuzzy GMHD. In: Proceedings of the 7th IEEE
International Conference on Fuzzy Systems:1265–1270
56 S.-K. Oh and W. Pedrycz
18. Ichihashi H, Nagasaka K (1994) Differential minimum bias criterion for neuro-
fuzzy GMDH. In: Proceedings of 3rd International Conference on Fuzzy Logic
Neural Nets and Soft Computing IIZUKA’94:171–172
19. Park BJ, Pedrycz W, Oh SK (2002) Fuzzy polynomial neural networks: hybrid
architectures of fuzzy modeling. IEEE Trans Fuzzy Syst 10(5):607–621
20. Oh SK, Pedrycz W, Park BJ (2003) Self-organizing neurofuzzy networks based
on evolutionary fuzzy granulation. IEEE Trans Syst Man and Cybern A
33(2):271–277
21. Cordon O et al. (2004) Ten years of genetic fuzzy systems: current framework
and new trends. Fuzzy Sets Syst 141(1):5–31
22. Ivahnenko AG (1968) The group method of data handling: a rival of method of
stochastic approximation. Sov Autom Control 13(3):43–55
23. Yamakawa T (1993) A new effective learning algorithm for a neo fuzzy neuron
model. 5th IFSA World Conference:1017–1020
24. Oh SK, Yoon KC, Kim HK (2000) The Design of optimal fuzzy- eural networks
structure by means of GA and an aggregate weighted performance index. J
Control, Autom Syst Eng 6(3):273–283 (In Korean)
25. Park MY, Choi HS (1990) Fuzzy control system. Daeyoungsa, Seoul (In Korean)
26. Box G.EP, Jenkins GM (1976) Time series analysis, forecasting, and control,
2nd edn. Holden-Day, SanFransisco
27. Ahn TC, Oh SK (1997) Intelligent models concerning the pattern of an air
pollutant emission in a thermal power plant, Final Report, EESRI
28. Kondo T (1986) Revised GMDH algorithm estimating degree of the complete
polynomial. Trans Soc Instrum Control Eng 22(9):928–934
29. Park HS, Oh SK (2003) Multi-FNN identification based on HCM clustering and
evolutionary fuzzy granulation. Int J Control, Autom Syst 1(2):194–202
30. Kim E, Lee H, Park M, Park M (1998) A simply identified sugeno-type fuzzy
model via double clustering. Inf Sci 110:25–39
31. Lin Y, Cunningham III GA (1997) A new approach to fuzzy-neural modeling,
IEEE Trans Fuzzy Syst 3(2):190–197
32. Oh SK, Pedrycz W, Park HS (2003) Hybrid identification in fuzzy-neural
networks. Fuzzy Sets Syst 138(2):399–426
33. Park HS, Oh SK (2000) Multi-FNN identification by means of HCM cluster-
ing and its optimization using genetic algorithms. J Fuzzy Logic Intell Syst
10(5):487–496 (In Korean)
34. Park BJ, Oh SK, Jang SW (2002) The design of adaptive fuzzy polynomial
neural networks architectures based on fuzzy neural networks and self-organizing
networks. J Control Autom Syst Eng 8(2):126–135 (In Korean)
35. Park BJ, Oh SK (2002) The analysis and design of advanced neurofuzzy
polynomial networks. J Inst Electron Eng Korea 39-CI(3):18–31 (In Korean)
36. Park BJ, Oh SK, Pedrycz W, Kim HK (2005) Design of evolutionally optimized
rule-based fuzzy neural networks on fuzzy relation and evolutionary optimiza-
tion. International Conference on Computational Science. Lecture Notes in
Computer Science 3516:1100–1103
37. Oh SK, Park BJ, Pedrycz W, Kim HK (2005) Evolutionally optimized fuzzy
neural networks based on evolutionary fuzzy granulation. Lecture Notes in
Computer Science 3483:887–895
38. Oh SK, Park BJ, Pedrycz W, Kim HK (2005) Genetically optimized hybrid
fuzzy neural networks in modeling software data. Lecture Notes in Artificial
Intelligence 3558:338–345
Genetically Optimized Hybrid Fuzzy Neural Networks 57
1 Introduction
Recently, lots of attention has been directed towards advanced techniques
of complex system modeling. The challenging quest for constructing models
of the systems that come with significant approximation and generalization
abilities as well as are easy to comprehend has been within the community for
decades. While neural networks, fuzzy sets and evolutionary computing as the
technologies of Computational Intelligence (CI) have expanded and enriched
a field of modeling quite immensely, they have also gave rise to a number
of new methodological issues and increased our awareness about tradeoffs
one has to make in system modeling [1–4]. The most successful approaches
to hybridize fuzzy systems with learning and adaptation have been made in
the realm of CI. Especially neural fuzzy systems and genetic fuzzy systems
hybridize the approximate inference method of fuzzy systems with the learning
capabilities of neural networks and evolutionary algorithms [5]. When the
dimensionality of the model goes up (say, the number of variables increases),
so do the difficulties. Fuzzy sets emphasize the aspect of transparency of the
models and a role of a model designer whose prior knowledge about the system
may be very helpful in facilitating all identification pursuits. On the other
hand, to build models of substantial approximation capabilities, there is a need
for advanced tools. The art of modeling is to reconcile these two tendencies
and find a workable and efficient synergistic environment. Moreover it is also
worth stressing that in many cases the nonlinear form of the model acts as a
two-edge sword: while we gain flexibility to cope with experimental data, we
are provided with an abundance of nonlinear dependencies that need to be
exploited in a systematic manner. In particular, when dealing with high-order
nonlinear and multivariable equations of the model, we require a vast amount
of data for estimating all its parameters [1–2].
To help alleviate the problems, one of the first approaches along the line
of a systematic design of nonlinear relationships between system’s inputs
and outputs comes under the name of a Group Method of Data Handling
(GMDH). GMDH was developed in the late 1960s by Ivakhnenko [6–9] as a
vehicle for identifying nonlinear relations between input and output variables.
While providing with a systematic design procedure, GMDH comes with some
drawbacks. First, it tends to generate quite complex polynomial even for rela-
tively simple systems (experimental data). Second, owing to its limited generic
structure (that is quadratic two-variable polynomials), GMDH also tends to
produce an overly complex network (model) when it comes to highly nonlinear
systems. Third, if there are less than three input variables, GMDH algorithm
does not generate a highly versatile structure. To alleviate the problems as-
sociated with the GMDH, Self-Organizing Neural Networks (SONN) (viz.
polynomial neuron (PN)-based SONN and fuzzy polynomial neuron (FPN)-
based SONN, or called SOPNN/FPNN) were introduced by Oh and Pedrycz
[10–13] as a new category of neural networks or neuro-fuzzy networks. In a
nutshell, these networks come with a high level of flexibility associated with
Genetically Optimized Self-organizing Neural Networks 61
each node (processing element forming a Partial Description (PD) (viz. Poly-
nomial Neuron (PN) or Fuzzy Polynomial Neuron (FPN)) can have a different
number of input variables as well as exploit a different order of the polynomial
(say, linear, quadratic, cubic, etc.). In comparison to well-known neural net-
works or neuro-fuzzy networks whose topologies are commonly selected and
kept prior to all detailed (parametric) learning, the SONN architecture is not
fixed in advance but becomes generated and optimized in a dynamic way. As
a consequence, the SONNs show a superb performance in comparison to the
previously presented intelligent models. Although the SONN has a flexible ar-
chitecture whose potential can be fully utilized through a systematic design,
it is difficult to obtain the structurally and parametrically optimized network
because of the limited design of the nodes (viz. PNs or FPNs) located in each
layer of the SONN. In other words, when we construct nodes of each layer
in the conventional SONN, such parameters as the number of input variables
(nodes), the order of the polynomial, and the input variables available within
a node (viz. PN or FPN) are fixed (selected) in advance by the designer.
Accordingly, the SONN algorithm exhibits some tendency to produce overly
complex networks as well as a repetitive computation load by the trial and
error method and/or the repetitive parameter adjustment by designer like in
case of the original GMDH algorithm. In order to generate a structurally and
parametrically optimized network, such parameters need to be optimal.
In this study, in addressing the above problems with the conventional
SONN as well as the GMDH algorithm, we introduce a new genetic design
approach; as a consequence we will be referring to these networks as geneti-
cally optimized SONN (gSONN). The determination of the optimal values of
the parameters available within an individual PN or FPN (viz. the number
of input variables, the order of the polynomial, and a collection of preferred
nodes) leads to a structurally and parametrically optimized network. As a
result, this network is more flexible as well as exhibits simpler topology in
comparison to the conventional SONN discussed in the previous research. Let
us reiterate that the objective of this study is to develop a general design
methodology of gSONN modeling, come up with a logic-based structure of
such model and propose a comprehensive evolutionary development environ-
ment in which the optimization of the models can be efficiently carried out
both at the structural as well as parametric level [14].
This chapter is organized in the following manner. First, Section 2 gives
a brief introduction to the architecture and development of the SONNs. Sec-
tion 3 introduces the genetic optimization used in SONN. The genetic design of
the SONN comes with an overall description of a detailed design methodology
of SONN based on genetically optimized multi-layer perceptron architecture
in Section 4. In Section 5, we report on a comprehensive set of experiments.
Finally concluding remarks are covered in Section 6. To evaluate the perfor-
mance of the proposed model, we exploit two well-known time series data
[10, 11, 13, 15, 16, 20–41]. Furthermore, the network is directly contrasted with
several existing neurofuzzy models reported in the literatures.
62 S.-K. Oh and W. Pedrycz
As underlined, the SONN algorithm is based on the GMDH method and uti-
lizes a class of polynomials such as linear, quadratic, modified quadratic, etc.
to describe basic processing realized there. By choosing the most significant
input variables and an order of the polynomial among various types of forms
available, we can obtain the best one - it comes under a name of a partial
description (PD). It is realized by selecting nodes at each layer and eventu-
ally generating additional layers until the best performance has been reached.
Such methodology leads to an optimal SONN structure. Let us recall that the
input-output data are given in the form
y = f (x1 , x2 , · · · , xN ). (2)
N
N
N
N
N
N
y = c0 + ci xi + cij xi xj cijk xi xj xk · · · (3)
i=1 i=1 j=1 i=1 j=1 k=1
1 st la y e r 2 n d la y e r o r h ig h e r
PN
PN
x1 PN
PN PN
x2 PN
PN
PN ŷ
x3 PN
PN PN
x4 PN
PN
PN
PN
Input
variables Partial
Description
zp zp
2 C0 +C1 zp +C2 zq+C3 z 2p+C4 z 2q +C5 zp zq z
zq zq
Polynomial
order
FPN P
x i ,x j
F
µ1 µˆ 1 P1
xp {A l }
µ2 µˆ 2 P2
∑ z
µ3 µˆ 3 P3
xq {Bk }
µK µˆ K PK
Fig. 2. A general topology of the generic FPN module (F: fuzzy set-based processing
part, P: the polynomial form of mapping)
Genetically Optimized Self-organizing Neural Networks 65
where alk is a vector of the parameters of the conclusion part of the rule while
Plk (xi , xj , alk ) denotes the regression polynomial forming the consequence
part of the fuzzy rule which uses several types of high-order polynomials (lin-
ear, quadratic, and modified quadratic) besides the constant function forming
the simplest version of the consequence; refer to Table 2.
The types of the polynomial read as follows
• Bilinear = c0 + c1 x1 + c2 x2
• Biquadratic-1 = Bilinear + c3 x21 + c4 x22 + c5 x1 x2
• Biquadratic-2 = Bilinear + c3 x1 x2
• Trilinear = c0 + c1 x1 + c2 x2 + c3 x3
• Triquadratic-1 = Trilinear + c4 x21 + c5 x22 + c6 x23 + c7 x1 x2 + c8 x1 x3 + c9 x2 x3
• Triquadratic-2 = Trilinear + c4 x1 x2 + c5 x1 x3 + c6 x2 x3
Alluding to the input variables of the FPN, especially a way in which they
interact with the two functional blocks shown there, we use the notation FPN
(xp , xq ; xi , xj ) to explicitly point at the variables. The processing of the FPN
is governed by the following expressions that are in line of the rule-based
computing existing in the literatures [15, 16].
The activation of the rule “K ” is computed as an and-combination of
the activations of the fuzzy sets occurring in the rule. This combination of
the subconditions is realized through any t-norm. In particular, we consider
the minimum and product operations as two widely used models of the logic
connectives. Subsequently, denote the resulting activation level of the rule
by µK . The activation levels of the rules contribute to the output of the
FPN being computed as a weighted average of the individual condition parts
(functional transformations) PK (note that the index of the rule, namely “K”
is a shorthand notation for the two indexes of fuzzy sets used in the rule (4),
that is K = (l, k)).
all
all rules
rules all
rules
z= µK PK (xi , xj , aK ) µK = K PK (xi , xj , ak ) (5)
µ
K=1 K=1 K=1
66 S.-K. Oh and W. Pedrycz
and the order of the polynomial forming the conclusion part of the rules as
well as a collection of the specific subset of input variables. The consequence
part can be expressed by linear, quadratic, or modified quadratic polynomial
equation as mentioned previously. Especially for the consequence part, we
consider two kinds of input vector formats in the conclusion part of the fuzzy
rules of the 1st layer, namely i) selected inputs and ii) entire system inputs,
see Table 3.
i) The input variables of the consequence part of the fuzzy rules are same
as the input variables of premise part.
ii) The input variables of the consequence part of the fuzzy rules in a node
of the 1st layer are same as the entire system input variables and the input
variables of the consequence part of the fuzzy rules in a node of the 2nd layer
or higher are same as the input variables of premise part.
Where notation A: Vector of the selected input variables (x1 , x2 , · · · , xi ),
B: Vector of the entire system input variables (x1 , x2 , · · · , xi , xj · · · ), Type
T:f (A) = f (x1 , x2 , · · · , xi ) - type of a polynomial function standing in the
consequence part of the fuzzy rules, Type T*: f (B) = f (x1 , x2 , · · · , xi , xj · · · )
- type of a polynomial function occurring in the consequence part of the fuzzy
rules.
68 S.-K. Oh and W. Pedrycz
is directed only by some fitness function whose form could be quite complex;
we do not require it need to be differentiable.
In this study, for the optimization of the SONN model, GA uses the serial
method of binary type, roulette-wheel used in the selection process, one-point
crossover in the crossover operation, and a binary inversion (complementation)
operation in the mutation operator. To retain the best individual and carry it
over to the next generation, we use elitist strategy [19]. The overall genetically-
driven structural optimization process of SONN is shown in Figs. 5–6.
As mentioned, when we construct PNs or FPNs of each layer in the con-
ventional SONN, such parameters as the number of input variables (nodes),
the SONN. Next, the testing data set is used to evaluate the quality of the
network.
[Step 3] Decide initial information for constructing the SONN structure.
We decide upon the design parameters of the SONN structure and they
include:
a) According to the stopping criterion, two termination methods are ex-
ploited:
– Criterion level for comparison of a minimal identification error of the
current layer with that occurring at the previous layer of the network.
– The maximum number of layers (predetermined by the designer) with
an intent to achieve a sound balance between model accuracy and its
complexity.
b) The maximum number of input variables coming to each node in the
corresponding layer.
c) The total number W of nodes to be retained (selected) at the next
generation of the SONN algorithm.
d) The depth of the SONN to be selected to reduce a conflict between
overfitting and generalization abilities of the developed SONN.
e) The depth and width of the SONN to be selected as a result of a tradeoff
between accuracy and complexity of the overall model.
In addition, in case of FPN-based SONN, parameters related to the
following item are considered besides what are mentioned above.
f) The decision of initial information for fuzzy inference method and fuzzy
identification:
– Fuzzy inference method
– MF type: Triangular or Gaussian-like MF
– No. of MFs per each input of a node (or FPN)
– Structure of the consequence part of fuzzy rules
[Step 4] Decide a structure of the PN or FPN based SONN using genetic
design.
This concerns the selection of the number of input variables, the polynomial
order, and the input variables to be assigned in each node of the corresponding
layer. These important decisions are carried out through an extensive genetic
optimization.
When it comes to the organization of the chromosome representing (map-
ping) the structure of the SONN, we divide the chromosome to be used for
genetic optimization into three sub-chromosomes as shown in Figs. 8–9. The
1st sub-chromosome contains the number of input variables, the 2nd sub-
chromosome involves the order of the polynomial of the node, and the 3rd
sub-chromosome (remaining bits) contains input variables coming to the cor-
responding node (PN or FPN). All these elements are optimized when running
the GA.
In nodes (PN or FPNs) of each layer of SONN, we adhere to the nota-
tion of Fig. 10. ‘PNn’ or ‘FPNn’ denotes the nth node (PN or FPN) of the
Genetically Optimized Self-organizing Neural Networks 73
Fig. 9. The FPN design used in the SONN architecture - structural considerations
and mapping the structure on a chromosome
74 S.-K. Oh and W. Pedrycz
are then decoded into decimal. Here, bit(1), bit(2) and bit(3) show each
location of these three bits and are denoted as “0”, or “1” respectively.
Sub-step 3) The above decimal value is rounded off
γ = (β/α) × (M ax − 1) + 1 (8)
where Max denotes the maximal number of input variables entering the
corresponding node (PN or FPN) while α is the decoded decimal value corre-
sponding to the situation when all bits of the 1st sub-chromosome are set up
as 1’s.
Sub-step 4) The normalized integer value is then treated as the number of
input variables (or input nodes) coming to the corresponding node.
Evidently, the maximal number (Max) of input variables is equal to or less
than the number of all system’s input variables (x1 , x2 , · · · , xn ) coming to the
1st layer, that is, Max ≤ n.
[Step 4-2] Selection of the order of polynomial (2nd sub-chromosome)
Sub-step 1) The 3 bits of the 2nd sub-chromosome are assigned to the binary
bits for the selection of the order of polynomial.
Sub-step 2) The 3 bits randomly selected using (7) are decoded into a
decimal format.
Sub-step 3) The decimal value obtained by means of (8) is normalized and
rounded off. The value of Max is replaced with 3 (in case of PN) or 4 (in case
of FPN), refer to Tables 1and 2.
Genetically Optimized Self-organizing Neural Networks 75
µ (x )
Small( µ S (x )) Big(µB ( x ) = 1 − µ S ( x ))
1 0 if x ≥ x max
x max − x
0.5 µ S (x ) if x min < x < x max
x max − x min
1 if x ≤ x min
0 x
x min x max
(a) Triangular MF
µ (x )
1 if c ≤ x min
0 x
x min x max
(b) Gaussian-like MF
Fig. 11. Triangular and Gaussian-like membership functions and their parameters
with the following notation i: node number, k: data number, ntr : number
of the training data subset, n: number of the selected input variables, m:
maximum order, n : number of estimated coefficients.
[Step 5-2] In case of a PN
At this step, the regression polynomial inference is considered. The inference
method deals with regression polynomial functions viewed as the consequents
of the rules. Regression polynomials (polynomial and in the very specific case,
a constant value) standing in the conclusion part of fuzzy rules are given as
different types of Type 1, 2, 3, or 4, see Table 2. In the fuzzy inference, we
consider two types of membership functions, namely triangular and Gaussian-
like membership functions.
The regression fuzzy inference (reasoning scheme) is envisioned: The con-
sequence part can be expressed by linear, quadratic, or modified quadratic
polynomial equation as shown in Table 2. The use of the regression polynomial
inference method gives rise to the expression
where, n is the number of the fuzzy rules, y is the inferred value, µi is the
premise fitness of Ri and µi is the normalized premise fitness of µi .
Here we consider a regression polynomial function of the input variables. The
consequence parameters are produced by the standard least squares method
that is
= (XjT Xj )−1 XjT Y
a (14)
where
aTj = [a10j , · · ·, anoj , a11j , · · ·, an1j , · · ·, a1kj , · · ·, ankj ],
Xj = [x1j , x2j , · · ·, xij , · · ·, xmj ]T ,
µlj1 , · · ·, µ
xTlj = [ ljn , xlj1 µ
lj1 , · · ·, xlj1 µ
ljn , · · ·, xljk µ
lj1 , · · ·, xljk µ
ljn ],
Y = [y1 , · · ·, ym ]T
j is the node number, m denotes the total number of data points, and k stands
the number of the input variables. Subsequently l is the data index while n is
the number of the fuzzy rules.
The procedure described above is implemented iteratively for all nodes of
the layer and also for all layers of SONN; we start from the input layer and
move towards the output layer.
[Step 6] Select nodes (PNs or FPNs) with the best predictive capability and
construct their corresponding layer.
Fig. 12 depicts the genetic optimization procedure for the generation of the
optimal nodes in the corresponding layer. As shown in Fig. 12, all nodes of
the corresponding layer of SONN architecture are constructed through the
genetic optimization.
The generation process can be organized as the following sequence of steps
Sub-step 1) We set up initial genetic information necessary for genera-
tion of the SONN architecture. This concerns the number of generations and
populations, mutation rate, crossover rate, and the length of the chromosome.
Sub-step 2) The nodes (PNs or FPNs) are generated through the genetic
design. Here, a single population assumes the same role as the node (PN or
FPN) in the SONN architecture and its underlying processing is visualized
78 S.-K. Oh and W. Pedrycz
Selected Population
Population Fitness operation of Array of nodes (preferred nodes:
Generation 1
(1~T) nodes (PNs or FPNs) (PNs or FPNs:1~T) PNs orFPNs)
(1~W)
Reproduction
Selection of the optimal node Selected Population
Roulette-wheel selection
Array of nodes (preferred nodes:
One-point crossover
Highest fitness value (PNs or FPNs:1~2W) PNs orFPNs)
Invert mutaion
(1~W)
Selected Population
Population Fitness operation of Elitist Array of nodes (preferred nodes:
Generation 2
(1~T) nodes (PNs or FPNs) strategy (PN or FPNs:1~T) PNs orFPNs)
(1~W)
Reproduction
Selection of the optimal node
Roulette-wheel selection Selected Population
One-point crossover Array of nodes (preferred nodes:
Highest fitness value
Invert mutaion (PNs or FPNs:1~2W) PNs orFPNs)
(1~W)
: Optimization operation
of SOFPNN within GA flow
Fig. 12. The genetic optimization used in the generation of the optimal nodes in
the given layer of the network
of the retained nodes serve as inputs to the next layer of the network. There
are two cases as to the number of the retained nodes, that is
(i) If W * < W , then the number of the nodes (PNs or FPNs) retained for
the next layer is equal to z.
Here, W * denotes the number of the retained nodes in each layer that
nodes with the duplicated fitness values were moved.
(ii) If W * ≥ W , then for the next layer, the number of the retained nodes
(PNs or FPNs) is equal to W .
The above design pattern is carried out for the successive layers of the
network. For the construction of the nodes in the corresponding layer of the
original SONN structure, the nodes obtained from (i) or (ii) are rearranged in
ascending order on a basis of initial population number. This step is needed
to construct the final network architecture as we trace the location of the
original nodes of each layer generated by the genetic optimization.
Sub-step 6) For the elitist strategy, we select the node that has the highest
fitness value among the selected nodes (W ).
Sub-step 7) We generate new populations of the next generation using op-
erators of GAs obtained from Sub-step 4. We use the elitist strategy. This
sub-step carries out by repeating sub-step 2–6. Especially in sub-step 5,
we replace the node that has the lowest fitness value in the current genera-
tion with the node that has reached the highest fitness value in the previous
generation obtained from sub-step 6.
Sub-step 8) We combine the nodes (W populations) obtained in the previous
generation with the nodes (W populations) obtained in the current genera-
tion. In the sequel, W nodes that have higher fitness values among them (2W )
are selected. That is, this sub-step carries out by repeating sub-step 5.
Sub-step 9) Until the last generation, this sub-step carries out by repeating
sub-step 7–8.
The iterative process generates the optimal nodes of the given layer of the
SONN.
Step 7) Check the termination criterion.
The termination condition that controls the growth of the model consists
of two components, that is the performance index and a size of the network
(expressed in terms of the maximal number of the layers). As far as the per-
formance index is concerned (that reflects a numeric accuracy of the layers),
a termination is straightforward and comes in the form,
F1 ≤ F∗ (16)
In this study, we use two measures (performance indexes) that is the Mean
Squared Error (MSE) and the Root Mean Squared Error (RMSE).
i) Mean Squared Error
1
N
E(P Is or EP Is ) = (yp − yp )2 (17)
N p=1
where, yp is the p − th target output data and yp stands for the p-th actual
output of the model for this specific data point. N is training (P Is ) or test-
ing (EP Is ) input-output data pairs and E is an overall (global) performance
index defined as a sum of the errors for the N.
[Step 8] Determine new input variables for the next layer.
If (16) has not been met, the model is expanded. The outputs of the
preserved nodes (zli , z2i , · · ·, zW i ) serves as new inputs to the next layer
(x1j , x2j , · · ·, xW j )(j = i + 1). This is captured by the expression
The SONN algorithm is carried out by repeating steps 4–8 of the algorithm.
5 Experimental studies
In this section, we illustrate the development of the SONN and show its per-
formance for a number of well-known and widely used datasets. The first one
is a time series of gas furnace (Box-Jenkins data) which was studied previously
in [10, 11, 13, 15, 16, 20–34]. The other one deals with a chaotic time series data
(Mackey-Glass time series data) [35–41].
of the series serves as a testing set. We consider the MSE given by (17) to be
a pertinent performance index. We choose the input variables of nodes in the
1st layer of SONN architecture from these input variables. We use four types
of system input variables of SONN architecture with vector formats such as
Type I, Type II, Type III, and Type IV as shown in Table 4. The forms
Type I, II, III, and IV utilize two, three, four, and six system input variables
respectively.
Table 5 summarizes the list of parameters used in the genetic optimiza-
tion of the PN-based and the FPN-based SONN. In the optimization of each
layer, we use 100 generations, 60 populations, a string of 36 bits, crossover
rate equal to 0.65, and the probability of mutation set up to 0.1. A chro-
mosome used in the genetic optimization consists of a string including 3
sub-chromosomes. The 1st chromosome contains the number of input vari-
ables, the 2nd chromosome contains the order of the polynomial, and finally
the 3rd chromosome contains input variables. The numbers of bits allocated
to each sub- chromosome are equal to 3, 3, and 30, respectively. The popu-
lation size being selected from the total population size (60) is equal to 30.
The process is realized as follows. 60 nodes (PNs or FPNs) are generated
in each layer of the network. The parameters of all nodes generated in each
layer are estimated and the network is evaluated using both the training and
testing data sets. Then we compare these values and choose 30 nodes (PNs
or FPNs) that produce the best (lowest) value of the performance index.
The maximal number (Max) of inputs to be selected is confined to two to
five (2–5). In case of PN-based SONN, the order of the polynomial is chosen
from three types that is Type 1, Type 2, and Type 3 (refer to the Table 1),
while in case of FPN-based SONN, the polynomial order of the consequent
part of fuzzy rules is chosen from four types, that is Type 1, Type 2, Type
3, and Type 4 as shown in Table 2. As usual in fuzzy systems, we may ex-
ploit a variety of membership functions in the condition part of the rules and
this is another factor contributing to the flexibility of the network. Overall,
triangular or Gaussian fuzzy sets are of general interest. The first class of tri-
angular membership functions provides with a very simple implementation.
The second class becomes useful because of an infinite support of its fuzzy
sets.
82 S.-K. Oh and W. Pedrycz
Total population
size 60 60 60 60 60
GA Selected
population 30 30 30 30 30
size (W )
Crossover rate 0.65 0.65 0.65 0.65 0.65
Mutation rate 0.1 0.1 0.1 0.1 0.1
String length 3 + 3 + 30 3 + 3 + 30 3 + 3 + 30 3 + 3 + 30 3 + 3 + 30
Maximal no.
(Max) of inputs 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max
PN- to be selected (2–5) (2–5) (2–5) (2–5) (2–5)
based
SONN Polynomial type 1≤T≤3 1≤T≤3 1≤T≤3 1≤T≤3 1≤T≤3
(Type T) (# )
Maximal no.
(Max) of inputs 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max
to be selected (2–5) (2–5) (2–5) (2–5) (2–5)
Polynomial type
(Type T) of the
consequent part 1≤T≤4 1≤T≤4 1≤T≤4 1≤T≤4 1≤T≤4
of fuzzy
rules (## )
# ## ###
l, T, Max: integers, , and : refer to Tables 1–3 respectively.
Decoding Decoding
1 1 0 0 1 0 0 0 1 1 0 1 1 1 1 1 0 0
7 6
1 2 3
3 input variables : x 2 , x 5 , x 6
Selected PN Modified quadratic polynomial : a0 + a1x2 + a2x5 + a3x6 + a4x2x5 + a5x2x6 + a6x5x6
PN
Fig. 13. The example of the PN design guided by some chromosome (Type IV and
Max = 3 in layer 1)
design process. The maximal number of input variables for the selection is
confined to 3 over nodes of each layer of the network.
Table 6 summarizes the results when using Type II and Type IV: Accor-
ding to the maximal number of inputs to be selected (Max = 2 to 5), the
selected node numbers, the selected polynomial type (Type T), and its cor-
responding performance index (PI and EPI) were shown when the genetic
optimization for each layer was carried out. “Node” denotes the nodes for
which the fitness value is maximal in each layer. For example, in case of Table
6(b), the fitness value in layer 1 is maximal for Max = 5 when nodes 3, 4, 5, 6
occurring in the previous layer are selected as the node inputs in the present
layer. Only 4 inputs of Type 1 (linear function) were selected as the result
of the genetic optimization. Here, node “0” indicates that it has not been
selected by the genetic operation. Therefore the width (the number of nodes)
of the layer can be lower in comparison to the conventional PN-based SONN
(which immensely contributes to the compactness of the resulting network).
In that case, the minimal value of the performance index at the node, that is
PI = 0.035, EPI = 0.125 are obtained.
Fig. 14 depicts the values of the performance index of the PN-based
gSONN with respect to the maximal number of inputs to be selected when
84
Table 6. Performance index of the PN-based SONN viewed with regard to the increasing number of the layers
(a) In case of Type II
1st layer 2nd layer 3rd layer 4th layer 5th layer
Max
Node T PI EPI Node T PI EPI Node T PI EPI Node T PI EPI Node T PI EPI
2 2 3 2 0.104 0.198 6 10 2 0.048 0.124 10 12 2 0.024 0.119 8 16 2 0.024 0.114 23 30 2 0.024 0.112
3 1 2 3 3 0.022 0.135 5 7 14 2 0.045 0.121 16 18 28 2 0.043 0.115 12 16 23 3 0.041 0.112 14 18 21 2 0.036 0.109
S.-K. Oh and W. Pedrycz
4 1 2 3 0 3 0.022 0.135 5 15 16 0 2 0.045 0.121 8 13 14 21 3 0.042 0.112 1 8 25 29 3 0.039 0.108 5 15 16 0 1 0.038 0.106
5 1 2 3 0 0 3 0.022 0.135 4 7 18 0 0 2 0.045 0.121 1 15 30 0 0 2 0.022 0.115 23 29 30 0 0 2 0.021 0.110 13 14 23 30 0 1 0.019 0.108
B : (5 7 14) A : ( 8 16)
Testing error
0.08 C : (5 15 16 0) B : (12 16 23) 0.17
D : (4 7 18 0 0) C : ( 1 8 25 29)
0.07 D : (23 29 30 0 0) 0.16
A : (10 12) A : (23 30)
0.06 B : (16 18 28) B : (14 18 21)
0.15
C : ( 8 13 14 21) C : (10 16 28 0)
0.05 D : ( 1 15 30 0 0) D : (13 14 23 30 0) 0.14
0.04 0.13
0.03 0.12
0.02 0.11
0.01 0.1
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a-1) Training error (a-2) Testing error
(a) In case of Type II
Testing error
0.08 B : (24 25 0)
C : ( 6 9 15 27) 0.16
0.07 D : ( 4 15 22 0 0)
A : (2 25)
0.06 B : (4 22 28) 0.14
C : (6 12 13 19)
0.05 D : (7 9 18 27 0) A : ( 8 19)
B : (12 13 0) A : ( 7 9) 0.12
0.04 C : ( 4 7 10 28) B : ( 4 6 18)
D : ( 3 23 30 0 0) C : (27 28 29 0)
0.03 D : ( 6 24 25 0 0)
0.1
0.02
0.01 0.08
1 2 3 4 5 1 2 3 4 5
Layer Layer
(b-1) Training error (b-2) Testing error
(b) In case of Type IV
Fig. 14. Performance index treated as a function of the maximal number of inputs
to be selected in Type II or Type IV
the number of system inputs is equal to 6 (Type IV). Next, Fig. 15 shows the
performance index of the gSONN with respect to the number of entire system
inputs when using Max = 5.
Considering the training and testing data sets in Type IV with Max = 5,
the best results for network in the 2nd layer happen with the Type 2 poly-
nomial (quadratic function) and 3 nodes at the inputs (nodes numbered as
4, 15, 22); the performance of this network is quantified by the values of PI
equal to 0.015 and EPI given as 0.112. The best results for the network in the
5th layer coming with PI = 0.012 and EPI = 0.091 have been reported when
using Max = 5 with the polynomial of Type 3 and 3 nodes at the inputs (the
node numbers are 6, 24, 25). In Fig. 14, A(•)–E(•) denote the optimal node
numbers at each layer of the network, namely those with the best predictive
performance. Here, the node numbers of the 1st layer represent system input
numbers, and the node numbers of each layer in the 2nd layer or higher rep-
resent the output node numbers of the preceding layer, as the optimal node
which has the best output performance in the current layer. Fig. 16 illustrates
the detailed optimal topologies of the network with 2 or 3 layers, compared
to the conventional optimized network of Fig. 17. That is, in Fig. 17, the
86 S.-K. Oh and W. Pedrycz
No. of system inputs : 2 ; ,3; ,4; ,6; No. of system inputs : 2 ; ,3; ,4; ,6;
0.05 0.4
0.045 0.35
0.04
0.3
Training error
Testing error
0.035
0.25
0.03
0.2
0.025
0.15
0.02
0.015 0.1
0.01 0.05
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a) Training error (b) Testing error
Fig. 15. Performance index regarded as a function of the number of system inputs
(SI) (in case of Max = 5)
PN 4 u(t-3)
PN4
2 1
4 3
PN 6 PN 1 u(t-2)
y(t-2) 2 2 2 3
PN 7 PN15 PN 7 u(t-1)
y
^ PN15 PN5
y(t-2) 1 2 3 3 3 2 4 2 3 2
^
y
y(t-3)
PN 14 PN30
y(t-1) 2 3 4 3
y(t-2) PN22
PN 17
5 1
2 1 y(t-1)
(a) In case of Type II with 3 layers and (b) In case of Type IV with 2 layers and
Max = 5 Max = 5
PD 1 PD 11
PD 2 PD 3
u(t-2) PD 4 PD 5
PD 1
PD 9 PD 3
u(t-3) • PD 8
PD 2 ^ PD 7
y(t-2) PD 1
PD 3
PD 11 PD 12 PD y u(t-2) • PD 8
PD 9
PD 2 PD 12
PD 12 PD 20 PD 10
PD 4
PD 9 PD 17 PD 7
y(t-1) PD 15
u(t-1) • PD 12
PD 18 PD 10 PD 20
PD 13 ^
PD 11 PD 24
PD 20 PD 9 PD
f
y(t-3) • PD 12
PD 16
PD 26
NOP5 PD 20
PD 13 PD 28 PD 30
NOP6
y(t-2) • PD 14
PD 21
PD 29
NOP7 PD 23
PD 15
y(t-1) •
PD 24
PD 2 PD 9 NOP25 PD 44
PD 3 PD 10 PD 25
PD 4 PD 26
(a) In case of Type II with 5 layers (b) In case of Type IV with 5 layers
0.04 0.17
0.16
0.035
0.15
0.03
Training error
Testing error
0.14
0.025 1st layer 2nd layer 3rd layer 4th layer 5th layer 0.13 1st layer 2nd layer 3rd layer 4th layer 5th layer
0.12
0.02
0.11
0.015
0.1
0.01 0.09
0 100 200 300 400 500 0 100 200 300 400 500
Generation Generation
Fig. 19 shows an example of the FPN design that is driven by some specific
chromosome, refer to the case that the performance values are PI = 0.012,
EPI = 0.145 in the 1st layer when using Type IV, Gaussian-like MF, and
Max = 4 in Table 8(b-2). In FPN of each layer, two Gaussian-like membership
88 S.-K. Oh and W. Pedrycz
Table 7. Performance index of PN based SONN architectures for Types I, II, III,
and IV
PN-based Optimal Polynomial Neuron
Layer PIs EPIs
SONN No. of inputs (Node no.) Polynomial type
1 2(1,2) Type 1 0.022 0.335
2 4(1,4,5,6) Type 2 0.019 0.282
Max = 4 3 4(1,2,14,25) Type 2 0.020 0.273
4 3(17,22,25) Type 2 0.018 0.268
Type I 5 4(4,14,25,26) Type 2 0.018 0.265
(SI: 2) 1 2(1,2) Type 1 0.022 0.335
2 4(3,4,5,7) Type 2 0.020 0.282
Max = 5 3 5(7,9,15,24,28) Type 3 0.020 0.271
4 3(21,27,30) Type 2 0.017 0.263
5 3(2,15,13) Type 2 0.017 0.259
1 3(1,2,3) Type 3 0.022 0.135
2 3(5,15,16) Type 2 0.045 0.121
Max = 4 3 4(8,13,14,21) Type 3 0.042 0.112
4 4(1,8,25,29) Type 3 0.039 0.108
Type II 5 3(10,16,28) Type 1 0.038 0.106
(SI: 3) 1 3(1,2,3) Type 3 0.022 0.135
2 3(4,17,18) Type 2 0.045 0.121
Max = 5 3 3(1,15,30) Type 2 0.022 0.115
4 3(23,29,30) Type 2 0.021 0.110
5 4(13,14,23,30) Type 1 0.019 0.108
1 3(1,3,4) Type 3 0.022 0.135
2 4(24,26,29,30) Type 3 0.016 0.124
Max = 4 3 4(7,11,18,26) Type 2 0.014 0.113
4 3(2,28,29) Type 2 0.014 0.102
Type III 5 3(3,12,22) Type 1 0.013 0.099
(SI: 4) 1 3(1,3,4) Type 3 0.022 0.135
2 5(12,13,15,18,29) Type 2 0.015 0.119
Max = 5 3 4(7,9,10,14) Type 2 0.014 0.104
4 3(1,3,5) Type 2 0.014 0.100
5 4(8,9,23,24) Type 1 0.013 0.098
1 4(3,4,5,6) Type 1 0.035 0.125
2 4(6,9,15,27) Type 3 0.018 0.122
Max = 4 3 4(6,12,13,19) Type 3 0.017 0.113
4 4(4,7,10,28) Type 2 0.016 0.106
Type IV 5 3(27,28,29) Type 3 0.014 0.100
(SI: 6) 1 4(3,4,5,6) Type 1 0.035 0.125
2 3(4,15,22) Type 2 0.015 0.112
Max = 5 3 4(7,9,18,27) Type 3 0.015 0.107
4 3(3,23,30) Type 2 0.013 0.096
5 3(6,24,25) Type 3 0.012 0.091
Genetically Optimized Self-organizing Neural Networks 89
i) Bits for the selection of ii) Bits for the selection iii) Bits for the selection of input
Related bit items the no. of input variables of the polynomial order variables
Decoding Decoding 0 0 1 1 0 1 1 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 1 1 0 0 0
3 3 1 2
Genetic
Normalization Normalization
Design Normalization : 2 Normalization : 3
2 2
Regression polynomial fuzzy inference Gaussian 2 MFs Entire system input variables
Fig. 19. The example of the FPN design guided by some chromosome (Type IV,
Gaussian-like MF, Max = 4, 1st layer)
functions for each input variable are used. Here, the number of entire in-
put variables (here, entire system input variables) considered in the 1st layer
is given as 6 (Type IV). The polynomial order selected is given as Type 2.
Furthermore the entire system input variables for the regression polynomial
function of the consequent part of fuzzy rules in the first layer were considered.
That is, “Type 2∗ ” as the consequent input type in the 1st layer is used and
in the 2nd layer or higher, the input variables in the conclusion part of fuzzy
rules are kept as those in the conditional part of the fuzzy rules. Especially,
in the 2nd layer or higher, the number of entire input variables is given as W
that is the number of the nodes selected in the current layer, as the output
nodes of the preceding layer. Refer to sub-step 4 of step 4–3 of the introduced
design process. As mentioned previously, the maximal number (Max) of input
variables for the selection is confined to 4, and two variables (such as x2
and x3 ) were selected among them. The parameters of the conclusion part
(polynomial) of fuzzy rules can be determined by the standard least-squares
method.
Table 8 summarizes the results when using Types II and IV: According
to the maximal number of inputs to be selected (Max = 2 to 5), the selected
node numbers, the selected polynomial type (Type T), and its corresponding
Table 8. Performance index of the network of each layer versus the increase of maximal number of inputs to be selected for Types II
90
2 1 2 4 0.017 0.142 8 14 2 0.017 0.134 17 26 2 0.017 0.132 15 25 4 0.017 0.130 1 5 4 0.016 0.128
3 1 2 0 4 0.017 0.142 25 16 17 1 0.025 0.131 10 20 26 1 0.024 0.127 2 4 0 2 0.016 0.119 16 25 0 2 0.015 0.117
4 1 2 0 0 4 0.017 0.142 3 9 14 25 1 0.021 0.125 2 3 9 10 1 0.023 0.111 10 20 21 28 1 0.030 0.108 3 13 0 0 2 0.018 0.101
5 1 2 0 0 0 4 0.017 0.142 5 6 9 24 0 1 0.020 0.127 4 7 17 20 0 1 0.022 0.117 17 29 0 0 0 2 0.014 0.114 19 29 0 0 0 3 0.014 0.108
performance index (PI and EPI) were shown when the genetic optimization
for each layer was carried out. “Node” denotes the nodes for which the fitness
value is maximal in each layer. For example, in case of Table 8(b-2), the fit-
ness value in layer 5 is maximal for Max = 5 when nodes 5, 15, 20 occurring
in the previous layer are selected as the node inputs in the present layer. Only
3 inputs of Type 1 (constant) were selected as the result of the genetic opti-
mization. Here, node “0” indicates that it has not been selected by the genetic
operation. Therefore the width (the number of nodes) of the layer can be lower
in comparison to the conventional FPNN (which immensely contributes to the
compactness of the resulting network). In that case, the minimal value of the
performance index at the node, that is PI = 0.016, EPI = 0.100 are obtained.
Fig. 20 shows the values of performance index vis-à-vis number of layers
of the gSONN with respect to the maximal number of inputs to be selected
as optimal architectures of each layer of the network included in Table 8(b)
while Fig. 21 summarizes the values of the performance index represented
vis-à-vis the increasing number of the layers with regard to the number of
Testing error
0.014 0.12
A : ( 2 23)
B : ( 2 4 6) A : ( 3 26)
0.012 B : (10 12 30) 0.11
C : (19 25 0 0)
D : ( 5 6 10 0 0) C : ( 8 14 15 19)
D : ( 5 27 0 0 0)
0.01 0.1
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a-1) Training error error (a-2) Testing error
(a) Triangular MF
A : (8 14) 0.14
Testing error
A : (2 3) B : (7 14 0)
0.018 B : (2 3 0) C : (2 20 24 27)
C : (2 3 0 0) D : (5 15 20 0 0) 0.13
D : (2 3 0 0 0)
0.016
0.12
0.014
A : (2 0) 0.11
0.012 B : (11 13 22)
C : ( 4 13 29 0)
0.01 D : (11 27 29 0 0) 0.1
0.008 0.09
1 2 3 4 5 1 2 3 4 5
Layer Layer
(b-1) Training error error (b-2) Testing error
(b) Gaussian-like MF
Fig. 20. Performance index of gSONN for Type IV according to the increase of
number of layers
92 S.-K. Oh and W. Pedrycz
No. of system inputs : 2 ; ,3; ,4; , 6; No. of system inputs : 2 ; ,3; ,4; ,6;
0.05 0.35
: Selected input variables : Selected input variables
0.045 : Entire system input variables : Entire system input variables
0.3
0.04
Training error
Testing error
0.035 0.25
0.03
0.025 0.2
0.02
0.15
0.015
0.01 0.1
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a-1) Training error error (a-2) Testing error
(a) Triangular MF
No. of system inputs : 2 ; ,3; ,4; ,6; No. of system inputs : 2 ; ,3; ,4; ,6;
0.045 0.3
: Selected input variables : Selected input variables
0.04 : Entire system input variables : Entire system input variables
0.25
0.035
Testing error
Testing error
0.03 0.2
0.025
0.02 0.15
0.015
0.1
0.01
0.005 0.05
1 2 3 4 5 1 2 3 4 5
Layer Layer
(b-1) Training error error (b-2) Testing error
(b) Gaussian-like MF
Fig. 21. Performance index of gSONN with respect to the increase of number of
system inputs (in case of Max = 5)
entire system inputs being used in gSONN when using Max = 5. In Fig. 20,
A(•)–D(•) denote the optimal node numbers at each layer of the network,
namely those with the best predictive performance. Here, the node numbers
of the 1st layer represent system input numbers, and the node numbers of
each layer in the 1nd layer or higher represent the output node numbers of the
preceding layer, as the optimal node that has the best output performance in
the current layer.
Fig. 22 illustrates the detailed optimal topology of the network with 2 lay-
ers, compared to the conventional optimized network of Fig. 23. That is, in Fig.
23, the performance of two types of the conventional optimized SONNs was
quantified by the values of PI = 0.020, EPI = 0.130 for Type II, and PI = 0.016,
EPI = 0.128 for Type III whereas under the condition given as similar perfor-
mance values, two types of gSONN architectures were depicted as shown Fig.
22(a) and (b). As shown in Fig. 22, the genetic design procedure at each stage
(layer) of SONN leads to the selection of the preferred nodes (or FPNs) with
optimal local characteristics (such as the number of input variables, the order
Genetically Optimized Self-organizing Neural Networks 93
FPN5 u(t-3)
2 4
u(t-2) u(t-2) FPN19
FPN6
2 2
2 2 FPN23 u(t-1) FPN29
y(t-2)
4 1 y^ 2 4 y^
FPN9 y(t-3)
3 1 FPN 29
y(t-1) y(t-2) 4 1
FPN24
1 3 y(t-1)
(a) FPN-based gSONN with 2 layers for (b) FPN-based gSONN with 2 lay-
Type II, Max = 5, and Gaussian- ers for Type IV, Max = 4, and
like MF Gaussian-like MF
Fig. 22. Genetically optimized FPNN (gFPNN) architecture
FPN 3
FPN 5
FPN 28
u(t-2) FPN 1 FPN 6 FPN22
FPN 29
y(t-2) FPN11 y^
NOP 7
y(t-1)
NOP 2 FPN 10 NOP31 NOP 31
NOP 3
NOP 4
(a) Type II
FPN 1 FPN1
u(t-2) c FPN 1
FPN 2 FPN5
u(t-1) c
FPN 3 FPN6 FPN 17 FPN 3
FPN 7 y^
FPN4 FPN8
y(t-2) c FPN 20 FPN 24
FPN5 FPN14
y(t-1) c FPN 30
FPN6 FPN15
Table 9. Performance index of FPN based SONN architectures for Types I, II, III,
and IV
PN-based Layer Optimal Polynomial Neuron PIs EPIs
SONN No. of inputs (Node no.) Polynomial type
1 1(1) Type 2 0.022 0.335
Max = 5 2 4(7, 9, 10, 11) Type 1 0.018 0.271
(Triangular MF) 3 4(6, 12, 25, 28) Type 1 0.017 0.270
4 4(6, 11, 12, 27) Type 1 0.017 0.268
Type I 5 3(10, 14, 19) Type 1 0.016 0.264
(SI: 2) 1 2(1,2) Type 3 0.022 0.281
Max = 5 2 3(6, 9, 11) Type 2 0.020 0.267
(Gaussian-like MF) 3 4(5, 13, 15, 25) Type 1 0.020 0.260
4 2(3, 6) Type 3 0.017 0.252
5 2(28, 30) Type 2 0.017 0.249
1 1(1) Type 4 0.020 0.133
Max = 5 2 2(4, 27) Type 1 0.049 0.129
(Triangular MF) 3 2(11, 22) Type 2 0.019 0.124
4 2(5, 23) Type 2 0.018 0.119
Type II 5 4(17, 21, 26, 27) Type 1 0.016 0.114
(SI: 3) 1 2(1, 2) Type 4 0.017 0.142
Max = 4 2 4(3, 9, 14, 25) Type 1 0.021 0.125
(Gaussian-like MF) 3 4(2, 3, 9, 10) Type 1 0.023 0.111
4 4(10, 20, 21, 28) Type 1 0.030 0.108
5 2(3, 13) Type 2 0.018 0.101
1 3(1, 3, 4) Type 1 0.021 0.135
Max = 5 2 2(1, 26) Type 1 0.021 0.125
(Triangular MF) 3 4(2, 6, 13, 15) Type 1 0.017 0.123
4 4(10, 14, 17, 28) Type 1 0.010 0.120
Type III 5 4(9, 14, 17, 24) Type 1 0.010 0.115
(SI: 4) 1 3(1, 3, 4) Type 2 0.016 0.146
Max = 5 2 4(1, 11, 24, 27) Type 1 0.043 0.128
(Gaussian-like MF) 3 4(1, 8, 19, 24) Type 1 0.024 0.117
4 2(2, 22) Type 3 0.012 0.104
5 4(8,9,23,24) Type 1 0.014 0.099
1 3(2, 5, 6) Type 1 0.021 0.136
Max = 5 2 2(16, 27) Type 1 0.021 0.124
(Triangular MF) 3 3(5, 6, 10) Type 4 0.015 0.106
4 2(5, 27) Type 2 0.015 0.105
Type IV 5 2(7, 19) Type 3 0.015 0.103
(SI: 6) 1 2(2, 3) Type 2 0.012 0.145
Max = 4 2 2(19, 29) Type 4 0.011 0.125
(Gaussian-like MF) 3 3(4, 13, 29) Type 1 0.020 0.104
4 1(29) Type 3 0.011 0.106
5 4(2, 20, 24, 27) Type 1 0.008 0.093
Table 10. Comparative analysis of the performance of the network; considered are
models reported in the literature
Performance index
Model PI PIs EPIs
Box and Jenkin’s model [20] 0.710
Tong’s model [21] 0.469
Sugeno and Yasukawa’s model [22] 0.355
Sugeno and Yasukawa’s model [23] 0.190
Xu and Zailu’s model [24] 0.328
Pedrycz’s model [15] 0.320
Chen’s model [25] 0.268
Gomez-Skarmeta’s model [26] 0.157
Oh and Pedrycz’s model [16] 0.123 0.020 0.271
Kim et al.’s model [27] 0.055
Kim et al.’s model [28] 0.034 0.244
Leski and Czogala’s model [29] 0.047
Lin and Cunningham’s model [30] 0.071 0.261
NNFS model [31] 0.128
CASE I (SI = 4, 5th layer) 0.016 0.116
FPNN [32]
CASE II (SI = 4, 5th layer) 0.016 0.128
Basic (SI = 4, 5th layer) 0.021 0.110
PNN [33]
Modified (SI = 4, 5th layer) 0.015 0.103
Triangular (SI = 4, 5th layer) 0.019 0.134
HFPNN [34]
Gaussian (SI = 4, 5th layer) 0.021 0.119
Basic SOPNN (SI = 4, 5th layer) 0.027 0.021 0.085
Generic SOPNN [10]
Modified SOPNN (SI = 4, 5th layer) 0.035 0.017 0.095
Basic SOPNN (SI = 4, 5th layer) 0.020 0.119
Advanced SOPNN [11]
Modified SOPNN (SI = 4, 5th layer) 0.018 0.118
Basic Case 1(5th layer) 0.016 0.266
Type I SONN Case 2(5th layer) 0.016 0.265
(SI = 2) Modified Case 1(5th layer) 0.013 0.267
SONN∗ [13] SONN Case 2(5th layer) 0.013 0.272
Basic Case 1(5th layer) 0.016 0.116
Type II SONN Case 2(5th layer) 0.016 0.128
(SI = 4) Modified Case 1(5th layer) 0.016 0.133
SONN Case 2(5th layer) 0.018 0.131
Type I 2nd layer (Max = 4) 0.019 0.282
(SI = 2) 5th layer (Max = 5) 0.017 0.259
Type II 1st layer (Max = 5) 0.022 0.135
PN- (SI = 3) 3rd layer (Max = 5) 0.022 0.115
based Type III 2nd layer (Max = 5) 0.015 0.119
(SI = 4) 5th layer (Max = 5) 0.018 0.098
Type IV 2nd layer (Max = 5) 0.015 0.112
Proposed (SI = 6) 3rd layer (Max = 5) 0.012 0.091
gSONN Type I Gaussian- 1st layer (Max = 5) 0.017 0.281
(SI = 2) like 5th layer (Max = 5) 0.014 0.249
Type II Gaussian- 1st layer (Max = 5) 0.017 0.142
FPN- (SI = 3) like 5th layer (Max = 5) 0.014 0.108
based Type III Gaussian- 1st layer (Max = 5) 0.016 0.146
(SI = 4) like 5th layer (Max = 5) 0.014 0.099
Type IV Gaussian- 2nd layer (Max = 5) 0.012 0.145
(SI = 6) like 5th layer (Max = 4) 0.008 0.093
PI - performance index over the entire data set,
PIs - performance index on the training data, EPIs - performance index on the testing data.
*: denotes “conventional optimized FPN-based SONN”.
96 S.-K. Oh and W. Pedrycz
0.2x(t − τ )
ẋ(t) = − 0.1x(t) (20)
1 + x10 (t − τ )
Table 11. System’s input vector formats for the design of SONN
Number of System Inputs (SI) Inputs and output
4(Type I) x(t−18), x(t−12), x(t−6), x(t);x(t+6)
5(Type II) x(t−24), x(t−18), x(t−12), x(t−6), x(t);x(t+6)
6(Type III) x(t−30), x(t−24), x(t−18), x(t−12), x(t−6), x(t);x(t+6)
where t = 118 to 1117.
Genetically Optimized Self-organizing Neural Networks 97
Table 12 summarizes the results for Type III when using the maximal number,
2, 3, 4, 5, 10 of inputs to be selected. The best results for the network in the
5th layer are reported as PI = 0.0009, EPI = 0.0009 when using Max = 10.
Fig. 24 depicts the values of the performance index of the PN-based
gSONN with respect to the maximal number of inputs to be selected when
the number of system inputs is equal to 6 (Type III). Next, Fig. 25 shows the
performance index of the gSONN with respect to the number of entire system
inputs when using Max = 10.
Table 14 summarizes the performance of the 1st to 5th layer of the network
when changing the maximal number of inputs to be selected; here Max was set
up to 2 through 5. Fig. 26 depicts the performance index of each layer of FPN-
based gSONN according to the increase of maximal number of inputs to be
selected. Fig. 27 summarizes the values of the performance index represented
vis-à-vis the increasing number of the layers with regard to the number of
selected inputs (or entire system inputs) being used in gSONN when using
Max = 5.
Fig. 28(a)–(b) illustrate the detailed optimal topologies of gSONN for 1
layer when using triangular MF: the results of the network have been reported
as PI = 4.0e-4, EPI = 4.0e-4 for Max = 3, and PI = 7.7e-5, EPI = 1.6e-4 for
Max = 5.
Table 13 summarizes the detailed results of the optimal architecture
according to each Type of the network.
And also Fig. 28(c)–(d) illustrate the detailed optimal topologies of gSONN
for 1 layer in case of Gaussian-like MF: those are quantified as PI = 3.0e-4,
EPI = 3.0e-4 for Max = 2, and PI = 3.6e-5, EPI = 4.5e-5 for Max = 5. As shown
in Fig. 28, the proposed network enables the architecture to be a structurally
more optimized and simplified network than the conventional SONN.
Fig. 29 illustrates the optimization process by visualizing the values of the
performance index obtained in successive generations of GA. It also shows the
optimized network architecture when using Gaussian-like MF (the maximal
number (Max) of inputs to be selected is set to 5 with the structure composed
of 5 layers). As shown in Fig. 29, the variation ratio (slope) of the performance
of the network is almost the same up to the 2nd through 5th layer, therefore
in this case, the stopping criterion can be taken into consideration up to
maximally 1 or 2 layers for the purpose to effectively reduce the number of
nodes as well as layers of the network (from the viewpoint of the width and
depth of gSONN for the compact network).
Table 15 summarizes the results of the optimal architecture according to
each Type (such as Types I, II, and III) of the network when using Type T∗
as the consequent input type.
98
Table 12. Performance index of the PN-based SONN of each layer versus the increase of maximal number of inputs to be selected for
Type III
S.-K. Oh and W. Pedrycz
1st layer 2nd layer 3rd layer 4th layer 5th layer
Max
Node T PI EPI Node T PI EPI Node T PI EPI Node T PI EPI Node T PI EPI
2 4 6 2 0.0502 0.0497 12 23 2 0.0309 0.0305 17 64 2 0.0215 0.0211 31 86 2 0.0173 0.0170 70 73 2 0.0156 0.0154
3 3 4 6 2 0.0347 0.0339 31 39 73 2 0.0161 0.0159 1 50 73 2 0.0107 0.0103 15 80 93 2 0.0080 0.0078 8 14 43 2 0.0067 0.0065
4 1 3 4 6 2 0.0293 0.0283 4 25 59 63 2 0.0106 0.0105 47 51 61 69 2 0.0063 0.0062 6 31 46 66 2 0.0052 0.0052 29 40 75 95 2 0.0045 0.0044
5 1 3 4 5 6 2 0.0231 0.0223 2 51 70 74 83 2 0.0126 0.0124 74 76 79 86 95 2 0.0066 0.0063 16 36 43 53 58 2 0.0049 0.0046 10 21 34 46 67 2 0.0041 0.0038
10 2 0.0211 0.0203 2 0.0036 0.0034 2 0.0021 0.0019 2 0.0013 0.0012 2 0.0009 0.0009
Genetically Optimized Self-organizing Neural Networks 99
B : (31 39 73)
0.04
Testing error
C : ( 4 25 59 63) A : (70 73)
D : ( 2 51 70 74 83) B : ( 8 14 43) 0.03
E : ( 1 35 38 42 45 56 57 70 91 97) C : (29 40 75 95)
0.03 D : (10 21 34 46 67) 0.025
E : ( 3 8 23 38 50 53 56 63 85 99)
A : (31 86) 0.02
B : (15 80 93)
0.02 C : ( 6 31 46 66) 0.015
D : (16 36 43 53 58)
E : ( 3 4 20 27 38 57 80 82 91 96) 0.01
0.01
0.005
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a) Training error (b) Testing error
Fig. 24. Performance index according to the increase of number of layers in case of
Type III
No. of system inputs : 4 ; ,5; ,6; No. of system inputs : 4 ; ,5; ,6;
0.035 0.03
0.03 0.025
0.025
0.02
Training error
Testing error
0.02
0.015
0.015
0.01
0.01
0.005 0.005
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a) Training error (b) Testing error
Fig. 25. Performance index of gPNN with respect to the increase of number of
system inputs (Max = 10)
Table 13. Performance index of PN based SONN architectures for Types I, II,
and III
PN-based Optimal Polynomial Neuron
Layer PIs EPIs
SONN No. of inputs (Node no.) Polynomial type
1 4(1, 2, 3, 4) Type 2 0.0302 0.0294
2 5(19, 26, 28, 34, 41) Type 2 0.0082 0.0081
Max = 5 3 5(2, 17, 23, 48, 86) Type 2 0.0063 0.0061
4 5(3, 25, 71, 75, 78) Type 2 0.0050 0.0049
Type I 5 5(1, 3, 54, 94, 98) Type 2 0.0043 0.0043
(SI: 4) 1 4(1, 2, 3, 4)) Type 2 0.0302 0.0294
2 10(2,4,8,10,11,12,13,15,21,23) Type 2 0.0047 0.0045
Max = 10 3 10(12,34,35,43,52,60,62,72,89,95) Type 2 0.0039 0.0038
4 10(12,13,28,29,32,41,57,73,91,94) Type 2 0.0032 0.0031
5 9(44,51,53,67,69,75,86,92,93) Type 2 0.0026 0.0026
1 5(1, 2, 3, 4, 5) Type 2 0.0282 0.0274
2 5(39, 59, 71, 74, 84)) Type 2 0.0088 0.0087
Max = 5 3 5(46, 51, 66, 67, 88) Type 2 0.0054 0.0052
4 4(29, 38, 58, 72, 83) Type 2 0.0042 0.0041
Type II 5 5(10, 21, 23, 63, 85) Type 2 0.0037 0.0035
(SI: 5) 1 5(1, 2, 3, 4, 5) Type 2 0.0282 0.0274
2 10(7,38,39,40,47,48,56,59,60,68) Type 2 0.0026 0.0024
Max = 10 3 10(3,11,18,35,42,50,52,53,67,73) Type 2 0.0019 0.0018
4 10(2,8,26,30,33,35,48,79,81,85) Type 2 0.0015 0.0014
5 10(3,7,10,31,54,59,79,89,95,97) Type 2 0.0012 0.0012
1 5(1, 3, 4, 5, 6) Type 2 0.0231 0.0223
2 5(2, 51, 70, 74, 83) Type 2 0.0126 0.0124
Max = 5 3 5(74, 76, 79, 86, 95) Type 2 0.0066 0.0063
4 5(16, 36, 43, 53, 58) Type 2 0.0049 0.0046
Type III 5 5(10, 21, 34, 46, 67) Type 2 0.0041 0.0038
(SI: 6) 1 6(1, 2, 3, 4, 5, 6) Type 2 0.0211 0.0203
2 10(1,35,38,42,45,56,57,70,91,97) Type 2 0.0036 0.0034
Max = 10 3 10(2,3,4,31,52,61,72,78,81,85) Type 2 0.0021 0.0019
4 10(3,4,20,27,38,57,80,82,91,96) Type 2 0.0013 0.0012
5 10(3,8,23,38,50,53,56,63,85,99) Type 2 0.0009 0.0009
6 Concluding remarks
In this study, we introduced a class of genetically optimized self-organizing
neural networks, discussed their topologies, came up with a detailed genetic
design procedure, and used these networks to nonlinear system modeling. The
comprehensive experimental studies involving well-known datasets quantify a
superb performance of the network in comparison to the existing fuzzy and
neuro-fuzzy models. The key features of this approach can be enumerated as
follows:
• The gSONN is sophisticated and optimized architecture capable of con-
structing models out of a few number of entire system inputs as well as a
limited data set.
• The proposed design methodology helps reach a compromise between ap-
proximation and generalization capabilities of the constructed gSONN
model.
• The depth (layer size) and width (node size of each layer) of the gSONN
can be selected as a result of a tradeoff between accuracy and complexity
of the overall model.
Table 14. Performance index of the network of each layer versus the increase of maximal number of inputs to be selected for Type III
(in case of Type T∗ )
(a) Triangular MF
1st layer 2nd layer 3rd layer 4th layer 5th layer
Max
Node T PI EPI Node T PI EPI Node T PI EPI Node T PI EPI Node T PI EPI
2 4 5 3 0.0019 0.0017 14 25 4 0.0016 0.0015 4 7 3 0.0016 0.0014 10 22 3 0.0015 0.0014 14 16 3 0.0015 0.0013
3 3 4 5 3 0.0004 0.0004 2 3 13 1 0.0003 0.0003 4 10 24 4 0.0003 0.0003 2 14 22 4 0.0002 0.0003 14 20 25 3 0.0002 0.0002
4 3 4 5 6 3 7.7e-5 1.6e-4 4 14 23 0 2 6.9e-5 1.3e-4 6 10 14 0 4 6.0e-5 1.2e-4 20 25 29 0 4 5.8e-5 1.0e-4 10 13 15 0 3 5.6e-5 9.9e-5
5 3 4 5 6 0 3 7.7e-5 1.6e-4 6 16 23 0 0 4 6.7e-5 1.2e-4 12 18 0 0 0 4 6.1e-5 1.0e-4 3 15 28 0 0 3 5.6e-5 9.7e-5 13 15 19 21 0 1 5.7e-5 9.4e-5
(b) Gaussian-like MF
1st layer 2nd layer 3rd layer 4th layer 5th layer
Max
Node T PI EPI Node T PI EPI Node T PI EPI Node T PI EPI Node T PI EPI
2 4 6 3 0.0003 0.0003 18 20 4 0.0002 0.0002 13 18 3 0.0002 0.0002 12 14 3 0.0002 0.0002 4 21 4 0.0002 0.0002
3 1 4 5 3 3.6e-5 0.199 3 18 19 2 3.2e-5 4.1e-5 7 17 21 2 3.1e-5 3.9e-5 6 18 29 2 2.7e-5 3.6e-5 15 23 0 3 2.7e-5 3.5e-5
4 1 4 5 0 3 3.6e-5 4.5e-5 7 16 19 24 2 2.8e-5 4.1e-5 11 17 23 28 4 2.8e-5 3.6e-5 2 3 18 0 2 2.4e-5 3.4e-5 7 19 0 0 2 2.4e-5 3.3e-5
5 1 4 5 0 0 3 3.6e-5 4.5e-5 12 14 18 0 0 2 3.2e-5 4.1e-5 7 8 13 14 20 4 2.7e-5 3.6e-5 6 14 15 0 0 3 2.4e-5 3.4e-5 18 26 28 0 0 4 2.3e-5 3.3e-5
Genetically Optimized Self-organizing Neural Networks
101
102 S.-K. Oh and W. Pedrycz
Maximal number of inputs to be selected (Max) Maximal number of inputs to be selected (Max)
D : (3 4 5 6 0) A : (14 16)
Testing error
B : (14 20 25)
1.2
1.2 A : (14 25) A : (10 22)
C : (10 13 15 0)
B : ( 2 3 13) B : ( 2 14 22)
D : (13 15 19 21 0) 1
C : ( 4 14 23 0) C : (20 25 29 0)
1 D : ( 6 16 23 0 0) D : ( 3 15 28 0 0)
0.8
0.8 A : ( 4 7)
B : ( 4 10 24)
0.6
0.6 C : ( 6 10 14 0)
D : (12 18 0 0)
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a-1) Training error error (a-2) Testing error
(a) Triangular MF
Maximal number of inputs to be selected (Max) Maximal number of inputs to be selected (Max)
x 10 −4 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
x 10 −4 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
4 4
3.5 3.5
A : (4 6)
B : (1 4 5)
3 C : (1 4 5 0) 3
Training error
Testing error
D : (1 4 5 0 0)
2.5 A : (18 20) 2.5
B : ( 3 18 19)
A : ( 4 21)
C : ( 7 16 16 24)
2 D : (12 14 18 0 0)
B : (15 23 0) 2
C : ( 7 19 0 0)
A : (13 18) D : (18 26 28 0 0)
1.5 B : ( 7 17 21)
A : (12 14)
1.5
C : (11 17 23 28)
B : ( 6 18 29)
D : ( 7 8 13 14 20)
1 C : ( 2 2 18 0) 1
D : ( 6 14 15 0 0)
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
(b-1) Training error error (b-2) Testing error
(b) Gaussian-like MF
• The structure of the network is not predetermined (as in most of the ex-
isting neural networks) but becomes dynamically adjusted and optimized
during the development process.
• With a properly selected type of membership functions and the organi-
zation of the layers, FPN based gSONN performs better than other fuzzy
and neurofuzzy models.
• The gSONN comes with a diversity of local neuron characteristics such as
PNs or FPNs that are useful in copying with various nonlinear character-
istics of the nonlinear systems. GA-based design procedure at each stage
(layer) of gSONN leads to the optimal selection of these preferred nodes
(PNs or FPNs) with local characteristics (such as the number of input
variables, the order of the polynomial, and a collection of specific subset
of input variables) available within a single node, and then based on these
selections, we build the flexible and optimized architecture of gSONN.
Genetically Optimized Self-organizing Neural Networks 103
−3 No. of system inputs : 4 ; ,5; ,6; −3 No. of system inputs : 4 ; ,5; ,6;
1.8 x 10 1.8 x 10
: Selected input variables : Selected input variables
1.6 : Entire system input variables 1.6 : Entire system input variables
1.4 1.4
Training error
Testing error
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
(a-1) Training error error (a-2) Testing error
(a) Triangular MF
−3 No. of system inputs : 4 ; ,5; ,6; −3 No. of system inputs : 4 ; ,5; ,6;
1.4 x 10 1.4 x 10
: Selected input variables : Selected input variables
1.2 : Entire system input variables 1.2 : Entire system input variables
1 1
Training error
Testing error
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 1 2 3 4 5
Layer Layer
(b-1) Training error error (b-2) Testing error
(b) Gaussian-like MF
Fig. 27. Performance index of gFPNN with respect to the increase of number of
system inputs (Max = 5)
x(t-30)
x(t-30)
x(t-24)
x(t-24)
x(t-18) x(t-18) FPN16
FPN13
y^ 4 3
y^
x(t-12)
3 3 x(t-12)
x(t-6) x(t-6)
x(t) x(t)
(a) Triangular MF (1 layer (b) Triangular MF (1 layer
and Type III, Max = 3) and Type III, Max = 5)
x(t-30)
x(t-24)
x(t-18) FPN18
2 3
y^
x(t-12)
x(t-6)
x(t)
x 10 −5 x 10 −5
6.5 8
6 7.5
5.5 7
6.5
5
Training error
Testing error
6
4.5
1st layer 2nd layer 3rd layer 4th layer 5th layer 5.5 1st layer 2nd layer 3rd layer 4th layer 5th layer
4
5
3.5
4.5
3 4
2.5 3.5
2 3
0 100 200 300 400 500 0 100 200 300 400 500
Generation Generation
Fig. 29. The optimization process of each performance index by the genetic
algorithms (Type III, Max = 5, Gaussian)
Table 15. Performance index of FPN based SONN architectures for Types I, II,
and III
PN-based Layer Optimal Polynomial Neuron PIs EPIs
SONN No. of inputs (Node no.) Polynomial type
1 4(1, 2, 3, 4) Type 3 0.0017 0.0016
Max = 5 2 5(8, 24, 25, 27, 28) Type 2 0.0014 0.0013
(Triangular MF) 3 4(2, 5, 9, 15) Type 2 0.0011 0.0011
4 4(2, 6, 17, 26) Type 3 0.0009 0.0010
Type I 5 4(1, 3, 17, 24) Type 1 0.0009 0.0010
(SI: 4) 1 4(1, 2, 3, 4) Type 3 0.0005 0.0006
Max = 5 2 5(7, 12, 18, 22, 29) Type 4 0.0004 0.0006
(Gaussian-like MF) 3 4(10, 18, 20, 22) Type 2 0.0003 0.0005
4 4(3, 7, 21, 26) Type 2 0.0003 0.0005
5 2(2, 15) Type 4 0.0003 0.0005
1 5(1, 2, 3, 4, 5) Type 4 0.0003 0.0004
Max = 5 2 4(1, 8, 9, 14) Type 4 0.0003 0.0003
(Triangular MF) 3 4(2, 17, 23, 24) Type 3 0.0002 0.0003
4 5(8, 11, 18, 21, 26) Type 1 0.0002 0.0003
Type II 5 4(7, 9, 14, 29) Type 1 0.0002 0.0003
(SI: 5) 1 3(2, 3, 5) Type 3 0.0004 0.0003
Max = 4 2 4(9, 11, 21, 30) Type 2 0.0003 0.0003
(Gaussian-like MF) 3 4(2, 8, 19, 21) Type 4 0.0002 0.0002
4 2(22, 24) Type 4 0.0002 0.0002
5 3(8, 18, 27) Type 3 0.0002 0.0002
1 4(3, 4, 5, 6) Type 3 7.7e-5 1.6e-4
Max = 5 2 3(6, 16, 23) Type 4 6.7e-5 1.2e-4
(Triangular MF) 3 2(12, 18) Type 4 6.1e-5 1.0e-4
4 3(3, 15, 28) Type 3 5.6e-5 9.7e-5
Type III 5 4(13, 15, 19, 21) Type 1 5.7e-5 9.4e-5
(SI: 6) 1 3(1, 4, 5) Type 3 3.6e-5 4.5e-5
Max = 5 2 3(12, 14, 18) Type 2 3.2e-5 4.1e-5
(Gaussian-like MF) 3 5(7, 8, 13, 14, 20) Type 4 2.7e-5 3.6e-5
4 3(6, 14, 15) Type 3 2.4e-5 3.4e-5
5 3(18, 26, 28) Type 4 2.3e-5 3.3e-5
Genetically Optimized Self-organizing Neural Networks 105
Table 16. Comparative analysis of the performance of the network; considered are
models reported in the literature
Performance index
Model PI PIs EPIs NDEI∗
0.044
Wang’s model [35] 0.013
0.010
Cascaded-correlation NN [36] 0.06
Backpropagation MLP [36] 0.02
6th-order polynomial [36] 0.04
ANFIS [37] 0.0016 0.0015 0.007
FNN model [38] 0.014 0.009
Recurrent neural network [39] 0.0138
Basic Case 1 0.0011 0.0011 0.005
Type I (5th layer) Case 2 0.0027 0.0028 0.011
Modified Case 1 0.0012 0.0011 0.005
SONN∗∗ [40] (5th layer) Case 2 0.0038 0.0038 0.016
Basic Case 1 0.0003 0.0005 0.0016
Type II
(5th layer) Case 2 0.0002 0.0004 0.0011
Modified Case 1 0.000001 0.00009 0.000006
Type III
(5th layer) Case 2 0.00004 0.00007 0.00015
PN- Type I (5th layer) Max = 10 0.0026 0.0026
based Type II (5th layer) Max = 10 0.0012 0.0012
Type III (5th layer) Max = 10 0.0009 0.0009
Proposed Type I Triangular Max = 5 0.0017 0.0016
gSONN FPN- (1st layer) Gaussian Max = 5 0.0005 0.0006
based Type II Triangular Max = 5 0.0003 0.0004
st
FPN- (1 layer) Gaussian Max = 5 0.0004 0.0003
based Type III Triangular Max = 5 7.7e-5 1.6e-4
(1st layer) Gaussian Max = 5 3.6e-5 4.5e-5
*Non-dimensional error index (NDEI) as used in [41] is defined as the root mean
square errors divided by the standard deviation of the target series.
** is called “conventional optimized FPN-based SONN”.
7 Acknowledgements
This work was supported by the Korea Research Foundation Grant funded by
the Korean Government (MOEHRD)(KRF-2006-311-D00194 and KRF-2006-
D00019).
106 S.-K. Oh and W. Pedrycz
References
1. Cherkassky V, Gehring D, Mulier F (1996) Comparison of adaptive methods for
function estimation from samples. IEEE Trans Neural Netw 7:969–984
2. Dickerson JA, Kosko B (1996) Fuzzy function approximation with ellipsoidal
rules. IEEE Trans Syst Man Cybern Part B 26:542–560
3. Sommer V, Tobias P, Kohl D, Sundgren H, Lundstrom L (1995) Neural networks
and abductive networks for chemical sensor signals: a case comparison. Sens
Actuators B 28:217–222
4. Kleinsteuber S, Sepehri N (1996) A polynomial network modeling approach to
a class of large-scale hydraulic systems. Comput Elect Eng 22:151–168
5. Cordon O, et al. (2004) Ten years of genetic fuzzy systems: current framework
and new trends. Fuzzy Set Syst 141(1):5–31
6. Ivakhnenko AG (1971) Polynomial theory of complex systems. IEEE Trans Syst
Man Cybern SMC-1:364–378
7. Ivakhnenko AG, Madala HR (1994) Inductive learning algorithms for complex
systems modeling. CRC, Boca Raton, FL
8. Ivakhnenko AG, Ivakhnenko GA (1995) The review of problems solvable by
algorithms of the group method of data handling (GMDH). Pattern Recogn
Image Anal 5(4):527–535
9. Ivakhnenko AG, Ivakhnenko GA, Muller JA (1994) Self-organization of neural
networks with active neurons. Pattern Recogn Image Anal 4(2):185–196
10. Oh SK, Pedrycz W (2002) The design of self-organizing polynomial neural
networks. Inf Sci 141:237–258
11. Oh SK, Pedrycz W, Park BJ (2003) Polynomial neural networks architecture:
analysis and design. Comput Electr Eng 29(6):703–725
12. Park HS, Park BJ, Oh SK (2002) Optimal design of self-organizing polynomial
neural networks by means of genetic algorithms. J Res Inst Eng Technol Dev
22:111–121 (in Korean)
13. Oh SK, Pedrycz W (2003) Fuzzy polynomial neuron-based self-organizing neural
networks. Int J Gen Syst 32(3):237–250
14. Pedrycz W, Reformat M (1996) Evolutionary optimization of fuzzy models in
fuzzy logic: a framework for the new millennium. In: Dimitrov V, Korotkich V
(eds.) Studies in fuzziness and soft computing, pp 51–67
15. Pedrycz W (1984) An identification algorithm in fuzzy relational system. Fuzzy
Set Syst 13:153–167
16. Oh SK, Pedrycz W (2000) Identification of fuzzy systems by means of an
auto-tuning algorithm and its application to nonlinear systems. Fuzzy Set Syst
115(2):205–230
17. Michalewicz Z (1996) Genetic algorithms + Data structures = Evolution
programs. Springer, Berlin Heidelberg Newyork
18. Holland JH (1975) Adaptation in natural and artificial systems. The University
of Michigan Press, Ann Arbour
19. Jong DKA (1996) Are genetic algorithms function Optimizers?. In Manner R,
Manderick B (eds.) Parallel problem solving from nature 2, North-Holland,
Amsterdam
20. Box DE, Jenkins GM (1976) Time series analysis forcasting and control. Holden
Day, California
21. Tong RM (1980) The evaluation of fuzzy models derived from experimental
data. Fuzzy Set Syst 13:1–12
Genetically Optimized Self-organizing Neural Networks 107
1 Introduction
System modelling and identification is important in system analysis, con-
trol, and automation as well as scientific research, so much effort has been
directed to developing advanced techniques of system modelling. Neural net-
works (NNs) and fuzzy systems have been widely used for modelling nonlinear
systems. The approximation capability of neural networks, such as multilayer
perceptrons, radial basis function (RBF) networks, or dynamic recurrent neu-
ral networks, has been investigated [1–3]. On the other hand, fuzzy systems
can approximate nonlinear functions with arbitrary accuracy [4–5]. But the
resultant neural network representation is very complex and difficult to under-
stand, and fuzzy systems require too many fuzzy rules for accurate function
D. Kim and G.-T. Park: Evolution of Inductive Self-organizing Networks, Studies in
Computational Intelligence (SCI) 82, 109–128 (2008)
www.springerlink.com
c Springer-Verlag Berlin Heidelberg 2008
110 D. Kim and G.-T. Park
When the SOPNN is designed by using EA, the most important consider-
ation is the representation strategy, that is, the process by which the key
factors of the SOPNN are encoded into the chromosome. A binary coding is
employed for the available design specifications. The order and the inputs of
each node in the SOPNN are coded as a finite-length string. Our chromo-
somes are made of three sub-chromosomes. The first one consists of 2 bits
for the order of polynomial (PD), the second one consists of 3 bits for the
number of inputs of PD, and the last one consists of N bits, which are equal
to the total number of input candidates in the current layer. These input
candidates are the node outputs of the previous layer. The representation of
binary chromosomes is illustrated in Fig. 1. The 1st sub-chromosome is made
of 2 bits, which represent the several types of the order of PD. The relation-
ship between bits in the 1st sub-chromosome and the order of PD is shown in
Table 1. Thus, each node can exploit a different order of the polynomial. The
3rd sub-chromosome has N bits, which are concatenated bits of 0’s and 1’s cod-
ing. The input candidate is represented by a ‘1’ bit if it is chosen as the input
variable to the PD and by a ‘0’ bit it is not chosen. This type of representation
solves the problem of which input variables are to be chosen. If many input
candidates are chosen for a model design, the modelling becomes computa-
tionally complex, and normally requires a lot of time to achieve good results.
Also, complex modelling can produce inaccurate results and poor generaliza-
tions. Good approximation performance does not necessarily guarantee good
1 0 1 1 0 0 0 0 1 1
•
•
•
1
1st sub- Order of
0 chromosome polynomial
0
2nd sub-
1 No. of inputs
chromosome
0
x1 1 selected f ^
y
0
x2 ignored
x3 0
3rd sub- ignored
x4 0 chromosome
ignored
x5 0 ignored
x6 1 selected
:quadratic
(Type 2)
x1
2
PD ^
y
2
x6
: 2 inputs
are all randomly initialized; thus, the use of heuristic knowledge is minimized.
The fitness assignment in EA serves as a guide in the search toward the
optimal solution. The fitness function for specific cases of modeling will be
explained later. After each chromosome is evaluated and associated with a
fitness, the current population undergoes the reproduction process to create
the next generation of population. The roulette-wheel selection scheme is used
to determine the members of the new generation of population. After the new
group of population is built, a mating pool is formed for the crossover. The
crossover proceeds in three steps. First, two newly reproduced strings are se-
lected from the mating pool, which is produced by reproduction. Second, a
position (one point) along the two strings is selected uniformly at random.
In the third step, all characters are exchanged by following the crossing site.
We use one-point crossover operator with a crossover probability of pc (0.85).
This crossover is then followed by the mutation operation. The mutation is
the occasional alteration of a value at a particular bit position (we flip the
states of a bit from 0 to 1 or vice versa). The mutation serves as an insurance
policy for recovering the loss of a particular piece of information (any sim-
ple bit). The mutation rate used is fixed at 0.05 (pm ). Generally, after these
three operations, the overall fitness of the population improves. Each of the
Evolution of Inductive Self-organizing Networks 115
Start
E = Θ × P I + (1 − Θ) × EP I (2)
where Θ ∈ [0, 1] is a weighting factor for PI and EPI, which the per-
formance index values of the training data and testing data, respectively,
116 D. Kim and G.-T. Park
3 Simulation Results
In this section, we show the performance of our new EA-based SOPNN for
two well known nonlinear system modeling. One is a time series of a gas
furnace (Box-Jenkins data) [19], which was studied previously in [19–21], and
the other is a nonlinear system already exploited in fuzzy modeling [22–26].
The delayed terms of methane gas flow rate u(t) and carbon dioxide density
y(t) such as u(t−3), u(t−2), u(t−1), y(t−3), y(t−2), and y(t−1)are used
as input variables to the EA-based SOPNN. The actual system output y(t)
is used as the target output variable for this model. We choose the input
variables of nodes in the 1st layer from these input variables. The total data
set consisting of 296 input-output pairs is split into two parts. The first part
(consisting of 148 pairs) is used as the training data set, and the remaining
part of the data set serves as a testing data set. Using the training data set, the
EA-based SOPNN can estimate the coefficients of the polynomial by using the
standard LSE. The performance index is defined as the mean squared error
1
m
P I(EP I) = (yi − ŷi )2 (4)
m i=1
where yi is the actual system output, ŷi is the estimated output of each node,
and m is the number of data.
The design parameters of EA-based SOPNN for modeling are shown in
Table 3. In the 1st layer, 20 chromosomes are generated and evolved during
40 generations, where each chromosome in the population is defined as a
corresponding node. So 20 nodes (PDs) are produced in the 1st layer based
on the EA operators. All PDs are estimated and evaluated using the training
and testing data sets, respectively. They are also evaluated by the fitness
function of (3) and ranked according to their fitness value. We choose nodes
as many as a predetermined number w from the highest ranking node, and
use their outputs as new input variables to the nodes in the next layer. In
other words, the chosen PDs (w nodes) must be preserved for the design of
the next layer, and the outputs of the preserved PDs serve as inputs to the
Evolution of Inductive Self-organizing Networks 117
next layer. The value of w is different for each layer, which is also shown in
Table 3. This procedure is repeated for the 2nd layer and the 3rd layer.
Table 4 summarizes the values of the performance indices, PI and EPI,
of the proposed EA-based SOPNN according to the weighting factor. These
values are the lowest in each layer. The overall lowest values of the performance
indices are obtained at the third layer when the weighting factor is 0.5. If this
model is designed to have a fourth or higher layer, the performance values
become much lower, and the computation time increases considerably for the
model with such a complex network.
Fig. 5 depicts the trend of the performance index values produced in suc-
cessive generations of the EA for the weighting factor Θ of 0.5. Fig. 6 illustrates
the values of the error function and fitness function in successive EA genera-
tions when Θ = 0.5. Fig. 7 shows the proposed EA-based SOPNN model with
3 layers and its identification performance for Θ = 0.5. The model output
follows the actual output very well. The values of the performance indices of
the proposed method are equal to PI = 0.012, EPI = 0.108, respectively.
118 D. Kim and G.-T. Park
0.145
0.022
Performance index(EPI)
Performance index(PI)
PI 0.140 EPI
0.020 0.135
0.130
0.018
0.125
0.016 0.120
0.115
0.014
1st layer 0.110 1st layer 2nd layer 3rd layer
2nd layer 3rd layer
0.012 0.105
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
Generations Generations
(a) (b)
Fig. 5. (a) performance index for the training data set (b) performance index for
the testing data set: Trend of performance index values with respect to generations
through layers
0.945
0.080
Value of error function(E)
0.070 0.935
0.065 0.930
0.060 1st layer 2nd layer 3rd layer 0.925 1st layer 2nd layer 3rd layer
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
Generations Generations
(a) (b)
Fig. 6. (a) error function (b) fitness function : Values of the error function and
fitness function for successive generations
1
4 PD
1
5 PD
u(t-3)
1
3 PD
u(t-2)
3 3
5 PD 5 PD
u(t-1) 3
2 3 PD ^
y
5 PD 5 PD
3
y(t-3)
2 2
4 PD 4 PD
y(t-2)
2
5 PD
y(t-1) 2
4 PD
1
4 PD
(a)
Actural output
62 Model output
60 2.25
58 1.50
CO2 concentration
56
0.75
54
Error
52 0.00
50 −0.75
48 −1.50
46 training data testing data
−2.25
44
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Data number Data number
(b) (c)
Fig. 7. (a) Proposed EA-based SOPNN model with 3 layers, (b) actual output
versus model output, (c) error: Proposed EA-based SOPNN model with 3 layers
and its identification performance
Evolution of Inductive Self-organizing Networks 119
Fig. 8. (a) Basic SOPNN & Case 1, (b) Modified SOPNN & Case 2: Conventional
SOPNN models with 5 layers
For the comparison of the network size of the proposed EA-based SOPNN
with that of conventional SOPNN, conventional SOPNN models are visualized
in Fig. 8. The structure of the basic SOPNN & Case 1 in Fig. 8(a) is obtained
by use of 4 input variables and Type 3 polynomial for every node in all layers
to the fifth layer. Its performance is as follows. PI = 0.012 and EPI = 0.084.
On the other hand, the structure of the modified SOPNN and Case 2 in
Fig. 8(b) is obtained by use of 2 input variables and Type 1 polynomial for
every node in the 1st layer and 3 input variables and Type 2 polynomial for
every node from the 2nd layer to the 5th layer. In this model, PI is 0.016,
and EPI is 0.101. Figures 7 and 8 show that the structure of the EA-based
SOPNN is much simpler than that of the conventional SOPNN in terms of
the number of nodes and layers, despite their similar performance.
Table 5 provides a comparison of the proposed model with other tech-
niques. The comparison is realized on the basis of the same performance index
for the training and testing data sets. Additionally, PI represents the perfor-
mance index of the model for the training data set and EPI for the testing
data. The proposed architecture, EA-based SOPNN model, outperforms other
models both in terms of accuracy and generalization capability.
120 D. Kim and G.-T. Park
Model PI EPI
Kim’s model [20] 0.034 0.244
Lin’s model [21] 0.071 0.261
Kim’s model [12] 0.013 0.126
SOPNN Basic & case 1 0.012 0.084
(5 layers) [13] Modified & case 2 0.016 0.101
EA-based SOPNN (3 layers) 0.012 0.108
which has been widely used by Takagi and Hayashi [22], Sugeno and Kang
[23], and Kondo [24] to test their modelling approaches.
Table 6 shows 40 pairs input-output data obtained from (5) [26]. The
input x4 is a dummy variable, which is not related to (5). The data in Table
6 is divided into the training data set (Nos. 1–20) and the testing data set
(Nos. 21–40). To compare the performance, the same performance index, the
average percentage error (APE), adopted in [22–26] is used.
1 |yi − ŷi |
m
AP E = × 100 (6)
m i=1 yi
where m is the number of data pairs and yi and ŷ1 are the i-th actual output
and model output, respectively.
Again, a series of comprehensive experiments was conducted, and the re-
sults are summarized. The design parameters of EA-based SOPNN for each
layer are shown in Table 7.
The simulation results of the EA-based SOPNN are summarized in Table 8.
The lowest values of the performance indices, PI = 0.188 EPI = 1.087, are
obtained at the third layer when the weighting factor (Θ) is 0.25.
Figure 9 illustrates the trend of the performance index values produced in
successive generations of the EA for the weighting factor Θ of 0.25. Figure 10
shows the values of the error function and fitness function in successive EA
generations when Θ is 0.25.
Evolution of Inductive Self-organizing Networks 121
No. x1 x2 x3 x4 y No. x1 x2 x3 x4 y
1 1 3 1 1 11.11 21 1 1 5 1 9.545
2 1 5 2 1 6.521 22 1 3 4 1 6.043
3 1 1 3 5 10.19 23 1 5 3 5 5.724
4 1 3 4 5 6.043 24 1 1 2 5 11.25
5 1 5 5 1 5.242 25 1 3 1 1 11.11
6 5 1 4 1 19.02 26 5 5 2 1 14.36
7 5 3 3 5 14.15 27 5 1 3 5 19.61
8 5 5 2 5 14.36 28 5 3 4 5 13.65
9 5 1 1 1 27.42 29 5 5 5 1 12.43
10 5 3 2 1 15.39 30 5 1 4 1 19.02
11 1 5 3 5 5.724 31 1 3 3 5 6.38
12 1 1 4 5 9.766 32 1 5 2 5 6.521
13 1 3 5 1 5.87 33 1 1 1 1 16
14 1 5 4 1 5.406 34 1 3 2 1 7.219
15 1 1 3 5 10.19 35 1 5 3 5 5.724
16 5 3 2 5 15.39 36 5 1 4 5 19.02
17 5 5 1 1 19.68 37 5 3 5 1 13.39
18 5 1 2 1 21.06 38 5 5 4 1 12.68
19 5 3 3 5 14.15 39 5 1 3 5 19.61
20 5 5 4 5 12.68 40 5 3 2 5 15.39
6 7
PI Performance index(EPI) EPI
Performance index(PI)
5 6
4 5
3 4
2 3
1 2
0 1st layer 2nd layer 3rd layer 1 1st layer 2nd layer 3rd layer
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
Generations Generations
(a) (b)
Fig. 9. (a) performance index for the training data set (b) performance index for
the testing data set: Trend of performance index values with respect to generations
through layers
7 0.6
Value of fitness function(F)
Value of error function(E)
6
0.5
5
0.4
4
3 0.3
2 0.2
1 1st layer 2nd layer 3rd layer 1st layer 2nd layer 3rd layer
0.1
0
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
Generations Generations
(a) (b)
Fig. 10. (a) error function (b) fitness function : Values of the error function and
fitness function for successive generations
2 PD
3
2 PD 2 PD
X1 3 2
2 PD 3
3 2 PD
2 3 3
X2 PD PD 5 PD y^
3 3
2 3
3 PD 2 PD
X3 1 PD 3 PD
3 1
2 PD
2
Fig. 11. Structure of the EA-based SOPNN model with 3 layers (Θ = 0.25)
20
30 Actual output
15
Model output
25 10
20 5
Errors
0
y_tr
15
−5
10 −10
−15
5
−20
5 10 15 20 5 10 15 20
Data number Data number
30 20
Actual output
Model output 15
25 10
20 5
Errors
y_te
0
15
−5
10 −10
−15
5
−20
5 10 15 20 5 10 15 20
Data number Data number
Fig. 12. (a) actual output versus model output of training data, (b) errors of (a)(c)
actual output versus model output of testing data, (b) errors of (c): Identification
performance and errors of EA-based SOPNN model with 3 layers
Model PI EPI
GMDH model[24] 4.7 5.7
Fuzzy model [23] model 1 1.5 2.1
model 2 0.59 3.4
FNN [26] type 1 0.84 1.22
type 2 0.73 1.28
type 3 0.63 1.25
GD-FNN [25] 2.11 1.54
SOPNN Basic & case 1 2.59 8.52
(5 layers) [13] Modified Impossible Impossible
EA-based SOPNN (3 layers) 0.188 1.087
4 Conclusions
5 Acknowledgement
The authors would like to thank the financial support of the Korea Science &
Engineering Foundation. This work was supported by grant No. R01-2005-000-
11044-0 from the Basic Research Program of the Korea Science & Engineering
Foundation. The authors are also very grateful to the anonymous reviewers
for their valuable comments.
Evolution of Inductive Self-organizing Networks 125
APPENDIX
SELF-ORGANIZING POLYNOMIAL NEURAL NETWORK [13]
x1i 1 j th layer
Z1
x2i PD
j-1
z 1
1
x3i Z2 PD
PD
x4i ∧
yi
1 z j-1 p
xN-1i
Z N!/{(N-r)!r!} zi
PD z j-1q PD
xNi
1
ntr
N!
Ek = (yi − zki )2 , k = 1, 2, . . . , (7)
ntr i=1 r!(N − r)!
where, zki denotes the output of the k-th node with respect to the i-th data.
This step is completed repeatedly for all the nodes in the current layer and,
subsequently, all layers of the SOPNN starting from the input to the output
layer.
Step 6: Select PDs with the good predictive capability
The predictive capability of each PD is evaluated by the performance index
!
using the testing data set. Then, we choose w number of PDs among r!(NN−r)!
PDs in the order of the best predictive capability (the lowest value of the
performance index). Here, w is the pre-defined number of PDs that must be
preserved to next layer. The outputs of the chosen PDs serve as inputs to the
next layer.
Step 7: Check the stopping criterion
The SOPNN algorithm terminates when the number of layers predetermined
by the designer is reached.
Step 8: Determine new input variables for the next layer
If the stopping criterion is not satisfied, the next layer is constructed by
repeating step 4 through step 8.
The overall architecture of the SOPNN is shown in Fig. 13.
References
1. Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are
universal approximators. IEEE Trans Neural Netw 2:359–366
Evolution of Inductive Self-organizing Networks 127
24. Kondo T (1986) Revised GMDH algorithm estimating degree of the complete
polynomial. Trans Soc Instrum Control Eng 22:928–934
25. Wu S, Er MJ, Gao Y (2001) A fast approach for automatic generation of fuzzy
rules by generalized dynamic fuzzy neural networks. IEEE Trans Fuzzy Syst
9:578–594
26. Horikawa SI, Furuhashi T, Uchikawa Y (1992) On fuzzy modeling using fuzzy
neural networks with the back-propagation algorithm. IEEE Trans Neural Netw
3:801–806
Recursive Pattern based Hybrid Supervised
Training
1 Introduction
In this chapter we study the Recursive Pattern Based Hybrid Supervised
(RPHS) learning System, a hybrid evolutionary-learning based approach to
task decomposition. Typically, the RPHS system consists of a pattern distrib-
utor and several neural networks, or sub-networks. The input signal propagates
through the system in a forward direction, through the pattern distributor,
which chooses the best sub-network to solve the problem, which then outputs
the corresponding solution.
RPHS has been applied successfully to solve some difficult and diverse
classification problems by training them using a recursive hybrid approach to
training. The algorithm is based on the concept of pseudo global optima. This
concept is based on the idea of local optima and the gradient descent rule.
K. Ramanathan and S.U. Guan: Recursive Pattern based Hybrid Supervised Training, Studies
in Computational Intelligence (SCI) 82, 129–156 (2008)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008
130 K. Ramanathan and S.U. Guan
Basically, RPHS learning involves four steps: evolutionary learning [19], de-
composition, gradient descent [26], and integration. During evolutionary learn-
ing, a set of T patterns is fed to a population of chromosomes to decide the
structure and weights of a preliminary pseudo global optimal solution (also
known as a preliminary subnetwork). Decomposition identifies the learnt and
unlearnt patterns of the preliminary subnetwork. A gradient descent approach
works on the learnt patterns and optimizes the preliminary subnetwork using
the learnt patterns. This process produces a pseudo-global optimal solution or
a final subnetwork. The whole process is then repeated recursively with the
unlearnt patterns. Integration works by creating a pattern distributor, a clas-
sification system which helps associate a given pattern with a corresponding
subnetwork.
The RPHS system has two distinctive characteristics
1. Evolutionary algorithms are used to find the vicinity of a pseudo global
optimum and gradient descent is used to find the actual optimum. The
evolutionary algorithms therefore aim to span a larger area of the search
space and to improve the performance of the gradient descent algorithm
2. Two validation algorithms are used in training the system, to identify
the correct stopping point for the gradient descent [26] and to identify
when to stop the recursive decomposition. These two validation algorithms
ensure that the RPHS system is completely adapted to the topology of
the training and validation data. It is through a combination of these
characteristics, together with the ability to unlearn old information and
adapt to new information, that the RPHS system derives its high accuracy.
1.1 Motivation
2 Some preliminaries
2.1 Notation
m : Input dimension
n : Output dimension
K : Number of RPHS recursions
I : Input
O : Output
Tr : Training
V al : Validation
P : Ensemble of subsets
S : Neural network solution
E : Error
NH : Number of hidden nodes in the 3-layered percepteron
Recursive Pattern based Hybrid Supervised Training 133
Further, let Iv = {I1 , I2 ,..., ITv } and Ov = {O1 , O2 ,..., OTv } represent the
input and output patterns of a set of validation data, such that Val = {Iv , Ov }.
We wish to take T r as training patterns to the system and V al as the valida-
tion data and come up with an ensemble of K subsets. Let P represent this
ensemble of K subsets:
P = P1 , P2 ,...P K
, where, for i ∈ K
i
P = Tr , Val , Tr = Itr , Oitr ,Vali = Iiv , Oiv
i i i i
i i
Here, Itr and Otr are mT i and nT i matrices respectively and Ivi and Ovi are
K K
mT v and nT v matrices respectively, such that Ti = T and Tvi = Tv.
i=1 i=1
We need to find a set of neural networks S = S1 , S2 ,...SK , where S 1
solves P 1 , S 2 solves P 2 and so on.
P should fulfill two conditions:
1. The individual subsets can be trained with a small mean square training
i
error, i.e, Eitr = Oitr − Si (Itr ) −→ 0,
j
2. None of the subsets T ri to T rK are overtrained, i.e, Eival <
i=1
j+1
Eival ; j, j + 1 ∈ K
i=1
The first property implies that each of the neural networks is a global optimum
with respect to their training subset. The second property implies two things:
firstly, each individual network should not be over trained, and secondly, none
of the decompositions should be detrimental to the system
NP = mNH + NH n + n + NH (1)
Recursive Pattern based Hybrid Supervised Training 135
Each of these free parameters is one element of the chromosome and represents
one of the weights or the biases in the network.
According to equation 1, a chromosome is therefore defined by the value of
NP which in turn depends on the value of NH . Initialization of the population
is done by generating a random number of hidden nodes for each individual
and a chromosome based on this number.
The performance of the RPHS algorithm can be attributed to the fact that
RPHS aims to find several pseudo-global solutions as opposed to a single
global solution. We define a pseudo-global optimal solution as follows:
Definition 1. A pseudo-global optima is a global optimum when viewed from
the perspective of a subset of training patterns, but could be a local (or global)
optimum when viewed from the perspective of all the training patterns.
In this section, we highlight the difference among the multisieving algo-
rithm [24], single staged GANN training algorithms [11, 31], and the RPHS
algorithm, based on the concept and simplified model of pseudo-global op-
tima. Consider the use of the RPHS algorithm to model a function S i such
i
that Otr = S i (w, Itr
i
). where w is the weight vector of values to be optimized.
The training error at any point of time is given by
We know that at any given point, the training error can be split into the
error of the learnt patterns T i and the error of the unlearnt patterns (T − T i ).
Definition 2: A pattern is considered learnt if its output differs from the ideal
output by a value no greater than an error tolerance ξ. The number of total
patterns learnt is therefore
⎧⎡ ⎤ ⎫
T ⎨ 1 O ⎬
Ti = δ ⎣ φ ξ − Oi,j − Oˆi,j ⎦ − 1 (3)
⎩ O ⎭
i=0 j=0
where δ(.) is the unit impulse function and φ(.) is the unit step function and
Oˆi,j is the network approximation to Oi,j .
By definition of learnt and unlearnt patterns
tr ≤ ξ <
unlearnt . Also, as
we approach the optimal points, Et ri → 0. Also, consider that at the end of
evolutionary training, all the learnt patterns have an error less than the error
tolerance ξ, i.e.
i
Etr = Etr + Eunlearnt < ξT i + Eunlearnt (4)
Fig. 2. Illustration of solutions found by (i) RPHS(SRP HS ), (ii) Single staged hybrid
search (Sss ), (iii) Multisieving algorithm (Sm )
k
out with T i patterns and the step Gk+1 with T − T i patterns. The value of
i=1
Eunlearnt is therefore a constant during Lk , i.e. for any given gradient descent
epoch,
i
Etr = Etr +C (5)
Figure 2 illustrates how the RPHS algorithm, a single-staged training al-
gorithm, and the multisieving algorithm [24] find their solutions. The graph
shows a hypothetical one-dimensional error surface to be optimized with re-
spect to w. Assume that at the end of the evolutionary training phase, solution
Sg has an error value Eg computed according to equation 3. A single-staged
algorithm such as backpropagation or GA classifiers will either try to search
for an optimum at this stage or, if the probability of finding the optimum is
too small, climb the hill and reach the local optima marked by Sss .
However, by virtue of equation 5, Eunlearnt is a constant value C. The
error curve (represented by the dotted line), is just a vertically translated
copy of the part of the original curve, which is of interest to us.
Now, if we consider the multisieving algorithm [24] or the topology-based
selection algorithm [22], the
data splitting
occurs with learnt patterns being
classified with those with Oj − Ôj < ξ. The splitting of the data therefore
depends on the error tolerance of learnt patterns ξ, as defined in equation 3.
With respect to figure 2, we can think of this algorithm as finding the solution
Sg using a gradient descent algorithm. The solution Sg , in itself, is found by
gradient descent. The final solution Sm is a vertically translated Sg . However,
Recursive Pattern based Hybrid Supervised Training 137
Sm , can only be equal to the translated local optima if the error tolerance ξ is
set to optimum values. This is because the solution to the multisieving algo-
rithm is considered found when the pattern is learnt to the error tolerance ξ.
On the other hand, we can see from 2, that the translated local optima due
to the splitting of patterns is more optimal than the other optimal solutions,
i.e.,
ET LO ≤ Eglobal (6)
This is by virtue of the fact that Et ri ← 0 as we approach the optimal
i!
point. Further, from equation 5, ∂Etr/∂w → 0 as ∂Etr ∂w → 0. Therefore,
the solution found by the RPHS algorithm is a pseudo global optima, i.e, it
could be a local optimum but it appears global from the perspective of a pattern
subset.
In contrast to the multisieving algorithm, the RPHS solution, adapts it-
self accordingly, regardless of the error tolerance ξ, to the problem topology
due to gradient descent at the end of each recursion. Finding a pseudo global
optimum therefore reduces the dependence of the algorithm on the error tol-
erance of learnt patterns ξ. It is also the natural optima based on the data
subset. Note: Since early stopping is implemented during backpropagation
so as to prevent overtraining, the optima found by RPHS may not necessarily
be SRP HS , but in the vicinity of SRP HS .
3. In this stage, we use a condition similar to that in [22] and the multisiev-
ing network
[24] to identify learnt patterns, i.e., a pattern is considered
learnt if Oj − Ôj < ξ. More formally, we can define the percentage of
total patterns learnt as in equation 3 Note that, similar to the multi-
sieving algorithm, a tolerance ξ is used to identify learnt patterns; the
arbitrarily set value of ξ for RPHS does not affect the performance of
the algorithm as explained in section 2.
4. The dataset is now split into learnt and unlearnt patterns. With the
unlearnt patterns, we repeat steps 1 to 3.
5. Since the learnt patterns are only learnt up to a tolerance ξ, we use
gradient descent to train the learnt patterns. The aim of gradient descent
is to best adapt the solution to the data topology. Backpropagation is
used in all the recursions except the last one for which constructive
backpropagation is used. The optimum thus found is called the pseudo-
global optimal solution, and is found using a validation set of data to
prevent over training and to overcome the dependence of the algorithm
on ξ.
As the number of patterns in a data subset is small, especially as the number
of recursions increases, it is possible for the pseudo global optimal solution
to over fit the data in the subset. In order to avoid this possibility, we use
a validation dataset. The validation dataset is used along with the training
data to detect generalization loss using an algorithm in [10].
The data decomposition technique of the RPHS algorithm can be best
described by figure 3. During the first recursion, the entire training set (size
T ) is learnt using evolutionary training until stagnation occurs. Only the learnt
patterns are learnt further using backpropagation, with measures to prevent
overtraining. This ensures the finding of a pseudo global optimal solution.
Legend:
BP: Backpropagation
CBP: Constructive backpropagation
GANN: genetic algorithm evolved neural nets
S i : Solution corresponding to the dataset T r i
Fig. 3. Recursive data decomposition employed by RPHS
Recursive Pattern based Hybrid Supervised Training 139
The second recursion repeats the same procedure with the unlearnt patterns.
The process repeats until the total number of patterns in a given recursion
(Recursion K) is too small, in which case, constructive backpropagation is
applied to the whole dataset to learn the remaining patterns to the best
possible extent.
3.2 Testing
Figure 5 presents the pseudo code for training the RPHS system. Train is
initially called with i = 1 and T r and V al as the whole training and validation
set. In addition to the algorithm described in the previous section, Figure 5
140 K. Ramanathan and S.U. Guan
also introduces the two validation procedures used in terminating the system
(indicated in bold). Step 2 uses the validation subset to train the patterns in
a given recursion to ensure that the neural network does not over represent
the patterns in question. Step 4 uses the validation procedure to ensure when
the recursions should be stopped and to determine whether a subsequent
decomposition is detrimental to the RPHS system.
Fig. 6. The two spiral data set and an example of how it can be decomposed into
several smaller datasets that are more easily separable
6.2 Seperability
Separability criterion
In order to increase the inter recursion separation, we modify the fitness func-
tion for GANNs as given by the equation below.
g(x) = w1
(x) − (1 − w1 )DBatt (x) (8)
The total time required for RPHS with MGG can also be expressed as a
R
summation of the time taken in each recursion i, tRP HS = tP D + ti . The
i
time taken for each recursion is given as below.
where λgi and λli refer to the number of epochs required for evolutionary
and backpropagation training in recursion i. alphai is the number of learnt
patterns at the end of the recursion. The last term refers to the initialization
of the recursion population with Npop chromosomes (as explained in figure 5).
The bulk of the time in the equation above depends on the third term,
i.e., the initial evaluation of the chromosome population in each recursion.
The justification of the above claim follows from the properties of RPHS and
evolutionary search:
1. As the patterns in each backpropagation epoch are already learnt, fewer
epochs are required than when compared to the training of modules in
the output parallelism, i.e λli,RP HS < λi,OP .
2. From the experimental results observed and due to the capability of ge-
netic algorithms to find partial solutions faster, we can also say that
λgi,RP HS is small. In the experiments carried out, the value of λgi is
usually less than 20 epochs.
The location of the pseudo global optimal solution found by GA is relatively
unimportant as the pseudo global optima is always globally optimal in terms
of the patterns selected. This implies that with a small population size, the
RPHS algorithm is likely to be faster than the output parallelism algorithm.
In order to observe the effect of the number of chromosomes Npop on
the training time and the generalization accuracy of the RPHS system, we
performed a set of experiments using the MGG based RPHS algorithm with
a varying number of chromosomes. The graphs in figure 7 show the trend in
training time and generalization accuracy for initial population sizes between
5 and 30. The population size of 5 chromosomes was chosen so that MGG
can be implemented with 4 chromosomes for mating and still retains the best
fitness values.
It is interesting to note that the number of chromosomes Npop in the initial
population of each recursion does not play a big role in the generalization
accuracy of the system. This is, once again, an expected property of the RPHS
algorithm as it is the backpropagation algorithm that completes the training
of the system according to the validation data. The part played by the genetic
algorithm is only partial training and it is the presence of the local optima, not
its relative position that is important for the RPHS algorithm. Therefore, if
training time is an issue, using the minimal requirement of 5 chromosomes and
implementing MGG can solve the problem to certain accuracy comparable to
that using a larger population.
Recursive Pattern based Hybrid Supervised Training 145
Fig. 7. The effect of using different sized initial populations for RPHS with the
(a) SEGMENTATION, (b) VOWEL, (c) SPAM datasets. Graphs show the training
time ( ) and generalization error(...) against the number of chromosomes in the
initial population.
Therefore, the most efficient training time for the RPHS algorithm will be
as given by equation 10, which is based on equation 9,
R
R
tRP HS = tP D + ti = tP D + Kgi Ti t + Kli (Ti − αi )t + 5T i t (10)
i=1 i=1
For optimal training, it is necessary to use suitable validation data for each
decomposed training sets. In this section we propose and justify the algorithm
for choosing the optimal validation data for each subset of training data.
Consider the distribution of data shown in figure 8. Each colored zone
represents data from a different recursion. The patterns learnt by solution i
are explicitly exclusive of the patterns learnt by solution j, ∀i
= j. The RPHS
decomposition tree in figure 3 can therefore be expressed as shown in figure 9.
According to figure 9 and the RPHS training algorithm described in 4, the
first recursion begins with T r, the data to be trained using EAs, At the end of
the recursion, T r is split into T r1 (data to be trained with backpropagation)
to give S 1 (the network representing the data T r1 , and (T r − T r1 )(data to
be trained using EAs to give solutions 2 to n). We represent all the networks
146 K. Ramanathan and S.U. Guan
Given a set of patterns, F indV i() finds out which patterns can possibly be
solved by the solutions that exist. Patterns that can be solved are isolated and
used as specific validation sets. Besides a more accurate validation dataset,
it is also possible to obtain the intermediate generalization capability of the
system, which is useful is stopping recursions, as described in the next section.
Recursive Pattern based Hybrid Supervised Training 147
2 i i
Both (Eval ) and (Etr ) represent the percentage of (training and validation) pat-
terns in error of the RPHS system with i recursions
148 K. Ramanathan and S.U. Guan
The graphs in figure 11 below compare the Mean square training error for the
various datasets. The typical training curves for Constructive Backpropaga-
tion [23], and RPHS training are shown.
From Figure 11, we can observe that the training error obtained by RPHS
is lower than the training error that is obtained using CBP [23]. The empirical
comparison shows that typically, the RPHS training curve converges to a
better optimal solution when compared to the CBP curve. At this stage, a note
should be made on the shape of the RPHS curve. The MSE value increases at
the end of each recursion before reducing further. This is due to the fact that
RPHS algorithm reinitializes the population at the end of each recursion.
The reinitialization of the population at the end of each recursion benefits
the solution set reached. The new population is now focused on learning the
“unlearnt” patterns, thereby enabling the search to exit from the pseudo global
(local) optima of the previous recursion.
Tables 3 to 7 summarize the training time and generalization accuracy
obtained by CBP [23], Output Parallelism with [15] and without the pattern
distributor [10] and RPHS. The Output Parallelism algorithm is chosen so as
to illustrate the difference between manual choice of subsets and evolutionary
search based choice of subsets.
Table 3. Summary of the results obtained from the Segmentation problem (average
number of recursions: 7.2)
Table 4. Summary of the results obtained from the Vowel problem (average number
of recursions: 6.425)
Table 5. Summary of the results obtained from the Letter recognition problem
(average number of recursions: 21.3)
Table 6. Summary of the results obtained from the Spam problem (average number
of recursions: 2.475)
Table 7. Summary of the results obtained from the Two-spiral problem (average
number of recursions: 2.475)3
To show the effect of the heuristics proposed, the RPHS training is carried
out with four options
1. Genetic Algorithms with no decomposition of validation patterns
(RPHS-GAND)
2. Genetic Algorithms with decomposition of validation patterns (RPHS-
GAD)
3. MGG with no decomposition of validation patterns (RPHS-MGGND)
4. MGG with decomposition of validation patterns (RPHS-MGGD)
The graphs in figure 11 compare CBP with the 4th training option (RPHS-
MGGD).
Based on the results presented, we can make the following observations
and classify them according to training time and generalization accuracy.
3
Results of the topology based subset selection algorithm [22]
Recursive Pattern based Hybrid Supervised Training 153
Generalization accuracy
• All the RPHS algorithms give better generalization accuracy when com-
pared to the traditional algorithms (CBP, Topology based selection and
multisieving).
• The algorithms which include the decomposition of validation data, al-
though marginally longer than that without decomposition, have better
generalization accuracy than output parallelism. As the algorithms imple-
menting output parallelism do so with a manual decomposition of valida-
tion data, it follows that a version of RPHS will be more accurate than a
similar corresponding algorithms based on output parallelism.
• Implementing RPHS with the separation criterion gives the best general-
ization accuracy although there is a large tradeoff in time. This is discussed
in greater detail in the following section.
• The RPHS algorithm that uses MGG with the decomposition of valida-
tion patterns (MGGD) provides the best tradeoff between training time
and generalization accuracy. When compared to RPHS-GAD and RPHS
with separation, the tradeoff in generalization accuracy is minimal when
compared to the reduction in training time.
• The number of recursions required by RPHS, on average, is lower than the
number of classes in a problem and gives better generalization accuracy.
This suggests that classwise decomposition of data is not always optimal.
Training time
• The training time required by CBP is the shortest of all the algorithms.
However, as seen from the graphs, this short training time is most likely
due to to premature convergence of the CBP algorithm.
• The training of RPHS is also more gradual. While premature convergence
is easily observed in the case of CBP, RPHS converges more slowly. The
recursions reduce the training error in small steps, learning a small number
of patterns at a time.
• Apart from the CBP algorithm, the RPHS algorithm carried out with
MGG has shorter training time than the output parallelism algorithms.
The training time of the multisieving algorithm is larger or less than the
RPHS-MGG based algorithms depending on the datasets. This is expected
as the nature of the dataset determines the number of levels that multi-
sieving has to be implemented and therefore influences the training time.
• The basic contribution of the Minimal coded genetic algorithms is the
reduction of training time. However, there is a small trade off in gener-
alization accuracy when MGGs are used. This can be observed across all
the problems.
• The use of the separation criterion with the RPHS algorithm increases the
training time by several fold. This is expected as the training time includes
the calculation of the inverse covariance matrix. This is the tradeoff for
154 K. Ramanathan and S.U. Guan
8 Conclusions
References
1. Andreou AS, Efstratios F, Spiridon G, Likothanassis D (2002) Exchange rates
forecasting: a hybrid algorithm based on genetically optimized adaptive neural
networks. Comput Econ 20(3):191–200
2. Breiman L (1984) Classification and regression trees. Wadsworth International
Group, Belmont, California
3. Chiang CC, Fu HH (1994) Divide-and-conquer methodology for modular su-
pervised neural network design. In: Proceedings of the 1994 IEEE international
conference on neural networks 1:119–124
4. Dokur Z (2002) Segmentation of MR and CT images using hybrid neural network
trained by genetic algorithms. Neural Process Lett 16(3):211–225
5. Foody GM (1998) Issues in training set selection and refinement for classification
by a feedforward neural network. Geoscience and remote sensing symposium
proceeding:401–411
6. Fukunaga K (1990) Introduction to statistical pattern recognition, Academic,
Boston
7. Goldberg DE, Deb K, Korb B (1991) Don’t worry, be messy. In: Belew R, Booker
L (eds.) Proceedings of the fourth international conference in genetic algorithms
and their applications, pp 24–30
8. Gong DX, Ruan XG, Qiao JF (2004) A neuro computing model for real
coded genetic algorithm with the minimal generation gap. Neural Comput Appl
13:221–228
9. Guan SU, Liu J (2002) Incremental ordered neural network training. J Intell
Syst 12(3):137–172
10. Guan SU, Li S (2002) Parallel growing and training of neural networks using
output parallelism. IEEE Trans Neural Netw 13(3):542–550
11. Guan SU, Ramanathan K (2007) Percentage-based hybrid pattern training with
neural network specific cross over, Journal of Intelligent Systems 16(1):1–26
12. Guan SU, Li P (2002) A hierarchical incremental learning approach to task
decomposition. J Intell Syst 12(3):201–226
13. Guan SU, Li S, Liu J (2002) Incremental learning with an increasing input
dimension. J Inst Eng Singapore 42(4):33–38
14. Guan SU, Liu J (2004) Incremental neural network training with an increasing
input dimension. J Intell Syst 13(1):43–69
15. Guan SU, Neo TN, Bao C (2004) Task decomposition using pattern distributor.
J Intell Syst 13(2):123–150
16. Guan SU, Zhu F (2004) Class decomposition for GA-based classifier agents – a
pitt approach. IEEE Trans Syst Man Cybern, Part B: Cybern 34(1):381–392
17. Guan SU, Zhu F (2005) An incremental approach to genetic algorithms based
classification. IEEE Trans Syst Man Cybern Part B 35(2):227–239
18. Haykins S (1999) Neural networks, a comprehensive foundation, Prentice Hall,
Englewood Cliffs, NJ
19. Holland JH (1973) Genetic algorithms and the optimal allocation of trials. SIAM
J Comput 2(2):88–105
20. Kim SP, Sanchez JC, Erdogmus D, Rao YN, Wessberg J, Principe J, Nicolelis M
(2003) Divide and conquer approach for brain machine interfaces: non linear
mixture of competitive linear models, Neural Netw 16(5–6):865–871
156 K. Ramanathan and S.U. Guan
21. Lang KJ, Witbrock MJ (1988) Learning to tell two spirals apart. In:
Touretzky D, Hinton G, Sejnowski T (eds.) Proceedings of the 1988 connec-
tionist models summer School, Morgan Kaufmann, San Mateo, CA
22. Lasarzyck CWG, Dittrich P, Banzhaf W (2004) Dynamic subset selection based
on a fitness case topology. Evol Comput, 12(4):223–242
23. Lehtokangas (1999), Modeling with constructive Backpropagation. Neural Netw
12:707–714
24. Lu BL, Ito K, Kita H, Nishikawa Y (1995) Parallel and modular multi-sieving
neural network architecture for constructive learning. In: Proceedings of the 4th
international conference on artificial neural networks 409:92–97
25. Quilan JR (1986) Introduction of decision trees. Mach Learn 1:81–106
26. Rumelhart D, Hinton G, Williams R (1986) Learning internal representations
by error propagation. In Rumelhart D, McClelland J (eds.) Parallel distributed
processing, 1: Foundations. MIT, Cambridge, MA
27. Satoh H, Yamamura M, Kobayashi S (1996) Minimal generation gap model for
GAs considering both exploration and exploitation. In: Proceedings of 4th Int
Conference on Soft Computing, Iizuka:494–497
28. The UCI Machine Learning repository:
http://www.ics.uci.edu/mlearn/MLRepository.html
29. Wong MA, Lane T (1983) A kth nearest neighbor clustering procedure. JR Stat
Soc (B) 45(3):362–368
30. Yao X (1993) A review of evolutionary artificial neural networks. Int J Intell
Syst 8(4):539–567
31. Yasunaga M, Yoshida E, Yoshihara I (1999) Parallel backpropagation using
genetic algorithm: real-time BP learning on the massively parallel computer
CP-PACS. In: International Joint Conference on Neural Networks 6:4175–4180
32. Zhang BT, Cho DY (1998) Genetic programming with active data selection.
Simulated Evol Learn 1485:146–153
Enhancing Recursive Supervised Learning
Using Clustering and Combinatorial
Optimization (RSL-CC)
Summary. The use of a team of weak learners to learn a dataset has been shown
better than the use of one single strong learner. In fact, the idea is so successful
that boosting, an algorithm combining several weak learners for supervised learn-
ing, has been considered to be one of the best off-the-shelf classifiers. However, some
problems still remain, including determining the optimal number of weak learners
and the overfitting of data. In an earlier work, we developed the RPHP algorithm
which solves both these problems by using a combination of genetic algorithm, weak
learner and pattern distributor. In this paper, we revise the global search component
by replacing it with a cluster-based combinatorial optimization. Patterns are clus-
tered according to the output space of the problem, i.e., natural clusters are formed
based on patterns belonging to each class. A combinatorial optimization problem
is therefore formed, which is solved using evolutionary algorithms. The evolution-
ary algorithms identify the “easy” and the “difficult” clusters in the system. The
removal of the easy patterns then gives way to the focused learning of the more com-
plicated patterns. The problem therefore becomes recursively simpler. Overfitting is
overcome by using a set of validation patterns along with a pattern distributor. An
algorithm is also proposed to use the pattern distributor to determine the optimal
number of recursions and hence the optimal number of weak learners for the prob-
lem. Empirical studies show generally good performance when compared to other
state-of-the-art methods.
1 Introduction
Recursive supervised learners are a combination of weak learners, data de-
composition and integration. Instead of learning the whole dataset, different
learners (neural networks, for instance) are used to learn different subsets
of the data, resulting in several sub solutions (or sub-networks). These sub-
networks are then integrated together to form the final solution to the system.
Figure 1 shows the general architecture of such learners.
K. Ramanathan and S.U. Guan: Enhancing Recursive Supervised Learning Using Clustering
and Combinatorial Optimization (RSL-CC), Studies in Computational Intelligence (SCI) 82,
157–176 (2008)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008
158 K. Ramanathan and S.U. Guan
In the design of recursive learners, several factors come into play in deter-
mining the generalization accuracy of the system. Factors include
1. The accuracy of the sub networks
2. The accuracy of the integrator
3. The number of subnetworks
The choice of subsets for training each of the subnetworks plays an impor-
tant part in determining the effect of all these factors. Various methods have
been used for subset selection in literature. Several works, including topology
based selection [17], and recursive pattern based training [6], use evolutionary
algorithms to choose subsets of data.
Algorithms such as active data selection [23], boosting [21] and multi-
sieving [19] implement this subset choice using neural networks trained with
various training methods. Multiple weak learners have also been used and the
best weak learner given the responsibility of training and solving a subset [9].
Other algorithms make use of more brute force methods such as the use of
the Mahalanobhis distance [3]. Even other algorithms such as the output
parallelism algorithm manually decompose their tasks [5], [7], [10]. Clustering
has also been used to decompose datasets in some cases [2].
The common method used in these algorithms (except the manual de-
composition), is to allow a network to learn some patterns and declare these
patterns as learnt. The system then creates other networks to deal with the
unlearnt patterns. The process is done recursively and with successively de-
creasing subset size, allowing the system to concentrate more and more on
the difficult patterns.
While all these methods work relatively well, the hitch lies in the fact that,
with the exception of manual partitioning, most of the techniques above use
RSL Using Clustering and Combinatorial Optimization 159
some kind of intelligent learner to split the data. While intelligent learners
and algorithms such as neural networks [12], genetic algorithms [13] and such
are effective algorithms, they are usually considered as black boxes [1], with
little known about the structure of the underlying solution.
In this chapter, our aim is to reduce, by a certain degree, this black box
nature of recursive data decomposition and training. Like in previous works,
genetic algorithms are used to select subsets, however, unlike previous works;
genetic algorithms are not used to select the patterns for a subset, but to
select clusters of patterns for a subset. By using this approach, we hope to
group patterns into subsets and derive a more optimal partitioning of data.
We also aim to gain a better understanding of optimal data partitioning and
the features of well partitioned data.
The system proposed consists of a pre-trainer and a trainer. The pre-
trainer is made up of a clusterer and a pattern distributor. The clusterer
splits the data set into clusters of patterns using Agglomerative hierarchi-
cal clustering. The pattern distributor assigns validation patterns to each of
these clusterers. The trainer now solves a combinatorial optimization problem,
choosing the clusters that can be learnt with best training and validation accu-
racy. These clusters now form the easy patterns which are then learnt using a
gradient descent with the constructive backpropagation algorithm [18] to cre-
ate the first subnetwork (a three layered percepteron). The remaining clusters
form the difficult patterns. The trainer now focuses attention on the diffi-
cult patterns, thereby recursively isolating and learning increasingly difficult
patterns and creating several corresponding subnetworks.
The use of genetic algorithms in selecting clusters is expected to be more
efficient than their use in the selection of patterns for two reasons
1. The number of combinations is now n Ck as opposed to T CL , where
the number of available clusters n, is less than the number of training
patterns T . Similarly, the number of clusters chosen k is smaller than the
number of training patterns chosen L. The solution space is now smaller,
therefore increasing the probability of finding a better solution.
2. The distribution of validation information is performed during pre-
training, as opposed to during the training time. Validation pattern
distribution is therefore a one-off process, thereby saving training time.
The rest of the paper is organized as follows. We begin with some pre-
liminaries and related work in section 2. In section 3, we present a detailed
description of the proposed Recursive supervised learning with clustering and
combinatorial optimization (RSL-CC) algorithm. To increase the practical
significance of the paper, we include pseudo code, remarks and parame-
ter considerations where appropriate. More details and specifications of the
algorithm are then presented in section 4. In section 5, we present some
heuristics and practical guidelines for making the algorithm perform better.
Section 5.4 presents the results of the RSL-CC algorithm on some benchmark
pattern recognition problems, comparing them with other recursive hybrid
160 K. Ramanathan and S.U. Guan
2 Some preliminaries
2.1 Notation
m : Input dimension
n : Output dimension
K : Number of recursions
I : Input
O : Output
Tr : Training
V al : Validation
P : Ensemble of subsets
S : Neural network solution
E : Error
T : Number of training patterns
r : Recursion index
Nchrom : Number of chromosomes
Nc : Number of clusters
Here, Iitr and Oitr are mxTi and nxTi matrices respectively and Iiv and
K
Oiv are mxTvi and nxTvi matrices respectively, such that i=1 Ti = T and
K i
1 2 K
i=1 Tv = Tv. We need to find a set of neural networks S = S , S ,...S ,
where S 1 solves P1 , S 2 solves P2 and so on.
RSL Using Clustering and Combinatorial Optimization 161
Manual decomposition
Output parallelism was developed by Guan and Li [5]. The idea involves split-
ting a n-class problem into n two-class problems. The idea behind the decom-
position is that a two class problem is easier to solve than an n-class problem
and hence is more efficient. Each of the n sub problem consists of two outputs,
class i and class i. Guan et al. [7] later added an integrator in the form of a
pattern distributor to the system. The Output parallelism algorithm essen-
tially develops a set of n sub-networks, each catering to a 2-class problem and
integrates them using a pattern distributor.
While the algorithm is shown to be effective in terms of both training time
and generalization accuracy, a major drawback of the algorithm is its class
based manual decomposition. In fact, research has been carried out [7] shows
empirically that the 2-class decomposition is not necessarily the optimum
decomposition for a problem. This optimum decomposition is a problem de-
pendent value. Some problems are better solved when decomposed into three
class sub problems, others when decomposed into sub-problems with a vari-
able number of classes. While automatic algorithms have been developed to
overcome this problem of manual decomposition [8], the net result is an algo-
rithm which is computationally expensive. The other concern associated with
the output parallelism is that it can only be applied to classification problems.
beginning, to separate the difficult patterns from the easy ones. In the case of
various algorithms, criterions used include the history of difficulty, the degree
of learning [17] and so on.
In the Recursive Pattern based hybrid supervised learning (RPHS) algo-
rithm [6], the problem of the degree of learning is overcome by hybridizing the
algorithm and adapting the solution to the problem topology by using a neu-
ral network. In this algorithm, GA is only a pattern selection tool. However,
the problem of computational cost still remains.
Separability
For the purpose of this chapter, we are interested in subsets which fulfill the
following conditions:
Condition set 2 Separability conditions for good subset partitioning
1. Each class in a subset must be well separated from other classes in the
same subset
2. Each subset must be well separated from another subset
Intuitively, we can observe that the fulfilling of these two conditions is
equivalent to fulfilling condition set 1
3.1 Pre-training
1. We express Itr , Otr , Iv and Ov as a combination of m classes of patterns,
i.e.,
Ov = {OC1
v , Ov ,..., Ov }
C2 Cn
3.2 Training
1. Number of recursions r = 1.
2. A set of binary chromosomes are created, each chromosome having Nc
elements, where Nc is defined in equation 2. An element in a chromosome
is set at 0 or 1, 1 indicating that the corresponding cluster will be selected
for solving using recursion r.
3. A genetic algorithm is executed to minimize Etot , the average of the train-
ing and validation errors Etr and Eval
1
Etot = (ET r + EV al ) (3)
2
4. The best chromosome Chrombest is a binary string with a combination of
0s and 1s, with the size Nc . The following steps are executed:
a) Ncr = 0, Trr = [], Valr = []
b) For e = 1 to Nc
c) if Chrombest (e) == 1
Ncr + +
Trr = Trr + Trchrombest (e)
Valr = Valr + Valchrombest (e)
d) The data is updated as follows:
Tr = Tr − Trr
Val = Val − Valr
Nc = Nc − Ncr
r++
r r r
e) Tr and Val are used to find S , the solution network corresponding
to the subset of data in recursion r.
5. Steps 2 to 4 are repeated with the new values of Tr, Val, Nc and r.
RSL Using Clustering and Combinatorial Optimization 165
3.3 Simulation
way. Later in this paper, we observe how the use of GA’s combinatorial opti-
mization automatically takes care of when to stop recursions. The use of GAs
to select patterns [17], [6] requires extensive tests for detrimental decomposi-
tion and overtraining. The proposed RSL-CC algorithm eliminates the need
for such tests. As a result, the resulting algorithm is self sufficient and very
simple, with minimal adaptations.
RSL Using Clustering and Combinatorial Optimization 167
4.1 Illustration
The grouping of patterns means that clusters of patterns are selected for each
subset. Further, in contrast with any other method, the GA based recursive
subset selection proposed selects the optimal subset combination. Therefore,
we can assume that the decomposition performed is optimal. We prove this
by using apogage
a.
b. c.
d. e.
The design of a neural network system “is more an art than a science in the
sense that many of the numerous factors involved in the design are as a result
of one’s personal experience.” [12]. While the statement is true, we wish to
make the RSL-CC system as less artistic and as much scientific as possible.
Therefore, we propose here several methods which will improve and make the
algorithm “to the point” as far as implementation is concerned.
This means that the population size is either P OP SIZE, a constant for
the maximal population size, or if Nc is small, 2NC .
The argument behind the use of a smaller population size is so that when
there are 4 clusters, for example, it does not make much efficiency to evaluate
a large number of chromosomes. So only 16 chromosomes are created and
evaluated.
In the case where the number of chromosomes is 2NC , only one generation is
executed. This step is to ensure efficiency of the algorithm.
Again, with efficiency in mind, we ensure that in the case where the population
size is 2NC , we ensure that all the chromosomes are unique. Therefore, when
the number of clusters is small, the algorithm is a brute force technique.
sectionExperimental results
The two spiral dataset consists of 194 patterns. To ensure a fair comparison
to the Dynamic subset selection algorithm [17], test and validation datasets of
192 patterns were constructed by choosing points next to the original points
in the dataset as mentioned in the paper.
Parameter Value
Evolutionary search Population size 20
parameters
Crossover probability 0.9
Small change mutation probability 0.1
µ = 4,
MGG parameters
θ=1
Neural network Generalization loss tolerance for validation 1.5
parameters
Backpropagation learning rate 10−2
Number of neighbors in the KNN pattern 1
distributor
172 K. Ramanathan and S.U. Guan
The following control experiments were carried out based on the multi-
sieving algorithm and the dynamic topology based subset finding algorithm.
Both the versions of output parallelism implemented also show the effect of
hybrid selection.
In order to illustrate the effect of the GA based combinatorial optimization,
we also implement the single cluster algorithm explained in section 2 [2]. The
algorithm, in contrast to the RSL-CC algorithm, simply divides the data into
clusters and develops a network to solve each cluster separately.
The RSL-CC algorithm was also compared to our earlier work on RPHS [6],
which uses a hybrid algorithm to recursively select patterns, as opposed to
clusters.
In a nutshell, the following algorithms were implemented to compare with
the various properties of the RSL-CC algorithm
1. Constructive Backpropagation
2. Multisieving1
3. Dynamic topology based subset finding
4. Output parallelism without pattern distributor [5]
5. Output parallelism with pattern distributor [7]
6. Single clustering for supervised learning
7. Recursive pattern based hybrid supervised learning
Table 2 summarizes the parameters used. For clustering, the agglomerative
hierarchial clustering is employed with complete linkage and cityblock distance
mechanism. Using thresholding, the natural clusters of patterns in each class
were obtained. AHC was preferred to other clustering methods such as K-
means or SOMs due to its parametric nature and since the number of target
clusters is not required beforehand.
5.7 Results
We divide this section into two parts. In the first part, we compare the mean
generalization accuracies of the various recent algorithms described in this
paper with the generalization accuracy of the RSL-CC algorithm. In the sec-
ond part, we present the clusters and data decomposition employed by RSL-
CC, illustrating the finding of simple subsets. Comparison of generalization
accuracies.
From the tables above, we can observe that the generalization error of
the RSL-CC is comparable to the generalization error of the RPHS algorithm
and is a general improvement over other recent algorithms. A particularly
significant improvement can be observed in the vowel dataset.
There is some tradeoff observed in terms of training time. The training
times for the RSL-CC algorithm for the vowel and two-spiral problems are
1
The multisieving algorithm [19] did not propose a testing system. We are testing
the generalization accuracy of the system using the KNN pattern distributor,
similar to the RSL-CC pattern distributor.
RSL Using Clustering and Combinatorial Optimization 173
Table 3. Summary of the results obtained from the VOWEL problem (38 clusters,
12 recursions)
higher than other methods. However, it is interesting to note that the train-
ing time for the Letter Recognition problem is 50% less than any of the recent
algorithms. It is felt that this reduction is training time comes from the re-
duction of the problem space from the selection of patterns to the selection
of clusters, where clusters are selected from 100 possible clusters while RPHS
has to select patterns out of 10,000, thereby reducing the solution space by
100 fold.
On the other hand, for the vowel problem, the problem space is reduced
by only about 13 fold. The performance of RSL-CC is more efficient when
the reduction of the problem space is more significant than the GA-based
combinatorial optimization. The RSL-CC decomposition figures
Figures 5 and 6 illustrate the data decomposition for the letter recognition
and the vowel problems. Only one instance of decomposition is presented
in the figures. From the figures, we can observe the data being split into
increasingly smaller subsets, thereby increasing focus on the difficult patterns.
The decomposition presented is the 2 dimensional projections on the principal
component axis (PCA) [3] of the input space.
174 K. Ramanathan and S.U. Guan
Table 5. Summary of the results obtained from the TWO-spiral problem (4 clusters,
2 recursions)
References
1. Dayhoff JE, DeLeo JM (2001) Artificial neural networks: Opening the black
box. Cancer 91(8):1615–1635
2. Engelbrechet AP, Brits R (2002) Supervised training using an unsupervised
approach to active learning. Neural Process Lett 15(3):247–260
3. Fukunaga K (1990) Introduction to statistical pattern recognition. Academic,
Boston
4. Gong DX, Ruan XG, Qiao JF (2004) A neuro computing model for real
coded genetic algorithm with the minimal generation gap. Neural Comput Appl
13:221–228
5. Guan SU, Li S (2002) Parallel growing and training of neural networks using
output parallelism. IEEE Trans Neural Netw 13(3):542–550
6. Guan SU, Ramanathan K (2004) Recursive percentage based hybrid pattern
training. In: Proceedings of the IEEE conference on cybernetics and intelligent
systems, pp 455–460
7. Guan SU, Neo TN, Bao C (2004) Task decomposition using pattern distributor.
J Intell Syst 13(2):123–150
8. Guan SU, Qi Y, Tan SK, Li S (2005) Output partitioning of neural networks.
Neurocomputing 68:38–53
9. Guan SU, Ramanathan K, Iyer LR (2006) Multi learner based recursive training.
In: Proceedings of the IEEE conference on cybernetics and intelligent systems
(Accepted)
176 K. Ramanathan and S.U. Guan
10. Guan SU, Zhu F (2004) Class decomposition for GA-based classifier agents – a
pitt approach. IEEE Trans Syst Man Cybern B Cybern 34(1):381–392
11. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning:
Data mining, inference, and prediction. Springer, New York
12. Haykins S (1999) Neural networks, a comprehensive foundation. Prentice Hall,
Upper Saddle River, NJ
13. Holland JH (1973) Genetic algorithms and the optimal allocation of trials. SIAM
J Comput 2(2):88–105
14. MacQueen JB (1967) Some methods for classification and analysis of multi-
variate observations. In: Proceedings of 5th Berkeley symposium on mathemat-
ical statistics and probability, vol 1. University of California Press, Berkeley,
pp 281–297
15. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: A review, ACM Comput
Surv 31(3):264–323
16. Kohonen T (1997) Self organizing maps. Springer, Berlin
17. Lasarzyck CWG, Dittrich P, Banzhaf W (2004) Dynamic subset selection based
on a fitness case topology. Evol Comput 12(4):223–242
18. Lehtokangas M (1999) Modeling with constructive backpropagation. Neural
Netw 12:707–714
19. Lu BL, Ito K, Kita H, Nishikawa Y (1995) Parallel and modular multi-sieving
neural network architecture for constructive learning. In: Proceedings of the 4th
international conference on artificial neural networks, 409, pp 92–97
20. Satoh H, Yamamura M, Kobayashi S (1996) Minimal generation gap model
for GAs considering both exploration and exploitation. In: Proceedings of 4th
international conference on soft computing, Iizuka, pp 494–497
21. Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the 16th
international joint conference on artificial intelligence, pp 1–5
22. Wong MA, Lane T (1983) A kth nearest neighbor clustering procedure. J Roy
Stat Soc B 45(3):362–368
23. Zhang BT, Cho DY (1998) Genetic programming with active data selection.
Lect Notes Comput Sci 1585:146–153
Evolutionary Approaches to Rule Extraction
from Neural Networks
Urszula Markowska-Kaczmar
Summary. A short survey of existing methods of rule extraction from neural net-
works starts the chapter. Because searching rules is similar to NP-hard problem it
justifies an application of evolutionary algorithm to the rule extraction. The sur-
vey contains a short description of evolutionary based methods, as well. It creates
a background to show own experiences from satisfying applications of evolutionary
algorithms to this process. Two methods of rule extraction namely: REX and GEX
are presented in details. They represent a global approach to rule extraction, per-
ceiving a neural network by the set of pairs: input pattern and response produced
by the neural network. REX uses prepositional fuzzy rules and is composed of two
methods REX Michigan and REX Pitt. GEX takes an advantage of classical crisp
rules. All details of these methods are described in the chapter. Their efficiency was
tested in experimental studies using different benchmark data sets from UCI repos-
itory. The comparison to other existing methods was made and is presented in the
chapter.
1 Introduction
Neural networks are widely used in real life. Many successful applications in
various areas may be listed here, for example: in show business [29], in pattern
recognition [36] and [11], in medicine, e.g. in drug development, image analysis
and patient diagnosis (two major cornerstones are detection of coronary artery
disease and processing EEG signals [30]), in robotics [27], in industry [37] and
optimization [7].
Their popularity is a consequence of ability to learn. Instead of an algo-
rithm that describes how to solve the problem neural networks need a training
set with patterns representing the proper transformation of the input patterns
on the output patterns. After training they can generalize knowledge they pos-
sessed during training performing the trained task on the unseen before data.
They are well known because of their skill in removing noise from data, also.
The next advantages is their resistance against damages.
Once the net input is calculated it is converted to the output value by applying
an activation function f :
yi = fi (neti ) (2)
Various types of functions can be applied as an activation function. Fig. 2
shows typical one. It is the sigmoidal function. The other popular one is
hyperbolic tangent. Its shape is similar to the sigmoidal function, but the
values belong to the range (−1,1) instead of (0,1). These simple elements
connected together create a neural network. Depending on the way they are
joined one can distinguish feedforward and recurrent neural networks.
Here we will focus on the first type of the neural network architecture.
Frequently, in the feedforward network, neurons are organized in layers. The
neurons in a given layer are not connected to each other. Connections exist
only between neurons of the neighbouring layers. Such a network is shown
in Fig. 3. One can distinguish the input layer that split input information
to the neurons in the next layers. Then, the information is processed by the
hidden layer. Each neuron in this layer adds weighted signals and processes
the total net by an activation function. Once all neurons in this layer have
calculated the output values, the neurons in the output layer become active.
They calculate the total net and process it by the activation function as it
was described for the previous layer. This is the way in which information is
processed by the network. In general, the network can contain more than one
hidden layer.
As we mentioned above, the network possesses its ability to solve the
problem after the training process. It consists in searching the weights in the
180 U. Markowska-Kaczmar
iterative way. After training the network can produce the response on the basis
of the input value. To train the network, it is necessary to collect data in the
training set T.
T = {(x1 , y1 ), (x2 , y2 ) . . . (xp , yp )} (3)
One element is created by a pair, which is an example of the proper transfor-
mation of an input value represented by the vector x = [x1 , x2 , . . . , xN ] onto
the desired output value represented by the vector y = [y1 , y2 , . . . , yM ].
The most popular training rule is backpropagation. It minimizes a squared
error Ep between desired output y for the given pattern x and the answer of
the neural network represented by the vector o = [o1 , o2 , . . . , oM ] for this
input.
Ep = 1/2 (ypk − opk )2 (4)
k
The weights are updated iteratively until an error reaches the assumed value.
In each time step the weight is changed according to Eq. (5):
of this pattern. We assume that during rule extraction we know the patterns
from the training set.
The aim of the rule extraction from a network is to find a set of rules
that describe the performance of the neural network. The most popular form
are prepositional rules, however predicates are applied, as well. Because of
their comprehensibility we concentrate on the prepositional rules that have
the following form:
IF prem1 AN D prem2 AN D . . . .AN D premN T HEN classj (6)
The premise premi refers to the i-th network input and formulates a condition
that has to be satisfied by the value of i-th attribute in order to classify the
pattern to the class, which is indicated by the conclusion of the rule.
The other popular form of the rule are prepositional fuzzy rules that is
shown below:
IF x1 is Z1r AN D . . . AN D xN is ZN b T HEN y1 y2 . . . yk (7)
where: xi corresponds to the i-th input of NN, the premise xi is Zib states
that attribute (input variable) xi belongs to the fuzzy set Zib (i ∈ [1, N ]),
and y1 y2 . . . yk is a code of a class, where k is the number of classes and
only one yc = 1 (c ∈ [1, N ], for i
= c yi = 0). In general, the number of
premises is less or equal to N , which stands for the number of inputs in
a neural network. Fuzzy sets can take different shapes. The most popular
are: triangular, trapezoidal and Gaussian function. The example of triangular
fuzzy sets are shown in Fig. 5.
There are many criteria of the evaluation of the extracted rules’ quality.
The most frequently used include:
• fidelity – stands for the degree to which the rules reflect the behaviour of
the network they have been extracted from,
• accuracy – is determined on the basis of the number of previously unseen
patterns that have been correctly classified,
• consistency – occurs if, during different training sessions, the network pro-
duces sets of rules that classify unseen patterns in the same way,
182 U. Markowska-Kaczmar
with threshold in θ). Neuron will fire if the following condition is satisfied:
5
xi wi > θ (8)
i=1
0 ≤ x3 w3 + x4 w4 ≤ 4 (9)
What does mean rule extraction in the context of neural network with many
hidden and output neurons and the continuous activation function. When sig-
moidal function or hyperbolic tangent is applied then by setting appropriate
value of β we can model the step function (Fig. 2). In this case to obtain rules
for a single neuron we can follow the way presented above. Such rules are
created for all neurons and concatenated on the basis of mutual dependen-
cies. Thus we obtain rules that describe the relations between the inputs and
outputs for the entire network as illustrated in Fig. 7. As examples one may
mention the following methods: Partial RE [34], M of N [12], Full RE [34],
RULEX [3]. The majority of algorithms belonging to this group in order to
create a rule, which describes the activation of neuron considers the influence
of the sign of the weight. When the weight is positive then this input helps
to move the neuron activation to be fired (output = 1), when the weight is
negative it means that this input disturbs in firing the neuron. The problem
of rule extraction by means of the local approach is simple if the network is
relatively small.
Global methods treat a network as a black box, observing its inputs and
responses produced at the outputs. In other words, a network provides the
method with training patterns. The examples include VIA [35] that uses a
procedure similar to classical sensibility analysis, BIO-RE [34] that applies
184 U. Markowska-Kaczmar
1 1 2 3 4
2 3 4
truth tables to extract rules, Ruleneg [13], where an adaptation of PAC algo-
rithm is applied or [28], based on inversion. It is worth mentioning that due to
such an approach the architecture of a neural network is insignificant. Most
of the methods of rule extraction concern multilayer perceptrons (MLP net-
works - Fig. 3). However, methods dedicated to other networks are developed
as well, e.g. [33], [10]. A method that would meet all these requirements (or
at least the vast majority) has not been developed yet. Some of the methods
are applicable to enumerable or real data only, some require repeating the
process of training or changing the network’s architecture, some require pro-
viding a default rule that is used if no other rule can be applied in a given
case. The methods differ in computational complexity (that is not specified
in most cases). Hence the necessity of developing new methods. Some of them
use evolutionary algorithm to solve this problem, what will be presented in
section 5.
1
Also the evolutionary algorithms in the literature [9], [25] are used as a gen-
eral term for different concepts like: genetic algorithms, evolutionary strategies,
genetic programming, here we will you use it in a narrowed sense in order to
underline the difference with reference to the genetic algorithm. In the classical
genetic algorithm a solution is encoded in binary way. Here real numbers are used
to encode information. It requires to specify special genetic operators but general
outline of the algorithm is the same.
Rule Extraction from Neural Networks 185
2
There are evolutionary algorithms with proven convergence to a global optimum
[8]
186 U. Markowska-Kaczmar
Let us recall that the local approach to rule extraction (Section 3.2), which
starts with searching rules describing the activation of each neuron in the
network, is simple only when the network is relatively small. In other case
the methods decreasing of the connection number and the number of neurons
are used. As an example of solutions applied in this case we can enumerate here
clusterisation of hidden neurons activation (substituting a cluster of neurons
by one neuron is used) or the structure of a neural network is optimized by
introducing a special training procedure that causes the pruning of a neural
network. Using evolutionary algorithms as it is shown in [31], [15] may be also
very helpful.
Let us start our survey of EA applications in rule extraction from neural
network by presenting the approach from [31]. The authors evolve feedfor-
ward neural networks topology, that is trained by RPROP algorithm which
is a gradient method similar to backpropagation method. They use ENZO
tool, which evolves fully connected feedforward neural network with a single
hidden layer. The tool is extended by implementation of RX algorithm of rule
extraction from neural network (proposed by [19],) which is a local method.
In ENZO each gene in a genotype representing an individual is interpreted
as connection between two neurons. The selection is based on ranking. The
higher is the ranking of an individual, the higher is the probability that it will
be selected.
Two crossover operators are designed. The first one inserts all connections
from parents into the child and the weights are the average of the weights
inherited from parents. The second one inserts in random chosen hidden neu-
ron and its connection from parent with higher fitness value to the current
topology of the child. The mutation operator inserts or removes neuron from
the current topology.
The next crucial element of the method based on evolutionary algorithm
is a fitness function. Here the fitness function is expressed as follows:
where: Wacc , Wcom are the importance weights for terms accuracy and
comprehensibility, Acc is accuracy, which is expressed as quotient of true
positive covered patterns to the number of all patterns and comprehensibility
in tern is defined as follows:
R C
2 +
comprehensibility = 1 − MaxR MaxC
(11)
3
In this equation R is the number of rules, C is the total number of rule con-
ditions, M axR is the maximum number of rules extracted from an individ-
ual, M axC is the maximum number of rule conditions among all individuals
evolved so far.
Rule Extraction from Neural Networks 187
Fig. 8. One genetic cycle in extended ENZO includes rule extraction and evaluation
min(f1 , f2 ) (12)
where : f1 = E (13)
f2 = Ω (14)
where E is the error of neural network and Ω is the regularization term that is
expressed by the number of connections in the network. In the paper there is
an illustrative example that shows the feasibility study of the idea to search the
optimal architecture of neural network and to acquire rules for that network. It
is the breast cancer benchmark problem from UCI repository [4]. It contains
699 examples with 9 input features that belong to two classes: benign and
malignant. Only the first output of the neural network is considered in the
188 U. Markowska-Kaczmar
Based on these two rules, only 2 out of 100 test samples were misclassified
and 4 of them cannot be decided with predicated value of 0.49.
The paper [14] is an example of the evolutionary algorithm application in
clustering of the activations of hidden neurons. It uses the simple encoding and
the chromosomes has fixed length. It allows to use classical genetic operators.
The fitness function is based on the Euclidian distance between activation
values of hidden neurons and the number of objects belonging to the given
cluster. The rules in this case have the following form:
where ai is the activation function of hidden neuron, vimin and vimax are re-
spectively minimal and maximal values; n is the number of hidden neurons. In
order to acquire rules in the first step the equations describing an activation
for each hidden neuron are created. Then, during processing training patterns
for each neuron the activation value and the label of the class is assigned.
Next, the values of activation obtained in the previous step are separated ac-
cording to the class and the hidden neuron activations with the application of
EA are clustered. It means that EA searches for groups of activation values
of hidden neurons. Finally, in order to acquire rules expressed in (Eq. 17)
minimal and maximal values for each cluster are searched.
Rule Extraction from Neural Networks 189
It seems that the global methods of rule extraction from neural network by
offering an independence of the neural network architecture are very attractive
proposals. Bellow there are few examples described in more detailed way.
We will start with the presentation of the relatively simple method, drawn
up by [17], where evolutionary algorithm searches for the path composing of
connections from an input neuron to the output neuron. Knowing that the
neural network solves classification task the method searches for an essential
input feature causing classification of patterns to the given class.
Single gene encodes one connection, that means that in order to find path
from the input layer to the output layer for two layered network, the chro-
mosome consists of two genes. The fitness function is defined as a product
of the connection weights, which belong to the path. They inform about the
strength of the connections belonging to the path for producing the output.
Classical genetic operators are used.
Another idea is shown in [18]. In this paper for each attribute the range
of values are divided into 10 subranges encoded by natural numbers. The
nominative attributes are substituted by the number of binary attributes that
is equal to the number of values for this attribute. The chromosome has as
many genes as many inputs has the neural network. The example is shown
in Fig. 10. Special meaning has the value of a gene equal to 0. It means that
the condition referring to this attribute is passed over in the final form of
the rule. The chromosome encodes the premise part of the rule. Conclusion
is established on the basis of the neural network response. Fig. 10 explains
the way of the chromosome decoding. Knowing that 1 corresponds to the
first subrange of the attribute in the rule creating phase the first subrange is
taken for the premise referring to the first attribute. In the same way values
of the second and third genes are transformed. The pattern is processed by
the neural network and in response on the output layer the code of the class is
delivered. The output of the neural network is coded in a local way. According
to this principle the biggest value in the output layer is searched. In case this
value is greater than assumed threshold, the chromosome creates a rule.
Fig. 10. The example of the chromosome in the method described in [18]
190 U. Markowska-Kaczmar
Fig. 11. The idea of the evolutionary algorithm application in a global method of
rule extraction
Now we focus on the approaches that treat a neural network as a black box
that delivers patterns for the method extracting rules. Fig. 11 illustrates the
idea of this approach. Evolutionary algorithm works here as a tool extracting
rules. In principle, this way of evolutionary algorithm application may be seen
in similar way as rule extraction immediately from data. The main difference
lies in the form of fitness function that in the case of rule extraction from
neural networks has to consider the criteria described in the Section 3.1.
Generally speaking there are two possibilities to extract a set of rules that
describes the classification of patterns. In so called Michigan [25] approach
one individual encodes one rule. In order to obtain the set of rules different
techniques are used. As examples we can enumerate here:
• sequential covering of examples, the pattern covered by the rule are re-
moved from the set of patterns and for the rest of patterns a new rule is
searched.
• an application of two hierarchical evolutionary algorithms where one
searches for single rules the set of which is optimized then by evolutionary
algorithm on the higher level,
• special heuristics, applied to define when the rule can be substituted by
more general rule in the final set of rules.
In the Pitt [25] approach the set of rules is encoded in one individual, which
causes that all of the mentioned above solutions are useless, but the chromo-
some is much more complex.
Now Pitt and Michigan approaches to rule extraction from neural networks
will be presented on the base of [23]. In this paper two methods of rule
Rule Extraction from Neural Networks 191
G1 G2 Gn
F2 , 2 D2 , 2
Fig. 12. A part of the chromosome coding fuzzy sets; Gn – gene of n-th group of
fuzzy sets; F Sn,l – code of l−th fuzzy set in the n-th group, Fl,k - flag, Dl,k – code
of an dk apex of triangle of the F Sl,k fuzzy set
Fig. 14. The scheme of the chromosome in the REX Pitt; F – activation bit, P C –
code of premise, CC – code of conclusion, D – the number corresponding to the
triangle apex of the fuzzy set
independent. The first one decides which premises are present in the rule,
while the second one determines the form of a fuzzy set group (the number of
fuzzy sets and their ranges).
Fig. 14. presents a general scheme of an individual in REX Pitt. F is the
bit of activation. It stands before each rule, premise and fuzzy set. If it is set
to 1, it means that the part after this bit is active. P C is a gene of premise
and is coded as an integer number indicating the index of the fuzzy set in the
group for a given input variable. The description of fuzzy sets included in one
individual deals with all rules coded in the individual.
To form a new generation, genetic operators are applied. One of them is a
mutation, which consists in the random change of chosen elements in the chro-
mosome. There are several mutation operators in the REX algorithm. The mu-
tation of the central point of a fuzzy set (gene D in Fig. 14. which corresponds
to d in Fig. 5.) is based on adding or subtracting a random floating–point num-
ber (the operation is chosen randomly). It is performed according to (17):
rand() · range
d ←d± (17)
10
where: rand() is the function giving a random number uniformly distributed
in the range 0; 1); range is the range of possible values of d. The parameter
equal to 10 in the denominator is used in order to ensure a small change of d.
The mutation of an integer value (P C) relies on adding a random integer
number modulo z, which is formally expressed by (18):
where z is a maximum number of fuzzy sets in a group and is set by the user.
The mutation of a bit (for example bit F , bits in CC) is simply its negation.
There is also a mutation operator which relies on changing the sequence of the
rules. As it was mentioned, the sequence of rules is essential for the inference
process.
Rule Extraction from Neural Networks 193
The second operator is the crossover, which is a uniform one ([25]). Because
of the floating point representation and a complex semantic in the chromosome
of the REX method, the uniform crossover is performed at the level of genes
representing the rules and groups of fuzzy sets (see Fig. 14). This means that
the offspring inherits genes of a certain whole rule or a whole group of fuzzy
sets from one or another parent. After these genetic operations, the individual
that is not consistent has to be repaired.
One of the most important elements in the evolutionary approach is the
design of the fitness function. Here, the evaluation of individuals is included
in the fuzzy reasoning process by using a decoded set of rules with training
and/or testing examples, and the calculation of metrics defined below which
are used in the fitness function expressed by (19). The fuzzy reasoning result
has to be compared with that of the NN, giving the following metrics:
• corr – the number of correctly classified patterns – those patterns which
fired at least one rule – and the result of reasoning (classification) was
equal to the output of the neural network; this parameter determines the
fidelity of the rule set;
• incorr – the number of incorrectly classified patterns – those patterns for
which the reasoning gave results being different than those of the neural
network;
• unclass – the number of patterns that were not classified – those patterns
for which none of the rules fired, and it was impossible to perform the
reasoning.
There are also metrics describing the complexity of rules and fuzzy sets:
• prem – the total number of active premises in active rules,
• fsets – the total number of active fuzzy sets in an individual.
All the metrics mentioned above were used to create the following evaluation
function (19):
of steps; or when the evaluation function for the best individual reaches a
certain value.
REX Michigan consists of two specialized evolutionary algorithms alter-
nating with each other (see Fig. 15). The first one – EARules searches rules,
while the second evolutionary algorithm, which we called EAFuzzySets, opti-
mises the membership functions of fuzzy sets applied to the rules. This ap-
proach could be promising when the initial form of fuzzy sets is given by the
experts. The role of EAFuzzySets would be only the tuning of the fuzzy sets.
One individual in the EARules codes one rule. In one cycle, EARules is
operating several times to find a set of rules describing the neural network.
Each time patterns covered by the previously found rule are removed from the
training set. This method is known as sequential covering. The rules are evalu-
ated by using the simplified fuzzy reasoning, where one of the fuzzy sets found
by EAFuzzySets is used. In the first cycle of REX Michigan the set of the fuzzy
sets is chosen at random or it can be established by an expert (however, in
presented experiments the first case was used). In the evolutionary algorithm
optimising fuzzy sets (EAFuzzySets), one individual codes a collection of fuzzy
set groups. They are evaluated on the basis of simplified fuzzy reasoning, as
well. The EAFuzzySets is processing the given number of generations (GF ),
then the best individual is passed to EARules. The cycle of the alternate work
Rule Extraction from Neural Networks 195
of these two stages lasts until the final set of rules with appropriate fuzzy sets
represents the knowledge contained in the neural network at the appropriate
level or the given number of cycles has elapsed.
The evolutionary algorithm EARules concentrates on the optimization of
the rules. It is included in periodically searching the best rule describing the
neural network, which is then passed to the set of rules. All training examples,
for which the result of reasoning with the investigated rule was the same as
the neural network answer, are marked as covered by the rule. It allows to
search the rule for the uncovered examples in the next cycle. This approach is
similar to ([6]). The form of the rule is the same as for REX Pitt. The muta-
tion operator from REX Pitt is also applied in REX Michigan. The crossover
is a uniform operator. After the crossover or mutation, the consistency of each
rule is tested. The individuals are repaired if necessary. To evaluate an indi-
vidual for each training example, a simplified fuzzy reasoning is performed
and appropriate metrics are calculated on the basis of the activation of the
rule. The following metrics are used in the fitness function:
• corri – the number of correctly classified examples, it means those exam-
ples for which the i–th rule was fired with an appropriate strength (greater
than θ) and its conclusion was the same as classification of neural network
for that pattern. This metric is applied for all training examples even if
they were covered by other rules;
• corr newi – the number of new covered examples – the covering of un-
marked examples;
• incorri – the number of incorrectly classified examples – the i–th rule
fired with an appropriate strength (activation of the rule is greater than
threshold θ)– but its conclusion was inconsistent with the neural network
output. Here, all examples are considered, including examples, which are
marked as covered;
• unclassi – the number of unclassified examples, for which the i–th rule
does not fire, but the conclusion of the rule was consistent with the neural
network output for this example. All examples are concerned in this case;
• unclass newi – the number of unclassified examples which are not marked
as covered by other rule.
Additionally, the fitness function applies the metric premi that informs about
the number of active premises in the rule. The fitness function used at this
level is presented by (20):
The first component of the fitness function (20) ensures maximizing the
number of newly correctly classified patterns while minimizing the number
of incorrectly classified patterns. The second one has a strong selective role,
because its value for an individual representing a rule that does not cover any
new pattern is much less than for that one which covers only one example.
The third component minimizes the number of premises in the rule.
The optimization of fuzzy sets is performed by an evolutionary algorithm,
which is called EAFuzzySets. A collection of groups of fuzzy sets is coded in
one individual. One group Gj corresponds to one input attribute that relates
to the neural network input. Additionally, a gene RS coding the sequence of
fuzzy rules is included. The initial experiments have shown that classical fuzzy
reasoning was not an effective solution in this case, so a determination of the
sequence of fuzzy rules was necessary. The form of chromosome in EAFuzzySets
is presented in Fig. 12. The coding of a group of fuzzy sets is common for both
approaches (REX Pitt and REX Michigan). The way of coding a single rule is
similar as in REX Pitt. The first number in the premise reports the activity
of the premise. The second number is simply the index of a fuzzy set in the
group corresponding to the given neural network input.
Generalising, we can say that the information contained in one chromo-
some in REX Pitt is split into two chromosomes in REX Michigan. These two
chromosomes represent individuals in EARules and EAFuzzySets.
The evaluation of individuals involves their decoding. As a result, in
EARules the set of rules is obtained. Then, for each example included in the
training set, simplified fuzzy reasoning takes place. For individual j at this
level, the following statistics are collected during this process: corrj , incorrj ,
unclassj . Additionally, the complexity of the collection of fuzzy sets is mea-
sured by the parameter f setsj describing the number of active fuzzy sets in
the collection of fuzzy sets. The form of fitness function is as follows (21):
Fig. 17. The comparison of the extracted rules for REX Pitt and Full–RE for the
IRIS data set
• xi > value1 ,
• xi > value2 ,)
• value1 < xi * xi < value2 ⇔ xi ∈ (value1 ; value2),
• xi < value1 value2 < xi .
For a discrete attribute, instead of (<, >) inequalities (≤, ≥) are used. One
assumes that value1 < value2 .
For enumerative attributes – only two operators of relation are used {=,
=},
so the premise has one of the following form:
• xi = valuei ,
• xi =
valuei .
For boolean attributes there is only one operator of relation =. It means that
the premise can take the following form:
• xi = T rue,
• xi = F alse.
All rules in one subpopulation have identical conclusion. The evolutionary
algorithm is performed in a classical way (Fig. 19). The only difference between
classical performance of evolutionary algorithm and the proposed one lies in
the evaluation of individuals, which requires the existence of decision system
based on the processing of rules. After each generation rules are evaluated
by using the set of patterns, which are processed by the neural network and
the rules from the actual population (Fig. 11). To realize it a decision system
consisting in searching rules that cover given pattern is implemented. The rule
covers a given example according to the definition Definition 1.
Definition 1 The rule ri covers a given example p when for all values of the
attributes presented in this pattern the premises in the rule are true.
The comparison of the results of classification is the basis for the evaluation
of each rule, which is expressed by the value of a fitness function.
The evolutionary algorithm performing in the presented way will look for
the best rules that cover as many patterns as possible. In this case the risk
exists that some patterns never would be covered by any rule. To solve this
problem in GEX the niche mechanism is implemented. The final set of rules
is created on the basis of best rules found by evolutionary algorithm but
also some heuristics are developed in order to optimize it. The details of the
method describing evolutionary algorithm and mentioned above heuristics are
presented in the following subsections.
Figure 20 shows the general scheme of the genotype in GEX. It is com-
posed of the chromosomes corresponding to the inputs of neural network and
a single gene of conclusion. A chromosome consists of gene being a flag and
genes encoding premises, which are specific for the type of attribute of the
premise it refers to.
The existence of flag assures that the rules have different length, because
the premise is included in the body of the rule if the flag is set to 1, only.
The chromosome is designed dependently on the type of attribute (Fig. 21) in
order to reflect the condition in the premise. For the real type of the attribute
the chromosome consists of the code of relation operator and two values deter-
mining the limits of range (Fig. 21c). For the nominal attribute there is a code
of the operator and a value (Fig. 21b). Fig. 21a represents a chromosome for
the binary attribute. Besides the gene of flag, it consists of one gene referring
to the value of the attribute
where ximax and ximin are respectively the maximal and minimal values of
i-th attribute, rch is the parameter, which defines how much the limits of range
202 U. Markowska-Kaczmar
can be changed. For the discrete type a new value is chosen in random from
values defined for this type. The assumed fitness function, is defined as the
weighted average of the following parameters: accuracy (acc), classCovering
(classCov), inaccuracy (inacc), and comprehensibility (compr):
A ∗ acc + B ∗ inacc + C ∗ classCov + D ∗ compr
F un = (23)
A+B+C +D
Weights (A, B, C, D) are implemented as the parameters of the application.
Accuracy measures how good the rule mimics knowledge contained in the
neural network. It is defined by (24).
correctF ires
acc = (24)
totalF iresCount
inacc is a measure of incorrect classification made by the rule. It is expressed
by Eq. (25).
missingF ires
inacc = (25)
totalF iresCount
Parameter classCovering abbreviated as classcov contains information about
the part of all patterns from a given class, which are covered by the evaluated
rule. It is formally defined by Eq. (26);
correctF ires
classcov = , (26)
classExamplesCount
where classExamplesCount is a number of patterns from a given class. The
last parameter - comprehensibility abbreviated as compr is calculated on the
basis of Eq. (27).
maxConditionCount − ruleLength
compr = , (27)
maxConditionCount − 1
where ruleLength is the number of premises of the rule, maxConditionsCount
is the maximal number of premises in the rule. In other words, it is the number
of inputs of neural network.
During the evolution the set of rules is updated. Some rules are added
and some are removed. In each generation individuals with accuracy and
classCovering greater then minAccuracy and minClassCovering are the
candidates to update the set of rules. The values minAccuracy and minClass-
Covering are the parameters of the method. Rules are added to the set of
rules when they are more general than the rules actually being in the set of
rules according to the Definition 2.
Definition 2 Rule r1 is more general than rule r2 when the set of examples
covered by r2 is a subset of the set of examples covered by r1 . In the case the
rules r1 and r2 cover the same examples, the rule that has the bigger fitness
value is assumed as more general.
Rule Extraction from Neural Networks 203
Furthermore, the less general rules are removed. After presenting all patterns
for each rule usability is calculated according to Eq. (28).
usabilityCount
usability = (28)
examplesCount
All rules with usability less then minU sability, which is a parameter set by
the user, are removed from the set of rules. We can say that the optimization
of the set of rules consists in removing less general and rarely used rules and
in supplying them by more general rules from the current generation.
The following statistics characterize the quality of the set of rules. The
value covering defines the percentage of the classified examples from all ex-
amples used in the evaluation of the set of rules (Eq. 29).
classif iedCount
covering = (29)
examplesCount
F idelity expressed in (Eq. 30) describes the percentage of correct (according
to the neural network answer) classified examples from all examples classified
by the set of rules.
correctClassif iedCount
f idelity = (30)
classif iedCount
Covering and f idelity are two measures of a quality of the acquired set of rules
that say about its accuracy generalization. Additionally, the perf ormance
(Eq. 31) is defined, which informs about the percentage of the correct classified
examples compared to all examples used in the evaluation process.
correctClassif iedCount
perf ormance = (31)
examplesCount
In Table 1 the results of experimental studies for benchmark data files are
shown. The aim of experiments was to test the quality of the rule extraction
made by GEX for data sets with different types of attributes. The procedure
proceed as follows. First, data set was divided into 10 different parts. Each
time 9 parts of the training set was used to train and the last part was used to
test. It was repeated 25 times (it is equivalent to 2.5 times made 10-fold cross
validation). Then, the results were averaged and the mean and the standard
deviation was calculated. The 15 tested data sets come from UCI repository.
They represent a spectrum of data sets with different types of attributes. The
files - Agaricus lepiota, Breast cancer, Tictactoe, Monk1 have the nominal
attributes, only. WDBC, Vehicle, Wine and Liver are the examples of files
with continuous attributes. Discrete attributes are in Dermatology files, logical
in Zoo data file. Mixed attributes are in the files Australian credit, Heart, Ecoli.
In these experiments the values of parameters were set as follows:
• mutation = 0.2,
• crossover = 0.5,
204 U. Markowska-Kaczmar
Table 1. The results of experiments with GEX in terms: the number of generations
(Ngenerations), the number of rules (Nrules), covering and fidelity for data sets
from UCI repository with different types of attributes using 10 − cross validation
(average value ± standard deviation is shown)
Other tests made with the following data sets from UCI: SAT (36 contin-
uous attributes, 4435 patterns), Artificial characters (16 discrete attributes,
16 000 patterns), Sonar (60 continuous attributes, 208 patterns), Animals
(72 nominal and continuous attributes, 256 patterns), German Credit (21
nominal, discrete and continuous attributes, 1000 patterns) and Shutle (9
continuous, 43500 attributes) confirmed that the method is scalable.
A comprehensibility of the rules acquired by GEX were tested, as well. In
this case LED data set was chosen, because its patterns are easy visualized.
The simple version of this data file contains seven attributes representing the
segments of LED as it is shown in the Fig. 22. In the applied file these 7 at-
tributes were increased in additional 17 attributes that are specified in random
as 0 or 1. To keep the task nontrivial to the proper attributes representing
segments of LED 1% of noise was introduced (1% values of proper attributes
has change their values to the opposite one). This problem was not easy to
classify for neural network, as well. It was trained with performance - 80%.
After several runs of the GEX application with different values of param-
eters the example of the extracted set of rules is presented in Table 2. It can
be noticed that only in the rules for digit 3 there is the premise referring to
the additional - artificial attribute. For the rest digits the rules are perfect.
They are visualized in Fig. 23 (except of rules for digit 3). During tests one
could observe that the time of rule extraction and the final result (rules for all
Up right
Up left
Down right
Down left
Table 2. The example of the set of rules for LED data set
Fig. 23. Visualization of LED data set - the first row, visualization of rules extracted
for LED data set - the second row
classes and the number of rules for each class) was strongly dependent on the
first found rule describing one from the similar digits (6, 9, 8). During these
experiments the crucial were parameters: minAccuracy and minU sability.
For noisy data minAccuracy equal to 0.9 or even 0.8 gave good results. The
greater is the value minU sability the more general rules are in the final set
of rules.
Table 3 presents a comparison of results obtained by GEX and other meth-
ods of rules extraction from neural network. Obtained results do not allow to
state that GEX is always better then its counterpart methods (e.q. see Iris
data set for Santos’s method), but it can be observed that in most cases GEX
surpasses the other methods. Besides it delivers rules for all classes without
necessity of existence the default rule.
6 Conclusion
Neural networks, are very efficient in solving various problems but they have
no ability of explaining their answers and presenting gathered knowledge in
a comprehensible way. In the last years numerous methods of data extraction
Rule Extraction from Neural Networks 207
have been developed. Two main approaches are used, namely the global one
that treats a network as a black box and the local one that examines its
structure. The purpose of these algorithms is to produce a set of rules that
would describe a network’s performance with the highest fidelity possible,
taking into account its ability to generalise, in a comprehensible way.
Because the problem of rule extraction is very complex, using evolutionary
algorithms for this purpose, especially for networks that solve the problem of
classification is popular and popular. In the chapter both examples of their
usage were presented - for local rule extraction as well as for global one. Ef-
fective application of local approach depends on the number of neurons in
the net, so existing methods follow to limit an architecture of neural network.
Evolutionary algorithm can be applied to cluster hidden neurons activation
and substituting a cluster of neurons by one neuron. In other application they
optimize the structure of a neural network. Global methods treat a neural
network as black box. As such it delivers the training examples for neural
network acquiring rules. In this group three different methods based on evo-
lutionary algorithm were presented. The experimental studies presented in
this chapter show that the methods of rule extraction from neural networks
based on evolutionary algorithms are not only the promising idea but they can
deliver rules fuzzy or crisp that are accurate and comprehensible for human
in acceptable time. Because of the limited space of this chapter the methods
using multiobjective optimization in the Pareto sense in the global group are
not presented here. Interested readers can be referred to [24] and [22].
References
1. Alexander JA, Mozer M (1999) Template-based procedures for neural network
interpretation. Neural Netw 12:479–498
2. Andrews R, Diedrich J, Tickle A (1995) Survey and critique of techniques
for extracting rules from trained artificial neural networks. Knowl-Based Syst
8(6):373–389
3. Andrews R, Geva S (1994) Rule extraction from constrained error back propa-
gation MLP. In: Proceedings of 5th Australian conference on neural networks,
Brisbane, Quinsland, pp 9–12
4. Blake CC, Merz C (1998) UCI Repository of Machine Learning Databases.
University of California, Irvine, Department of Information and Computer
Sciences.
5. Bologna G (2000) A study on rule extraction from neural network applied to
medical databases. In: The 4th European conference on principles and practice
of knowledge discovery
6. Castillo L, González A, Pérez R (2001) Including a simplicity criterion in the
selection of the best rule in a genetic algorithm. Fuzzy Sets Syst 120:309–321
7. Cichocki A, Unbehauen R (1993) Neural networks for optimization and signal
processing. Wiley, London
8. Eiben A, Aarts E, Hee K (1991) Parallel problem solving from nature. Chapter
Global convergence of genetic algorithms: a Markov chain analysis, Springer,
Berlin Heidelberg New York, p 412
208 U. Markowska-Kaczmar
27. Omidvar O, Van der Smagt P (eds.) (1997) Neural systems for robotics.
Academic, New York
28. Palade V, Neagu DC, Patton RJ (2001) Interpretation of trained neural net-
works by rule extraction. Fuzzy days 2001, LNC 2206, pp 152–161
29. Reil T, Husbands P (2002) Evolution of central pattern generators for
bipedal walking in a real-time physics environment. IEEE Trans Evol Comput
6(2):159–168
30. Robert C, Gaudy JF, Limoge A (2002) Electroencephalogram processing using
neural network. Clin Neurophysiol 113(5):694–701
31. Santos R, Nievola J, Freitas A (2000) Extracting comprehensible rules from neu-
ral networks via genetic algorithms Symposium on combinations of evolutionary
computation and neural network 1:130–139
32. Setiono R, Leow WK, Zurada J (2002) Extraction of rules from artificial neural
networks for nonlinear regression. IEEE Trans Neural Netw 13(3):564–577
33. Siponen M, Vesanto J, Simula O, Vasara P (2001) An approach to automated
interpretation of SOM. In: Proceedings of workshop on self-organizing map 2001
(WSOM2001), pp 89–94
34. Taha I, Ghosh J (1999) Symbolic interpretation of artificial neural networks.
IEEE Trans Knowl Data Eng 11(3):448–463
35. Thrun SB (1995) Advances in neural information processing systems. MIT, San
Mateo, CA
36. Van der Zwaag B-J (2001) Handwritten digit Recognition: a neural network
demo. In: Computational intelligence: theory and applications, Vol. 2206 of
Springer LNCS. Dortmund, Germany, pp 762–771
37. Widrow B, Rumelhart DE, Lehr M (1994) Neural networks: applications in
industry, business and science. Commun ACM 37(3):93–105
Cluster-wise Design of Takagi and Sugeno
Approach of Fuzzy Logic Controller
α Constant
β Threshold for similarity
γ Threshold for determining a valid cluster
µ Membership function value
Takagi and Sugeno Approach of FLC 213
1 Introduction
We, human beings, have a natural thirst to gather input-output relationships
of a process, which are necessary, particularly for on-line control of the same.
Several attempts were made in the past, to capture the above relationships
for a number of processes, by using the tools of traditional mathematics (e.g.,
differential equations and their solutions), statistical analysis (e.g, regression
analysis based on some experimental data collected in a particular fashion,
say full factorial design, fractional factorial design, central composite design,
and others), and others. Most of the real-world problems are so complex that
it might be difficult to formulate them, in the form of differential equations.
Moreover, even if it is possible to determine the differential equation, it could
be difficult to get its solution. The situation becomes worse, when the input
and output variables are associated with imprecision and uncertainty. A fuzzy
logic controller (FLC), which works based on Zadeh’s fuzzy set theory [1],
could be a natural choice, to solve the above problem, as it is a potential tool
for dealing with imprecision and uncertainty. The FLCs are becoming more
and more popular, nowadays, due to their simplicity, ease of implementations
and ability to tackle complex real-world problems.
To design an FLC for controlling a process, we try to model the human rea-
soning used to solve the said problem, artificially. The variables are expressed
in terms of some linguistic terms (such as VN–very near, N–near, and others)
and the degree of belongingness of an element to a class is expressed by its
membership function value (which lies between 0 and 1). The rules generally
express the relationships among the input and output variables of a process.
The performance of an FLC depends on its knowledge base (KB), which con-
sists of both data base (i.e., membership function distributions) as well as
rule base. Two basic approaches of FLC, namely Mamdani Approach [2] and
Takagi and Sugeno Approach [3], are generally available in the literature. An
FLC developed based on Mamdani Approach may not be accurate enough but
it is interpretable. On the other hand, Takagi and Sugeno Approach can yield
a more accurate controller compared to that designed based on the Mamdani
Approach but at the cost of interpretability. In Mamdani Approach, the crisp
value corresponding to the fuzzified output of the controller is determined by
using a defuzzification method, which could be computationally expensive,
whereas in Takagi and Sugeno Approach, the output is directly expressed as
214 Tushar and D.K. Pratihar
Attempts were also made to optimize the Takagi and Sugeno type of FLC.
Smith and Comer used the Least Mean Square (LMS) learning algorithm to
tune a general Takagi and Sugeno type FLC [17]. The co-efficients of the
output functions in the Takagi-Sugeno type FLC had been optimized using a
GA by Kim et al. [18]. The structure of ANFIS: adaptive-network-based fuzzy
inference system was developed by Jhang [19], in which the main aim was to
design an optimal Takagi-Sugeno type FLC by using a neural network.
The present chapter deals with GA-based (off-line) tuning of the FLCs
working based on Takagi and Sugeno approach. After the training of the
FLCs is over, their performances have been tested on two physical problems.
The rest of the text is organized as follows: Section 2 explains the principle
of Takagi and Sugeno approach of FLC. Section 3 gives a brief introduction
to the GA. The issues related to entropy-based fuzzy clustering and cluster-
wise linear regression are discussed in Section 4. The proposed method for
cluster-wise design of the FLCs working based on Takagi and Sugeno approach
and their GA-based tuning are explained in Section 5. The performances of
the developed approaches are compared among themselves on two physical
problems in Section 6. Some concluding remarks are made in Section 7 and
the scope for future work is indicated in Section 8.
3 Genetic Algorithm
Genetic algorithm (GA) is a population-based probabilistic search and opti-
mization technique, which works based on the mechanism of natural genetics
and Darwin’s principle of natural selection (i.e., survival of the fittest) [4].
The concept of GA was introduced by Prof. John Holland of the University
of Michigan, Ann Arbor, USA, in the year 1965, but his seminal book was
published in the year 1975 [20]. This book lays the foundation of genetic al-
gorithms. It is basically a heuristic search technique, which works using the
concept of probability. The working principle of a GA can be explained briefly
with the help of Fig. 1.
• A GA starts with a population of initial solutions, chosen at random.
• The fitness/goodness value (i.e., objective function value in case of a max-
imization problem) of each solution in the population is calculated.
• The population of solutions is then modified by using different operators,
namely reproduction, crossover, mutation, and others.
• All the solutions in a population may not be equally good in terms of
their fitness values. An operator named reproduction is used to select
the good solutions by using their fitness information. Thus, reproduction
forms a mating pool, which consists of good solutions hopefully. It is to be
Start
Initialize a population of
solutions
Gen=0
No
Gen > Max_gen
?
Assign fitness to
Yes all solutions in the
population
End
Reproduction
Crossover
Mutation
Gen = Gen+1
(i.e., xi1 , xi2 , . . . , xiR ). The following steps are followed to determine entropy
of each point.
• Arrange the data set in rows and columns. Thus, there are N rows and
R columns.
• Determine Euclidean distance between the points i and j as follows.
R
Dij = (xik − xjk )2 . (3)
k=1
From the above equation, it can be observed that entropy E becomes equal
to 0.0, for a value of S = 0 and S = 1.0. Moreover, entropy E takes the
maximum value of 1.0, corresponding to a value of S = 0.5. Thus, entropy
of a point with respect to another point varies between 0.0 and 1.0.
• Calculate total entropy value at a data point xi with respect to all other
data points by using the expression given below.
j=i
Ei = − (Sij log2 Sij + (1 − Sij ) log2 (1 − Sij )) (7)
j∈X
It is also important to note that during clustering, the point having min-
imum total entropy may be selected as the cluster center. It is so, because
(1 − E) indicates the probability of a point for getting selected as a cluster
center.
Takagi and Sugeno Approach of FLC 219
Clustering Algorithm
Let us suppose that [T ] is the data set containing N data points and each
data point has R dimensions. Thus, [T ] is an N × R matrix. The clustering
algorithm consists of the following steps.
• Step 1: Calculate entropy Ei for each data point xi lying in [T ].
• Step 2: Locate xi , which has the minimum Ei value and select it
(xi,minimum ) as the cluster center.
• Step 3: Put xi,minimum and the data points having similarity with
xi,minimum greater than β (a threshold value of similarity) in a cluster
and remove them from [T ].
• Step 4: Check if [T ] is empty. If yes, terminate the program, else go Step 2.
In the above algorithm, entropy has been defined in such a way that a data
point which is far away from the rest of the data may also be selected as a
cluster center. It may happen so, because a very distant point (from the rest
of the data points) may also have a low value of entropy. To overcome this,
another parameter γ (in %) is introduced, which is nothing but a threshold
used to declare a cluster to be a valid one. After the clustering is over, we
count the number of data points in each cluster and if this number becomes
greater than or equal to γ% of the total number of data points, we declare
this cluster as a valid one. Otherwise, these data points, which could not form
a cluster will be declared as outliers.
y ij = aij ij ij ij ij
0 + a1 x1 + a2 x2 + . . . + ak xk + . . . + ap xp , (8)
where aij ij ij
0 , a1 , . . . , ap are the co-efficients to be tuned properly during the
GA-based learning.
Let us also suppose that each of the input dimensions of a variable (i.e.,
x1 , x2 , . . . , xp ) is expressed by using three linguistic terms – L (Low), M
(Medium) and H (high). The membership function distributions of the input
variables are assumed to linear in Approach 2, whereas those are expressed
by using the third order polynomial functions [25] in Approach 3 (refer to
Fig. 2). It is to be noted that the parameter – gk (k = 1, 2, . . . , p) can be var-
ied, while carrying out optimization by using a GA. As three linguistic terms
L M H L M H
1.0 1.0
µ µ
0.0 0.0
xk,min xk,min
g gk g g
k k k
x x
k k
(a) (b)
GA−based
tuning
Off−line
Knowledge
Base
On−line
1 . . 01. . . . 1
+ . ,- + . ,-
. . 10. . . . 0
+ . ,-
. . 10. 1
+ . ,-
. . 00. . . . 0+ . ,-
. . 01. . . . 1+ . ,-
. . 01.
g1 gk gp h111 hijk hqnp
The performances of the above three developed approaches have been tested
and compared among themselves on two different problems – one is related
to Abrasive Flow Machining (AFM) [26] and the other deals with Tungsten
Inert Gas (TIG) welding [27], which are discussed below.
10
7
No. of clusters
6
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Beta value
(a) No. of cluster vs. beta
1200
1000
800
No. of outliers
600
400
200
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Beta value
(b) No. of outliers vs. beta
Results of Approach 1
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Fig. 5 shows the comparisons among the target (determined by using the em-
pirical relationships) and calculated values (obtained by utilizing the above
regression equations) of two outputs, namely M RR and Ra , for 50 randomly-
generated test cases. The belongingness of a test case to a particular cluster
is decided by considering the minimum of its Euclidean distances from the
four cluster centers obtained above. The model is able to make reasonably
good predictions of M RR for a number of test cases (corresponding to which
the points are lying on the ideal y = x line) but not all (refer to Fig. 5(a)).
The above figure shows that the model has under-estimated the M RR values
for a few test cases. It is interesting to notice from Fig. 5(b) that the model
has predicted surface roughness values almost accurately for most of the test
cases but not all. It is also important to note that the points are found to lie
on both the sides of the ideal y = x line. Thus, reasonably good predictions of
both the outputs have been made by Approach 1, for most of the random test
cases. The actual input-output relationships of the above process may be non-
linear in nature. Moreover, the degree of non-linearity may not be the same
throughout the entire input-output space. As the clustering is done based on
the concept of similarity, the similar points are expected to form a cluster.
Here, the non-linear input-output space has been divided into four clusters
based on the similarity and at each cluster, the input-output relationships
have been determined using the linear regression analysis. Thus, a non-linear
space has been divided into four regions and at each region, the input-output
relationships have been approximated as the linear ones. The deviations in
predictions as shown in Fig. 5 could be due to the above reason.
226 Tushar and D.K. Pratihar
0.6
0.5
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6
Calculated values of MRR
2.5
2
Target values of Ra
1.5
0.5
0
0 0.5 1 1.5 2 2.5
Calculated values of Ra
Results of Approach 2
Fuzzy logic controllers (FLCs) have been designed based on Takagi and
Sugeno’s approach cluster-wise, in which each of the outputs has been ex-
pressed as the linear function of inputs, as obtained above. The Knowledge
Takagi and Sugeno Approach of FLC 227
Base (KB) of the FLCs have been tuned to improve their performance by using
a GA. As the performance of the GA depends on its parameters, a systematic
study is conducted to determine the optimal GA-parameters, in which only
one parameter has been varied at a time after keeping the others fixed. Fig. 6
shows the results of the above parametric study. It is important to note that
9.3
9.25
9.2
9.15
Fitness
9.1
9.05
8.95
0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Crossover probability
(a) Fitness vs. pc
9.45
9.4
9.35
9.3
9.25
Fitness
9.2
9.15
9.1
9.05
9
8.95
0 0.0005 0.001 0.0015 0.002 0.0025 0.003
Mutation probability
(b) Fitness vs. pm
Fig. 6. Results of the GA-parametric study
228 Tushar and D.K. Pratihar
10.4
10.2
10
Fitness 9.8
9.6
9.4
9.2
8.8
0 20 40 60 80 100 120 140 160 180 200
Population size
(c) Fitness vs. population size
10.6
10.4
10.2
10
9.8
Fitness
9.6
9.4
9.2
8.8
0 50 100 150 200 250 300 350 400 450 500
No. of generations
(d) Fitness vs. no. of generations
Fig. 6. (Continued)
the best results are obtained with the following GA-parameters: probability
of crossover pc = 0.87, probability of mutation pm = 0.00223, population size
P = 190 and maximum number of generations G = 450. In this problem, there
are four input variables (i.e, p = 4), two outputs (i.e., q = 2) and the number
of clusters n is set equal to 4. Thus, a total of p +q × n × p = 4 + 2 × 4 × 4 = 36
values are to be varied within their respective ranges (refer to Table 1), during
Takagi and Sugeno Approach of FLC 229
0.5
0.4
Target values of MRR
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5
Calculated values of MRR
2
Target values of Ra
1.5
0.5
0
0 0.5 1 1.5 2
Calculated values of Ra
Fig. 7. Performance testing of the FLC having linear membership function distri-
butions – AFM data
Takagi and Sugeno Approach of FLC 231
Results of Approach 3
A parametric study has been carried out for this approach by following the
same procedure explained above and the following GA-parameters – pc = 0.93,
pm = 0.00133, P = 190 and G = 430 are found to yield the best results. The
optimal values of gk and hijk obtained by using this approach are shown in
Table 1. The values of M RR and Ra , predicted by using Approach 3, have
been compared with their respective target values in Fig. 8. It is interesting
to note that Approach 3 has predicted both M RR as well as Ra values almost
with the same accuracy level as that obtained by Approach 2.
Comparisons
In the present work, five inputs, such as welding speed (C1 ), wire feed rate
(C2 ), % cleaning (C3 ), arc gap (C4 ), welding current (C5 ) and four outputs,
232 Tushar and D.K. Pratihar
0.5
0.4
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5
Calculated values of MRR
2
Target values of Ra
1.5
0.5
0
0 0.5 1 1.5 2
Calculated values of Ra
namely weld bead front height (O1 ), front width (O2 ), back height (O3 ), back
width (O4 ), have been considered to model the TIG welding process. The
ranges of different input parameters have been set as follows: C1 (24.0 to
46.0 cm/min), C2 (1.5 to 2.5 cm/min), C3 (30.0 to 70.0), C4 (2.4 to 3.2 mm)
and C5 (80.0 to 110.0 amp). To establish input-output relationships of this
Takagi and Sugeno Approach of FLC 233
1
’CWLRMRR’
’LRFLCMRR’
’NONLRFLCMRR’
0.5
-0.5
-1
-1.5
-2
-2.5
5 10 15 20 25 30 35 40 45 50
Test Cases
(a) Output 1: MRR
1.2
’CWLRRa’
’LRFLCRa’
’NONLRFLCRa’
1
% deviation in prediction of Ra
0.8
0.6
0.4
0.2
-0.2
5 10 15 20 25 30 35 40 45 50
Test Cases
(b) Output 2: Surface roughness Ra
process by using statistical regression analysis, the data collected as per full
factorial design of experiments (as there are five input variables, the number
of experiments will be equal to 25 = 32) [27], have been used. Table 2 shows
the set of above 32 data involving input-output relationships of the process.
The following response equations have been obtained by using MINITAB-
14 software package on the above 32 data.
O1 = − 17.2504 + 0.6202C1 + 4.6762C2 + 0.0866C3 + 7.4479C4 + 0.0431C5
− 0.1870C1 C2 − 0.0058C1 C3 − 0.2210C1 C4 − 0.0029C1 C5 + 0.0018C2 C3
− 1.8396C2 C4 + 0.0191C2 C5 − 0.0586C3 C4 + 0.0018C3 C5 − 0.0352C4 C5
+ 0.014C1 C2 C3 + 0.0623C1 C2 C4 + 0.0002C1 C2 C5 + 0.0022C1 C3 C4
− 0.0070 × 10−3 C1 C3 C5 + 0.0011C1 C4 C5 + 0.0061C2 C3 C4
− 0.0014C2 C3 C5 − 0.0030C2 C4 C5 − 0.0003C3 C4 C5 − 0.0004C1 C2 C3 C4
+ 0.0189 × 10−3 C1 C2 C3 C5 − 0.0460 × 10−3 C1 C2 C4 C5 − 0.0009C1 C3 C4 C5
− 0.0004C2 C3 C4 C5 − 0.0069C1 C2 C3 C4 C5 , (23)
Table 2. Data (as per full factorial design of experiments) used to carry out
regression analysis
One thousand data have been generated artificially by using the above re-
gression equations and those are grouped into a number of clusters based on
similarity by utilizing the entropy-based fuzzy clustering algorithm. The best
set of four clusters with zero outlier is obtained, corresponding to a threshold
of similarity β = 0.43. Three approaches have been developed with the best
set of clusters obtained above, the results of which are explained below.
Results of Approach 1
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:
outputs. The points on Fig. 10 are seen to lie on either the ideal y = x line or
both the sides of it. A few points are found to lie on both the sides of the ideal
y = x line and it could be due to the fact that at each cluster, the responses
have been determined as the linear functions of the input process parameters
but their interaction and non-linear terms have been neglected for simplicity.
0.3
0.2
Target value of Output 1
0.1
−0.1
−0.2
−0.3
−0.4
−0.5
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
Calculated value of Output 1
(a) Target vs. calculated values of output 1
11
10
Target value of Output 2
5
5 6 7 8 9 10 11
Calculated value of Output 2
(b) Target vs. calculated values of output 2
Fig. 10. Performance testing of cluster-wise linear regression – TIG data
238 Tushar and D.K. Pratihar
0.9
0.8
0.6
0.5
0.4
0.3
0.2
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Calculated value of Output 3
(c) Target vs. calculated values of output 3
11
10
9
Target value of Output 4
3
3 4 5 6 7 8 9 10 11
Calculated value of Output 4
(d) Target vs. calculated values of output 4
Results of Approach 2
In this approach, FLCs have been developed cluster-wise by following the Tak-
agi and Sugeno approach and utilizing the linear response equations obtained
Takagi and Sugeno Approach of FLC 239
Results of Approach 3
Comparisons
Table 3. (Continued)
0.3
0.2
0.1
Target value of Output 1
0
−0.1
−0.2
−0.3
−0.4
−0.5
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
Calculated value of Output 1
(a) Target vs. calculated values of output 1
11
10
Target values of output 2
5
5 6 7 8 9 10 11
Calculated values of output 2
(b) Target vs. calculated values of output 2
Fig. 11. Performance testing of the FLC having linear membership function distri-
butions – TIG data
Takagi and Sugeno Approach of FLC 243
0.9
0.7
0.6
0.5
0.4
0.3
0.2
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Calculated value of Output 3
(c) Target vs. calculated values of output 3
12
11
10
Target value of Output 4
3
3 4 5 6 7 8 9 10 11 12
Calculated value of Output 4
(d) Target vs. calculated values of output 4
and 3 are found to perform better than Approach 1, i.e., GA-tuned FLCs (de-
veloped cluster-wise) are seen to outperform the cluster-wise linear regression
analysis. It could be due to the fact that the input-output relationships of
the above processes are not exactly the linear in nature, although those have
244 Tushar and D.K. Pratihar
0.2
0.1
Target value of Output 1
−0.1
−0.2
−0.3
−0.4
−0.5
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
Calculated value of Output 1
(a) Target vs. calculated values of output 1
11
10
Target value of Output 2
5
5 6 7 8 9 10 11
Calculated value of Output 2
(b) Target vs. calculated values of output 2
Fig. 12. Performance testing of the FLC having nonlinear membership function
distributions – TIG data
Takagi and Sugeno Approach of FLC 245
0.9
0.8
Target values of Output 3
0.7
0.6
0.5
0.4
0.3
0.2
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Calculated values of Output 3
(c) Target vs. calculated values of output 3
12
11
10
Target values of Output 4
3
3 4 5 6 7 8 9 10 11 12
Calculated values of Output 4
(d) Target vs. calculated values of output 4
a close watch on the above results indicates that Approach 3 has performed
slightly better than Approach 2. It might be due to the reason that the linear
membership function distributions of the input variables used in Approach 2,
have been replaced by the nonlinear ones in Approach 3. Thus, Approach 3
246 Tushar and D.K. Pratihar
300
’CWLR1’
’LRFLC1’
200 ’NLRFLC1’
−100
−200
−300
−400
−500
−600
−700
5 10 15 20 25 30 35 40 45 50
Test Cases
(a) Output 1: Weld bead front height
10
’CWLR2’
’LRFLC2’
’NLRFLC2’
8
% deviation in prediction of output 2
−2
−4
−6
5 10 15 20 25 30 35 40 45 50
Test Cases
(b) Output 2: Weld bead front width
Fig. 13. Comparison of three approaches in terms of % deviation in prediction of
different outputs – TIG
Takagi and Sugeno Approach of FLC 247
10
’CWLR3’
’LRFLC3’
’NLRFLC3’
−5
−10
−15
−20
5 10 15 20 25 30 35 40 45 50
Test Cases
(c) Output 3: Weld bead back height
10
’CWLR4’
’LRFLC4’
’NLRFLC4’
8
% deviation in prediction of output 4
−2
−4
−6
5 10 15 20 25 30 35 40 45 50
Test Cases
(d) Output 4: Weld bead back width
Fig. 13. (Continued)
248 Tushar and D.K. Pratihar
7 Concluding Remarks
To establish input-output relationships of two physical processes, three ap-
proaches (one is a statistical regression analysis and the other two deal with
GA-tuned FLCs) have been developed cluster-wise and their performances
are tested on 50 randomly-generated new cases. From the above study, the
following conclusions have been drawn:
1. GA-tuned FLCs (both Approaches 2 as well as 3) developed based on
Takagi and Sugeno approach have outperformed the linear regression anal-
ysis, i.e., Approach 1, for the random test cases. It might be due to the
reason that adaptability has been injected into the FLCs, during their
training carried out by using a GA, whereas Approach 1 does not have
the provision to gain such a property.
2. Approach 3 has yielded better results compared to Approach 2. It could
be due to the fact that the membership function distributions of the input
variables have been assumed to be nonlinear in Approach 3, whereas those
in Approach 2 are linear in nature. Thus, Approach 3 is able to capture the
nonlinearity of the processes in a more effective way. In fact, Approach 3
is found to be the best of all.
3. Performances of the approaches are problem-dependent.
References
1. Zadeh LA (1965) Fuzzy Sets, Information and Control, 8(3):338–353
2. Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a
fuzzy logic controller. Int J Man Mach Stud 7:1–13
3. Takagi T, Sugeno M (1985) Fuzzy identification of systems and its application
to modeling and control. IEEE Trans Syst Man Cybern, SMC-15:116–132
Takagi and Sugeno Approach of FLC 249
24. Yao J, Dash M, Tan ST (2000) Entropy-based fuzzy clustering and fuzzy mod-
eling. Fuzzy Set Syst 113:381–388
25. Nandi AK, Pratihar DK (2004) Design of a genetic-fuzzy system to predict
surface finish and power requirement in grinding. Fuzzy Set Syst 148:487–504
26. Jain RK, Jain VK (2000) Optimum selection of machining conditions in abrasive
flow machining using neural networks. J Mater Process Technol 108:62–67
27. Juang SC, Tarng YS, Lii HR (1998) A comparison between the back-propagation
and counter-propagation networks in the modeling of the TIG welding process.
J Mater Process Technol 75:54–62
Evolutionary Fuzzy Modelling for Drug
Resistant HIV-1 Treatment Optimization
Summary. Fuzzy relational models for genotypic drug resistance analysis in Human
Immunodeficiency Virus type 1 (HIV-1) are discussed. Fuzzy logic is introduced to
model high-level medical language, viral and pharmacological dynamics. In-vitro
experiments of genotype/phenotype pairs and in-vivo clinical data bases are the
base for the knowledge mining. Fuzzy evolutionary algorithms and fuzzy evaluation
functions are proposed to mine resistance rules, to improve computational perfor-
mances and to select relevant features.
1 Introduction
1.1 Artificial Intelligence in Medicine
Recent years have seen medicine and artificial intelligence cross their roads and
proceed together: statistical analysis has been a valid support on epidemiology
or diagnosis, but after first genomic regions were sequenced and interpreted
the scenario and the needs became more complex, involving biology and bio-
chemistry. Today computer science -through machine learning and intelligent
systems- integrates medicine and biology in several fields, from sequence anal-
ysis to protein structure and function prediction, to gene regulatory networks
modelling, to molecular design, to medical diagnosis. Medical and biological
data bases started to grow up and assume standard structures. Biological
systems are complex systems, medical measures are extremely variable even
under the same conditions and indirect indicators of real processes: one way
to handle them is to use uncertainty and vagueness concepts of Fuzzy Logic,
which thereby is a suitable modelling framework. The HIV treatment opti-
mization scenario in fact sees the drug resistance development by its genomic
variation under drug pressure: the high mutation rate determines a huge state
variable space. There is a large number of different drugs that attack differ-
ent viral targets genes and have to be combined in order to control the viral
suppression and the chance of resistance rise. Moreover, in the human body
M. Prosperi and G. Ulivi: Evolutionary Fuzzy Modelling for Drug Resistant HIV-1 Treatment
Optimization, Studies in Computational Intelligence (SCI) 82, 251–287 (2008)
www.springerlink.com
c Springer-Verlag Berlin Heidelberg 2008
252 M. Prosperi and G. Ulivi
The first sections (2 and 3) of this chapter are intended to provide necessary
introduction and literature references to be probe further: in detail, biologi-
cal background on antiviral treatments, resistance developing, viral sequence
analysis and data collection is given in section 2. The following section 3 is
an overview on the current machine learning approaches for drug resistant
HIV-1 treatment optimization. Section 4 introduces fuzzy modelling for med-
ical science: from previous studies on the field continues defining the fuzzy
relational system for in-vitro and in-vivo modelling. In section 5 optimiza-
tion techniques are discussed, describing Fuzzy Genetic Algorithms, Random
Searches and Fuzzy Feature Selection criteria. Section 6 presents application
results for phenotype and in-vivo clinical prediction, with conclusions and
future perspectives.
amino acids, which are encoded by blocks of three adjacent nucleotides in the
genome, called codons.
Genomic sequences are the building blocks of biological mechanisms: com-
puter science is today necessary to investigate the genes and their func-
tions; even simple organisms like viruses are characterized by long character
sequences. The base for sequence analysis is [5], which is also a complete and
generic guide for the whole set of derived subtasks.
In the virus life cycle (when it reproduces) the genome string has to be copied
from a generation to the next one. Soon after HIV enters the body, the virus
begins reproducing at a rapid rate and billions of new viruses are produced
every day. In the process, HIV produces both perfect copies of itself (wild
type) and copies containing errors (mutants): copying errors occur frequently.
Mutations can change virus structure or functions and then modify its inter-
action in the environment: the high mutation rate of HIV (combined with the
fact that it attacks the immune system) leads to difficulties in the design of a
vaccine, and rapid selection of mutant strains resistant to drugs. At present
three classes of drugs are approved from FDA (Food and Drug Administration
of USA) as antiviral treatment against HIV: these are Reverse Transcriptase
inhibitors (RTi), Protease inhibitors (PRi) and Fusion inhibitors (Fi); each
class acts against a step of the viral replication process, and there are around
15 different molecules in commerce.
The viral genotype is a RNA 4-character sequence, from which usually
are extracted mutations comparing the sequence with the wild type. Usually
mutations are identified with a number representing a codon (a position in the
genomic sequence), headed by a letter that indicates the amino acid present
in the wild type (i.e. the standard virus, without mutations) and followed
by another letter that describes the amino acid replaced in the mutant. For
instance, a mutation that usually confers resistance to Lamivudine (3TC) is
the M184V: it indicates that in codon 184 amino acid Methonine (M) has
been replaced by Valine (V).
During infection there’s no single virus in the body, but a large population
of mixed viruses called quasispecies. Wild type virus is the one naturally
evolved with highest replicative capacity: before therapy is started, it is the
most abundant in the body and dominates all other quasispecies. Mutant
variants are too weak to survive and/or can’t reproduce. Others are strong
enough to reproduce but still can’t compete with the more fit wild type; as a
result, their numbers are less than the wild type ones in the body.
A drug usually works blocking a key role in the virus life cycle. Some
variants have mutations that allow the virus to partly, or even fully, resist an
antiretroviral drug. In a constant therapy mutant resistant strains can become
dominant (though having lower replicative capacity or fitness) in the patient.
This is called selective resistance, because the mutant is selected by the drug.
254 M. Prosperi and G. Ulivi
If it is not recognized, treatment loses its efficacy. Selected mutants are more
challenging to treat because therapy options are reduced. If drug regimen is
changed, new mutations can be selected and furthermore there are mutations
(such as the insertion at codon 69 in the Reverse Transcriptase gene) that
cause cross-resistance to a whole class of antiretrovirals.
Treatment interruptions showed that in a couple of months HIV reverts
to the wild type, but maintains low concentrations of resistant mutants, so if
a heavily experience drug is reused resistance arises shortly.
Combined therapies that involve multiple drugs are an approach to avoid
resistance. If virus changes to resist against a drug, but it’s inhibited by many
different others, it can be suppressed to undetectable levels (even though
complete eradication is not possible). Combined treatments can contain from
three to five different drugs, but lead often to tolerability and toxicity prob-
lems. Such therapies (usually two or three RTis and at least a PRi) are called
HAARTs (Highly Active Anti Retroviral Therapies) or cARTs (combined Anti
Retroviral Therapies): usually HAARTs produce a sensible reduction of viral
load within three-four weeks, and can be sustained in a long time window.
Unfortunately mutations occur also under HAARTs, even though at a lower
rate.
Before being approved and commercialized from the FDA, drugs follow a
long iter in which their efficacies are tested through different phases (namely,
from phase I to phase IV): first they’re designed, synthesized and put in viral
cultures; when they’re proven to be effective in-vitro, they start to be tested for
adsorption levels and toxicity in-vivo, until they are judged to be relatively
safe for the human body and effective in viral eradication. In-vitro studies
however are always carried on -even after the commercialization- in order to
point out further resistance development.
In-vitro and in-vivo studies are the data available for modelling.
In-vitro studies are collections of experiments that measure how a mutated
virus responds in a culture to a single drug inhibition, compared with the
replication of the wild type under the same drug pressure: the phenotype
is a numeric indicator of viral replication power, expressed as Fold Change
of the drug concentration needed to inhibit 50% of the viral replication as
compared to a wild type drug-susceptible reference viral strain: the data sets
are pairs of genotype sequences and Fold Changes values. At now, for each
drug there are thousands of such pairs freely available and the quality -being
fixed environment conditions and repeatability- is fairly high. These tests are
expensive compared to the cost of sequencing a viral strain, so a first attempt
is to define models that give phenotype prediction from genotype sequences.
In-vivo studies usually are data bases collecting patients’ Follow Ups, i.e.
analyses carried out before and after a therapy switch: usually a therapy
Evolutionary Fuzzy HIV Modelling 255
is stopped and considered failing when the Viral Load1 is detectable in the
patient’s blood and/or the CD4+ T2 cell counts are very low; when the virus
is detectable, it can be also sequenced and so it is possible to find which
are the selected mutations. Unfortunately, not always it is possible to obtain
clean and huge data sets: perspective cohorts (called clinical trials, studies on
precise therapeutic protocols lead by an equipe of physicians, in which pa-
tients are controlled weekly) are the most reliable ones, but not always free
and not so large; retrospective cohorts (collections of clinical reports from
the hospitals) are larger in size, but suffer from noises like time delays, miss-
ing data and unadherent patients. In addition, in-vivo measures are biased
by instruments’ systematic errors: viral load measures are reliable within 1
Log and can’t detect copies under certain limits (500 or 50 cp/ml); genotype
sequencing methods have an accuracy of 90% in revealing mutations, but per-
formances decrease using plasma samples with low viral concentrations. Even
input errors are not negligible: data bases are not automated and have bad
relational structures and implementations, mostly data are recorded manually
from paper clinical reports to spreadsheets. There are thousands of instances
available, but the variability of in vivo data is extremely high and the space of
400
investigation is huge: in$ theory
% there
$15%about 20 possible genotypes (in the
sequenced region) and 1 + · · · + 5 = 4943 therapeutic combinations.
15
CARTs with three or more antiretroviral drugs have lead to significant de-
creases in HIV-related morbidity and mortality by reducing HIV replication.
The goal of therapy is to suppress the plasma viral load as much as possible
for as long as possible. Given the fact that viral eradication is not feasible
with the current treatment armamentarium, HIV mutants ultimately are de-
veloped, with different degrees of decreased susceptibility to the ongoing treat-
ment regimen and of cross-resistance to other agents. This results in virologic
rebound and eventually disease progression: for these patients, the aim is to
build a therapy optimization tool, that will explore efficacies among a set of
possible cARTs and assure the best subsequent Viral Load reduction, taking
into account input attributes among the viral genotype, Baseline Viral Load
(virion counts in the plasma taken in a therapy switch), CD4+ T cell counts,
pharmacodynamics-kinetics, viral drug resistance mechanisms, et cetera.
One of the first studies published on drug resistant HIV treatment opti-
mization was the CTSHIV (Customized Treatment Strategy for HIV, see [18]):
the system operated on a set of a predetermined fuzzy-like set of in vivo viral
drug resistance rules applied to a patient’s viral genotype and nearby mutants,
1
Viral Load is the virion count in the plasma
2
CD4+ T are immune cells targeted and infected by the virus, so a measure for
the immune response to the infection
256 M. Prosperi and G. Ulivi
Advantages for the fuzzy modelling come from the strong flexibility gained us-
ing linguistic variables, i.e. the possibility of modelling hypotheses in complex
natural language, like the medical one.
However, due to the multi-dimensional and heterogeneous characteristics
of the input attribute space (discrete character viral genotype mutational
sequences, cARTs and real valued plasma analyses) define linguistic variables
is complex, and even worse mine them.
Before introducing the HIV models, It’s worth here to cite the CADIAG
system, a fuzzy inference relational model proposed by Sanchez, Adlassnig
and Gupta [9, 10] developed as automated tool for medical diagnosis: this
study in fact inspired the HIV model. In the Sanchez approach the medical
knowledge is represented as a fuzzy relation between symptoms and diseases.
So, given the fuzzy set A of the symptoms observed in the patient3 and the
fuzzy relation R = (A → B) representing the medical knowledge that relates
symptoms s ∈ S and diseases d ∈ D, then a fuzzy set B of the patient’s
possible diseases can be calculated through
B = A ◦ R (1)
or equivalently
T =Q◦R (3)
3
here membership functions are set up on healty/ill distributions from real-valued
analyses
Evolutionary Fuzzy HIV Modelling 259
M ◦W =R (4)
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
m1,1 · · · m1,M w1,AZT · · · w1,3T C r1,AZT · · · r1,3T C
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ .. . . .. ⎟ ⎜ .. .. .. ⎟ ⎜ .. .. .. ⎟
⎜ . . . ⎟◦⎜ . . . ⎟=⎜ . . . ⎟
⎝ ⎠ ⎝ ⎠ ⎝ ⎠
mN,1 · · · mN,M wM,AZT · · · wM,3T C rN,AZT · · · rN,3T C
260 M. Prosperi and G. Ulivi
Define furthermore another weight matrix W , in which there are the de-
grees of susceptibility (i.e. how much a mutation increases the power of a
drug), obtaining a result matrix S:
M ◦ W = S (5)
L(f (R = M ◦ W, S = M ◦ W ), P ) = L(P , P )
Facing in vivo drug resistance and viral outcome prediction is a harder task.
The best way probably should be to have a predetermined rule system and
Evolutionary Fuzzy HIV Modelling 261
optimize the parameters: the problem is that the rules have be found too.
Relying on the existing rules would avoid the input space partition search
problem: however the optimization would be biased and would not permit to
find new associations. Moreover, data bases collecting in-vivo analyses are still
fragmented, small in size and incomplete. The following models are a com-
promise due to this lack of training data. Two methods will be presented: the
first one -already known in literature- uses the in-vitro predictor to infer con-
clusions for in-vivo treatments; the second is an extension of the in-vitro fuzzy
relational predictor above introduced, modified to keep in-vivo dynamics, but
does not need in-vitro knowledge and results.
Existing Model
This model, proposed by Beerenwinkel in [7], is one of the few complete and
well documented studies in literature that can be compared and/or integrated
to the system presented in the previous subsection. It’s worth to be cited here
because it’s not a standard machine learning technique and possesses a lot of
common points with the modelling assumptions of the fuzzy system.
The author assumes that in vitro genotype→phenotype predictions are a
reliable starting point to predict in vivo cART efficacies. This is an arbitrary
hypothesis and statistical studies between phenotype and clinical response
today are still not giving precise confidences, but this application revealed
anyway promising results.
Assume to have a genotype→phenotype in vitro predictor: use for instance
the Fuzzy above described, a Linear Regressor or a Support Vector Machine
regressor (this latter was the one chosen in [7]).
Analysis of phenotype predictions (among naive and treated patients)
shows large differences in range, location and deviation of models, but in
general reveals a bimodal nature of distributions (resistance and susceptibil-
ity) among the whole set of drugs (even though for some drug not clearly), as
can be seen in Figure 1. Thus Beerenwinkel models the probability density of
predicted phenotypes (y) using a two-component gaussian mixture model for
each drug:
αφ(y, µ1 , σ1 ) + (1 − α)φ(y, µ2 , σ2 )
where φ(y, µi , σi ) is the density of the normal distribution i with mean µi and
standard deviation σi ; α is the mixing parameter. Parameters are estimated
by Expectation Maximization (EM) algorithm.
Then a log-likelihood ratio is defined to decide whether a given phenotype
was more likely belonging to the resistant or susceptible subpopulation:
Pr(res|y)
l(y) = log (6)
Pr(sus|y)
262 M. Prosperi and G. Ulivi
Pr(y|res) Pr(res)
Pr(y)
l(y) = log (7)
Pr(y|sus) Pr(sus)
Pr(y)
φ(y, µ1 , σ1 ) Pr(res)
l(y) = log + log (8)
φ(y, µ2 , σ2 ) Pr(sus)
Then l(y) is approximated with its tangent l (y) in y0 zero.
Finally the probability score ps is introduced as the logistic function of l (y)
for a given genotype with respect to a drug:
1
ps = ≈ Pr(res|y) (9)
1 + e−l (y)
A scoring function can be defined as estimation of activity of a treatment
against a given viral strain.
1
activity(d, x) = (10)
1+ el (fd (x))
where d is a drug, x is a genotype sequence and yd = fd (x) is the Log
Fold Change phenotype (prediction) for sequence x and drug d, having
activity(d, x) ≈ Pr(sus|yd ).
To calculate a score that takes into account drug combinations, it is im-
portant to note that treatments with drugs coming from different drug classes
(at now there are the NRTi, NNRTi, PRi and Fi classes) benefit from synergic
effects, while combinations restricted to a single drug class are in general less
potent. Thus the score will be additive for drugs coming from different classes
and taken as the maximum for drugs in the same class: having a mono-class
restricted drug combination Ci = {d1 . . . dn } ⊂ Di , where Di is the set of all
available drugs in the class i
k
activity(C, x) = (activity(Ck , x)) (12)
1
progression was significantly shifted through time: in fact the more drugs you
keep, the less are chances for the virus to develop resistance, because it is
attacked in different replicative steps (depending on the drug class) and must
select cross-resistance mutations being also in low concentrations.
Roughly speaking, drugs have different power : an old drug like Zidovudine
(AZT) is able to bring the viral load of a naive patient (i.e. wild type) down
by 1 Log from the baseline after three months of single-drug therapy, while
Lopinavir (LPV) can provide a 2 Log reduction. Drug power is closely related
to pharmacodynamics and pharmacokinetics, but no precise values can be
obtained: think about Saquinavir (SQV), a Protease inhibitor that shows high
viral load reduction in vitro, but is badly adsorbed by human body; its power
however increases when taken together with small doses of Ritonavir (RTV).
A combined therapy of AZT+LPV for a naive patient won’t lead to a
simple 1 + 2 Log reduction, but rather a slightly lower value.
The goal is to obtain an unique indicator of cART activity taking into
account genotypic resistance and susceptibility, drug powers and combination
effects.
This leads again to a fuzzy formula:
1
aD = ((¬rd ∧ pd ) ∨ sd ) (13)
d∈D
where aD is the overall activity for the D cART, d is a drug included in the
cART, rd is the viral genotypic resistance to drug d (the negation has the
meaning of an efficacy), pd is the single d drug power and sd is the viral
genotypic susceptibility to drug d. Clearly r and s are given by the above
matrix compositions. The power is coupled first with the efficacy and then
with the susceptibility, in order to take into account the fact that a drug can
be more powerful (thus efficient) than usual in presence of hypersusceptibility
mutations. *
The ∧ and ∨ operators are algebraic norm and conorm, while the is the
Hamacher conorm:
a + b − ab − (1 − γ)ab
⊥Hamacher (a, b, γ) = (14)
1 − (1 − γ)ab
where a, b are two generic single-drug activities and γ > 0 is a parameter that
takes into account drug synergies (and if γ < 1 then ⊥Hamacher < ⊥Algebraic ).
The γ and pd parameters were not estimated, but provided by physicians and
virologists according to their experimental evidences.
Now there is a unique indicator in [0,1]. Next step is to transform it (as
it was made for the phenotype prediction) in a Viral Load value: help comes
from [19], in which a system of differential equations -parameterized by drug
activities- models HIV-1 replication in the human body. This permits again
to define a loss function and optimize it.
Evolutionary Fuzzy HIV Modelling 265
Fig. 2. HIV-1 Schematic summary of the dynamics of HIV-1 infection in vivo: shown
in the center is the cell-free virion population sampled in the plasma - Image taken
from [46]
HIV-1 replication in the human body follows the process described in figure 2.
Drugs target different replication steps, as it can be seen in figure 3. Several
mathematical models have been proposed, the one coming from Perelson [19] is:
dT T
= s + pT (1 − ) − dT T − KVi T (15)
dt Tmax
dT ∗
= (1 − ηRT )kVi T − δT ∗ (16)
dt
dVi
= (1 − ηP R )N δT ∗ − cVi (17)
dt
dVni
= ηP R N δT ∗ − cVni (18)
dt
where T are uninfected CD4+ T cells, T ∗ are infected CD4+ T cells, Vi
and Vni are infectious and non-infectious virions respectively, ηi are the drug
class efficacies. Cytotoxic response (CD8), latent and long-lived cells are not
included in the model, as the presence of multiple viral strains.
Unfortunately, informations about T cells too often are missing. A simpli-
fied model is:
dV
= (1 − α)cVeq − cV (19)
dt
266 M. Prosperi and G. Ulivi
where α is the cART overall activity, c is the viral clearance rate (about three
days), Veq is the Steady State Viral Load.
Solution of this equation is
5 Optimization Techniques
Simple optimization algorithms are often limited to regular convex func-
tions. Actually, most real problems lead to face multi-modal, discontinuous,
non-differentiable functions. To optimize such functions traditional research
techniques use gradient-based algorithms [4], while new approaches rely on
stochastic mechanisms: these latter base the search of the next point basing
on stochastic decision rules, rather than deterministic processes.
Genetic Algorithms, Simulated Annealing and Random Searches (see again
[4]) are among these, and often are used either when the problems are difficult
to be defined, either when “comfortable” properties -such as differentiability
Evolutionary Fuzzy HIV Modelling 267
There are two possible ways to integrate Fuzzy Logic and Genetic Algorithms.
The first convolves application of GAs to solve optimization and search prob-
lems referring to fuzzy sets and rule systems, the second utilizes fuzzy tools
to model different components of GA in order to improve computational per-
formances. In this chapter both ways are discussed: in fact a GA is used to
mine fuzzy rules and the same GA is optimized through fuzzy tools.
Fuzzy Logic can integrate GAs in:
Chromosome Representation Classical binary or real valued representations
can be generalized into fuzzy set and membership functions
Crossover Operators Fuzzy operators can be considered to design crossover
operators able to guarantee an adequate and parameterizable level of di-
versity in population, in order to avoid problems of premature convergence
Evaluation Criteria Uncertainty, vagueness, belief, plausibility measures can
be introduced to define more powerful fitness functions.
F, S, M Crossover Operators
E ≤ A ≤ Z ≤ ΥA ≤ ΥE ≤ ΥZ ≤ ⊥Z ≤ ⊥A ≤ ⊥E
Table 3. F family
Table 4. S family
Other approaches for crossover operators have been presented that are worth
to be cited. In [16], Soft Genetic Operators are introduced, while in [15]
Sanchez proposes Template Fuzzy Crossover.
FGAs were proven to be more efficient than GAs in different scenarios. Tests
of performances in the minimization of non-linear and non-differentiable prob-
lems for GAs and FGAs have been showed in [4, 16, 17].
270 M. Prosperi and G. Ulivi
In GAs, when the mutation rate is high (above 0.1), performances approach
that of a primitive random search. The advantage to use such an optimization
algorithm is that still remains derivative-free and does not need to keep alive
a large population of solutions: actually just one is iteratively modified and
evaluated.
As described in [4] Random Search explores the parameter space of an
objective function sequentially adding random values to the solution vector
in order to find an optimal point that minimizes or maximizes the objective
function. Despite its simplicity, it has been proved that converges to the global
optimum.
Let be f (x) an objective function to be minimized and x the current vector
point considered. The basic algorithm iterates the following steps:
1. x is the current starting point
2. add a random vector dx to x and evaluate f (x + dx)
3. IF f (x + dx) < f (x) THEN set x = x + dx
4. IF (optimal f (x) or maximum number of iterations is reached) THEN
stop ELSE go to 2
Improvements to this primitive version are suggested in [4].
5.3 Implementation
In the previous sections the models for in-vitro and in-vivo prediction were
defined. Aim of this section is the estimation of parameters. For each model
there is a loss function to be minimized:
L(f (R = M ◦ W, S = M ◦ W ), P ) = L(P , P )
For the in-vitro model, f = f (R, S) = f (W, W ) and pi,d = tanh(ri,d − si,d ).
The same is for the in-vivo model: there are the relational compositions M ◦
W = R and M ◦ W , then R and S are combined through equation 13 (to
get the overall cART activity α) and then through equation 21 to get Viral
Loads.
The parameters to be estimated are -for both models- the values in rela-
tions W and W . The idea is to use either a FGA or a RS to minimize the loss
function (which will be in the fitness function) and compare different approxi-
mate solutions through different runs. Unlike a derivative-based approach (for
instance Gradient Descent) in which the solutions can stuck in local optima,
an evolutionary algorithm in this scenario can explore a wider set of solutions
with less constraints and still contemplate reasonable computational times.
In order to define the chromosome coding, note that W and W relations
are matrix of real numbers in [0, 1]: these matrices can be directly used as
chromosomes, being possible to apply directly fuzzy crossover operators: thus
the FGA searches solutions within a population of weight matrices.
Evolutionary Fuzzy HIV Modelling 271
The loss functions above described tend to minimize the error or maximize
the correlation between observed and predicted vectors. They do not take
account for the number of parameters used. In usual engineering scenarios
the parameters to be optimized are few and related to significant variables
as position, speed, acceleration: in this biological framework instead there is
a huge number of variables (all the mutations in the viral genotype) and a
corresponding large parameter space. Many of the input variables can be not-
significant for the model and many parameters mean that the system can be
easily overfitted. Feature Selection is closely related to the Occam’s princi-
ple, probabilistically interpreted as the Minimum Description Length (MDL)
principle (see [6]), for which models that use a minor number of parameters
are preferred under the same prediction performances. This is useful when
dealing with high-dimensional data sets, where many input attributes could
be irrelevant and redundant to the dependant variables and act just as a noise.
By allowing learning algorithms to focus only on highly predictive variables,
their accuracy can be even improved.
Feature Selection methods can be classified in two groups: Filter and
Wrapper methods (see again [6]). Filter methods usually rank each attribute
individually by different quality criteria (for example p-value of t-tests, mu-
tual information values et cetera) and then select the subset of best ranked
attributes. Wrapper methods evaluate performance of the learning algorithms
using different subsets of the attributes: an exhaustive search through all sub-
sets is clearly not possible, leading to 2n variable subsets to be evaluated, so
different search algorithms use heuristics (greedy) to direct towards promising
subsets. One such search method can be the Genetic Algorithm.
272 M. Prosperi and G. Ulivi
The idea for Fuzzy Feature Selection rises from the AIC definition and its
extended in fuzzy terms. While the AIC formula is fixed and selects variables
only based on their statistical significance, families of parameterized fuzzy
functions that take into account the number of parameters and the loss func-
tion can be designed, in order to decide with more flexibility how much the
model has to be simple (i.e. how many parameters are included) joined with
its goodness of fit. Fuzzy formulae are set up in order to select models with
high squared correlation ρ2 (or low (Mean) Squared Error SE) and a few
parameters:
where v is related to the number of active parameters. The fuzzy set for
v can be defined as a parameterized function of the variable weights: the
more they’re close to zero, the better is, because they do not participate to
the model. Excluding non-interactive operators (like min or drastic product )
Evolutionary Fuzzy HIV Modelling 273
every fuzzy -norm is admitted: the algebraic product was the choice for the
application. Two simple examples for membership functions are:
M 2
wi
e− 2σ2
µv (w) = 1 − i=1
(23)
M
|{w ∈ W : abs(w) < σ}|
µv (w) = 1 − (24)
M
where | · | is the cardinality of the set. A reasonable value for σ is 0.01, used in
the application. The first function is smoother, while the second cut variables
regardless their weight, just fixing a bound. For SE, the same holds:
N (xi −xi )2
e− 2σ2
µError (SE) = 1 − i=1
(25)
N
where xi and xi are predicted and observed value respectively; here σ can be
set on 0.5 (the choice rose from variability of plasma analyses within ±0.5
Log). Finally, ρ2 is by itself a goodness of fit indicator defined in [0,1]. Results
in section 6 will show the advantages gained with the Fuzzy Feature Selection.
6 Application
6.1 Phenotype Prediction
Genotype/phenotype pairs were collected from public Stanford data base [37]
and from VIRCO Laboratories [38]. Available data set sizes were ranging from
700 to 1000 pairs for the whole set of drugs {AZT, DDI, DDC, 3TC, TDF,
NVP, ABC, NFV, EFV, SQV, IDV, LPV, APV, D4T, DLV, RTV}, except
for {TPV, ATV, FTC} in which sizes were in the order of 70 to 300. Data sets
were split in training (90%) and validation (10%) in order to assess robustness
of results. Viral nucleotide sequences were aligned to consensus B wild type
viral reference strain with a global alignment algorithm (CLUSTALW), taking
into account high gap penalties for insertions and deletions. Mutations were
extracted consequently, handling also ambiguous sequencing.
M relation was filled according to the definition given in section 4.2. All
the mutational positions were included in the system (550 in RT gene and
330 in PR gene), but only positions related to the corresponding drug target
were allowed to have a weight, i.e. only mutations in the Protease gene were
considered for Protease inhibitors and the same for Reverse Transcriptase;
this was in order to respect real biological mechanisms. The Fuzzy Feature
Selection function used in conjunction either with the FGA or the RS was
(Error is low) ∧ (v is low) as proposed in section 5.4.
274 M. Prosperi and G. Ulivi
Results
In order to assess performances, the Fuzzy System was trained and validated
on the whole set of commercial drugs for which phenotypic tests are available.
The system was then compared with a Linear Regressor, a literature standard
for genotype→phenotype prediction. Being a huge number of input variables,
the Linear Regressor was enhanced in two ways: first a Singular Value Decom-
position (SVD) cut the quasi-colinear attributes; secondly, a stepwise selection
heuristic method (starting from an input variable subset, variables are added
or removed according to the Akaike Information Criterion) was used to re-
duce the number of variables. Performance indicator was the squared linear
correlation ρ2 between predicted and observed vector, a widely used mea-
sure in biology. While the validation performances between the three models
were not significantly different (Kruskal-Wallis rank sum test), i.e. the models
have the same predictive power, the Fuzzy Feature Selection method selected
significantly a lower number of input variables (p < 4 · 10−7 , Wilcoxon rank
sum test). Table 5 summarizes the results. The SVD LR produced robust
models, but the weight interpretation is difficult, because (even though cut in
the decomposition) they’re re-projected in the original attribute space. The
stepwise heuristic function reduced the input attribute space, but the resulting
models possessed a higher number of variables (order 10 of magnitude) than
the Fuzzy engine. A simpler model has the advantage to be more understand-
able and to point out features that can have real biological meaning. In fact, for
the whole drugs, the Fuzzy model optimized with the Fuzzy Feature Selection
yielded a set of weights that resemble with high accuracy medical hypotheses.
Table 6 shows estimated weights for three drugs that completely agree the list
of resistance/susceptibility mutations approved by IAS/USA [36]. Note that
the Fuzzy Feature Selection function independently selected these among more
than 500 input variables. Usually Machine Learners are trained only using the
IAS/USA list.
AZT
RT mutation weight effect IDV
EFV
RT mutation weight effect
100I 0.85 resistance
101E 0.55 resistance
101P 0.9 resistance
101Q 0.5 resistance
103N 0.9 resistance
103S 0.8 resistance
108I 0.45 resistance
181C 0.35 resistance
184V 0.05 resistance
188L 0.95 resistance
190A 0.85 resistance
190S 0.95 resistance
221Y 0.3 resistance
225H 0.7 resistance
276 M. Prosperi and G. Ulivi
Fig. 4. Log Fold Change regression - Fuzzy Relational System + Feature Selection
- AZT Validation Set
Fig. 5. Log Fold Change regression - Fuzzy Relational System + Feature Selection
- ATV Validation Set
The data bases available were the five clinical trials {GART, HAVANA,
ARGENTA, ACTG 320, ACTG 364} and the retrospective cohort ARCA
(taken from [37, 39]): 1329 instances were selected, according to the follow-
ing constraints:
• Viral Load Equilibrium was the maximum viral load value ever observed
in a patient
Evolutionary Fuzzy HIV Modelling 277
Fig. 6. Log Fold Change regression - Fuzzy Relational System + Feature Selection
- SQV Validation Set
Fig. 7. Log Fold Change regression - Fuzzy Relational System + Feature Selection
- IDV Validation Set
• Baseline Viral Load had to be collected in the interval [−15, 7] days from
the therapy switch date
• Viral Genotype sequenced in the interval [−90, 30] days from therapy
switch date
• 12-Weeks Viral Load taken from 8 to 16 weeks after the therapy switch
date
For the clinical trials, the equilibrium viral load was not really reliable, because
patients were enrolled being treated and little information about patient’s
history were available. Retrospective cohorts, on the other hand, often were
missing baseline measure. Mutations were extracted aligning each patient’s
viral genotype with the consensus B wild type reference strain as for the in-
vitro tests. 12 ARVs were included in the model: {AZT, 3TC, D4T, DDI,
ABC, EFV, NVP, NFV, SQV, LPV, RTV, IDV}. Data were split in training
(90%) and validation (10%) sets. Furthermore, a blind-validation set of 42
278 M. Prosperi and G. Ulivi
Unlike the in-vitro scenario, in-vivo clinical data sets possess high variability:
this is due to patients’ unadherence, different drug adsorption levels, different
contour conditions (psychological state, co-infections . . . ), even instruments’
systematic errors and wrong data insertions. Furthermore, Viral Loads (and
CD4+) are just a surrogate of the real disease progression in the body, because
they reflect only the viral strains present in the peripheral blood (and mostly
the infection acts in the lymphatic tissues). Input attributes are chosen among
a set of variables that have been showed relevant in-vitro and among limited
in-vivo statistical tests, so they could be not the most predictive ones or pre-
dictive at all. For instance, an important information as the therapy history
is rarely recorded, and thus cannot be contemplated in a model. Before pre-
senting the results, it’s useful to show the high variance in the follow-up Viral
Loads among observations that share the same input attributes. Figures 8 and
9 explain clearly the situation for two selected observation subsets: the out-
come distribution for patients that are wild-type (do not possess mutations),
have the same Baseline Viral Loads and take the same cARTs are almost flat.
Whatever the Machine Learning techniques is used, for such biased data the
results will be poor.
Evolutionary Fuzzy HIV Modelling 279
Fig. 8. 12-Weeks Viral Load Log distribution for 6 Wild Type patients under
AZT+3TC+SQV with 5 Baseline Viral Load Log
Fig. 9. 12-Weeks Viral Load Log distribution for 4 Wild Type patients under AZT+
3TC with 4,75 Baseline Viral Load Log
Results
The Fuzzy system was trained and validated under different parametric con-
ditions:
• zero-resistance/susceptibility model: a null model in which mutations are
assumed not to contribute to resistance or susceptibility (useful to compare
performances of the others)
• input variables taken from IAS/USA [36] list of relevant mutations (with
or without Fuzzy Feature Selection)
• input variables on the entire RT + PR genes (all mutational positions, all
amino acidic substitutions) with Fuzzy Feature Selection
All the results are summarized in Table 8. The zero-resistance/susceptibility
model, that assumes perfect inhibition despite mutational profiles and just
280 M. Prosperi and G. Ulivi
Fig. 10. In Vivo 12-Weeks Viral Load Log regression - Fuzzy model, no Feature
Selection - Mutations from IAS/USA - Validation Set - observed values on x axis,
predictions on y axis
relies on drug powers and Viral Load exponential decrease, was the poorest
predictor: the Fuzzy system explains better the data. Figure 10 depicts val-
idation results (real outcomes vs prediction) having trained system on the
IAS/USA list without Feature Selection: validation ρ2 was poor, only 0.2488
(training yielded ρ2 = 0.36) and weight matrices were quite unstable executing
algorithms different times with different starting points. Note that the per-
fectly aligned bunch of points is due to the undetectable saturation. Results
started to be better using the Fuzzy Feature Selection and the IAS/USA
list: Figures 11, 12 and 13 show training performances and validation per-
formances (using the blind-validation set coming from the different clinics),
for which ρ2 was always above 0.37. Different runs with perturbations on
the starting points yielded slightly different weight matrices and variable in-
cluded, but with low variability. Furthermore, weights were resembling often
Evolutionary Fuzzy HIV Modelling 281
Fig. 11. In Vivo 12-Weeks Viral Load Log regression - Fuzzy model + Feature
Selection - Mutations from IAS/USA - Training Set - observed values on x axis,
predictions on y axis
Fig. 12. In Vivo 12-Weeks Viral Load Log regression - Fuzzy model + Feature
Selection - Mutations from IAS/USA - Validation set - observed values on x axis,
predictions on x axis
Fig. 13. In Vivo 12-Weeks Viral Load Log regression - Fuzzy model + Feature
Selection - Mutations from IAS/USA - Validation set from different clinics - observed
values on x axis, predictions on y axis
282 M. Prosperi and G. Ulivi
Table 9. Weight Matrices for Fuzzy System. NRT is for mutations in Reverse Tran-
criptase targeted by Nucleoside/Nucleotide analogues, NNRT is for Non-Nucleoside
RT inhibitors, PR is for Protease inhibitors. Weights were forced to assume zero
value for mutations that were in regions not targeted by the corresponding drug
class in order to resemble physical behaviors.
medical hypotheses. Table 9 reports a small set of weights (amino acidic sub-
stitutions are not shown for clearness) that is capable of handle ρ2 = 0.39
in validation: emphasized terms disagree medical hypotheses. Differently from
the phenotype prediction, the FGA did not improve and escape fast from
the zero-resistance/susceptibility model, being stuck in this local optimum,
while the RS was able to find better solutions in a shorter time. The last
was test made using the complete mutational regions in RT and PR, forcing
just mutations in RT not to interact with PR drugs and vice-versa, relying
on the Fuzzy Feature Selection function for the feature selection. The pa-
rameter search space was huge: around 400 mutations and 3000 weights to
be estimated (not considering amino acidic substitutions and having just a
thousand of training examples). Training performances were optimal, yielding
ρ2 = 0.6672, but validation results did not increase, yielding ρ2 = 0.3553.
The system was obviously over-parameterized and, even if the Feature Selec-
tion was used, the weights in the relational matrices were unstable, changing
their values largely among different executions of the FGA and RS algorithms.
Final comparison was made with the Beerenwinkel’s model described in sec-
tion 4.3, that uses an in-vitro predictor: this system was tested on a set of
Evolutionary Fuzzy HIV Modelling 283
96 therapy switch episodes (but not coming from the sets here used), using
the overall activity score as a predictor of 4-weeks (28 ± 10 days) virological
response (very short term response). Linear least squares regression analysis
gave a ρ2 = 0.368. Being different the validation settings, it’s not possible to
compare directly the models: however, being this model trained with in-vitro
data, it’s at least an indication that the two experimental settings are related.
6.3 Conclusions
The Fuzzy Relational system has been shown to be accurate, robust and com-
pact for in-vitro prediction: differently from other models as Linear Regressors
or SVM, has the advantage to provide a meaningful explanation of biological
mechanisms and joined with the Fuzzy Feature Selection function selects the
best models according to the Occam’s principle. Its extension for in-vivo cART
optimization and Viral Load prediction gives encouraging results, providing
still a compact model: moreover, its derivation from the in-vitro framework
and the comparison with Beerenwinkel’s model emphasize the relationships
between in-vitro tests and in-vivo treatments. However, in a mere therapy
optimization purpose, the correlations results are still not satisfactory: the
variance estimation anyway shows how much the clinical data are biased, ei-
ther due to the limited attribute recording or the intrinsic variabilities in the
human body.
The Fuzzy Relational system described in this chapter is designed under a lim-
ited data scenario. For instance, the differential equation system that models
the viral reproduction had to be simplified ignoring the contribution of CD4+
T cells, because they were too often missing in the data bases. Mutations
were treated separately, because a preliminary clustering did not worked due
to the large number of therapy combinations compared to the small number
of training instances. Therapeutic history, that could have a crucial role in
the learning process, is missing as well. However, public availability of clinical
data bases is today increasing, as confirmed by the EuResist data base [40]
(an European project that aims to integrate several clinical data bases on
HIV and build a treatment decision system), as well quality and additional
attribute recording in the data sets. In this perspective, it’s possible to design
more complex models, still maintaining the aim to model meaningfully biolog-
ical mechanisms and handle uncertainty and vagueness. The Fuzzy Relational
system can be modified and extended in order to produce a rule set capable
to infer prediction, at least eliminating the noise produced by (previously)
unseen significant attributes. An appropriate rule base for an in vivo HIV re-
sistance/susceptibility FIS (Mamdami. . . ), that still waits to be trained with
a sufficient amount of data, is defined in Table 10.
284 M. Prosperi and G. Ulivi
7 Acknowledgements
We want to thank physician Andrea De Luca (Institute of Clinical Infection
Disease - Catholic University of Rome - UCSC) and virologist Maurizio Zazzi
(University of Siena, Italy) who collaborated actively to this study. Further-
more we want to thank the ARCA consortium [39] that gave the in-vivo
Evolutionary Fuzzy HIV Modelling 285
retrospective data sets, Stanford University [37] for in-vivo clinical trials and
VIRCO [38] for in-vitro genotype/phenotype data sets.
References
1. Klir G (1988) Fuzzy sets, uncertainty and information. Prentice-Hall, Englewood
Cliffs, NJ
2. Bandemer H (1992) Fuzzy data analysis. Kluwer Academic, Dordrecht
3. Michalewicz Z (1994) Genetic algorithms + data structures = evolution pro-
grams. AI Series. Springer, Berlin Heidelberg New York
4. Jang JSR, Sun CT, Mizutani E (1997) Neuro-fuzzy and soft computing. Prentice
Hall, Englewood Cliffs, NJ
5. Brunak S, Baldi P (2001) Bioinformatics: the machine learning approach. MIT,
Cambridge, MA
6. Witten IH, Frank E (2005) Data mining: practical machine learning tools and
techniques. Morgan Kauffmann, Los Altos, CA
7. Beerenwinkel N (2003) Computational analysis of HIV drug resistant data. PhD
Thesis, MPS-MPI for Informatics, University of Saarland, Saarbruecken, Ger-
many
8. Sanchez E (1977) Solutions in composite fuzzy relation equations. In: Gupta,
Saridis, Gaines (eds) Fuzzy automata and decision processes. North-Holland,
New York, pp 221–234
9. Sanchez E (1979) Medical diagnosis and composite fuzzy relations. In: Gupta,
Ragade, Yager (eds) Advances in fuzzy set theory and applications. North-
Holland, New York, pp 437–444
10. Adlassnig K, Kolarz G (1982) CADIAG-2: Computer-assisted medical diagnosis
using fuzzy subsets. In: Gupta, Sanchez (eds) Approximate reasoning in decision
analysis. North-Holland, New York, pp 203–217, 219–247
11. Zadeh L (1965) Fuzzy sets. Inf Control 8:338–353
12. Sanchez E (1984) Solution of fuzzy equations with extended operations. Fuzzy
Sets Syst 12:237–248
13. Mizumoto M (1989) Pictorial representations of fuzzy connectives, part II: cases
of compensatory operators and self-dual operators. Fuzzy Sets Syst 32:45–79
14. Pedrycz W (1993) Fuzzy relational equations. Fuzzy Sets Syst 59:189–195
15. Sanchez E (1993) Fuzzy genetic algorithms in soft computing enviroment. In-
vited Plenary Lecture in Fifth IFSA World Congress, Seoul
16. Voigt HM (1995) Soft genetic operators in evolutionary algorithms. In: Banzhaf
W, Eeckman FH (eds). Evolution and biocomputation. Lecture notes in com-
puter science, vol 899. Springer, Berlin Heidelberg New York, pp 123–121
17. Herrera F (1996) Dynamic and heuristic fuzzy connective based crossover oper-
ators for controlling the diversity and the convergence of real coded algorithms.
Int J Intell Syst 11:1013–1041
18. Lathrop R, Pazzani MJ (1999) Combinatorial optimization in rapidly mutating
drug-resistant viruses. J Combinatorial Optimiz 3:301–320
19. Perelson AS, Nelson PW (1999) Mathematical analysis of HIV-1 dynamics in
vivo. SIAM Rev 41:3–44
20. Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D,
Korn K, Selbig J (2001) Geno2pheno: interpreting genotypic HIV drug resistance
tests. IEEE Intell Syst Biol 16(6):35–41
286 M. Prosperi and G. Ulivi
1 Introduction
Evolutionary algorithms (EAs) are models based on natural evolution of in-
dividuals in a defined environment. They are especially useful for complex
optimization problems where the number of parameters is large and the ana-
lytical solutions are difficult to obtain.
Texts of reference and synthesis in the field of evolutionary algorithms are
[8,36], and recent advances in evolutionary computation are described in [60],
in which these approaches attract increasing interest from both academia and
industrial society.
EAs can help to find out the optimal solution globally over a domain,
although convergence to the global optimum may only be guaranteed in prob-
ability, provided certain rather mild assumptions are met [1]. They have been
applied in different areas such as fuzzy control, path planning, modeling and
classification etc. Their strength is essentially due to their updating of a whole
population of possible solutions at each iteration of evolving individuals; this
is equivalent to carry out parallel explorations of the overall search space in
a problem.
New evolving systems, neuro-genetic systems, have become a very impor-
tant topic of study in evolutionary computation. As indicated by Yao et al.
A. Azzini and A.G.B. Tettamanzi: A New Genetic Approach for Neural Network Design, Studies
in Computational Intelligence (SCI) 82, 289–323 (2008)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008
290 A. Azzini and A.G.B. Tettamanzi
2 Evolving ANNs
There are several approaches to evolving ANNs and EAs are used to perform
various tasks, such as connection weight training, architecture design, learning
rule adaptation, connection weight initialization, rule extraction from ANNs,
etc. Three of them are considered as the most popular approaches at these
levels:
• Connection weights, that concentrates just on the weights optimization,
assuming that the architecture of the network must be static. The evo-
lution of connection weights introduces an adaptive and global approach
to training, especially in the reinforcement learning and recurrent network
learning paradigm where gradient-based training algorithms often experi-
ence great difficulties.
• Learning rules, that can be regarded as a process of learning to learn in
ANNs where the adaptation of learning rules is achieved through evolution.
It can also be regarded as an adaptive process of automatic discovery of
novel learning rules.
292 A. Azzini and A.G.B. Tettamanzi
Weight Optimization
Several standard learning rules govern the speed and the accuracy with which
the network will be trained and, when there is a little knowledge about the
most suitable architecture for a given problem, the possibly dynamic adap-
tation of the learning rules becomes very useful. Examples of these are the
294 A. Azzini and A.G.B. Tettamanzi
learning rate and momentum, that can be difficult to assign by hand, and
therefore become good candidates for evolutionary adaptation.
One of the first studies in this field was conducted by Chalmers [13]. The
aim of his work was to see if the well-known delta rule, or a fitter variant, could
be evolved automatically by a genetic algorithm. With a suitable chromosome
encoding and using a number of linearly separable mappings for training,
Chalmers was able to evolve a rule analogous to the delta rule, as well as
some of its variants. Although this study was limited to somewhat constrained
network and parameter spaces, it paved the way for further progress.
Several studies have been carried out in this direction: Merelo and col-
leagues [35], present a search for the optimal learning parameters of multilayer
competitive learning neural networks. Another work, based on simulated an-
nealing, is proposed by Castillo et al. in [12].
Architeture Optimization
There are two major ways in which EAs have been used for searching network
topologies: either all aspects of a network architecture are encoded into an
individual or a compressed description of the network is evolved. The first
case defines a direct encoding, while the second leads to an indirect encoding.
• Direct Encoding specifies each parameter of the neural network and a little
effort in decoding is required, since a direct transformation of genotypes
into phenotypes is defined. Several examples of this approach are shown
in the literature, like in [37, 56] and in [61], in which the direct encoding
scheme is used to represent ANN architectures and connection weights
(including biases). EP-Net [61] is based on evolutionary programming with
several different sophisticated mutation operators.
• Indirect Encoding requires a considerable effort for the neural network
decoding, but, in some cases, the network can be pre-structured, using re-
strictions in order to rule out undesirable architectures, which makes the
searching space much smaller. A few sophisticated encoding method is im-
plemented based on network parameter definitions. These parameters may
represent the number of layers, the size of the layers, i.e., the number of
neurons in each layer, the bias of each neuron and the connections among
them. This kind of encoding is an interesting idea that has been further
pursued by other researchers, like Filho and colleagues [19], and Harp and
colleagues [23]. Their method is aimed at the choice of the architecture
and connections, and uses a representation which describes the main com-
ponents of the networks, dividing them in two classes, i.e., parameter and
layer sections.
The design of an optimal NN architecture can be formulated as a search
problem in the architecture space, where each point represents an architec-
ture. As pointed out by Yao [59, 61, 62], given some performance (optimality)
criteria, e.g., minimum error, learning speed, lower complexity, etc., about
A New Genetic Approach for Neural Network Design 295
architectures, the performance level of all these forms a surface in the design
space. Determining the optimal architecture design is equivalent to finding
the highest point on this surface. There are several arguments which make
the case for using EAs for searching for the best network topology [37, 53].
Pattern classification approaches [54] can also be found to design the net-
work structure, and constructive and destructive algorithms can be imple-
mented [59]. The constructive algorithm starts with a small network. Hidden
layers, nodes, and connections are added to expand the network dinami-
cally [61]. The destructive algorithm starts with a large network. Hidden
layers, nodes, and connections are then deleted to contract the network dy-
namically [41].
Stanley and Miikkulainen in [53] presented a neuro evolutionary method
through augmenting topologies (NEAT). This algorithm outperforms solu-
tions employing a principled method of crossover of different topologies, pro-
tecting structural innovation using speciation, and incrementally growing from
minimal structure.
One of the most important forms of deception in ANNs structure opti-
mization arises from the many-to-one and from one-to-many mapping from
genotypes in the representation space to phenotypes in the evaluation space.
The existence of networks functionally equivalent and with different encod-
ings makes the evolution inefficient. This problem is usually termed as the
permutation problem [22] or the competing convention problem [51]. It is clear
that the evolution of pure architectures has difficulties in evaluating fitness
accurately. As a result, the evolution would be very inefficient.
One solution to decrease the noisy fitness evaluation problem in ANNs struc-
ture optimization is to consider a one-to-one mapping between genotypes and
phenotypes of each individual. This is possible by considering a simultane-
ous evolution of the architecture and the network weights. The advantage of
combining these two basic elements of a NN is that a completely functioning
network can be evolved without any intervention by an expert.
Some methods that evolve both the network structure and connection
weights were proposed in the literature.
In the ANNA ELEONORA algorithm [34] new genetic operator, called
GA-simplex [9], an encoding procedure, called granularity encoding [32, 33],
that allows the algorithm to autonomously identify an appropriate suitable
lenght of the coding string, were introduced. Each gene consists of two parts:
the connectivity bits and the connection weight bits. The former indicates the
absence or the presence of a link, and the latter indicates the value of the
weight of a link. This approach employs the four genetic operators, reproduc-
tion, crossover, mutation and GA-simplex, and it has been shown in a parallel
and sequential version.
296 A. Azzini and A.G.B. Tettamanzi
3 Neuro-Genetic Approach
This work presents an approach to the design of NNs based on EAs, whose
aim is both to find an optimal network architecture and to train the network
on a given data set.
The approach is designed to be able to take advantage of BP if that is
possible and beneficial; however, it can also do without it. The basic idea
is to exploit the ability of the EA to find a solution close enough to the
global optimum, together with the ability of the BP algorithm to finely tune
a solution and reach the nearest local minimum.
As indicated in Section 1, this research is primed by an industrial applica-
tion [4,5], in which it is required to design neural engine controllers to be imple-
mented in hardware, with particular attention to reduced power consumption
and silicon area occupation. The validity of resulting approach, however, is by
no means limited to hardware implementations of NNs. Further application
describes a brain wave signal processing [7], in particular as a classification
algorithm in the analysis of P300 Evoked Potential. Finally, the third appli-
cation considers a financial problem, whereby a factor model capturing the
mutual relationships among several financial instruments is sought for [6].
The attention is restricted to a specific subset of the NN architectures,
namely the Multi-Layer Perceptron (MLP). MLPs are feedforward NNs with
one layer of input neurons, one layer of one or more output neurons and zero
or more ‘hidden’ (i.e., internal) layers of neurons in between; neurons in a
layer can take inputs from the previous layer only.
A peculiar aspect of this approach is that BP is not used as some genetic
operator, as is the case in some related work [11]. Instead, the EA optimizes
both the topology and the weights of the networks; BP is optionally used
to decode a genotype into a phenotype NN. Accordingly, it is the genotype
which undergoes the genetic operators and which reproduces itself, whereas
the phenotype is used only for calculating the genotype’s fitness.
The idea proposed in this work is near to the solution presented in EPNet
[61], a new evolutionary system for evolving feedforward ANNs, that puts
emphasis on evolving ANNs behaviors. This neuro-genetic approach evolves
ANNs architecture and connection weights simultaneously, as EPNet, in order
to reduce the noise in fitness evaluation.
Close behavioral link between parent and offspring is also maintained by
applying different techniques, like weight mutation and partial training, in
order to reduce behavioral disruption. The first technique is attempted before
any architecural mutation; the second is employed after each architectural
mutation. Moreover, a hidden node is not added to an existing ANN at ran-
dom, but through splitting and existing node. In our work we carried out
weight mutation before topology mutation since we want to perturb the con-
nection weights of the neurons in a neural network, and then carry out a
weight control in order to delete neurons whose contribution is negligible
resptect to the overall network output. This allows to implement, if it is
298 A. Azzini and A.G.B. Tettamanzi
The fitness of an individual depends both on its accuracy (i.e., its mse) and
on its cost. Although it is customary in EAs to assume that better individuals
have higher fitness, the convention that a lower fitness means a better NN
is adopted in this work. This maps directly to the objective function of the
genetic problem, which is a cost minimization problem. Therefore, the fitness
is proportional to the value of the mse and to the cost of the considered
network. It is defined as
3.4 Selection
In the evolution process two important and strongly related issues are the pop-
ulation diversity and the selective pressure. Indeed, an increase in the selective
pressure decreases the diversity of the population, and vice versa. Like indi-
cated by Michalewicz [36], it is important to strike a balance between these
two factors, and sampling mechanisms are attempt to achieve this goal. As
observed in that work, many of the parameters used in the genetic search
affect these factors. In this sense as selective pressure is increased, the search
302 A. Azzini and A.G.B. Tettamanzi
3.5 Mutation
with
1
τ = 3 (6)
2Nsyn
1
τ = 4 3 (7)
2 Nsyn
this work two different kinds of threshold are considered and alternatively
applied to the weight perturbation. The first is a fixed threshold, simply
defining a
parameter, setted before execution. The following pseudocode
is implemented in mutation operator by applying a comparison between
that parameter and all weight matrices values.
for i = 1 to l − 1 do
if Ni > 1
for j = 1 to Ni do
(i)
if ||Wj || <
the (i − 1)th layer and the (i + 1)th layer (to become the ith layer)
are rewired as follows:
Since W (i−1) is a row vector and W (i) is a column vector, the result
of the product of their transposes is a Ni+1 × Ni−1 matrix.
c) Insertion of a neuron: with probability p+
neuron , the jth neuron in the
hidden layer i is randomly selected for duplication. A copy of it is
inserted into the same layer i as the (Ni + 1)th neuron; the weight
matrices are then updated as follows:
i. a new row is appended to W (i−1) , which is a copy of jth row of
W (i−1) ;
(i)
ii. a new column WNi +1 is appended to W (i) , where
(i) 1 (i)
Wj ←W , (9)
2 j
(i) (i)
WNi +1 ← Wj . (10)
The rationale for halving the output weights from both the jth neuron
and its copy is that, by doing so, the overall network behavior remains
unchanged, i.e., this kind of mutation is neutral.
All three topology mutation operators are designed so as to minimize their
impact on the behavior of the network; in other words, they are designed to
be as little disruptive (and as much neutral) as possible.
3.6 Recombination
As indicated in [34] there has been some debate in the literature about the
opportunity of applying crossover to ANN evolution, based on disruptive eff-
ects that it could make into neural model. In this approach two ideas of
crossover are independently implemented: the first is a kind of single-point
crossover with different cutting points; the second implements a kind of ver-
tical crossover, defining a merge-operator between the topologies and weight
matrices of two parents in order to create the offspring.
Single-Point Crossover
iw iw
lw {i, j} lw {i, j} lw lw {i, j} lw
{i, j} {i, j}
iw
lw{i , j} lw{i , j}
Vertical Crossover
iw iw
lw{i , j} lw{i , j} lw {i , j} lw {i , j}
5 6
(a)
lw1,1 lw1,2
(a)
(b) (b)
(a) (a) lw1,1 lw1,2
lw2,1 lw2,2
iw lw{i , j} lw{i , j}
⎡ (a) (a)
⎤
lw1,1 lw1,2 0 0
⎢ (a) (a) ⎥
⎣ lw2,1 lw2,2 0 0 ⎦
(b) (b)
0 0 lw1,1 lw1,2
is that, if both parents were ‘good’ networks, they would both supply the
appropriate input to the output layer; without halving it, the contribution
from the two subnetworks would add and yield an approximately double input
to the output layer. Therefore, halving the weights helps to make the operator
as little disruptive as possible.
The main problem in the crossover implementation, is to maintain, in a
neural network, a structural alignment between neurons of each parent, when
a new offspring is created. Without alignment, some disruptive effects could
be make into neural model. Another important open issue, defined also in
the two approaches implemented in this work, regards the initialization of
connection weight values in the merging point between two selected parents.
In this particular approach, in the first crossover operator, that is a single-
point, new weight matrices in the merging point of the offspring are initialized
from a normal distribution, while corresponding variance matrices are setted
to matrices of all ones. In the vertical crossover, the new weight matrices are
defined, in the merging point, as the block diagonal matrix of the matrix of
A New Genetic Approach for Neural Network Design 307
the first parent and the matrix of the second parent. Also in this case, the new
connections will be initialized from a normal distribution.
In any case, the frequency and the size of crossover changes must be care-
fully selected in order to assure the balance between exploring and exploiting
abilities.
4 Real-World Applications
As indicated in previous sections, this approach has been applied to three dif-
ferent real-world problems. Section 4.1 describes an industrial application for
neural engine controllers design. The second application presented in Section
4.2 concerns a neural classification algorithm for brain wave signal processing,
in particular the analysis of P300 Evoked Potential. Finally, the third applica-
tion presented in Section 4.3, considers a financial modeling, whereby a factor
model capturing the mutual relationships among several financial instruments
is sought for.
Hall-effect
Transducer
Rectifier DC-Link PWM-Inverter
AC M
A/D Wavelet
GA-Tuned
Acquisiton processing Diagnosis
ANN
board stage
Fig. 3. The experimental setup used for the fault diagnosis application
308 A. Azzini and A.G.B. Tettamanzi
information about the output circuits of the drive and the motor. Instead, it
was proved that the operating condition of the AC motor will appear on the
AC side as a transient phenomenon or a sudden variation in the load power.
The presence of this electrical transient in the current suggests an approach
based on time-frequency or, better, time-scale analysis. In particular, the use
of Discrete Wavelet Transform (DWT) [52] could be efficiently used in the
process of separating the information.
The genetic approach involves the analysis of the signal—the load
current—through wavelet series decomposition. The decomposition results in
a set of coefficients, each carrying local time-frequency information. An or-
thogonal basis function is chosen, thus avoiding redundancy of information
and allowing for easy computation.
The computation of the wavelet series coefficients can be efficiently per-
formed with the Mallat algorithm [31]. The coefficients are computed by pro-
cessing the samples of the signal with a filter bank where the coefficients of the
filters are peculiar to the family of the chosen wavelets. The Figure 4 shows
the bandpass filtering, which is implemented as a lowpass gi (n) - highpass
hi (n) filter pair which has mirrored characteristics.
In particular, in this application the 6-coefficient Daubechies wavelet [15]
was used. In Table 3 the filter coefficients of the utilized wavelet are reported.
The wavelet coefficients allow a compact representation of the signal and
the features of the normal operating or faulty conditions are condensed in the
wavelet coefficients. Conversely, the features of given operating modes can be
i s (n) = Current
Fig. 5. A depiction of the logical structure of a case of study in the fault diagnosis
problem. The elements w1 to w8 are the maximum coefficients for each level of
wavelet decomposition; C indicates whether the case is faulty or not
recognized in the wavelet coefficients of the signal and the operating mode
can be identified.
Employing the wavelet analysis both the identification of the drive operat-
ing conditions (faulty or normal operation) and the identification of significant
parameters for the specific condition have been obtained.
Figure 5 depicts the logical structure of the data describing a case of study.
Each vector is known to have been originated in fault or non-fault conditions,
so it can be associated with a fault index C equal to 1 (fault condition), or 0
(normal condition).
This problem has been already approached with a neuro-fuzzy network,
whose structure was defined a priori, trained with BP [14], as indicated in
Figure 6.
Experiments
In this approach, both the network structure (topology) and the weights have
to be determined through evolution at the same time, as depicted in Figure 7.
The model proposed has to look for networks with 8 input attributes (the
features from w1 to w8 of Figure 5, corresponding to the maximum coefficients
for each level of wavelet decomposition, and 1 output, the diagnosis C, which
there will be interpret as an estimate of the fault probability: zero thus means
a fault is not expected at all, whereas one is the certainty that a fault is about
to occur.
310 A. Azzini and A.G.B. Tettamanzi
The data used for learning have been obtained from a Virtual Test Bed
(VTB) model simulator of a real engine.
Several settings of five parameters, backpropagation bp, population size
n, and three mutation probabilities relevant to structural mutation, p+ layer ,
− +
player , and pneuron , have been explored in order to assess the robustness of
the approach and to determine an optimal set-up. The pcross parameter, that
defines the probability to crossover, is set to 0 for all runs, because neither
single-point crossover nor merge crossover give satisfactory results for this
problem. All the other parameters are set to the default values shown in
Table 2.
All runs were allocated the same fixed amount of neural network exe-
cutions, to allow for a fair comparison between the cases with and without
backpropagation. The results are respectively summarized in Table 4 and in
Table 5. Ten runs are executed for each setting, of which the average and
standard deviation for the best solutions found are reported.
Results
A first comment can be made regarding the size of the population. In most
cases it is possible to observe that the solutions found with a larger population
are better than those found with a smaller population. With bp = 1, 15 settings
out of 27 give better results with n = 60, while with bp = 0, 19 settings out
of 27 give better results with the larger pupulation.
The best solutions, on average, have been found with p+ layer = 0.1,
− +
player = 0.2, and pneuron = 0.05, and there is a clear tendency for the runs us-
ing backpropagation (bp = 1) to consistently obtain better quality solutions.
The best model over all runs performed is a multi-layer perceptron with a
phenotype of type [3, 1], here specified without input layer.
A New Genetic Approach for Neural Network Design 311
Table 4. Experimental results for the engine fault diagnosis problem with BP = 1
−
setting p+ +
layer player pneuron bp = 1
n = 30 n = 60
avg stdev avg stdev
1 0.05 0.05 0.05 0.11114 0.0070719 0.106 0.0027268
2 0.05 0.05 0.1 0.10676 0.003172 0.10606 0.0029861
3 0.05 0.05 0.2 0.10776 0.0066295 0.10513 0.0044829
4 0.05 0.1 0.05 0.10974 0.0076066 0.10339 0.0036281
5 0.05 0.1 0.1 0.1079 0.0067423 0.10696 0.0050514
6 0.05 0.1 0.2 0.10595 0.0035799 0.10634 0.0058783
7 0.05 0.2 0.05 0.10332 0.0051391 0.10423 0.0030827
8 0.05 0.2 0.1 0.10723 0.0097194 0.10496 0.0050782
9 0.05 0.2 0.2 0.10684 0.007072 0.1033 0.0031087
10 0.1 0.05 0.05 0.10637 0.0041459 0.10552 0.0031851
11 0.1 0.05 0.1 0.10579 0.0050796 0.10322 0.0035797
12 0.1 0.05 0.2 0.10635 0.0049606 0.10642 0.0042313
13 0.1 0.1 0.05 0.10592 0.0065002 0.10889 0.0038811
14 0.1 0.1 0.1 0.10814 0.0064667 0.10719 0.0054168
15 0.1 0.1 0.2 0.10851 0.0051502 0.11015 0.0055841
16 0.1 0.2 0.05 0.10267 0.005589 0.10318 0.0085395
17 0.1 0.2 0.1 0.10644 0.0045312 0.10431 0.0041649
18 0.1 0.2 0.2 0.10428 0.004367 0.10613 0.0052063
19 0.2 0.05 0.05 0.10985 0.0059448 0.10757 0.0045103
20 0.2 0.05 0.1 0.10593 0.0048254 0.10643 0.0056578
21 0.2 0.05 0.2 0.10714 0.0043861 0.10884 0.0049295
22 0.2 0.1 0.05 0.10441 0.0051143 0.10789 0.0046945
23 0.2 0.1 0.1 0.1035 0.0030094 0.1083 0.0031669
24 0.2 0.1 0.2 0.10722 0.0048851 0.1069 0.0050953
25 0.2 0.2 0.05 0.10285 0.0039064 0.1079 0.0045474
26 0.2 0.2 0.1 0.10785 0.008699 0.10768 0.0061734
27 0.2 0.2 0.2 0.10694 0.0052523 0.10652 0.0050768
Table 5. Experimental results for the engine fault diagnosis problem with BP = 0
setting p+
layer p−
layer p+
neuron bp = 0
n = 30 n = 60
avg stdev avg stdev
1 0.05 0.05 0.05 0.14578 0.013878 0.13911 0.0086825
2 0.05 0.05 0.1 0.1434 0.011187 0.13573 0.013579
3 0.05 0.05 0.2 0.13977 0.014003 0.13574 0.010239
4 0.05 0.1 0.05 0.14713 0.0095158 0.13559 0.011214
5 0.05 0.1 0.1 0.14877 0.010932 0.13759 0.014255
6 0.05 0.1 0.2 0.14321 0.0095505 0.1309 0.012189
7 0.05 0.2 0.05 0.14304 0.014855 0.13855 0.0089141
8 0.05 0.2 0.1 0.13495 0.015099 0.13655 0.0079848
9 0.05 0.2 0.2 0.14613 0.010733 0.14165 0.013385
10 0.1 0.05 0.05 0.13939 0.013532 0.13473 0.0085242
11 0.1 0.05 0.1 0.13781 0.0094961 0.13991 0.012132
12 0.1 0.05 0.2 0.13692 0.017408 0.13143 0.012919
13 0.1 0.1 0.05 0.13348 0.009155 0.1363 0.013102
14 0.1 0.1 0.1 0.13785 0.013465 0.13836 0.0094587
15 0.1 0.1 0.2 0.14076 0.01551 0.13994 0.011786
16 0.1 0.2 0.05 0.1396 0.0098416 0.13719 0.016372
17 0.1 0.2 0.1 0.13597 0.012948 0.14091 0.014344
18 0.1 0.2 0.2 0.14049 0.013535 0.13665 0.011426
19 0.2 0.05 0.05 0.13486 0.0079435 0.14068 0.013874
20 0.2 0.05 0.1 0.13536 0.0112 0.12998 0.013489
21 0.2 0.05 0.2 0.13328 0.0087402 0.1314 0.0088282
22 0.2 0.1 0.05 0.13693 0.0096481 0.13456 0.012431
23 0.2 0.1 0.1 0.13771 0.015971 0.13939 0.0092643
24 0.2 0.1 0.2 0.13204 0.010325 0.1378 0.01028
25 0.2 0.2 0.05 0.14062 0.012129 0.14005 0.011195
26 0.2 0.2 0.1 0.14171 0.008802 0.13877 0.0094973
27 0.2 0.2 0.2 0.14216 0.015659 0.13965 0.015732
Brain-Computer Interfaces
signal processing that can be used to analyze brain waves. During the first
international meeting on BCI technology, Jonathan R. Wolpaw formalized the
definition of the BCI systems as follows:
A brain-computer interface (BCI) is a communication or control sys-
tem in which the user’s messages or commands do not depend on the
brain’s normal output channels. That is, the message is not carried by
nerves and muscles, and, furthermore, neuromuscular activity is not
needed to produce the activity that does carry the message [57].
According with this definition, BCI systems appear as a possible and some-
times unique mode of communication for people with severe neuromuscular
disorders like spinal cord injury or cerebral paralysis. Exploiting the residual
functions of the brain, may allow those patients to communicate.
The human brain has an intense chemical and electrical activity, partially
characterized by peculiar electrical patterns, which occur at specific times and
at well-localized brain sites. All of that is observable with a certain level of
repeatability under well-defined environmental conditions. These simple phys-
iological issues can lead to the development of new communication systems.
Problem Description
One of the most utilized electrical activities of the brain for BCI is the so-called
P300 Evoked Potential. This wave is a late-appearing component of an Event
Related Potential (ERP) which can be auditory, visual or somatosensory. It
has a latency of about 300 ms and is elicited by rare or significant stimuli,
when these are interspersed with frequent or routine stimuli. Its amplitude is
strongly related to the unpredictability of the stimulus: the more unexpected
the stimulus, the higher the amplitude. This particular wave has been used
to make a subject chose between different stimuli [17, 18].
The general idea of Donchin’s solution is that the patient is able to generate
this signal without any training. This is due to the fact that the P300 is
the brains response to an unexpected or surprising event and is generated
naturally. Donchin developed a BCI system able to detect an elicited P300 by
signal averaging techniques (to reduce the noise) and used a specific method
to speed up the overall performance. Donchin’s idea has been adopted and
further developed by Beverina and colleagues of ST Microelectronics [10].
In this application, the neuro-genetic approach described in Section 3 has
been applied to the same dataset of P300 evoked potential used by Beverina
and colleagues for their approach on brain signal analysis based on support
vector machines.
Experiments
Results
The results obtained with settings defined above for experiments are presented
in Table 6.
Due to the way the training set and the test set are used, it is not surprising
that error rates on the test sets look better than error rates on the training
sets. That happens because, in the case of bp = 1, the performance of a
network on the test set is used to calculate its fitness, which is used by the
evolutionary algorithm to perform selection. Therefore, it is only networks
whose performance on the test set is better than average which are selected
for reproduction. The best solution has been found by the algorithm using
backpropagation and is a multi-layer perceptron with one hidden layer with 4
Table 6. Error rates of the best solutions found by the neuro-genetic approach with
and without the use of backpropagation, averaged over 100 runs
training test
bp false positives false negatives false positives false negatives
avg stdev avg stdev avg stdev avg stdev
0 93.28 38.668 86.14 38.289 7.62 3.9817 7.39 3.9026
1 29.42 14.329 36.47 12.716 1.96 1.4697 2.07 1.4924
A New Genetic Approach for Neural Network Design 315
neurons, which gives 22 false positives and 29 false negatives on the training
set, while it commits no classification error on the test set.
The results obtained by the neuro-genetic approach, without any specific
tuning of the parameters, appear promising. To provide a reference, the av-
erage number of false positives obtained by Beverina and colleagues with
support vector machines are 9.62 on the training set and 3.26 on the test set,
whereas the number of false negatives are 21.34 on the training set and 4.45
on the test set [43].
Experiments
In this application the training and test sets are created by considering daily
returns for the period since the 2nd of January, 2001 until the 30th of Novem-
ber, 2005. All data are divided in two different datasets, rispectively with 1000
cases for training set and 231 cases for test set. The validation set consists of
the daily returns for the period since the 1st of December, 2005 until the 13th
of January, 2006.
All the parameters are set to the default values shown in Table 2, while
the pcross parameter is set to 0, because the two types of crossover defined give
to unsatisfactory results in this application. This is probably due to fact that
whereas evolutionary algorithms are known to be quite effective in exploring
the search space, they are in general quite poor at closing into a local optimum;
backpropagation, which is essentially a local optimization algorithm, appears
to complement well the evolutionary approach.
316 A. Azzini and A.G.B. Tettamanzi
Time series of training, test and validation set are preprocessesd by delet-
ing the 20 days moving average from all the components. Several runs of this
approach have been carried out in order to find out optimal settings of the
−
genetic parameters p+ +
layer , player , and pneuron . For each run of the evolutionary
algorithm, up to 100,000 network evaluations (i.e., simulations of the network
on the whole training set) have been allowed, including those performed by
the backpropagation algorithm.
All simulations have been carried out with bp = 1, i.e., while not all cases
with bp = 0 have been considered, since no otpimal results are obtained from
A New Genetic Approach for Neural Network Design 317
these simulation. The results obtained are presented in Table 8: here are repor-
ted data about the average and the standard deviation of the test fitness values
about the best solutions found for each parameter settings over all runs.
−
The best solutions, on average, have been found with p+layer = 0.2, player =
0.2, and p+ neuron = 0.2, although they do not differ significantly from other
solutions found with bp = 1.
The best model over all runs performed has been found by the algorithm
using backpropagation, and it is a multi-layer perceptron with a phenotype of
type [2, 1], specified without input layer, which obtained a mean square error
of 0.39 on the test set.
Results
One observation is that the approach is substantially robust with respect to
the setting of parameters other than bp.
318 A. Azzini and A.G.B. Tettamanzi
250
200
150
100
Returns
50
−50
−100
−150
0 5 10 15 20 25 30
Time
Fig. 8. Comparison between the daily closing prices predicted by the best model
(dashed line) and actual daily closing prices (solid line) of the DJIA on the valida-
tion set.
where the wi are the same weights values of the weights of best solution.
The prediction obtained by the linear regression model are compared with
our best solution found, as shown in Figure 9. The neuro-genetic solution
obtained with our approach has a mse of 1291.7, a better result compared to
the mse of 1320.5 of the prediction based on linear regression on the same
validation dataset.
The usefullness of such a model is evaluated with a paper simulation of a
very simple statistical arbitrage strategy, carried out on the same validation
set of the financial modeling. The strategy is described in detail in [6], and
the results show that the information given by a neural network obtained by
the approach would enable an arbitrageur to gain significant profit.
250
200
150
100
Returns
50
−50
−100
−150
0 5 10 15 20 25 30
Time
Fig. 9. Comparison between the daily closing prices predicted by the best model
(dashed line), those predicted by the linear regression (dash-dotted line), and the
actual daily closing prices (solid line) of the DJIA on the validation set.
Future work will consider the study of the efficiency and the robustness of
this approach even when input data are affected by uncertainty depending on
errors introduced by some measurement instrumentations. A further improve-
ment could be given by the elimination of algorithm parameters, even though
this approach has been demonstrated to be robust with respect to parameter
tuning.
Further studies of new crossover design could improve the genetic algo-
rithm implementation, by being as little disruptive as possible. The new
merge-crossover implemented in this work seems to be a promising step in
that direction, even though its use did not boost the performance of the algo-
rithm significantly in the present form.
References
1. Aarts EHL, Eiben AE, van Hee KM (1989) A general theory of genetic
algorithms. Computing Science Notes. Eindhoven University of Technology,
Eindhoven
2. Abraham A (2004) Meta learning evolutionary artificial neural networks. Neuro-
computing 56:1–38
3. Angeline PJ, Saunders GM, Pollack JB (1994) An evolutionary algorithm that
constructs recurrent neural networks. IEEE Trans Neural Netw 5:54–65
4. Azzini A, Cristaldi L, Lazzaroni M, Monti A, Ponci F, Tettamanzi AGB (2006)
Incipient fault diagnosis in electrical drives by tuned neural networks. In: Pro-
ceedings of the IEEE instrumentation and measurement technology conference,
IMTC 2006, Sorrento, Italy. IEEE, April, 24–27
5. Azzini A, Lazzaroni M, Tettamanzi AGB (2005) A neuro-genetic approach to
neural network design. In: Sartori F, Manzoni S, Palmonari M (eds) AI*IA 2005
A New Genetic Approach for Neural Network Design 321
23. Harp S, Samad T, Guha A (1991) Towards the genetic syntesis of neural net-
works. Fourth international conference on genetic alglorithms, pp 360–369
24. Harris L (2003) Trading and exchanges: market microstructure for practitioners.
Oxford University Press, Oxford
25. Holland JH (1975) Adaption in natural and artificial systems. The University
of Michigan Press, Ann Arbor, MI
26. De Jong KA (1993) Evol Comput. MIT, Cambridge, MA
27. Keesing R, Stork DG (1991) Evolution and learning in neural networks: the
number and distribution of learning trials affect the rate of evolution. Adv Neu-
ral Inf Process Syst 3:805–810
28. Knerr S, Personnaz L, Dreyfus G (1992) Handwritten digit recognition by neural
networks with single-layer training. IEEE Trans Neural Netw 3:962–968
29. Koza JR (1994) Genetic programming. The MIT, Cambridge, MA
30. Leung EHF, Lam HF, Ling SH, Tam PKS (2003) Tuning of the structure and
parameters of a neural network using an improved genetic algorithm. IEEE
Trans Neural Netw 14(1):54–65
31. Mallat S (1999) A wavelet tour of signal processing. Academic, San Diego, CA
32. Maniezzo V (1993) Granularity evolution. In: Proceedings of the fifth interna-
tional conference on genetic algorithm and their applications, p 644
33. Maniezzo V (1993) Searching among search spaces: hastening the genetic evo-
lution of feedforward neural networks. In: International conference on neural
networks and genetic algorithms, GA-ANN’93, pp 635–642
34. Maniezzo V (1994) Genetic evolution fo the topology and weight distribution of
neural networks. IEEE Trans Neural Netw 5(1):39–53
35. Merelo JJ, Patn M, Canas A, Prieto A, Morn F (1993) Optimization of a com-
petitive learning neural network by genetic algorithms. IWANN93. Lect Notes
Comp Sci 686:185–192
36. Michalevicz Z (1996) Genetic algorithms + data structures = evolution program.
Springer, Berlin Heidelberg New York
37. Miller GF, Todd PM, Hegde SU (1989) Designing neural networks using genetic
algorithms. In: Schaffer JD (ed) Proceedings of the third international confer-
ence on genetic algorithms, pp 379–384
38. Moller MF (1993) A scaled conjugate gradient algorithm for fast supervised
learning. Neural Netw 6(4):525–533
39. Montana D, Davis L (1989) Training feedforward neural networks using genetic
algorithms. In: Proceedings of the eleventh international conference on artificial
intelligence. Morgan Kaufmann, Los Altos, CA, pp 762–767
40. Mordaunt P, Zalzala AMS (2002) Towards an evolutionary neural network for
gait analysis. In: Proceedings of the 2002 congress on evolutionary computation,
vol 2. pp 1238–1243
41. Moze MC, Smolensky P (1989) Using relevance to reduce network size automat-
ically. Connect Sci 1(1):3–16
42. Muhlenbein H, Schlierkamp-Voosen D (1993) The science of breeding and its
application to the breeder genetic algorithm (bga). Evol Comput 1(4):335–360
43. Giorgio Palmas (2005) Personal communication, November 2005
44. Palmes PP, Hayasaka T, Usui S (2005) Mutation-based genetic neural network.
IEEE Trans Neural Netw 16(3):587–600
45. Pedrajas NG, Martinez CH, Prez JM (2003) Covnet: a cooperative coevolu-
tionary model for evolving artificial neural networks. IEEE Trans Neural Netw
14(3):575–596
A New Genetic Approach for Neural Network Design 323
1 Introduction
General purpose neural network (NN) models such as multi-layer perceptrons
(MLPs) and radial basis function networks (RBFNs) have been applied to
many real-world problems. Although these models have very general utility,
the construction of a quality network can be time consuming. Practical prob-
lems faced by the modeller include the selection of model inputs, the selection
of model form, and the selection of appropriate parameters for the model
such as weights. The use of evolutionary algorithms (EAs) such as the ge-
netic algorithm provides scope to automate one or more of these decisions.
Traditional methods of combining EA and NN methodologies typically entail
the encoding of aspects of the NN model using a fixed-length binary or real-
valued chromosome. The EA is then applied to a population of chromosomes,
where each member of this population encodes a specific NN structure. The
population of chromosomes is evolved over time so that better NN structures
are uncovered. A drawback of this method is that the use of a fixed length
chromosome places a restriction on the nature of the NN models that can be
evolved by the EA. This study adopts an alternative approach using a novel
hybrid algorithm where evolutionary computation, in the form of grammatical
genetic programming, is used to generate an RBFN. This approach employs
a variable length chromosome which implies that the structure of the RBFN
is not determined a priori but rather is uncovered by means of an evolution-
ary process. This study represents the first application of a grammar-based
2 Grammatical Evolution
Grammatical Evolution (GE) is an evolutionary algorithm that can evolve
computer programs in any language [12–16] and it can be considered a form
of grammar-based genetic programming. GE has enjoyed particular success in
the domain of Financial Modelling [2] amongst numerous other applications
including Bioinformatics, Systems Biology, Combinatorial Optimisation and
Design [3, 4, 9, 11]. Rather than representing the programs as parse trees, as
in GP [1, 5–8], a linear genome representation is used. A genotype-phenotype
mapping is employed such that each individual’s variable length binary string
contains in its codons (groups of 8 bits) the information to select production
rules from a Backus Naur Form (BNF) grammar. The grammar allows the
generation of programs (or in this study, RBFN forms) in an arbitrary lan-
guage that are guaranteed to be syntactically correct. As such, it is used as a
generative grammar, as opposed to the classical use of grammars in compilers
to check syntactic correctness of sentences. The user can tailor the grammar
to produce solutions that are purely syntactically constrained, or they may in-
corporate domain knowledge by biasing the grammar to produce very specific
forms of sentences.
BNF is a notation that represents a language in the form of production
rules. It is comprised of a set of non-terminals that can be mapped to elements
of the set of terminals (the primitive symbols that can be used to construct the
output program or sentence(s)), according to the production rules. A simple
example of a BNF grammar is given below, where <expr> is the start symbol
from which all programs are generated. These productions state that <expr>
can be replaced with either one of <expr><op><expr> or <var>. An <op> can
become either +, -, or *, and a <var> can become either x, or y.
<expr> ::= <expr><op><expr> (0)
| <var> (1)
<op> ::= + (0)
| - (1)
| * (2)
<var> ::= x (0)
| y (1)
A Grammatical Genetic Programming Representation 327
220 240 220 203 101 53 202 203 102 55 221 202
(2003) [12]. Some more recent developments are covered in Brabazon & O’Neill
(2005) [2].
H0
I1 w0
H1
w1
I2 .
. Output
. w5
H5
I3
Fig. 2. A radial basis function network. The output from each hidden node (H0 is
a bias node, with a fixed input value of 1) is obtained by measuring the distance
between each input pattern and the location of the hidden node, and applying the
radial basis function to that distance. The final output from the network is obtained
by taking the weighted sum (using w0, w1 and w5) of the outputs from the hidden
layer and from H0
A Grammatical Genetic Programming Representation 329
using a radial basis function. This value represents the quality of the match
between the input vector and the location of that centre in the input space.
Each hidden node, therefore, can be considered as a local detector in the
input data space. The most commonly used radial basis function is a Gaussian
function. This produces an output value of one if the input and weight vectors
are identical, falling towards zero as the distance between the two vectors gets
large. A range of alternative radial basis functions exists, including the inverse
multi-quadratic function and the spline function.
The second phase of the model construction process is the determination
of the value of the weights on the connections between the hidden layer and
the output layer. In training these weights, the output value for each input
vector will be known, as will the activation values for that input vector at each
hidden layer node, so a supervised learning method can be used. The simplest
transfer function for the node(s) in the output layer is a linear function where
the network’s output is a linearly weighted sum of the outputs from the hidden
nodes. In this case, the weights on the arcs to the output node(s) can be found
using linear regression, with the weight values being the regression coefficients.
Sometimes it may be preferred to implement a non-linear transfer function
at the output node(s). For example, when the RBFN is acting as a binary
classifier it would be useful to use a sigmoid transfer function to limit outputs
to the range 0 → 1. In this case, the weights between the hidden and output
layer could be determined using the backpropagation algorithm.
Once the RBFN has been constructed using a training set of input-output
data vectors it can then be used to classify or to predict outputs for new
input data vectors, for which an output value is not known. The new input
data vector is presented to the network, and an activation value is calculated
for each hidden node. Assuming that a linear transfer function is used in the
output node(s), the final output produced by the network is the weighted sum
of the activation values from the hidden layer, where these weights are the
coefficient values obtained in the linear regression step during training. The
basic algorithm for the canonical RBFN is as follows:
i. Select the initial number of centres (m).
ii. Select the initial location of each of the centres in the data space.
iii. For each input data vector/centre pairing calculate the activation value
φ(||x − y||), where φ is a radial basis function and ||...|| is a distance
measure between input vector x and a centre y in the data space. As an
d 2= ||x − y||. The value of a Gaussian RBF is then given
example, let −d
by y = exp( 2σ2 ), where σ is a modeller selected parameter which deter-
mines the size of the region of input space a given centre will respond
to.
iv. Once all the activation values for each input vector have been obtained,
calculate the weights for the connections between the hidden and output
layers using linear regression.
v. Go to step (iii) and repeat until a stopping condition is reached.
330 I. Dempsey et al.
vi. Improve the fit of the RBFN to the training data by adjusting some or
all of the following: the number of centres, their location, or the width of
the radial basis functions.
As the number of centres increases, the predictive ability of the RBFN on the
training data will tend to increase, possibly leading to overfit and poor out-
of-sample generalisation. Hence, the object is to choose a sufficient number
of hidden layer nodes to capture the essential features in the training data,
without overfitting it.
4 GE-RBFN Hybrid
Despite the apparent dissimilarities between GE and RBFN methodologies,
the methods can complement each other. A practical problem in utilising
RBFNs is the selection of model inputs and model form. By defining an appro-
priate grammar, GE is capable of automatically generating a range of RBFN
forms. Hence, a combined GE-RBFN hybrid can be considered as embedding
both hypothesis generation and hypothesis optimisation components.
The basic operation of the GE-RBFN methodology is as follows. Initially,
a population of binary strings are randomly created. In turn, each of these is
mapped to a RBFN structure using a grammar which has been constructed
specifically for the task of generating RBFNs (see next subsection). The qual-
ity of each resulting RBFN is then assessed using the training data. Based
on this information, the binary strings resulting in higher quality networks
are preferentially selected for survival and reproduction. Over successive iter-
ations, the quality of the networks encoded in the population of binary strings
improves.
4.1 Grammar
There are multiple grammars that could be defined in order to generate
RBFNs depending on exactly what the modeller wishes to evolve. For ex-
ample, if little was known about which inputs would be useful for the RBFN,
the grammar could be written so that GE selected which inputs to use, in
addition to selecting the form of the RBFN itself (the number of hidden layer
nodes, their associated weight vectors, the form of their associated radial basis
functions and so on).
In this study we define a grammar which permits GE to construct RBFNs
with differing numbers of centres. GE is also used to decide where to locate
those centres in the input space. The Backus Naur Form grammar for this is
as follows.
<RBFN> :: = 1 / (1 + exp (- <HL>) )
(A)
/
1 +
1 exp
<HL>
(B) (C)
<HL> <HL>
* +
<weight> <HN>
* <HL>
<weight> <HN>
Fig. 3. An output radial basis function network in the form of a derivation tree.
Tree (A) represents the common structure of all RBFN’s generated by the example
grammar. Trees (B) and (C) represent the two possible sub-trees that can replace
the <HL> non-terminal in (A). (B) represents the case where a <HL> becomes a
single node, and (C) represents the case where <HL> becomes at least two nodes
Table 1. Problem instance statistics and the training and test set partition sizes in
each case
Dataset Training Test #variables #classes
Wisconsin 559 140 9 2
Pima 614 154 8 2
Thyroid 172 43 5 3
Australian 552 138 6 2
Bupa 276 69 6 2
A Grammatical Genetic Programming Representation 333
Table 2. Results for GE/RBFN including average fitnesses for both in and out of
sample data sets along with standard deviation for the out of sample data
process. The developed GE-RBFN hybrid was applied to five benchmark in-
stances from the UCI Machine Learning repository with encouraging results.
Substantial scope exists to further develop the RBFN-GE methodology
outlined in this chapter. In this initial study we did not include the selection
of inputs, or the selection of the form of the RBFs in the evolutionary process.
However, the RBFN grammar could be easily adapted in order to incorporate
these steps if required. The use of the GE methodology also opens up a variety
of other research avenues. The methodology applied in this study is based
on the canonical form of the GE algorithm. As already noted, a substantial
literature exists on GE covering such issues as the use of alternative search
engines for the algorithm, and the use of alternatives to the strict left-to-right
mapping of the genome (piGE). Future work could usefully examine the utility
of these GE variants for the purposes of evolving RBFNs.
References
1. Banzhaf W, Nordin P, Keller RE, Francone FD (1998) Genetic programming –
an introduction. On the automatic evolution of computer programs and its ap-
plications. Morgan Kaufmann, Los Altos, CA
2. Brabazon A, O’Neill M (2005) Biologically inspired algorithms for financial mod-
elling. Springer, Berlin Heidelberg New York
3. Cleary R, O’Neill M (2005) An attribute grammar decoder for the 01 multiCon-
strained knapsack problem. In: LNCS 3448 Proceedings of evolutionary com-
putation in combinatorial optimization EvoCOP 2005, Lausanne, Switzerland.
Springer, Berlin Heidelberg New York, pp 34–45
4. Hemberg M, O’Reilly U-M (2002) GENR8 – using grammatical evolution in a
surface design tool. In: Proceedings of the first grammatical evolution workshop
GEWS2002, New York City, New York, US. ISGEC, pp 120–123
5. Koza JR (1992) Genetic programming. MIT, Cambridge, MA
6. Koza JR (1994) Genetic programming II: automatic discovery of reusable pro-
grams. MIT, Cambridge, MA
7. Koza JR, Andre D, Bennett III FH, Keane M (1999) Genetic programming 3:
Darwinian invention and problem solving. Morgan Kaufmann, Los Altos, CA
8. Koza JR, Keane M, Streeter MJ, Mydlowec W, Yu J Lanza G (2003) Genetic
programming IV: routine human-competitive machine intelligence. Kluwer
Academic, Dordrecht
9. Moore JH, Hahn LW (2004) Systems biology modeling in human genetics using
petri nets and grammatical evolution. In: LNCS 3102 Proceedings of the genetic
and evolutionary computation conference GECCO 2004, Seattle, WA, USA,
Springer, Berlin Heidelberg New York, pp 392–401
10. O’Neill M, Brabazon A (2004) Grammatical swarm. In: LNCS 3102 Proceedings
of the genetic and evolutionary computation conference GECCO 2004, Seattle,
WA, USA. Springer Berlin Heidelberg New York, pp 163–174
11. O’Neill M, Adley C, Brabazon A (2005) A grammatical evolution approach to
eukaryotic promoter recognition. In: Proceedings of Bioinformatics INFORM
2005, Dublin City University, Dublin, Ireland
A Grammatical Genetic Programming Representation 335
1 Introduction
1.1 Artificial Neural Networks in Engineering
Artificial Neural Networks (ANNs) are amongst the most successful empir-
ical processing technologies to be used in engineering applications. ANNs
serve as an important function for engineering purposes such as modelling
and predicting the evolution of dynamic systems.
Hagan et al. [1] espoused that the pioneering work in neural networks
commenced in 1943 when McCulloch and Pitts [2] postulated a simple math-
ematical model to explain how biological neurons work. It may be the first
significant publication on the theory of artificial neural networks, which is
generally considered. ANN models have been widely applied to various en-
gineering problems. For example, the prediction of water quality parameters
D. Cha et al.: A Neural-Genetic Technique for Coastal Engineering: Determining Wave-
induced Seabed Liquefaction Depth, Studies in Computational Intelligence (SCI) 82, 337–351
(2008)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008
338 D. Cha et al.
[3], generation of wave equations based on hydraulic data [4], soil dynamic
amplification analysis [5], tide forecasting using artificial neural networks
[6], prediction of settlement of shallow foundations [7], earthquake-induced
liquefaction [8], and ground settlement by deep excavation [9].
Unlike conventional approaches based on engineering mechanics, the main
requirement for accurate prediction using ANN models is an appropriate
database. With sufficient quality training data, ANNs can provide accurate
predictions for various engineering problems.
Generally speaking, Genetic Algorithms (GAs) are one of the various Com-
putational Intelligence (CI) technologies, which also include ANNs and Fuzzy
Logic.
Fundamental theories of GAs were established by Holland [10] in the early
1970s. Holland [10] was amongst the first to put computational evolution on
a firm theoretical footing. The GA’s main role is numerical optimisation in-
spired by natural evolution. GAs can be applied to an extremely wide range of
problems. The basic component of GAs is strings of binary values (sometimes
real-values) called chromosomes. A GA operates on a population of individ-
uals (chromosomes), each presenting a possible solution to a given problem.
Each chromosome is assigned a fitness value based on a fitness function, and
its operation is based on crossover, selection and mutation.
Fig. 1. Comparison of a traditional ANN model (a) and an ANN model trained by
GAs (b) (MSE: Mean Squared Error)
(a)
(b)
Bjerrum [16] was possibly the first author that considered wave-induced
liquefaction occurring in saturated seabed sediments. Later, Nataraja et al.
[17] suggested a simplified procedure for ocean-based, wave-induced lique-
faction analysis. Recently, Rahman [18] established the relationship between
liquefaction and characteristics of wave and soil. He concluded that liquefac-
tion potential increases in degree of saturation and with an increase of wave
period. Jeng [19] examined a wave-induced liquefied state for several differ-
ent cases, together with Zen and Yamazaki’s [20] field data. He found that
no liquefaction occurs in a saturated seabed, except in very shallow water,
for large waves and a seabed with very low permeability. For more advanced
poro-elastic models for the wave-induced liquefaction potential, the readers
can refer to Sassa and Sekiguchi [21] and Sassa et al. [22]. All aforementioned
investigations have been reviewed by Jeng [23]. However, most previous inves-
tigations for the wave-induced liquefaction potential in a porous seabed have
been based on various assumptions of engineering mechanics, which limits the
application of the model in realistic engineering problems.
CI models for the estimation of the wave-induced liquefaction apply differ-
ent techniques to investigate coastal engineering problems, as compared to the
traditional engineering approach. Traditional engineering methods for wave-
induced liquefaction prediction always use deterministic models, which involve
A Neural-Genetic Technique for Coastal Engineering 341
output layer of neurons. In the present study, GAs are utilised to optimise the
weights of the network and adjust the interconnections to minimise its output
error. It can be applied to the network, which has at least one hidden layer,
and fully connected to all units in each layer. The goal of this procedure is to
obtain a desired output when certain inputs are given. The general network
error E is shown in (1), where Dx and Ox are the desired and actual output
values, respectively.
1
n
E(x) = (Dx − Ox )2 (1)
n x=1
Since the error is the difference between the actual output and the target
output, the error depends on the weights, so we employed this error function
in the GA fitness function for optimising the weights instead of a standard,
iterative, gradient descent-based training method (2),
f (x) = Emax − E (x) (2)
where, f (x) is the corresponding fitness and Emax is the maximum perfor-
mance error.
For the GA selection function, we used the Roulette wheel method
developed by Holland [10]. It is defined as
Fi
Pi = S (3)
i=1 Fi
where, Pi is the probability for each individual chromosome, S is the popula-
tion size, and Fi is the fitness value of each chromosome.
In the use of GAs, there are basic operators employed and modified in
the search mechanism, which are Crossover and Mutation. For example, the
basic two point Crossover operator takes two individuals from a chromosome
and produces two new individuals, whilst mutation alters one individual to
produce a single new solution. Further discussion and details pertaining to
Crossover and Mutation settings in this research are presented in the results
section.
Poro-elastic model
Then substituting (9) into (4) and (6), the governing equation can be
rewritten as
Kf ρf ρf g
(εii + wii )i = ρf üi + ẅi + ẇi . (10)
n n kz
If the acceleration terms are neglected in the above equation, it be-
comes the consolidation equation, which has been used in previous work [19].
Based on the wave-induced soil and fluid displacements, we can obtain the
wave-induced pore pressure, effective stresses and shear stresses. Detailed
information of the above solution can be found in [27].
Estimation of liquefaction
It has generally been accepted that when the vertical effective stress van-
ishes, the soil will be liquefied. Thus, the soil matrix loses its strength to
carry and load and consequently causes seabed instability. Based on the con-
cept of excess pore pressure, Zen and Yamazaki [20] proposed a criterion of
344 D. Cha et al.
Training model
Number of inputs 5
Number of output neurons 1
Number of hidden neurons 1 to 5
learning rate 0.5
momentum factor 0.2
Wave characteristics
wave period (T ) 8 sec to 12.5 sec
wave height (H) 7.5 m to 10.5 m
water depth (d) 50 m to 100 m
Soil characteristics
soil permeability (Kz ) 10−4 , 5 × 10−4 m/sec
seabed thickness (h) 10 to 80 m
degree of saturation (S) 0.95 to 0.99
the training procedure, and the remaining data for validating the prediction
capability using the best run of each case.
In this paper, we not only use a correlation value(R2 ) for comparison
between the database and the neural-genetic prediction but we also use the
346 D. Cha et al.
where LAi and LP i are the liquefaction depth from the ANN model and the
Poro-elastic model respectively; N is the total number of liquefaction depth
data.
(a)
(b)
slightly better than those shown in Fig. 6 because the GA operation settings
were based on those in Fig. 6(b). It is clearly shown that better results may
be producing varying the GA settings, with specific attention to increasing
the number of generations and varying crossover and mutation parameters.
A Neural-Genetic Technique for Coastal Engineering 349
(a)
(b)
Fig. 7. Comparisons of the wave-induced liquefaction depth by the approach versus
the poro-elastic model (Soil Permeability: 5 × 10−4 m/sec). (a) 6000 Generations.
(b) 8000 Generations
These results indicated that the performance of neural-genetic model for the
prediction of maximum wave-induced seabed liquefaction compares favourably
with the previous authors’ results [29]. In this study, 3 crossover and mutation
functions were adopted, which were used in the neural-genetic model.
350 D. Cha et al.
4 Conclusions
In this study, we adopted the concept of GA-based training of ANN models
in an effort to overcome the problems inherent in some ANN training pro-
cedures (i.e. gradient-based techniques) whilst providing accurate results for
determining maximum liquefaction depth in a real-world application.
Unlike the conventional engineering mechanics approaches, the neural-
genetic techniques are based on statistical theory, which is a data-driven
technique and can be built with the knowledge of a quality database, and
can save time in configuring and adjusting the settings of an ANN model
during the supervised training process.
In the proposed neural-genetic model, based on a physical understanding
of wave-induced seabed liquefaction, several important parameters, includ-
ing wave period, water depth, wave height, seabed thickness and the degree
of saturation, were used as the input parameters with constant soil perme-
ability, whilst the maximum liquefaction depth was the output parameter.
Experimental results demonstrate that the neural-genetic model is a successful
technique in predicting the wave-induced maximum liquefaction depth.
References
1. Hagan MT, Demuth HB, Beale M (1996) Neural network design. PWS, Boston,
MA
2. McCulloch WS, Pitts W (1943) A logical calculus of the ideas imminent in
nervous activity. Bull Math Biophys 5:115–133
3. Maier HR, Dandy HR (1997) Modeling cyanobacteria (blue-green algae) in the
River Murray using artificial neural networks. Math Comput Simulation 43:
377–386
4. Dibike YB, Minns AW, Abbott MB (1999) Applications of artificial neural net-
works to the generation of wave equations from hydraulic data. J Hydraulic Res
37(1):81–97
5. Hurtado JE, Londono JE, Meza JE (2001) On the applicability of neural
networks for soil dynamic amplification analysis. Soil Dyn Earthquake Eng
21(7):579–591
6. Lee TL, Jeng DS (2002) Application of artificial neural networks in tide
forecasting. Ocean Eng 29(9):1003–1022
7. Mohamed AS, Holger RM, Mark BJ (2002) Predicting settlement of shallow
foundations using neural networks. J Geotech Geo Envir Eng 128(9):785–793
8. Jeng DS, Lee TL, Lin C (2003) Application of artificial neural networks in
assessment of Chi–Chi earthquake-induced liquefaction. Asian J Inf Technol
2(3):190–198
9. Leo SS, Lo HS (2004) Neural network based regression model of ground surface
settlement induced by deep excavation automation in construction 13:279–289
10. Holland J (1975) Adaptation in natural and artificial systems. University of
Michigan Press. (Second edition: MIT, Cambridge, MA, 1999)
11. Yao X (1993) Evolving artificial neural networks. Int J Neural Syst 4(3):203–222
A Neural-Genetic Technique for Coastal Engineering 351
1 Introduction
A Personal Communication Network is a wireless communication network
which integrates various services such as voice, video, electronic mail, accessi-
ble from a single mobile terminal and for which the subscriber obtains a single
invoicing. These various services are offered in an area called cover zone which
is divided into cells. In each cell is installed a base station which manages all
the communications within the cell. In the cover zone, cells are connected
to special units called switches which are located in mobile switching cen-
ters (MSC). When a user in communication goes from a cell to another, the
base station of the new cell has the responsibility to relay this communication
by allotting a new radio channel to the user. Supporting the transfer of the
communication from a base station to another is called handoff. This mech-
anism, which primarily involves the switches, occurs when the level of signal
received by the user reaches a certain threshold. We distinguish two types of
handoffs. In the case of Figure 1 for example, when a user moves from cell
A to cell B, it refers to soft handoff because these two cells are connected to
the same switch. The MSC which supervises the two cells remains the same
A. Quintero and S. Pierre: On the Design of Large-scale Cellular Mobile Networks Using Multi-
population Memetic Algorithms, Studies in Computational Intelligence (SCI) 82, 353–377 (2008)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008
354 A. Quintero and S. Pierre
switch 1
Assignment of Cells B
A
to Switches switch 1 switch 2
D
C
Complex handoff
Cell A Cell C
Switch 2
cell complex handoff
and the induced cost is weak. On the other hand, when the user moves from
cell A to cell C, there is a complex handoff. The induced cost is high because
both switches 1 and 2 remain active during the procedure of handoff and the
database containing information on subscribers must be updated.
The total operating cost of a cellular network includes two components: the
cost of the links between the cells (base station) and the switches to which they
are joined, and the cost generated by the handoffs between cells. It appears
therefore intuitively more discriminating to join cells A and C to the same
switch if the frequency of the handoffs between them is high. The problem
of assigning cells to switches essentially consists of finding the configuration
that minimizes the total operating cost of the network. The resolution of
this problem by an exhaustive search method would entail a combinatorial
explosion, and therefore an exponential growth of execution times.
Assigning cells to switches in cellular mobile networks being an NP-hard
problem, enumerative search methods are practically inappropriate to solve
large-sized instances of this problem [1, 40]. Because they exhaustively exam-
ine the entire search space in order to find the optimal solution, they are only
efficient for small search spaces corresponding to small-sized instances of the
problem. For example, for a network with m switches and n cells, mn solutions
should be examined.
Merchant and Sengupta [27] have proposed the first heuristic to solve this
problem. Their algorithm starts from an initial solution, which they attempt
to improve through a series of greedy moves, while avoiding to be blocked
in a local minimum. The moves used to escape a local minimum explore a
very limited set of options. These moves depend on the initial solution and
do not necessarily lead to a good final solution. Others heuristic approaches
have been developed for this kind of problem [1, 2, 7, 23, 39, 49].
On the Design of Large-scale Cellular Mobile Networks 355
m
yij = zijk for i, j = 1, . . . , n and i
= j.
k=1
The first term of the equation represents the link or cabling cost. The
second term takes into account the complex handoffs cost and the third, the
cost of simple handoffs. We should keep in mind that the cost function is
quadratic in xik , because yij is a quadratic function of xik . Let’s mention that
an eventual weighting could be taken into account directly in the link and
handoff costs definitions.
The capacity of a switch k is denoted Mk . If λi denotes the number of
calls per unit of time directed to i, the limited capacity of switches imposes
the following constraint:
n
λi xik ≤ Mk for k = 1, . . . , m (3)
i=1
according to which the total load of all cells which are assigned to the switch
k is less than the capacity Mk of the switch. Finally, the constraints of the
problem are completed by:
(1), (3) and (4) are constraints of transport problems. In fact, each cell i could
be likened to a factory which produces a call volume λi . The switches are then
considered as warehouses of capacity Mk where the cells production could be
stored. Therefore, the problem is to minimize (2) under (1), and (3) to (6).
When the problem is formulated in this way, it could not be solved with a
polynomial time algorithm such as linear programming because constraint
(5) is not linear. Merchant and Sengupta [27, 28] replaced it by the following
equivalent set of constraints:
hij refers to the reduced cost per time unit of a complex handoff between cells
i and j. Relation (2) is then re-written as follows:
n
m
n
n
n
n
f= cik xik + hij (1 − yij ) + Hij
i=1 k=1 i=1 j=1 i=1 j=1
j=i j=i
+ ,- .
constant
subject to (1), (3), (4) and (7) to (10). In this form, the assignment problem
could be solved by usual programming methods such as integer programming.
The total cost includes two types of cost, namely cost of handoff between
two adjacent cells, and cost of cabling between cells and switches. The design
is to be optimized subject to the constraint that the call volume of each
switch must not exceed its call handling capacity. This kind of problem is
NP-hard, so enumerative searches are practically inappropriate for moderate-
and large-sized cellular mobile networks [27].
Merchant and Sengupta [27, 28] studied this assignment problem. Their al-
gorithm starts from an initial solution, which they attempt to improve through
a series of greedy moves, while avoiding to be stranded in a local minimum.
The moves used to escape a local minimum explore only a very limited set
of options. These moves depend on the initial solution and do not necessarily
lead to a good final solution.
In [46], an engineering cost model has been proposed to estimate the cost
of providing personal communications services in a new residential develop-
ment. The cost model estimated the costs of building and operating a new
PCS using existing infrastructure such as the telephone, cable television and
cellular networks. In [14], economic aspects of configuring cellular networks
are presented. Major components of costs and revenues as well as the ma-
jor stakeholders were identified and a model was developed to determine the
system configuration (e.g., cell size, number of channels, link cost, etc.). For
example, in a large cellular network, it is impossible for a cell located in east
America to be assigned to a switch located in west America. In this case,
the variable link cost is ∞. The geographical relationships between cells and
switches are considered in the value of the cost of linking, so that the base
station of a cell is generally assigned to a neighbouring switch and not to far
switches [60]. In [15], different methods have been proposed to estimate the
handoff rate in PCS and the economic impacts of mobility on system configu-
ration decisions (e.g., annual maintenance and operations, channel cost, etc.).
The cost model used in this chapter is based on [14, 15, 46].
On the Design of Large-scale Cellular Mobile Networks 359
3 Memetic Approach
In the field of combinatorial optimization, it has been shown that com-
bining evolutionary algorithms with problem-specific heuristics can lead to
highly effective approaches [10, 48]. These hybrid evolutionary algorithms
combine the advantages of efficient heuristics incorporating domain knowl-
edge and population-based search approaches [32, 34]. For further details on
evolutionary algorithms, see [55, 56].
Genetic algorithms (GA) are robust search techniques based on natural selec-
tion and genetic production mechanisms. GAs perform a search by evolving
a population of candidate solutions through non-deterministic operators and
by incrementally improving the individual solutions forming the population
using mechanisms inspired from natural genetics and heredity (e.g., selection,
crossover and mutation). In many cases, especially with problems character-
ized by many local optima (graph coloring, travelling salesman, network design
problems, etc.), traditional optimization techniques fail to find high quality
solutions. GAs can be considered as an efficient and interesting option [22, 52].
GAs [18] are composed of a first step (initialization of the population)
and a loop. In each loop step, called generation, the population is altered
through selection and variation operators, then the resulting individuals are
evaluated. It is hoped that the last generation will contain a good solution,
but this solution is not necessarily the optimum [9].
Crossover is a process by which two chosen string genes are interchanged.
To execute the crossover, strings of the mating pool are coupled at random.
The crossover of a string pair of length l is performed as follows: a position i
is chosen uniformly between 1 and (l − 1), then two new strings are created
by exchanging all values between positions (i + 1) and l of each string of the
pair considered.
Mutation is the process by which a randomly chosen bit in a chromosome
is flipped. It is employed to introduce new information into the population
and also to prevent the population from becoming saturated with similar
chromosomes (premature convergence). Large mutation rates increase the
probability that good schemata be destroyed, but increase population diver-
sity. A schema is a subset of chromosomes which are identical in certain fixed
positions [18, 22].
The next generation of chromosomes is generated from present population
by selection and reproduction. The selection process is based on the fitness
of the present population, such that the fitter chromosome contribute more
to the reproductive pool; typically this is also done probabilistically. For a
selection, inheritance is required; the offspring must retain at least some of
the features that made their parents fitter than average [18].
360 A. Quintero and S. Pierre
are subject to many variations, which gives to Tabu its meta-heuristic nature.
The tabu list is not always a list of solutions, but can be a list of forbidden
moves/perturbations [16, 45].
Tabu search is a hill-climber endowed with a tabu list (list of solutions or
moves) [54]. Let Xi denote the current point; let N (Xi ) denote all admissible
neighbors of Xi , where Y is an admissible neighbor of Xi if Y is obtained
from Xi through a single move not in the tabu list, and Y does not belong to
the tabu list is updated as Xi is replaced with the best point in N (Xi ); stop
after nbmax steps or if N (Xi ) is empty.
Other mechanisms of tabu search are intensification and diversification:
by the intensification mechanism, the algorithm does a more comprehensive
exploration of attractive regions which may lead to a local optimal point;
by the diversification mechanism, on the other hand, the search is moved to
previously unvisited regions, something that is important in order to avoid
getting trapped into local minimum points [16].
Canonical genetic algorithms are powerful and perform well on a broad class
of problems. However, part of the biological and cultural analogies used to
motivate a genetic algorithm search are inherently parallel.
One approach is the partitioning of the population into several subpopula-
tions (multi-population approach) [57]. The evolution of each subpopulation
is handled independently from each other and help maintain genetic diversity.
Diversity is the term used to describe the relative uniqueness of each individ-
ual in the population. From time to time, there is however some interchange
of genetic material between different subpopulations. This exchange of indi-
viduals is called migration [37]. Sometimes a topology is introduced on the
population, so that individuals can only interact with nearby chromosomes in
their neighborhood [20].
The parallel implementation of the migration model shows not only a
speedup in computation time, but it also needs less objective function eval-
uations when compared to a single population algorithm for some classes of
problems [55]. Cohoon et al. [5] present results in which parallel algorithms
with migration found better solutions than a sequential GA for optimization
problems, and Lienig [26] indicates that parallel genetic algorithms in isolated
evolving subpopulations with migrations may offer advantages over sequential
approaches.
The migration algorithm is controlled by many parameters that affect its
efficiency and accuracy. Among other things, one must decide the number and
the size of the populations, the rate of the migration, the migration interval
and the destination of the migrants. The migration interval is the number of
generations between each migration, and the migration rate is the number
of individuals selected for migration. An important property of the architec-
ture used between the demes is its degree, which is the number of neighbors
362 A. Quintero and S. Pierre
4 Implementation Details
4.1 Memetic Algorithm Implementation
A = 211↓1321
B = 132↓1212
A = 2 1 113 2 1 fl
fl A’ = 2 1 1 1 2 1 2
A’ = 2 1 1 33 2 1 B’ = 1 3 2 1 3 2 1
(a) Mutation (b) Crossover
Fig. 2. An example of mutation and crossover
The choice of the candidates is based on the evaluation function given by:
n
m
n
n
f= cik xik + hij (1 − yij ) (12)
i=1 k=1 i=1 j=1
j=i
… … …
… …
Replace the Pr worst individuals
Before exchange after exchange
This section presents the implementations details of the local refinement strat-
egy used to improve the individuals representing solutions provided by genetic
algorithms : tabu search.
To solve the assignment problem with tabu search, we have chosen a
search domain free from capacity constraints on the switches, but respect-
ing the constraints of unique assignment of cells to switches. We associate
with each solution two values: the first one is the intrinsic cost of the solution,
which is calculated from the objective function; the second is the evaluation
of the solution, which takes into account the cost and the penalty for not
respecting the capacity constraints. At each step, the solution that has the
best evaluation is chosen. Once an initial solution built from the problem
On the Design of Large-scale Cellular Mobile Networks 365
Initialize population(gen)
evaluation population(gen)
for each individual i∈gen do
i = tabu_search(i)
end for
while not terminated do
repeat
select two individuals i,j ∈ gen
apply Crossover(i,j) giving children(c)
c = tabu_search(c)
add children(c) to newgen
until crossover = false
for each individual i ∈ gen do
if probability-mutation then
i = apply tabu_search(Mutation(i))
add (i) to newgen
end if
end for
gen = Select_elitist(newgen)
begin migration
if migration appropriate
Choose emigrants(population(gen))
Send clones of emigrants
end if
if immigrants available
Im = Receive immigrants
end if
end migration
gen = Select_ elitist((Im ∪ gen)
end while
Fig. 4. Multi-population memetic algorithm with migration and elitism
data, the short term memory component attempts to improve it, while avoid-
ing cycles. The middle-term memory component seeks to intensify the search
in specified neighbourhoods, while the long-term memory aims at diversifying
the exploration area.
The neighbourhood N (S) of a solution S is defined by all the solutions that
are accessible from S by applying a move a → b to S. a → b is defined as re-
assignment of cell a to switch b. To evaluate the solutions in the neighbourhood
366 A. Quintero and S. Pierre
N (S), we define the gain GS (a, b) associated to the move a → b and to the
solution S by:
⎧ n n
⎪
⎨ (hai + hia )xib0 − (hai + hia )xib + cab − cab0 if b =
b0
GS (a, b) = ii=1 i=1
⎪
⎩
=a i=a
M if not
(13)
where:
• hij refers to the handoff cost between cells i and j;
• b0 is the switch of cell a in solution S, that is, before the application of
move a → b;
• xik takes value 1 if cell i is assigned to switch k, 0 otherwise;
• cik is the cost of linking cell i to switch k;
• M is an arbitrary large number.
The short-term memory moves iteratively from one solution to another,
by applying moves, while prohibiting a return to the k latest visited solutions.
It starts with an initial solution, obtained simply by assigning each cell to the
closest switch, according to an Euclidean distance metric. The objective of this
memory component is to improve the current solution, either by diminishing
its cost or by diminishing the penalties.
The middle-term memory component tries to intensify the search in
promising regions. It is introduced after the end of the short-term memory
component and allows a return to solutions we may have omitted. It mainly
consists in defining the regions of intensified search, and then choosing the
types of move to be applied.
To diversify the search, we use a long-term memory structure in order to
guide the search towards regions that have not been explored. This is often
done by generating new initial solutions. In this case, a table n×m (where n is
the number of cells and m the number of switches) counts, for each link (a, b),
the number of times this link appears in the visited solutions. A new initial
solution is generated by choosing, for each cell a, the least visited link (a, b).
Solutions visited during the intensification phase are not taken into account
because they result from different type of moves than those applied in short
and long-term memory components.
distance separating both [40]. The call rate γi of a cell i follows a gamma
law of average and variance equal to the unit. The call duration inside the
cells are distributed according to an exponential law of parameter equal to
1 [8]. If a cell j has k neighbors, the [0,1] interval is divided into k + 1 sub-
intervals by choosing k random numbers distributed evenly between 0 and 1.
At the end of the service period in cell j, the call could be either transferred
to the ith neighbour (i = 1, . . . , k) with a handoff probability rij equal to the
length of ith interval, or ended with a probability equal to the length of the
k + 1th interval. To find the call volumes and the rates of coherent handoff,
the cells are considered as M/M/1 queues forming a Jackson network [25].
The incoming rates αi in cells are obtained by solving the following system:
n
αi − αj rij = γi with i = 1, . . . , n
j=1
If the incoming rate αi is greater than the service rate, the distribution is
rejected and chosen again. The handoff rate hij is defined by:
hij = λi rij
K
n
1
M= (1 + ) λi
m 100 i=1
where K is uniformly chosen between 10 and 50, which insures a global excess
of 10 to 50% of the switches’ capacity compared to the cells’ volume of call.
In the second step, we generate an initial population of size 100 chro-
mosomes. In the third step, we estimate each chromosome by the objective
function, what allows to deduct its value of capacity. Finally, in the last step,
the cycle of generations of the populations begins then, each new population
replacing the previous one. The number of 400 generations is defined at first.
To determine the number of subpopulations in parallel, MA was executed
over a set of 600 cases with 3 instances of problem in series of 20 runs for each
assignment pattern, with a number of populations varying between 1 and 10.
This experience shows that MA converges to good solutions with a number
of populations varying between 7 and 10, as shown in Figure 5.
To define the population size, MA was executed over a set of 600 cases
with 3 instances of problem in series of 20 runs for each assignment pattern
with 8 populations. This experience shows that MA converges to provide good
solutions with a population size varying between 80 and 140.
The values used by MA are: the number of generations is 400; the pop-
ulation size is 100; the number of populations is 8 for MA; the crossover
probability is 0.9; the mutation probability is 0.08; the migration interval
(Pm ) is 0.1; the migration rate (Sm ) is 0.4 and the emigrants accepted (Pr )
368 A. Quintero and S. Pierre
10000
9000
8000
Evaluation Cost
7000
6000 5 switches, 100 cells
5000 6 switches, 150 cells
4000 7 switches, 200 cells
3000
2000
1000
0
1 2 3 4 5 6 7 8 9 10
Number of populations in parallel
3100
20 migrants
2900 40 migrants
60 migrants
2700
80 migrants
0 migrants
2500
Fitness value
2300
2100
1900
1700
1500
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311 321 331 341 351 361 371 381 391 401
Number of generations
Merchant and Sengupta [27] have designed a heuristic, which we call H, for
solving the cell assignment problem. Pierre and Houéto [40] have been used
Tabu search (TS) for solving the same problem.
We compare TS and heuristics H with MA. For the experiments, we used
a number of cells varying between 100 and 1000, and a number of switches
varying between 5 and 10, that means the search space size is between 5100
and 101000 . In all our tests, the total number of evaluations remained the
same.
The three heuristics always find feasible solutions. However, these re-
sults inform only on the feasibility of obtained results without demonstrating
whether these solutions are among the best or not. Figure 7 shows the re-
sults obtained for 5 different instances of problem: 5 switches and 100 cells,
6 switches and 150 cells, 7 switches and 200 cells, 8 switches and 500 cells,
and 10 switches and 1000 cells. For each instance, we tested 16 different cases
whose evaluation costs represent the average over 100 runs of each algorithm.
370 A. Quintero and S. Pierre
1900
1800
1700
Fitness value
1400 Heuristic H
1300
1200
1100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
3800
3600
3400
3200
Fitness value
4200
4000
3800
Fitness value
3000
2800
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Mobile network (200 cells, 7 switches)
(c) Evaluation cost comparison for an instance of problem with 7 switches, 200
cells
8500
8300
8100
Fitness value
7900
Memetic algorithm
7700 Tabu search
Heuristic H
7500
7300
7100
6900
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Mobile network (500 cells, 8 switches)
(d) Evaluation cost comparison for an instance of problem with 8 switches, 500
cells
Fig. 7. (Continued)
372 A. Quintero and S. Pierre
14700
14200
Fitness value
13700
Memetic algorithm
Tabu search
Heuristic H
13200
12700
12200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Mobile network (100 cells, 10 switches)
(e) Evaluation cost comparison for an instance of problem with 10 switches, 1000
cells
Fig. 7. (Continued)
The three heuristics always find feasible solutions with objective values
close to the optimum solution. In each of the all considered series of tests,
MA yields an improvement in the cost function in comparison with the other
two heuristics. In terms of evaluation fitness, MA provides better results than
tabu search and heuristics H. Table 2 summarizes the results. Nevertheless,
given the initial link, the handoff and the annual maintenance costs for large-
sized cellular mobile networks (in the order of hundred of millions of dollards)
this small improvement represents a large reduction in costs over a 10-years
period in the order of millions of dollards. For example, in a cellular network
composed by 300 cells, with initial link and handoff cost of $350,000 for each
cell, an improvement of 2% in the cost function represents an approximate
saving of $2M over 10 years.
In terms of CPU times, for a large number of cells, TS is a bit faster
than heuristic H. Conversely, for problems of smaller size, TS is a bit slower.
MA is slower than heuristics H and TS. However, this is not an important
On the Design of Large-scale Cellular Mobile Networks 373
Table 3. Relative distances between the MA solutions and the lower bound
Instance of 5 switches 6 switches 7 switches 8 switches 10 switches
problem 100 cells 150 cells 200 cells 500 cells 1000 cells
MA’s Mean 1.35% 1.94% 2.19% 3.97% 4.84%
distance (%)
fact because this heuristic is used in designing and planning phase of cellular
mobile networks.
which is the link cost of the solution obtained by assigning each cell i to the
nearest switch k. This lower bound does not take into account handoff cost.
In fact, we suppose that capacity constraint is being relaxed and that all cells
could be assigned to a single switch. Thus, we have a lower bound whatever
the values of Mk and λi . Table 3 summarizes the results.
MA gives good solutions in comparison with the lower bound. Note that
the lower bound does not include handoff costs and therefore no solution could
equal the lower bound.
6 Conclusion
In this chapter, we proposed a multi-population memetic algorithm with
elitism (MA) to design large-scale cellular mobile networks, and specifically
to solve the problem of assigning cells to switches in cellular mobile networks.
To select the offsprings of the new generation, we have used the concept of
elitism; according to this concept only the lowest ranked string is deleted and
the best string is automatically kept in the population. Also, the migrants
are clones, and they are not deleted from their original populations. The local
refinement strategy used with our memetic algorithm is tabu search. Exper-
iments have been conducted to measure the quality of solutions provided by
this algorithm.
To evaluate the performance of this approach, we defined two lower bounds
for the global optimum, which are used as references to judge the quality of
the obtained solutions. Generally, the results are sufficiently close to the global
optimum.
374 A. Quintero and S. Pierre
References
[1] Beaubrun R, Pierre S, Conan J (1999) An efficient method for optimizing the
assignment of cells to MSCs in PCS networks. In: Proceedings of the eleventh
international conference on wireless communication, wireless 99, vol 1. Calgary
(AB), July 1999, pp 259–265
[2] Bhattacharjee P, Saha D, Mukherjee A (1999) Heuristics for assignment of cells
to switches in a PCSN: a comparative study. In: International conference on
personal wireless communications, Jaipur, India, February 1999, pp 331–334
[3] Cantu-Paz E (2000) Efficient and accurate parallel genetic algorithms, Kluwer
Academic, Dordecht
[4] Ching-Hung W, Tzung-Pei H, Shian-Shyong T (1998) Integrating fuzzy knowl-
edge by genetic algorithms. IEEE Trans Evol Comput 2(4):138–149
[5] Cohoon J, Martin W, Richards D (1991) A multi-population genetic algorithm
for solving the K-partition problem on hyper-cubes. In: Proceedings of the
fourth international conference on genetic algorithms, pp 244–248
[6] Costa D (1995) An evolutionary Tabu Search algorithm and the NHL scheduling
problem. INFOR 33(3):161–178
[7] Demirkol I, Ersoy C, Caglayan MU, Delic H (2001) Location area planning
in cellular networks using simulated annealing. In: Proceedings of IEEE-
INFOCOM 2001, vol 1, 2001, pp 13–20
[8] Fang Y, Chlamtac I, Lin Y (1997) Modeling PCS networks under general call
holding time and cell residence time distributions. IEEE/ACM Trans Network
5(6):893–905
[9] Fogel D (1995) Evolutionary computation. Piscataway, NJ
[10] Fogel D (1995) Evolutionary computation: toward a new philosophy of machine
intelligence. IEEE, New York
[11] Fogel D (1999) An overview of evolutionary programming. Springer-Verlag,
Berlin Heidelberg New York, pp 89–109
[12] Fogel D (1999) An introduction to evolutionary computation and some
applications. Wiley, Chichester, UK
[13] Forrest S, Mitchell M (1999) What makes a problem hard for a genetic al-
gorithm? Some anomalous results and their explanation. Machine Learning
13(2):285–319
[14] Gavish B, Sridhar S (1995) Economic aspects of configuring cellular networks.
Wireless Netw 1(1):115–128
[15] Gavish B, Sridhar S (2001) The impact of mobility on cellular network
configuration. Wireless Netw 7(1):173–185
On the Design of Large-scale Cellular Mobile Networks 375
[59] Vavak F, Fogarty T (1996) Comparison of steady state and generational genetic
algorithms for use in non stationary environments. In: Proceedings of IEEE
international conference on evolutionary computation, pp 192–195
[60] Wheatly C (1995) Trading coverage for capacity in cellular systems: a system
perspective. Microwave J 38(7):62–76
A Hybrid Cellular Genetic Algorithm for the
Capacitated Vehicle Routing Problem
Summary. Cellular genetic algorithms (cGAs) are a kind of genetic algorithm (GA)
– population based heuristic – with a structured population so that each individual
can only interact with its neighbors. The existence of small overlapped neighbor-
hoods in this decentralized population provides both diversity and opportunities
for exploration, while the exploitation of the search space is strengthened inside
each neighborhood. This balance between intensification and diversification makes
cGAs naturally suitable for solving complex problems. In this chapter, we solve a
large benchmark (composed of 160 instances) of the Capacitated Vehicle Routing
Problem (CVRP) with a cGA hybridized with a problem customized recombination
operation, an advanced mutation operator integrating three mutation methods, and
two well-known local search algorithms for routing problems. The studied test-suite
contains almost every existing instance for CVRP in the literature. In this work, the
best-so-far solution is found (or even improved) in 80% of the tested instances (126
out of 160), and in the other cases (20%, i.e. 34 out of 160) the deviation between our
best solution and the best-known one is always very low (under 2.90%). Moreover,
9 new best-known solutions have been found.
1 Introduction
Transportation plays an important role in logistic tasks of many companies
since it usually accounts for a high percentage of the value added to goods.
Therefore, the use of computerized methods in transportation often results in
significant savings of up to 20% of the total costs (see Chap. 1 in [1]).
A distinguished problem in the field of transportation consists in finding
the optimal routes for a fleet of vehicles which serve a set of clients. In this
problem, an arbitrary set of clients must receive goods from a central depot.
This general scenario presents many chances for defining (related) problem
scenarios: determining the optimal number of vehicles, finding the shortest
routes, and so on, all of them are subject to many restrictions like vehicle
capacity, time windows for deliveries, etc. This variety of scenarios leads to a
plethora of problem variants in practice. Some reference case studies where the
E. Alba and B. Dorronsoro: A Hybrid Cellular Genetic Algorithm for the Capacitated Vehicle
Routing Problem, Studies in Computational Intelligence (SCI) 82, 379–422 (2008)
www.springerlink.com
c Springer-Verlag Berlin Heidelberg 2008
380 E. Alba and B. Dorronsoro
(a) (b)
view of this problem not biased by any ad hoc selection of individual instances.
The included instances are characterized by many different features: instances
from real world, theoretical ones, clustered, not clustered, with homogeneous
and heterogeneous demands on customers, with the existence of drop times
or not, etc.
In recent VRP history, there has been a steady evolution in the quality
of the solution methodologies used for this problem, borrowed both from the
exact and the heuristic fields of research. However, due to the difficulty of
the problem, no known exact method is capable of consistently solving to op-
timality instances involving more than 50 customers [1, 5]. In fact, it is also
clear that, as for most complex problems, non-customized heuristics would
not compete in terms of the solution quality with present techniques like the
ones described in [1, 6]. Moreover, the power of exploration of some modern
techniques like Genetic Algorithms or Ant Systems is not yet fully explored,
specially when combined with an effective local search step. All these con-
siderations could allow us to refine solutions to optimality. Particularly, we
present in this chapter a Cellular Genetic Algorithm (cGA) [7] hybridized with
specialized recombination and mutation operators and with a local search al-
gorithm for solving CVRP.
Genetic Algorithms (GAs) are heuristics based on an analogy with biolog-
ical evolution. A population of individuals representing tentative solutions is
maintained. New individuals are produced by combining members of the pop-
ulation, and they replace existing individuals attending to some given policy.
In order to induce a lower selection pressure (larger exploration) compared to
that of panmictic GAs (here panmictic means that an individual may mate
with any other in the population – see Fig. 2a), the population can be decen-
tralized by structuring it in some way [8] (see Figs. 2b and 2c). Cellular GAs
are a subclass of GAs in which the population is structured in a specified topol-
ogy (usually a toroidal mesh of dimensions d = 1, 2 or 3), so that individuals
may only interact with their neighbors (see Fig. 2c). The pursued effect is to
improve on the diversity and exploration capabilities of the algorithm (due to
the presence of overlapped small neighborhoods) while still admitting an easy
Fig. 2. (a) Panmictic GA has all its individual black points in the same population.
Structuring the population usually leads to distinguish between (b) distributed, and
(c) cellular GAs
382 E. Alba and B. Dorronsoro
k
k
Cost(Ri ) = cj,j+1 + δj , (1)
j=0 j=0
m
FCVRP (S) = Cost(Ri ). (2)
i=1
tournament selection (BT) inside this neighborhood (line 6), while the other
parent will be the current individual itself (line 7). The two parents can be
the same individual (replacement) or not. The algorithm iteratively consid-
ers as “current” each individual in the grid. Genetic operators are applied
to the individuals in lines 8 and 9 to increasingly improve on the average
fitness of individuals in the grid also on a neighborhood basis (explained in
Sects. 3.2 and 3.3). We add to this basic cGA a local search technique in line
10 consisting in applying 2-Opt and 1-Interchange, which are well-known lo-
cal optimization methods (see Sect. 3.4). After applying these operators, we
evaluate the fitness value of the new individual (line 11), and insert it in the
new (auxiliary) population – line 12 – only if its fitness value is larger than
that of the parent located at that position in the current population (elitist
replacement).
After applying the above mentioned operators to the individuals, we re-
place the old population for the new one at once (line 15), and we then calcu-
late some statistics (line 16). It can be noticed that new individuals replace
the old ones en bloc (synchronously) and not incrementally (see [22] for other
replacement techniques). The algorithm stops when an optimal solution is
found or when an a priori predetermined maximum number of generations is
reached.
The fitness value assigned to every individual is computed as follows
[14, 15]:
f (S) = FCVRP (S) + λ · overcap(S) + µ · overtm(S), (3)
feval (S) = fmax − f (S). (4)
The objective of our algorithm is to maximize feval (S) (4) by minimizing
f (S) (3). The value fmax must be larger or equal with respect to that of
the worst feasible solution for the problem. Function f (S) is computed by
adding the total costs of all the routes (FCVRP (S) – see (2) –), and penalizes
the fitness value only in the case that the capacity of any vehicle and/or
A Hybrid cGA for the CVRP 385
the time of any route are exceeded. Functions ‘overcap(S)’ and ‘overtm(S)’
return the overhead in capacity and time of the solution (respectively) with
respect to the maximum allowed value of each route. These values returned
by ‘overcap(S)’ and ‘overtm(S)’ are weighted by multiplying them by factors
λ and µ, respectively. In this work we have used λ = µ = 1000 [23].
In Sects. 3.1 to 3.4 we proceed to explain in detail the main features that
characterize our algorithm (JCell). The algorithm itself can be applied with
all the mentioned operations and also applying only some of them to analyze
their separate contribution to the performance of the search.
1 STEP3 1 END
2 2
3 3
4 Offspring 4 Offspring
5 5
6 6
3.2 Recombination
3.3 Mutation
The mutation operator we use in our algorithms will play an important role
during the evolution since it is in charge of introducing a considerable degree
of diversity in each generation, counteracting in this way the strong selective
pressure which is a result of the local search method we plan to use. The
mutation consists in applying Insertion, Swap or Inversion operations to each
gene with equal probability (see Algorithm 3.2).
These three mutation operators (see Fig. 5) are well-known methods
found in the literature, and typically applied sooner than later in routing
A Hybrid cGA for the CVRP 387
Offspring
problems. Our idea here has been to merge these three in a new com-
bined operator. The Insertion operator [25] selects a gene (either customer
or route splitter) and inserts it in another randomly selected place of the
same individual. Swap [26] consists in randomly selecting two genes in a
solution and exchanging them. Finally, Inversion [27] reverses the visiting
order of the genes between two randomly selected points of the permuta-
tion. Note that the induced changes might occur in an intra or inter-route
way in all the three operators. Formally stated, given a potential solution
S = {s1 , . . . , sp−1 , sp , sp+1 , . . . , sq−1 , sq , sq+1 , . . . , sn }, where p and q are ran-
domly selected indexes, and n is the sum of the number of customers plus the
number of route splitters (n = c + k), then the new solution S obtained after
applying each of the different proposed mechanisms is shown below:
It is very clear after all the existing literature on VRP that the utilization of a
local search method is almost mandatory to achieve results of high quality [14,
17, 28]. This is why we envision from the beginning the application of two of
the most successful techniques in recent years. In effect, we will add a local
refining step to some of our algorithms consisting in applying 2-Opt [29] and
1-Interchange [30] local optimization to every individual.
388 E. Alba and B. Dorronsoro
a b a b
Route i
Route j
d c d c
(a) (b)
Fig. 6. 2-Opt works into a route (a), while λ-Interchange affects two routes (b)
On the one hand, the 2-Opt simple local search method works inside each
route. It randomly selects two non-adjacent edges (i.e. (a, b) and (c, d)) of a
single route, deletes them, thus breaking the tour into two parts, and then
reconnects those parts in the other possible way: (a, c) and (b, d) (Fig. 6a).
Hence, given a route R = {r1 , . . . , ra , rb , . . . , rc , rd , . . . , rn }, being (ra , rb ) and
(rc , rd ) two randomly selected non-adjacent edges, the new route R obtained
after applying the 2-Opt method to the two considered edges will be R =
{r1 , . . . , ra , rc , . . . , rb , rd , . . . , rn }.
On the other hand, the λ-Interchange local optimization method that we
use is based on the analysis of all the possible combinations for up to λ
customers between sets of routes (Fig. 6b). Hence, this method results in
customers either being shifted from one route to another, or being exchanged
between routes. The mechanism can be described as follows. A solution to the
problem is represented by a set of routes S = {R1 , . . . , Rp , . . . , Rq , . . . , Rk },
where Ri is the set of customers serviced in route i. New neighboring solutions
can be obtained after applying λ-Interchange between a pair of routes Rp and
Rq ; to do so we replace each subset of customers S1 ⊆ Rp of size |S1 | ≤ λ
with any other one S2 ; ⊆ Rq of size |S2 | ≤ λ. This
; way, we obtain two new
routes Rp = (Rp − S1 ) S2 and Rq = (Rq − S2 ) S1 , which are part of the
new solution S = {R1 . . . Rp . . . Rq . . . Rk }.
Hence, 2-Opt searches for better solutions by modifying the order of visit-
ing customers inside a route, while the λ-Interchange method results in cus-
tomers either being shifted from one route to another, or customers being
exchanged between routes. This local search step is applied to an individual
after the recombination and mutation operators, and returns the best solution
between the best ones found by 2-Opt and 1-Interchange, or the current one
if it is better (see a pseudocode in Algorithm 3.3). In the local search step, the
algorithm applies 2-Opt to all the pairs of non-adjacent edges in every route
and 1-Interchange to all the subsets of up to 1 customer between every pair
of routes.
In summary, the functioning of JCell is quite simple: in each generation,
an offspring is obtained for every individual after applying the recombina-
tion operator (ERX) to its two parents (selected from its neighborhood).
A Hybrid cGA for the CVRP 389
The offsprings are mutated with a special combined mutation, and later a
local post-optimization step is applied to the mutated individuals. This lo-
cal search algorithm consists in applying to the individual two different local
search methods (2-Opt and 1-Interchange), and returns the best individual
from among the input individual and the output of 2-Opt and 1-Interchange.
The population of the next generation will be composed of the current one
after replacing its individuals with their offsprings in the cases where they are
better.
the three smallest instances of the benchmark with and without drop times
and maximum route length restrictions (CMT-n51-k5, CMT-n76-k10 and
CMT-n101-k8, and CMTd-n51-k6, CMTd-n76-k11 and CMTd-n101-k9). In or-
der to obtain a more complete benchmark (and thus, more reliable results),
we have additionally selected the clustered instances CMT-n101-k10 and
CMTd-n101-k11. This last one (CMTd-n101-k11) is obtained after including
drop times and maximum route lengths to CMT-n101-k10.
The algorithms studied in this section are compared by means of the av-
erage total travelled length of the solutions found, the average number of
evaluations needed to get those solutions, and the hit rate (percentage of suc-
cess rate) in 30 independent runs. Hence, we are comparing them in terms
of their accuracy, efficiency, and efficacy. In order to look for the existence of
statistical reliability in the comparison of the results, ANOVA tests are per-
formed. The parameters used for all the algorithms in the current work are
essentially those of Table 1, with some exceptions mentioned in each case. The
average values in the tables are shown along with their standard deviation,
and the best ones are bolded.
We will compare in this section the behavior of our cellular GA, JCell2o1i,
with a generational panmictic one (genGA) in order to illustrate our decision
of using a structured cellular model for solving the CVRP. The same algorithm
is run in the two cases with the difference being the neighborhood structure
of the cGA. As we will see, structuring the population has been useful for
improving the algorithm in many domains, since the population diversity is
better maintained with regard to non-structured population. The results of
this comparison are shown in Table 2. As can be seen, the genGA is not able to
find the optimum for half of the tested instances (CMT-n76-k10, CMT-n101-k8,
CMTd-n51-k6, and CMTd-n101-k9), while JCell2o1i always finds it, and with
larger hit rates and lower effort.
In Fig. 7 we graphically compare the two algorithms by plotting the dif-
ference of the values obtained (genGA minus cGA) in terms of the accuracy
–average solution length– (Fig. 7a), efficiency –average evaluations– (Fig. 7b),
A Hybrid cGA for the CVRP 391
CMT-n51-k5
CMT-n76-k10
CMT-n101-k8
CMT-n101-k10
CMTd-n51-k6
CMTd-n76-k11
CMTd-n101-k9
CMTd-n101-k11
0 2 4 6 8 0 100000 200000 300000 −30 −20 −10 0
Avg. Distances Difference Avg. Number of Evaluations Difference Hit Rate Difference (% )
(a) (b) (c)
and efficacy –hit rate– (Fig. 7c). As it can be noticed, JCell2o1i always outper-
forms the generational GA for all of the three compared parameters (8×3 = 24
tests), except for the hit rate in CMTd-n76-k11. The histogram of Fig. 7b is
incomplete because no solutions were found by the genGA for some instances
(see second column of Table 2). After applying t -tests to the obtained results
in the number of evaluations made and the distance of the solutions, we con-
clude that there is always statistical confidence (p-values ≥ 0.05) for these
claims between the two algorithms.
These algorithms differ from each other just in the mutation method applied:
inversion (JCell2o1iinv ), insertion (JCell2o1iins ), swap (JCell2o1isw ), or they
three (JCell2o1i) —for details on the mutation operators refer to Sect. 3.3.
An interesting observation is that using a single mutation operator does
not seem to throw a clear heading winner, i.e., no overall best performance
of any of the three basic cellular variants can be concluded. For example,
using the inversion mutation (JCell2o1iinv ) we obtain very accurate solu-
tions for most instances, although it is less accurate for some other ones (i.e.,
instances CMT-n51-k5, CMT-n101-k8, and CMTd-n76-k11). The same beha-
vior is obtained in terms of the average evaluations made and for the hit
rate; so it is clear that the best algorithm depends on the instance being
solved when we are using a single mutation operator. Hence, as in the work of
Chellapilla [31], wherein his algorithm (evolutionary programming) is outper-
formed by combining two different mutations, we decided to develop the new
mutation composed of the three proposed ones (called combined mutation) in
order to exploit the behavior of the best of them for each instance. As we can
see in Table 3 this idea works: JCell2o1i stands out as the best one out of the
four compared methods. In terms of the average evaluations made, it is the
best algorithm for all the tested instances. Moreover, in terms of accuracy and
the hit rate, JCell2o1i is the algorithm that finds the best values for a larger
number of instances (see the bolded values for the three metrics).
In Fig. 8 we show a graphical evaluation of the four algorithms in terms
of accuracy, efficiency, and efficacy. JCell2o1i is, in general, more accurate
than the other three algorithms, although differences are very small (Fig. 8a).
In terms of the average evaluations made (efficiency), the algorithm using
the combined mutation largely outperforms the other three (Fig. 8b), while
the obtained values are slightly worse in the case of the hit rate only for
a few instances (Fig. 8c). After applying the ANOVA test to the results of
A Hybrid cGA for the CVRP 393
6,0E+05
500 0
0,0E+00
-k5 -k10 1- k8 -k10 1- k6 -k11 1-k9 - k11 -k5 -k10 1- k8 -k10 1- k6 -k11 1-k9 - k11 -k5 -k10 1- k8 -k10 1- k6 -k11 1-k9 - k11
51 6 0 01 5 76 0 1 51 6 0 01 5 76 0 1 51 6 0 01 5 76 0 1
T-n T-n7 T-n1 -n1 Td-n d-n d-n1 -n10 T-n T-n7 T-n1 -n1 Td-n d-n d-n1 -n10 T-n T-n7 T-n1 -n1 Td-n d-n d-n1 -n10
CM CM CM MT CM MT MT Td CM CM CM MT CM MT MT Td CM CM CM MT CM MT MT Td
C C C CM C C C CM C C C CM
(a) (b) (c)
Fig. 8. Comparison of the behavior of the cGA using the three different mutations
independently and together in terms of (a) accuracy, (b) effort, and (c) efficacy
Table 4. Analysis of the effects of using local search on the JCell with combined
mutation
JCell (no LS) JCell2o JCell1i JCell2o1i JCell2o2i
Inst. Avg. Avg. Hit Avg. Avg. Hit Avg. Avg. Hit Avg. Avg. Hit Avg. Avg. Hit
Dist. Evals. (%) Dist. Evals. (%) Dist. Evals. (%) Dist. Evals. (%) Dist. Evals. (%)
CMT-n51-k5 576.1 —– 0 551.6 —– 0 529.9 1.0E5 50 525.2 2.6E4 53 526.1 1.6E5 77
±14 .4 —– ±9 .4 —– ±6 .7 ±3 .7E4 ±1 .5 ±6 .4E3 ±2 .9 ±6 .8E4
CMT-n76-k10 956.4 —– 0 901.8 —– 0 851.5 —– 0 842.8 7.2E4 3 844.7 7.5E5 3
±20 .5 —– ±15 .8 —– ±6 .3 —– ±4 .6 ±0 .0 ±5 .3 ±0 .0
CMT-n101-k8 994.4 —– 0 867.2 —– 0 840.6 —– 0 832.4 1.8E4 3 831.9 8.2E5 3
±29 .9 —– ±14 .5 —– ±6 .5 —– ±0 .0 ±0 .0 ±7 .2 ±0 .0
CMT-n101-k10 1053.6 —– 0 949.6 —– 0 822.8 2.3E5 10 820.9 7.1E4 90 823.3 1.2E5 17
±43 .5 —– ±29 .0 —– ±5 .9 ±4 .2eE5 ±4 .9 ±1 .3E4 ±7 .6 ±2 .7e4
CMTd-n51-k6 611.5 —– 0 571.9 —– 0 561.7 6.9E4 3 558.3 2.8E4 23 558.6 3.7E5 13
±15 .2 —– ±13 .4 —– ±4 .0 ±5 .9E3 ±2 .1 ±6 .1E3 ±2 .5 ±1 .8E5
CMTd-n76-k11 1099.1 —– 0 1002.6 —– 0 924.0 2.0E5 3 918.50 6.5E4 7 917.7 7.1E5 13
±26 .1 —– ±21 .7 —– ±8 .2 ±0 .0 ±7 .4 ±2 .6E4 ±9 .1 ±3 .5E5
CMTd-n101-k10 1117.5 —– 0 936.5 —– 0 882.2 3.8E5 13 876.94 9.2E4 3 873.7 2.87E5 80
±34 .8 —– ±15 .1 —– ±10 .4 ±0 .0 ±5 .9 ±0 .0 ±4 .9 ±1 .2E5
CMTd-n101-k11 1110.1 —– 0 1002.8 —– 0 869.5 1.8E5 13 867.4 3.3E5 70 867.9 3.5E5 83
±58 .8 —– ±40 .4 —– ±3 .5 ±6 .7E4 ±2 .0 ±1 .3E5 ±3 .9 ±1 .4E5
differences among these three algorithms in the average evaluations made have
also (with minor connotations) statistical significance.
Regarding the two algorithms with the overall best results (those using 2-
Opt and λ-Interchange), JCell2o1i always obtains better values than JCell2o2i
(see bolded values) in terms of the average evaluations made (with statis-
tical significance). Hence, applying 1-Interchange is faster than applying 2-
Interchange (as can be expected since it is a lighter operation) with greater
accuracy and efficiency (less computational resources).
We summarize in Fig. 9 the results of the compared algorithms. JCell and
Jcell2o do not appear in Figs. 9b and 9c because these two algorithms were
not able to find the best-known solution for any of the tested instances. In
terms of the average distance of the solutions found (Fig. 9a), it can be seen
that the three algorithms using the λ-Interchange method have similar results,
outperforming the rest in all the cases. JCell2o1i is the algorithm which needs
the lowest number of evaluations to get the solution in almost all the cases
(Fig. 9b). In terms of the hit rate (Fig. 9c), JCell1i is, in general, the worst
algorithm (among the three using λ-Interchange) for all the tested instances.
To summarize in one example the studies made in this subsection, we plot
in Fig. 10 the evolution of the best (10a) and the average (10b) solutions found
during a run with the five algorithms studied in this section for CMT-n51-k5.
All the algorithms implement the combined mutation, composed of the three
methods studied in Sect. 3.3: insertion, inversion and swap. In graphics 10a
and 10b we can see a similar behavior, in general, for all the algorithms:
the population quickly evolves towards solutions close to the best-known one
(value 524.61) in all the cases. After zooming in on the graphics we can no-
tice more clearly that both JCell and JCell2o converge more slowly than the
others, and finally get stuck at local optima. JCell2o maintains the diversity
for longer with respect to the other algorithms. Although the algorithms with
λ-Interchange converge faster than JCell and JCell2o, they are able to find
A Hybrid cGA for the CVRP 395
1.0E+06 100
8.0E+05 80
HitRate(%)
6.0E+05 60
4.0E+05 40
2.0E+05 20
0.0E+00 0
5 0 8 0 6 1 k9 k11 5 0 k8 10 -k6 k11 -k9 k11
1-k -k1 1-k -k1 1-k -k1 1- - 1-k -k1 01- 1-k 1 - 1 -
- n5 -n76 -n10 101 d-n5 n76 -n10 101 T -n5 -n76 -n1 n10 d-n5 -n76 -n10 n101
T T -n T d- d -n T - d d d-
T
CM CM CM MT CM MT CMT MTd CM CMT CM MT CMT MT MT T
C C C C C C CM
(b) (c)
Fig. 9. Comparison of the behavior of the cGA without local search and four cGAs
using different local search techniques in terms of (a) accuracy, (b) effort, and (c)
efficacy
the optimal value, escaping from the local optima thanks to the λ-Interchange
local search method.
Best solution
1850
1650
1450
Distance
1250
1050
850
650
450
0 100000 200000 300000 400000 500000 600000 700000
Evaluations
(a)
Average Solution
25000
20000
Distance
15000
10000
5000
0
0 100000 200000 300000 400000 500000 600000 700000
Evaluations
(b)
Fig. 10. Evolution of the best (a) and average (b) solutions for CMT-n51-k5 with
some algorithms differing only in the local search applied
one so far for the instance (in percentage), and the previously best-known
solution for each instance. The deviation between our best solution (sol ) and
the best-so-far one (best) is calculated by Equation 8.
< =
best − sol
∆(sol) = ∗ 100 . (8)
best
In sections 5.1, 5.2, 5.3, 5.5, and 5.8, distances between customers have
been rounded to the closest integer value, following the TSPLIB conven-
tion [37]. In the other sections, no rounding has been done. We discuss each
benchmark in a separate section.
Set A.
100
80
Hit Rate (%)
60
40
20
0
n4 7
n6 9
n6 9
n3 5
n3 5
n3 6
n3 5
n3 5
n3 5
n3 6
n3 5
n3 5
n4 6
n4 7
n4 6
n4 7
n5 7
n5 7
n5 7
n6 9
n6 9
n 8
n6 k9
n6 9
n8 k9
n6 0
0
A- 5-k
A- 1-k
A- 5-k
A- 2-k
A- 3-k
A- 3-k
A- 4-k
A- 6-k
A- 7-k
A- 7-k
A- 8-k
A- 9-k
A- 9-k
A- 4-k
A- 5-k
A- 6-k
A- 8-k
A- 3-k
A- 4-k
A- 5-k
A- 0-k
A- 2-k
A- 4-k
A- -k1
k1
A- 63-
A- 9-
0-
n3
3
A-
Fig. 11. Success rate for the benchmark of Augerat et al., set A
398 E. Alba and B. Dorronsoro
100
80
Hit Rate (%)
60
40
20
0
n3 5
n3 5
n3 5
n3 6
n4 5
n4 6
n4 6
n4 7
n4 5
n5 6
n5 7
n5 8
n5 7
n5 7
n5 7
n 7
n6 k9
n6 0
n6 9
n6 9
n6 0
n7 9
0
B- 1-k
B- 4-k
B- 5-k
B- 8-k
B- 9-k
B- 1-k
B- 3-k
B- 4-k
B- 5-k
B- 5-k
B- 0-k
B- 0-k
B- 1-k
B- 2-k
B- 6-k
B- 7-k
B- -k1
B- 4-k
B- 6-k
B- -k1
B- 8-k
k1
B- 57-
8-
n3
7
B-
Fig. 12. Success rate for the benchmark of Augerat et al., set B
to find the optimal solution as the number of customers grows, since the hit
rate decreases as the size of the instance increases. Indeed, the hit rate is
under 8% for instances involving more than 59 customers, although it is also
low for smaller instances like A-n39-k5 and A-n45-k6.
Set B.
The instances in this set are all mainly characterized by being clustered. This
set of instances does not seem to pose a real challenge to our algorithm since
it is able to find the optimal values for all of them (Table 15). There is just
one exception (instance B-n68-k9) but the deviation between our best found
solution and the best known one is really low (0.08%). In Fig. 12 we show the
success rate obtained for every instance. This hit rate is lower than 5% only
for few instances (B-n50-k8, B-n63-k10, and B-n66-k9).
Set P.
The instances in class ‘P’ are modified versions of other instances taken from
the literature. In Table 16 we summarize our results. The reader can see that
our algorithm is also able to find all the optimal solutions for this benchmark.
The success rates are shown in Fig. 13, and the same trend of the two previous
suite cases appears: linking low hit rates for the largest instances (more than
60 customers).
To end this section we just point out the remarkable properties of the
algorithm, being itself competitive to many other algorithms and in the three
benchmarks.
100
80
Hit Rate (%)
60
40
20
0
n1 8
n 2
n2 k2
n2 2
n2 2
n2 8
n4 k8
n4 5
n5 5
n5 7
8
1 0
n5 0
n k7
n5 k8
P- 5-k 0
n6 15
n6 10
n6 15
n7 0
n7 0
n7 4
n1 5
4
P- 6-k
P- 9-k
P- 1-k
P- 2-k
P- 2-k
P- 0-k
P- 5-k
P- 0-k
P- 0-k
n5 k1
P- -k1
n5 k1
P- -k1
P- -k1
P- 6-k
P- 6-k
-k
P- 20-
P- 3-
P- 5-
P- 55-
P- 0-k
P- 0-k
01
P- 50-
P- 5-
n1
0
P-
n Fig. 13. Success rate for the benchmark of Augerat et al., set P
clusters, cones, ...) and the depot (central, inside, outside) in the space, time
windows, pick-up and delivery, or heterogeneous demands. Van Breedam also
proposed a reduced set of instances (composed of 15 problems) from the initial
one, and solved it with many different heuristics [33] (there exist more recent
works using this benchmark, but no new best-solutions were reported [39,40]).
This reduced set of instances is the one we study in this chapter.
All the instances of this benchmark are composed of the same number of
customers (n = 100), and the total demand of the customers is always the
same, 1000 units. If we should adopt the nomenclature initially used in all this
chapter, instances using the same vehicle capacity will have the same name;
to avoid this we will use a special nomenclature for this benchmark, number-
ing the problems from Bre-1 to Bre-15, in order to avoid the repetition of
names for different instances. Problems Bre-1 to Bre-6 are constrained only
by vehicle capacity. The demand for each stop is 10 units. The vehicle capacity
equals 100 units for Bre-1 and Bre-2, 50 units for Bre-3 and Bre-4, and 200
units for Bre-5 and Bre-6. Problems Bre-7 and Bre-8 are not studied in this
work because they include pick-up and delivery constraints. A specific feature
of this benchmark is the use of homogeneous demands at stops, i.e. all the
customers of an instance require the same amount of goods. The exceptions
are problems Bre-9 to Bre-11, for which the demands of customers are het-
erogeneous. The remaining four problems, Bre-12 to Bre-15, are not studied
in this work because they include time window constraints (out of scope).
Our first conclusion is that JCell2o1i outperforms the best-so-far solutions
for eight out of nine instances. Furthermore, the average solution found in
all the runs is better than the previously best-known solution in all these
eight improved problems (see Table 17), what is an indication of the high
accuracy and stability of our algorithm. As we can see in Fig. 14, the algo-
rithm finds the new best-solutions in a high percentage of the runs for some
instances (Bre-1, Bre-2, Bre-4, and Bre-6), while in some other ones new
best-solutions were found in several runs (Bre-5, Bre-9, Bre-10, and Bre-11).
In Fig. 15 we show the deviations of the solutions found (the symbols mark
400 E. Alba and B. Dorronsoro
100
80
40
20
0
Bre-1 Bre-2 Bre-3 Bre-4 Bre-5 Bre-6 Bre-9 Bre-10 Bre-11
6
5,59*
4,68*
5
4
3,24*
(%)
2,76*
3
2,56*
1,82*
2
1,02*
1
0,41*
0.0
0
Bre-1 Bre-2 Bre-3 Bre-4 Bre-5 Bre-6 Bre-9 Bre-10 Bre-11
the improved instances). The new best solutions found for this benchmark are
shown in Tables 5 to 12 in Appendix A.
We have not found common patterns in the different instances (neither in
the distribution of customers nor in the location of the depot) that allow us to
predict the accuracy of the algorithm when solving them. We have noticed that
the 4 instances for which the obtained hit rate is lower (Bre-5, Bre-9, Bre-10,
and Bre-11) have common features, such as the non-centered localization of
the depot and the presence of a unique customer located far away from the
rest, but these two features can also be found in instances Bre-2 and Bre-6.
However, a common aspect of instances Bre-9 to Bre-11 (not present in the
other instances of this benchmark) are the heterogeneous demands at the
stops. Hence, we can conclude that the use of heterogeneous demands in this
benchmark reduces the number of runs in which the algorithm obtains the
best found solution (hit rate). Finally, in the case of Bre-3 (uniformly circle-
distributed clusters and not centered depot), the best-so-far solution could
not be improved, but it was found in every run.
The instances that compose this benchmark range from easy problems of just
21 customers to more difficult ones of 100 customers. We can distinguish two
different kind of instances in the benchmark. In the smaller instances (up to
A Hybrid cGA for the CVRP 401
100
80
Hit Rate (%)
60
40
20
0
-k4 2-k4 3-k3 0-k3 0-k4 1-k7 3-k4 1-k5 6-k7 6-k8 -k10 -k14 1-k8 -k14
13 2 2 3 3 3 3 5 7 7 6 6 0 1
E-n E-n E-n E-n E-n E-n E-n E-n E-n E-n E-n7 E-n7 E-n1 -n10
E
Fig. 16. Success rate for the benchmark of Christofides and Eilon
32 customers), the depot and all the customers (except one) are in a small
region of the plane, and there exists only one customer located far away from
that region wherein all the others are (like in the case of some instances of
the benchmark of Van Breedam). In the rest of instances of this benchmark,
the customers are randomly distributed in the plane and the depot is either
in the center or near to it.
In order to compare our results with those found in the literature, dis-
tances between customers have been rounded to the closest integer value, as
we previously did in the cases of the three sets of instances of Augerat et al.
in Sect. 5.1. We show the percentage of runs in which the optimal solution is
found in Fig. 16. It can be seen in that picture how the algorithm finds some
difficulties when solving the largest (over 50 customers) instances (customers
distributed in the whole problem area), for which the hit rate ranges from 1%
to 8%. In the case of the smaller instances (32 customers or less), the solution
is found in a 100% of the runs, except for E-n31-k7 (67%). The deviations
of the solutions found (∆) are not plotted because JCell2o1i is able to find
the optimal solutions (∆ = 0.0) for all the instances in this benchmark (see
Table 18).
The other seven instances (named CMTd-nXX-kXX) use the same customer
locations as the previous seven, but they are subject to additional constraints,
like limited distances on routes and drop times (time needed to unload goods)
for visiting each customer. These instances are unique in all the benchmarks
studied in this work because they have drop times associated to the cus-
tomers. These drop times are homogenous for all the customer of an instance,
i.e. all the customers spend the same time in unloading goods. Conversely, the
demands at the stops are heterogeneous, so different customers can require
distinct amounts of goods.
The four clustered instances have a not centered depot, while in the other
ones the depot is centered (with the exceptions of CMT-n51-k5 and CMTd-
n51-k6).
The reader can see in Table 19 that these instances are, in general, harder
than all the previously studied ones since JCell2o1i is not able to find the
optimal solution for some of them for the first time in our study (see Fig 17).
Specifically, the best-known solution cannot be found by our algorithm for six
instances (the largest ones). Nevertheless, it can be seen in Fig. 18 that the
100
80
Hit Rate (%)
60
40
20
0
10
12
k9
0
k6
11
14
8
k5
T d - k1
-k
-k
k1
k1
k1
k1
1-
1-
-k
-k
1-
-k
T d 1- k
01
21
6-
1-
0-
01
51
76
10
00
5
n5
1
n7
n1
n1
-n
10
12
15
20
n1
n1
n2
-n
-n
T-
Td
T-
T-
T-
-n
-n
-n
-n
Td
M
T-
T-
T-
Td
Td
Td
M
M
C
M
C
M
C
Fig. 17. Success rate for the benchmark of Christofides, Mingozzi and Toth
2
1.88
1.6
1.08
1.2
(%)
0.66
0.8
0.16 0.22
0.4
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0
n1 0
n2 12
17
10 9
n1 0
n 1 k8
12 1
15 1
20 4
8
n7 5
k
1
k1
-k
-k
k1
k
k1
1-
1-
-k
-k
-
T d 1- k
T d 1- k
T d 1- k
01
21
51
6-
0-
01
51
00
76
10
n5
-n
n1
-n
-n
T-
Td
T-
T-
T-
-n
-n
-n
-n
Td
Td
M
T-
T-
T-
Td
M
M
C
M
C
M
C
Fig. 18. Deviation rates for the benchmark of Christofides, Mingozzi and Toth
A Hybrid cGA for the CVRP 403
difference (∆) between our best solution and the best-known one is always
lower than 2%. The best-known solution has been found for all the clustered
instances, except for CMTd-n121-k11 (∆ = 0.16). Notice that the two non-
clustered instances with a non-centered depot where solved in a 51% (CMT-
n51-k5) and 23% (CMTd-n51-k6) of the runs. Hence, the best-known solution
was found for all instances having a non-centered depot, with the exception
of CMTd-n121-k11.
This benchmark is composed of just three instances, which are taken from real-
life vehicle routing applications. Problems F-n45-k4 and F-n135-k7 represent
a day of grocery deliveries from the Peterboro and Bramalea, Ontario termi-
nals, respectively, of National Grocers Limited. The other problem (F-n72-k4)
is concerned with the delivery of tires, batteries and accessories to gasoline
service stations and is based on data obtained from Exxon (USA). The depot
is not centered in the three instances. Since this benchmark also belongs to
the TSPLIB, distances between customers are rounded to the nearest integer
value. Once again, all the instances have been solved to optimality by our
algorithm (see Table 20) at high hit rates for two out of the three instances
(hit rate of 3% for F-n135-k7). In Fig. 19 we show the obtained hit rates. This
figure clearly shows how the difficulty of finding the optimal solution grows
with the size of the instance. Once more, the deviations (∆) have not been
plotted because they are 0.00 for the three instances.
100
Hit Rate (%)
80
60
40
20
0
F-n45-k4 F-n72-k4 F-n135-k7
2,8
2,53 2,54 2,42
2,43
2,4
2,04 1,79 1,90
2
1,74 1,65 1,73
1,57 1,63
1,6
(%)
1,16
1,2 0,99
0,98
0,75
0,8
0,48
0,31
0,4
0,18
0,00
0
go 253 2
go 256 7
go 301 14
go 321 28
go 324 30
go 361 6
go 397 3
go 400 4
go 42 18
go 481 41
go 484 8
go -n2 19
go 241 5
go -n2 10
go 32 k8
go d-n k10
go n40 -k9
go 44 0
48 1
2
l-n -k2
1
3
3
-k
-n k1
-n k1
k1
-
l-n -k
l-n -k
l-n -k
l-n -k
l-n -k
l-n -k
l-n -k
l-n -k
l-n 1-k
l-n -k
ld -k
ld -k
ld 01
ld 81
1
l 1-
ld 1-
ld 1-
1-
go 41
ld 36
2
l-n
-n
-n
-
go
is also one pseudo-real problem involving 385 cities, which was generated
as follows: each city is the most important town or village of the smallest
political entity (commune) of the canton of Vaud, in Switzerland. The census
of inhabitants of each commune was taken at the beginning of 1990. A demand
of 1 unit per 100 inhabitants (but at least 1 for each commune) was considered,
and vehicles have a capacity of 65 units.
As we can see in Table 22, JCell2o1i is able to obtain the best-known
solutions for the four smaller instances (those of 75 customers) and also
for ta-n101-k11b (see Table 21 for the hit rates). Moreover, for instance
ta-n76-k10b our algorithm has improved on the best-known solution so far
by 0.0015% compared to the previous solution (new solution = 1344.62); this
solution is shown in Table 13 of Appendix A. Note that this best solution is
quite close to the previously existing one, but they represent very different
solutions in terms of resulting routes. JCell2o1i has found the previous best
solution for this instance in 30% of the runs made. In the instances for which
the known best solution was found but not improved, the hit rate is over 60%.
The deviations in the other instances, as in the case of the previous section,
are very low (under 2.90), as can be seen in Fig. 22, i.e., the algorithm is very
stable.
100
80
Hit Rate (%)
60
40
20
0
a b a b a b a b a b a b 7
10 10 -k9 -k9 10 10 11 11 14 14 15 15 -k4
7 6-k 76-k -n76 -n76 01-k 01-k 01-k 01-k 51-k 51-k 51-k 51-k 386
t a-n ta-
n ta ta a-n
t
1
ta-
n 1
ta-
1
n a-n
t
1
ta-
n 1
ta-
1
n a-n
t
1
ta-
n 1
ta-
n
3
2.87
2.5
2.09
2
(%)
1.5
0.95
1
0.39 0.32 0.35
0.5 0.19 0.02 0.04
0.00 0.00* 0.00 0.00
0
a b a 0a 10b 11a 11b 4a 4b 5a 5b 7
10 10 -k9 9b -k1 -k -k -k -k1 -k1 1-k1 1-k1 6-k4
7 6-k 76-k -n76 7 6-k 101 101 101 101 1 51 151 1 5 1 5 3 8
a-n n n n n n n n n n n n
t ta- ta ta- ta- ta- ta- ta- ta- ta- ta- ta- ta-
Fig. 22. Deviation rates for the benchmark of Taillard (
marks new best solution
found by JCell2o1i)
406 E. Alba and B. Dorronsoro
100
80
Hit Rate (%)
60
40
20
0
-k4 9-k
4
9-k
5
2-k
4
6-k
3
7-k
3
1-k
3
4-k
4
8-k
3
8-k
4
2-k
5
-n48 -n2 -n2 ig-n4 fri-n2 gr-n1 gr-n2 gr-n2 gr-n4 -n4 -n4
att y g y s h k is s
ba ba ntz sw
da
Fig. 23. Success rate for the benchmark of translated instances from TSP
A Hybrid cGA for the CVRP 407
it has been able to improve the known best solution so far for nine of the tested
instances, which represents an important record in present research. Hence,
we can say that the performance of JCell2o1i is similar or even better to that
of the best algorithm for each instance. Besides, our algorithm is quite simple
since we have designed a canonical cGA with three widely used mutations in
the literature for this problem, plus two well known local search methods.
As future work, it may be interesting to test the behavior of the algorithm
with other local search methods, i.e., 3-Opt. Another further step is to adapt
the algorithm for testing it on other variants of the studied problem, like
VRP with time windows (VRPTW), multiple depots (MDVRP), or backhauls
(VRPB).
Finally, it will also be interesting to study the behavior of JCell2o1i after
applying the local search step in a less exhaustive form, thus developing a
faster algorithm.
7 Acknowledgement
This work has been partially funded by MCYT and FEDER under
contract TIN2005-08818-C04-01 (the OPLINK project: http://oplink.lcc.
uma.es).
References
1. Toth P, Vigo D (2001) The vehicle routing problem. Monographs on discrete
mathematics and applications. SIAM, Philadelphia
2. Dantzing G, Ramster R (1959) The truck dispatching problem. Manag Sci
6:80–91
3. Christofides N, Mingozzi A, Toth P (1979) The vehicle routing problem. In:
Combinatorial optimization. Wiley, New York, pp 315–338
4. http://neo.lcc.uma.es/radi-aeb/WebVRP/index.html
5. Golden B, Wasil E, Kelly J, Chao IM (1998) The impact of metaheuristics on
solving the vehicle routing problem: algorithms, problem sets, and computa-
tional results. In: Fleet management and logistics. Kluwer, Boston, pp 33–56
6. Cordeau JF, Gendreau M, Hertz A, Laporte G, Sormany JS (2005) New heuris-
tics for the vehicle routing problem. In: Langevin A, Riopel D (eds.) Logistics
systems: design and optimization. Kluwer Academic, Dordecht, Springer Verlag
NY, pp. 279–297
7. Manderick B, Spiessens P (1989) Fine-grained parallel genetic algorithm. In:
Schaffer J (ed.) Proceedings of the third international conference on genetic
algorithms – ICGA89, Morgan-Kaufmann, Los Altos, CA, pp 428–433
8. Alba E, Tomassini M (2002) Parallelism and evolutionary algorithms. IEEE
Trans Evol Comput 6:443–462
9. Sarma J, Jong KD (1996) An analysis of the effect of the neighborhood size
and shape on local selection algorithms. In: Voigt H, Ebeling W, Rechenberg I,
Schwefel H (eds.) Parallel problem solving from nature (PPSN IV). Volume 1141
of lecture notes in computer science. Springer, Berlin Heidelberg New York, pp
236–244
408 E. Alba and B. Dorronsoro
B Results
The tables containing the results obtained by JCell2o1i when solving all the in-
stances proposed in this work are shown in this appendix. The values included
in Tables 14 to 23 are obtained after making 100 independent runs (for sta-
tistical significance), except for the benchmark of Golden et al. (Table 21),
where only 30 runs were made due to its difficulty.
The values in the tables belonging to this section correspond to the best so-
lution found in our runs for each instance, the average number of evaluations
made to find that solution, the success rate, the average value of the solu-
tions found in all the independent runs, the deviation (∆) between our best
solution and the previously best-known one (in percentage), and the known
412 E. Alba and B. Dorronsoro
best solution for each instance. The execution times of our tests range from
153.50 hours for the most complex instance to 0.32 seconds in the case of the
simplest one.
A Hybrid cGA for the CVRP 413
Jeff Achtnig
1 Introduction
1.1 Outline
The PSO algorithm was first proposed by Kennedy and Eberhart [1] as a
new method for function optimization. The inspiration for PSO came from
observing the social behavior found in nature in such things as a flock of birds,
a swarm of bees, or a school of fish.
J. Achtnig: Particle Swarm Optimization with Mutation for High Dimensional Problems,
Studies in Computational Intelligence (SCI) 82, 423–439 (2008)
www.springerlink.com
c Springer-Verlag Berlin Heidelberg 2008
424 J. Achtnig
that problem space, often caused the swarm to quickly converge on relatively
poor solutions.
PSO’s difficulty with high dimensional problems can be understood as
follows. A problem with 1000 variables, for example, would require 2000 it-
erations in order for a given particle to explore the effects of changing just a
single variable at a time (i.e. two iterations per variable would be needed to
explore both a positive and a negative change in direction from its current
position). Exploring the effects of simultaneous variable changes would require
even more iterations. In that time, the swarm would have already begun con-
verging on one of the particles. Once the swarm converges the search becomes
much more local, and the ability to explore large-scale changes in each of the
variables diminishes.
2 PSO Modifications
Our initial goal was to examine the ability of PSO in high dimensional problem
spaces, and see if there might be a simple way of improving its performance on
those types of problems. To that end, our research was narrowed to two tech-
niques that did not require any large or computationally expensive changes to
the PSO algorithm. The first was the inclusion of a mutation operator. The
second involved using a random constriction coefficient instead of a fixed one.
2.1 Mutation
The basic PSO algorithm is modelled around the concept of swarming, and
therefore does not make use of an explicit mutation operator. In typical evo-
lutionary algorithms, however, mutation is an important operation that helps
to maintain the diversity of the population. It was therefore hypothesized that
combining the basic PSO algorithm with a mutation operator would help the
swarm avoid premature convergence in high dimensional problems.
One possible approach for creating a mutation operator for PSO is to
give each problem dimension of every particle a random chance of getting
mutated. For a 1,600 dimensional problem with 100 particles, however, this
would result in 160,000 random numbers generated in each iteration of the
swarm. To reduce the number of random number calculations required, it was
decided instead to give each particle a chance of getting selected for mutation.
If selected, then a second random variable would be generated to determine
which single dimension/variable for that particle would get mutated. This
approach considerably reduces the number of random number calculations
required.
Note that limiting the mutation to a single dimension/variable does not
preclude the ability to explore the effects of multiple variable changes. A parti-
cle can still be selected for mutation multiple times in succession, thus allowing
potentially more than one variable in each particle to be affected by mutation.
426 J. Achtnig
A common variant of PSO utilizes a linearly decreasing inertial weight term [4]
instead of the constriction coefficient, χ, of equation 1. However, in order to
use the inertial weight term in this way requires advance knowledge as to how
many iterations the PSO algorithm will run for, so that the weight term can
be decreased accordingly. Instead of arbitrarily – or through trial and error –
deciding on what fixed number of iterations to choose, it was decided instead
to use Clerc’s version of PSO (which utilizes a static constriction coefficient)
for our experiments.
However, in an attempt to improve some of our initial results with using a
mutation operator, we also experimented with a number of other PSO modi-
fications in combination with the mutation operator. After a few initial tests,
we settled on utilizing a random constriction coefficient in place of a static one.
In this approach, we generated a uniform random number in the range
of [0.2, 0.9] for each particle in the swarm, for every iteration. This random
number was then used as the value of the constriction coefficient for that
particle, and was applied to every dimension/particle of that particle.
3 Experimental Settings
velocity for that particle at that time step was set equal to vmax , where vmax
was determined according to the following equation:
F unctionU pperLimit − F unctionLowerLimit
vmax = 0.50 · . (3)
2
The performance of each algorithm was compared on each test function
in 400, 800, and 1600 dimensions (the dimensionality of the neural net-
work problems are roughly the same). With the exception of the neural
network problems of 800 and 1600 dimensions, each test ran for 50 runs and
the average fitness was used in our comparisons. For the previously mentioned
neural network problems, 25 runs were averaged due to time constraints. For
each run, each of the four PSO algorithms being tested was initialized with
the same random seed, ensuring that the initial positions of the particles
would be the same. The population size for all tests was fixed at 100 particles,
while the number of iterations varied depending on the test function and its
dimensionality.
Larger populations of particles – equal to the number of dimensions of
the problem – were also experimented with, but the few tests we did seemed
to suggest that running 100 particles for more iterations resulted in equal
or better solutions. We were also looking at reducing the computation time
involved, and decided that using fewer particles would best meet that goal. All
particles in our tests were connected in the standard “star” (fully connected)
topology.
Finally, we decided to compare the PSO algorithms with another evolu-
tionary algorithm: Differential Evolution (DE) [12]. After some preliminary
tests with a few DE variants and settings, we chose the DE/rand/1/best strat-
egy with C = 0.80 and F = 0.50 for our comparisons. As in the PSO variants,
the population size was fixed at 100 particles.
The four numerical optimization problems used in our tests are the unimodal
Sphere and Rosenbrock functions, and the multimodal Rastrigin and Schwefel
functions.
n
Sphere = x2i (4)
i=1
n−1
$ %2
Rosenbrock = (100 · xi+1 − x2i + (1 − xi )2 ) (5)
i=1
n
Rastrigin = 10n + (x2i − 10 cos (2πxi )) (6)
i=1
n "
"3 ##
Schwef el = 418.9829 · n + −xi sin |xi | (7)
i=1
428 J. Achtnig
For the feed-forward network, Kramer’s nonlinear PCA neural network [9]
was used. As shown in Figure 2, this is a five layer auto-associative neural
network with a bottleneck layer in the middle to reduce the dimensionality of
the input variables (the number of nodes in the bottleneck layer is less than
the number of nodes in the input/output layers).
The following network configurations (nodes per layer) were tested:
1) 12–11–5–11–12, which – including the bias weights – results in 413
weights to optimize;
2) 18–15–8–15–18, which results in 836 weights to optimize; and
3) 24–24–11–24–24, which results in 1763 weights to optimize.
Each network was trained on 200 test cases, and the average error of
all of the test cases between the input and output values was used as our
fitness function. The test cases were created such that there would be some
correlation between the input variables – a non-linear correlation and a linear
one. The following scheme was used to generate the inputs for each test case:
1) input[1] = 3
random variable uniformly chosen over the range [0.0, 1.0];
2) input[2] = input[1]; and
3) input[3] = 0.75 · input[1].
This pattern was repeated for the remainder of the inputs (4..6, 7..9, 10..12,
etc.) for each test case. As such, the number of inputs was always chosen to
be a multiple of three.
The recurrent neural network consisted of a three layer network with the pre-
vious values of both the middle and output layers fed back into the input layer,
similar to the one shown in Figure 3. Three external inputs were connected
to the network. The first two of these inputs were generated from a chaotic
time series (the logistic map with r = 4), while the third input was a non-
linear combination of the first two inputs. The outputs of the network were
the future values of each of the three input values (ranging from two future
values per input, to four future values per input depending on the network
problem). The three inputs were:
1) input[1] = logistic map function, with initial value of 0.3;
2) input[2] = 4
logistic map function, with initial value of 0.5; and
2
3) input[3] = (input[1]) + (input[2])2 .
The logistic map function, with r = 4, is: f (xt+1 ) = 4xt · (1 − xt ).
The following network configurations (nodes per layer) were tested:
1) (3 + 20) − 14 − 6, which – including the bias weights – results in 426
weights to optimize;
2) (3 + 29) − 20 − 9, which results in 849 weights to optimize; and
3) (3 + 40) − 28 − 12, which results in 1580 weights to optimize.
The number of input nodes was always 3 plus the number of middle and
output layer nodes. Each network was trained on 200 successive values of the
input series.
Rosenbrock 46, 719, 745 843 918 27, 606, 006 413
10,000 iterations σ=8,543,113 σ=108 σ=122 σ=5,388,505 σ=139
Schwefel 81, 092 133, 310 22, 715 90, 801 21, 825
15,000 iterations σ=5,543 σ=4,605 σ=1,070 σ=6,555 σ=1,317
NN − PCA∗ 14.85 13.42 10.80 14.08 6.56
20,000 iterations σ=1.99 σ=0.57 σ=1.41 σ=1.92 σ=2.63
∗
RNN 17.08 21.86 14.91 18.32 13.76
20,000 iterations σ=3.65 σ=4.43 σ=3.82 σ=4.04 σ=5.03
432 J. Achtnig
compared. There are no local minimums to get stuck in; there is only one
global optimal solution centered at the origin, and – regardless of where a
particle is positioned in the search space – following the gradient information
will lead directly to that optimal solution. Yet, despite these favorable con-
ditions, the bPSO algorithm still gets stuck in a non-optimal solution. PSO’s
problem of premature convergence appears to have a much more noticeable
effect in higher dimensions.
Adding a small mutation to the PSO algorithm, however, seems to give the
swarm a needed push that helps to keep it from getting stuck. While the bPSO
algorithm converges prematurely in each of the Sphere tests, the PSO.mut al-
gorithm continues to improve its fitness throughout each of the iterations. The
random constriction coefficient (PSO.rnd) by itself didn’t offer much improve-
ment over the bPSO algorithm, but combining it with the mutation operator
(PSO.cmb) resulted in the best performance.
The relative performance of the algorithms was found to be similar on all
of the other test problems – including the two neural network problems. In
each case the bPSO algorithm performed relatively poorly, while the algo-
rithms that performed the best were those with the mutation operator. The
combination of the mutation operator along with the random constriction
coefficient performed the best in all instances.
PSO with Mutation 433
All of the PSO algorithms being tested still had a tendency – albeit to vary-
ing degrees – to converge to non-optimal solutions. This was most noticeable
for the Schwefel function in all dimensions, and in general for all functions
it became more noticeable as the dimensionality of the problem increased.
Even the PSO.cmb algorithm appeared to have moments of difficulties on the
Sphere function in 1600 dimensions (although it did manage to recover).
The final set of graphs plot the diversity of the swarm for the Sphere, Ras-
trigin, Rosenbrock, and Schwefel functions in 400 dimensions. The diversity
of the swarm is given by:
|S| N
1 $ %2
Diversity = · pij − pj , (8)
|S| · D i=1 j=1
PSO with Mutation 437
where |S| is the swarm size, D is the length of the largest diagonal in the
problem, N is the Number of Dimensions of the problem, pij is the j’th value
of the i’th particle, and pj is the j’th value of the average point p. In all cases,
the PSO.rnd algorithm quickly achieves the lowest diversity compared with all
of the other algorithms, suggesting that the random constriction coefficient
has the effect of speeding up the convergence of the swarm. The PSO.mut
algorithm, on the other hand, has a diversity greater than or roughly equal
to the bPSO algorithm in all cases. This is to be expected, as the random
mutations should increase the diversity of the swarm.
The best performing variant – the PSO.cmb algorithm – was subjected
both to forces trying to increase its diversity (i.e. the mutation operator),
and to those trying to decrease its diversity (i.e. the random constriction
coefficient), with results somewhere in-between depending on the problem. In
all cases the diversity of the PSO.cmb algorithm very quickly decreased in the
first few iterations – much more so than the bPSO algorithm, and generally
matching the decreased diversity of the PSO.rnd algorithm. After the first few
iterations of the Rastrigin and Schwefel functions, however, the PSO.cmb’s
diversity started to match that of the PSO.mut algorithm. Conversely, on
the Sphere and Rosenbrock functions the PSO.cmb algorithm continued to
decrease its diversity similar to the PSO.rnd for an extended period of time,
before levelling off (eventually, near the end of the iterations for both of these
latter cases, the bPSO algorithm’s diversity continued to decrease until it fell
below that of the PSO.cmb algorithm).
The DE algorithm performed quite well on its own, beating the “standard”
bPSO algorithm on most – but not all – test functions. Note especially the
Rastrigin function, where the DE algorithm in 800 dimensions performed bet-
ter than it did on the same problem in 400 dimensions. However, the PSO.cmb
algorithm still performed the best out of all of the algorithms on all of the test
problems. The PSO.mut algorithm was also better than the DE algorithm on
all but two problems (Rosenbrock in 400 dimensions, and the Rastrigin func-
tion in 1600 dimensions).
In general, the two PSO variants with a mutation operator (PSO.mut
and PSO.cmb) performed well across all of the problems tested, where-as
the DE algorithm performed relatively well on some problems, but not so
well on others (such as the Schwefel function). Also, on problems such as
the Rosenbrock function, the DE algorithm performed noticeably worse as
the dimensionality of the problem increased (as compared with the PSO.mut
and PSO.cmb algorithms, which appear to handle the increasing problem
dimensions better than the DE algorithm).
438 J. Achtnig
References
1. Kennedy J, Eberhart R (2001) Swarm intelligence. Morgan Kaufmann, San
Francisco, CA
2. Clerc M, Kennedy J (2002) The particle swarm: explosion, stability, and conver-
gence in a multidimensional complex space. IEEE Trans Evol Comput 6:58–73
3. Kennedy J (1999) Small worlds and mega-minds: effects of neighborhood topol-
ogy on particle swarm performance. In: Proceedings of the 1999 Congress of
Evolutionary Computation, vol 3. IEEE, pp 1931–1938
4. Eberhart R, Shi Y (1998) A modified particle swarm optimizer. IEEE Int Conf
Evol Comput. Anchorage, Alaska
5. Bellman R (1961) Adaptive control processes: a guided tour. Princeton Univer-
sity Press
6. Fogel D, Beyer H-G (1996) A note on the empirical evaluation of intermediate
recombination. Evol Comput 3(4):491–495
7. Angeline PJ (1998) Evolutionary optimization versus particle swarm optimiza-
tion. In: Porto VW, Saravanan N, Waagen D, Eiben AE (eds) Evolutionary
PSO with Mutation 439
programming VII. Lecture notes in computer science, vol 1447. Springer, Berlin
Heidelberg New York, pp 601–610
8. Vesterstroem J, Riget J, Krink T (2002) Division of labor in particle swarm
optimisation. In: Proceedings of the IEEE congress on evolutionary computa-
tion (CEC 2002), Honolulu, Hawaii
9. Kramer MA (1991) Nonlinear principal component analysis using autoassocia-
tive neural networks. AIChE J 37(2):233–243
10. Riget J, Vesterstroem J, Krink T (2002) Proceedings of the Congress on Evo-
lutionary Computation, CEC ’02 2:1474–1479
11. Riget J, Vesterstroem J (2002) A diversity-guided particle swarm optimizer –
the arPSO. In: EVALife, Department of Computer Science, University of
Aarhus, Denmark
12. Lampinen J, Storn R (2004) Differential evolution. In: Onwubolu GC, Babu BV
(eds) New optimization techniques in engineering. Studies in fuzziness and soft
computing, vol 141. Springer, Berlin Heidelberg, New York, pp 123–166
Index