Yaochu Jin (Ed.

Multi-Objective Machine Learning

Studies in Computational Intelligence, Volume 16
Prof. Janusz Kacprzyk
Systems Research Institute
Polish Academy of Sciences
ul. Newelska 6
01-447 Warsaw
E-mail: kacprzyk@ibspan.waw.pl

Further volumes of this series Vol. 8. Srikanta Patnaik, Lakhmi C. Jain,
can be found on our homepage: Spyros G. Tzafestas, Germano Resconi,
Amit Konar (Eds.)
springer.com Innovations in Robot Mobility and Control,
Vol. 1. Tetsuya Hoya ISBN 3-540-26892-8
Artificial Mind System – Kernel Memory
Vol. 9. Tsau Young Lin, Setsuo Ohsuga,
Approach, 2005
Churn-Jung Liau, Xiaohua Hu (Eds.)
ISBN 3-540-26072-2
Foundations and Novel Approaches in Data
Vol. 2. Saman K. Halgamuge, Lipo Wang Mining, 2005
(Eds.) ISBN 3-540-28315-3
Computational Intelligence for Modelling
Vol. 10. Andrzej P. Wierzbicki, Yoshiteru
and Prediction, 2005
ISBN 3-540-26071-4
Creative Space, 2005
Vol. 3. Bożena Kostek ISBN 3-540-28458-3
Perception-Based Data Processing in
Vol. 11. Antoni Ligeza
Acoustics, 2005 c
Logical Foundations for Rule-Based
ISBN 3-540-25729-2
Systems, 2006
Vol. 4. Saman K. Halgamuge, Lipo Wang ISBN 3-540-29117-2
Vol. 12. Jonathan Lawry
Classification and Clustering for Knowledge
Modelling and Reasoning with Vague
Discovery, 2005
Concepts, 2006
ISBN 3-540-26073-0
ISBN 0-387-29056-7
Vol. 5. Da Ruan, Guoqing Chen, Etienne E.
Vol. 13. Nadia Nedjah, Ajith Abraham,
Kerre, Geert Wets (Eds.)
Luiza de Macedo Mourelle (Eds.)
Intelligent Data Mining, 2005
Genetic Systems Programming, 2006
ISBN 3-540-26256-3
ISBN 3-540-29849-5
Vol. 6. Tsau Young Lin, Setsuo Ohsuga,
Vol. 14. Spiros Sirmakessis (Ed.)
Churn-Jung Liau, Xiaohua Hu, Shusaku
Adaptive and Personalized Semantic Web,
Tsumoto (Eds.)
Foundations of Data Mining and Knowledge
ISBN 3-540-30605-6
Discovery, 2005
ISBN 3-540-26257-1 Vol. 15. Lei Zhi Chen, Sing Kiong Nguang,
Xiao Dong Chen
Vol. 7. Bruno Apolloni, Ashish Ghosh, Ferda
Modelling and Optimization of
Alpaslan, Lakhmi C. Jain, Srikanta Patnaik
Biotechnological Processes, 2006
ISBN 3-540-30634-X
Machine Learning and Robot Perception,
2005 Vol. 16. Yaochu Jin (Ed.)
ISBN 3-540-26549-X Multi-Objective Machine Learning, 2006
ISBN 3-540-30676-5

Yaochu Jin

Machine Learning


Dr. Yaochu Jin
Honda Research Institute
Europe GmbH
Carl-Legien-Str. 30
63073 Offenbach
E-mail: yaochu.jin@honda-ri.de

Library of Congress Control Number: 2005937505

ISSN print edition: 1860-949X
ISSN electronic edition: 1860-9503
ISBN-10 3-540-30676-5 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-30676-4 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations are
liable for prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
c Springer-Verlag Berlin Heidelberg 2006
Printed in The Netherlands
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: by the authors and techbooks using a Springer LATEX macro package
Cover design: design & production GmbH, Heidelberg
Printed on acid-free paper SPIN: 11399346 89/techbooks 543210

To Fanhong, Robert and Zewei


Feature selection and model selection are two major elements in machine learn-
ing. Both feature selection and model selection are inherently multi-objective
optimization problems where more than one objective has to be optimized.
For example in feature selection, minimization of the number of features and
maximization of feature quality are two common objectives that are likely
conflicting with each other. It is also widely realized that one has to deal with
the trade-off between approximation or classification accuracy and model com-
plexity in model selection.
Traditional machine learning algorithms try to satisfy multiple objectives
by combining the objectives into a scalar cost function. A good example is
the training of neural networks, where the main target is to minimize a cost
function that accounts for the approximation or classification error on given
training data. However, reducing the training error often leads to overfitting,
which means that the error on unseen data will become very large, though
the neural network performs perfectly on the training data. To improve the
generalization capability of neural networks, i.e., to improve their ability to
perform well on unseen data, a regularization term, e.g., the complexity of
neural networks weighted by a hyper-parameter (regularization coefficient)
has to be included in the cost function. One major challenge to implement
the regularization technique is how to choose the regularization coefficient
appropriately, which is non-trivial for most machine learning problems.
Several other examples exist in machine and human learning where a trade-
off between conflicting objectives has to be taken into account. For example in
object recognition, a learning system should learn as many details of a given
object as possible; on the other hand, it should also be able to abstract the
general features of different objects. Another example is the stability and plas-
ticity dilemma, where the learning systems have to trade off between learning
new information without forgetting old information. From the viewpoint of
multi-objective optimization, there is no single learning model that can sat-
isfy different objectives at the same time. In this sense, Pareto-based multi-
objective optimization is the only way to deal with the conflicting objectives

VIII Preface

in machine learning. However, this seemingly straightforward idea has not
been implemented in machine learning until late 1990’s. Liu and Kadirka-
manathan [1] considered three criteria in designing neural networks for sys-
tem identification. They used a genetic algorithm to minimize the maximum
of the three normalized objectives. Similar work has been presented in [2].
Kottathra and Attikiouzel [3] employed a branch-and-bound search algorithm
to determine the structure of neural networks by trading off the mean square
error and the number of hidden nodes. The trade-off between sum of squared
error and the norm of weights of neural networks was reported in [4], whereas
the trade-off between training error and test error has been considered in [5].
An interesting work on Pareto-based neural network learning is reported by
Kupinski and Anastasio [6], where a multi-objective genetic algorithm is im-
plemented to generate the receiver operating characteristics curve of neural
network classifiers. In generating fuzzy systems, Ishibuchi et al [7] used a
multi-objective genetic algorithm to minimize the classification error and the
number of rules. Gomez-Skarmeta et al suggested the idea of using Pareto-
based multi-objective genetic algorithms to optimize multiple objectives in
fuzzy modeling [8], though no simulation results were provided. A genetic al-
gorithm is used to minimize approximation error, complexity, sensitivity to
noise and continuity of rules [9].
With the boom of the research on evolutionary multi-objective optimiza-
tion, Pareto-optimality based multi-objective machine learning has gained new
impetus. Compared to the early works, not only more sophisticated multi-
objective algorithms are used, but new research areas are also being opened
up. For example, it is found that one is able to determine the number of clus-
ters by analyzing the shape of the Pareto front obtained by a multi-objective
clustering method [10], see Chapter 2 for more detailed results. In [11], it
is found that interpretable rules can be extracted from the simple, Pareto-
optimal neural networks. Further research reveals that neural networks with
good generalization capability can also be identified by analyzing the Pareto
front, as reported in Chapter 13 in this book. Our most recent work sug-
gests that multi-objective machine learning provides an efficient approach to
addressing catastrophic forgetting [12]. Besides, Pareto-based multi-objective
learning has shown particularly powerful in generating ensembles, support
vector machines (SVMs), and interpretable fuzzy systems.
This edited book presents a collection of most representative research work
on multi-objective machine learning. The book is structured into five parts.
Part I discusses multi-objective feature extraction and selection, such as rough
set based feature selection, clustering and cluster validation, supervised and
unsupervised feature selection for ensemble based handwritten digit recog-
nition, and edge detection in image processing. In the second part, multi-
objective model selection is presented for improving the performance of single
objective learning algorithms in generating various machine learning mod-
els, including linear and nonlinear regression models, multi-layer perceptrons
(MLPs), radial-basis-function networks (RBFNs), support vector machines

Preface IX

(SVMs), decision trees, and learning classifier systems. Multi-objective model
selection for creating interpretable models is described in Part III. Generat-
ing interpretable learning models plays an important rule in data mining and
knowledge extraction, where the preference is put on gaining insights into
unknown systems. From the work presented, the reader can see that how un-
derstandable symbolic or fuzzy rules can be extracted from trained neural
networks, or from data directly within the framework of multi-objective opti-
mization. The merit of multi-objective optimization is fully demonstrated in
Part IV, where techniques for generating ensembles of machine learning mod-
els are concerned. Diverse member of neural networks or fuzzy systems can be
generated by trading off between training and test errors, between accuracy
and complexity, or between accuracy and diversity. Compared to single objec-
tive based ensemble generation methods, diversity is imposed more explicitly
in multi-objective learning so that structural or functional diversity of ensem-
ble members can be guaranteed. To conclude the book, Part V presents a
number of successful applications of multi-objective machine learning, such as
multi-class receiver operating curve analysis, mobile robot navigation, dock-
ing maneuver of automated ground vehicles, information retrieval and object
I am confident that by reading this book, the reader is able to bring home a
complete view of the emerging research area and to gain hands-on experience
of a variety of multi-objective machine learning approaches. Furthermore, I
do hope that this book, which is the first book dedicated to multi-objective
machine learning to the best of my knowledge, will inspire more creative ideas
to further promote the research in this research area.
I would like to thank all contributors who prepared excellent chapters for
this book. Many thanks go to Prof. Janusz Kacprzyk for his interest in this
book. I would also like to thank Dr. Thomas Ditzinger and Ms. Heather King
at Springer for their kind assistance. I am most grateful to Mr. Tomohiko
Kawanabe, Prof. Dr. Edgar Körner, Dr. Bernhard Sendhoff and Mr. Andreas
Richter at the Honda Research Institute Europe for their full understanding
and kind support.

Offenbach am Main
October 2005 Yaochu Jin

[1] G.P. Liu and V. Kadirkamanathan. Learning with multi-objective criteria. In
IEE Conference on Artificial Neural Networks, pages 53–58, 1995.
[2] S. Park, D. Nam, and C. H. Park. Design of a neural controller using multi-
objective optimization for nonminimum phase systems. In Proceedings of 1999
IEEE Int. Conf. on Fuzzy Systems, volume I, pages 533–537, 1999.

X Preface

[3] K. Kottathra and Y. Attikiouzel. A novel multicriteria optimization algorithm
for the structure determination of multilayer feedforward neural networks. Jour-
nal of Network and Computer Applications, 19:135–147, 1996.
[4] R. de A. Teixeira, A.P. Braga, R. H.C. Takahashi, and R.R. Saldanha. Improv-
ing generalization of MLP with multi-objective optimization. Neurocomputing,
35:189–194, 2000.
[5] H.A. Abbass. A memetic Pareto approach to artificial neural networks. In
Proceedings of the 14th Australian Joint Conference on Artificial Intelligence,
pages 1–12, 2001.
[6] M.A. Kupinski and M Anastasio. Multiobjective genetic optimization of diag-
nostic classifiers with implementations for generating receiver operating char-
acteristic curves. IEEE Transactions on Medical Imaging, 18(8):675–685, 1999.
[7] H. Ishibuchi, T. Murata, and B. Turksen. Single-objective and two-objective ge-
netic algorithms for selecting linguistic rules for pattern classification problems.
Fuzzy Sets and Systems, 89:135–150, 1997.
[8] A. Gomez-Skameta, F. Jimenez, and J. Ibanez. Pareto-optimality in fuzzy
modeling. In Proc. of the 6th European Congress on Intelligent Techniques and
Soft Computing, pages 694–700, 1998.
[9] T. Suzuhi, T. Furuhashi, S. Matsushita, and H. Tsutsui. Efficient fuzzy mod-
eling under multiple criteria by using genetic algorithm. In IEEE Conf. on
Systems, Man and Cybernetics, volume 5, pages 314–319, 1999.
[10] J. Handl and J. Knowles. Exploiting the trade-off – the benefits of multiple ob-
jectives in data clustering. In Evolutionary Multi-Criteria Optimization, LNCS
3410, pages 547–560. Springer, 2005.
[11] Y. Jin, B. Sendhoff, and E. Körner. Evolutionary multi-objective optimiza-
tion for simultaneous generation of signal-type and symbol-type representa-
tions. In Evolutionary Multi-Criteria Optimization, LNCS 3410, pages 752–766.
Springer, 2005.
[12] Y. Jin and B. Sendhoff. Avoiding catastrophic forgetting via multi-objective
learning. Congress of Evolutionary Computation. 2006


Part I Multi-Objective Clustering, Feature Extraction
and Feature Selection

1 Feature Selection Using Rough Sets
Mohua Banerjee, Sushmita Mitra, Ashish Anand . . . . . . . . . . . . . . . . . . . . . 3

2 Multi-Objective Clustering and Cluster Validation
Julia Handl, Joshua Knowles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Feature Selection for Ensembles Using the Multi-Objective
Optimization Approach
Luiz S. Oliveira, Marisa Morita, Robert Sabourin . . . . . . . . . . . . . . . . . . . . 49

4 Feature Extraction Using Multi-Objective
Genetic Programming
Yang Zhang, Peter I Rockett . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Part II Multi-Objective Learning for Accuracy Improvement

5 Regression Error Characteristic Optimisation of Non-Linear
Jonathan E. Fieldsend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6 Regularization for Parameter Identification Using
Multi-Objective Optimization
Tomonari Furukawa, Chen Jian Ken Lee, John G. Michopoulos . . . . . . . . 125

7 Multi-Objective Algorithms for Neural Networks Learning
Antônio Pádua Braga, Ricardo H. C. Takahashi, Marcelo Azevedo
Costa, Roselito de Albuquerque Teixeira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

XII Contents

8 Generating Support Vector Machines Using Multi-Objective
Optimization and Goal Programming
Hirotaka Nakayama, Yeboon Yun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

9 Multi-Objective Optimization of Support Vector Machines
Thorsten Suttorp, Christian Igel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

10 Multi-Objective Evolutionary Algorithm for Radial Basis
Function Neural Network Design
Gary G. Yen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

11 Minimizing Structural Risk on Decision Tree Classification
DaeEun Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

12 Multi-objective Learning Classifier Systems
Ester Bernadó-Mansilla, Xavier Llorà, Ivan Traus . . . . . . . . . . . . . . . . . . . 261

Part III Multi-Objective Learning for Interpretability Improve-

13 Simultaneous Generation of Accurate and Interpretable
Neural Network Classifiers
Yaochu Jin, Bernhard Sendhoff, Edgar Körner . . . . . . . . . . . . . . . . . . . . . . . 291

14 GA-Based Pareto Optimization for Rule Extraction
from Neural Networks
Urszula Markowska-Kaczmar, Krystyna Mularczyk . . . . . . . . . . . . . . . . . . . 313

15 Agent Based Multi-Objective Approach to Generating
Interpretable Fuzzy Systems
Hanli Wang, Sam Kwong, Yaochu Jin, and Chi-Ho Tsang . . . . . . . . . . . . . 339

16 Multi-objective Evolutionary Algorithm
for Temporal Linguistic Rule Extraction
Gary G. Yen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

17 Multiple Objective Learning for Constructing
Interpretable Takagi-Sugeno Fuzzy Model
Shang-Ming Zhou, John Q. Gan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385

Part IV Multi-Objective Ensemble Generation

18 Pareto-Optimal Approaches to Neuro-Ensemble Learning
Hussein Abbass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

Contents XIII

19 Trade-Off Between Diversity and Accuracy in Ensemble
Arjun Chandra, Huanhuan Chen, Xin Yao . . . . . . . . . . . . . . . . . . . . . . . . . . 429

20 Cooperative Coevolution of Neural Networks and
Ensembles of Neural Networks
Nicolás Garcı́a-Pedrajas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465

21 Multi-Objective Structure Selection for RBF Networks
and Its Application to Nonlinear System Identification
Toshiharu Hatanaka, Nobuhiko Kondo, Katsuji Uosaki . . . . . . . . . . . . . . . . 491

22 Fuzzy Ensemble Design through Multi-Objective Fuzzy
Rule Selection
Hisao Ishibuchi, Yusuke Nojima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507

Part V Applications of Multi-Objective Machine Learning

23 Multi-Objective Optimisation for Receiver Operating
Characteristic Analysis
Richard M. Everson, Jonathan E. Fieldsend . . . . . . . . . . . . . . . . . . . . . . . . . 533

24 Multi-Objective Design of Neuro-Fuzzy Controllers
for Robot Behavior Coordination
Naoyuki Kubota . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557

25 Fuzzy Tuning for the Docking Maneuver Controller
of an Automated Guided Vehicle
J.M. Lucas, H. Martinez, F. Jimenez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585

26 A Multi-Objective Genetic Algorithm for Learning
Linguistic Persistent Queries in Text Retrieval Environments
Marı́a Luque, Oscar Cordón, Enrique Herrera-Viedma . . . . . . . . . . . . . . . . 601

27 Multi-Objective Neural Network Optimization
for Visual Object Detection
Stefan Roth, Alexander Gepperth, Christian Igel . . . . . . . . . . . . . . . . . . . . . 629

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657

Part I

Multi-Objective Clustering, Feature Extraction
and Feature Selection

Unlike other dimensionality reduction methods. India {mohua. Here. Capability of handling imprecision. Sushmita Mitra2 .ac.1 Introduction Feature selection techniques aim at reducing the number of irrelevant and redundant variables in the dataset. has attracted researchers to use rough sets for feature selection. when it comes to dealing with large datasets with tens or hundreds of thousands of vari- ables. Kanpur. Feature selection refers to the selection of input attributes that are most predictive of a given outcome. 3–20 (2006) www.1 Feature Selection Using Rough Sets Mohua Banerjee1 .g. makes faster and more cost- effective post-analysis. for a number of patients and replicates. inexactness and noise. Usually very few data samples are available altogether for testing and training. 1. and Ashish Anand1 1 Indian Institute of Technology. Consider gene selection from microarray data. feature selection preserves the original features after reduction and selection.com c Springer-Verlag Berlin Heidelberg 2006 .000. Studies in Computational Intelligence (SCI) 16. Benefit of feature selection is many fold: it improves subsequent analysis by removing the noisy data and outliers.000 to 15. Kolkata. This article provides an overview on recent literature in this direction. This is a problem encountered in many areas such as machine learning.in Summary. Feature selection is one of the most important and challenging tasks. the number of features (genes) ranges from 10. A typical analysis task is to find genes which are differentially expressed in different cases or can classify different classes with high accuracy.ac. signal processing.aanand}@iitk. But. M. uncertainty. makes data visualization easier and provides a better understanding of the underlying process that generated the data. particular time point of development or treatment). where selection of interesting and useful features determines the performance of subsequent analysis. and recently bioinformatics/computational biology. we will consider an example which will serve us as an illustration throughout the chapter.: Feature Selection Using Rough Setsx. India sushmita@isical. Banerjee et al. incompleteness of data makes extraction of hidden and useful information very difficult. Areas of web-mining and gene expression array analysis provide examples. The intrinsic nature of noise.springerlink.in 2 Indian Statistical Institute. In this problem. the features are expression levels of genes corresponding to the abundance of mRNA in a sample (e.

4 M. e. A population of chromosomes is made to evolve over generations by optimizing a fitness function. • no additional information about the data is required. Many of the attributes may be redundant. which provides a quantitative measure of the fitness of individuals in the pool. This success is due in part to the following aspects of the theory: • only the facts hidden in data are analyzed. approach to search for optimal solutions. and correspond to the minimal feature sets that are necessary and sufficient to represent a correct decision about classification.3 deals with feature selection and the role of rough sets. But the whole set A may not always be necessary to define the classifi- cation/partition of U . 14] was developed by Pawlak as a tool to deal with inexact and incomplete data. . In such cases multi-objective GAs (MOGAs) [7] provide an alternative.2 introduces the preliminaries of rough sets and genetic algorithms. Section 1.4 and Section 1. In this article.and multi- objective feature selection using rough sets. Rough set theory (RST) [13. The rest of the chapter is organized as follows. and hence provide some solution to the challenging task of feature selection. There are some studies reported in literature.7 concludes the chapter. The high com- plexity of this problem has motivated investigators to apply various approx- imation techniques to find near-optimal solutions. These subsets are called reducts in RST. A induces a partition (clas- sification) of U by A. where genetic algorithms [9] have been applied to find reducts.and multi-objective) in order to obtain reducts. in particular to knowledge databases.. Banerjee et al. by grouping together objects having identical attribute values. to the case of microarray data described earlier.5 describe recent literature on single. • minimal knowledge representation is obtained. Over the years. [17. Section 1. often the single-objective GA requires an appropriate formulation of the single fitness function in terms of an additive combination of the different criteria involved. more efficient. Section 1. The task of finding reducts is reported to be NP-hard [15]. Genetic algorithms (GAs) provide an efficient search technique in a large solution space. based on the theory of evolution. Thus RST provides a methodology for address- ing the problem of relevant feature selection that could be applied. e.g. 3]. When there are two or more conflicting characteristics to be optimized. we present various attempts of using GA’s (both single. and we may find minimal subsets of attributes which give the same classification as the whole set A. Consider an information system consisting of a domain U of objects / observations and a set A of attributes/features. Section 1.g. RST has become a topic of great interest to researchers and has been applied to many domains.

whereas the upper approximation is the set of objects possibly belonging to the same. pointers to references are given at the appropriate places. Definition 1. For detailed discussion. The intention is to approximate a rough (imprecise) concept in the domain of discourse by a pair of exact concepts.1 illustrates a rough set with its ap- proximations. y are such that for each condition attribute a. the latter represented formally by an indiscerni- bility relation (typically an equivalence) on the domain. The issues relevant to this chapter are explained briefly. Lower approximation of the set X is shown as the shaded region and upper approximation consists of all the ele- ments inside the thick line.2 Preliminaries In this section we discuss the preliminaries of rough sets and genetic algo- rithms.) and a non-empty. the attribute set A consists of two parts C and D. then d(x) = d(y). where Va is a value set. Often. 1. etc. . finite set U of objects (cases. 1 Feature Selection Using Rough Sets 5 1. whenever objects x. An Information System A = (U. 1. Fig. upper approximation Set X lower approximation Fig. finite set A of attributes a (features. for any d ∈ D. called condition and decision attributes respectively. The formal definitions of the above notions and others required for the present work are given below.2. with emphasis on multi-objective GAs. The lower approximation is the set of objects definitely belonging to the rough concept. The small squares represent equivalence classes induced by the indiscernibility relation on the domain. These exact concepts are called the lower and upper approximations of the rough concept. variables).1. observations. Lower and upper approximations of a rough set 1.1 Rough sets The theory of rough sets deals with uncertainty that arises from granularity in the domain of discourse. Decision tables are termed con- sistent. a(x) = a(y). such that a : U → Va . A) consists of a non- empty. that are determined by the indiscernibility relation. In that case the informa- tion system A is called a decision table.

. then B is called independent in A.  Let B ⊆ C. A = (U. (1. otherwise b is said to be indispensable in B. ∀a ∈ B}.1) It is clear that IN D(B) partitions the universe U into equivalence classes [xi ]B = {xj ∈ U : (xi . which are members of all reducts. C ∪ D). xi ∈ U. Discernibility Matrix D-reducts can be computed with the help of D-discernibility matrices [15]. Let B ⊆ A. C ∪ D). If a consistent decision table has a single decision attribute d. j)th entry cij given by: . core(A) = reduct(A). If all attributes in B are indispensable.e. viz. A). and relative (D)-reducts are computed. Reducts and Core Reducts are the basic attributes which induce same partition on universe U as the whole set of attributes. The core is the set ofessential attributes of any information system. if B is D-independent in A and P OSC (D) = P OSB (D). an attribute b ∈ B ⊆ A is dispensable if IN D(B) = IN D(B−{b}). B is said to be D- independent in A. xm }. if B is independent in A and IN D(B) = IN D(A). there may exist many reducts. BX = {x ∈ U : [x]B ∩ X = φ}. and a decision table (U. The B-lower and B-upper approximations of a given set X(⊆ U ) are defined. An attribute b ∈ B(⊆ C) is D-dispensable in B if P OSB (D) = P OSB\{b} (D). one is interested in eliminating redun- dant condition attributes. B(⊆ C) is called a D-reduct in A. For a given information system. if every attribute from B is D-indispensable in B. and consider the B-positive region of D. P OSB (D) = [x]D B[x]D . (1. · · · .2) Definition 3.. A). A B-indiscernibility relation IN D(B) is de- fined as IN D(B) = {(x. In an information system. the set consists of those attributes. respectively. Attribute set B ⊆ A is called reduct. Let U = {x1 . Math- ematically. for any d-reduct B. In a decision table A = (U.6 M. A D-discernibility matrix MD (A) is defined as an m × m matrix of the information system A with the (i. Definition 2. Definition 4. then U = P OSC (d) = P OSB (D). These are formally defined below for both a general information system (U. as follows: BX = {x ∈ U : [x]B ⊆ X}. otherwise b is D-indispensable in B. xj ) ∈ IN D(B)}. Banerjee et al. y) ∈ U : a(x) = a(y). i.

A distinction table is a binary matrix with dimensions (m2 −m) 2 ×N . In contrast to single-objective problems.2. known as Pareto- optimal solution [5].. new so- lutions are generated. viz. a number of multi-objective genetic algorithms have been suggested. crossover and mutation.4) 0 if ai (xk ) = ai (xj ). · · · . Over the past decade. They are motivated by the principles of nat- ural genetics and natural selection. Unlike classical optimization methods. Definition 5. which pro- vides a quantitative measure of the fitness of individuals in the pool. Crossover op- erator is responsible for creating new solutions from the old ones. xj ) ∈ IN D(D)}. b((k. Mutation operator plays a great role in case of multi-modal problems. GAs deal with a population of solutions/individuals. and (xi . more efficient approach to searching for optimal solutions. Multi-Objective GAs As the name suggests. 1 Feature Selection Using Rough Sets 7 cij = {a ∈ C : a(xi ) = a(xj ). A population of chromosomes. (1.2 Genetic Algorithms Genetic algorithms [9] are heuristic techniques applied to solve complex search and optimization problems. xj ). Mutation also creates new solutions. like selection. 1. xj ). i) of the matrix corresponds to the attribute ai and pair of objects (xk . multiple objective problems give rise to a set of optimal solutions. a multi-objective optimization problem deals with more than one objective function. and is given by  1 if ai (xk ) = ai (xj ). i) = (1. Selection operator selects better solutions to participate into crossover. When there are two or more conflicting objectives to be optimized. m}. often weighted sum of objectives are taken to convert them as single-objective prob- lem. In such cases multi-objective GAs provide an alternative. distinction table [17] is generally used in many applications to enable faster computation. j). The basic advantage of multi-objective GAs over classical optimization methods is their ability to find multiple Pareto- optimal solutions in one single simulation run. With the basic ge- netic/evolutionary operators. i. j ∈ {1. we will discuss .3) A variant of the discernibility matrix. Here. is made to evolve over generations by optimizing a fitness function. but only in the vicinity of old solutions. j). where N is the number of attributes in A. Detailed discussion about multi-objective genetic algorithms can be found in [7]. The presence of a ‘1’ signifies the ability of the attribute ai to discern between the pair of objects (xk . An entry b((k. representing solutions.

Combine parent and children population and do non-dominated sorting. i. . Initially.e. 2. Replace the parent population by the best members of the combined pop- ulation. Rank the population using the dominance criteria. Among the different multi-objective algorithms. is that which makes size of the new parent population same as size of the old one. NSGA- II [8]. as discussed in Section 1. Calculate the fitness. ri = rj but solution i is less densely located in the search space. When it is not possible to accommodate all the members of a particular front. when they are compared with respect to all objectives. i. It uses the concept of non-domination to select the better individuals. Do crossover and mutation to generate children population. 1. members of lower fronts replace the parent population. NSGA-II not only tries to converge to Pareto-front but also tries to have diverse solution on the front. 5. Initialize the population.8 M.e.5. 6. To get an estimate of the density around the solution i. NSGA-II uses crowding distance to find the population density near each individual. The number of individuals selected on the basis of higher crowding distance.e. • a local crowding distance di . The NSGA-II algorithm can be summarized as follows. ri < rj . 7. it has been observed that NSGA-II has the features required for a good multi-objective GA. Banerjee et al. 3. average distance of two solutions on either side of solution i along each of the objectives is taken. The algorithm assumes each solution in the population has two characteristics: • a non-domination rank ri . Do selection using crowding selection operator. 8. Crowding distance metric is defined in [8] such that the solution which resides in less crowded region will get higher value of crowding distance. Calculate the crowding distance. di > dj . Crowded tournament selection operation is described below. the main features of Non-dominated Sorting Genetic Algorithm. By using crowded tournament selection operator. Thereby NSGA-II tries to maintain the diversity among the non-dominated solutions. which is one of the frequently used multi-objective genetic algorithms. 4. This has been used in the studies of multi-objective feature selection using rough sets. that front is sorted according to the crowding distance. (ii) both the solutions are in the same front. viz. Definition 6. It is said that solution i wins tournament with another solution j if any one of the following is true: (i) solution i has better rank i.

and removes redundant. Often heuristic and non-deterministic strategies are found to be more practical. 1 Feature Selection Using Rough Sets 9 1. even with moderate value of N . Finding the best feature subset is often intractable or NP-hard. • Subset evaluation: Each generated subset needs to be evaluated by a cri- terion.3. and increasing the comprehensibility of their output. Feature selection can be supervised as well as unsupervised.3. – when addition or deletion of any feature does not produce a better subset. involving search starting point. – A pre-defined number of features are selected. with different emphasis on dimensionality reduction or accuracy enhancement. the total number of candidate subsets is 2N . 1. The al- gorithms are typically categorized under filter and wrapper models [18]. 1. – a pre-defined number of iterations are completed. It enhances the immediate effects for any application by speeding up subsequent mining algorithms.3 Feature Selection and Rough Sets Feature selection plays an important role in data selection and preparation for subsequent analysis.2 Role of Rough Sets Rough sets provide a useful tool for feature selection. Search is a key issue in feature selection. and compared with the previous best subset. • Validation: The selected best feature subset needs to be validated with different tests. irrelevant. or – an optimal subset is obtained according to the evaluation criterion. • Stopping criterion: The algorithm may stop when either of the following holds. such as the gene expression analysis example discussed in . improving data quality and thereby performance of such algorithms. Feature selection typically involves the following steps: • Subset generation: For N features. We explain its role with reference to the bioinformatics domain. so that the feature space is optimally reduced according to an evaluation criterion. It reduces the dimensionality of a feature space. depending on class information availability in data. A basic issue addressed in many prac- tical applications. search direction. One also needs to measure the goodness of the generated feature subset. This makes an exhaustive search through the feature space infeasible.1 Feature Selection It is a process that selects a minimum subset of M features from an original set of N features (M ≤ N ). or noisy data. In this section we highlight the basics of feature selection followed by the role of rough sets in this direction. and search strategy.

is that the whole set of attributes/features is not always neces- sary to define an underlying partition/classification. [2] investigates the multi-objective feature selection criteria for classification of cancer microarray data.4 Single-objective Feature Selection Approach Over the past few years. For example. While the first approach is based on classical GAs. Incorporating some modifications in this proposal. Hence the determination of reducts is better represented as a bi-objective problem.10 M. e. . these minimal subsets of features are just the reducts. First we will discuss algorithms proposed by Wroblewski [17].5. and then pass on to the multi-objective approach in Section 1. The high complexity of the reduct finding problem has motivated investi- gators to apply various approximation techniques to find near optimal solu- tions. The idea was first presented in [1]. In rough set terminology. where genetic algorithms (GAs) have been applied to find reducts. The essential properties of reducts are: • to classify among all elements of the universe with the same accuracy as the starting attribute (feature) set. and • to be of small cardinality.1. Section 1.. the other two are permutation-based greedy approaches.g. only few genes in microarray gene expression study are supposed to define the underlying process and hence working with all genes only reduces the quality and significance of analysis. 1. There are some studies reported in literature. [17. Banerjee et al. and minimal subsets of attributes may give the same classifi- cation as whole set of attributes. Each of the studies in [17. and a preliminary study was conducted. there has been a good amount of study in effec- tively applying GAs to find minimal reducts. Wroblewski’s Algorithms Wroblewski has proposed three heuristic-based approaches for finding minimal reducts. We will first discuss single-objective feature selection approach using rough sets in Section 1. A close observation reveals that these two characteristics are of a conflict- ing nature. 3] employs a single-objective function to obtain reducts.4. 3]. and correspond to the minimal feature sets that are necessary and sufficient to represent underlying classification. Many of the features may be superfluous.

.bN ) = τ (a1 .. . Here the aim is to find the proper order of attributes. define R := R ∪ bi+1 . and m is the number of objects. start with empty reduct R = φ and set i ← 0. Stop. First part of the fitness function gives the candidate credit for containing less attributes (few 1’s) and the second part of the function determines the extent to which the candidate can discern among objects. Method 2 This method uses greedy algorithms to generate the reducts. The result of this algorithm will be either a reduct or a set of attributes containing a reduct as a subset. each of them representing an ordered list of attributes. where N is the number of attributes (features). . aN ). Step 4 : Else. Lν is the number of 1’s in ν.05... The fitness function of an individual ν is defined as: 1 F (τ ) = . . 1 Feature Selection Using Rough Sets 11 Method 1 Solutions are represented by binary strings of length N .. A particular selection operator ‘Roulette Wheel’ is used. Mutation of one position means replacement of ‘1’ by ‘0’ or ‘0’ by ‘1’. Genetic Operators: Different permutations represent different individuals.7. One- point crossover is used with crossover probability Pc = 0.. (1. We can describe this method as follows: Step 1 : Generate an initial set of random permutations of attributes τ (a1 . (1. i. The following fitness function is considered for each individual: N − Lν Cν F1 (ν) = + . Complexity of the algorithm is governed by that of fitness calculation. . When the fitness function is calculated for each individual.e. Cν is the number of object combinations that ν can discern. . add one more element from the ordered list of attributes. If R is a reduct.aN ). the selection process begins. GAs help to find reducts of different order. . i.5) N (m2 − m)/2 where ν is a reduct candidate. . Step 3 : Check whether R is a reduct. In the bit representation ‘1’ means that the attribute is present and ‘0’ means that it is not.6) Lν where Lν is the length of the subset R found by the greedy algorithm. Probability Pm of mutation on a single position of individual is taken as 0. and it can be shown that the latter complexity is O(N m2 ). . Step 2 : For each ordered list. (b1 . N is the number of available attributes.e. Step 5 : Go to step 3.

But different muta- tion and crossover operators are used. . define reduct R as the whole set of attributes. If it is not. We can de- scribe this method as follows: Step 1 : Generate an initial set of random permutations of attributes τ (a1 . . However. To give more preference to finding shorter reducts. If there are many individuals with same fitness value. higher mutation rate is chosen to avoid premature convergence or getting stuck at local minima..  ( N −L N ν + (m2 C −m)/2 ) ν 2 if Cν < (m2 − m)/2 F1 (ν) = N −Lν ( N + ( (m2 −m)/2 + 2 ) × 2 ) if Cν = (m2 − m)/2. aN ). Cν 1 1 2 Bjorvand argues that by squaring the fitness values it becomes easy to separate the different values. Although one can choose any order-based crossover method. Wroblewski [17] has suggested the use of PMX (Partially Mapped Crossover [9]). . higher mutation probability is chosen for mutation from 1 to 0 than for the reverse direction.. The notations used are the same as those for the previous methods. Instead of constant mutation rate. Banerjee et al. . In case of total coverings (candidate is possibly a reduct). then undo step 3 and i ← i + 1. each of which represents an ordered list of attributes. . Step 3 : Set i ← 1 and let R := R − bi . Bjorvand uses adaptive mutation.. the proof of which is discussed in [17]. . with an interchange of two randomly chosen attributes being done in mutation with some probability. . In his approach. Step 4 : Check whether R is a reduct. the second part of fitness values is added and then multiplied by 1/2 to avoid getting low fitness values as compared to the candidates almost covering all objects and also having a low number of attributes.aN ). a disadvantage of this method is its high complexity. All genetic operators are chosen as in method 2... Step 2 : For each ordered list. Method 3 This method again uses greedy algorithms to generate reducts. The result of this algorithm will always be a reduct. a different fitness function is used.bN ) = τ (a1 . (b1 . . The same selection methods are used as in method 1.12 M. i.e. . Go back to step 3. ‘Rough Enough’ Approach to Calculating Reducts Bjorvand [3] has proposed another variant of finding reducts using GAs.

This motivated the work in [1]. Multi-objective criterion has been used successfully in many engineering problems. 1. Means of defining the cost function are discussed in Rosetta [12]. As dis- cussed in Section 1.3. Since. One can trivially define a cost function as “the cardinality of the candidate ν. al [16]. The following fitness function is defined for each candi- date solution ν: cost(A) − cost(ν)  |[S ∈ |S ∩ ν = φ]|  F (ν) = (1 − α) × + α × min ε. Among these algorithms are Johnson’s algorithm [12] and hitting set approach by Vinterbo et. Cost function in above definition specifies the cost of an attribute subset. In this section we first discuss the multi-objective reduct finding algorithm proposed initially in [1]. a user may not like to just have a minimal set of genes.2. . Here the fitness function again has two parts. In many applications. there are some more approximation approaches to calculate reducts. non-empty elements of the discernibility matrix are chosen as elements of a multiset . α lies between 0 and 1. Here. Combining this conflicting nature of reducts with MOGAs may give the desired set of trade-off solutions. (1. This is followed by a discussion on a modification and . and detailed description of the fitness function and all parameters can be found in [16].7) cost(A) || In the above equation. 1 Feature Selection Using Rough Sets 13 Other Approaches In literature. Candidate solutions ν(⊂ A) are found through evolutionary search algorithms. Rosetta describes all the required parameters in brief. as well as in feature selection algorithms [11]. and a weighted sum of the two parts are taken. GAs are used for finding the approximate hitting sets (reducts). Johnson’s algorithm is a greedy approach to find a single reduct. the basic idea is to give freedom to the user to choose features from a wide spectrum of trade-off features. First term of the above equation rewards the shorter element and the second term tries to ensure that hitting sets get reward. The parameter ε signifies a minimal value for the hitting fraction. but explore a range of different sets of features and the relations among them.5 Multi-objective Feature Selection Approach All the algorithms discussed above concentrate more on finding the minimal reducts and thus use variations of different single fitness functions. The minimal hitting sets of  are exactly the reducts. finding minimal hitting sets is again an NP-hard problem [16]. In the hitting set approach. such as gene expression analysis. |ν|”. which will be useful for them. a reduct exhibits a conflicting nature of having small cardinality and ability to discern among all objects. to use multi-objective fitness functions for finding reducts of all lengths. A is the set containing elements of the discernibility matrix.

The complete algorithm can be summarized in the following steps: Step 1 : A random population of size n is generated. in this case.2 are used.14 M. Non-domination sorting brings out the difference between the proposed algorithm and the earlier algorithms.5. For two solutions i and j. The algorithm makes sure that • the true reducts come to the best non-domination front. Explicit checking of superset is intended to ensure that only true reducts come into the best non-domination level.4. 1. (1. if they do not violate the superset criteria. Banerjee et al. The representation scheme of solutions discussed in Section 1. in [2].9) (m − m)/2 2 Hence. say) is superset of other (j) then put i at inferior domination level else put both in same domination level else do regular domination checking with respect to two fitness values.1 Finding Reducts Using MOGA – I In [1] the fitness function proposed by Wroblewski (eqn. if one is not the superset of the other and the two can discern between the same number of objects. i. (1.e. the first fitness function gives the solution credit for containing less attributes (few 1’s) and the second fitness function determines the extent to which the solution can discern among objects. and the usual crowding binary tournament selection as suggested in Section 1. In this way. we make sure that candidates with different cardinality can come to the same non-domination level. The two fitness functions F1 and F2 are as follows: N − Lν F1 (ν) = . (1.2.5)) was split into two parts. non-domination procedure is outlined as follows: if F2i = F2j and F1i = F1j if one (i. Remark 1. If the two solutions discern the same number of pair of objects. for gene expression classification problem. implementation of this proposal. one solution is not a superset of the other. . to exploit the basic properties of reducts as two conflicting objec- tives. then their non-domination level is determined by the first objective. and • two different candidates also come into the same non-domination front.8) N Cν F2 (ν) = .

. there is a considerable reduction in its size. e) giving the same clas- sification.8)-(1. j)∃i ∈ R : b((k. · · · . The algorithm is run on the d-distinction table. the feature sets are reduced: • Colon dataset: 1102 attributes for the normal and cancer classes. Then the earlier approach will give less preference to the second solution. whereas the proposed algorithm puts both solutions in the same non-domination level. This algorithm was implemented on some simple data sets. • Lymphoma dataset: 1867 attributes for normal and malignant lymphocyte cells. Consider two solutions (a. So. After the initial redundancy reduction. We call this shortened distinction table.1 [2]. Note that. The modified feature selection algorithm is implemented on microarray data consisting of three different cancer samples. with rows corresponding to only those object pairs (xk . An advantage of the multi-objective approach can be shown by taking an example. satisfying ∀(k. However. Step 5 : Offspring solution is created using crowded tournament selection. As object pairs corresponding to the same class do not constitute a row of the d-distinction table. The explicit check of superset also increases the probability of getting only true reducts. there is no row with all 0 entries in a d-distinction table. 1. and • Leukemia dataset: 3783 attributes for classes ALL and AML. with different population sizes. Thus the probability of selecting reducts of larger car- dinalities is the same as that of smaller cardinalities. to identify different fronts. Fitness functions of eqns.9). there are complexity problems when faced with large data. the distinction table consists of N columns. From this we form the d-distinction table consisting of N columns. Step 3 : Non-domination sorting is performed. the problem of finding a d-reduct is equivalent to finding a minimal subset of columns R(⊆ {1. j). Step 4 : Crowding sort is performed to get a wide spread of the solution. 1 Feature Selection Using Rough Sets 15 Step 2 : The two fitness values for each individual is calculated. NSGA-II is modified to effectively handle large datasets. d. and rows correspond- ing to only those object pairs (xk . thereby leading to a decrease in computational cost. to generate reducts upon convergence. In [2]. a d-distinction table. in effect. N }) in the distinction table. i) = 1. xj ) such that d(xk ) = d(xj ).2 Finding Reducts Using MOGA – II For a decision table A with N condition attributes and a single decision at- tribute d. An initial redundancy reduction is done to generate a reduced attribute value table Ar .5. 2. as summarized in Ta- ble 1. Step 6 : Steps 2 to 5 are repeated for a pre-specified number of generations. b) and (c. We focus on two-class problems. xj ) such that d(xk ) = d(xj ). crossover and mutation operators. as A is taken to be consistent. (1. whenever d(xk ) = d(xj ).

A five-genes set is generated for Lymphoma data.5 To illustrate the single-objective feature selection approach. [6] employ a t-test based feature selection with a fuzzy neural network. On the other hand. For illustration. for the minimal reduct on the three sets of two-class microarray gene expression data after 15. Huang [10] uses a probabilistic neural network for feature selection. 2 attributes respectively.5 67 M Asymptomatic 120 229 F LV hypertrophy 129 Y 2. In case of Leukemia data. Cleve- land data set is taken from Rosetta [12].000 generations. 2. the algorithm proposed in [1] is used on the same data set.2. a ten-genes set is generated. Table 1. they report a re- duction to a ten-genes set. a sample data viz. For multi- objective feature selection approach. based on correlation with class distinction. 4. to find reducts. Part of Cleveland data.1. feature selection (without rough sets) in microar- ray gene expression analysis has also been reported in literature [10.3 67 M Asymptomatic 160 286 F LV hypertrophy 108 Y 1.6 37 M Non-anginal pain 130 250 F Normal 187 N 3. Usage details of the two-class microarray data Data used# Attributes Classes # Samples Colon 2000 Colon cancer 40 Normal 22 Lymphoma 4026 Other type 54 B-cell lymphoma 42 Leukemia 7129 ALL 47 AML 25 adapted to the case of two-class problems. 6]. are used. is shown in Table 1. we have used Wroblewski’s algorithm implemented in Rosetta. Results indicate conver- gence to 8. The data has 14 attributes and it does not contain any decision variable. For Colon data. Table 1.16 M. 1.2. . We have removed all objects with missing values and hence there are 297 objects. Banerjee et al. Chu et al.6 Example In this section we will explain single-objective and multi-objective based fea- ture selection through an example. taken from Rosetta age sex cp trestbps chol fbs restecg thalach exang oldpeak 63 M Typical angina 145 233 T LV hypertrophy 150 N 2. A part of the data with 10 attributes and 3 objects.

Rosetta gives an option of finding reducts based on a complete set of objects or a subset of objects.2 Illustration for the Multi-objective Approach Solutions or chromosomes are binary strings of 1 and 0. or a pre-defined number of reducts has been found.6.4. 5th and 14th attributes are present. Results On running Wroblewski’s algorithm implemented in Rosetta. Table 1. Rosetta also provides options to users for selecting parameters such as number of reducts. Results of Rosetta Reduct Length # Reducts 3 10 4 10 5 2 6 3 1.1 Illustration for the Single-objective Approach Wroblewski’s algorithm searches for reducts using GA until either it has ex- hausted the search space. we obtain 25 reducts. of length equal to the total number of attributes.g. In this example. It may be remarked that other approaches (e.3. seed to start with different random populations. We have chosen normal calculation speed and number of reducts = 50. 1 Feature Selection Using Rough Sets 17 1. 1 indicates that the particular attribute is present. Three different parameters can be chosen to control the thoroughness and speed of the algorithm. Thus the resultant distinction table consists of object pairs belonging to different classes only. A random population of size n is generated. Each individual is nothing but a binary string. we have chosen discernibility based on all objects. Table 1.6.3 summarizes the results. For example. In case of decision table. as just explained. and calculation speed to choose one of the different versions of Wroblewski’s algorithm discussed in Section 1. Fitness functions are calculated and . All reducts were tested for reduct membership [15] and found to be true reducts. 10001000000001 means that 1st. one can select modulo decision to avoid discerning between objects belonging to the same decision class. Vinterbo’s method [16]) for finding reducts have also been implemented in Rosetta.

Results The following parameters are used to run the multi-objective algorithm. non-domination ranking is done. i. A comparison with the results obtained using the single-objective algorithm indicates a greater effectiveness of the multi-objective approach. Table 1.08 Table 1. it is still kept in the same non-dominated front as individuals 1 and 2. let us take 4 individuals with the following strings and assume that all 4 individuals can discern among all the objects.5. second fitness function of all individuals is 1. as discussed in Section 1. Though individual 4 has cardinality four.6 Mutation Probability = 0.18 M. Banerjee et al. all reducts were tested for reduct membership [15] and found to be true reducts.0. For illustra- tion.4 summarizes the results.1. individual 2 dominates it. and are put in the same front.4. Offspring solutions are created using crowding selection. crossover and mutation operators. Individual 1: 10001000100000 Individual 2: 01001000010000 Individual 3: 11001000010000 Individual 4: 01001001000100 Since individual 3 is a superset of individual 2. Results of multi-objective implementation Reduct Length # Reducts 3 10 4 17 5 14 6 7 7 2 . Again.e. Population Size = 50 Number of Generations = 500 Crossover Probability = 0. But individuals 1 and 2 are non-dominated. which is more than the cardinality of individual 1 and individual 2.

London. References [1] A. Finland. [2] M. Xie. IEEE Transactions on Systems. C. In Proceedings of the Sixth Scandinavian Conference on Artificial In- telligence. Gu. Acknowledgment Ashish Anand is financially supported by DBT project no. Department of Mathematics. Handling of high-dimensional data requires a judicious selection of attributes. W. [6] F. T. Kanpur. and L. Bjorvand. reduct computation is a hard problem. Helsinki. and Cybernetics. Haimes. 2005. Y. Reducts in rough set theory. pages 290–291. Banerjee. Banka. 1983. 2002. H. Identifying the essential features amongst the non-redundant ones. Cao. Wang. could be relevant in this direc- tion. An illustrative example is also provided. Evolutionary-rough feature selection in gene expression data. Seng. volume 2.7 Conclusion In this article we have provided a study on the use of rough sets for feature selection. We thank the referees for their suggestions. Saliency analysis of support vec- tor machines for gene selection in tissue classification. 2004. and it may be a worthwhile future endeavor to conduct an investigation into its role. Feature selection is hence very important for such data analysis. Applica- tion to microarray data is described. Representation and learning of inexact information using rough set theory. Part C: Applications and Reviews. . In Proceedings of 2004 Annual Meeting of the North American Fuzzy Information Processing Society (NAFIPS 2004). North-Holland. Chu. However. 2001. ‘Rough Enough’ – A system supporting the rough sets ap- proach. Anand. Master’s thesis. Gene selection and cancer classification using a fuzzy neural network. India. It is found that evolutionary algorithms. and H. Deb. Mitra. [5] V. prove to be relevant for this task. DBT/BSBE/ 20030360. Indian Institute of Technology. also appears to be important in feature selection. Chankong and Y. Multi-Objective Optimization using Evolutionary Algorithms. pages 555–559. John Wiley & Sons. The notion of core (the common part of all reducts) of an information system. [4] L. 1997. Lee. Accepted. P. [3] A. to explain the single. par- ticularly multi-objective GA. 11:244–249. K. 1 Feature Selection Using Rough Sets 19 1. 2003. Man. [7] K. S. Multiobjective Decision Making Theory and Methodology. is useful in computing optimal reducts.and multi-objective approaches. and Q. Neural Computing and Applications.

Pratap. [8] K. 5:1205–1224. Kluwer Academic Publishers. 2004. Jensen. Huang. Meyarivan. 2000. Genetic Algorithms for Search. Rough sets. [15] A. 1989. [10] C. Optimization. Agarwal. International Journal of Approximate Reasoning. 2004.20 M. Skowron. -J. Goldberg. In W. 18:117–128. and Machine Learning. A. Reading. 2002. Øhrn. Klasgen and J. [16] S. 1982. Slowinski. pages 123–143. Efficient feature selection via analysis of relevance and redundancy. School of Informatics. [18] L. Banerjee et al. Pawlak. 2002. International J. PhD thesis. Wroblewski. Comp & Inf. [14] Z. [13] Z. Theoretical Aspects of Reasoning about Data.3. Finding minimal reducts using genetic algorithms. Combining rough and fuzzy sets for feature selection. Applied Artificial Intelligence. 1995. In R. The rosetta rough set software system. IEEE Transactions on Evolutionary Computation. chapter D. Rough Sets. Skowron and C. 1991. [9] D. A. .2. Handbook of Applications and Ad- vances of the Rough Set Theory. Rauszer. Liu. Class prediction of cancer using probabilistic neural networks and relative correlation metric. Pawlak. pages 331–362. Addison-Wesley. Dordrecht. 1992.. University of Edinburgh. Minimal approximate hitting sets and rule templates. [17] J. S. Dordrecht. Sc. 6:182–197.E. [11] R. Vinterbo and A. [12] J. and T. The discernibility matrices and functions in in- formation systems. Øhrn. pages 186–189. editor. Kluwer Academic Publishers. Komorowski. Journal of Machine Learning Research. Deb. Yu and H. Oxford University Press. Zytkow. editors. 2004. A fast and elitist multi- objective genetic algorithm: NSGA-II. and A. In Second Annual Joint Conference on Information Sciences. Handbook of Data Mining and Knowledge Discovery.

microarray experiments in bioinformatics. physical simulations in scientific computing and many more) give rise to large ‘warehouses’ of data. the choice of the best clustering solution out of a set of different alternatives. Knowles: Multi-Objective Clustering and Cluster Validation. J. which require the presence of training data. This chapter is concerned with unsupervised classification. the analysis of data sets for which no (or very little) training data is available. and investigate whether a multiobjective approach may also be beneficial vis-a-vis more traditional validation criteria.1 Unsupervised Classification The increasing volume of data arising in the fields of document retrieval and bioinformatics are two prominent examples of a trend in a wide range of dif- ferent research areas. jknowles@manchester.springerlink. that is. and the identification of groups (or clusters) of homogeneous data items — a process commonly referred to as cluster analysis. In this chapter. Handl and J. A variety of such criteria exist. Second. The effi- cient analysis and the generation of new knowledge from these masses of data requires the extensive use of data-driven inductive approaches.uk. The main goals in this data-driven type of analysis are the discovery of a data set’s underlying structure. 2. Data-driven techniques are also referred to as unsupervised techniques.manchester. that is. clustering can therefore be considered as an intrinsically multiobjective optimization problem. we consider two steps in the clustering process that may benefit from the use of multiple objectives. and stand in con- trast to supervised techniques. Studies in Com- putational Intelligence (SCI) 16. Clustering relies on the use of certain criteria that attempt to capture those as- pects that humans perceive as the properties of a good clustering solution. First. 21–47 (2006) www. which can only be handled and processed by means of computers. we consider the problem of model selection. we consider the generation of clustering solutions and show that the use of two complementary clustering objectives results in an improved and more robust performance vis-a-vis single-objective clustering algorithms.2 Multi-Objective Clustering and Cluster Validation Julia Handl and Joshua Knowles University of Manchester jhandl@postgrad.uk Summary. many of which are partially complementary or even conflict- ing.ac. and may favour different types of solutions.ac. Novel technologies (such as the Internet in document retrieval. Like many other machine learning problems.com c Springer-Verlag Berlin Heidelberg 2006 .

1 Clustering Informally. Formally. while those within different clusters should be dissimilar. j) ∈ . 10.1.1: The clustering problem INSTANCE: A finite set X. Knowles that is. 17. a distance measure δ(i. Handl and J. This type of method is also referred to as clustering or unsupervised classification [7. 19]. clustering is concerned with the division of data into homogeneous subgroups. that is. it must be hoped that a distance measure or a reduced feature space can be identified under which related data items cluster together in data space. 2. subject to the following two aims: data items within one cluster should be similar to each other. a (sufficiently large) set of data samples for which the correct clas- sification is known. algorithms rely on the presence of distinct structure in the data. the clustering problem can be defined as an optimization problem1 [3]: Definition 1. In the absence of training data.22 J.

to optimize just one such property — and through this 1 Without loss of generality. . ·). the search space for the clustering problem grows exponentially and cannot be scanned exhaustively even for medium sized problems. Definitions of the clustering problem vary in the optimization criterion J and the distance function δ(·. CK } of X and measure δ(·. . the clustering problem is known to be NP-hard in many of its forms [3]. Indeed. j ∈ X. N ) = (−1)K−i (i)N ≈ K! i=1 i K! Hence. The number R(K. is one of the fundamental dilemmas in clustering [9. that is. . most existing clustering methods attempt. δ(·. CK that maximises the expression J(C. . 20]. and a criterion function J(C. we here assume maximization. N ) of possible solutions for the division of a data set of size N into K partitions is given by the Stirling number of the second kind:  1  K K KN R(K. even with a fixed number of partitions K. . ·)) on a K-partition C = {C1 . δ(·. grasping the intuitive notion of a cluster by means of an explicit definition. Yet. . ·) used. While there are several valid properties that may be ascribed to a good clus- tering. .+ 0 for i. these may be partially conflicting and/or inappropriate in a particular context (see Figure 2.1). . Unfortunately. choosing an optimiza- tion criterion. a positive integer K. OPTIMIZATION: Find the partition of X into disjoint sets C1 . ex- plicitly or otherwise. ·)). .

or through the a posteriori combination of different clustering results by means of ensemble methods [28. connectedness and spatial separation. The result- ing methods tend to be very effective for spherical or well-separated clusters. 31]. 8] and methods such as single link agglomerative clustering [32]. Clearly. This category includes algorithms like K-means [21]. most notably measures of com- pactness or balance of cluster sizes. The resulting clustering objectives can be tackled by general-purpose meta-heuristics (such as simulated annealing. This is a more local concept of clustering based on the idea that neighbouring data items should share the same cluster. This concept is generally implemented by keeping intra-cluster variation small. but these are partly in conflict and are generally difficult to express in terms of objective functions. It is therefore usually combined with other objectives. average link agglomerative clustering [32] or model-based clustering approaches [22]. . Spatial separation on its own is a criterion that gives little guidance during the clustering process and can easily lead to trivial solutions. 25]). This confinement to a particular clustering criterion is one of the reasons for the fundamental discrepancies observable between the solutions produced by different algorithms and will cause a clustering method to fail in a context where the criterion employed is inappropriate. tabu search and evolutionary algorithms [2. (2) Connectedness. In practical data-mining scenarios. respectively. choice they make a priori assumptions on the type of clusters that can later be identified. 2. There are several valid properties that may be ascribed to a good par- titioning. In principle. 19]. the cluster structure in the data sets B and C can be identified by a clustering algorithm based on either connectedness or on spatial separation. 10. connectedness and spatial separation are re- lated (albeit opposite) concepts. researchers attempt to circumvent this problem through the application and comparison of multiple clustering al- gorithms [16]. but not by one based on compactness. Examples of data sets exhibiting compactness. existing clustering criteria/algorithms do fit broadly into three fundamental categories: (1) Compactness. (3) Spatial separation. Despite this. but they may fail to detect more complicated cluster structures [7. 17. Algorithms implement- ing this principle are density-based methods [1. They are well-suited for the detection of arbitrarily shaped clusters. are shown above. but can lack robustness when there is little spatial separation be- tween clusters. 2 Multi-Objective Clustering and Cluster Validation 23 Fig.1.

2. j) ∈ . Handl and J. are considered as good solutions.2.3 0. 20].4 Single link Silhouette Width 0. at K = 4 for K-means and average link and at K = 2 for single link). Knowles 0.2 −0. K-means.24 J. For each algorithm. the Silhouette Width [26]. a typical approach is to run several clustering algorithms for a range of numbers of clusters. a process often referred to as model selection (see Figure 2.3 −0. Automated approaches to the determination of the number of clusters frequently work by selecting the most appropriate clustering from a range of partitionings with different Ks.2 0.5 0 2 4 6 8 10 12 14 16 18 20 Number of clusters Fig. This way of choosing one partitioning from a set of different partitionings is frequently referred to as model selection. Formally.1.4 −0.6 0. the process of model selection considered in this chapter can be defined as follows:2 Definition 1. If the structure of a data set and the best number of clusters is not known.5 K−means Average link 0. average link and single link agglomerative clustering have been run on a four-cluster data set (Square3) for K ∈ [1. K-means) may be selected as the most suitable clustering method. clear local maxima within the corresponding curve (here.2).2: Model selection INSTANCE: A finite set X. 2. Here. the par- titioning that is most appropriate is again defined by means of a clustering criterion.1 0 −0.2 Model Selection Determining the most appropriate number of clusters K for a given data set is a second important challenge in clustering. The algorithm resulting in the highest validation values (here.1 −0. Here. which takes both cluster compactness and cluster separation into account. here. a distance measure δ(i. The solutions are evaluated under a validation measure.

and a criterion function J(P. δ(·. ·)) on a partition P of X and measure δ(·. P2 . . OPTIMIZATION: Find the partition P ∈ {P1 . for which the expression J(P. .+ 0 for i. PM }. . δ(·. . . ·)) is maximal. . a set of M partitionings {P1 . PM }. . P2 . . we here assume maximization. 2 Without loss of generality. j ∈ X. ·). .

3. not known a priori. we employ a multiobjective evolutionary algorithm (MOEA) to optimize several clustering objectives. which abolishes the need for multiple runs of different clustering algorithms. 2 Multi-Objective Clustering and Cluster Validation 25 Again. Experimental results demonstrate the high performance of the resulting multiobjective validation approach. We therefore expect the direct optimization of multiple objectives to be beneficial both in the context of clustering and model selection. fundamentally different clustering objectives can be used in this process. In Section 2. the choice of an appropriate clustering criterion is of major impor- tance in both clustering and model selection. Multiobjective Pareto optimization allows the simultaneous optimization of a set of complementary clustering objectives. Finally. we describe a multiobjective clustering algorithm. we consider model selection and discuss how the output of MOCK can be directly used to assess the quality of individual clustering results. In Section 2. we provide evidence to support this hypothesis in sev- eral steps. Specifically. we contend that the framework of Pareto optimization can provide a principled way to deal with this issue. 2. it is also clear that a secure choice requires knowledge of the underlying data distribution. which simultaneously optimizes two different complementary clus- tering objectives. In this chapter. 2. 15]). and the final result may vary strongly depending on the specific ob- jective employed. Section 2.3 Scope of This Work Evidently. The resulting multiobjective clustering algorithm is named “MultiOb- jective Clustering with automatic K-determination” (MOCK.4 concludes. .2. which represent a good approximation to the Pareto front. which is.3).2 Multiobjective Clustering In order to develop a clustering algorithm that simultaneously considers sev- eral complementary aspects of clustering quality. in most cases. Unfortunately. and discuss performance dif- ferences with respect to single-objective validation techniques. Experimental results show that the algorithm has a signif- icantly improved classification performance compared to traditional single- objective clustering methods. 14. In the following. we embrace the framework of Pareto optimization.1. There is a consensus that good partitionings tend to perform well under multiple complementary clustering criteria and the best clustering is therefore commonly determined using linear or non-linear combinations of different clustering criteria. and to obtain a set of trade-off solutions. and is highly robust over a range of different data properties. MOCK. [13. and is more general than the fixed linear or non-linear combination of individual objectives (see Figure 2.

The correct solution corresponds to a compromise between the two objectives. . the approx- imation set in this example only contains solutions for K = 3. that is. More generally. the set of par- titionings that are Pareto optimal with respect to the objectives optimized. Clustering can be considered as a multiobjective optimization problem.26 J. two objectives are to be minimized and the approximation set is represented by the corresponding attainment surface. For sake of clarity. Handl and J. Here. The figure highlights three solutions within this approxima- tion set. Instead of finding a single clustering solution. Knowles Locally connected solution overall deviation (minimize) "Correct" solution Compact solution connectivity (minimize) Fig. whereas those to the bottom right tend to be highly compact. 2. The solutions to the top left tend to be locally connected but not compact. the number of clusters can also be kept dynamic — in this case an approximation set is obtained in which the number of clusters varies along the Pareto front. and can therefore only be found through the optimization of both objectives. which is the boundary in the objective space separating those points that are dominated by or equal to at least one of the data points in the approximation set from those that no data point in the approximation set dominates or equals. the aim in a multiobjective clustering scenario is to find an approximation to the Pareto front.3.

IP of fixed size. • and an effective initialization scheme.g. Moreover. For further details on PESA-II. but their performance breaks down rapidly for larger data sets. An important advantage of PESA-II is that this niching policy uses an adaptive range equalization and normalization of the objective function values.2. PESA-II [5. PESA-II can also handle any number of objective functions. To this end. Selection occurs at the interface between the two populations. and has been used in comparison studies by several researchers.e. the operators and the objective func- tions. The solutions in EP are stored in ‘niches’. mutation and/or crossover). • one or more genetic variation operators (e. nondominated solutions that try to enter a full EP can only do so if they occupy a less crowded niche than some other solution (lines 36 and 37 of Algorithm 1). recombination and mutation).1 PESA-II MOCK is based on an existing MOEA. The internal population’s job is to explore new solutions. when the internal population of each generation is constructed from EP (lines 9–12). and achieves this by the standard EA processes of reproduction and variation (i.. 6] is a well- known algorithm in the evolutionary multiobjective optimization literature. The design of an effective EA for clustering requires a close harmonization of the encoding. they are selected uniformly from among the populated niches — thus highly populated niches do not contribute more solutions than less populated ones. which permits to narrow down the search space and guide the search . and objective functions that have very different ranges can be readily used. Two populations of solutions are main- tained: an internal population. A high-level description is given in Algorithm 1. rather than bunch together in one region. • a suitable genetic encoding of a partitioning. and an external population EP. PESA-II. A tally of the number of solutions that occupy each niche is kept and this is used to encourage solutions to cover the whole objective space. 2. primarily in the update of EP. The purpose of EP is to exploit good so- lutions: to this end it implements elitism by maintaining a large and diverse set of nondominated solutions. the reader is referred to [4]. implemented as a hypergrid in the objective space.2. 2 Multi-Objective Clustering and Cluster Validation 27 2.2 Details of MOCK The application of PESA-II to the clustering problem requires the choice of • two or more objective functions. This means that difficult parameter tuning is avoided. These choices are non-trivial and are crucial for the performance and particu- larly the scalability of the algorithm: many encodings work well for data sets with a few hundred data points. of non-fixed but limited size.

EP := ∅ 3: for each i in 1 to ipsize do /* INITIALIZATION */ 4: si := initialize solution(i) 5: evaluate(si ) 6: UpdateEP(EP . si . si . si dominates s} 33: else if si is nondominated in EP then 34: if EP < epmaxsize then 35: EP := EP ∪ {si } 36: else if ∃s ∈ EP. epmaxsize) /* Procedure defined in line 30 */ 7: end for 8: for gen in 1 to #gens do /* MAIN LOOP */ 9: for each i in 1 to ipsize do 10: select a populated niche n uniformly at random from EP 11: select a solution si uniformly at random from n 12: IP := IP ∪ {si } 13: end for 14: i := 0 15: while i < ipsize do 16: if random deviate R(0. epmaxsize) 25: end for 26: IP := ∅ 27: end for 28: return EP . Knowles Algorithm 1 PESA-II (high-level pseudocode) 1: procedure PESA-II(ipszize. pm ). 1]. si+1 := crossover(si . 1]. epmaxsize)/* Update EP with solution si */ 31: if ∃s ∈ EP. epmaxsize. si+1 := mutate(si+1 . pm ∈ [0. a set of nondominated solutions 29: end procedure 30: procedure UpdateEP(EP . Handl and J. #gens) 2: IP := ∅. a solution from a most-crowded niche} 38: end if 39: end if 40: update all niche counts 41: end procedure . si dominates s then 32: EP := EP ∪ {si } \ {s ∈ EP. pc ∈ [0. 1) < pc then 17: si . si+1 ) 18: end if 19: si := mutate(si . si .28 J. pm ) 20: i := i + 2 21: end while 22: for each i in 1 to ipsize do 23: evaluate(si ) 24: UpdateEP(EP . si is in a less crowded niche than s then 37: EP := EP ∪ {si } \ {s.

and L is a parameter determining the number of neighbours that contribute to the connectivity measure. i=1 j=1 nnij is the jth nearest neighbour of datum i.nnij  . connectivity should be minimized.. an effective combination of encoding. we therefore select two types of complementary objectives: one based on compactness. Subsequently. both objectives. Ck ∈C i∈Ck where C is the set of all clusters.) is the chosen distance function.s = j 0 otherwise. It is computed as    N L 1  if Ck : r ∈ Ck ∧ s ∈ Ck Conn(C ) = xi. From the groups identified in Figure 2. We refrain from using a third objective based on spatial separation. An important aspect in the choice of these objective functions is their po- tential to balance each other’s tendency to increase or decrease the number . where xr. we are interested in selecting optimization crite- ria that reflect fundamentally different aspects of a good clustering solution. Following extensive experiments using a range of different encod- ings. The criterion is strongly biased towards spherically shaped clus- ters. can be efficiently computed in linear time. . overall deviation and connect- edness. µk ). see below) requires the one- off computation of the nearest neighbour lists of all data items in the initial- ization phase. As an objective reflecting cluster connectedness. As an objective. operators and objective functions. the details of which are ex- plained in the following. operators and objective functions was derived. overall deviation should be minimized. the other one based on connectedness of clusters. con- nectivity. µk is the centroid of cluster Ck and δ(. but is not robust toward overlapping clusters. Objective Functions For the clustering objectives. This criterion captures local densities — it can therefore detect arbitrarily shaped clusters.1. This is simply computed as the overall summed distances between data items and their corresponding cluster centre:   Dev (C ) = δ(i. which evaluates the degree to which neighbouring data-points have been placed in the same cluster. 2 Multi-Objective Clustering and Cluster Validation 29 effectively. As an objective. In order to express cluster compactness we compute the overall deviation of a partitioning. as the concept of spatial separation is intrinsic (opposite) to that of connectedness of clusters. Overall deviation (like our mutation operator. we use a measure.

. Two parent partitionings. The decoding of this representation requires the identification of all subgraphs. Hence. A and B are shown.4.30 J. . The interaction of the two is important in order to explore sensible parts of the solution space. and their respective geno- types. and each gene gi can take allele values j in the range {1. a value of j assigned to the ith gene. their graph structure. but differs from both of them. this decoding step can be done in linear time [13]. Furthermore. we can evolve and compare solutions with different numbers of clusters in just one run of the GA. . . . Most importantly. is then interpreted as a link between data items i and j: in the resulting clustering solution they will be in the same cluster. Handl and J. each individual g consists of N genes g1 . which has inherited much of its structure from its parents. and not to converge to trivial solutions (which would be N singleton clusters for overall deviation and only one cluster for connectivity). Note that. In this graph-based representation. of clusters. where N is the size of the clustered data set. as it is automatically determined in the decoding step. Knowles 2 2 3 8 3 8 4 4 1 7 1 7 5 5 6 6 A: 2 3 4 2 4 5 8 7 B: 3 1 2 3 6 5 6 7 2 3 8 Uniform crossover 4 1 7 A: 2 3 4 2 4 5 8 7 B: 3 1 2 3 6 5 6 7 mask 0 1 0 0 1 1 0 0 C: 2 1 4 2 6 5 8 7 5 6 C: 2 1 4 2 6 5 8 7 Fig. gN . While the objective value associated with overall deviation neces- sarily improves with an increasing number of clusters. the representation is well-suited for . . we employ the locus-based adjacency representation pro- posed in [23]. . The locus-based adjacency encoding scheme has several major advantages for our application. . All data items belonging to the same subgraph are then assigned to one cluster. N }. the opposite is the case for connectivity. Encoding For the encoding. Thus. A standard uniform crossover of the genotypes yields the child C. there is no need to fix the number of clusters in advance. using a simple backtracking scheme. 2.

‘longer links’ in the encoding are expected to be less favourable.or two-point because it is unbiased with respect to the ordering of genes and can generate any combination of alleles from the two parents (in a single crossover event) [33]. algorithms consistent with con- nectivity tend to generate close to optimal solutions in those regions of the Pareto front where connectivity is low. where nnil denotes the lth nearest neighbour of data item i. . one-point or two-point crossover. Hence. which we now define as  2 1 l pm = + . a link i → j with j = nnil (item j is the lth nearest neighbour of item i) may be preferred over a link i → j ∗ with j ∗ = nnik (item j ∗ is the kth nearest neighbour of item i) if l < k. In more traditional encodings for clustering these straightforward crossover operators are usually highly disruptive and therefore detrimental for the clustering process. The properties of the encoding can additionally be employed to bias the mutation probabilities of individual genes. Uniform Crossover We choose the uniform crossover [29] in favour of one. This reduces the extent of the search space to just LN . while maintaining the remainder of the partitioning (see Figure 2. whereas algorithms based on com- pactness perform well in the regions where overall deviation has significantly . . a suitable mutation operator can be employed to significantly reduce the size of the search space. In a link-based encoding. 2 Multi-Objective Clustering and Cluster Validation 31 use with standard crossover-operators such as uniform. This can be used to bias the mutation probability of individual links i → j. they effort- lessly implement merging and splitting operations on individual clusters. Note that the nearest neighbour list can be precomputed in the initial- ization phase of the algorithm. nniL }. In particular. Initialization Our initialization routine is based on the observation that different clustering algorithms tend to perform better (find better approximations) in different regions of the Pareto front [14]. .4). in contrast. Neighbourhood-biased Mutation Operator While the encoding results in a very large search space with N N possible combinations. that is. N N where j = nnil and N is the size of the data set. gi ∈ {nni1 . Intuitively. We use a restricted nearest neighbour mutation where each data item can only be linked to one of its L nearest neighbours. .

Its degree of interestingness is d = min(l. in order to obtain a good initial spread of solutions and a close initial approximation to the Pareto front. In order to avoid this effect. decreased. . The use of MSTs has the advantage that the links present in a given MST can be directly translated as the encoding of individual solutions. The translation of these flat (that is. Informally. k). all links that cross cluster boundaries (defined by the three-cluster K-means solution indicated here by the ellipses) are removed. as described below.32 J. we employ a definition of interestingness that distinguishes between ‘uninteresting’ links whose removal leads to the sepa- ration of outliers. As the complete MST corresponds to a one-cluster solution. non-hierarchical) partitionings to the encoding of individual solutions is more involved. we first compute the complete MST using Prim’s algo- rithm [34]. Definition 1. A missing link emanating from data item i is then replaced by a (randomly determined) link to one of its L nearest neighbours. we are then interested in obtaining a range of good solutions with different numbers of clusters. this means. Construction of an MST-similar solution from a given K-means solution. Handl and J. iff i = nnjl ∧ j = nnik ∧ l > L ∧ k > L. Knowles 4 11 4 11 4 11 1 5 13 1 5 13 1 5 13 10 12 10 12 10 12 2 2 2 3 3 3 6 6 6 8 8 8 7 7 7 9 9 9 Position: 1 2 3 4 5 6 7 8 9 10 11 12 13 Position: 1 2 3 4 5 6 7 8 9 10 11 12 13 Position: 1 2 3 4 5 6 7 8 9 10 11 12 13 Genotype: 1 5 2 2 1 3 6 7 7 3 10 10 12 Genotype: 1 5 2 1 3 6 7 7 10 10 12 Genotype: 1 5 8 2 1 3 6 7 7 13 10 10 12 Fig. and ‘interesting’ links whose removal leads to the discovery of real cluster structures. Solutions performing well under connectivity are generated using mini- mum spanning trees (MSTs). where L is a parameter. Generation of Interesting MST Solutions For a given data set. Simply removing the largest links from this MST does not yield the desired results: in the absence of spatially separated clusters the method tends to isolate outliers so that many of the solutions generated are highly similar. We therefore use an initialization based on two different single- objective algorithms. Solutions performing well under overall deviation are generated using the K-means algorithm. Starting from the original MST solution.3: A link i → j is considered as interesting.5. that a link between two items i and j is considered interesting if neither of them is a part of the other item’s set of L nearest neighbours. 2.

3 Experimental Setup MOCK generates a set of different clustering solutions. Generation of K-means Solutions Next. Again. Note that the numbers of clusters obtained as the final phenotypes are not pre-defined and can increase or decrease depending on the structure of the underlying MST. f size − (min(I. See Figure 2. a set of interesting MST-derived solutions can then be constructed as follows.g. Using this sorted list. un- equally sized clusters (the Sizes series) and elongated clusters of arbitrary shape (the Long. In a first step. . The preservation of a high degree of MST information (within clusters) at this stage is crucial for the quick convergence of the algorithm. Clustering quality is objectively evaluated using the Adjusted Rand In- dex.6 for an illustration of some of these data sets. 50].5 × fsize)]. which are run for a range of different numbers of clusters K ∈ [1. 0. 0. In order to assess this. External indices evaluate a partitioning by comparing to a ‘gold standard’. The missing links are then replaced by a link to a randomly chosen neighbour j with j = nnil ∧l ≤ L. the number of clusters of the best solution may be quite different from the known ‘correct’ number of clusters. that is. a set of clustering solutions is then constructed: for n ∈ [0. because of the isolation of outliers). we evaluate the quality of the best solution present in the final Pareto set. We start by running the K-means algorithm (for 10 iterations) for different numbers of clusters K ∈ [2. which correspond to different trade-offs between the two objectives and to different numbers of clusters. only the best solu- tion within this range is selected (note that.4: A clustering solution C is considered as interesting. where fsize is the total number of initial solutions. min(I. and [12] for a detailed description. e. For a given data set. we compare to three single-objective algorithms. Smile and Triangle series). on many data sets.5. clustering solution Cn is generated by removing the first n interesting links. if it can be deduced from the full MST through the removal of interesting links only. In order to put MOCK’s results into perspec- tive. we consider the generation of K-means solutions.2.5 × f size) + 1)]. the true class labels. We compare these algorithms on a range of data sets exhibiting differ- ent difficult data properties including cluster overlap (the Square series). this experimental section primarily fo- cuses on MOCK’s ‘peak’ performance at generating high-quality clustering solutions. all I interesting links from the MST are detected and are sorted by their degree of interestingness. an external measure of clustering quality. 2. While this provides a large set of solutions to choose from (typically more than one solution for each K). 2 Multi-Objective Clustering and Cluster Validation 33 Definition 1. The resulting partitionings are then converted to MST-based genotypes as illustrated in Figure 2.

34 J. the failure of an algorithm becomes evident not only through a low Adjusted Rand Index.7 1 0. in every run.8 2 10 0.1 shows the results of the comparison of MOCK to K-means.8 -0. but that their perfor- mance breaks down dramatically for those data sets violating the assumptions made by the employed clustering criterion.5 0 -1 0. the Adjusted Rand Index has the strong advantage that it takes into account biases introduced due to the distribution of class sizes and differences in the number of clusters. Compared to other external indices. Implementation details for the individual algorithms and the Adjusted Rand Index are given in the Appendix. K-means and agglomerative clustering perform very well for spherically shaped clusters but fail to detect clusters of elongated shapes. 2.2 0 Fig.4 -0.4 -5 0. As can be expected. a new instance is sampled from these distributions (hence.2. Top row (from left to right): Square1. frequently. This is of particular im- portance. such as the Sizes and Square series.6 5 0 0.6 -0. all instances differ). shows a consistently good performance across the entire range of data properties. The results demonstrate that all single-objective algorithms perform well for certain subsets of the data. and. Square3 and Sizes3.3 -2 0. 2. It is the best performer on ten out of fourteen . av- erage link and single link. Examples of data sets exhibiting different properties. and it clearly outperforms traditional single-objective clustering algorithms in this respect.2 -3 -10 0.9 15 0. Knowles 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 -5 -5 -5 -10 -10 -10 -10 -5 0 5 10 15 20 -10 -5 0 5 10 15 20 -10 -5 0 5 10 15 20 4 20 1 3 0. Here. Handl and J. The clusters in these data sets (with the exception of two clusters in Smile) are described by Normal Distributions. in contrast.1 -4 -15 0 -4 -3 -2 -1 0 1 2 3 4 -15 -10 -5 0 5 10 15 -1 -0. Triangle1 and Smile1. but fails for ‘noisy’ data sets. as we compare clustering results across a range of different numbers of clusters. Single link detects well-separated clusters of arbitrary shape. Table 2. MOCK. Bottom row (from left to right): Long1. but also through a dramatic increase in the number of clusters of the best solution.4 Experimental Results Experimental results confirm that MOCK is indeed robust towards the range of different data properties studied.6.

99570 Long2 2 2 0.96342 4 0.96008 34.99612 Smile1 4 4 1 30.4 0. but a set of solutions rep- resenting an approximation to the Pareto front.86767 4 0.12 0.99595 Long3 2 2.95103 35.42 0.76 0.84 0.95435 24. 2 Multi-Objective Clustering and Cluster Validation 35 Table 2. The number of clusters associated with the best solution is usually very close to the real number of clusters in the data set.3 Multiobjective Model Selection After dealing with the generation of clustering solutions in the previous sec- tion.42 0.16 0.96465 4 0. that is.1.54 0. Model selection is particularly relevant for multiobjective clustering.35544 7.16 0.99992 4.2 0.32 0.14 0. we now turn our interest to model selection.96336 4.82880 Sizes3 4 4.48 0.82 0.24 0.21742 7. For additional experimental results.33936 7.84291 Sizes2 4 4. Number of clusters and quality of the best solution (in terms of the Adjusted Rand Index) generated by MOCK. 2.93701 37.82555 Square2 4 4.12 0.45803 Square3 4 4.99965 4.94117 36. average link and single link on a number of artificial data sets exhibiting different cluster properties (averages over 50 runs). All of these experiments confirm the high performance of the multiobjective cluster- ing approach.80083 4 1 Smile2 4 4 1 27.95800 4. and results on complex high-dimensional and real test data with large numbers of clusters.75607 4 1 Smile3 4 4 1 33.4 0. the reader is referred to [12.02 0.26 0.93574 4.24 0. While this may be . The individual partitionings in this approximation set correspond to different trade-offs between the two objectives but also consist of different numbers of clusters.99979 Triangle2 4 4 1 4 0. Problem MOCK K-means Average link Single link Name K K Rand K Rand K Rand K Rand Square1 4 4.87194 Long1 2 2 0.55456 10.92661 4 0.97109 4 0. K-means.99259 4. including a comparison to an advanced ensemble technique.22 0.38256 4 1 Triangle1 4 4 1 4 0.64 0.32 0. the identification of one or several promising solutions from a large set of given clustering so- lutions.96505 4.94976 of the data sets.22 0.88 0.28 0. 15].74404 11.88607 5.32 0.47172 3.32 0.14 0.48 0.37855 3.12 0.98 0.88436 4.34918 11.94 0. as the algorithm does not return a single solution.02 0.42 0.96532 4.10803 Sizes1 4 4.82932 41.64 0.48 0.45719 3.99984 5 0. On the remaining four data sets it is the second best performer with only little difference to the best performer K-means.96663 4.97575 4 0. The best and second best performer are highlighted in bold and italic face respectively.89713 40.

we expect the ratio R = δoverall deviation δconnectivity to be large: the separation of two clusters will trigger a great decrease in overall deviation.3. Using this knowledge. can also be applied to analyze the output of other clustering algorithms. a performance plot that takes out the bias resulting purely from a change in the number of clusters. From the two objectives employed. other applications may require the automatic selection of just one ‘best’ solution. N ]. the solutions are approximately ordered by the number of clusters they contain: . In this section. to have the opportunity to analyze several alternative solutions and bring to bear any specialized domain expertise available). Tib- shirani et al realize this by generating a number of reference partitionings for random data. we therefore introduce a method for identifying the most promising cluster- ing solutions in the candidate set. Intuitively. The Gap statistic is based on the expectation that the most suitable number of clusters shows in a significant ‘knee’ when plotting the performance of a clustering algorithm (in terms of a selected internal validation measure) as a function of the number of clusters. Knowles a very useful feature under certain circumstances (e. the ‘knee’ can be best identified in a normalized plot. From the normalized performance curve they then identify the smallest number of clusters for which the gain in performance is not higher than would be expected for random data. 2.36 J. when considering two solutions with K = k and K = k +1 respectively (where k ∈ [1. human experts may find it preferable to have the opportunity to choose from a set of clustering solutions. 2. When we surpass the correct number of clusters this ratio will diminish: the decrease in overall de- viation will be less significant but come at a high cost in terms of connectivity (because a true cluster is being split). originally developed for MOCK.3. we can say that we gain an improvement in overall deviation δoverall deviation at the cost of a degradation in connectivity δconnectivity . generally. that is. Handl and J.2 Proposed Approach Several aspects of this idea can be carried over to our case of two objectives. We will also show how this methodology. As internal validation measures are gen- erally biased by the number of clusters (they show an increasing/decreasing trend that is solely due to a change in the number of clusters). let us consider a plot of the Pareto front. Due to the natural bias of both measures. overall devi- ation decreases with an increasing number of clusters. with only a small increase in connectivity.1 Related Work The approach we have developed is inspired by Tibshirani et al’s Gap statis- tic [30]. we equally expect the structure of the data to be reflected in the shape of the Pareto front. For a number of clusters smaller than the true number K.g. a statistical method to determine the number of clusters in a data set. Hence. and N is the size of the data set). whereas connectivity decreases.

we can again use random reference data distributions.8 1 0 5 10 15 20 25 30 35 Connectivity Number of clusters Fig. 2 Multi-Objective Clustering and Cluster Validation 37 1 0. the highest number of clusters shared by all fronts.8 0.6 0. where each side of the hyperbox is proportional in length to the size of the eigenvalue corre- sponding to this dimension.2 0.4 0. provides us with a set of ‘reference fronts’ (see Figure 2.4 0. Plot of the scores as a function of K.7).2 0. we restrict . Subsequently. The global maximum at K = 4 is clearly visible.6 0.25 Overall deviation Attainment score 0. (left) Solution and control reference fronts for a run of MOCK on the Square1 data set. 2. we set Kmin = 1 and identify Kmax . and it is therefore not clear how individual points in the solution front should be normalized. described next. In order to help us correctly determine this knee.3 Solution Solution Control 0. The eigenvectors and eigenvalues obtained are then used for the definition of a uniform distribu- tion: the data is generated within a hyperbox in eigenspace. a normalization of the original ‘solution front’ using the ‘reference fronts’ is not as straightforward as the normalization of the perfor- mance curve for the Gap statistic.1 0. The distinct change in R occurring for the correct number of clusters can therefore be seen as a ‘knee’. but a set of solutions for every value of K. a principal component analysis is ap- plied to the covariance matrix of the original data. K gradually increases from left to right.7. Control Data The reference distributions are obtained using a Poisson model in eigen-space as suggested by Sarle [27].2 0. The solution with the largest minimum distance to the reference fronts is indicated by the arrow and corresponds to the desired K = 4 cluster solution.05 0 0 0 0. Clustering a number of such distributions using MOCK. Specifically. Alignment of Solution and Control Fronts Given both solution and reference fronts. The resulting data is then back-transformed to the original data space. This is because both solution and control fronts contain not just one.15 0. (right) Attainment scores for the Square1 data set. We overcome this problem by a heuristic approach. Unfortunately.

and we refer to this distance as the ‘attainment score’.7). while changes for higher number of clusters are far less marked. All solutions corresponding to the local optima in the resulting plot are considered as promising solutions. By this means. Connectivity. 2. Knowles the analysis to solutions with a number of clusters K ∈ [Kmin . Attainment Scores For both solution and reference fronts. For each front. we subsequently compute the attain- ment surfaces [11].3. we then determine the minimum and maximum value of both overall deviation and connectivity.g. Finally. and give a higher degree of emphasis to small (but distinct) changes in the objectives. the performance of the attainment method is com- pared to an existing cluster validation technique. We further transform the objective values by taking the square root of each objective. and use these to scale all objective values to lie within the region [0. To make the investi- gation more meaningful. This results in an uneven sampling of the range of objective values. the Silhouette Width. the algorithm becomes more precise at identifying solutions situated in all parts of the Pareto front. in contrast rises very quickly for higher numbers of clusters.38 J. in particular those at the tails which may correspond e. Kmax ]. to partitionings with elongated cluster shapes or a high number of clusters. It is the boundary in the objective space separat- ing those points that are dominated by or equal to at least one of the data points in the front from those that no data point in the front dominates or equals (see Figure 2.7). 1]×[0. we investigate the performance of MOCK’s at- tainment method as a general tool for model selection. while initial changes in the degree of connectivity are rather small. Overall deviation decreases very rapidly for the first few K. Taking the square root of the objective values is an attempt to reduce this ‘squeezing’ effect. a step motivated by the observation that both overall deviation and connectivity show a non-linear development with respect to K. For a given solution point p. we plot the attainment scores as a function of the number of clusters K (see Figure 2. 1]. which is highly-regarded in the clustering literature as a means to model selection. Solu- tion points that are dominated by any reference point are also excluded from further consideration. we compute its attainment score as the Euclidean distance between p and the closest point on the the reference attainment surface. The attainment surface of a Pareto front is uniquely defined by the points in the front. with fewer solution points in the centre. which — in a plot of the Pareto front — shows in a high density of points at the tails. The global maximum in this plot may be considered as a hypothesis as to the best solution. For each point in the solution front we then compute its distance to the attainment surfaces of each of the reference fronts. .3 Experimental Setup In this experimental section. Handl and J.

potentially interesting solutions. The resulting sequences are then subjected to the procedure detailed in Section 2. We would therefore expect it to perform well for data sets exhibiting this type of structure. All local maxima are considered as alternative. we also give the evaluation of the overall best solution returned by the clustering algorithm. an external measure of clustering quality. as candidate ‘best’ solutions. the Silhouette Width of the corresponding clustering solution is computed. to select one or more partitionings from those generated by the clustering algorithm. In a plot of the attainment score as a function of the number of clusters.2.4 and Table 2. In the following.2). Specifically. Table 2. MOCK’s attainment method can be adapted for model selection on the clustering results of any single-objective algorithm. In a plot of the Silhouette Width as a function of the number of clusters. they are applied to the clustering results from four different clustering algorithms. the global maximum and all local maxima are identified.5 show the results obtained for model selection on the results returned by the four different algorithms. 50].3. we would expect MOCK’s attainment method to perform more uniformly well for data of different structures. 50] are obtained both on the original data set and random control data (which is generated in the same way as described in Section 2. The procedure for its use in model selection is as follows: for each K ∈ [1. 2 Multi-Objective Clustering and Cluster Validation 39 To get a broad and detailed picture of how both methods perform. In contrast. and do this across the full range of data sets introduced in Section 2. spherically shaped clusters. 50]). but to perform poorly for data containing arbitrarily shaped. For this purpose. it can be considered a (fixed) nonlinear combination of two objectives. The Silhouette Width is a popular validation technique commonly used for model selection in cluster analyses.3. All local maxima are considered as alternative. To isolate more clearly the performance of the validation techniques from the performance of the clustering algorithms. Notably. in order to obtain attainment scores for all solutions.3.3. and for each algorithm. The global maximum is considered the guess for the best solution.4 Experimental Results The Silhouette Width assumes compact. the global maximum and all local maxima are identified. elongated clusters.2. that is. The global maximum is considered the estimated best solution. Similarly. we run MOCK and three single-objective algorithms (the latter run for a range of different numbers of clusters K ∈ [1. the two validation techniques are then used to perform model selection. clustering results for a range of K ∈ [1. Table 2.3. .2. as it takes both intra-cluster dis- tances and inter-cluster distances into account. we briefly describe the Silhouette Width (see also the Appendix) and state how the two validation techniques are applied to our data. On each data set. potentially in- teresting solutions. Table 2. 2. The selected solutions are then objectively evaluated using the Adjusted Rand Index.

37415 Long3 2.85653 4 0.99965 2.39329 Triangle1 4 1 4 1 4.04 0. and.02 0.38018 Smile1 4 1 4 1 14.56 0.04 0.92438 4.95919 4 0.08 0.96868 Sizes3 4. The overall best solution found (in terms of the Adjusted Rand Index).55059 4 1 10.86251 4. the additional flexibility of the attainment score is not rewarded.2 indicate that this advantage can be largely maintained when using the at- tainment score for model selection.96113 4 0. Values presented are averages over 50 sample data sets.42 0.98652 4.99984 2.08 0. respectively.24 0.02 0.02 0. as the K-means algorithm only generates spherically shaped.92661 4.85653 4 0.99748 6.96314 4. it seems to cope better with the data sets containing elongated data sets.92 0.31094 2.97692 4 0.99680 Triangle2 4.02 0. In Table 2. it seems that the utility of multiobjective cluster validation strongly depends on the algorithm used for the generation of the clustering solutions.5 0.48 0. for single link. Knowles Table 2.06 0.02 0.32214 4 1 11.48 0.97165 We have seen before.97109 4 0.12 0.40 J.96342 4.2. but instead results in the introduction of noise. compact clusters.44 0.92438 Square3 4.04 0. Here.9411 4 0.2 0.24 0. Max A-score’ and ‘Max S-score’ refer to the solution corresponding to the global maximum in the plot of the attainment score and the Silhouette Width.93692 7 0. Overall.5 a slight advantage of the attainment method can be observed.32103 2.72 0.96878 3.04 0.1 0.3.96465 4.99748 7. In general.97692 4 0.97575 3.12 0.96113 4 0.14 0. Problem Best Solution Max A-score Max S-score Best local A-score Best local S-score K Rand K Rand K Rand K Rand K Rand Square1 4. All of the partitionings generated by the algorithm are therefore in concordance with the assumptions made by the Silhouette Width. In the context of multiobjective clustering.99680 4 1 4.04 0.99745 7.36782 Long2 2 0. respectively. While the Silhouette Width performs (as expected) very well for model selection on the Square and the Sizes data sets. This is intuitively plausible.96184 4.91790 4 0.2 0.21439 2.76 0.95919 4 0.96727 4 0. In Table 2.96 0.99745 6.96724 4 0.24 0.44 0.97394 4. no advantage can be observed for MOCK’s attainment score method.16 0.96314 Sizes2 4.97394 Long1 2 0.86767 4. its estimate of the best clustering solution is generally of much higher quality.63436 Smile3 4 1 4 1 15.02 0. where the full extent of . its performance breaks down drastically for those data with elongated cluster shapes.02 0.96184 Square2 4.02 0.72711 4 1 10. ‘Best local A-score’ and ‘Best local S-score’ refer to the best (in terms of the Adjusted Rand Index) out of the set of solutions corresponding to local maxima in the plot of the attainment score and the Silhouette Width.04 0.91546 4 0.77839 Smile2 4 1 4 1 13 0. Results for MOCK.97165 4. and the Silhouette Width can therefore be expected to perform well. and the solutions identified using the Silhouette Width and the attainment score are evaluated using the Adjusted Rand Index (Rand) and the number of clusters (K). Handl and J.68 0. that MOCK is the only one of the three algorithms that perform robustly across the entire range of data sets.99992 2.86251 Sizes1 4. In such a scenario. if the Silhouette Width is used.4 and Table 2.97110 4. The results in Table 2.02 0. but is strongly reduced.96868 4.

21694 13.48 0.88432 Sizes1 4 0.20836 8.4 Conclusion In this chapter we have identified two subtasks in clustering.28 0. Max A-score’ and ‘Max S-score’ refer to the solution corresponding to the global maximum in the plot of the attainment score and the Silhouette Width.1 0.69363 30. for traditional clustering algorithms such as K-means.11241 4.93568 4 0.84 0.96500 4 0.96532 Sizes3 4 0.96648 4 0.96648 4 0.88432 4.04 0.88259 4 0.96336 4 0.74356 Smile2 25. In contrast.65296 41. 2.18235 8.32 0.04 0.96500 4 0. The resulting algorithms have been evaluated across a range of different data sets.58 0.53681 24.935684 4 0.55347 12.92 0. that is.96336 Long1 5.02 0.86 0. the generation of clustering solutions.77700 4 0.22 0.55340 Smile3 34. that is.96532 4 0.54 0.53460 7.96648 4 0.88575 the true Pareto front is approximated.90565 4 0.18 0.33693 13.08793 6.77339 4 0.33803 12.96532 4 0.12 0. Results for K-means.27396 8.16 0.53044 35. Here.02 0.96336 4 0.95805 5.73858 14. a clear advantage to the multiobjective approach can be observed.21598 Smile1 30.90414 4 0.96648 4 0. This idea has been illustrated through the development of a multiobjective clustering algorithm and a multiobjec- tive scheme for model selection.88432 4.96532 4 0.96336 4 0.29356 Long2 4. We have argued that that the simultaneous consideration of several ob- jectives may be a way to obtain a more consistent performance (across a range of data properties) for both subtasks. ‘Best local A-score’ and ‘Best local S-score’ refer to the best (in terms of the Adjusted Rand Index) out of the set of solutions corresponding to local maxima in the plot of the attainment score and the Silhouette Width.64 0. which are based on very specific cluster models.27602 7. the selection of the best clustering out of a set of given partitionings. and (2) model selection.35500 12.02 0.88259 4 0. Values presented are averages over 50 sample data sets. Both processes require the definition of a clustering criterion.8 0.24886 7.96500 4 0.3 0. the use of an analogous validation technique based on identical assumptions may be preferable.93568 4 0.29083 Long3 4. The overall best solution found (in terms of the Adjusted Rand Index). respectively.3.96532 4 0.96500 4 0.74397 9. 2 Multi-Objective Clustering and Cluster Validation 41 Table 2.66 0.02 0.62 0.58 0. and results have been compared to those . and the algo- rithms’ capability to deal with different data properties highly depends on this choice. Problem Best Solution Max A-score Max S-score Best local A-score Best local S-score K Rand K Rand K Rand K Rand K Rand Square1 4 0.93568 4 0.29340 43.29585 32.34 0.95805 Triangle2 4 0.96648 Sizes2 4 0.38 0.34815 Triangle1 4 0.88575 5. in which the use of a multiobjective framework may be beneficial: (1) the actual process of clus- tering.42 0. and the solutions identified using the Silhouette Width and the attainment score are evaluated using the Adjusted Rand Index (Rand) and the number of clusters (K).18584 11.94 0.02 0. respectively.88575 5.36 0.34886 13.95805 5.78 0.25925 8.93568 Square3 4 0.46 0.96336 4 0.64 0.2 0.96500 Square2 4 0.6 0.

95089 4 0.84184 Sizes2 35.24 0.9 0.98863 4 0.94079 4 0.08 0.22 0.5.17631 11.62 0.84 0.7 0.06 0.46 0.32 0.1 0.28 0.56 0. ‘Best local A-score’ and ‘Best local S-score’ refer to the best (in terms of the Adjusted Rand Index) out of the set of solutions corresponding to local maxima in the plot of the attainment score and the Silhouette Width.94079 Sizes2 4.14 0.99979 4.06 0.32 0.7 0.99138 7.92 0.99366 9.74898 17.52 0.48 0.42 J.94079 4 0.33951 38.26364 4.95089 4 0.8 0.84291 14.94234 Table 2.76 0.24133 9.64 0.2 0.07145e-06 12.94283 4. Max A-score’ and ‘Max S-score’ refer to the solution corresponding to the global maximum in the plot of the attainment score and the Silhouette Width.37855 16.78 0.88220 4.54517 Smile3 11.98493 Triangle2 5.7 0.95990 4.58 0.08473 5.82921 Sizes1 4.47172 14.94 0.69273 16.082558 15.95089 Sizes3 4.94832 .99979 4.32341 2.02944e-07 10.14 0.93671 4 0.06 0.45713 Square3 41.42 0.06 0.37684 Long2 7.12 0.18 0.32 0.45803 11.14 0.08 0.64 0.46 0.82400 Square2 40.82555 14.82932 4.14 0. Values presented are averages over 50 sample data sets.10770 Sizes1 36.00570 43.72 0.8 0.38 0.21911 7.06 0.75607 4.87194 12.3 0. and the solutions identified using the Silhouette Width and the attainment score are evaluated using the Adjusted Rand Index (Rand) and the number of clusters (K).20012 4 0.06 0.68408 17.28632 9.74 0.58 0. The overall best solution found (in terms of the Adjusted Rand Index).99595 4.06 0.99612 4.98 0.58 0.95990 4 0.24 0.88 0.30426 12.82921 4.08 0.04 0.32 0.99570 4.42990 2.72330 11.98881 2. respectively.55600 18.14 0.7 0.18893 6.4 0.94117 4 0.42 0.28 0.3 0.09140 10. respectively. Max A-score’ and ‘Max S-score’ refer to the solution corresponding to the global maximum in the plot of the attainment score and the Silhouette Width.97057 Long3 3.08 0. Problem Best Solution Max A-score Max S-score Best local A-score Best local S-score K Rand K Rand K Rand K Rand K Rand Square1 4.97394 Long2 3.38256 10.54 0.99259 4.40410 36.4 0.89661 4 0.36 0.63951 5.935819 4.99979 4.06278 15.04 0.12544 12.82803 Sizes3 34.30000 4.45719 15.82921 4.02 0.82 0. respectively.97182 8.22 0.29776 13.80083 8.06 0.9886 4 0.99073 2.47064 35.94079 4 0.94234 4.6 0.95990 Long1 7.98 0.1 0. Values presented are averages over 50 sample data sets.99992 4 1 4.36842 Long3 7.26 0.93671 4 0.84 0.93671 4 0.78 0.38 0.99991 4 1 Triangle1 4.5 0.4 0.64 0.30292 18 0.68 0.95103 4 0.53626 35.08107 2 8. ‘Best local A-score’ and ‘Best local S-score’ refer to the best (in terms of the Adjusted Rand Index) out of the set of solutions corresponding to local maxima in the plot of the attainment score and the Silhouette Width.1 0.02 0.82147 4.89664 4 0.94 0. Handl and J. The overall best solution found (in terms of the Adjusted Rand Index). Results for average link. and the solutions identified using the Silhouette Width and the attainment score are evaluated using the Adjusted Rand Index (Rand) and the number of clusters (K).95089 4 0.24 0.87094 Long1 3.93671 Square2 4.84 0.92 0.31562 Triangle1 4.28 0.042279 15.99979 4. Here.08 0.99991 3.89661 Square3 4. Knowles Table 2.99992 4 1 Smile3 4 1 4.08108 40.02 0.37947 17.82 0.32 0.94976 11.00568 2 1.95990 4 0.72 0.4. respectively.3 0. Problem Best Solution Max A-score Max S-score Best local A-score Best local S-score K Rand K Rand K Rand K Rand K Rand Square1 37.64 0.52529 11.72458 Smile2 10.58 0.03742 14.89661 4.54 0.96008 4 0.02102 15.89713 4 0.954345 4.98 0.12 0.04 0.10803 11.98493 4.97434 Smile1 4 1 4 1 4 1 4 1 4 1 Smile2 4 1 4. Here.82880 13.16 0.38400 3. Results for single link.46 0.93701 4 0.16 0.99979 Triangle2 24.97150 2.25754 Smile1 11.49804 2.42 0.12 0.72440 24.

[5] D. and J. Ankerst. 2001. Stork. References [1] M. W. S. ACM SIGKDD Explorations Newsletter Archive. and J. pages 584– 593. Breunig. In Proceedings of the Fourth International Conference on Parallel Problem Solving from Nature. PESA- II: Region-based selection in evolutionary multiobjective optimization. [8] M. Estivill-Castro. J. Everitt. Jerram. M. Fleming. Morgan Kaufmann Publishers. Knowles. pages 49–60. Fonseca and P. 4:65–75. Manlik. ACM Press. Bandyopadhyay and U. H. and M. [4] D. AIII Press. G. Corne. Technical Report TR-97-018. CA. In Proceedings of the Fifth Conference on Parallel Problem Solving from Nature. [9] V. Duda.-J.. O. Im. Oates. W. John Wiley and Son Ltd. Cluster Analysis. and D. Sander. D. In Proceedings of the 1999 International Conference on Management of Data. Sander. Oates. 1997. 2001. pages 226–231. In Proceedings of the Genetic and Evolutionary Computation Conference. 1996. Knowles. [3] J. PESA-II: region-based selection in evolutionary multiobjective optimization. and Martin J. 1993. W. Kriegel. E. H. J. Kriegel. 1996. University of California. Vahdat. D. pages 283–290. Germany. pages 839–848. Hart. Ester. 31:120–125. Springer-Verlag. 2 Multi-Objective Clustering and Cluster Validation 43 obtained using traditional single-objective approaches. On the performance assessment and com- parison of stochastic multiobjective optimizers. and M. [7] R. A density-based algorithm for dis- covering clusters in large spatial databases with noise. [2] S. [6] D. Joshua Knowles is supported by a David Phillips Fellowship from the Biotechnology and Biological Sciences Research Council (BBSRC). W. UK. and E. M. Why so many clustering algorithms: A position paper. Corne. . 1999.-P. Knowles.and Karl Benz-Foundation. J. OPTICS: Ordering points to identify clustering structure. Man and Cybernetics. Berkeley. 2001. 2002. Oates. P. Experimental results indicate some general advantages to the multiobjective approach. [10] B. Hsu. Second edition. IEEE Transactions on Systems. pages 283– 290. Joshua D. Nick R. Edward Arnold. Bilmes. In Pro- ceedings of the Genetic and Evolutionary Computation Conference. J. 2001. J. Corne. P. Nonparametric genetic clustering: compar- ison of validity indices. 2000. The Pareto envelope-based selection algorithm for multiobjectice optimization. Acknowledgements Julia Handl gratefully acknowledges support of a scholarship from the Gottlieb Daimler. A. [11] C. In Proceedings of the Second International Conference on Knowledge Discovery and Data-Mining. Pattern Classification. Empirical observations of prob- abilistic heuristics for the clustering problem. International Computer Science Institute.

An empirical comparison of four initialization methods for the k-means algorithm. 31:264–323. and T. Strehl and J. 20:53–65. Smith. 2:193–198. N. Journal of Computational and Applied Mathematics. Comparing partitions. I. [21] L. 1999. WI. M. C. J. Estimating the number of clusters in a dataset via the Gap statistic. Pattern Recognition Letters. A. The EM Algorithm and Extensions. University of California Press. [15] J. pages 568–575. Data clustering: a review. Evolutionary multiobjective clustering. The elements of statistical learning: data mining. Knowles. M. Handl and J. [17] T. IEEE Press. Handl and J. McLachlan and T.-S. [24] J. pages 281–297. [20] J. Technical report. UMIST. Madison. pages 632–639. Kell. Tibshirani. Journal of the Royal Statistical Society: Series B (Statistical Methodology). [30] R. J. J. Park and M. Sarle. and J. 1989. Morgan Kaufmann. In IEEE Congress on Evolutionary Computation. Knowles [12] J. J. 1999. An impossibility theorem for clustering. Murty. [13] J. K. Osman. In Proceedings of the 15th Conference on Neural Information Processing Systems. A genetic algorithm for clustering problems. D. Cubic clustering criterion.44 J. Bioinformatics. G. Flynn. NC: SAS Institute Inc. Journal of Classification. In Proceedings of the Third Annual Conference on Genetic Programming. 2005. and P. 2004. SAS Technical Report A-108. and G. UK. [18] A. inference and prediction. Springer-Verlag. Pena. B. [29] G. pages 1081–1091. R. [23] Y. 2001. Ghosh. Morgan Kaufmann Publishers.-J. 2002. [28] A. Handl. Manchester. Exploiting the trade-off: the benefits of multiple ob- jectives in data clustering. Friedman. In Proceedings of the Third International Conference on Evolutionary Multicriterion Optimization. 1997. Handl and J. Larranaga. Computational cluster validation in post-genomic data analysis. 20:1027–1040. 2002. 2005. [25] V. Handl and J. Tibshirani. [14] J. Hastie. Journal on Machine Learning Research. Some methods for classification and analysis of multivariate observations. S. Hubert. 1996. Knowles. [27] W. 1985. R. Krishman. Jain. 2005. Silhouettes: a graphical aid to the interpretation and valida- tion of cluster analysis. Improvements to the scalability of multiobjective clustering. Technical Report TR-COMPSYSBIO-2004-02. 2001. 21:3201–3212. [26] P. Reeves. Song. Multiobjective clustering with automatic determina- tion of the number of clusters. Modern Heuristic Search Methods. John Wiley and Son Ltd. and D. 1967. Knowles. Knowles. Cluster ensembles — a knowledge reuse framework for combining multiple partitions. ACM Computing Surveys. [19] A. pages 2–9. pages 547–560. 3:583– 617. Kleinberg. The Internet. John Wiley and Son Ltd. Cary. MacQueen. In Proceed- ings of the Eighth International Conference on Parallel Problem Solving from Nature. . 1983. 63:411–423. and P. [16] J. J. Springer-Verlag. 1987. Knowles. Springer-Verlag. In Proceedings of the Third International Conference on Genetic Algorithms. Rayward-Smith. 1998. Hastie. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Lozana. H. Handl and J. Rousseeuw. Uniform crossover in genetic algorithms. Walther. [22] G. Syswerda. 2004.

Graphs: An Introductory Approach: A First Course in Discrete Mathematics. 1985. J. 1994. Jain. 2004. J. Topchy. [33] D. Watkins. . A genetic algorithm tutorial. John Wiley and Sons. Clustering ensembles: Models of consen- sus and weak partitions. K. 1990. 2 Multi-Objective Clustering and Cluster Validation 45 [31] A. Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence. Wilson and J. [34] R. Department of Computer Science. 4:65–85. Cornell University. Vorhees. A. Whitley. Statistics and Computing. [32] E. Punch. PhD thesis. The effectiveness and efficiency of agglomerative hierarchical clus- tering in document retrieval. and W.

The Adjusted Rand Index additionally introduces a statistically induced normalization in order to yield values close to 0 for random partitions. max(bi . ai ) where ai denotes the average distance between i and all data items in the same cluster. is computed as bi − ai S(i) = . the Adjusted Rand Index [18] is given as  nlk  nl.5. V ) = 1  nl. Knowles 2.2 Algorithms K-means Starting from a random partitioning. 1] and is to be maximized.k  n . the average vector of each cluster in data space) and (ii) reassigns each data item to the cluster whose centre is closest to it. The Silhouette Width return values in the interval [−1.5. the K-means algorithm repeatedly (i) computes the current cluster centres (that is. which is a gen- eralization of the Rand Index.k l 2 nkl. 2  n2. By this . 2. It terminates when no more reassignments take place.  n.k  n lk ( 2 ) − ( )· ( ) /( )  R(U. 2 ( l 2 ) + ( k 2 ) − l 2 )· ( k ( 2 ) /( 2 ) where nlk denotes the number of data items that have been assigned to both cluster l and cluster k. clustering quality is objectively evaluated using the Ad- justed Rand Index. The Rand indices are based on counting the number of pair-wise co- assignments of data items. and bi denotes the average distance between i and all data items in the closest other cluster (which is defined as the one yielding the minimal bi ). Handl and J. Adjusted Rand Index In all experiments.46 J. The Adjusted Rand Index returns values in the interval [0.  n. an external measure of clustering quality. Using a representation based on contingency tables. The Silhouette value for an individual data item i.5 Appendix 2. 1] and is to be maximized. which reflects the confidence in this particular cluster assignment.1 Evaluation Functions Silhouette Width The Silhouette Width [26] for a partitioning is computed as the average Sil- houette value over all data items.

the distance between two clusters Ci and Cj is computed as the average dissimilarity between all possible pairs of data elements i and j with i ∈ Ci and j ∈ Cj . cluster centres are only recomputed after the reas- signment of all data items. For the linkage metric of average link. that is.6. is locally minimized. the sum of squares of the differences between data items and their associated cluster centres. Single link and average link agglomerative clustering only differ in the link- age metric used. agglomerative clustering algorithms start with the finest partition- ing possible (that is. the intra-cluster variance. 2 Multi-Objective Clustering and Cluster Validation 47 Table 2. Hierarchical Clustering In general. in each iteration.7 Objective functions Overall deviation and connectivity (L = 20) #(Reference distributions) 3 means. To reduce suboptimal solutions K-means is run repeatedly (100 times) using random initialisation (which is known to be an effective initialization method [24]) and only the best result in terms of intra- cluster variance is returned. Alternatively. For the linkage metric of single link. . They terminate when the target number of clusters has been obtained. the distance between two clusters Ci and Cj is computed as the smallest dissimilarity between all possible pairs of data elements i and j with i ∈ Ci and j ∈ Cj . where N is data set size. that is.6 and are kept constant over all experiments. Parameter setting Number of generations 500 External population size 1000 Internal population size 10 #(Initial solutions) f size 100 Initialization Minimum spanning tree and K-means (L = 20) Mutation type L nearest neighbours (L = 20) Mutation rate pm pm = N1 + ( Nl )2 Recombination Uniform crossover Recombination rate pc 0. merge the two least distant clusters. Our implementation of the K-means algorithm is based on the batch ver- sion of K-means. singletons) and. the entire dendrogram can be generated and be cut at a later point. Parameter settings for MOCK. Parameter Settings for MOCK Parameter settings for MOCK are given in Table 2.

br 3 École de Technologie Supérieure. In order to show its robustness. demonstrated that the proposed methodology brings compelling improvements when classifiers have to work with very low error rates. 17. In this paper we present an ensemble feature selection approach based on a hierarchical multi-objective ge- netic algorithm. such as Bagging and Boosting. Firstly. Studies in Computational Intelligence (SCI) 16. which make the classifiers of the ensemble disagree on difficult cases.com c Springer-Verlag Berlin Heidelberg 2006 .morita@hsbc. The underpinning paradigm is the “overproduce and choose”. 3. 31].com.: Feature Selection for Ensembles Using the Multi-Objective Optimization Approach. the method is evaluated in two different contexts: supervised and unsupervised feature selection. Experiments and comparisons with classical methods. In other words.sabourin@etsmtl. The algorithm operates in two levels. 49–74 (2006) www.springerlink. it performs feature selection in order to generate a set of classifiers and then it chooses the best team of classifiers. Curitiba. Oliveira1 . It has been demonstrated that a good ensemble is one where the individual classifiers in the ensemble are both accurate and make their errors on different parts of the input space (there is no gain in combining identical classifiers) [11. In the former. Feature selection for ensembles has shown to be an effective strategy for ensemble creation due to its ability of producing good subsets of features.3 Feature Selection for Ensembles Using the Multi-Objective Optimization Approach Luiz S. and Robert Sabourin3 1 Pontifical Catholic University of Paraná.1 Introduction Ensemble of classifiers has been widely used to reduce model uncertainty and improve generalization performance.br 2 HSBC Bank Brazil. PR. In the latter. Marisa Morita2 . Canada robert. Brazil soares@ppgia. Curitiba. we took into account the problem of handwritten month word recognition and used three different feature sets and hidden Markov models as classifiers. Oliveira et al. we have considered the problem of handwritten digit recognition and used three different feature sets and multi-layer perceptron neural networks as classifiers.S.pucpr.e. Brazil marisa.ca Summary. Developing techniques for generating candidate ensemble members is a very important direction of ensemble of classifiers research. Montreal. an ideal ensemble consists of good classifiers (not necessarily excellent) that disagree as much as possible on difficult cases. PR. L.

Recently. Traditional feature selection algorithms aim at finding the best trade-off between features and generalization.. The algorithm operates in two levels. en- semble feature selection has the additional goal of finding a set of feature sets that will promote disagreement among the component members of the ensem- ble. we propose an ensemble feature selection approach based on a hierarchical MOGA. In spite of the weak correlation between diversity and performance. More recently some strategies based on genetic algorithms (GAs) have been proposed [31]. The latter combines these classifiers in order to find an ensemble by maximizing the following two criteria: accuracy of the ensemble and a measure of diversity. we ar- gue that diversity might be useful to build ensembles of classifiers.50 L. The underlying paradigm is the “overproduce and choose” [32. On the other hand. e. We demon- strated through experimentation that using diversity jointly with performance to guide selection can avoid overfitting during the search. 24. In the former. non-linear and poorly understood spaces [30]. The literature has shown that varying the feature subsets used by each member of the ensemble should help to promote this necessary diversity [12. it can overcome problems such as scaling and sensitivity towards the weights. It has been demonstrated that feature selection through multi-objective genetic algorithm (MOGA) is a very powerful tool for finding a set of good classifiers. we . This is the case of most of the problems in handwriting recognition. 9]. In light of this. The former is devoted to the generation of a set of good classifiers by minimizing two criteria: error rate and number of features. Kudo and Sklansky [18] have compared several algorithms for feature selection and concluded that GAs are suitable when dealing with large-scale feature selection (number of features is over 50). the issue of using diversity to build ensemble of classifiers has been widely discussed. The Random Subspace Method (RMS) proposed by Ho in [12] was one early algorithm that constructs an ensemble by varying the subset of features. Besides.g. since GA is quite effective in rapid global search of large. 35]. It is well known that when dealing with this kind of combination. some authors have claimed that diversity brings no benefits in building en- semble of classifiers [33]. others suggest that the study of diversity in classifier combination might be one of the lines for further exploration [19]. In order to show robustness of the proposed methodology. avoiding classical methods such as the weighted sum to combine multiple objective functions. Several works have demonstrated that there is a weak correlation between diversity and ensemble performance [23]. they still can be improved in some aspects. All these strategies claim better results than those produced by traditional methods for creating ensembles such as Bagging and Boost- ing. In this light. on the other hand.S. it was evaluated in two different contexts: supervised and unsupervised feature selection. 31. which is the test problem in this work. In spite of the good results brought by GA-based methods. one should deal with problems such as scaling and sensitivity towards the weights. Oliveira et al.

3 Feature Selection for Ensembles 51

have considered the problem of handwritten digit recognition and used three
different feature sets and multi-layer perceptron (MLP) neural networks as
classifiers. In such a case, the classification accuracy is supplied by the MLPs
in conjunction with the sensitivity analysis. This approach makes it feasible
to deal with huge databases in order to better represent the pattern recogni-
tion problem during the fitness evaluation. In the latter, we took into account
the problem of handwritten month word recognition and used three different
feature sets and hidden Markov models (HMM) as classifiers. We demonstrate
that it is feasible to find compact clusters and complementary high-level rep-
resentations (codebooks) in subspaces without using the recognition results
of the system. Experiments and comparisons with classical methods, such as
Bagging and Boosting, demonstrated that the proposed methodology brings
compelling improvements when classifiers have to work with very low error
The remainder of this paper is organized as follows. Section 3.2 presents a
brief review about the methods for ensemble creation. Section 3.3 provides a
overview of the strategy. Section 3.4 introduces briefly the the multi-objective
genetic algorithm we are using in this work. Section 3.5 describes the classifiers
and feature sets for both supervised and unsupervised contexts. Section 3.6
introduces how we have implemented both levels of the proposed methodology
and Section 3.7 reports the experimental results. Finally, Section 3.8 discusses
the reported results and Section 3.9 concludes the paper.

3.2 Related Works

Assuming the architecture of the ensemble as the main criterion, we can dis-
tinguish among serial, parallel, and hierarchical schemes, and if the classifiers
of the ensemble are selected or not by the ensemble algorithm we can divide
them into selection-oriented and combiner-oriented methods [20]. Here we are
more interested in the first class, which try to improve the overall accuracy
of the ensemble by directly boosting the accuracy and the diversity of the ex-
perts of the ensemble. Basically, they can be divided into resampling methods
and feature selection methods.
Resampling techniques can be used to generate different hypotheses. For
instance, bootstrapping techniques [6] may be used to generate different train-
ing sets and a learning algorithm can be applied to the obtained subsets of
data in order to produce multiple hypotheses. These techniques are effective
especially with unstable learning algorithms, which are algorithms very sen-
sitive to small changes in the training data. In bagging [1] the ensemble is
formed by making bootstrap replicates of the training sets, and then multiple
generated hypotheses are used to get an aggregated predictor. The aggrega-
tion can be performed by averaging the outputs in regression or by majority
or weighted voting in classification problems.

52 L.S. Oliveira et al.

While in bagging the samples are drawn with replacement using a uni-
form probability distribution, in boosting methods [7] the learning algorithm
is called at each iteration using a different distribution or weighting over the
training examples. This technique places the highest weight on the examples
most often misclassified by the previous base learner: in this manner the clas-
sifiers of the ensemble focus their attention on the hardest examples. Then
the boosting algorithm combines the base rules taking a weighted majority
vote of the base rules.
The second class of methods regards those strategies based on feature se-
lection. The concept behind these approaches consists in reducing the number
of input features of the classifiers, a simple method to fight the effects of the
classical curse of dimensionality problem. For instance, the random subspace
method [12, 35] relies on a pseudorandom procedure to select a small number
of dimensions from a given feature space. In each pass, such a selection is
made and a subspace is fixed. All samples are projected to this subspace, and
a classifier is constructed using the projected training samples. In the classi-
fication a sample of an unknown class is projected to the same subspace and
classified using the corresponding classifier. In the same vein of the random
subspace method lies the input decimation method [37], which reduces the
correlation among the errors of the base classifiers, by decoupling the classi-
fiers by training them with different subsets of the input features. It differs
from the random subspace as for each class the correlation between each fea-
ture and the output of the class is explicitly computed, and the classifier is
trained only on the most correlated subset of features.
Recently, several authors have been investigated GA to design ensemble of
classifiers. Kuncheva and Jain [21] suggest two simple ways to use genetic al-
gorithm to design an ensemble of classifiers. They present two versions of their
algorithm. The former uses just disjoint feature subsets while the latter con-
siders (possibly) overlapping feature subsets. The fitness function employed is
the accuracy of the ensemble, however, no measure of diversity is considered.
A more elaborate method, also based on GA, was proposed by Optiz [31]. In
his work, he stresses the importance of a diversity measure by including it
in the fitness calculation. The drawback of this method is that the objective
functions are combined through the weighted sum. It is well known that when
dealing with this kind of combination, one should deal with problems such as
scaling and sensitivity towards the weights. More recently Gunter and Bunke
[10] have applied feature selection in conjunction with floating search to cre-
ate ensembles of classifiers for the field of handwriting recognition. They used
handwritten words and HMMs as classifiers to evaluate their algorithm. The
feature set was composed of nine discrete features, which makes simpler the
feature selection process. A drawback of this method is that one must set a
priori the number of classifiers in the ensemble.

3 Feature Selection for Ensembles 53

3.3 Methodology Overview

In this section we outline the hierarchical approach proposed. As stated be-
fore, it is based on an “overproduce and choose” paradigm where the first
level generates several classifiers by conducting feature selection and the sec-
ond one chooses the best ensemble among such classifiers. Figure 3.1 depicts
the proposed methodology. Firstly, we carry out feature selection by using a
MOGA. It gets as inputs a trained classifier and its respective data set. Since
the algorithm aims at minimizing two criteria during the search1 , it will pro-
duce at the end a 2-dimensional Pareto-optimal front, which contains a set of
classifiers (trade-offs between the criteria being optimized). The final step of
this first level consists in training such classifiers.


MOGA Population

Er ror Rate

. ..

Number of Features Pareto-optimal
Trained Classifier MOGA Population Best
and Data Set
Performance of the Ensemble

“Classifiers” Ensemble
Training Classifiers

C0 C1 C2 ... CL
.. .



Fig. 3.1. An overview of the proposed methodology.

Once the set of classifiers have been trained, the second level is suggested
to pick the members of the team which are most diverse and accurate. Let
A = C1 , C2 , . . . , CL be a set of L classifiers extracted from the Pareto-optimal
and B a chromosome of size L of the population. The relationship between A
and B is straightforward, i.e., the gene i of the chromosome B is represented
by the classifier Ci from A. Thus, if a chromosome has all bits selected, all
classifiers of A will be included in the ensemble. Therefore, the algorithm will
produce a 2-dimensional Pareto-optimal front which is composed of several
ensembles (trade-offs between accuracy and diversity). In order to choose the
best one, we use a validation set, which points out the most diverse and
accurate team among all. Later in this paper, we will discuss the issue of
using diversity to choose the best ensemble.
Error rate and number of features in the case of supervised feature selection and
a clustering index and the number of features in the case of unsupervised feature

54 L.S. Oliveira et al.

In both cases, MOGAs are based on bit representation, one-point crossover,
and bit-flip mutation. In our experiments, MOGA used is a modified version
of the Non-dominated Sorting Genetic Algorithm (NSGA) [4] with elitism.

3.4 Multi-Objective Genetic Algorithm

Since the concept of multi-objective genetic algorithm (MOGA) will be ex-
plored in the remaining of this paper, this section briefly introduces it.
A general multi-objective optimization problem consists of a number of ob-
jectives and is associated with a number of inequality and equality constraints.
Solutions to a multi-objective optimization problem can be expressed math-
ematically in terms of nondominated points, i.e., a solution is dominant over
another only if it has superior performance in all criteria. A solution is said
to be Pareto-optimal if it cannot be dominated by any other solution avail-
able in the search space. In our experiments, the algorithm adopted is the
Non-dominated Sorting Genetic Algorithm (NSGA) with elitism proposed by
Srinivas and Deb in [4, 34].
The idea behind NSGA is that a ranking selection method is applied to
emphasize good points and a niche method is used to maintain stable subpop-
ulations of good points. It varies from simple GA only in the way the selection
operator works. The crossover and mutation remain as usual. Before the se-
lection is performed, the population is ranked on the basis of an individual’s
nondomination. The nondominated individuals present in the population are
first identified from the current population. Then, all these individuals are
assumed to constitute the first nondominated front in the population and
assigned a large dummy fitness value. The same fitness value is assigned to
give an equal reproductive potential to all these nondominated individuals.
In order to maintain the diversity in the population, these classified individ-
uals are made to share their dummy fitness values. Sharing is achieved by
performing selection operation using degraded fitness values obtained by di-
viding the original fitness value of an individual by a quantity proportional
to the number of individuals around it. After sharing, these nondominated
individuals are ignored temporarily to process the rest of population in the
same way to identify individuals for the second nondominated front. These
new set of points are then assigned a new dummy fitness value which is kept
smaller than the minimum shared dummy fitness of the previous front. This
process is continued until the entire population is classified into several fronts.
Thereafter, the population is reproduced according to the dummy fitness
values. A stochastic remainder proportionate selection is adopted here. Since
individuals in the first front have the maximum fitness value, they get more
copies than the rest of the population. The efficiency of NSGA lies in the way
multiple objectives are reduced to a dummy fitness function using nondomi-
nated sorting procedures. More details about NSGA can be found in [4].

3 Feature Selection for Ensembles 55

3.5 Classifiers and Feature Sets
As stated before, we have carried out experiments in both supervised and
unsupervised contexts. The remaining of this section describes the feature
sets and classifiers we have used.

3.5.1 Supervised Context

To evaluate the proposed methodology in the supervised context, we have used
three base classifiers trained to recognize handwritten digits of NIST SD19.
Such classifiers were trained with three well-known feature sets: Concavities
and Contour (CCsc), Distances (DDDsc), and Edge Maps (EMsc) [29].
All classifiers here are MLPs trained with the gradient descent applied
to a sum-of-squares error function. The transfer function employed is the fa-
miliar sigmoid function. In order to monitor the generalization performance
during learning and terminate the algorithm when there is no longer an im-
provement, we have used the method of cross-validation. Such a method takes
into account a validation set, which is not used for learning, to measure the
generalization performance of the network. During learning, the performance
of the network on the training set will continue to improve, but its perfor-
mance on the validation set will only improve to a point, where the network
starts to overfit the training set, that the learning algorithm is terminated.
All networks have one hidden layer where the units of input and output are
fully connected with units of the hidden layer, where the number of hidden
units were determined empirically (see Table 3.1). The learning rate and the
momentum term were set at high values in the beginning to make the weights
quickly fit the long ravines in the weight space, then these parameters were re-
duced several times according to the number of iterations to make the weights
fit the sharp curvatures.
Among the different strategies of rejection we have tested, the one proposed
by Fumera et al [8] provided the better error-reject trade-off for our experi-
ments. Basically, this technique suggests the use of multiple reject thresholds
for the different data classes (T0 , . . . , Tn ) to obtain the optimal decision and
reject regions. In order to define such thresholds we have developed an itera-
tive algorithm, which takes into account a decreasing function of the threshold
variables R(T0 , . . . , Tn ) and a fixed error rate Terror . We start from all thresh-
old values equal to 1, i.e., the error rate equal to 0 since all images are rejected.
Then, at each step, the algorithm decreases the value of one of the thresholds
in order to increase the accuracy until the error rate exceeds Terror .
The training (TRDBsc) and validation (VLDB1sc) sets are composed of
195,000 and 28,000 samples from hsf 0123 series respectively while the test
set (TSDBsc) is composed of 30,089 samples from the hsf 7. We consider also
a second validation set (VLDB2sc), which is composed of 30,000 samples of
hsf 7. This data is used to select the best ensemble of classifiers. Figure 3.2
shows the performance on the test set of all classifiers for error rates varying

56 L.S. Oliveira et al.

Table 3.1. Description and performance of the classifiers on TSDB (zero-rejection

Feature Number. of Units in the Rec.
Set Features Hidden Layer Rate (%)
CCsc 132 80 99.13
DDDsc 96 60 98.17
EMsc 125 70 97.04

from 0.10 to 0.50%, while Table 3.1 reports the performance of all classifiers at
zero-rejection level. The curve depicted in Figure 3.2 is much more meaningful
when dealing with real applications since they describe the recognition rate
in relation to a specific error rate, including implicitly a corresponding reject
rate. This rate also allows us to compute the reliability of the system for a
given error rate. It can be done by using Equation 3.1.
Reliability = × 100 (3.1)
Rec.Rate + Error Rate
Figure 3.2 corroborates that recognition of handwritten digits is still an
open problem when very low error rates are required. Consider for example
our best classifier, which reaches 99.13% at zero-rejection level on the test set.
If we allow an error rate of 0.1%, i.e., just one error in 1,000, the recognition
rate of such classifier drops from 99.13% to 91.83%. This means that we have
to reject 8.07% to get 0.1% of error (Figure 3.2). We will demonstrate that
the ensemble of classifiers can significantly improve the performance of the
classifiers for low error rates.



Recognition Rate (%)




65 Edge Maps

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Error Rate (%)

Fig. 3.2. Performance of the classifiers on the test set for error rates varying from
0.10 to 0.50%.

3 Feature Selection for Ensembles 57

3.5.2 Unsupervised Context

To evaluate the proposed methodology in unsupervised context we have used
three HMM-based classifiers trained to recognize handwritten Brazilian month
words (“Janeiro”, “Fevereiro”, “Março”, “Abril”, “Maio”, “Junho”, “Julho”,
“Agosto”, “Setembro”, “Outubro”, “Novembro”, “Dezembro”). The training
(TRDBuc), validation (VLDB1uc), and testing (TSDBuc) sets are composed
of 1,200, 400, and 400 samples, respectively. In order to increase the training
and validation sets, we have also considered 8,300 and 1,900 word images, re-
spectively, extracted from the legal amount database. This is possible because
we are considering character models. We consider also a second validation set
(VLDB2uc) of 500 handwritten Brazilian month words [14]. Such data is used
to select the best ensemble of classifiers.
Given a discrete HMM-based approach, each word image is transformed
as a whole into a sequence of observations by the successive application of
preprocessing, segmentation, and feature extraction. Preprocessing consists
of correcting the average character slant. The segmentation algorithm uses
the upper contour minima and some heuristics to split the date image into
a sequence of segments (graphemes), each of which consists of a correctly
segmented, an under-segmented, or an over-segmented character. A detailed
description of the preprocessing and segmentation stages is given in [28].
The word models are formed by the concatenation of appropriate ele-
mentary HMMs, which are built at letter and space levels. The topology of
space model consists of two states linked by two transitions that encode a
space or no space. Two topologies of letter models were chosen based on the
output of our grapheme-based segmentation algorithm which may produce a
correct segmentation of a letter, a letter under-segmentation or a letter over-
segmentation into two, three, or four graphemes depending on each letter. In
order to cope with these configurations of segmentations, we have designed
topologies with three different paths leading from the initial state to the final
Considering uppercase and lowercase letters, we need 42 models since the
legal amount alphabet is reduced to 21 letter classes and we are not consider-
ing the unused ones. Thus, regarding the two topologies, we have 84 HMMs
which are trained using the Baum-Welch algorithm with the Cross-Validation
Since no information on recognition is available on the writing style (up-
percase, lowercase), the word model consists of two letter HMMs in parallel
and four space HMMs linked by four transitions: two uppercase-letters (UU),
two lowercase-letters (LL), one uppercase letter followed by one lowercase-
letter (UL), and one lowercase letter followed by one uppercase-letter (LU).
The probabilities of these transitions are estimated by their frequency of oc-
currence in the training set. In the same manner, the probabilities of beginning
a word by an uppercase-letter (0U) or a lowercase letter (0L) are also esti-
mated in the training set. This architecture handles the problem related to

58 L.S. Oliveira et al.

the mixed handwritten words detecting implicitly the writing style during
recognition using the Backtracking of the Viterbi algorithm.
The feature set that feeds the first classifier is a mixture of concavity and
contour features (CCuc) [29]. In this case, each grapheme is divided into two
equal zones (horizontal) where for each region a concavity and contour feature
vector of 17 components is extracted. Therefore, the final feature vector has
34 components. The other two classifiers make use of a feature set based on
distances [28]. The former uses the same zoning discussed before (two equal
zones), but in this case, for each region a vector of 16 components is extracted.
This leads to a final feature vector of 32 components (DDD32u c). For the latter
we have tried a different zoning. The grapheme is divided into four zones using
the reference baselines, hence, we have a final feature vector composed of 64
components (DDD64u c). Table 3.2 reports the performance of all classifiers
on the test set at zero-rejection level. Figure 3.3 shows the performance of all
classifiers for error rates varying from 1% to 4%. The strategy for rejection
used in this case is the one discussed previously. We have chosen higher error
rates in this case due to the size of the database we are dealing with.

Table 3.2. Performance of the classifiers on the test set.
Feature Number of Codebook Rec Rate
Set Features Size (%)
CCuc 34 80 86.1
DDD32uc 32 40 73.0
DDD64uc 64 60 64.5

It can be observed from Table 3.3 that the recognition rates with error
fixed at 1% are very poor, hence, the number of rejected patterns is very
high. We will see in the next sections that the proposed methodology can
improve these results considerably.

3.6 Implementation
This section introduces how we have implemented both levels of the proposed
methodology. First we discuss the supervised context and then the unsuper-

3.6.1 Supervised Context

Supervised Feature Subset Selection

The feature selection algorithm used in here was introduced in [30]. To make
this paper self-contained, a brief description is included in this section.

3 Feature Selection for Ensembles 59

Regarding feature selection algorithms, they can be classified into two cat-
egories based on whether or not feature selection is performed independently
of the learning algorithm used to construct the classifier. If feature selection
is done independently of the learning algorithm, the technique is said to fol-
low a filter approach. Otherwise, it is said to follow a wrapper approach [13].
While the filter approach is generally computationally more efficient than the
wrapper approach, its major drawback is that an optimal selection of features
may not be independent of the inductive and representational biases of the
learning algorithm that is used to construct the classifier. On the other hand,
the wrapper approach involves the computational overhead of evaluating can-
didate feature subsets by executing a given learning algorithm on the database
using each feature subset under consideration.
As stated elsewhere, the idea of using feature selection is to promote di-
versity among the classifiers. To tackle such a task we have to optimize two
objective functions: minimization of the number of features and minimization
of the error rate of the classifier. Computing the first one is simple, i.e., the
number of selected features. The problem lies in computing the second one,
i.e., the error rate supplied by the classifier. Regarding a wrapper approach,
in each generation, evaluation of a chromosome (a feature subset) requires
training the corresponding neural network and computing its accuracy. This
evaluation has to be performed for each of the chromosomes in the population.
Since such a strategy is not feasible due to the limits imposed by the learning
time of the huge training set considered in this work, we have adopted the
strategy proposed by Moody and Utans in [26], who use the sensitivity of
the network to estimate the relationship between the input features and the
network performance.



Recognition Rate (%)



30 CC

1 1.5 2 2.5 3 3.5 4
Error Rate (%)

Fig. 3.3. Performance of the classifiers on the test set for error rates varying from
1 to 4%.

60 L.S. Oliveira et al.

The sensitivity of the network model to variable β is defined as:

Sβ = ASE(x̄β ) − ASE(xβ ) (3.2)
N j=1


x̄β = xβ (3.3)
N j=1 j

where xβj is the β th input variable of the j th exemplar. Sβ measures the
effect on the training ASE (average square error) of replacing the β th input
xβ by its average x̄β . Moody and Utans show that when variables with small
sensitivity values with respect to the network outputs are removed, they do not
influence the final classification. So, in order to evaluate a given feature subset
we replace the unselected features by their averages. In this way, we avoid
training the neural network and hence turn the wrapper approach feasible for
our problem. We call this strategy modified-wrapper. Such a scheme has been
employed also by Yuan et al in [38], and it makes it feasible to deal with huge
databases in order to better represent the pattern recognition problem during
the fitness evaluation2 . Moreover it can accommodate multiple criteria such
as the number of features and the accuracy of the classifier, and generate the
Pareto-optimal front in the first run of the algorithm. Figure 3.4 shows the
evolution of the population in the objective plane and its respective Pareto-
optimal front.
It can be observed in Figure 3.4b that the Pareto-optimal front is composed
of several different classifiers. In order to get a better insight about them, they
were classified into 3 different groups: weak, medium, and strong. It can be
observed that among all those classifiers there are very good ones. To find
out which classifiers of the Pareto-optimal front compose the best ensemble,
we carried out a second level of search. Once we did not train the models
during the search (the training step is replaced by the sensitivity analysis),
the final step of feature selection consists of training the solutions provided
by the Pareto-optimal front (3.1).

Choosing the Best Ensemble

As defined in Section 3.3 each gene of the chromosome is represented by a
classifier produced in the previous level. Therefore, if a chromosome has all
bits selected, all classifiers of will compose the team. In order to find the best
ensemble of classifiers, i.e., the most diverse set of classifiers that brings a
good generalization, we have used two objective functions during this level
If small databases are considered, then a full-wrapper could replace the proposed

3 Feature Selection for Ensembles 61

100 100

90 90

80 80 weak

70 70
Error Rate (%)


Error Rate (%)

50 50
40 40

30 30

20 20

10 10
0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of Features Number of Features

(a) (b)

Fig. 3.4. Supervised feature selection using a Pareto-based approach (a) Evolution
of the population in the objective plane, (b) Pareto-optimal front and its different
classes of classifiers.

of the search, namely, maximization of the recognition rate of the ensemble
and maximization of a measure of diversity. We have tried different measures
such as overlap, entropy [22], and ambiguity [17]. The results achieved with
ambiguity and entropy were very similar. In this work we have used ambiguity
as diversity measure. The ambiguity is defined as follows:

ai (xk ) = [Vi (xk ) − V (xk )]2 (3.4)
where ai is the ambiguity of the ith classifier on the example xk , randomly
drawn from an unknown distribution, while Vi and V are the ith classifier
and the ensemble predictions, respectively. In other words, it is simply the
variance of ensemble around the mean, and it measures the disagreement
among the classifiers on input x. Thus the contribution to diversity of an
ensemble member i as measured on a set of M samples is:

Ai = ai (xk ) (3.5)

and the ambiguity of the ensemble is
Ai A=
where N is the number of classifiers. So, if the classifiers implement the same
functions, the ambiguity A will be low, otherwise it will be high. In this
scenario the error from the ensemble is

E =E−A (3.7)

62 L.S. Oliveira et al.

where E is the average errors of the single classifiers and A is the ambiguity of
the ensemble. Equation 3.7 expresses the trade-off between bias and variance
in the ensemble, but in a different way than the common bias-variance relation
in which the averages are over possible training sets instead of ensemble aver-
ages. If the ensemble is strongly biased the ambiguity will be small, because
the classifiers implement very similar functions and thus agree in inputs even
outside the training set [17].
At this level of the strategy we want to maximize the generalization of the
ensemble, therefore, it will be necessary to use a way of combining the outputs
of all classifiers to get a final decision. To do this, we have used the average,
which is a simple and effective scheme of combining predictions of the neural
networks [16]. Other combination rules such as product, min, and max have
been tested but the simple average has produced slightly better results. In
order to evaluate the objective functions during the search described above
we have used the validation set VLDB1sc.

3.6.2 Unsupervised Context

Unsupervised Feature Subset Selection

A lot of work done in the field of handwritten word recognition takes into
account discrete HMMs as classifiers, which have to be fed with a sequence of
discrete values (symbols). This means that before using a continuous feature
vector, we must convert it to discrete values. A common way to do that is
through clustering. The problem is that for the most of real-life situations we
do not know the best number of clusters, what makes it necessary to explore
different numbers of clusters using traditional clustering methods such as the
K-means algorithm and its variants. In this light, clustering can become a
trial-and-error work. Besides, its result may not be very promising especially
when the number of clusters is large and not easy to estimate.
Unsupervised feature selection emerges as a clever solution to this prob-
lem. The literature contains several studies on feature selection for supervised
learning, but only recently, the feature selection for unsupervised learning
has been investigated [5, 15]. The objective in unsupervised feature selection
is to search for a subset of features that best uncovers “natural” groupings
(clusters) from data according to some criterion. In this way, we can avoid
the manual process of clustering and find the most discriminative features in
the same time. Hence, we will have at the end a more compact and robust
high-level representation (symbols).
In the above context, unsupervised feature selection also presents a multi-
criterion optimization function, where the objective is to find compact and
well separated hyper-spherical clusters in the feature subspaces. Differently of
the supervised feature selection, here the criteria optimized by the algorithm
are a validity index and the number of features. [27].

3 Feature Selection for Ensembles 63

65 90



Recognition Rate (%)
No. of Clusters





25 65
1 5 10 15 20 25 30 34 11 12 14 16 18 20 22 24 26 28 30 32 34
No. of Features No. of Features

(a) (b)

Fig. 3.5. (a) Relationship between the number of clusters and the number of features
and (b) Relationship between the recognition rate and the number of features.

In order to measure the quality of clusters during the clustering process,
we have used the Davies-Bouldin (DB)-index [3] over 80,000 feature vectors
extracted from the training set of 9,500 words. To make such an index suitable
for our problem, it must be normalized by the number of selected features. This
is due to the fact that it is based on geometric distance metrics and therefore,
it is not directly applicable here because it is biased by the dimensionality of
the space, which is variable in feature selection problems.
We have noticed that the value of DB index decreases as the number of
features increases. We have correlated this effect with the normalization of
DB-index by the number of features. In order to compensate this, we have
considered as second objective the minimization of the number of features. In
this case, one feature must be set at least. Figure 3.5 depicts the relationship
between the number of clusters and number of features and the relationship
between the recognition rate on the validation set and the number of features.
Like in the supervised context, here we also divided the classifiers of the
Pareto into classes. In this case, we have realized that those classifiers with
very few features are not selected to compose the ensemble, and therefore,
just the classifiers with more than 10 features were used into the second level
of search. In Section 3.7.2 we discuss this issue in more detail. The way of
choosing the best ensemble is exactly the same as introduced in Section 3.6.1.

3.7 Experimental Results

All experiments in this work were based on a single-population master-slave
MOGA. In this strategy, one master node executes the genetic operators (se-
lection, crossover and mutation), and the evaluation of fitness is distributed
among several slave processors. We have used a Beowulf cluster with 17 (one

5. 0. In order to define the probabilities of crossover and mutation.97. 0.1 DDDsc 54 30-84 90. The following parameter settings were employed in both levels: population size = 128. which is probably the most frequently-used test function in research on genetic algorithms because of its simplicity [2]. number of generations = 1000. Feature No.1 were used. consists of performing feature selection for a given feature set.8 and 1. We have used a standard genetic algorithm with a single- point crossover and the maximum generations of 1000.45].S.1 EMsc 78 35-113 90. probability of crossover = 0. Summary of the classifiers produced by the first level. as described in Sec- tion 3. the first step.0. probability of mutation = 1/L (where L is the length of the chromosome). After some exper- iments. Oliveira et al.6.6. As depicted in Figure 3. where L is the length of the chromosome.7.1 Experiments in the Supervised Context Once all parameters have been defined. Such results confirmed the values reported by Miki et al in [25]. 512Mb RAM) to execute our ex- periments.4. the same databases reported in Section 3. Thus. and niche distance (σshare ) = [0.1). which should be trained for use in the second level.8 and Pm = 1/L. This function measures the fitness of an individual as the number of bits set to one on the chromosome.25.64 L.1Ghz CPU.3.98. while in the second level is the number of classifiers picked from the Pareto-optimal front in the previous level.8. we decide to train and use in the second level just the “strong” classifiers. This decision was made after we realized that in our experiments the “weak” and “medium” classifiers did not cooperate with the ensemble at all. and the combination of the crossover rates 0.0 and the mutation rates of 0. 0. Table 3. The parameter σshare was tuned empirically. this procedure produces quite a large number of clas- sifiers. we found out that the second level always chooses “strong” classifiers to compose the ensemble. of Range of Range of Set Classifiers Features Rec. To train such classifiers. Rates (%) CCsc 81 24-125 90. The best results were achieved with Pc = 0. The length of the chromosome in the first level is the number of components in the feature set (see Table 3. we have used the one-max problem.99.0 .1/L. Table 3.4.5 . 1/L and 10/L.5 . in order to speed up the training process and the second level of search as well.3 summarizes the “strong” classifiers pro- duced by the first level for the three feature sets we have considered. The fixed crossover and mutation rates are used in a run.0.6 . 3.1. master and 16 slaves) PCs (1.

1 0. .23 27 31 Pareto−front 33 Validation 34 99.3 we define four sets of base classifiers as follows: S1 = {CCsc0 .7 0.21 11 39 Recognition Rate (%) 99. For facility.6 0. which was not used so far. Thus the problem now lies in choosing the most accurate en- semble among all. the first level of the al- gorithm provided 81 “strong” classifiers which have the number of features ranging from 24 to 125 and recognition rates ranging from 90.15 39 40 42 99. .8 0. . we reproduce in this table the results of the original classifiers. the direct choice will be the ensemble that provides better generalization on VLDB2sc .4 0.0 0. Based on the classifiers reported in Table 3. EM sc77 }. .4 summarizes the best ensembles produced for the four sets of base classifiers and their performance at zero- rejection level on the test set. CCsc80 }. . Since we are aiming at performance. 99.16 99. The Pareto-optimal front produced by the second-level MOGA: (a) S1 and (b) S4 Like the first level. .4 0.25 28 99.1 46 99.24 Validation 24 4 Pareto−front 99.7 0.2 0.2 43 45 17 99. On the other hand. .5% to 99.9 1 Ambiguity Ambiguity (a) (b) Fig.6.2 0.5 0. Rate of the Ensemble (%) 38 99. . In order to assess the objective functions of the second- level of the algorithm (generalization of the ensemble and diversity) we have used the validation set (VLDB1sc). the second one also generates a set of possible solu- tions which are the trade-offs between the generalization of the ensemble and its diversity.1% on TSDBsc .19 49 14 51 99. Table 3. All these sets could be seen as ensembles.5 0.3 0. . .05 52 21 99.9 1 0 0.8 0. Figure .17 25 99 99.15 98. DDDsc53 }. . The number over each point stands for the number of classifiers in the ensemble. We can notice from Table 3. 3 Feature Selection for Ensembles 65 Considering for example the feature set CCsc . 3.3 0.2 35 12 38 Rec.4 that the ensembles and base classifiers have very similar performance at zero-rejection level. S2 = {DDDsc  0 . S3 = {EM sc0 . In order to decide which ensemble to choose we validate the Pareto-optimal front using VLDB2sc. and S4 = {S1 S2 S3 }. Due to the limited space we have.6 only depicts the variety of ensembles yielded by the second-level of the algorithm for S1 and S4 . but in this work we reserve the word ensemble to characterize the results yielded by the second- level of the algorithm.18 99.95 0. Figure 3. .22 7 99. This shows the great diversity of the classifiers produced by the feature selection method.1 0.6 0.

We have noticed that the ensemble reduces the high outputs of some outliers so that the threshold used for rejection can be reduced and consequently the number of samples rejected is reduced. Oliveira et al. Rate (%) Set Classifiers zero-rejection level Original Classifiers S1 4 99.45 0.10 0. Besides. The fact worths noting though. 6. Rate (%) Rec.25 0.10 97.50 Error Rate (%) Fig.20 0.40 0. against four of S1 .30 0. Feature Number of Rec. is the performance of S4 at low error rates.15 0. 3.22 99.0% against 93.4). S4 is composed of 14. S2 . Thus.13 S2 4 98. Regarding the ensemble S4 . Performance of the ensembles on the test set. This emphasizes the ability of the algorithm in finding good ensembles when more original classifiers are available. aiming for a small error rate we have to consider the important role of the ensemble. . and 4 classifiers from S1 .66 L. the ensemble of classifiers brought an improvement of about 8%.4. Table 3.17 S3 7 97. In such a case. we can notice that it achieves a performance similar to S1 at zero-rejection level (see Table 3. 100 95 90 Recognition Rate (%) 85 80 Combination 75 Concavities Distances Edge Maps 70 0.35 0. and S3 . The most expressive result was achieved for the ensemble S3 .S.18 98.7.5% of S1 .7 shows that the ensembles respond better for error rates fixed at very low levels than single classifiers.04 S4 24 99. it is composed of 24 classifiers.25 3. For the error rate fixed at 1% it reached 95. respectively. Improvements yielded by the ensembles. which attains a reasonable performance at zero-rejection level but performs very poorly at low error rates.

6 DDD32uc 21 10-31 20-30 71.7.5 we define four sets of base classifiers as follows: F1 = {CCuc0 .5 summarizes the “strong” classifiers (after training) produced by the first level for the three feature sets we have considered. due to the limited space we have. Figure 3. we decide to train and use in the second level just “strong” classifiers. Again.8b shows the performance of the ensembles generated with all base classifiers available. . the result achieved by the ensemble F4 shows the ability of the algorithm in finding good ensembles when more base classifiers are . The number over each point stands for the number of classifiers in the ensemble. the main difference lies in the way the feature selection is carried out. In spite of that.5. Table 3.2 Experiments in the Unsupervised Context The experiments in the unsupervised context follow the same vein of the supervised one.6% on VLDB1uc .7 . To train such classifiers. 3 Feature Selection for Ensembles 67 3. This shows the great diversity of the classifiers produced by the feature selection method.1 . . the second validation set (VLDB2uc) was used to select the best ensemble.2.2 Considering for example the feature set CCuc. we reproduce in Table 3. the first level of the al- gorithm provided 15 “strong” classifiers which have the number of features ranging from 10 to 32 and recognition rates ranging from 68. Like in the previous experiments. . . In light of this. we have applied the same strategy of dividing the classifiers into groups (see Figure 3. .5. in order to speed up the training process and the second level of search as well. As discussed in Section 3.78. Summary of the classifiers produced by the first level. CCuc14 }. DDD32uc20 }. DDD64uc49 }. we found out that the second level always chooses “strong” classifiers to compose the ensemble. .. Ensemble F4 . we can observe that the number of classifiers produced during unsupervised feature selection is quite large as well. and F4 = {F1 F2 F3 }.6.0 DDD64uc 50 10-64 52-80 60. After some experiments. Like in the previous experiments (supervised context).78. Based on the classifiers reported in Table 3.1% to 88. Thus. Table 3. Figure 3.5). .6 . . . Feature Number of Range of Range of Range of Set Classifiers Features Codebook Rec. Table 3. After selecting the best ensemble the final step is to assess them on the test set.6 the results presented in Table 3.88. .8 only depicts the variety of ensembles yielded by the second-level of the algorithm for F2 and F4 . F2 = {DDD32uc  0 .2 were considered. .6 summarizes the performance of the ensembles on the test set. For the sake of comparison.. the same databases reported in Section 3. Rates (%) CCuc 15 10-32 29-39 68. F3 = {DDD64uc0 .e. i.2.

respectively.9 1 Ambiguity Ambiguity (a) (b) Fig. based on our experience. respectively. Table 3.0 F3 36 80. In order to train the classifier with this feature set. The ensemble F4 is composed of 9. For more details.2 considered. Rate Classifiers Classifiers (%) Feature Set (%) F1 10 89.5 F4 45 90.2% on validation and testing sets. The combina- tion of these primitives plus a primitive that determines whether a grapheme does not contain ascender. is composed of primitives such as ascenders.1 0. The Pareto-optimal front (and validation curves where the best solutions are highlighted with an arrow) produced by the second-level MOGA: (a) F2 and (b) F4 .5. The recognition rates at zero-rejection level are 86. This performance compares with the CCuc classifier. Table 3. and loops.2 DDD32 73. Rate Original Rec. Since we have a new base classifier. and F3 .4 0.2 CC 86.   F 1G = {F 1 G}.8 0.7 0. In such cases.6.8.3 0. Pareto Pareto 83 11 45 Validation Validation 93 14 47 14 53 15 55 82 17 58 92 18 62 19 66 81 20 71 Recognition Rate (%) 91 72 20 Recognition Rate (%) 80 74 21 76 77 90 79 79 79 83 89 83 78 77 88 76 0 0.2 0. This feature set.1% and 87.5 0.6 0. descenders. has a good discrimination power when combined with other features such as concavities. which. which we call “global features”.7 DDD64 64.6 0.S. our  sets of base classifiers  must be modified  to cope with it.2 0. Comparison between ensembles and original classifiers. F3G = {F3 G}. Base Number of Rec. and F4G = {F1 F2 F3 G}.1 F2 15 80. F2 . G stands for the classifier trained with global features.1 0. [28].7 summarizes the ensembles found using these new sets of base classifiers. and 25 classifiers from F1 . 11.5 0.9 1 87 0 0.3 0. and loop produces a 20-symbol alpha- bet.4 0. Thus. we have used the same databases described in Section 3.68 L. F 2G = {F2 G}. In light of this. Oliveira et al. This shows the ability of the algorithm .7 0. 3. we decided to introduce a new feature set. It is worthy of remark the reduction of the size of the teams. descender.2.8 0. see Ref.

9b). Based on the experiments reported so far we can affirm that the unsuper- vised feature selection is a good strategy to generate diverse classifiers.2 F2G 2 89.7. the improve- ment observed here also are quite impressive. 90 90 85 85 80 80 Recognition Rate (%) Recognition Rate (%) 75 75 70 65 70 60 F1G 65 Original F2G F3G F1 55 F4G F1G 60 50 1 1.5 3 3. the latter features a slightly better error-reject trade-off in the long run (Figure 3. Besides. In such a case. F1G contains just two classifiers against 23 of F4G . Rate (%) Classifiers Classifiers Testing F1G 2 92.7. however. In Figure 3. This is made very clear in the experiments regarding the feature set DDD64.7 F3G 7 85.5 3 3.5 2 2.5 F4G 23 92. Improvements yielded by the ensembles: (a) F1 and (b) Comparison among all ensembles. Like the results at zero-rejection level. Such an ensemble of classifiers brought an im- provement of about 15% in the recognition rate at zero-rejection level.9 we compare the error-reject trade-offs for some ensembles reported in Table 3.0 in finding not just diverse but also uncorrelated classifiers to compose the ensemble [36]. the second-level MOGA was able to produce a good ensemble by maximizing the performance and the ambiguity measure. 3 Feature Selection for Ensembles 69 Table 3.7 shows that F1G and F4G reach similar results on the test set at zero-rejection level.5 4 1 1. it corroborates to our claim that the classifier G when combined with other features bring an improvement to the performance. the original classifier has a poor performance (about 65% on the test set). On the other hand. Base Number of Rec. 3.9. .5 4 Error Rate (%) Error Rate (%) (a) (b) Fig. but when it is used to generate the set of base classifiers. Table 3.5 2 2. Performance of the ensembles with global features.

bagging.25 0. This could lead one to agree that diversity is not important when building ensem- bles. some authors advocated that diversity does not help at all. To show that it does not happen. and boosting for the two feature sets used in the supervised context:(a) CCsc and (b) EMsc Diversity is an issue that deserves some attention when discussing ensem- ble of classifiers.10 0.50 Error Rate (%) Error Rate (%) (a) (b) Fig. To better evaluate our results.10 reports the results. it is natural to expect that it will converge towards the fittest solution.10 0.20 0. Comparison among feature selection for ensembles. most of the time. we have used two traditional ensemble methods (Bagging and Boosting) in the supervised context. the diversity of solutions . As we have mentioned before. the proposed methodology achieved better results.8 Discussion The results obtained here attest that the proposed strategy is able to generate a set of good classifiers in both supervised and unsupervised contexts.30 0. since even using a validation set the selected team is always the most accurate and with less diversity. 99 90 98 85 97 80 Recognition Rate (%) Recognition Rate (%) 96 95 75 94 70 93 Ensemble Feature Selection Ensemble Feature Selection Original System Original System 65 Boosting Boosting 92 Bagging Bagging 91 60 0.25 0.45 0. In our experiments. In such cases diversity is very useful to avoid selecting overfitted solutions.15 0. especially when considering very low error rates. However. Since a single-objetive optimiza- tion algorithm searches for an optimum solution.15 0. perhaps the similar solutions found in the Pareto-optimal produced by the MOGA will be there. One can argue that using a single GA and considering the entire final popu- lation.30 0. Oliveira et al.35 0.35 0. hence. we have carried out some experiments with a single GA where the fitness function was the maximization of the ensemble´s accuracy. 3.40 0.50 0. we will observe that there are cases where the validation curve does not have the same shape of the Pareto- optimal.20 0.10.45 0. the best ensembles of the Pareto-optimal also were the best for the unseen data. if we look carefully the results. 3. Figure 3.70 L.S.40 0. As we can see.

8 0. For the sake of compar- ison we reproduce Figure 3.7 0. The parameters used here are the same we have used for the MOGA (Section 3. As stated some- where. one should deal with problems such as scaling and sensitivity towards the weights. 3.8a in Figure 3. We believe that our strategy offers a clever way to find the ensemble using genetic algorithms. 3.11a plots all the classifiers found in the final population of the genetic algorithm. 3 Feature Selection for Ensembles 71 presented in the Pareto-optimal is not present in the final population of the single genetic algorithm. as expected. Some attempts in this direction were made by Optiz [31].4 0.9 1 76 0 62 124 Ambiguity Classifiers (a) (b) Fig. The results attained in both sit- uations and using different feature sets and base classifiers demonstrated the . Benefits of using diversity: (a) population (classifiers) of the final gener- ation of the GA and (b) classifiers found by the MOGA.2 0. As we can see. Pareto 83 83 11 Validation 14 14 15 82 82 17 18 19 81 20 81 Recognition Rate (%) 20 Recognition Rate (%) 80 21 80 79 79 78 78 77 77 Final population Validation 76 0 0. when dealing with this kind of combination.7).3 0.1 0. we present the results we got using a GA to find ensem- ble in F2 (unsupervised context). Figure 3.6 0. the population is very homogeneous and it converged. He combined accuracy and diversity through the weighted-sum approach.9 Conclusion We have described a methodology for ensemble creation underpinned on the paradigm “overproduce and choose”.11. The feasibility of the strategy was demonstrated through comprehensive experiments carried out in the context of handwriting recognition. The idea of generating classifiers through feature selection was proved to be successful in both supervised and unsupervised contexts. To illustrate that.11b. It takes two levels of search where the first level overproduces a set of classifiers by performing feature selection while the second one chooses the best team of classifiers.5 0. towards the most accurate ensemble.

[3] D.. of 13th International Conference on Machine Learning. Stacked regressions. on Pattern Analysis and Machine Intelligence. Bary-Italy. A. Giacinto and F. 1997. Gunter and H. Kohavi. and K. 1993. efficiency of the proposed strategy by finding powerful ensembles. 2001. [12] T. Dy and C. M. 33(12):2099–2101. of 8th IWFHR. C. Brodley. Breiman. 1996. [8] G. 1979. [6] B. IEEE Computer Society. The random subspace method for constructing decision forests. which suc- ceed in improving the recognition rates for classifiers working with a very low error rates. Freitas. Ho. Bunke. Efficient and Accurate Parallel Genetic Algorithms. Finally we have addressed the issue of using diversity to build ensembles. 24(1):49–64. 2000. [2] E. [13] G. 9-10:697–705. using diversity jointly with the accuracy of the ensemble as selection criterion might be very helpful to avoid choosing overfitted solutions. Bouldin. 1994. Chapman and Hall. O. K. 2000. J. Pfleger. Sabourin. An introduction to the Bootstrap. Giacinto. and R. As we have seen. Fumera. 2000. Feature subset selection and order identification for unsupervised learning. Irrelevant features and the subset selection problems. 20(8):832–844. 17th International Conference on Machine Learning. of 11th International Conference on Machine Learning. 2nd edition. W. In Proc. Cantu-Paz. Hashem. Optimal linear combinations of neural networks. Design of effective neural network ensemble for image classification purposes. Machine Learning. Freund and R. Roli. F. [9] G. 10(4):599–614. Carvalho. Our results certainly bring some contribution to the field. Oliveira et al. In Proc. John. In Proc. [5] J. Roli. 1996. Evaluating NN and HMM classifiers for handwritten word recognition. Davies and D. 1998. April 2002. L. [10] S. Canada. and G. In Proceedings of the 15th Brazilian Symposium on Computer Graphics and Image Processing. IEEE Trans. [14] J. John Wiley and Sons Ltd. Deb. Schapire. but this still is an open problem. Multi-Objective Optimization using Evolutionary Algorithms. Neural Networks. on Pattern Analysis and Machine Intelligence. Pattern Recognition. pages 183–188. Experiments with a new boosting algorithm. A cluster separation measure. [7] Y.S. IEEE Trans. References [1] L. pages 148–156. pages 210–217.72 L. 2002. E. Image Vision and Computing Journal. Creation of classifier ensembles for handwritten word recogntion using feature selection algorithms. Efron and Tibshirani R. . 1(224-227):550–554. R. Such results compare favorably to traditional ensemble methods such as Bagging and Boosting. [11] S. pages 121–129. G. 2002. Niagara-on-the-Lake. In Proc. J. Oliveira Jr. [4] K. Reject option with multiple thresholds. Kluwer Academic Publishers.

IEEE Trans. 5:385–398. R. . [29] L. 6:248–262. Last. Advances in Neural Information Processing Systems 4. A parallel genetic algo- rithm with distributed environment scheme. 2002. [25] M. of ibPRIA. Bortolozzi. Sabourin. Feature selection for ensembles.Tesauro et al. 2002. Morita. MIT Press. Moody and J. 1995. editor. 1991. [26] J. Kudo and J. Sabourin. In Proc. 2000. [17] A. Bortolozzi. pages 1–10. Kim. Bortolozzi. M. P. C. and F. Morita. [16] J. and C. IEEE Trans. Sabourin. Y. Oliveira. P. International Journal on Document Analysis and Recognition. Bezdek. on Evolutionary Computation. F. Kuncheva. C. That elusive diversity in classifier ensembles. S. Sabourin. Utans. 2003. Y. Kittler. Whitaker. Pattern Analysis and Applications. Y. N. Man. [23] L. R. Kuncheva and C. [27] M. Kuncheva and L. [28] M. Kuncheva and C. and J. editors. Segmentation and recognition of handwritten dates: An hmm-mlp hybrid approach. on Pattern Analysis and Machine Intelligence. In Proc. cross validation. and Cybernetics. 51:181–207. In Proc. Street. Suen. R. and R. 20(3):226–239. [24] M. Comparision of algorithms that select features for pattern classifiers. J. I. [30] L. Morgan Kaufmann. and C. Moody. 24(11):1438–1454. Principled architecture selection for neural networks: Application to corporate bond rating prediction. [22] L. and A. 2003. [20] L. LNCS 2652. of 16th International Conference on Artificial Intelligence. and R. 17(6):903–930. Menczer. Kandel. Hanson. Duin. Jain. Duin. and C. 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Hiroyasu. 1998. 3 Feature Selection for Ensembles 73 [15] Y. 4(4):327–336. 34(2):299– 314. 1999. Krogh and J. IEEE Trans. A feature-based serial approach to classifier combination. 2003. A methodology for feature selection using multi-objective genetic algorithms for handwritten digit string recognition. F. Vedelsby. Spain. In J. Kuncheva. Bortolozzi. Oliveira. In Proc. Ten measures of diversity in classifier ensembles:limits for two classifiers. pages 365–369. [19] L. and K. Decision templates for multiple classifier fusion: An experimental comparison. Advances in Neural Information Processing Systems 7. Suen. Measures of diversity in classifier ensembles. and active learning. K. 2001. 1999. Pattern Recognition. Sklansky. S. S. [31] D. 2000. In Proc. S. Whitaker. Optiz. Kaneko. F. Suen. 2001. Feature selection in unsupervised learning via evolutionary search. on Pattern Analysis and Machine Intelligence. 33(1):25–41. Hatanaka. R. In G. Unsupervised fea- ture selection using multi-objective genetic algorithms for handwritten word recognition. Machine Learning. R. W. Pattern Recognition. pages 1126–1138. 2003. 2003. pages 695–700. Y. International Journal of Pattern Recognition and Artificial Intelligence. Bunke. of International Confer- ence on System. [21] L. IEEE Computer Society. pages 231–238. pages 666–670. Miki. 2000. Designing classifier fusion systems by genetic algorithms. Automatic recogni- tion of handwritten numerical strings: A recognition and verification strategy. pages 379–384. Mallorca. W. I. Neural networks ensembles. J. and Suen C. H. Lippmann. of IEE Workshop on Intelligent Sensor Processing. T. [18] M. Matas. J. Hatef. On combining classifiers. F. W. volume 1. In Proceedings of the 7th International Conference on Document Analysis and Recognition. J.

S. Tumer and J. W. Connection Science. Neural Computation. [34] N. S. Partridge and W. [37] K. [36] K. 2003. 8(4):869–893. Tseng. of IEEE International Conference on Systems. C. Deb. 1999. Fuyan. 1996. Ruta. 1995. [33] D. Oza. volume 2. Multilayer selection-fusion model for pattern classification. Gangshan. 6:65–77. Tsymbal. Puuronen. 4:87–100. Yuan. 2003. . Information Fusion. Tumer and N. In Pro- ceedings of the IASTED Artificial Intelligence and Application Conference. 8(3-4):385–404. S. and Cybernetics. Engineering multiversion neural-net systems. [35] A. 1996. 2(3):221–248. Multiobjective optimization using nondominated sort- ing in genetic algorithms. [32] D. Input decimated ensembles. Pattern Analysis and Applications. Ins- bruck. and D. Oliveira et al. S. and Z. In Proc. A two-phase feature se- lection method using both filter and wrapper. Ghosh. Yates. 2004. Austria. B. [38] H. W. Error correlation and error reduction in ensemble classifiers. Evolutionary Computation. Patterson. Ensemble feature selection with the simple Bayesian classification. pages 132–136. Srinivas and K.74 L. Man.

In- deed many elegant results have been obtained in the area of classification since the 1970s.4 Feature Extraction Using Multi-Objective Genetic Programming Yang Zhang and Peter I Rockett Department of Electronic and Electrical Engineering.1 Introduction 4. Often though. Studies in Computational Intelligence (SCI) 16. the effort made on designing ever-more sophisticated classifiers has almost eclipsed the importance of feature extraction.p. Sheffield. To utilize the potential information in a dataset to its maximum extent has always been highly desirable and subset selection. University of Sheffield. We show that the superior performance from MOGP in terms of minimizing the misclassification is due to its effective optimal feature extraction.PCGA and SPGA .uk Summary. Both of these evolutionary approaches provide comparable misclassification errors within the present framework but PCGP produces more compact transformations. such as medical diagnosis or credit scoring. where a distinct feature selection/extraction stages are not readily identifiable. Nonetheless.springerlink.com  c Springer-Verlag Berlin Heidelberg 2006 . dimensionality reduction and transformation of features (feature extraction) have all been applied to patterns in the past before they are labeled by a classifier.ac.1 Feature Extraction Over recent decades.a multi-layer perceptron (MLP) is a good example. A generic. 4. In many application domains. to compare different evolutionary approaches. This methodology has been applied to the well-known edge detection problem in image processing and detailed comparisons made with the Canny edge detector.rockett@shef.1. optimal feature extraction method using multi-objective ge- netic programming (MOGP) is presented. feature extraction maintains a key position in the field since it is well-known that feature extraction can enhance the performance of a pattern classifier via appropriate pre-processing of the raw measurement data [2]. the feature selection and/or extraction stages are omitted or are implicit in the recognition paradigm .1. two popular techniques . typically the pre-processing step in a pattern classification system . UK yang. and applied to five datasets from the UCI database.see Figure 4.zhang. I Rockett: Feature Extraction Using Multi-Objective Genetic Programming. 75–99 (2006) www.have been extended to genetic program- ming as PCGP and SPGP. S1 3JD. Zhang and P. Furthermore. Y.

x to the decision space vector. we seek the mapping which maximizes the class separability in decision space and hence the classification performance. subject to some criterion. Nonetheless. coupled with powerful training algorithms. Ideally. I Rockett however. To obtain the optimal (possibly non-linear) transformation x → y from input vector.in fact the subject of optimality is rarely even addressed.1) is a more challenging task. corners. for example.76 Y. feature extraction approaches are hand-crafted using domain- specific knowledge and optimality is hard to guarantee . Popular supervised learning network systems such as the multi-layer perceptrons. can provide model-free or semi-parametric meth- ods to design non-linear mappings between the input and classification spaces from a set of training examples. optimality is hard to guarantee with such methods. In such paradigms.1 are merged into a single. Indeed. there are two main streams of feature extractions: lin- ear and non-linear.2 Genetic Programming Genetic programming (GP) is an evolutionary problem-solving method which has been extensively used to evolve programs. radial basis functions and LVQs. has been devising what are really feature extraction al- gorithms to detect edges. Most importantly. Thus the problem reduces to one of finding an optimal sequence of operations. indivisible stage. the interpretability of the features used in classification may contain much important information.1. much of image processing re- search. we require some measure of class separability in the transformed decision space to be maximized but with hand-crafted methods. Normally. no generic and domain-independent methodology exists to automate the process of creating or searching for good feature extractors for classification tasks where domain knowledge either does not exist or is incomplete. the three processing steps shown in Figure 4. Generally speaking. critical quantities such the best number of hidden neurons are often determined empirically using cross-validation or other techniques.this does not nec- essarily equate to optimal class separability. y where: y = f (x) (4. 4. this is usually hard to assess. or sequences of operations [1]. . Following the above insight that the feature extraction pre-processing stage is a mapping from input space to a decision space which can be represented as a sequence of transformations. The former is exemplified by principal component analysis (PCA) and related methods and are mainly applied to reduce the dimen- sionality of the original input space by projecting the data down into the sub-space with the greatest amount of data variability . A generic and domain-independent methodology for generating feature extraction stages has hitherto been an unattained goal in pattern recogni- tion. Zhang and P. For example. etc. with multi- layer perceptrons.

The discriminability. 4 Feature Extraction Using Multi-Objective Genetic Programming 77 Raw Input Feature Space Feature selection Sub-space of Input Feature Space Feature extraction Transformed Feature Space Classification Classification Space Fig. adding these one-at-a-time to a k -nearest neighbor (k -NN) classifier until the newly added feature fails to improve the classification performance by more than a predefined amount. giving a poor overall performance or vice versa. Pipelined image processing operations to transform multi-spectral input synthetic aperture radar (SAR) image planes into a new set of image planes were evolved by Harvey et al. Bot [4] has used GP to evolve new features in a decision space. Kotani et al. however. Indeed. Training data were used to derive a Fisher linear discriminant and GP was applied to find a threshold to reduce the output to a binary image. Koza [1] has evolved character detectors using genetic programming while Tackett [9] evolved a symbolic expression for image classification based on image features. GP has been used before to design feature extraction. Prototypical pattern recognition system Typically. but also depends on the classifiers in an opaque way such that there is a potential risk that the evolved pre-processing can be excellent but the classifier can be ill-matched to the task in hand. Generalized linear machine. Sherrah et al. [29]. [14] proposed an Evolutionary Pre-Processor (EPrep) system which used GP to evolve a good feature (or feature vector) by minimizing misclassification error. A conventional supervised classifier was employed to classify the transformed features. [23] used GP to determine the polynomial combination of raw features to pass to a k -NN classifier. Krawiec [33] proposed a new method of . The misclassification error over the training set was used as a raw fitness for the individuals in the evolutionary population. This approach not only has a large search space in which to work.1. k -nearest neighbor (k -NN) and maximum likelihood classifiers were selected randomly and trained in conjunction with the search for feature extractors. Bot’s method is a greedy algorithm and therefore sub-optimal. is constrained in the discriminant-finding phase and the GP only used as a one- dimensional search tool to find a threshold. in fact. 4. a prospective solution in GP is represented as a parse tree which can be interpreted straightforwardly as a sequence of operations.

g.3 Multi-objective Optimization As we have noted. adding a further objective to penalize more complex individuals. often conflicting objectives. To counter this. indeed. In addition. many methods have been proposed to apply multi-objective optimization: e. SPEA-II[7.3. typically ac- companied by a degradation in the generalization performance of the trained classifiers. Effectively. 17]. for example. As far as genetic algorithms are concerned. the trees in a GP optimization tend to carry-on growing in complexity. I Rockett using GP to design a fixed-length decision-space vector which protects ‘useful’ blocks during the evolution.see also [25].2 In Section 4. comparisons between two multi- objective evolutionary strategies . the simpler indi- vidual is always preferred. PCGA [20].for a given error. Ekárt and Németh [18] have shown that embedding tree complexity within a multi-objective GP tends to prevent bloat by exerting selective pressure in favor of trees of smaller size [7]. various heuristic techniques have been tried to suppress bloat. The single objective func- tion used in this previous work has a number of shortcomings. many real-world problems naturally involve the simultaneous optimization of mul- tiple. Unfortunately. steady-state) applied to multi-objective genetic programming.identifying the (near-)optimal sequence of transforma- tions which map input patterns to decision space with maximized separability between the classes. Thus a multi-objective framework based on Pareto optimality [16] is presented here. 4. unless specific measures are taken to prevent it.4. In Section 4. For a pattern classification system.SPGP and PCGP . quantitative comparisons are made with the ‘gold-standard’ Canny edge detector to investigate the effectiveness of the present method on the USF real world dataset [24].1. The rest of this chapter is organized as follows: Our generic framework for evolving optimal feature extractors will be presented in Section 4. the obvious design ob- jective is to minimize the misclassification error (over some finite size training set).have been made on five . all previous work on GP feature extraction has used a single objective comprising either a raw misclassification score or an aggregation of this with a tree complexity measure . Significantly. bloat is an over-parameterization of the problem. This results in excessive computational demands.78 Y. this is the first report of the comparison between these two evolutionary techniques (generational vs. the protection mechanism actually contributes to the over-fitting which is evident in his experiments. a phenomenon known as bloat. MOGA [19]. Such strategies stem from Occam’s Razor . As far as the authors are aware. Both Strength Pareto GP (SPGP) and Pareto Converging GP (PCGP) are applied in the present framework to find the Pareto set for our multi-objective optimization problem . Zhang and P. we will describe the principal application domain to which the methodology will be applied .the edge detection problem in image processing. Unfortunately. there are often competing multiple objectives in real world design problems. Also see [39] for a detailed review of multi-objective evolutionary algorithms.

multi-objective genetic programming (MOGP) is used as the search tool driven by Pareto optimality to search for the optimal sequence of transformations.1 Multi-objective Genetic Programming In order to implement the method. If the optimization is effective.and the one we favor . This is a tempting route due to its ‘simplicity’. of each individual in both sets . domain-independent method to fully automate the process of producing optimal feature transformations.is that treating the feature extraction as a stage distinct from classification. we conjecture that our method should yield a classification performance which is at least as good as the best of the set of all classifiers on a particular dataset [6]. Hence we make no assumptions about the statistical distributions of the original data. we adopt the approach of evolving the optimal feature extraction and performing the classification task using a standard.5.2 Methodology There are two ways of applying evolutionary optimization techniques in the pattern classification domain . nor indeed about the dimensionality of the input space. simple and fast-to-train classifier. evolving a distinct feature extraction stage retains the possibility of human interpretation of the evolved meta-features and therefore the use of our method as a data exploration technique.when . 4. Unfortunately.SPGP and PCGP - are both implemented here on five UCI datasets. or fitness.one could be regarded as putting the whole system inside the evolutionary learning loop to directly output class labels. In the context of optimality. The two evolutionary strategies mentioned above .2. 4. 4 Feature Extraction Using Multi-Objective Genetic Programming 79 datasets from UCI Machine Learning database [8]. it not as simple as it first appears since the search space is very large. Consequently. one representing the current population and the other containing the current approximation to the Pareto set. features in the transformed space should not only be much easier to separate than in the original pattern space but should also be optimal in the decision space with the classifier employed. comprising the space of all feature extractors and the space of all classifiers. The other option . The “fast-to-train” requirement on the classifier comes from the fact that we still have to use classification performance as an individual’s fitness and thus the classifier has to be trained anew within the evolutionary loop for every fitness evaluation. Our target is to obtain a generic. We offer conclusions and thoughts on future work in Section 4. The SPGP approach uses a generational strategy with tournament selection to maintain two sets of individuals during evolution. In addition. Ranking is done by calculating the strength. Our target is to invest all available computational effort into evolving the feature extraction and to utilize established knowledge from the relatively well-understood area of classifier design.

Note that when fully converged. Zhang and P. Here a set of candidate sub-trees is chosen to mutate based on their depth in the tree using the depth-fair operator [13].04 value is an empirical value determined by experiment) • The misclassification error = 0.1. misclassification error and Bayes error. In our modified binary tournament selection method. I Rockett calculating an individual’s strength. For the steady-state PCGP strategy. The stopping criterion for SPGP is any of the following being is met: • The maximum number of generations is exceeded (500 in present case) • The evolution process stops as adjudged by the Bayes error of the best in- dividual failing to improve for 0. as follows: . For detailed information see [20]. Details of node types are summarized in Table 4. the termination condition was set simply as exceeding the maximum number of generations.04 × the maximum number of generations (the 0. we favor mutating more complex trees.function nodes and terminal nodes. The motivation for this approach is aimed particularly at solving difficult (real-world) problems with local minima.80 Y. we followed the method designed in [20] as PCGA with straightforward modifications for use with GP trees.e. the number of nodes) us- ing roulette wheel selection. two individuals from the union of the population and the non-dominated set are randomly selected. 4. Detailed information can be found in [5]. one particular sub-tree from this set is selected biased in its complexity (i. The Pareto Converging Genetic Algorithm (PCGA) ensures population advancement towards the Pareto-front by natu- rally sampling the solution space. depth-dependent crossover [13] and a depth-dependent mutation operator were used to avoid the breaking of building blocks. For PCGP.2. There are two types of nodes in each individual . For more detailed information see [20]. if not. We have used a fairly standard chromosomal tree structure for the MOGP implementation. all the non-dominated individuals in the PCGP population comprise the Pareto- front.2 Multiple Objectives Within the multi-objective framework. that is. In comparing the two evolutionary strategies. 2000 in present work. Then. the raw fitness vector is used to decide which should be chosen. If both trees have been drawn from the same set we compare the normalized fitness to determine the winner. we are aiming to focus on the issue of the search convergence. Non-destructive. we use the method proposed in SPEA-II [7]. The population size for SPGP in all experiments presented here was 500 while a population of only 200 individuals was used for PCGP. our three-dimensional fitness vector of objectives comprises: tree complexity.

when all the randomly- created individuals possessed roughly the same (very high) misclassification error.7 crossover Tree Complexity Measurement As pointed-out above. we are aiming to maximize the class separability in the decision space using the simplest possible mapping.0. This is clearly an example of the familiar over- fitting phenomenon and grounded on the principle of Occam’s Razor. With the aid of the two competing objectives. After detailed investigation into the searching process. there was insufficient selective pressure to pick individuals with slightly more promise than the irredeemably poor performers. Under Pareto optimality [16]. . . we found that learning is often very slow and sometimes the optimization fails to converge at all. 1. .3 mutation. . Since the GP tree chromosomes used here naturally lead to an n-to-1 mapping into the one-dimensional decision space. This means we are trying to evolve an optimal feature extractor conditioned on a thresholding classifier operating in the decision space. we found that in the initial stages. Misclassification Error In addition to the tree complexity measure. we use the node count of a tree as a straightforward measure of tree complexity. a simple threshold in this decision space is adapted during the determination of fitness value to obtain the minimum misclassification error. the third objective of Bayes error was investigated as an additional measure of inter-class separability. Consequently. half random trees Original tree depth 5 Probabilities 0. 0.0 Raw fitness vector Bayes error. Nonetheless. the fraction of misclassified pat- terns counted over the training set is used as a second objective [33]. 4 Feature Extraction Using Multi-Objective Genetic Programming 81 Table 4. misclassification error. Thus the search stag- nated. during our early experiments. all objectives are treated as equally impor- tant since we are aiming to explore the trade-off between the complexity and the misclassification error (over the training set). .1. tree bloat in GP can produce trees with extremely small classification errors over the training set but a very poor error estimated over an independent validation set. number of nodes Standardized fitness Strength-based fitness Original population Half full-sized trees. MOGP Settings Terminal set Input pattern vector elements Constant 10 floating point numbers 0. Within the concept of Pareto optimality in which all objectives are weighted equally. this complexity measure exerts a selective pressure which favors small trees.

such as object segmentation. Clearly what was desired were two well-separated PDFs although the GP was meeting this goal in an unintended and unhelpful way . etc. The Bayes error is indeed small when calculated over the training set but this does not generalize to an independent validation set. Harris [34] used GP with a single objective . shape recognition. we map the n-dimensional input pattern space into the 1D decision space forming two class-conditioned probability density func- tions (PDFs) in the decision space and the overlapping region(s) of these two class-conditioned PDFs can be used to estimate the Bayes error with a simple histogramming procedure. the Canny algorithm is widely held to be a ’gold standard’ among edge detection algorithms [21. I Rockett Bayes Error The search performance of all evolutionary algorithms is critically dependent on the use of appropriate fitness functions. The Bayes error allows the evolutionary search to make rapid initial progress after which the misclassification error objective appears provide se- lective pressure which separates the transformed distributions.clearly the Bayes error is exerts more sensitive selective pressure in the the situation where the population is far from convergence.2. We have also observed that on those occasions when the optimization has got temporar- ily stuck in a local minimum.3 Application Domain 4.such opportunis- tic behavior has been observed previously in evolutionary algorithms.1 Edge Detection in Image Processing Edge detection is an important image processing technique in many applica- tions. Edge detection is a well-researched field which should be suitable for assessing the effectiveness of the evolved feature extraction method in com- parison to an established and conventional method e. If the Bayes error is used as a direct replacement for the misclassification error. we employed both misclassification error and the Bayes error estimate in a three-dimensional fitness vector.82 Y. Unfortunately. the Canny edge detec- tor. In a two class problem. Further investigation revealed that in minimiz- ing the overlap between two class-conditioned PDFs. it is the Bayes error which improves first and appears to ‘lead’ algorithm out of the local minimum. Zhang and P. 24]. Our motive for choosing the Bayes error as an objective is because it is the fundamental lower bound on clas- sification performance. 4.g.3. independent of the class distributions and classifier. the GP often achieved this goal by producing two PDFs with non-coincident. As a consequence. and so we chose this well-understood problem as the starting point to demonstrate our methodol- ogy. the subsequently estimated validation error was disappointingly high. 22. the optimization converges rapidly . ‘comb’-like features as illustrated in Figure 4.

31. [30] concluded that hand-labeling real image training set did not adequately sample the pattern space.8 -0.0 -0.8 -0.04 Likelihoods 0. non-obvious-non edges and uniform patches. Three distinct types of patterns are identified: edges.02 0. 4 Feature Extraction Using Multi-Objective Genetic Programming 83 0.00 -1.probably larger than is needed - deliberately to investigate whether the most useful features would be selected by the objective of minimizing the tree size.03 0. 11]. leading to deficient learning.01 0. We have followed a very similar approach of synthesizing a training set from a physically realistic model of the edge imaging process.03 0.0 -0. We have employed an image patch size of 13 × 13 in the generated training set .01 0. 4.05 Class 1 0.05 Class 2 0.2.000 samples and . and by implication. the number of raw input features used.6 -0. We used a synthetic dataset for training on the edge detection task because Chen et al.04 Likelihoods 0.4 -0.00 -1.0 GP Transformed Value 0.4 -0.0 GP Transformed Value Fig. Further details can be found in [30.2 0. 32].6 -0. The training set comprised 10.2 0.02 0. Example of ‘comb’-like class-conditioned densities evolved using the Bayes error metric alone to evolve an ‘optimal’ edge detector but terminated the evolution when he obtained a performance comparable to the Canny detector See also [10.

Our interest here is a domain-independent methodology and a step such as NMS is very much a heuristic specific to the image processing domain. See also [6]. respectively. nonetheless we have included NMS in our comparisons since it is held to be an integral part of the Canny algorithm. In other applications. . The operating point which naturally emerges from our GP algorithm is that which minimizes the misclassification error over the training set which was constructed with edge prior of 0. The prin- cipled basis we have chosen for comparison between the GP-generated and Canny edge detectors is the operating points which minimize the Bayes risk for both detectors. Through locating the minimum of the risk- threshold plot we can identify the ‘optimal’ operating point of the detector at the given prior. which serves to reduce the fraction of false positives (FPs) with a slight attendant sacrifice in the fraction of true positives (TPs). I Rockett the realistic figure of 0. for its superiority over other con- ventional edge detectors [32]. if not completely. Here we adopt a neutral position of using a cost ratio of unity since we have no basis for regarding one sort of error as more or less important than any other. TP and FP are the fractions of true and false positives. Generally.84 Y. In order to investigate the labeling performance on real image data. P is the prior of edge. Zhang and P. Whereas the our GP detector has no adjustable parameters (once trained). Thus fair comparison is not completely straightforward. The Bayes risk can be written as: R = P × (1 − T P ) + (1 − P ) × F P (4. the edge labelling threshold is a user-defined tuning parameter in the Canny algorithm.05 is chosen for the prior probability of the edge class. the results are presented in Section 4. in medical image processing. Note we have used the assumption of equal costs: In fact.4.05.2) where. we have applied the GP-generated edge detector to images taken from the ground-truth labeled USF dataset [24] and drawn comparison with the Canny edge detector with and without NMS. For the Canny algorithm we can locate the corresponding decision threshold by plotting the Bayes risk ver- sus threshold to locate the optimal operating point (i. such as non-maximal suppression (NMS) which appear to be respon- sible in large measure. The Canny algorithm includes a significant number of post-processing steps. For ex- ample. cost ratios are always subjectively chosen and vary from application to application. false negatives resulting in line frag- mentation. the cost of a false negative may be un- acceptably high and so a suitable cost would be used. biasing the classifier operating point.e. may be tolerable in order to keep processing times below some limit. the minimum Bayes risk) [15]. Detailed comparison using synthetic datasets between the MOGP edge detector and the Canny algorithm had been made elsewhere [5]. both TP and FP will vary with threshold and other operating conditions. The NMS step is a heuristic sequential and spatially-localized classification stage.

SPGP and PCGP . Each record comprises ten attributes. Five UCI Datasets Name Number of Features Size and Distributions BUPA 6 345 = 200 (Benign) + 145 (Malignant) WDBC 30 569 = 357 (Benign) + 212 (Malignant) PID 7 532 = 355 + 177 (Diabetic) WBC 10 699 = 458 (Benign) + 241 (Malignant) THY 21 7200 = 166 (Class1) + 7034 (Others) . we have applied our method to five other datasets from the UCI Machine Learning databases [35] in Section 4. the details of the datasets used in the current work are summarized in Table 4.2. six numerical attributes and 345 records. This dataset has been used previously in [27]. • Pima Indians Diabetes (PID): All records with missing attributes were removed. • Wisconsin Breast Cancer (WBC): Sixteen instances with missing values were removed. [26]. • Thyroid (THY): This dataset includes 7200 instances with 21 attributes (15 are binary.2. This dataset comprises 532 complete examples with seven at- tributes.3.2 where we make comparison between the two evolutionary strategies . • Wisconsin Diagnostic Breast Cancer (WDBC): This dataset has been dis- cussed before by Mangasarian et al. This dataset has been discussed by [36]. 683 out of original 699 instances have been used here. There are two classes.2 UCI data In addition to the edge detection task. For convenience. The datasets used in the current work are: • BUPA Liver Disorders (BUPA): To predict whether a patient has a liver disorder. It has been reconfigured as a two-class problem with 166 instances from class1 and the remaining. 4 Feature Extraction Using Multi-Objective Genetic Programming 85 4. 569 examples with thirty numer- ical attributes. 7034 instances from non-class1. 6 are continuous) from 3 classes.4. Table 4.to investigate the relative merits of generational and steady-state evolutionary techniques.

3886 0.0047 It is again straightforward to obtain the labeling performance of the GP detector without NMS and these results are shown in Table 4. As pointed out in our previous work [6].4326 0.4433 0.0048 0.4.0049 Table 4.4.0003 0.[6] which assumes an edge prior of 0.4.4 0. GP [TP. To determine the GP performance with NMS we have devised a special method of carrying- out non-maximal suppression on the output of the GP detector.0466 4.080 0.080 0.0129 0.5 0.4151 0.0003 0.0061 4.6(a).3657 0.066 0.[32] .3 for each of the USF images shown in Figs 4. The labeling performance is summarized in Table 4. we .0083 4.6228 0.0204 0. Canny [TP.3821 0.0298 0.0052 4. fair comparisons turn-out to be somewhat harder than first appear.the GP detector assumes this same prior therefore neither detector is comparatively disadvantaged.6 0.0278 0. Table 4.3415 0.5581 0. FP] Operating Points for USF Test Images Figure Edge Prior Without NMS With NMS TP FP TP FP 4. we have assessed the performance of both detectors with and without the NMS post-processing step.1 Comparisons on USF Datasets In order to examine the performance of the GP edge detector as well as make fair comparison with the Canny algorithm.0246 0.5045 0.00036 4. Zhang and P.3 0. FP] Operating Points for USF Test Images Figure Edge Prior Without NMS With NMS TP FP TP FP 4.5 0.4 0.0695 4.4.0003 0.0063 0.3 0. We have used the same optimal threshold as determined over the synthetic data [5].4 Results 4.211 0.3.have been hand-labeled as belonging to a distinct “don’t care” class.86 Y.087 0.6 0. The principal complication here is that the USF dataset has been subjectively censored since the non-obvious non-edge patterns [24].3388 0.087 0.3580 0. Hence we have used the following methodology: Quantifying the performance of the Canny detector over the USF images is straightforward.0001 0. We are trying to compare the GP edge detector (with or without NMS) to the Canny edge detector (with or without NMS) at the detector operating points which minimize the Bayes risk.211 0. First.3(a) .which are the patterns most likely to be confused by any classifier . I Rockett 4.0142 0.066 0.05 .

after NMS.06848 0.5 0.04785 0. The reason seems to be that the noise level in this image is much lower than the others. Bayes Risk Comparisons for USF Test Images Figure GP Canny Without NMS With NMS Without NMS With NMS 4.this turns-out not to be the case since NMS significantly changes the ROC of the Canny edge detector and hence its optimal operating point. an observation consistent with the preceding results on synthetic edge data reported in [6].6 where. The ‘distance’ of a given edge pixel’s response from the decision threshold can be taken as a measure of its edge strength and we perform non-maximal suppression on this quantity. The Canny algorithm. the GP has a slightly higher risk.0871 0. gives much bigger FP values.0786 0.4 0. however. The only exception to this is the image in Fig. the Canny algorithm has a much higher TP fraction although the FP fraction also increases. In particular. Comparisons of the Bayes risk figures are shown in Table 4.6 0.10933 4. From the comparisons we can see that before NMS. the TP values are very low. Hence the success of this algorithm owes very little to the feature extraction (pre-processing) step. It is apparent from Table 4. We then quantize the edge direction into one of the eight principal directions of the compass and examine the decision variable responses of the GP detector for the three pixels centered on the edge and (approximately) normal to the edge direction.04694 With NMS.06596 0.09630 4. The results are again summarized in Table 4. there is a clearer optimal operating point in terms of minimization of Bayes risk. 4.5 from which it can be seen that the risk values of the Canny algorithm are higher than for the GP.4. after NMS the (increased) FP fraction becomes the principal contributor to the Bayes risk. Consistent to the conclusion made in [32] that there is no clear minimum risk operating point for the Canny algorithm without NMS.05768 0. The received wisdom in the image processing community would suggest that the difference between the detector output with and without NMS can be explained by the thinning of otherwise thick edges . Table 4. the two algorithms give comparable TP values.2129 0. with or without NMS.06276 0. Indeed. GP gives larger TP values while after NMS.07420 0.13381 4.3 0. a finding which will be a surprise to many in the image processing community.11774 0.3 that the Canny algorithm without NMS performs very poorly.08977 0.5. allowing the Canny algorithm to yield a higher TP value while having an FP value roughly equal to the GP . This is again consistent with the result reported in [32] that after the NMS step. 4 Feature Extraction Using Multi-Objective Genetic Programming 87 estimate the orientation of an edgel using the familiar difference-of-boxes op- erator.05214 0.

4. we have deliberately employed an overly large image patch (13 × 13) to in- vestigate how the GP selects input features within this patch given that one of our fitness objectives is to minimize tree size and.6b are the distinct “don’t care” class. Whole image labeling results for four. heuristic post-processing steps of the Canny edge detector [21]. In the USF dataset. 4. 4.we did not supply any domain-dependent knowledge to MOGP apart from the carefully constructed training set. 4. Although feature selection was not explicitly intended in our approach.set cost ratios. it is not statistically significantly different using t-test under 95% confidence level.6. (f) shows the output from the GP feature extractor with the gradient-direction based NMS. For the edge detection problem we have concluded that it is these non-obvious non-edge (NONE) patterns labeled as “don’t care”in the USF images which make the classification task difficult. This lends evidence to support our conjecture that our method is able to automatically produce (near-)optimal feature extraction stages of a classification system. compared to the Canny detector which is based on extensive domain knowledge .88 Y. In these figures. 4. a number of regions have been labeled as “don’t care” . it is striking that Fig.3(c) contains only six labeled pixels in the top right hand corner of the image and the same poor performance in other corresponding (c) figures are consistent with the results from Table 4. This is consistent with the observation of Konishi et al. In contrast to the hand-crafted. Discussion The evolution described here. In fact. typical USF images using the Canny and GP detectors (with and without NMS) are shown in Fig. and 4. we concentrate our computational resource on optimizing the feature extraction stage to yield greater separability in the mapped feature space.and implicitly .5b.6 (e.see Canny’s original work on deducing the ‘optimal’ filter kernel [21] .3. These results further confirm our conclusion that the Canny edge detector’s performance is not due to the feature extraction stage but to the sophisticated post-processing steps coupled with subjectively . (a) denotes the original image from the USF dataset. In these labeling results. Zhang and P. [22] that the USF images are easier to label than the Sowerby dataset which these authors also considered and which has no ”don’t care” regions. therefore. (e) shows images from the Canny edge detector with NMS.3 to Fig. the number . driven by multiple objectives is able to generate separated class-conditioned distributions in the 1D decision space. The final labeled images in Fig 4. f) for Canny and GP (with NMS) look much more similar than for any of the other images from the USF dataset. Further. 4.4b. (b) shows the ground truth data. (c) shows the labeling results from the Canny detector without NMS and (d) shows images labeled with the GP feature extractor.3b.the white regions in Fig. I Rockett detector.

We reiterate that we have not embedded any domain knowledge in this optimization. a to f illustrate the comparisons from the Canny edge detector and GP.7. Here we report the results of a preliminary investigation into the efficacy and performance of different evolutionary techniques within the proposed frame- work. a fact which is intuitively pleasing because most of the edge/non-edge discriminatory information can be considered to come from this region. 4. implemented using the SPGP algorithm. exhib- ited either superior or equivalent performance to nine conventional classifiers. Thus we believe that feature selection is occuring as a beneficial by-product of the feature extraction process due to the way we have incorporated parsimony into the optimization. (a) (b) (c) (d) (e) (f) Fig.2 Comparison of Generational and Steady-state Evolutionary Algorithms on UCI Datasets As mentioned in the introduction. 0). details refer to the text. 4 Feature Extraction Using Multi-Objective Genetic Programming 89 of leaf nodes (i. This figure illustrates that the MOGP optimization has a strong tendency to select pixels from around the center of the image patch. where the central pixel of the 13 × 13 patch has the row and column indices of (0.4. SPGP and PCGP have been compared on five typical problems from the . individual pixel intensity values). evidence to substantiate the generic prop- erty of our method has been demonstrated in previous work [6] where we have shown that our methodology. 4. A histogram of the num- ber of times each pixel was used by the trees in a typical converged Pareto set is shown in Fig. 4.3.e.

details refer to the text. For the generational SPGP algorithm.4. (a) (b) (c) (d) (e) (f) Fig. The GP settings are the same as those listed in Table 4. As far as we are aware. a to f illustrate the comparisons from the Canny edge detector and GP.5.1 except that the terminal nodes are now the elements in the pattern vectors of the five UCI datasets rather than image pixel values. there has been little detailed analy- sis and comparison of multi-objective evolutionary strategies on real world GP problems. Zhang and P.90 Y.2. 4. I Rockett (a) (b) (c) (d) (e) (f) Fig. . details refer to the text. 4. We have applied SPGP and PCGP to each of the problems listed in Table 4. a to f illustrate the comparisons from the Canny edge detector and GP. UCI database [35].

4 Feature Extraction Using Multi-Objective Genetic Programming 91 (a) (b) (c) (d) (e) (f) Fig. For the steady-state PCGP algorithm. while for simplicity. only 200 individuals were used in the population. 4. details refer to the text. 4. the three possible stopping criteria listed in Table 4.1 are reused. Histogram of numbers of pixels used in the GP trees in a typical Pareto set. the stopping criterion used was to run for a fixed number of 2000 generations. relative to the 13 × 13 input patch. This means that PCGP is performing around one-tenth the number of fitness evaluations of the SPGP algorithm and is thus comparatively disadvantaged. a to f illustrate the comparisons from the Canny edge detector and GP.7. 70 60 50 er Numb 40 30 20 10 6 5 0 4 5 3 4 2 3 1 x 2 0 de 1 -1 In Ro 0 -1 -2 n w I -2 nd -3 l um ex -3 -4 Co -4 -5 -5 -6 -6 Fig. we used a population size of 500 and a maximum of 500 generations.6. this was an intentional feature of this preliminary study to .

Note that these figures do not depict Pareto-optimal sets per se since these plots are a two-dimensional representation of a three-dimensional Pareto front. A sophisticated method of gauging actual convergence of PCGA have been discussed by Kumar & Rockett [20].6.343 THY 0.019 0.964 ± 43.492 The mean error comparisons between the two evolutionary algorithms over 10 repetitions are summarized in Table 4. What is a notable and significant difference between these two algorithms is the numbers of mean nodes required to ob- tain ‘identical’ misclassification errors.401 WBC 0. for PCGP we have plotted the whole population which in practice.028 0. I Rockett determine if approximately equivalent solutions could be obtained with PCGP but with much less computing resource.12.320 0.8 to 4.417 0. Mean Error Comparisons Between SPGP and PCGP on Five UCI Datasets Datasets SPGP PCGP BE ME Mean Nodes BE ME Mean Nodes BUPA 0.633 ± 40. BE.272 58.870 ± 1.022 6. Al- though the SPGP results shown are this algorithm’s best approximation to the Pareto front.278 7.830 ± 5.185 0.173 0.025 0. This in part explains why a number of solutions appear at first glance to be dominated (in the Pareto .025 24. Table 4.6.071 ± 46. the corresponding Bayes error estimates are listed to give an indication of the degree to which the misclassification error approaches its fun- damental lower bound.242 PID 0.133 0.299 0. Zhang and P.928 ± 54.028 36.220 ± 5.028 7. contains a number of dominated solutions.647 WDBC 0.0058 11.203 32.636 0.219 0.0044 0. We have applied Alpaydin’s F -test [28] to these data from which we conclude that none of these differences in error is statistically significant.160 0.0061 37. albeit at an intermediate stage for PCGP. where ME stands for misclassi- fication error from optimal thresholding in the decision space after feature extraction. we have plotted the misclas- sification error versus the number of nodes for the final solutions produced by the two algorithms on each of the datasets in Figures 4. This result saves a lot of computational effort during evolution as well as having a practical implication.014 0. Our aim here is to com- pare the coverage and sampling of the Pareto front produced by the two algo- rithms.6 contains only data from the non-dominated solutions from each of the evolutionary paradigms.200 3.0044 0. Table 4.92 Y.941 ± 43. The mean numbers of nodes for the steady-state PCGP approach are very much smaller than for the generational SPGP algorithm.785 ± 5. To further explore the differences between the ‘quality’ of the solutions produced by these two evolutionary paradigms. We have observed that some inactive/redundant sub-trees exist in solutions evolved by SPGP [6] and the fact that we generate signifi- cantly smaller trees using PCGP implies that PCGP is much more effective in controlling tree bloat than SPGP.280 ± 2.

3 0.30 0. 4.2 0.6 Training Pareto-front Training Pareto-front Validation Pareto-front Validation Pareto-front 0. 4.25 0.30 Validation Pareto-front Misclassification Error 0.10.15 0.8.30 0.40 Training Pareto-front Training Pareto-front Validation Pareto-front Validation Pareto-front 0.20 0. Number of nodes for the members of the Pareto sets generated by SPGP (left) and PCGP (right) on the WBC dataset .2 0.1 0.4 0.35 Misclassification Error Misclassification Error 0. Number of nodes for the members of the Pareto sets generated by SPGP (left) and PCGP (right) on the BUPA dataset 0.3 0.00 0 10 20 30 40 50 60160 170 180 Nodes Fig.4 0. Misclassification error vs.35 0.35 Training Pareto-front 0.20 0.05 0.5 Misclassification Error Misclassification Error 0.15 0.9. 4.1 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 0 10 20 30 40 50140 145 150 155 160 Nodes Nodes Fig.25 0.20 0. Number of nodes for the members of the Pareto sets generated by SPGP (left) and PCGP (right) on the PID dataset 0.5 0. Misclassification error vs.40 0. 4 Feature Extraction Using Multi-Objective Genetic Programming 93 0.25 0. Misclassification error vs.6 0.10 0.15 0 10 20 30 40 50 60 70 80 90 100180 190 200 0 5 10 15 20180 185 190 195 200 Nodes Nodes Fig.

10 0.00 0 10 20 30 40 50 60 70 80 90 100 110 120 130 0 5 10 15 20 25 30110 115 120 125 130 Nodes Nodes Fig.) From Figures 4.04 0. I Rockett 0.13.02 0.11.05 0. we show a typical GP-generated tree in Figure 4. This individual was a non-dominated population member obtained from a PCGP run on the THY dataset.10 0.08 0. The leaf nodes in Figure 4.12. Tree Interpretation By way of example.00 0.02 0. for ∈ [1.00 0. where N is the number of raw attributes in the input pattern vector.07 0. we present the tree as it was generated without pruning.10 Training Pareto-front Training Pareto-front 0. for the THY .01 0.03 0.08 Misclassification Error Misclassification Error 0. many of the PCGP solutions actually are dominated within their population. Zhang and P. Number of nodes for the members of the Pareto sets generated by SPGP and PCGP on the THY dataset sense) whereas in fact. PCGP is able to produce solutions which are significantly smaller than those from SPGP but with indistinguishable error rates.12.20 1.8 to 4. it is apparent that PCGP produces a solution set which is much more strongly centered on lower node numbers. 4.01 0.08 0.04 0. Misclassification error vs.06 0.00 0 10 20 30 40100 110 120 130 0 5 10 15 20 25 30110 115 120 125 130 Nodes Nodes Fig.00 Training Pareto-front Training Pareto-front 1.01 0.04 0.00 Validation Pareto-front Validation Pareto-front 0. Number of nodes for the members of the Pareto sets generated by SPGP (left) and PCGP (right) on the WDBC dataset 1.06 0.05 0.06 0.13 are labeled as Xn.09 0.04 0.03 0.09 Misclassification Error Misclassification Error 0.09 0.09 Validation Pareto-front Validation Pareto-front 0.94 Y.N ]. 4. they are actually non-dominated if one takes the third (Bayes error) objective into account.02 0.05 0.03 0..02 0.07 0.20 1.10 0.06 0.05 0.07 0.03 0.01 0..07 0.08 0. (Indeed. This pre- liminary comparison of PCGP is thus extremely promising since even in the state of not being fully converged and only having had around one-tenth the computing resources invested in it. Misclassification error vs.

The value returned from the tree root .−(minus) .is the transformed feature value in the one-dimensional decision space in which the separability between classes is maximized. In essence. in the example tree shown in Figure 4. When confronted with a new dataset. 4 Feature Extraction Using Multi-Objective Genetic Programming 95 dataset here. the set of raw input features has been selected during the optimization process which seeks to minimize the number of tree nodes. All the function nodes in the graph are elementary functions except pow2 which calculates the child value raised to the power of two and if else which returns the second child value if the first child value is 0. the performance of which is at least as good as the best of all available classifiers on any given problem. Actually.2). Although it is trivial to remove this obvious redundancy by hand (or even as a non-evolutionary post-processing stage). otherwise it returns the third child value. entirely feasible that the genetic optimization is re-inventing an existing classifier rather than devising a novel algorithm but this is of little practical consequence. Fig.13. with this dataset only two original elements can be used to extract the optimal new feature to obtain the desirable classification performance. It is.5 Conclusions and Future Research Work In the present work we conjecture that efficient feature extraction can yield a classification system. the sub-tree containing if else could be simplified as X19.13. 4. One typical evolved GP feature extractor on THY dataset 4. Each input pattern will be fed into the tree and will return a scalar value as the extracted feature. N = 21 (see Table 4. it appears that even the PCGP evolutionary strategy has the propensity to generate inactive code. Again. the com- mon approach among pattern recognition practitioners is to empirically try a . of course. Surprisingly. the evolutionary algo- rithm is ‘inventing’ the optimal classifier for a given labeling problem/dataset.

Zhang and P. by definition. Even where the PCGP algorithm has not fully converged. The concentration of the solutions at low node num- bers also means that the sampling of the Pareto front is better with PCGP. In the present work we have projected the input pattern to a one-dimensional decision space since this transformation naturally arises from a genetic programming tree although potentially. Finally. The feature extraction transformation sequence is identified automatically.) Faster convergence and better control of the tree-bloating make the PCGP approach very promising for practical applications. Needless to say. we cannot offer a definitive proof. as with all multi-objective optimizations. but un- tried classifier.which presents the system designer with the trade-off surface between classification performance and the complexity of the feature extractor. superior classification performance could be obtained by projecting into a multi-dimensional decision space [14] . We have presented the application to edge detection. Here we present the pre- liminary results of applying two different evolutionary paradigms. although the results presented here and in previous work [6] support our conjecture. I Rockett range of classification paradigms since there is no principled method for pre- dicting which classifier will perform best on a given problem.this is currently an area of active research. for example. fundamental superiority in the steady-state approach [20]. Our conjecture that multi-objective evolutionary optimization produces the best possible classifier effectively eliminates the risk from this trial-and-select approach. driven solely by the optimality criteria of the multi-objective algorithm. in which our method yields superior results to the very well-established and mature Canny algorithm.96 Y. one gener- ational (SPGP) and one steady-state (PCGP). We have demonstrated the use of multi-objective genetic programming (MOGP) to evolve an “optimal” feature extractor which transforms the input patterns into a decision space such that pattern separability in this decision space in is maximized. The differences between the misclassification errors attained with the two types of genetic search are not statistically significant although the complexity of the generated feature trans- formation sequences is markedly different. there is always the chance that better performance could be obtained with another. Also of considerable importance is the fact that we do require any do- main knowledge of the application.invariably one would like the solution which is .the Pareto set . we find that the steady-state method produces much smaller trees which implies that PCGP is more responsive to the objective try- ing to minimize tree size. (Previous comparisons between PCGA and other multi-objective GAs imply a similar. what arises from our method is a family of equivalent solutions . Although the principle of evolutionary optimization is straightforward. exact implementation is still an open research issue. Exactly which pre-processing solution is selected for the final application will depend on the generalization properties of the solutions . The generic property of the presented method enables it to be straightforwardly applied to other ap- plication domains.

Systems. Zhang and P. In:Joint Pro- ceedings of the First European Workshops on Evolutionary Image Analysis. Huang. Italy. Predicting generalization performance is very much a major and continuing issue in statistical pattern recognition .A. Evolving a task specific image operator. 2000 [3] D. O. Ito. IEEE Trans. pp. M. Tackett. pp. C. Ebner and A. In: Congress on Evolutionary Computation (CEC 2001). T. Heath. Morgan Kaufmann. Liu. Paris. Proceedings CVPR ’96. Genetic algorithm optimized feature transformation . Genetic programming for feature discovery and image discrim- ination.D. Multiobjective genetic program- ming: Reducing bloat using SPEA2. A comparison of feature extraction and selection techniques. Cambridge. Huang. 2005 (submitted) [7] S. Duda and P. Paris. In: GECCO 2003. 1999 [11] M. Turkey. 303-309. pp. Man. and S. Zhang and P.J. Evolving optimal feature extraction using multi- objective genetic programming: A methodology and preliminary study on edge detection.ultimately we would like to formulate a measure of classifier generalization which potentially. Bot. 212–215. 14-15.A comparison with different classifiers. Rockett. S. A Generic Optimal Feature Extraction Method using Multiobjective Genetic Programming: Methodology and Applications. Thiele. 2003 [4] M. On the evolution of interest operators using genetic programming. Iba. 4 Feature Extraction Using Multi-Objective Genetic Programming 97 best able to predict class of as-yet unseen patterns with the greatest accuracy. Wermter and G. Pei. In: Proceedings of the First European Workshop on Genetic Programming. LNCS 2724. Sweden. 1993 [10] M. In: Proceedings of the International Conference on Artificial Neural Networks. Proceedings of EuroGP’2001.I.Zitzler. In: Late Breaking Papers at EuroGP’98: the First European Workshop on Ge- netic Programming. The MIT Press. and Cybernetics. E.Brack. 2003 [9] W. Wiley-Interscience. France. Ebner. Zell. S. pp. Lake Como. Rockett. 536-543. Genetic Programming II.W. I. 2005 [6] Y. Feature extraction for the k-Nearest neighbor classifier with genetic programming. E. Comparison of edge detectors: A methodology and initial study. Arevian. Y. 1994 [2] R. Signal Processing and Telecommunications (EvoIASP’99 and EuroEcTel’99). 2001 [5] Y. Goodman. J. LNCS. References [1] J. L. In: Computer Vision and Pattern Recognition. 1998 [12] M. pp. pp. Hart and D. Istanbul. This remains an area of open research. Springer-Verlag. Göteborg. 74-89. Massachusetts. Automatic Discovery of Reusable Pro- grams. G. In: Proceedings of the Fifth International Conference on Genetic Al- gorithms. In: Genetic Programming. 1996 [13] T. 6-10. Sanocki. and E. Bowyer. pp. Stork. Pattern Classification (2nd Edi- tion). In: GECCO 2005. 1998 . and G. Supplementary Proceedings. pp. Non-destructive depth-dependent crossover for ge- netic programming. Bleuler. Sato.I. pp. 2001 [8] Z. could be optimized along with the other multiple objectives. 256-267. Koza. 143-148. Sarkar. M. 2121-2133. and K. pp. Addison. 795-802.R.

Cancer diagnosis via linear program- ming.. 1999 [17] E. P. USA. J. Mangasarian. A. In: Congress on Evolutionary Computation. 8. vol. R. C. pp. 25.. J. 1997 [15] T. DC. no. J. 40. 2. 679-698. Bloch. vol. Akazawa. 304-312. Multiobjective optimization and multiple con- straint handling with evolutionary algorithms -Part I: A unified formulation. 77-103. The evolutionary pre- processor: Automatic feature extraction for supervised classification using ge- netic programming.1.183-196.C. Fleming.J.R. 1998 [18] A. Zhang and P. pp. Das. Zitzler and L. pp. Coello. 2001 [19] C. Swiss Federal Institute of Technology (ETH).26-37. Coughlan. Kotani.E. 1999 [24] K. R. pp. M. 283-314. 1-18. Breast cancer diagnosis and prognosis via linear programming. Mangasarian and W. Wolberg. vol. Kranenburg. IEEE Transactions on Evolutionary Computation.P. Washington. Technical Report.L. Edge detector evaluation using empirical ROC curves. 11. vol. Rockett. Receiver operating characteristic curves and optimal Bayesian operating points. Street and W. Galassi and A. CA. Theiler. 393-404. Statistical edge detec- tion: Learning and evaluating edge cues.R. Wolberg. A novel approach to design classifiers using genetic programming. Sherrah. pp. In: Genetic Programming 1997: Proceedings of the Second Annual Conference. Pattern Analysis and Machine Intelligence. pp. Zhu. 43. 570-577. Pattern Analysis and Machine Intelligence. W.H. Feature extraction using evolutionary computation. D. vol. 28. pp. Németh. M. Kumar and P.84. Bowyer. Fonseca and P. and K. pp. Nakai. pp. Evolutionary Computation. Canny. and S.M. 1986 [22] S. Bogner. 1990 [28] E. 2002 [21] J. Pal.N. pp. 1998 [20] R.3. no. 2003 [23] M. Computer En- gineering and Communication Networks Lab (TIK). 256-259. Operations Research. S. 1999 [29] N. Improved sampling of the Pareto-Front in multi- objective genetic optimization by Steady-State evolution: A Pareto converging genetic algorithm. vol. pp.Proceedings.L. Ekárt and S. vol. L. J. Young. IEEE Trans. Zurich. and J. Stanford University.Z. Konishi. 2001 [25] D.C. Image feature extraction: GENIE vs conventional supervised classification techniques. no. pp.M. 57-74. I Rockett [14] J. 2004 [26] O. Harvey. Alpaydin. An evolutionary algorithm for multiobjective opti- mization: The strength Pareto approach. Washington. and S. R. pp. pp. Switzerland. IEEE Press.B. 1885-1892. 4. 23. IEEE Trans.C. IEEE Transactions on Systems. A computational approach to edge detection. 6. vol.M. 2002 . 61-73. no. Kanungo and R.C. In: International Conference on Image Processing . Thiele. 3. Neural Computation. S. Haralick. vol. 8. no. SIAM News. N. Porter. 3-13. no. IEEE Transactions on Geo- science and Remote Sensing. Man and Cybernetics-Part A: Systems and Humans. H. 43. vol. no. Dougherty. Bouzerdoum. In: Proceedings of the Congress of Evolutionary Computation. no. Yuille. Combined 5 2 cv F-test for comparing supervised classification learning algorithms. Selection based on the Pareto nondomination crite- rion for controlling code growth in genetic programming. Brumby. Computer Vision and Image Understanding. Szymanski.J. 1995 [16] C. and A. 2. 1995 [27] O. pp.A. 8.98 Y.J. vol.I. An updated survey of evolutionary multiobjective optimization techniques: State of the art and future trends. Muni. 2. Genetic Programming and Evolvable Machines. 1230-1236. Perkins. 1.10. vol.

I. Schiffmann.L.1. Performance assessment of feature detection algorithms: A methodology and case study on corner detectors. Univer- sity of Koblenz. Department of Information and Computer Science. IEEE Transactions on Image Processing. An investigation into the application of genetic programming tech- niques to signal analysis and feature detection. 2005 (submitted) [33] K. Krawiec.I. Rockett. . Harris. 1992.329-343. College of London. Sep. 4 Feature Extraction Using Multi-Objective Genetic Programming 99 [30] W. Merz. Image Processing. Irvine. Thacker and P. Image and Signal Processing. Chen. Genetic programming-based construction of features for machine learning and knowledge discovery tasks. pp. [35] C. Genetic Programming and Evolvable Machines. Zhang and P. no. N. CA: University of California. Joost. Comp. UCI Repository of machine learning databases [http://www. 1997. PhD. and R. vol. An adaptive step edge model for self-consistent training of a neural network for probabilistic edge labeling. no. no.uci.11.ics. Sci- ence.12. Blake and C. IEEE Trans. Werner.J. thesis.Vision. 1996 [31] P. vol.A. The Bayesian operating point of the Canny edge detector. Univ. Synthesis and performance analysis of multilayer neural network architectures.3. 1668-1676. vol. 2003 [32] Y. Rockett. Technical Report 16/1992. IEE Proceedings . 2002 [34] C. pp.4. Dept.I.C.html]. 143. M. 1998 [36] W. 41-50. pp. Institute für Physics. Rockett.edu/ mlearn/MLRepository..

Part II Multi-Objective Learning for Accuracy Improvement .

This approach is then extending to encapsulate the complexity of the model as a third objective to minimise. The highlighting of regions of the REC curve on which we can J. This approach is then augmented with the simultaneous optimisation of model complexity. Results are shown for a number of data sets. Exeter.E. A method which meaningfully compares across regressors and against benchmark models (i. introduced by Bi and Bennett [2]. in- stead of for individual regressors. These are numerous and often problem specific (a wide range of which are described in [1]).com c Springer-Verlag Berlin Heidelberg 2006 .Fieldsend@exeter. degrees of confident out-performance are also highlighted.uk Summary. EX4 4QF J. This chapter proceeds with the introduction of evolutionary computation techniques as a process for generating REC curves for regressor families. Recent advances in regressor comparison has been concerned not solely with the chosen error measure itself. [13]. In this chapter recent research in the area of multi-objective optimisa- tion of regression models is presented and combined.1 Introduction When forecasting a time series there are often different measurements of the quality of the signal prediction.E. allowing the comparison of models given different error property preferences. Computer Science and Mathematics. This curve traces out the proportion of residual errors of a model which lie below a certain error threshold. but also with the distributional properties of a regressor’s error – this has been formulated in a methodology called Regression Error Characteristic (REC) curve. Fieldsend: Regression Error Characteristic Optimisation of Non-Linear Models. using multi-layer perceptron neural networks. Fieldsend School of Engineering.e. University of Exeter. 5. UK. Studies in Computational Intelligence (SCI) 16. 103–123 (2006) www.springerlink.ac. allowing the visualisation of the potential prediction properties for an entire class of method for a problem. ‘random walk’ and maximum a posteriori approaches) for varying error rates. with a similar framework to that introduced by Fieldsend et al. Through bootstrapping training data. Evolutionary multi-objective optimisation techniques are described for training a population of regression mod- els to optimise the recently defined Regression Error Characteristic Curves (REC).5 Regression Error Characteristic Optimisation of Non-Linear Models Jonathan E.

u). and also allows them to compare models where misclassification costs are unknown by using such measures as the area under the curve (AUC) and the Gini coefficient. ξ = error(ŷ.3.4. and the average error. typically derived through changing the threshold/cost of a particular parameterised classifier. through bootstrapping the optimised REC curve. ξ¯ typically used as the scaler evaluation of the parameters u.3. the task is to generate a good estimate of a signal yi (called the dependent variable). Inspired by this. Empirical results are presented on a range of problems in Section 5. the entire error distribution is of interest. A more in depth discussion of ROC curves can be found in the chapter on ROC optimisation in this book. Fieldsend confidently outperform the maximum a posteriori (MAP) trained model is also introduced. Bi and Bennett [2] developed the Regression Error Char- acteristic curve methodology to represent the properties of regression models. and . such that ŷi = f (xi . The optimisation process of regression models typically takes the form of varying u in order to make ŷi as close to yi as possible. The proportion of points forecast below a certain error threshold are plotted against the error threshold. The chapter ends with a final discussion of results in Section 5. This formulation allows the user to see what range of classifications they can obtain from a model. and the methodology is extended to include complexity trade-off in Section 5. instead of dealing purely with the average error of a regressor. 5. In a regression problem. A general model for multi-objective REC optimisation is introduced in Section 5. given the data xi and the model parameters u.6.104 J. This effectively traces out an estimate of the cumulative distribution function of the error experienced by a regressor. y).5.E. for a range of error thresholds (from zero to the maximum obtained error for a null regressor on a single point). This error is calculated for all the n training data points used. from a transformation of one or more input signals xi (called independent variables). where closeness is calculated through some error term (like Euclidean distance or absolute error).2 Regression Error Characteristic Curves Receiver Operating Characteristic (ROC) curves have proved useful for a num- ber of years as a method to compare classifiers when the costs of misclassi- fication are a priori unknown.2. The chapter proceeds as follows: REC curves are formally defined and their properties discussed in Section 5. and the regression models to be optimised and compared are presented in Section 5.4. In REC. (5. In the binary classification case it plots the rates of correct classification of one class against the misclassification rates of the other class.1) ŷi is the regression model prediction of yi .

This gives information to the user which may lead them to select a model other than the one with the lowest average error. on average. 5. 5 Regression Error Characteristic Optimisation of Non-Linear Models 105 200 200 180 180 160 160 140 140 Number of points Number of points 120 120 100 100 80 80 60 60 40 40 20 20 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 a Absolute error b Absolute error Fig. can be created by ordering ξ in ascending order. 5. A and B.1 Illustration Let us start with a toy example where we have two regression models available to us. a: A and b: B. and plotting this against the element index divided by n.1 shows an (illustrative) error distribution of the absolute errors of models A and B. which are predicting some time series (for instance the demand of a product in the next month). probably the most useful contribution of the REC formula- tion is the easy visualisation of error information about a model across the range of data used to train or test it. may lead to a different decision. This shows that although model A has a lower average error than model B.1. Figure 5.06. Figure 5. There are many useful properties of this representation. However.2 in turn traces out the REC curves for A (solid line) and B (dashed line). Distribution of residual errors of two regression models. If model A experiences.99 and model B an absolute error of 1. then without any additional information one would typically choose to use model A as it exhibits a lower mean error. for instance the area over the curve (AOC) is a good estimate of the expected error of the model. Given that the cost of extreme errors to the user of the forecast may be proportionally greater than small errors (small under predictions of demand .2. its largest errors (the top 15%) are proportionally bigger than that of model B – meaning it makes more extreme errors than model B. however. an absolute error of 0. Knowing the distributional properties of this error.

8 0.3 shows this composite REC curve along with the REC curves of two new models. 5. if we merge the two REC curves of models A and B. an estimate of the mean expected error.106 J. Using the previous illustration.1 0 0 2 4 6 8 10 12 Absolute deviation Fig. 1 It should be noted in the original work by Bi and Bennett they recommended ranking models by the AOC – i. Fieldsend 1 0.4 0.1 The final choice of model will depend upon the preferences and costs of the user of the model which may be difficult to incorporate in an optimisation algorithm when there is no a priori knowledge of the shape of the REC curves possible (see Das and Dennis [5]). C and D.2 0.5 0. large under predictions may result in turning customers away). Model D however is slightly in front of the composite REC for a small range of accuracy. which in many situations is not the case. and so would be useful to retain and offer to the end user as a possible model.7 0. REC curves of two regression models.9 0. we can see that for any possible error/accuracy combination models A and/or B are better than model C. Bi and Bennett [2] used REC to compare different regressors trained to minimise the average error.3 0. A (solid) and B (dashed). B may actually be preferred.2.E. The illustration in Figure 5. we can create an REC curve which illustrates the possible error/accuracy combinations given the available models. As the REC of model C lies completely below the composite REC curve. However this assumes proportional costs of error. can be taken up by inventoried goods. When we are interested in a particular region of the REC curve (distributional property of our residual errors).6 Accuracy 0. and so were effectively interested in those models which minimised the AOC. minimising the AOC of a single model will not necessarily lead us to the best model given our preferences. However minimising the AOC of a set of models can.e. . taking only the portions which are in front.

4 0.7 0.9 D 0. which actually turns out to be faster. which need to be compared to the current best estimate of the composite REC curve.8 C 0. . O(n log n) domination comparisons are needed for each parameter vector compared to the current best estimate of the Pareto front/composite REC curve. 5.2 0.6 Accuracy A+B 0. and REC curves of models C and D. the first objective (error threshold) to be minimised and the second objective (accuracy) to be maximised.3. However the casting itself can be in two different forms. 5 Regression Error Characteristic Optimisation of Non-Linear Models 107 1 0.3. 5. We can instead however represent the problem as an n objective problem. which affect both the representa- tion and the complexity of the optimisation process. Any values on the current best estimate which have higher error threshold for the same accu- racy as u need to be removed as they are dominated and replaced by relevant the pair(s) from the evaluation of u.3 Multi-Objective Evolutionary Computation for REC The generation of a composite REC curve to describe the possible er- ror/accuracy combinations for a problem (given a model family or families) is easily cast in terms of multi-objective optimisation problem. In this case a single para- meterisation.1 Casting as A Two-Objective Problem The obvious representation of the REC optimisation problem is as a 2- objective problem. Composite REC of models A (solid) and B (dashed). u. Formulated in this fashion. results in n error/accuracy pairs.5 0.3 0. 5.1 0 0 2 4 6 8 10 Absolute error Fig.

3.4 5.3) For this formulation a single domination comparison is needed for each parameter vector compared to the current best estimate of the Pareto front/ composite REC curve (albeit a comparison across n objectives as opposed to 2).2 Casting as An n-Objective Problem On calculating the error. our decision to add or remove an element to/from F is not quite so straightforward as that in (5. of two parameterisations u and v. (5. v ∈ F.E. then the regressor parameterised by u is said to strictly dominate that parameterised by v (u  v). and arranging them in ascending order. The accuracy term (the second objective in the previous formulation) always takes on the value of the index (over n) of the elements of ξ. If we define REC(F ) as the composite curve generated by the elements in F . so each element of the ordered ξ for u is coupled with the correspond- ing element of the ordered ξ of v. as illustrated in Figure 5. a set F of decision vectors is said to be non-dominated if no member of the set is dominated by any other member: u ≺ v ∀u.2) and this formulation is used to store the archive of the best estimate of the Pareto front found by an optimisation process. Additionally we know that at most n unique model parameterisations will describe the REC composite curve. then we actually want to maintain F such that: u ≺ REC(F \ u) ∀u ∈ F.2). Interestingly the points returned by casting the problem as a 2-objective or as an n-objective problem are slightly different – the n objective optimisation of the curve returning an attainment surface representation of the REC curve (along the error axis) [35. Traditionally.4 Empirical Illustration In this section a simple multi-objective evolutionary algorithm (MOEA) is introduced to show the generation of REC curves for a number of different well-known autoregressive problems from the literature. namely the Santa Fe competition suite of problems (see Weigend and Gershenfeld [32]). Note that . Fieldsend 5. In the REC optimisation situ- ation. ξ. If all the error thresholds at each index for the regressor with decision vector (parameters) u are no larger than the error thresholds at the corresponding index for regressor v and at least one threshold is lower. whereas the 2-objective optimisation returns a strictly non-dominated front representation (and therefore may have less than n elements. (5. we can use these n dimensional errors as the fitness vectors for u and v.108 J. 29]. because the set F itself traces out the composite REC.

Composite REC curves from casting the REC optimisation problem as a a: 2-objective or a b: n-objective one. 15. 4.9 0.1 0.8 0.6 Accuracy Accuracy 0. 14.4 0.3 0. line 1). 23. 20. 10.3). Algorithm 2 REC optimising MOEA. 21.2 0. 12.5 0.6 0. 31.4 0. 22].9 0.1 The Optimiser The MOEA used here is based on a simple (1+1)-evolutionary algorithm. 36. 1: F := initialise() Initial front estimate 2: n := 0 3: while n < N : Loop 4: u := evolve(select(F )) Copy and evolve from F 5: REC(u) Evaluate 6: if u  REC(F ) If non-dominated 7: F := F ∪ u Insert in archive 8: F := {v ∈ F |v ⊀ REC(F \ v)} Remove any dominated 9: end 10: n := n + 1 11: end any recent MOEA from the literature could equally be used [17. Note that all the points in (a). a model which has been used extensively in the literature [20.5 0.3 0. 5 Regression Error Characteristic Optimisation of Non-Linear Models 109 1 1 0. An overview is provided in Algorithm 2. are included in (b).4.7 0.4. 5.2 0. The process commences with the generation of an initial estimate of the REC curve for the problem (Algorithm 2. 9.8 0. 7] – however they would need augmentation to compensate for the problem specific archive update shown in (5. 5.7 0.1 0 0 0 50 100 150 200 0 50 100 150 200 a Absolute error b Absolute error Fig. This can typically be provided by the generation of random parameterisations of the model and/or the optimisation of the model . 6.

Fieldsend with traditional scaler optimisers concerned with the mean error (i. • Series E: Astrophysical data. u.2 Data The data used in this chapter to show the properties of REC optimisation is the Santa Fe competition data [32]. and at each generation cre- ating a new model parameterisation. Recorded by a currency trading group for 12 days.4. These decision vector(s) are stored in F . is not used here as this exhibits the missing data property. and remove any elements that do not contribute to its minimisation – effectively a set based scalar optimisation.5 second intervals. the second is the chest volume (respiration force). through mutation and/or crossover of elements of F (line 4). or by copying a single solution from F and perturbing its weights. The algorithm continues by iterating through a number of generations (line 3). An equivalent formulation would be to simply update F to minimise its AOC. 5. and the third is the blood oxygen concentration (measured by ear oximetry). which this chapter is not concerned with. back- propagation or scaled conjugate gradient for a neural network (NN) model [27]). Parameters used are encoded as real values and crossover occurred at the transform unit level. The final series.110 J. which is a suite of autoregressive and multi-variate data problems exhibiting different properties. The first set is the heart rate. the nominally multi-objective formulation will be adhered to here. it is inserted into F (line 7). If it is non-dominated by the composite front defined by F . Different representations and crossover/perturbation/selection methods could equally be used. u is compared at each generation to F (line 6). . spaced by 0.5.e. and any elements in F which no longer contribute to the composite front are removed (line 8). The evolve(select(F )) methods (line 4) either generates a new solution by single point crossover of two members from F and perturbing the weights. • Series B: Three sets of physiological data. Series F. at 10 second intervals. weight perturbation probability 0. Crossover probability was 0. • Series C: Tickwise bids for the exchange rate from Swiss Francs to US dollars.8 and weight per- turbation multiplier 0. A set of measurements of the light curve (time variation of the intensity) of the variable white dwarf star PG1159- 035 during March 1989. However as later in this chapter the additional imposition of a complexity minimisation objective will be in- troduced. The other 5 series are: • Series A: Laser generated data. and an ex- cellent review of those concerned with NN training can be found in Yao [34].01 (perturbation values themselves were drawn from a Laplacian distribution).E. • Series D: computer generated time series.

5 Regression Error Characteristic Optimisation of Non-Linear Models 111 B B BA 1 2 3 C D E 0 100 200 300 400 500 600 700 800 900 1000 Fig.6).6).stanford. 5. and is formulated here as a problem of predicted one of the series at time t given the values of the two other series at t. One step ahead predictions are made. 5 lags were chosen for this series. . Series A. Figure 5.5 shows the first 1000 samples of each of the series (except for series B. all series are standardised (the mean of the training data subtracted and divided through by the variance of the training data). 1000 sequential points of each used for train- ing. All may be obtained from http://www-psych. 40 lags are used for series A and D.html. the series exhibit varying degrees of oscillation and rapid change. past values of the same series are used to predict future values). As can be seen. however the error reported is based on the prediction transformed back into the original range.edu/∼andreas/ Time-Series/SantaFe. where the second 1000 is shown). In the prediction tasks used here. Santa Fe competition series.e. On inspection of the autocorrelations.5. 10 lags for series E. but the correlation levels and a priori knowledge of the exchange rate market would indict results outperforming a random walk would be surprising. and the number of lags (past steps) to use is determined by observing the auto-correlation between steps (shown in Figure 5. Series C is highly correlated even with very large lags (see Figure 5. C. Series B is a multi- variate problem. D and E are autoregressive problems (i.

ukl ) + um (5. u1 . .4 0. . these units take the form of a hyperbolic tangent:   p f (x. + um−1 xp + um (5.4 0. .92 −0. .4 0.3 Non-linear Models In the traditional linear regression model the functional transformation takes the form of ŷ = u1 x1 + u2 x2 + .2 Correlation Correlation 0 0. .4 0.2 0.6 0.6) i=1 .2 0 −0.9 −0. In an MLP. Here however we shall use the non-linear multi-layer perceptrons (MLP) neural network regression model.8 0. 5. and a single hidden layer.94 −0. .6. Autocorrelation of series a: A.5 0. . the functional transformation takes the form of a number of parallel and sequential functional transforma- tions. . b: C.4.8 1 0.88 0 100 200 300 400 500 0 500 1000 1500 a Lag b Lag 1. . u2l ) + . q) = tanh qp+1 + qi xi qp+2 (5. .7 0. . c: D and d: E.2 0.3 Correlation 0. ul ) + f2 (x.98 0. . ul+1 .2 −0.8 0. they can be represented in their regression form as: ŷ = f1 (x. where p is the number of independent variables.3 0 100 200 300 400 500 0 100 200 300 400 500 600 c Lag d Lag Fig.6 −0.6 Correlation 0.2 0.E.6 1 0. u(k−1)l .4) with correspondingly m = p+1 model parameters to fit. .96 0.2 −0. 5.1 0 −0. + fk (x. Fieldsend 0. . .112 J.5) In the case of the MLP transfer. .1 0. With k parallel transformation functions (known as the hidden layer).

in this formulation is therefore the more appropriate ran- dom walk model (which predicts that ŷt = yt−1 ) – as differencing the data gives a mean of zero.7.8 0. the REC curve of a single model optimised with the scaled conjugate gradient algorithm for 1000 epochs.7 0. REC curves of Santa Fe competition series ‘A’ for a 5 hidden unit MLP. which is typically the mean of the data. 5 Regression Error Characteristic Optimisation of Non-Linear Models 113 1 0.7 shows the composite (evolved) REC curve after 20000 genera- tions. 5.9 0.4 Results The first example results are shown here for a 5 hidden unit multi-layer per- ceptron neural network. Figure 5.5 0. As series A is highly oscillatory the first difference is used as the dependant variable.4.2 0. The null model. The model is initially trained using a scaled conju- gate gradient algorithm [27].3 0.1 0 0 50 100 150 200 250 Absolute deviation Fig. 5. with acts as an initial F . Solid line evolved composite REC. and the null model. and the fi- nal prediction reconstructed from this.6 Accuracy 0. dashed line optimised REC curve of scalar best model and dot-dashed line REC of null model. It should be noted that the training of the single neural network and the subsequent run of the MOEA took approximately the same computation time.4 0. It therefore has m = k(p + 2) + 1 parameters. The p + 1th element is the unit bias and the p + 2th element is the weight between the unit and the output. The first p elements of q are the weights between the inputs to the hidden unit. The composite REC .

which shows where these errors are being made. . 5.9 level.8 gives the error histograms of three different points on the com- posite REC curve.114 J.5 level. Histograms of models on evolved composite REC. Figure 5. Both curves are well in front of the null model – implying there is indeed information in the series which enables a degree of prediction beyond the most simple formulation. Bottom right histogram of lowest mean error model. The bottom left histogram (corresponding to the model with the lowest threshold at the 0. Top right lowest threshold at 0. Conversely.8. Top left lowest threshold at 0. to better illustrate the qualitative difference between the errors made by regressors on different points on the composite REC curve. the bottom right histogram shows that of the single minimising AOC Model (trained with the scaled conjugate gradient algorithm – the maximum a poste- riori model). although the mean error of the histogram in the top right is pushed higher than the other four models shown.1 level) can be seen to exhibit the greatest number of points with very low absolute error.1 level. however it does completely dominate it. Bottom left. lowest threshold at 0.8. Fieldsend Number of data points 250 250 200 200 150 150 100 100 50 50 0 0 0 50 100 150 200 0 50 100 150 200 250 250 Number of data points 200 200 150 150 100 100 50 50 0 0 0 50 100 150 200 0 50 100 150 200 Absolute deviation Absolute deviation Fig.9 shows the actual errors corresponding to the histograms provided in Figure 5.E. curve is only slightly in front of the single AOC minimising model. Figure 5. it exhibits fewer very high errors than the others.

a: lowest threshold at 0. b: 95% level composite REC contour and REC curve of MAP model (dashed) and null model (dot-dashed). 5.2 0.7 0.1 0 0 −1 0 1 2 0 50 100 150 200 10 10 10 10 a Absolute deviation b Absolute deviation Fig. c:.6 Accuracy 0.4 0.5 level.8 0. d: histogram of lowest mean error model. 1 1 0.7 0.9.3 0.10. a: Probability contours of REC for MLP on series A. Errors of models on evolved composite REC.8 0.5 0.1 level. Error in log scale to aid visualisation.3 0.9 0.5 0. 5% and 95% level contours shown around composite REC curve.2 0.9 0.6 Accuracy 0. b: lowest threshold at 0. 5 Regression Error Characteristic Optimisation of Non-Linear Models 115 200 200 150 150 100 100 Prediction error Prediction error 50 50 0 0 −50 −50 −100 −100 −150 −200 −150 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 a Time step b Time step 200 200 150 150 100 100 Prediction error Prediction error 50 50 0 0 −50 −50 −100 −100 −150 −150 −200 −200 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 c Time step d Time step Fig. . lowest threshold at 0.1 0.4 0.9 level. 5.

Figure 5. More formally.85 on statistically equivalent data.) From these we can see that we are confident (at the 95% level) of the composite REC models outperforming the single MAP model on the accuracy range 0. and evaluate our composite REC curve on these p data sets. we are provided with p error thresholds for each accuracy level.E. In the case of series C. Fieldsend Uncertainty In [2] the REC curve presented are the means of a number of different runs (on cross validation data).7) p j=1 where I(·) is the indicator function. as in the end we have to decide upon a single model. created from 200 bootstrap resamples of the data. will. Until this point we have been concerned with the uncertainty over the error preferences for a regression model (leading to the optimisation of REC curves). Figure 5. If we generate p bootstrap replications. From this we can say not only (from Figure 5. 0.7) that the single MAP curve lies completely in front of that of the single MAP model.02 up to 0. Figure 5. What we are interested in here more specifically is the expected variation of a single parameterisation. and noting the variation in errors to generate a probability of operating in a particular objective space region [11].10b in turn shows the 95% REC contour and the REC curve of the single model trained using the conjugate gradient algo- rithm.01-0.10 for series D and 0. which is statistically equivalent to the original.116 J. meaning they were an average of a number of different model parameterisations on a number of different datasets – as each model was trained in turn on the different data.30 on series E.11 shows these fronts for the other four test problems (once more using an MLP with 5 hidden units).j < ẽ) (5. i. have a lower error level than a value ẽ as: 1 p p(Ξi < ẽ) = I(Ξi.10a shows the probability contours for composite REC front shown in Figure 5.00-0.7. although we may . (The null model is again set as the random walk model – which is a far better fit than the mean allocation model suggested in [2]. and the uncertainty over the variability of the data used (leading to the use of probability contours).00-0.5 for series B. implying there is little or no information in the series beyond a random walk prediction. We can calculate the probability that the regressor defining the REC curve point at accuracy level. Again the 95% composite REC contour (solid line). We can do this effectively by bootstrapping our training data [8]. the REC of the MAP model (dashed line) and the null model (dot-dashed line) are plotted. It is also very likely that. the REC of the MAP model (and for most of the null model) lies in front of the 95% composite REC contour. but that we are confident (at the 95% level) that it will lie in front of the single REC model for accuracy levels from 0. at the 5% and 95% levels. This gives us an n × p set of error values Ξ. by bootstrapping the data we are generating a data set of the same size.

9 0. how many hidden units.2 0.9 0. b: C. what level of connectivity? In the final section of this chapter the previous optimising model will be extended so that a set of REC curves can be returned. various methods to tackle this problem are routinely in use.6 Accuracy 0.5 0. c: D and d: E.4 0.1 0 −5 0 −4 −3 −2 −1 −4 −3 −2 −1 0 10 10 10 10 10 10 10 10 10 10 c Absolute error d Absolute deviation Fig. too complex a model and one runs the risk of ‘overfitting’ the model and promoting misleading confidence on the actual error properties of your chosen technique. Too simple a model and the performance will be worse than is actually realisable. 5 Regression Error Characteristic Optimisation of Non-Linear Models 117 1 1 0.8 0.2 0. we may not know how complex a model we should use – i.3 0.5 Complexity as An Additional Objective An additional problem which is manifest when training regression models is how to specify a model that is sufficient to provide ‘good’ results to the task at hand without any a priori knowledge of how complex the function you wish to emulate actually is.2 0.3 0. in one optimising process.6 Accuracy 0.5 0.7 0.. 95% level composite REC contour and REC curve of MAP model (dashed) and null model (dot-dashed) for series a: B. have a preferred regression model type.5 0.7 0.8 0.7 0.9 0.9 0.1 0.6 Accuracy 0.8 0.3 0. 5.1 0.4 0.1 0 −2 −1 0 1 2 3 0 −10 −8 −6 −4 −2 10 10 10 10 10 10 10 10 10 10 10 a Absolute deviation b Absolute error 1 1 0.e. in the case of NNs.6 Accuracy 0. 5.4 0.3 0. Depending upon the regressor used.11.7 0. how many inputs. In the .2 0. which describe the error/accuracy trade-off possibilities for a range of model complexities.8 0.4 0.5 0. Error in log scale to aid visualisation.

g. which recognised the im- plicit assumptions about the interaction of model complexity and accuracy that penalisation methods make (e.1. The existing crossover allows the interaction of models with different dimensionality – in addition weight deletion is also incorporated with a probability of 0. the more complex. was recently proposed by the au- thor [16]. This casts the problem of accuracy and complex- ity as an explicit trade-off. A multi-objective formulation of this problem. pruning [24]. F now contains a set of models generating a composite REC curve for each complexity level. 5. [19]. In the linear regression case this is simply the number of coefficients. This method can be applied to the REC optimisation to trace out es- timated optimal REC curves for different levels of model complexity.E. 1: F := initialise() Initial front estimate 2: n := 0 3: while n < N : Loop 4: u := evolve(select(F )) Copy and evolve from F 5: REC(u) Evaluate 6: F̃ − := F \ v ∈ F where |v| ≤ |u| Lower and equal complexity 7: if u  REC(F̃ ) If non-dominated 8: F := F ∪ u Insert in archive 9: F̃ + := F \ v ∈ F where |v| ≥ |u| Higher and equal complexity 10: F̃ − := F \ v ∈ F where |v| < |u| Lower complexity 11: F̃ + := {v ∈ F̃ + |v ⊀ REC(F̃ + \ v)}Remove any dominated 12: F := F̃ + ∪ F̃ − 13: end 14: n := n + 1 15: end neural network domain these take the guise of weight decay regularisation [28. 3]. as such its update is also modified from Algorithm 2.1 Changes to the Optimiser The evolve(select(F )) methods (line 4) of Algorithm 3 is adjusted in this extension to allow the generation and evolutionary interaction of parameteri- sations of differing complexity (dimensionality). . Here complexity shall be cast in terms of the number of model parameters – the larger the parameterisation of a model from a family. [25]). In the MLP case this is the number of weights and biases. which could be traced out and visualised without imposing any complexity costs a priori. complexity loss functions [33] and cross validation topology selection [30].5.118 J. Fieldsend Algorithm 3 REC and complexity optimising MOEA. and Jin et al.

5. 12]. 18]. Although the compu- tational cost is higher with this level of maintenance.3). this time for 50000 generations. u is then compared to the composite REC curve defined by the members of F̃ − . This problem with using constrained archives in multi-objective search has been highlighted previously in [9. As can be see. .8 0. Apart from these alterations the algorithm is run as previously.6 0.5.12 shows the composite REC curves for various MLP complexities for the Series A problem previously defined. then it is inserted into F (line 8) and those members of F with and equal or higher complexity than u are then compared to u. As such the methodology of [16] is maintained here and an unconstrained elite archive of non-dominated solutions is maintained. 5 Regression Error Characteristic Optimisation of Non-Linear Models 119 50 40 Complexity 30 20 10 0 1 0. the REC curve rapidly approaches a stable position with only 10 weights in the network. Jin et al. there are data structures available to make this process more efficient [9. Line 6 of Algorithm 3 shows the selection from F of those members with equal or lower complexity to the new parameterisation u into F̃ − . Although the method in [19] uses a MOEA with a constrained archive. 5. and we have the additional benefit in this application that we know there is a limit on the archive size (as discussed earlier in Section 5.4 200 150 0. REC/complexity surface for problem A.2 Empirical Results Figure 5. 12.2 100 50 Accuracy 0 0 Absolute deviation Fig. if it is non-dominated.12. also report the occurrence of degrading estimates of the Pareto front as the search progressed (the elite archive containing points dominated by some solutions found earlier in the search). and any dominated members removed (line 11). 26.

120 J.2 0. 5. Figure 5. which describes the (estimated) best possible error/accuracy trade off a regression model can produce for a particular problem.2 6 0.6 0. or is not predictable given the inputs. REC curves of series a: B.25 0. and only very minimal improvements in deviation for given accuracy are observed.05 0.6 Conclusion This chapter has introduced the use of MOEAs to optimise a composite REC curve.4 0. as well as higher deviations for given accuracy levels in the range. MLPs with less than 10 weights are seen to suffer from larger absolute deviations at the limit of accuracy. .1 Accuracy 0 0 Accuracy 0 0 c Absolute deviation d Absolute deviation Fig.4 0.2 0. There are very small adjustments throughout the range of complexities shown for problem D.3 0.004 Accuracy 2 0.8 0.6 0.8 0.4 0. Test problem B can also be adequately described with only 10 weights whereas the REC curves of test problem C are relatively unchanged across all complexities. c: D and d: E.2 0.1 0.4 10 0.E. Fieldsend 50 40 40 Complexity 30 Complexity 30 20 20 10 10 0 0 1 1 0.008 0. Beyond 10 weights the absolute deviations at the limit of accuracy are constant. On the other hand 20 weights are needed for problem E before the improvements to the REC curve become relatively small for higher complexity levels.8 0. inferring all that is left to model is noise.01 8 0.4 0. b: C. although good results can be obtained with relatively low complexity.13.6 0. for varying complexities.2 0.6 0. 5.13 in turn provides these plots for the other 4 problems previously described.006 4 0.8 0.002 0 0 Accuracy 0 0 a Absolute deviation b Absolute deviation 50 50 40 40 Complexity Complexity 30 30 20 20 10 10 0 0 1 1 0.2 0.15 0.

Agarwal. [5] I. Fast and elitist multi- objective genetic algorithm: NSGA–II. pages 43–50. Coello Coello. A Comprehensive Survey of Evolutionary-Based Multi- objective Optimization Techniques. It is also useful to note that the use of bootstrapping is equally applicable for examining the significance of performance of a single model parameterisa- tion. A. Pratap.P. Number 57 in Monographs on Statistics and Probability. which makes it easy to identify the minimum level of complexity needed to achieve specific error/accuracy combinations for a particular model family on a particular problem. Regression Error Characteristic Curves. Deb.Dennis. Efron and R. 1(3):269–308. S. Structural Optimization. 1992. 5 Regression Error Characteristic Optimisation of Non-Linear Models 121 REC specific properties where also highlighted – the hard limit of the number of different parameterisations possible. and the different computational complexities of these different formulations. The optimisation method was extended by the use of bootstrapping to show the probability of performance.A.S. and T. the casting as a 2 objective. Das and J. producing a REC surface. . 6(2):182–197. 2003. [3] C. 2001. International Journal of Forecasting. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003). 1999. The problem itself was then expanded so that the complexity of the model was also optimised.aston. [6] K. An International Journal. [7] K. 14(1):63–69. The NETLAB toolbox for MATLAB. Deb. [8] B. IEEE Transactions on Evolutionary Computation. 8(1):69–80.J. on statistically equivalent data. 2002. 1998. Multi-Objective Optimization Using Evolutionary Algorithms. 1993. [4] C. Acknowledgements The author gratefully acknowledges support from the EPRSC grant num- ber GR/R24357/01 during the writing of this chapter. Knowledge and Information Systems.ncrg. Chapman & Hall. Wiley. Error measures for generalizing about fore- casting methods: Empirical comparisons. php. Bi and K. Oxford University Press. n objective or even scalar problem. Neural Networks for Pattern Recognition. was used as the basis of the NNs used here. Armstrong and F. New York. across accuracy ranges. Meyarivan.ac. Bishop. An Introduction to the Bootstrap. available from http://www.. Collopy. [2] J. 1997. to a baseline model. A closer look at drawbacks of minimizing weighted sums of objectives for pareto set generation in multicriteria optimization problems. Bennett. Tibshirani. Chichester. References [1] J.M. Washington DC.uk/netlab/index.

Springer. 1995. Singh. 2-4. Everson. [18] M. The pareto archived evolution strategy: A new baseline algorithm for pareto multiobjective optimisation. 2003. In Z. Reducing the run-time complexity of multiobjective EAs: The NSGA-II and other algorithms.E. Fieldsend. pages 788–793. Matatko. Hernández-Orallo. 2002. . Fieldsend. Peng. [11] J. Yang. Singh. 2005. Everson. Corne. Pareto Multi-Objective Non-Linear Regres- sion Modelling to Aid CAPM Analogous Forecasting. and S. Hawaii. pages 388–393. Lachiche. L.E.C Coello and G. Zitzler. N. 2002. 2004. and B.D.122 J. An Overview of Evolutionary Algorithms in Multiobjective Optimization. Sept. and H. Full Elite Sets for Multi-Objective Optimisation. Parmee. A Multi-Objective Algorithm based upon Particle Swarm Optimisation. [21] J. an Efficient Data Structure and Turbulence. [16] J.E. J.E. pages 1–8. and M. Sendhoff. Everson. Approximating the Nondominated Front Using the Pareto Archived Evolution Strategy. In C. J. 7(3):305–323. IEEE Transactions on Evolutionary Computa- tion. Proceedings of the Fifth International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’04). In Proceedings of the 1999 Congress on Evolutionary Computation. pages 37–44. Jin. Optimizing forecast model complexity using multi- objective evolutionary algorithms. [19] Y. 8(2):170–182.M. and S. Spain. editors. Neural network regularization and ensem- bling using multi-objective evolutionary algorithms. In I. Singh. 2004. 8(2):149–172. part of the 16th European Conference on Artifi- cial Intelligence (ECAI). [20] J. Fieldsend. Applications of Multi-Objective Evolutionary Algorithms. Proceedings of ROCAI 2004. R.E. 2003. IEEE Press. [22] M. Fonseca and P.C.A. editor. 2000. Flach. Knowles and D. Yin.T. 2005. [12] J. Knowles and D. Fieldsend and R. pages 343–354. 3(1):1–16. IEEE Press. Forthcoming. pages 37–44. Piscataway.M.B. Evolutionary Computation. NJ. Fieldsend and S.M. 7(5):503–515. 2004. [15] J. 1999.E. Everson. editors. World Scientific. 2004. In J. 2002.M. editors. IEEE Service Center. [17] C. R. Thiele. Okabe. Ferri. Corne. IEEE Transactions on Evolutionary Com- putation. T. [13] J. In Proceed- ings of UK Workshop on Computational Intelligence (UKCI’02). Lamont.E. Valencia. Fieldsend [9] R.J. Birmingham.R. and P. Adaptive Computing in Design and Man- ufacture V. Laumanns. Cardinality constrained portfolio optimisation. Fieldsend and S.E. number 3177 in Lecture Notes in Computer Science. Evolutionary Computation. Fieldsend and S. Multi-objective Optimisation in the Pres- ence of Uncertainty. May 12-17. IEEE Transac- tions on Evolutionary Computation. Using Unconstrained Elite Archives for Multi-Objective Optimisation.E. [10] J. ROC Optimisation of Safety Related Sys- tems. Springer. Running Time Analysis of Multiob- jective Evolutionary Algorithms on Pseudo-Boolean Functions. In Proceedings of the 2002 IEEE International Joint Conference on Neural Networks. pages 675–700. [14] J. Fieldsend and R. Jensen.M. Fleming. UK. Singh. and E. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC’04). Singh. C. Everson. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC’05). pages 98–105.

In Proc. Fieldsend. Piscataway. In Congess on Evolutionary Computation (CEC’2002). Deb. Lamont. P. pages 23–30. Everson. PhD thesis. Laumanns. and J. Comparison of Data Structures for Storing Pareto-sets in MOEAs. R. pages 35–41. IEEE Press. Adamidis. editor. [31] D. [28] Y. [25] Y. Selecting neural network architectures via the predic- tion risk: application to corporate bond rating prediction. 13398. Gershenfeld. 1991. Zitzler and L. 2002. Evolutionary Computation. Teich. Multiobjective Evolutionary Algorithms: A Compar- ative Case Study and the Strength Pareto Approach. [29] K. 1990.E. 1996. Zitzler. [35] E. Liu and X. Merelo Guervós. Mostaghim. 1999. CA. Los Alamos. Diss ETH No. and A. Running Time Analysis of Multi-objective Evolutionary Algorithms on a Simple Discrete Op- timization Problem. pages 44–53. of the First Int. Neural Computation. 1498:623–632. Yao. editors. and K. Weigend and N. J. 2004. Moody. Swiss Federal Institute of Technology Zurich (ETH). Jackel. 9(6):1211–1243. Thiele. Optimal brain damage. . Time Series Prediction: Fore- casting the Future and Understanding the Past. volume 1. MA. Towards designing neural network ensembles by evolution. A. Zitzler. L. May 2002. 8(2):125–147. Wolpert. 1998. E. Bootstrapping with noise: An effective regularization technique. Proceedings of the IEEE. H-G Beyer. pages 843–848. S. R. Van Veldhuizen and G. Lecture Notes in Computer Science. Springer.CA. Morgan Kauffman. [30] J. [34] X. 1999. IEEE Transactions on Evolutionary Computation. In Proceedings of the IEEE Congress on Evo- lutionary Computation (CEC’04). D. Addison-Wesley. 3(4):257–271. S. 1994. [33] D.M. Yao. IEEE Press.T. Connection Science. [36] E. Lecture Notes in Computer Science. Evolutionary Algorithms for Multiobjective Optimization: Methods and Applications. Thiele. 1999. LeCun. and L. San Mateo. Howard. pages 598–605. [32] A. [26] S. Intrator. 2002 2002. Advances in Neural Information Processing Systems II. Netlab: Algorithms for Pattern Recognition. E. New Jersey.J. On bias plus variance. Welzl. Evolving Artificial Neural Networks. and H-P Schwefel. J. 5 Regression Error Characteristic Optimisation of Non-Linear Models 123 [23] M. 1997. Tyagi. 87(9):1423–1447. Parallel Problem Solving from Nature—PPSN VII. Nabney. IEEE Computer Society Press. 2000. Raviv and N. Multiobjective Evolutionary Algorithms: Analyzing the State-of-the-Art. Conf on AI Applications on Wall Street. [27] I. In J. Solla. Reading. J-L Fernández-Villacañas. S. Denker. 8:356–372. In D. Smith. Utans and J. Dominance Measures for Multi- Objective Simulated Annealing. E.I. [24] Y. Springer- Verlag. Touretzky. editors.

funded by the Australian Research Council (ARC) and the New South Wales State Gov- ernment. T. Code 6390 Computational Multiphysics Systems Lab.furukawa@unsw. In en- gineering applications. The University of New South Wales. When the 1 This work is supported in part by the ARC Centre of Excellence program.ac. Subsequently.: Regularization for Parameter Identification Using Multi-Objective Opti- mization.6 Regularization for Parameter Identification Using Multi-Objective Optimization Tomonari Furukawa1 . It continues by introducing a weightless regularization ap- proach that reduces the parameter identification problem to multi-objective opti- mization.com  c Springer-Verlag Berlin Heidelberg 2006 . the technique was successfully applied for the parameter identification of a material model1 . Washington DC 20375 USA john. School of Mechanical and Manufacturing Engineering.sd. Michopoulos3 1 ARC Centre of Excellence in Autonomous Systems.springerlink. this is achieved via the identification of a continuous parameter set associated with a continuous deterministic mathematical model.michopoulos@nrl. Regularization is a technique used in finding a stable solution when a parameter identification problem is exposed to considerable errors..keio. is presented.au 2 School of Science for Open and Environmental Systems. Finally. Chen Jian Ken Lee2 and John G.mil Summary.edu. 6. However a sig- nificant difficulty associated with it is that the solution depends upon the choice of the value assigned onto the weighting regularization parameter participating in the corresponding formulation. Comparative numerical results with explicitly de- fined objective functions demonstrate that the technique can search for appropriate solutions more efficiently than other existing techniques. Studies in Computational Intelligence (SCI) 16. based on a set of measured data related to the system’s behavior.jp 3 Naval Research Laboratory. This chapter initially and briefly describes the weighted regularization method. NSW 2052 Australia t. Keio University.1 Introduction Inverse modeling is used in application areas where the system model is de- rived from its measured response and its corresponding stimulus [23].navy. Furukawa et al. 125–149 (2006) www. 3-14-1 Hiyoshi. Yokohama 223-8522 Japan ken@noguchi. a gradient-based multi-objective optimization method with Lagrange multipliers. Kohoku-ku. Center of Computational Material Science Special Projects Group.

However. 9. Because of the multiple character of the solution and the possible complexity of the associated objective functions. rather than a point. depending upon the characteristics of the objective function [25. 18]. This term makes the resulting functional smooth. namely the solution space. One of the approaches capable of overcoming this difficulty adds a regular- ization term [4] to the objective function. These meth- ods allow the design parameters to be optimized without weighting factors that depend on design criteria such as weight and energy consumption. Meanwhile. et al. Later. and the generalized cross validation [14]. 20] proposed a technique based on the sensitivity of the regularization term with respect to the regularization parameter. To find the best regularization parameter. [24] also proposed a technique using Singular Value De- composition. some techniques for finding the best value of the weighting factor have been proposed by several researchers. all these techniques result in introduc- ing another parameter that in turn influences the solution and therefore no conclusive result can ever be produced. the obtained solution depends upon the value of the weighting factor. Reginska [31] considered the maximizer of the L-curve. 5]. which is derived from a statistical method. Since the solution of this vector functional formulation is henceforth represented as a space. while Zhang and Zhu [33] developed a multi-time-step method for inverse problems involving systems consisting of partial differential equa- tions. 8. Nevertheless. which obtains a regularization para- meter based on minimization of an error criterion. so that conventional calculus-based optimization techniques can obtain an ap- propriate parameter set in a stable fashion. and most of the research re- ports leave the determination of its values open for further studies. 30]. 22]. as the optimal para- meter. Although improvements have been demonstrated by these techniques. Kitagawa [19. A dif- ficulty of this approach is finding a solution when measurement data and/or the model contain large errors [3] because the objective function becomes too complicated to be solved by conventional optimization methods. which normally consists of a function multiplied by a weighting factor that plays the role of the corresponding reg- ularization parameter. multi-objective optimization methods have been proposed to solve multi-objective design optimization problems [13. model is complicated. the corresponding methods attempt to find a set of admissible solutions (points) in the solu- tion space. Furukawa et al. Kubo.126 T. A comparative study of some of the techniques can be found in [21. the fun- damental question remains whether the automatic determination of a single solution through the selection of the best weighting factor is necessary. Conventional techniques include the Morozov discrepancy principle [28]. the corresponding solution . defined by Hansen [15]. Various optimization methods have been used. the parameter identification problem is often converted into an optimization problem where an objective function is formed by the measured behavioral data and the analytically predicted data generated from the partial model is minimized. while they only demonstrate results based on a limited number of a priori selected values [27.

In this chapter.2) The parameter identification process is typically defined as the activity of identifying a continuous vector X ⊆ . Section 6.2 Parameter Identification 6. 6. The experimental data can be related to the model by v̂ (u∗i .3 whereas MOGM-LM implemented in the framework is formulated in Sec. They execute robust searches from multiple search points for single objective optimization.1) where ei represents the sum of the model errors and measurement errors: ei = emod i + eexp i .2. 6.5 presents numerical results demonstrating the superiority of the proposed technique to conventional techniques and its applicability for parameter identification. where u∗i ∈ U and vi∗ ∈ V. The final section summarizes conclusions. According to this technique. a framework for solving a regularized parameter identifica- tion problem without weighting factors [12] is presented initially. The proposed weightless regularized identification approach and a general framework for multi-objective optimization are presented in Section 6. 6 Regularization for Parameter Identification 127 methods have been mostly based on the evolutionary algorithms (EAs) [2]. since these algorithms find only a fixed number of solutions in the solution space.1 Formulation We assume the existence of a set of stimulus and response experimental data [u∗i . a multi-objective gradient-based method with Lagrange multipliers (MOGM-LM) is introduced as an optimiza- tion method to find solutions for this class of problems efficiently [26]. x) + ei = vi∗ (6. the performance of the proposed technique is tested with explicitly defined objective functions. The corresponding model v̂ is defined in terms of parameters x ∈ X . respectively. The method is also formulated such that its solutions can configure the solution space to be derived. Many EAs are however excessively robust at the expense of ef- ficiency in contrast to the conventional calculus-based methods that are not inefficient for parameter identification problems involving a continuous deter- ministic objective function with continuous search space [16]. In the first three subsections. and the last subsection deals with the parameter identification of a viscoplastic material for practical use. vi∗ ]. regularization terms are each formulated as another objective function. (6. and the multi-objective optimization problem is solved by a multi- objective optimization method. The next section deals with the overview of the parameter identification discipline. Moreover. the solutions are sparse and are not well distributed in the solution space. Furthermore.4.

n . U. V ⊆ . provided a set of continuous experi- mental data.

the .n is available. In order to determine it numerically.

parameter identification problem is often converted into a minimization of a continuous functional: f (x) → min (6.128 T.3) x where f : X → . Furukawa et al.

.2. The parameter set minimizing such an objective function is found within the bounds: xmin ≤ x ≤ xmax (6. It is more apparent when the number of measured data is small. . even a small change of the parameters may lead to a significant change to the value of the objective function. me } (6. The regu- larization contributes to the stabilization of such a function. and the solution is said to be unique if there is only one minimum within the range. The difficulty of the parameter identification is therefore that the objective function can become complex if the model and measured data contain considerable errors. .3). In the Tikhonov . xmax ] = X and may be further subject to equality constraints hi (x) = 0. the solution may diverge or vibrate depending on the a priori selected initial search point.. one may consider the popular method of least squares. ∀i ∈ {1. the solution of the identification problem is said to exist if there is at least one minimum within the search range satisfying constraints (6. and the measured data. .2 Regularization If an objective function is complex.5) as well as inequality constraints gi (x) ≤ 0. the majority of optimization methods can consistently find a global minimum only if the objective function is nearly convex. On the other hand. m} . As an example approach for determining an objective functional (6. x) − vi∗  . This gives rise to the need for regularization described in the next section.4) where [xmin . .7) i=1 This equation clearly indicates that the form of the objective function depends upon the modeled data that depend on the model of the system itself. ∀i {me + 1.6) In this formulation. 6. 2 f (x) = (6. (6. An approach for overcoming this problem is to introduce an additional term to the objective function in order to make it near-convex.4)-(6. where the objective function of which is often represented as:  n v̂ (u∗i . Other- wise. . .6). .

6 Regularization for Parameter Identification 129 regularization method [32].9) where K is a weighting matrix. x − x∗ 2 → min (6.3. This results into a multi-objective formulation. the objective function is reformulated as Π (x) = f (x) + ωΛ (x) (6.1. The next section will describe how to derive a solution which is not influenced by the regularization parameter. the specification of ω becomes the fundamental issue in the regularized parameter identification as mentioned in Section 6. Tikhonov regularization parameter ω is the only weighting factor. the regularization term may be given by Λ (x) = K (x − x∗ ) 2 (6. thus avoiding the uncertainty introduced by the unknown character of ω.10) x where f (x) : X → . (arguably the most popular regularization tech- nique). respectively. With the unity weighting matrix of K.1 Problem Formulation Instead of attempting to find solutions that depend on the weighting factors we focus into finding solutions that do not depend on the weighting factors by en- tirely removing them from our formulation. The matrix is most often set simply to the unity matrix unless some prior information is available about the system. and the objective of the problem is thus expressed as    f (x) = f (x) . 6. In other words.3 Weightless Regularization 6. A solution of the parameter identification problem is thus obtained by first specifying ω and subsequently minimizing the function with an optimization method.8) where ω and Λ (x) are known as the Tikhonov regularization parameter and the Tikhonov regularization function. If the solution is known to be adjacent to x∗ .

. .11) where f (x) : X → . . x1 − x∗1 2 .2 . xn − x∗n 2 (6. If the weighting matrix is diagonal. .    f (x) = f (x) .

• the problem is subject to equality constraints (6. • the objective functions f (x) are continuous but can be complex such as non-convex. • the search space is continuous.6). .5) and inequality con- straints (6. 1+n . The regularized parameter identification formu- lated as a multi-objective optimization problem is conclusively characterized as that: • the problem is multi-objective.

2 Problem Solution For the sake of generality. fk : X → . we consider a problem where m objective functions.130 T. Furukawa et al.3. 6.

fm (x)] → min .3 A General Framework for Searching Pareto-Optimal Solutions Figure 6... . the multi-objective optimization searches with p multiple points.. ∀k ∈ {1. ∀k ∈ {1.. X(K) = {xK 1 . . m} . . i.e. ∀k ∈ {1. there is no vector xi such that fk (xj ) ≤ fk (xi ) . Numerical techniques can configure the solution space in an ap- proximate manner by finding points that satisfy the Pareto-optimality.1 shows the flowchart of the framework of the multi-objective op- timization proposed in this chapter. . . i.3.e. .. . but there is no analytical technique that derives the so- lution space.12) x The solution of this multi-objective optimization problem becomes a space rather than a point.. (6. In order to find multiple solutions.. . the numerical technique must find Pareto-optimal solutions which are well distributed to configure the so- lution space. . m} are defined as  f (x) = [f1 (x) . m}... which is the requirement for points to be in the solution space: Pareto-optimality: A decision vector xi ∈ X is said to be Pareto-optimal if and only if there is no vector xj ∈ X for which fk (xi ) . dominates fk (xj ). xp } ∈ (. (6. 6. . . . . .13) To find a good approximate solution space.

. . X(0). Each objective function value fj (xK K i ) is then calculated with each parameter set xi .f (xp )}.. ) . finally yielding F (K) = {f (xK K 1 ). .. is generated randomly within a specified range [xmin . two scalar criteria are evaluated for each search point in the proposed framework.16) where θ : .. One is the rank in Pareto-optimality expressed as usual Θ(K) = {θ(xK K 1 ). (6. The initial population. θ(xp )}.. xmax ].14) where xKi is the i th search point at K th iteration. (6. K n p (6.15) Unlike multi-objective EAs.

and the other is a positive real-valued scalar objective function or a fitness. .. which is derived by taking into account the rank: Φ(K) = {φ(xK K 1 )..17) . φ(xp )}.n → ℵ.. (6.

Flowchart of multi-objective evolutionary algorithms. 6. 6 Regularization for Parameter Identification 131 Fig. where φ : .1.

n → .

.e. Whilst the rank is evaluated to check the degree of Pareto-optimality of each search point. i. Φ(K). the next population is written in canonical form as X(K + 1) = s(X(K). we want to find effective Pareto-optimal solutions as many as possible such that the solution space can be configured.18) where s is the search operator.+ . (6. the newly created Pareto-optimal solutions in the optimization loop. X(K +1). and the new set of Pareto-optimal solutions P (K +1) is saved . the fitness is used to create the next search point xK+1 i . Since the creation depends upon the search methods to be used. ∇2 Φ(K)). The whole Pareto-optimal solutions obtained in the first iteration are stored. Another technique proposed here is a Pareto pooling strategy where the set of Pareto-optimal solutions created in the past are pooled as P (K) besides the population of search points X(K). ∇Φ(K). The process of the Pareto pooling technique is as follows. Once the iterative computation has been enabled. are compared to the stored Pareto-optimal solutions P (K). P (0) = X(0). From the second iteration.

In this case.3 illustrates Pareto-optimal solutions where two objective functions f = [f1 . 6. Let the Pareto-optimal solutions finally obtained be x̄i .132 T. in the storage as illustrated in Fig. and such techniques are no longer practically possible. the knowledge to be constructed exponentially increases. Some Pareto-optimal solutions may be identical or very close to an existing point.2. x3 ] .3. The storage of such solutions is simply a waste of memory.4 Selection of a Single Solution by Centre-of-Gravity Method (CoGM) Figure 6. The creations of the new population and the Pareto-optimal solutions are repeated until a terminal condition is satisfied. one may incorporate human knowledge into computa- tional knowledge-based techniques such as expert systems and fuzzy logic [17] for automatic selection of a single solution... so that they are discarded if they are closer than the resolution specified a priori. Here. . q}. If each .2.. x2 . f2 ] are minimized to identify three parameters x = [x1 . 6. if the numbers of ob- jective functions and parameters are large. Furukawa et al. we propose a technique where the closest solution to the center-of-gravity is chosen as the solution. Fig. As two-dimensional function space and three-dimensional parameter space are still easy to visualize. 6. However. one prominent way is to select the solution residing in the center of solution space since this solution is robust. ∀i ∈ {1. Creation of Pareto-optimal solutions.

16). No matter what the value is.20) q The effectiveness of the CoGM has not been proved theoretically. (6. .e. 6. 6 Regularization for Parameter Identification 133 solution is evaluated in a scalar manner.3.e. i. 6. as it has been commonly used in fuzzy logic[17] to find a solution from the solution space described by fuzzy sets. Process of deriving a single solution.4 Multi-objective Gradient-based Method with Lagrange Multipliers 6.4. The process is based purely on an elimination rule. i.4 depicts the process to rank the search points and accordingly de- rive Θ(K) in Eq. ϕ(x̄i )..19) i=1 ϕ(x̄ ) Since the Pareto-optimal solutions must be evaluated equally. (6. f1 x1 x1 f2 0 0 x2 0 x3 x2 0 x3 Fig. (6.1 Evaluation of Functions Rank function Figure 6. the center-of-gravity results in the form: q i=1 x̄i x̄ = . ϕ(x̄1 ) = · · · = ϕ(x̄q ). we can consider all the Pareto-optimal solutions possess the same scalar value. but it is highly acceptable. the center-of-gravity is in general given by q i=1 x̄ · ϕ(x̄ ) i i x̄ =  q i ..

. p}}.. Ranking process. θ(xKi ) = 2... Furukawa et al. ∀j ∈ {1. Fig. Real-valued Function The evaluation of real-valued function of each search point starts with finding the maximum and minimum values of each objective function among the population: (fmax )j (K) = max{fj (xKi )|∀i ∈ {1. . p}} (6. θ(xK K i ) = 1 if the search point xi is in the Pareto-optimal set. 1 are then eliminated from the population..... 2.. ∀i ∈ {1. The group of search points ranked No. and the Pareto-optimal set in the current population is ranked No. . is first calculated.4. every objective function at every search point fj (xK i ). The points with rank No. p}. K (6.134 T..22) . 6. According to the rule.. . i..21) (fmin )j (K) = min{fj (xi )|∀i ∈ {1.e.. m}. 1 is denoted as G(1) in the figure.. Ranking is continued in the same fashion until all the points are ranked [13]. 1. and the Pareto-optimal set in the population is ranked No.

φ(xi ). ∇φ(xi )) = −AK dQN (xK ∇φ(xK K i ) (6.30) The efficient and most popular technique is the Quasi-Newton (QN) method. ∇φ(xi ). 1) refers to a random number between 0 and 1 following a uniform distribtion. (6. and the true value of each objective function is thus defined as:  i ) = max{φj (xl )|θ(xl ) = θ(xi ).28) where the step ∆xK i of the search point determined by i = αK d(xi . (6. .25) The value of each search point can be conclusively calculated as  m φ(xK i )= wjK φj (xK i ) (6. (6. The real value will appear within the range: i ) ≤ m.4.26) j=i where wjK = rand(0.2 Search Method The introduction of real-valued function Φ(k) makes gradient-based methods implemented easily without additional formulations. The real values of points with the same rank have to be identical. ∇φ(xi )) = −∇φ(xi ) dSD (xK K K (6.31) where AK ≈ ∇2 φ(xK i ) is the approximated Hessian of the function defined in (6. which is described as: −1 i .29) In Eq. the next state for a search point is given by xK+1 i = xK K i + ∆xi (6. ξ2 and τ .. Various existing techniques can be used to find the direction of the search step...24) and this allows the value of each function to be treated in the same scale. 0 ≤ φj (xK i ) ≤ 1. 0 ≤ φ(xK (6. ∆xK K K K 2 K (6. ∀l ∈ {1. 6 Regularization for Parameter Identification 135 If we temporarily define the real-valued function as i ) − (fmin )j (K) fj (xK φj (xK i )= . αK is the search step length iteratively searched as a subproblem by the Armijo rule [29]. whereas d maps the direction of the search step. With φ(xK i ) calculated for each xK i . p}}. 1) contributes to a wide distribution of resultant Pareto- optimal solutions – rand(0.23) (fmax )j (K) − (fmin )j (K) the following normalized conditions can be obtained. The Armijo rule is governed by three parameters ξ1 . φj (xK K K K (6.26). The most basic technique may be the Steepest Descent (SD) method: i .27) 6. ∇ φ(xi )). l = i.29). .

λK )∆xK +∇λλ L(xK . 6.39) ∇λ L(xK+1 . The substitution of ∇x L(xK . λK ) = 0 (6. λ∗ ) = h(x∗ ) = 0. (6. λK )∆xK +∇xλ L(xK . λK+1 ]. λK ) = ∇x φ(xK ) + ∇x h(xK ) λK = 0.4. λK ) = ∇λx L(xK . The approxima- tion of the first-order Taylor expansion yields the following equations: ∇x L(xK+1 . ∆λK ] such that the optimality conditions are satisfied at [xK+1 . λK ) = ∇x h(xK ) . λ∗ ]. λ∗ ) = ∇x φ(x∗ ) + ∇x h(x∗ ) λ∗ = 0. The optimality conditions in this case.35) ∇λ L(xK+1 . (6.35) and (6. λK )∆λK . Suppose the solution is denoted by [x∗ . known as Kuhn-Tucker conditions. To handle equality constraints with real-valued function Φ(K).36) as ∇x L(xK+1 . Lagrangian function can be defined as L(x. (6. are described as ∇x L(x∗ . λK+1 ) ≡ ∇λ L(xK . (6. (6. λK+1 ) ≡ ∇x L(xK .33) ∇λ L(x∗ . λK ) + ∇xx L(xK .32) where λ is a set of Lagrange multipliers for equality constraints. λK ) + ∇λx L(xK .3 Constraint Handling with Lagrange Multipliers Equality constraints The introduction of real-valued function Φ(K) also contributes to handling constraints using the standard technique of Lagrange multiplier method [1]. Furukawa et al.41) ∇x h(xK ) 0 λK+1 −h(xK ) . (6.136 T.40) It remains to determine [∆xK . (6.37) ∇λλ L(xK . (6. The resulting equation in matrix-vector form is:      ∇xx L(xK . λK+1 ) = ∇x φ(xK ) + ∇x h(xK ) λK + ∇xx L(xK . λ) ≡ φ(x) + h(x) λ (6.38) rewrites Eqs.34) Let the parameters and Lagrange Multipliers be updated iteratively with xK+1 = xK + ∆xK and λK+1 = λK + ∆λK respectively. λK ) ∇x h(xK ) ∆xK −∇x φ(xK ) = . (6. λK )∆λK .36) ∇xλ L(xK . λK+1 ) = h(xK ) + ∇x h(xK ) ∆xK . λK )∆xK +∇x h(xK ) ∆λK .

when the constraint is violated. (6.32). However. Inequality constraints Parameter identification may also be subject to inequality constraints.42). ∆µK+1 ] by    −1   ∆xK ∇xx L(xK . a simple selection algorithm is proposed for inequality constraints. there does not exist a need for further search. The equality con- straint h(x) in Eq. µ) ≡ f (x) + h(x) λ + g(x) µ (6.44) where λ and µ are both Lagrange multipliers for equality and inequality con- straints respectively. λK ) is the Hessian of the Lagrange equation (6. λK ) ∇x h(xK ) −∇x φ(xK ) = . ∆λK+1 . λ. When g(xK ) < 0. The mentioned selection algorithm is exemplified by if (g(xK ) < 0) g(xK ) = 0  else if (g(xK ) < 1) g(xK ) = − g(xK ) else g(xK ) = −g(xK )2 These relations indicate that. then the constraint is satisfied. the search is greatly improved by square-rooting the value of g(xK ) if 0 < g(xK ) < 1 and squaring the value of g(xK ) if g(xK ) > 1. The solution is now denoted by [x∗ . λK ) ∇x h(xK ) ∇x g(xK ) −∇x f (xK )  λK+1  =  ∇x h(xK ) 0 0   −h(xK )    µK+1 ∇x g(xK ) 0 0 −g (xK ) (6. .43) As it is evident from Eq. (6.41) results in equations for evaluating [∆xK . L(x. λ) ≡ f (x) + g(x) λ. µ∗ ]. (6. λK+1 ]:    −1   ∆xK ∇xx L(xK . This process has several advantages including keeping in check the ideal solution as explained in the next section. Every itera- tion gives [∆xK . 6 Regularization for Parameter Identification 137 An inversion of the matrix-vector Eq. Equality and inequality constraints With both constraints involved. Ap- plying the theory of Lagrange multipliers for equality constraints. (6.42) λK+1 ∇x h(xK ) 0 −h(xK ) It is important to note that ∇xx L(xK . The computation of the Hessian can be performed by the QN method. the Lagrangian function becomes L(x. a positive value of g(xK ) would lead to an increase in ∆xK . λ∗ . when the constraint is satisfied.45)  where g (xK ) is described by the inequality selection algorithm described earlier.32) is replaced with the inequality constraint g(x). (6.

5. (6. Furukawa et al.138 T. MOGM-LM was first used to identify two parameters by minimizing a simple objective function where the exact set of solutions is known and can be seen visually in two-dimensional space. the set of parameters x ∈ .46) i=1 where n = 2.5 Numerical Examples 6. In this example. the function is given by a simple quadratic function:  n f1 (x) = x2i .1 Regularized Parameter Identification in Two-dimensional Parameter Space In order to confirm its suitability for finding Pareto-optimal solutions and the increase of solutions over iterations. 6.

0].2 is subject to inequality constraint (6. 4] ∈ . A Tikhonov regularization term is also added as another objective function:  n 2 f2 (x) = (xi − zi ) . it is assumed that the solution is known to be adjacent to [2.4]. −50] and xmax = [50.47) i=1 where z = [2. 50]. but. The solution of the problem ∗ is clearly x = [0.4) with x  min = [−50. (6. to apply regularization.

15th .1. ∀r ∈ [0. The exact solution for this problem can be determined analytically and is given by the space X ∗ = {x|x = rz. Parameters for MOGM-LM Parameter second p 10 ξ1 0. Values of major parameters for MOGM-LM used to solve the problem are listed in Table 6. ξ1 .47).4 ξ2 0.46) and (6. 1]} (6.5 τ 0.1.7 Figures 6. The problem therefore becomes to minimize functions (6. ξ2 and τ are parameters used to find the step length αK .48) We can evaluate the performance of the proposed technique by comparing the computed Pareto-optimal solutions to the exact solution space. Table 6. 25th and 50th iterations respec- tively together with the exact solution space. The results indicate that some .5(a)-(d) show the computed Pareto-optimal set in f1 − f2 space (left) and x1 − x2 space (right) at 5th .2 .

250th and 1000th iteration.7 shows the resulting Pareto-optimal solutions in f1 −f2 space (left) and x1 −x2 space (right) at the 50th .1 were used for MOGM-LM. 6.021. the result at the 50th iteration is poor with MOEAs.46) and (6.8(a)-(d) show the Pareto-optimal solutions computed in f1 − f2 space and x1 − x3 space at the 5. 15. 3. It is shown that the result at the 25th iter- ation of this five-dimensional problem is much better than the result of the two-dimensional problem by MOEAs at the 1000th iteration.50].48). The exact solution space is therefore given by Eq. Although the search space is much larger than the previous example. on the other hand. where z = [2. In MOEAs. Figures 6. The final single solution selected by the CoGM was [1.5. The number of Pareto-optimal solutions with respect to the number of iterations can be seen in Figure 6. 5. (6.50.50. the same prob- lem was solved using conventional MOEAs [12]. but. In order to evaluate the performance of MOGM-LM to find a solution of a single-objective optimization problem through multi-objective optimization. To demonstrate the effectiveness of the proposed method. which is very close to the center of gravity of the exact solution space. respec- tively.-50] and x max = [50. 4. The objective function and Tikhonov regulariza- tion term to be minimized are given by functions (6.-50.4) with x min = [-50. the result continues to be inaccurate at the 1000th iteration. 6].46) and (6.2 Regularized Parameter Identification within a General Parameter Space To see the capability of MOGM-LM in general multi-dimensional parameter space. and this contributes to visualizing the solution space. The satura- tion and good distribution of the solutions. -50.008]. Due to the robust but inefficient search of evolutionary algorithms. because of the premature convergence.47) obtained by the MOGM-LM contains a so- lution of a single-objective optimization problem with one of the functions. The increase is due to the Pareto-pooling technique.46) was solved using the most popular single-objective optimization method of Sequential .47). are controlled by the input resolution and avoids delaying the optimization process by pool- ing too many Pareto-optimal solutions. 25 and 50th iterations together with the exact solution. 2.-50. the result at the 250th iteration is still significantly inaccurate. The Pareto-optimal solutions of the two-objective optimization problem with functions (6. the single-objective optimization problem with function (6. Again. One can see that the number of solutions increases rapidly and saturates around the 15th iteration. the Pareto-optimal set which well describes the exact solution was already found after 15 iterations.50. parameters listed in Table 6.6. underscoring the efficacy of MOGM-LM compared to MOEAs. Figure 6. the second example deals with five parameters (n = 5) where the para- meter space is constrained by inequality (6. 6 Regularization for Parameter Identification 139 good approximate solutions appear after 5 iterations and well distributed solutions have been already obtained in the 15th iteration.

Furukawa et al. 6. Pareto-optimal set for Example I. .140 T. 30 Computed Computed 6 25 Exact Exact 20 4 2 f2 15 x 2 10 0 5 0 -2 0 10 20 30 -2 0 2 4 6 f x 1 1 (a) 5th iteration 30 Computed Computed 6 25 Exact Exact 20 4 2 f2 15 x 2 10 0 5 0 -2 0 10 20 30 -2 0 2 4 6 f x 1 1 (b) 15th iteration 30 Computed Computed 6 25 Exact Exact 20 4 2 f2 15 x 2 10 0 5 0 -2 0 10 20 30 -2 0 2 4 6 f x 1 1 (c) 25th iteration 30 Computed Computed 6 25 Exact Exact 20 4 2 f2 15 x 2 10 0 5 0 -2 0 10 20 30 -2 0 2 4 6 f x 1 1 (d) 50th iteration Fig.5.

47) was used as the Tikhonov regularization term. Again. the objective function has an additional term to Eq. However. and the result was compared to the Pareto- optimal solution by MOGM-LM. the identification with an objective function.47) were minimized. of solutions 20 10 0 0 10 20 30 40 50 No.1 as MOGM-LM parameters. (6. was in- vestigated. Figure 6. 6. The Pareto-optimal solution used for com- parison was the one that produced the minimal value of function (6.3 Regularized Parameter Identification with a Multimodal Objective Function Having the appropriate performance of the proposed technique for identifica- tion with a simple objective function been demonstrated.46) and is given by  n f1 (x) = 10n + x2i − 10 cos xi . of iterations Fig.49) i=1 where n = 5. Number of computed solutions vs number of iterations. . and Table 6. (6. This is due to the random selection of MOGM-LM in Eq. As MOGM-LM can find many other Pareto-optimal solutions. one may more importantly conclude that the solution by MOGM-LM is also close to the exact solution. the effectiveness of the MOGM-LM is clearer from this result. which does not take place in the single-objective optimization by SQP.46) and (6. 6 Regularization for Parameter Identification 141 40 30 No. which is more likely in real applications. The figure shows the superiority of SQP to MOGM- LM. In this Example III.26).46) when both the functions (6. Quadratic Programming (SQP). so that the function has been used as a good example for a multimodal continuous function [7].5.6. (6.9 shows the minimal values with respect to iterations by both SQP and MOGM-LM. The cosine term clearly makes the function multimodal with a number of local minima. 6. Eq. the mean square error being in the order of 10−2 . (6.

7. . Furukawa et al. 6. Pareto-optimal set by MOEAs for Example I.142 T. 30 Computed Computed 6 25 Exact Exact 20 4 2 2 15 x f 2 10 0 5 0 -2 0 10 20 30 -2 0 2 4 6 f1 x 1 (a) 50th iteration 30 Computed Computed 6 25 Exact Exact 20 2 4 2 15 x f 2 10 0 5 0 -2 0 10 20 30 -2 0 2 4 6 f1 x 1 (b) 250th iteration 30 Computed Computed 6 25 Exact Exact 20 4 2 2 15 x f 2 10 0 5 0 -2 0 10 20 30 -2 0 2 4 6 f1 x 1 (c) 1000th iteration Fig.

6. .8. Pareto-optimal set for Example II. 6 Regularization for Parameter Identification 143 150 Computed Computed 6 Exact Exact 100 4 3 2 x f 2 50 0 0 -2 0 50 100 150 200 -2 0 2 4 6 f x 1 1 (a) 5th iteration 150 Computed Computed 6 Exact Exact 100 4 3 2 x f 2 50 0 0 -2 0 50 100 150 200 -2 0 2 4 6 f x 1 1 (b) 15th iteration 150 Computed Computed 6 Exact Exact 100 4 3 2 x f 2 50 0 0 -2 0 50 100 150 200 -2 0 2 4 6 f x 1 1 (c) 25th iteration 150 Computed Computed 6 Exact Exact 100 4 3 2 x f 2 50 0 0 -2 0 50 100 150 200 -2 0 2 4 6 f x 1 1 (d) 50th iteration Fig.

Furukawa et al.144 T. The solution of the optimization of such a multimodal objective function by SQP diverges or vibrates. The coarse distribution of solutions with small f1 is yielded by its complexity compared to f2 . 0] and [2. 6. Pareto-optimal set for Example III. The appropriateness of the solutions can be verified by observing that Pareto-optimal solutions are found near the two distinct exact solutions [0. Due to the complexity of f1 . . 120 5 100 4 3 80 2 3 2 60 x f 1 40 0 20 -1 0 -2 0 50 100 150 -2 0 2 4 f x 1 1 (a) Function space (b) Parameter space Fig. The probabilistic formulation enables MOGM-LM to find solutions with such a function and visualize the solution space.9. 5 10 SQP MOGM-LM 0 10 f1 -5 10 -10 10 0 50 100 No.10 shows the resultant Pareto-optimal solutions at 50th iteration. 4]. of iterations Fig. Minimal value of objective function. Figure 6. 6. it is seen that the Pareto-optimal solutions are spread over the parameter space.10.

52) Ṙ = h |˙vp | − d |˙vp | (6.5.50) i where [∗i . χ. χ|t=0 . n. Material models are formulated by considering the effect of each parameter. i. H. R] are the viscoplastic strain. K. This gives Tikhonov regularization term expressed . parameters often unknown are inelastic material parameters [K. χ0 . is McCauley bracket [10. x = [R0 . H]. given strain  as a con-  the initial condition of state variables [ |t=0 . 6 Regularization for Parameter Identification 145 6. h. but.e. the viscoplastic strain with respect to time can be derived iter- atively. K. To facilitate the analysis of identification. n. the numerical example uses pseudo-experimental data of cyclic plasticity shown in Fig.4 Regularized Parameter Identification of Viscoplastic Material Models Finally.51) K χ̇ = H ˙vp − D |˙vp | (6. The material model used in the numerical example is Chaboche model [6]. Measurement errors in the mechanical tests are relatively small. H. The model under stationary temperature and uniaxial load conditions is of the form:   |σ − χ| ˙vp = sgn (σ − χ) (6. created from Chaboche model with a set of parameters described in Table 6. 6. R|t=0 ] = vp trol  vp input and 0 . In the mechanical tests of material.11. σi∗ ] are a set of experimental stress-strain data and σ = σ̂ (. and therefore we often have a coarse estimate on the values of the material parameters x = [R0 . D. [K.50). x) − σi∗ 2 (6. R0 . Let stress and strain be represented by σ and . and . The stress-strain relationship cannot be explicitly written as described in Eq.53) vp where state variables [ . assuming that the others are exactly known. which can describe the major material behaviors of viscosity and cyclic plasticity accurately and is thus used to model a variety of metallic materials. D. 11]. H]. In the model. the proposed technique was applied to a practical parameter iden- tification problem of a material model.2. (6. d] are inelastic material parame- ters. and the stress can be ultimately calculated using σ = E ( − vp ) (6.. h. kinematic harden- ing and isotropic hardening. the problem in the robust least square formulation is given by  f1 (x) = σ̂ (∗i . stress-strain data can be derived as experimental data.54) where E is the elastic modulus. x) is a material model having x as material parameters. d] plus the initial condition of isotropic hardening variable R0 . We will hence identify only parameters which influence cyclic plasticity. but the difficulty of solving this problem is created by the complex description of material model.

146 T.05 Strain [%] Fig. and it is easily seen in the figure that the solutions are well distributed. 100 300 0. (b) and (c) are the solutions in R0 −K. - by Eq.6 Known . Parameters for Chaboche model Parameter R0 K n H D h d Exact 50 100 3 5000 100 300 0.6 Coarse estimate 45 95 . 200 100 Stress [MPa] 0 -100 -200 -0. . All the figures show that the solutions are distributed along the straight line linking the exact val- ues and the initial coarse estimates. K −H and H −R0 parameter spaces. .12(a) depicts the Pareto-optimal solutions in function space ob- tained from the parameter identification after 20 iterations. The user can select a single solution later by applying CoGM. the . have been proposed. 6. The use of multi-objective method allows for the derivation of the whole solution set of the problem rather than a single solution to be derived by one optimization. A total of 234 solutions are obtained. The parameters used for optimization were again those in Table 6.6 Conclusions A weightless regularized identification technique and a multi-objective opti- mization method of MOGM-LM.11. This result indicates that the proposed technique could find appropriate Pareto-optimal solutions for this problem.12(a). Table 6. Respectively shown in Figs. which can search for solutions efficiently for this class of problems. 6. 3 . 4800 . Figure 6.1. After the Pareto-optimality of solutions derived by MOGM-LM was confirmed with a simple example. (6.47) as the second objective function f2 . Furukawa et al. 6. Pseudo-experimental data of cyclic plasticity.05 0 0.2.

many other techniques can be thought of. One of the firm steps to take is to incorporate prior knowledge and solve the problem stochastically. The CoGM described in this chapter is one of the many possible techniques. Because the solution of an inverse problem is never known. proposed technique was applied to identification problems including material parameter identification.005 96 2 0 95 0 500 1000 44 46 48 50 f (Mean equare error) R 1 0 (a) f 1 − f 2 (b) R0 − K 5050 50 5000 49 48 4950 R0 H 47 4900 46 4850 45 4800 44 96 98 100 4800 4850 4900 4950 5000 5050 K H (c) K − H (d) H − R0 Fig. contains 23 parameters in its multiaxial formulation. Other issues include the improvement of the technique for high-dimensional problems since Chaboche model described in this chapter. the overall effectiveness of the proposed technique for parameter identification has been confirmed. for instance. The searching capability of the technique was also com- pared to that of a single-objective optimization method. Pareto-optimal set for material parameter identification. and the technique could find appropriate solutions in all the problems. 6.01 K 98 97 0. .12. Conclusively. and its superiority has been demonstrated. One of the issues that have not been discussed thoroughly is the selection of a final solution. 6 Regularization for Parameter Identification 147 101 0.015 100 f (Tikhonov regularization) 99 0.

5:247-254.H.W. Schwefel. Kitagawa. Discussion and Generalisation.J. 1989 [14] C. Jap J Appl Math. 1993 [9] C. University of Dortmund. Coello. Michigan. P. References [1] M. Dulikravich GS (eds). 1971 [26] C. New York. An Analysis of the Behaviour of a Class of Genetic Adaptive Systems. Yagawa. 40:1071-1090. Goldberg. Morgan Kaufmann. Zehnder. fundamentals and applica- tions of nonlinear programming. Academic Press. The University of Michigan Press. 1998 [25] H. Furukawa. Fleming PJ. Fonseca. Kitagawa. Evol Comp. Yagawa. 1975 [18] Y. Holland.. 1988 [21] T. Furukawa. 2001 [13] D. Yoshimura. Academic Press. University of Michigan.M. T. Honjo. Bard. Furukawa. Sys-1/92. De Jong. Baumeister. P. Kitagawa. 3(1):1-16. 1999 [6] J. New York. 1993 [3] Y. SIAM Review 34(4):561-580. Int J Plast. 1989 [7] K. Braunschweig. Methods in Estimating the Optimal Regularization Parameters. Int J Knowl Info Sys. Int J Numer Meth Eng.P. 1974 [4] J.A. H. Baifu-kan (in Japanese). In: Tanaka M. 1998 [12] T. 1998 [19] T. Inverse Problems in Engineering Mechanics. 1995 [10] T. H. San Mateo. 1992 [17] J. Genetic Algorithms and Evolution Strategies: Simi- larities and Differences. Boston. Elsevier Sci- ence. Bäck.M. N. Characterization of the Tikhonov Regulariza- tion for Numerical Analysis of Inverse Boundary Value Problems by Using the Singular Value Decomposition. Hoffmeister. Springer Verlag. 1987 [5] C. 25-35. In: Yamaguti M (ed). Genetic Algorithms for Multi-objective Opti- misation: Formulation. Ohji. Matching Objective and Subjective Information in Geotech- nical Inverse Analysis Based on Entropy Minimization. Optimization and Machine Learn- ing. Takahashi.C.A. 37-42. Fonseca. C. 263-271. Germany. Ph. Kubo. Lee. G. 1993 [24] S.J. Furukawa. Evol Comp. 1(1):1-23. 52:219-238.D Thesis. Bäck. Kubo. Groetsche. Macmillian. 2005 (in print) . Inverse Problems in Engineering Mechanics. MA.. 1991 [22] T. Pro- ceedings of the Fifth International Conference on Genetic Algorithms. Inverse Problems in Mathematical Engineering. In: Tanaka M. 1(1):1-25. 1971 [2] T. CA. Int J Numer Meth Eng. Int J Numer Meth Eng. Introduction to Optimizaion Techniques. Reading. Int J Numer Meth Eng. Adaptation in Natural and Artificial Systems. 1992 [23] S. 1987 [20] T. The Theory of Tikhonov Regularization for Fredholm Integral Equation of the First Kind. Furukawa et al. Hansen. In: Kubo S (ed) Inverse Problems. Technical Report. Du- likravich GS (eds).P. 4:371-379. Pitman. Inverse Problems. K. Kitagawa.148 T. Vieweg. J Info Proc. 43:195-219. A Comparison between Two Classes of the Method for the Op- timal Regularization. Kudo.G. Addison-Wesley. Chaboche. 11:263-270. Tzschach. T. Genetic Algorithms in Search.L. Aoki. Kunzi. S. G. 416-423.J. Fleming. Numerical Methods of Mathematical Optimization. Atlanta Technology Publications. Stable Solution of Inverse Problems. 337-344. T. 1984 [15] P. Nonlinear Parameter Estimation. In: Forrest S (ed). 1975 [8] C. 1997 [11] T. 1992 [16] F.K.

Y. Springer- Verlag. J.T. 2000 [30] W. Inverse Problems in Engineering Mechanics. Principles of Optimal Design.J. 6 Regularization for Parameter Identification 149 [27] R. In: Tanaka M. 1994 [28] V. 137-144.Y. SIAM J Sci Comp. Reginska. 17:740-749.H. Numerical Recipes in C. An Efficient Numerical Algorithm with Adaptive Regulariza- tion for Parameter Estimations. V. Mahnken. In: Bui T. Elsevier Science 299-308. Press. New York. Morozov. 1996 [32] A. Teukolsky.P. E. Inverse Problems in Engi- neering Mechanics. Dulikravich GS (eds). Tanaka M (eds). Stein. Solutions to Ill-posed Problems. Vetterling. W. Zhang. 1984 [29] P. John Willy Sons. S. 1977 [33] X. Flannery. 1988 [31] T. Cambridge Uni- versity Press.A. Papalambros. Cambridge University Press. Zhu. Arsenin. Tikhonov. 1998 .A. Methods for Solving Incorrectly Posed Problems. New York. Wilde. Gradient-based Methods for Parameter Identification of Viscoplastic Materials. B. D.N.

In this chapter.7 Multi-Objective Algorithms for Neural Networks Learning Antônio Pádua Braga1 . Structural complexity is associated to the number of network parameters (weights) and apparent complexity is associated to the network response. Most supervised learning algorithms for Artificial Neural Networks (ANN)aim at minimizing the sum of the squared error of the training data [12.springerlink. Thus. from which the best network for modeling the data is selected. only recently it has been given a formal multi-objective optimization treatment [16]. C. If the network is over-sized it is possible to control its apparent complexity so that it behaves properly. 11. 5. using vector optimization methods.ufmg. 151–171 (2006) www. Studies in Com- putational Intelligence (SCI) 16. The learn- ing task is carried on by minimizing both objectives simultaneously. It is well known that learning algorithms that are based only on error minimization do not guarantee good generalization performance models. some other network-related parameters should be adapted in the learning phase in order to control generalization performance.: Multi-Objective Algorithms for Neural Networks Learning. 7. The need for more than a single objective function paves the way for treating the supervised learning problem with multi-objective optimization techniques. In addition to the training set error. This leads to a set of solutions that is called the Pareto- optimal set [2]. regardless to its size.com  c Springer-Verlag Berlin Heidelberg 2006 . Takahashi1 .br 2 Eastern University Centre of Minas Gerais roselito@unilestemg. Ricardo H. Although the learn- ing problem is multi-objective by nature. 10]. Braga et al.1 Introduction The different approaches to tackle the supervised learning problem usually in- clude error minimization and some sort of network complexity control.br Summary. This method is named MOBJ (for Multi-OBJective training).P. Marcelo Azevedo Costa1 . where complexity can be structural or apparent. A. an approach that explicitly considers the two objectives of mini- mizing the squared error and the norm of the weight vectors is discussed. The problem has been treated from different points of view along the last two decades. and Roselito de Albuquerque Teixeira2 1 Federal University of Minas Gerais apbraga@cpdee. a large struc- tural complexity network may have a small apparent complexity if it behaves like a lower order model.

These concepts are also related to the bias-variance dilemma [27]. On the other extreme. Structural complexity control can be accomplished by shrinking (prun- ing) [9. The latter includes the multi-objective approach that will be described in this chapter [16]. from a set of candidate models. from which the best network for modeling the data is selected. whereas apparent complexity control can be achieved by cross-validation [26]. since the models fit the data by construction. This leads to a set of solutions that is called the Pareto- optimal set [2]. is to . The larger the network complexity the higher its flexibility to fit the data. whereas a flexible one spans its possible solutions into a wider area of the solutions space. 4. the larger its rigidity to adapt itself to the data set. The main principle that is behind these two steps is: some data is employed in order to adjust the model parameters (the training data). from which the best one is to be chosen. The step of finding the Pareto-optimal set can be interpreted as a way for reducing the search space to a one-dimensional set of candidate solutions. since a flexible model has a large variance and a rigid one is biased. other data (the validation data) must be employed in the model selection step. This one-dimensional set exactly follows a trade-off direction between flexibility and rigidity. Restrictions to the apparent complexity can be obtained by limiting the value of the norm of the weight vectors for a given network solution. Decision step: Using a set of validation data. Therefore. find the Pareto-optimal so- lutions. This structure of the problem of ANN learning imposes this two-step procedure. The MOBJ training approach considers two simultaneous objectives: min- imizing the squared error and the norm of the weight vectors. that corresponds to the scheme of multi-objective optimization with a posteriori decision.152 A. 8] or growing (constructive) methods [24. smoothing (reg- ularization) [17] and by a restricted search into the set of possible solutions. the lower its complexity.P. which means that it can be used for reaching a suitable compromise solution [16]. The first step when using multi-objective optimization. using vector optimization methods. 25]. A model that is too rigid tends to concentrate its responses into a limited region. The concepts of flexibility and rigidity are also embodied by the notion of complexity. choose the optimal solution from the Pareto-optimal set. Braga et al. 7. These data cannot be re-used for the purpose of choosing a model that does not over-fit the data (fitting the noise too). in an a posteriori decision scheme.2 Multi-objective Learning Approach The proposed MOBJ method is composed of the following steps: Vector optimization step: From the training data. The essence of our problem is therefore to obtain a proper balance between error and complexity (structural or appar- ent).

f1∗ is obtained by training a network with a standard training algorithm.1 Vector Optimization Step A variation of the -constraint problem. [14]. such as Backpropagation [12]. denoted by f1∗ and f2∗ . ||w|| f*2 MOBJ solutions f*1 f* e2 Fig. which is used here to obtain a good compromise between the two conflicting objectives: error and norm. since it is obtained by making all network weights equal to zero. proposed by Takahashi et al. Intermediate solutions obtained by the multiobjective (MOBJ) algorithm are called Pareto-optimal (Fig. 7. The algorithm intrinsicaly avoids the generation of non-feasible solutions.2. This is in fact the definition of the Pareto-optimal set.1 shows the two extremes of the Pareto set. The Pareto set is obtained by first ob- taining its two extremes. Figure 7. what increases its efficiency. was adopted. which are formed by the underfitted and overfitted solutions. Let f1 (x) designate the sum of squared errors. 7 Multi-Objective Algorithms for Neural Networks Learning 153 obtain the Pareto-optimal set [2]. which correspond to the underfitted and overfitted solutions. and let f2 (x) designate the norm of the weighting vector. Pareto-optimal set . Solutions belonging to the Pareto-optimal set cannot be improved considering both objective func- tions simultaneously. which contains the set of efficient solutions X ∗ .1. The set X ∗ is defined as: X ∗ = {x |  ∃x̄ = x such that does not occur: f1 (x̄) ≤ f1 (x) and f2 (x̄) < f2 (x) (7. Several algorithms are known in the literature to deal with this multi- objective approach [2].1) f1 (x̄) < f1 (x) and f2 (x̄) ≤ f2 (x)} The second step aims at selecting the most appropriate solution within the Pareto-optimal set. 7. f2∗ is trivial.1). 7.

P. The next step aims at selecting the most appropriate solution within the Pareto-optimal set. called relaxation method [14] was adopted. Braga et al. In this work. Its formula- tion is presented next.154 A. • f ∗ ∈ . Several algorithms are described in the literature to deal with this multi-objective problem [2]. a variation of the -constraint problem.

• φ∗i is the ith objective optimum.m is the objective vector corresponding to the “utopian solution”[2] of the problem. • φi is the value of the ith objective in any point. • fi∗ ∈ .

are described by Eq.9) show how f1∗ .6) φ2 w2∗ = arg min f2 (7. considered in the optimization problem.2) for 0 ≤ γk ≤ 1. (7. The ANN output is denoted by y(w. respectively.5)   φ∗1 f1∗ = . f2∗ and f ∗ can be obtained. In the simulations presented in this chapter.m . w1∗ = arg min f1 (7. (7.3) N j=1 f2 (w) = w (7. • vk ∈ C is a vector constructed according to Eq. xj ) are.5)–(7. xj ). which performs a convex combination of the individual objective vectors. . with origin in f ∗ .4) where w is the ANN weight vector. • C is the cone generated by vectors (fi∗ − f ∗ ). xj )) (7. vk = f ∗ + γk (f1∗ − f ∗ ) + (1 − γk )(f2∗ − f ∗ ). Equation (7. Equation (7.2). 1  N 2 f1 (w) = (dj − y(w. . φ∗1 = f1 (w1∗ ) and φ2 = f2 (w1∗ ) (7. the desired output and current output at iteration j and xj is the input pattern. N is the training set size.2) results always in a vector within the cone (in the objective space) of feasible solutions. m is the vector formed by optimal solutions of the in- dividual objective i and values corresponding to the other objective func- tions. . γk is initialized with zero and then incremented to one. dj and y(w. so that the Pareto set can be generated from subsequent vectors vk . i = 1.4). For every γk there is a vector vk within this cone that results in a Pareto solution (a vector on the Pareto set). (7. The objectives “sum of squared errors” and “norm of the weight vector”. . respectively.7) .3) and (7.

12) w. by considering the multiple objectives as constraints to the optimization algorithm.3) and (7. between the two extremes. the two extremes of the Pareto set are the two opposite solutions to the minimization problem. the solution f2∗ would result in minimum (zero) norm with large error (underfitting).13): w∗ = argw min η (7. Substituting Eq. the solution f1∗ yields small mean square error with large norm weight vectors.η subject to : fi (w) ≤ f ∗ + ηvk (7.1. φ∗1 . minimum value of the objective function f2 . (7. 7 Multi-Objective Algorithms for Neural Networks Learning 155   φ1 f2∗ = .   ∗ φ∗1 f = (7. At the other extreme. The problem can be solved by a constrained optimization method such as the “ellipsoidal algorithm” [13].12) and (7.9) φ∗2 The multi-objective problem can be redefined now as a single-objective one. 1. xj )) − φ∗1 − ηvk1 ≤ 0 1 2 N (7.10) and (7. φ∗2 . minimum value of the objective function f1 . The utopian solution denoted by f ∗ is a vector formed by two elements according to the Eq. what would result in poor generalization (overfitting). 7.8) φ∗2 The vector w2∗ can be obtained easily and the vector w1∗ can be obtained by the Backpropagation algorithm. At one extreme. φ1 = f1 (w2∗ ) and φ∗2 = f2 (w2∗ ) (7.η subject to:    N (dj − y(w. . (7.10) w. 2. via the decision step. w∗ = argw min η (7.11 leads to the constrained problem described by Eq. 7. The multi-objective problem can now be described by Eq. (7.9.13)  j=1 w − φ∗2 − ηvk2 ≤ 0 As can be observed in the graph of Fig. The well balanced solution is picked up on the Pareto set.4) into 7.11).11) where η is the auxiliary variable.

This procedure can be performed via a very simple algorithm that relies on the comparison of the solutions through validation data: 1. Compute the sum of squared errors of each candidate ANN.14 defines the weight update formula according to the SMC-MOBJ proposition. for these data points. 3.2. ||wt ||) into the cost functions space.4 The Sliding Mode Approach The original multi-objective algorithm [16] fits a neural network model by simultaneous optimization of two cost functions. with possibly less computational effort. The final weight update equation is based on two descent gradients related to the error and norm functions.3 A Note on the Two-step Algorithm The MOBJ algorithm. The SMC-MOBJ algorithm uses two sliding surfaces to lead the error and norm cost functions to a pre-establish coordinate (Et . Simulate the response of all candidate ANN’s for a set of validation data points.2. The first surface is defined as the difference between the actual network error function at the k-iteration and the target error (SE(k) = E(k) − Et ).2. as presented above. Choose the ANN with smallest error. The algorithm allows to reach any solution within the solution space as long as it is previously selected. 7. A particular case oc- curs when a null error is defined for arbitrary values of norm which represents Pareto solutions targets. 7. In fact. 2. 7.2 Decision Step After the Pareto-optimal solutions being found.2. the best ANN among them must be chosen. Both are controlled by its respective sliding surface signals leading the networks to the target point. The Multi-Objective Sliding Mode (SMC-MOBJ) algorithm applies slid- ing mode control to the network training.156 A.2. Braga et al. Pre-established values for error and norm are set as training targets. the sum of squared error and the norm of the weights vector. that itera- tively takes two points of the Pareto-set and makes one decision step.P. has used a simple schematic pro- cedure that has been divided in two separate steps: the vector optimization and the decision. . Equation 7. See detais in [19]. a more efficient algorithm can be built. Once the Pareto-optimal candidates are found the decision step is done with validation data as described in section 7. The second surface is the difference between the actual and the target norm functions (S||w(k)|| = ||w(k) ||2 − ||wt ||2 ). The theoretical setting of the gains aggre- gates robustness to the convergence and guarantee its convergence even from adverse initial conditions. This algorithm would approximate arbitrarily the best ANN.

wji(k) (7. Approximations to the Pareto set are achieved setting targets with null error cost function (Et = 0) and pre-established norm values or through arbitrary trajectories that cross the Pareto boundary as shown in Figure 7.14) ∂wji(k) where SE(k) e S||w(k)|| are the sliding surfaces.sgn S||w(k)|| . 7 Multi-Objective Algorithms for Neural Networks Learning 157  ∂E(k)  ∆wji(k) = −α. − β.3 Constructive and Pruning Methods with Multi-objective Learning The Multi-objective algorithm for training neural networks [16] is based on the idea of picking up a network solution within the restricted set of the Optimal Pareto set. α.5 M SE Fig. The . β the respectives gains calculated with sliding model theory and sgn(.2. 7.sgn SE(k) .2.5 1 1. The solutions generated by the MOBJ algorithm can be tuned to be oversized and yet with good generalization performance.). 18 Estimated pareto set with original algorithm Resulting trajectory with sliding mode control 16 Target trajectory 14 12 10 8 ||w|| 6 (E0 . ||w0 ||) 4 2 (Et .   −1 if S < 0 sgn(S) = 0 if S = 0  +1 if S > 0 Multiple solutions can be generated from different targets or trajectories into the space. the sign function. ||wt ||) 0 0 0. Arbitrary trajectory in the plane of objectives 7.

3. the norm of the weights is optimal since it represents a solution with minimal validation error and the network is suitable for pruning techniques.3. Braga et al. Based on this principle. the Pareto set shape in the space of objectives shifts left. Norm function (||w||) 3. converging to a stable form. random sampling of weights.P. In this case the Pareto shape can be seen as a threshold for hidden layer growing. If networks are known to be oversized in advance. As the number of nodes increases. The Pareto set shape is nearly the same regardless of network complexity. a Pareto set is generated by one of the current multi-objective approaches. 7. Pareto set behaviour with increasing number of hidden nodes The multi-objective growing algorithm searches for a solution with im- proved generalization performance within the three-dimensional space defined by the following functions: 1. their structure can be shrinked.3. Therefore. final solution represents a network whose amplitude weights were properly set during training. similarity of hidden nodes responses and. as can be seen in the sketch presented in Figure 7. training stops as training set error stabilizes. Number of Hidden nodes (H) . for each new neural network. a mixture of some of these methods. Another approach consists of a constructive algorithm based on the MOBJ algorithm. finally. increasing hidden nodes ||W|| MSE Fig. Original pruning algorithms for MOBJ training [22] are based on: lineariza- tion of the hidden nodes. 7.1 Network Growing This constructive algorithm is based on the idea of gradually increasing the number of nodes of a MLP and. Error function (E(w) ) 2.158 A.

For every network node. The new network has a mixed structure with non-linear and linear hidden nodes.15) w∈N (H)  H 7.w2tp .3 Similar Nodes Response Simplification This method works by identifying pairs of hidden nodes whose responses are highly correlated. the sigmoidal activation function could be replaced by a linear one. the norm of their responses to the training set input vectors is calculated. This means that. The effect of linearizing hidden nodes is that the map- ping between network input and output can be performed by a single linear transformation.3. The stop criterion is based on the validation error. f (−ψ) < f (x) < f (+ψ).w2tp . Equation 7. The norms of the weight vectors that connect each one of the two nodes to the output layer is then computed and the one with the smallest norm is chosen to be pruned and the remaining output weights are updated with the sum of the hidden nodes’s output weights.     H N  N ! yp = fp w2ip . p outputs.16 expresses a MLP’s output equation for a two layer perceptron network with H hidden nodes. N inputs and L linearized hidden nodes (L ≤ H). . fp and fi are the activation functions.2 Linearization of Hidden Nodes A non-linear node with hiperbolic tangent activation function may have a linear output response.fi   (xj w1ji ) + b1i +  (xj . In order to measure the similarity between two hidden nodes r and t. 7 Multi-Objective Algorithms for Neural Networks Learning 159 In this case. ||frH − ftH || ≈ 0 (7.3. simplifying network structure.16) i=1.wlnjp ) + b2p (7. if it is within the pre-established range it is substituted by a linear function within that range.15:   E(w) w∗ = arg min ||w|| (7.17. Pruning stops when it starts to increase. 7. the optimization problem is described according to Equation 7. for a limited range of the outputs. b2p = b2p + t=1 b1t .i=t j=1 j=1 L L where wlnjp = t=1 w1jt .17) The algorithm consists of finding the pair of hidden nodes with minimum distance between their outputs for the whole training set. as shown in Equation 7. the output range is calculated and. The nodes could be replaced by a single one.

7. Linearization. For large MLPs. for each algorithm.4 Pruning Randomly Selected Weights This method consists of pruning randomly selected weights.3. The first one consists of data sampled from two gaussian distributions with mean µ1 = (2. The final topology may have linear connections between inputs and outputs and also reduced number of hidden nodes with prunned connections. two exam- ples were selected.P. SVM [3] algorithms and the proposed multi-objective approach (MOBJ). Pruning randomly selected weigths. 4) (class B) and variance σ 2 = 1. that is a user-defined parameter.160 A. The methods are tested in the sequence above and. weights and nodes are pruned. The decay constant to the weight decay algorithm was chosen equal to 0. The MOBJ algorithm was executed in order to generate 20 Pareto-optimal solutions. from which the final MOBJ solution is chosen.0004. the performance may decrease. Similar nodes response. 7. Several trials were carried on in order to obtain appropriate WD and and SVM solutions. Weight Decay (WD). Due to its simplicity. The training set has 80 input-output patterns with 40 examples from each class.4 Classification Problem In order to test the algorithm’s performance in classification task. 2) (class A) and µ2 = (4. 2. 3. 7. For the SVM algorithm. otherwise it is main- tained.52 . depending on the number of weights to be pruned. Braga et al. Figure 7. this method has a good balance between algorithm complexity and network performance.5 shows the solutions generated by all methods in the objective space and the Pareto-optimal set generated by the MOBJ algorithm. Note that BP and WD solutions are far from the Pareto-optimal set.5 Mixed Method A combination of the previously described methods is used to simplify net- work structures. The ANNs topology 2–50–1 trained by standard Backpropagation (BP). the chosen kernel was RBF with variance equal to 6 and the limit of the Lagrange multipliers was 10. Figure 7.3. It means that these solutions are not Pareto-optimal solutions and therefore they could still . the node is pruned. The execution order of the methods is not restricted but the following order is suggested: 1. A network with over-estimated size with topology 2–50–1 was used in order allow complexity control by the learning algorithms. If the extraction results on validation error decrease.4 shows the twenty generated solutions.

Overfitting can be clearly observed in the standard Backpropagation response.3 0. WD. The one presented . MOBJ classification solutions be minimized considering both objectives. Solutions within the norm versus MSE space Figure 7. since it tends to separate every single component of each class.6 shows also the decision regions obtained by a 2-50-1 MLP trained with BP. It is also evident that the MOBJ solution has the smaller norm value.4.5.4 0.8 0. 7 Multi-Objective Algorithms for Neural Networks Learning 161 Classification − MOBJ Algorithm 8 Class A Class B 7 BP solution MOBJ solutions 6 5 4 x2 3 2 1 0 −1 −2 −2 0 2 4 6 8 10 x1 Fig. Several trials were carried on in order to choose an appropriate decay constant for the WD solution.7 0.1 0. 7.9 1 MSE Fig. SVM and MOBJ algorithms.2 0.5 0. Pareto−optimal set 20 BP Solution MOBJ Solutions Best MOBJ solution 18 BP Solution Minimal norm solution WD Solution Utopian solution 16 WD Solution 14 12 ||w|| 10 8 6 4 2 Best MOBJ solution 0 0 0.6 0. 7.

Classification problem. Classification problem 8 Class A 7 Class B MOBJ solution WD solution 6 BP solution SVM solution 5 4 3 y 2 1 0 −1 −2 −2 0 2 4 6 8 10 x Fig. Braga et al. ES and MOBJ solutions An important property of the multi-objective approach is the characteristic of the decision surfaces in classification problems. A second example is called the Chess-Board classification problem which is a two class problem where the data is spread into the bi-dimensional space as a 4x4 chess board. CV and MOBJ. initial weights and number of iterations differently from the MOBJ algorithm which is insensitive to initialization process and parameters.8 show the back- propagation and MOBJ solutions respectively. Figure 7. ES. such as the limit of the Lagrange multipliers that was not obtained in a straight way. The MOBJ and SVM solutions are quite similar but the generation of the SVM solution demanded selecting net- work and training parameters. The SVM approach was also evaluated. Although in the previous example only the SVM solution had this behavior. the MOBJ surfaces are smoother than those of other methods which avoid overfitting.9 and consists of 960 patterns. The SVM norm comparison was not provided due to the respective network structure (RBF) that does not have input weights. BP.10 shows the solutions as well as the approximated Pareto set. ANNs with 2-50-1 topology were trained with BP. 7. in the graph was the best one obtained. OBD (Optimal Brain Damage). The classes present a narrow overlapping in the boundaries. Figure 7. ES.162 A. The data distribution is shown in Figure 7. it is not guaranteed that any of them will have similar responses as the MOBJ approach. where the smoothness of the MOBJ solution can be observed. Mainly because their generalization control is sensitive to their parameters.7 and Figure 7. and WD solutions were very close to the Pareto-set. WD. In this example the CV.6. Since it reduces the norm of the weight vector.P. . WD.

In addition to the validation set. The BP algorithm aims at minimizing the MSE which overfits the data in contrast with the MOBJ that takes into account the gener- alization degree measured as the validation error. MOBJ decision surface Figures 11(a) and 11(b) show the surface decision generated with the MOBJ and BP algorithms. the MOBJ approach starts with an underfitted network which is gradually fitted to the data according to the Pareto’s principle [18].1 displays the percentage of patterns correctly classified for 100 sets generated from the original set (960 patterns) which was randomly di- vided into training and validation sets. all the algorithm . 7. For multiple tests.7. The BP surface is rougher than the MOBJ sur- face which is smoother. 7.5 0 −0. 7 Multi-Objective Algorithms for Neural Networks Learning 163 Fig.8. Table 7. BP decision surface MOBJ Decision surface 1 0.5 −1 8 6 10 4 8 6 2 4 0 2 0 −2 −2 Fig.

5 1 0. SVM and MOBJ algo- rithms. The MOBJ approach had a slightly better performance. 7. 2. The Chess Board solutions performed good classifications with closer means.5 −0.P.5 0 −0.10.2 M SE Fig.164 A.9.5 Regression Problem A MLP with topology 1-50-1 was trained by BP.5 0 0.5 Class 1 Class 2 2 x2 1. The networks were trained with the 40 noisy patterns sampled from the function described by Eq.7 1. 7. Braga et al.5 x1 Fig. which was generated by adding a zero- . (7.5 1 1. WD. 7.18).5 2 2. The Chess Board data set 70 MOBJ solutions Best MOBJ 60 solution with minimum norm BP (out of the axis) WD OBD(out of the axis) 50 ES CV 40 w 30 20 10 0 0.2 0.

5 2 1 1.3525% MOBJ 77. The decay constant to the weight decay algorithm has been chosen equals to 0.8 and the chosen limit of Lagrange multipliers was 1. when we tried to model the steep region.5 −0. 7.152 to the original curve. Proportions of patterns correctly identified for 100 simulations sets Algorithms Percentage σ Backpropagation 75.5 −0.5 (a) MOBJ solution (b) BP solution Fig.7312% ±1.5 1 0. As can be observed. Figure 7.11.18) (1 + x2 ) The MOBJ algorithm was executed to generate 20 Pareto-optimal solu- tions.2293% Early stopping 77. since the target function has both steep and smooth regions. 7 Multi-Objective Algorithms for Neural Networks Learning 165 1 1 0.5 0. Figure 7.3177% Cross-validation 77.00015.13 shows each algorithm solution. when we changed the parameters in order to model the smooth region.5 −0.5 0.5 2 1. the chose kernel was RBF with variance equals to 0. the BP.5 0 0 0 0 −0.9635% ±1.5 2.5 1.2566% Optimal Brain Damage 77.7604% ±1.5 0.5 −1 −1 2.1.5 0 0 −0.5 2 2 2. For the SVM algorithm. To carry on this approximation is spe- cially difficult for some algorithms.2863% mean normally distributed random noise with variance of 0. Decision surface for BP and MOBJ algorithms Table 7. In the design of the SVM solution.8958% ±1.2031% ±1. on the other hand. the steep one was not well approximated and.7583% ±1.5 −0.5 1 1.9667% ±1. The weight decay and SVM parameters were not obtained easily once many attempts were necessary.3257% SVM 77. WD and SVM solutions presented some degrees of overfitting.3103% Weight decay 77.5 1 0. (x − 2) (2x + 1) f (x) = (7.12 shows the twenty generated solutions.5 2. overfitting .

12. WD. for example. This differs from weight decay.P. 7. Although the MOBJ algorithm works by reducing norm. 7. SVM and MOBJ solution occurred in the smooth region. Figure 7. Regression − MOBJ Algorithm 3 2 1 0 y −1 Training set −2 Function f(x) BP solution MOBJ solutions −3 −8 −6 −4 −2 0 2 4 6 8 10 12 x Fig. Braga et al.166 A. BP. MOBJ solutions 3 2 1 0 −1 Training set Function f(x) −2 MOBJ solution WD solution BP solution SVM solution −3 −8 −6 −4 −2 0 2 4 6 8 10 12 Fig. that applies the same decay constant for all weights.13. since the resulting model can be smooth in some regions and steep in other ones. This character- istic of the MOBJ algorithm yields more flexibility to neural networks. The MOBJ algorithm reached a solution with high generalization capacity independently of the user choices. . what penalizes mainly the large ones.14 shows that in the multi-objective approach it is possible to reduce the norm of some weight vectors while other vectors have their norm increased. the reduction is not carried on with the same magnitude for all the network weights.

Table 7. 7 Multi-Objective Algorithms for Neural Networks Learning 167 Weights − MOBJ Algorithm Weights − BP Algorithm 4 4 2 2 MOBJ w vector BP w1 vector 1 0 0 −2 −2 −4 −4 0 20 40 60 0 20 40 60 3 3 2 2 MOBJ w vector BP w2 vector 1 1 2 0 0 −1 −1 −2 −2 0 20 40 60 0 20 40 60 Fig. The data sets were divided into training and validation sets with 60% and 40% of the whole data.413 gene 98.6 Pruning Problem Four real based data sets picked-up from PROBEN [23] consisting of three classification and one function approximation problems were used to test the Pruning methods with Multi-objective learning.14.831 building 0.57% 9-15-2 14. without pruning. MOBJ and BP weight vectors 7.263 card 88. These initial solutions were used further as a reference for the pruning methods. The pruning methods are capable to reduce these initial over-sized networks without loss in validation performance.85% 89. respectively.0196 0. card and gene were used for classification and building data set were used for func- tion approximation. The networks are the best generalization solutions obtained without reducing the network number of weights and nodes.0228 14-15-3 6.53% 120-15-3 8. The data sets cancer. 7.04% 51-15-2 4. Table 7.89% 88.2 presents the results for oversized network with the MOBJ algo- rithm. Initial results with the Multi-Objective algorithm Data Training Validation Oversized ||w|| set Error Error Topology cancer 98.992 For classification problems the training and validation error are the percentage of patterns correctly assigned and represents the mean squared error for the function approximation data set.09% 98. .2.

168 A. since the number of inputs was reduced from 51 to 14.5 shows the results for the Randomly Selected Weights method. without loss in generalization performance. what reduced the network into a single layer network with non-linear nodes. Table 7.57% 9-15-2 9-2-2 9-2 card 88.04% 51-15-2 .4.3.57% 9-15-2 9-6-2 card 88. For the card data set the reduction was of 11 nodes in the hidden layer.0189 0. There was no reduction for the gene data set and reduction only one node for the building data set.53% 120-15-3 120-15-3 building 0. no linearization was possible.89% 88.4. Similar nodes response identification results Data Training Validation Initial Final Set Error Error Topology Topology cancer 96. The network for the building data set had also a significant reduction in size and number of inputs (inputs were reduced to 4).P.85% 89.85% 89. For the card data set. The results for the card data set showed that some input variables are not relevant for solving the problem. For the gene data set.0234 0. weights were pruned throughout the network. all the hidden nodes of the original network were linearized. Table 7.3. Results for the Linearization method are presented in Table 7. Results for Linearization of hidden nodes Data Training Validation Initial Final Topology Set Error Error Topology non-linear linear cancer 96. . The result for the cancer data set reduced the network to a similar structure to the one obtained by linearization and Method 3. Topologies with pruned weights are indicated with an asterisk (*). The network for the cancer data set was reduced to a 9-6-2 topology. what represented a reduction of 9 nodes in the hidden layer.04% 51-15-2 51-4-2 gene 98.18% 98. The number of hidden nodes was reduced to 12. that was reduced from 120 inputs to only 44.42% 98. A similar result was obtained for the gene data set. in contrast with the 15 nodes of the original network.53% 120-15-3 120-15-3 - building 0. In addition to the final 9-4-2 topology.0227 14-15-3 14-14-3 Table 7. Braga et al.89% 88.0224 14-15-3 14-7-3 14-3 Similar nodes response identification results are presented in Table 7. 51-2 gene 98. The final network replaced the original two-layers non-linear network. The final network was also reduced in the hidden layer to 9 nodes. The Final Topology presents the number of linear and non-linear nodes between inputs and outputs of the final network.

A training algorithm for optimal margin classifiers. The building data set had also significant reduction in the number of inputs. For the cancer data set.6. Table 7. . Fifth Annual Workshop on Computational Learning Theory.64% 9-4-2* card 85. Results for arranged pruning methods Data Training Validation Initial Final Topology Set Error Error Topology non-linear linear cancer 96.0257 0.0164 14-15-3 4-7-3* 1-2 7. Boser. and V.18% 99.27% 88. 8-2 gene 91.7 Conclusions The multi-objective algorithms described are able to obtain good generaliza- tion solutions for regression and classification problems.18% 91. the best results occurs when the methods are mixed.18% 99. The solution is obtained by a restricted search over the space of objectives norm and error.26% 120-15-3 20-8-3* 1-1 building 0. 1992.0257 0.31% 44-12-3* building 0. This can be justified by the fact that it can take the best features of each one.99% 88. pages 144–152.08% 90.6. The problem could be solved with a single hidden layer with 8 input variables. Current research aim at describing new decision strategies and at extending MOBJ concepts to other machine learning approaches.41% 14-9-2* gene 90.5.64% 9-15-2 9-2-2* 3-2 card 85. Guyon. Vapnik. 7 Multi-Objective Algorithms for Neural Networks Learning 169 Table 7. There was an impressive reduction in the card data set.77% 51-15-2 . The gene data set problem had the best reduction in the number of inputs of all methods: the original 120 variables was reduced to only 20. The algorithms do not demand external user parameters in order to generate the Pareto-optimal set and to choose the best solution. Randomly selected weights Data Training Validation Final Set Error Error Topology cancer 96. 13 hidden nodes were pruned. the final network has also 3 linear hidden nodes. I. References [1] B.0165 4-8-3* As shown in Table 7.

Rumelhart.P. [7] S. Hinton. Haykin. April 1993. Cybernetics. Ferreira. Z. . North-Holland (Elsevier). Artificial Intelligence. Neural Networks: A Comprehensive Foundation. Hinton. 20:273– 279. C. C. 1977. and T. H2/h-infinity multiobjective pid design. pages 586–591. 1896. pages 598–605. Parma. 1995. [12] D. C. [10] Gustavo G. Techni- cal report. G. [16] R. John S. Teixeira. Control Systems of Variable Structure. [19] R. IEEE Control Systems Magazine. IEEE Transactions on Neural Networks. Cut-off method with space extension in convex programming problems. 8th Brazilian Symposium on Neural Networks. Braga. [13] N. New York. Gunn. A. 1989. E. Neurocomputing. Saldanha. editors. Connectionist learning procedures. 38(1):97–98. 1999. 1989. Denker. 2000. [17] G. R. E. November 2004. A. [20] U. Saldanha. [2] V. June 1997. Lausanne. H. Sliding mode algorithm for training multi-layer neural networks. Pittsburg. H. In Proc. G. CA. 323:533–536. Takahashi. Cours D’Economie Politique. Antonio P. Proceedings of the 1988 Connectionist Models Summer School. Nature. Conf. Support vector machines for classification and regression. Mozer and P. J. H. Menezes. Haimes. Prentice Hall. D. and R. [8] Ehud D. Machine Learning. Keter Publishing House Jerusalem LTD. A. 1983. A. Morgan Kaufmann. [3] C. 40:185-234. January 1998. Takahashi. vol. San Francisco.170 A. Rouse. Cortes and V. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. Karnin. [4] Yann Le Cun. P. Touretzky. Uti- lização de seção áurea no cálculo de soluções eficientes para treinamento de redes neurais artificiais através de otimização multi-objetivo. pages 38–51. Sejnowski. 1995. Improv- ing generalization of mlps with multi-objective optimization. 1976. Learning representations by back-propagating errors. P. 17(5):37–47. 1997. A simple procedure for pruning back-propagation trained neural networks. Support vector networks. and Benjamim R. Optimal brain damage. R. IEE Electronics Letters. Springer-Verlag. Braga. 35(1–4):189–194. Hinton. volume 8. V. and Sara A. Smolensky. of the IEEE Intl. E. CA. Chankong and Y. and R. R. 1990. R. P. Image Speech and Intelligent Systems Research Group. In D. 1(2):239–242. [5] S. [18] V. [6] S. Fahlman. A. Vapnik. I and II. pages 107–115. In Advances in Neural Information Processing Systems 2. on Neural Networks. [15] V. San Mateo. R. Multiobjective Decision Making: Theory and Methodology. Y. Braga. Braga et al. Shor. Solla. 1990. and P. Takahashi. University of Southampton. A. vols. Teixeira. 1986. 1. [9] M. [14] R. Faster-learning variations on back-propagation: an empirical study. L. Vapnik. The Nature of Statistical Learning Theory. Peres. Skeletonization: A technique for trimming the fat from a network via relevance assessment. Williams. Advances in Neural Information Processing. and R. 12:94–96. 1988. [11] Martin Riedmiller and Heinrich Braun. Pareto. C. Itkis.

Bienenstock and R. [26] Ron Kohavi. Improving neural networks generalization with new constructive and pruning methods. Study of a growth algorithm for a feedforward network. 2003. Neural Networks and the Bias/Variance Dilemma. A.edu/∼mlearn/MLRepository. [23] C. P. 7 Multi-Objective Algorithms for Neural Networks Learning 171 [21] M. [24] S. A. 1990 [25] Jean-Pierre Nadal. [22] M. Blake and C. Journal of Intel- ligent & Fuzzy Systems. G. de Menezes. http://www. The cascade-correlation learning architecture. Geman and E. 10:1-9. Dept. Morgan Kaufmann.edu/105046. Costa. Braga and B.uci. Irvine. A. Teix- eira. Merz. Neural Computation. Lebiere.html. . Fahlman and C. 1998. International Journal of Neural Systems. 1(1):55-59. 51:467-473. In Advances in Neural Information Processing Systems 2 (D. Costa. de Menezes. R. of Information and Computer Sciences. Doursat.ist. Neurocomputing. Editor). P. 1995.L. Training neural networks with a multi-objective sliding mode control algorithm. Braga.ics. and R. A. E. S. A. 1992. B. A Study of Cross-Validation and Bootstrap for Accuracy Es- timation and Model Selection. http://citeseer. [27] S.psu.J. Touretzky. 4(1):1-58. 2003. 1989. R. {UCI} Repository of machine learn- ing databases. html. G. University of California. Parma.

For binary classification problems. Kagawa 761-0396. Kobe 658-8501. the idea of maximal margin separation was employed in the multi-surface method (MSM) sug- gested by Mangasarian in 1960’s. Japan yun@eng.ac. Support Vector Machine (SVM) is gaining much popularity as one of effective methods for machine learning in recent years. ). In pattern classification prob- lems with two class sets. This chapter introduces a new family of SVM using multi-objective programming and goal programming (MOP/GP) techniques. Nakayama and Y. can be expressed by the following linear form f (x) = wT x + b H.com c Springer-Verlag Berlin Heidelberg 2006 . . we consider pattern classification problems. the value of +1 or −1 is assigned to each pattern xi ∈ X according to its class A or B.jp Summary. 173–198 (2006) www.kagawa-u. it generalizes linear classifiers into high dimensional feature spaces through nonlinear mappings defined implicitly by kernels in the Hilbert space so that it may produce nonlinear classifiers in the original data space. yi ) (i = 1. Studies in Computational Intelligence (SCI) 16. .8 Generating Support Vector Machines Using Multi-Objective Optimization and Goal Programming Hirotaka Nakayama1 and Yeboon Yun2 1 Konan University. Higashinada. Yun: Generating Support Vector Machines Using Multi-Objective Op- timization and Goal Programming. . On the other hand. Linear classi- fiers then are optimized to give the maximal margin separation between the classes.jp 2 Kagawa University. linear classifiers using goal programming were developed extensively in 1980’s. 8. Japan nakayama@konan-u. This task is performed by solving some type of mathematical programming such as quadratic programming (QP) or linear programming (LP). Also. in particular.1 Introduction For convenience.ac. .springerlink. This is performed by finding a discriminant function f (x) such that f (x)  0 for x ∈ A and f (x) < 0 for x ∈ B. Let X be a space of conditional attributes. The aim of machine learning is to predict which class newly observed patterns belong to on the basis of the given training data set (xi . where yi = +1 or −1. from a viewpoint of mathematical programming for machine learning. Dept. and discusses its effectiveness throughout several numerical experiments. Linear discriminant functions. of Information Science and Systems Engineering 8-9-1 Okamoto.

because it is reduced to quadratic programming (QP) or linear programming (LP). in particular. Support Vector Machine (SVM. people who are engaged in mathematical programming. However.. and hence difficult to apply to large scale problems. in cases where training data set X is not linearly separable. Namely. Nakayama and Y. [6]. and extensively studied through 1980’s to the beginning of 1990’s. for short) is attracting interest of researchers. The idea of maximal margin in linear classifiers is intuitive. One of main features in SVM is that it is a linear classifier with maximal margin on the feature space through nonlinear mappings defined implicitly by kernels in the Hilbert space. The maximal margin is effectively applied for discrimination analysis us- ing mathematical programming. the back propagation method is reduced to nonlinear optimization with multiple local optima. [22] (see also Cristianini and Shawe-Taylor [7].2 Support Vector Machine Support vector machine (SVM) was developed by Vapnik et al. Another drawback in the back propagation method is in the fact that it is difficult to change the structure adaptively according to the change of environment in incremental learning. This chapter discusses SVMs using techniques of multi-objective programming (MOP) and goal pro- gramming (GP). e. and its rea- soning in connection with perceptrons was given in early 1960’s (e. Schölkopf-Smola [20]) and its main fea- tures are 1) SVM maps the original data set into a high dimensional feature space by nonlinear mapping implicitly defined by kernels in the Hilbert space. Yun with the property wT x + b  0 for x∈A T w x + b < 0 for x ∈ B. 8. MSM (Multi-Surface Method) by Man- gasarian [11].. Later. . we map the original data set X to a feature space Z by some nonlinear map φ. linear classifiers with maximal margin were formulated as linear goal programming. 3) SVM provides an evaluation of the generalization ability using VC dimension. [8].g. Recently. and proposes several extensions of SVM along MOP/GP. For such a pattern classification problem.174 H.g. Novikoff [17]). 2) SVM finds linear classifiers with maximal margin on the feature space. The pioneering work was given by Freed-Glover [9]. and a good survey can be seen in Erenguc-Koehler et al. artificial neural networks have been widely applied.

q- polynomial K(x. i = 1. . the separating hy- perplane with maximal margin can be given by solving the following problem with the normalization wT z + b = ±1 at points with the minimum interior deviation: minimize ||w|| (SVMhard )P  T subject to yi w z i + b  1. i=1 αi  0. Several kinds of norm are possible. Dual problem of (SVMhard )P with 12 ||w||22 is   1   maximize αi − αi αj yi yj φ(xi )T φ(xj ) (SVMhard )D i=1 2 i. x ) = exp − r2 are most popularly used. . When ||w||2 is used. Letting z i = φ(xi ). . while the problem with ||w||1 or ||w||∞ is reduced to linear programming (see.j=1   subject to αi yi = 0. the problem (SVMhard )D can be reformulated as follows:   1   maximize αi − αi αj yi yj K(xi . the problem is reduced to quadratic programming.j=1   subject to αi yi = 0. xj ) (SVMhard ) i=1 2 i. We try to find linear classifiers with maximal margin in the feature space. i=1 αi  0. . . . . . x ) = (xT x + 1)q and Gaussian  ||x − x ||2 K(x. . 8 Generating Support Vector Machines using MOP/GP 175 Increasing the dimension of the feature space. . [12]). . e.g. . . Using the kernel function K(x. x ) = φ(x)T φ(x ). i = 1. . . i = 1. it is expected that the mapped data set becomes linearly separable. .. Several kinds of kernel functions have been suggested: among them.

ξi . Namely. It is important to help DM to trade-off easily in practical decisin making problems. p x ∈ X. or exterior deviation) ξi . . . The final decision is made among them taking the total balance over all criteria into account. Yun 8. i = 1. The idea was originated from getting rid of no feasible solution in usual mathematical programming. On the other hand. if there is no better solution x ∈ X other than x̂). DM). e. g2 (x). The constraint set X may be given by cj (x)  0. p. m. Nakayama and Y. we have the following goal programming formulation:  p minimize hi ξi (GP0 ) i=1 subject to gi (x) − g i = ηi − ξi .3 Review of Multi-objective Programming and Goal Programming Multi-objective programming (MOP) problems are formulated as follows: (MOP) Maximize g(x) ≡ (g1 (x). interactive multi- objective programming methods.. i = 1. which were developed remarkably in 1980’s. . Goal Programming (GP) was developed by Charnes- Cooper [5] much earlier than interactive programming methods. there may be many Pareto solutions. Sawaragi-Nakayama- Tanino [18]. Among them. . Pareto solutions are candidates of final decision (x̂ is said Pareto optimal. Chankong-Haims [4]. . gp (x)) over x ∈ X. This is a problem of value judgment of decision maker (in abbreviation. In general. suppose that we want to make gi (x)  g i . j = 1. Steuer [21]. . . Nakayama [15]. .176 H. . For example. n and/or a subset of R itself. .g. . Miettinen [14]). There have been developed several kinds of methods for multi-objective programming (see. . Introducing the degree of overattainment (or surplus. For the problem (MOP). and we try to find a solution which attains those goals as much as possible. ηi  0. The totally balancing over criteria is usually called trade-off. many constraints should be regarded as “goal” to be attained. or interior deviation) ηi and the degree of unattainment (or slackness. . . . have been observed to be effective in various fields of practial problems. Those methods search a solution in an interactive way with DM while eliciting in- formation on his/her value judgment. . .

. . p holds. . . Let ξ and η be vectors of Rp . if ki = hi for each i = 1. . i = 1. p. the relation ξi · ηi = 0 has to be satisfied. Then consider the following problem: minimize P (ξ. . Therefore. i = 1. . ξi . x ∈ X. x ∈ X. p). . . . p. The above formulation assures this property due to the following lemma (Lemma 7. the obtained solution by goal programming is not necessarily Pareto optimal. In the original formulation of goal programming. Note that if the relation hi > ki for each i = 1.3. i = 1. . p. no efforts are made for further improvement. 8 Generating Support Vector Machines using MOP/GP 177 where hi (i = 1. . ξi . This is due to the fact that the idea of goal programming is based on “satisficing” rather than “optimization”. . p is satisfied at the solution. . . . . ηi  0. . the solution ξ̂ and η̂ to the preceding problem satisfy ξˆi ηˆi = 0. It should be noted that in order for ηi and ξi in the above formulation to have the meaning of the degree of overattainment and the degree of unattain- ment. . once a solution which attains every goal. . then by substituting the right hand side of the equality constraints of (GP1 ) into the objective function we have . η) subject to gi (x) − g i = ηi − ξi . p. Suppose that the function P is monotononically increasing with respect to elements of ξ and η and strictly monotonically increasing with respect to at least either ξi or ηi for each i (i = 1.1 of [18]): Lemma 1. . . . . i=1 i=1 i=1 i=1 Moreover. . p) are positive weighting parameters which are given by DMs. . we can put the degree of overattain- ment in the objective function in (GP0 ) as follows:.  p  p minimize hi ξi − ki ηi (GP1 ) i=1 i=1 subject to gi (x) − g i = ηi − ξi . . ηi  0. . . In order to overcome this difficulty. then the relation ξi ηi = 0 for each i = 1. . . Then. respectively. . This follows in a similar fashion to Lemma 1 by considering  p  p  p  p hi ξi − ki ηi = ki (ξi − ηi ) + (hi − ki )ξi .

178 H. Nakayama and Y. Yun

maximize hi (gi (x) − g i ) (MOP/GP0 )
subject to x ∈ X.

Since the term of −g i does not affect to maximizing the objective func-
tion, it can be removed. Namely the formulation (MOP/GP0 ) is reduced to
the usual scalarization using the linearly weighted sum in multi-objective pro-
However, the scalarization of linearly weighted sum has another drawbacks:
e.g., it can not yield solutions on nonconvex parts of the Pareto frontier. To
overcome this, the formulation of improvement of the worst level of objective
function as much as possibel is applied as follows:

maximize η (MOP/GP1 )
subject to gi (x) − g i  η, i = 1, . . . , p,
x ∈ X.

The solution to (MOP/GP1 ) is guaranteed to be weakly Pareto optimal.
Further discussion on scalarization functions can be seen in the literatures
([21], [4], [18], [15], [14]).

8.4 MOP/GP Approaches to Pattern Classification

In 1981, Freed-Glover suggested to get just a hyperplane separating two classes
with as few misclassified data as possible by using goal programming [9] (see
also [8]). Let ξi denote the exterior deviation which is a deviation from the
hyperplane of a point xi improperly classified. Similarly, let ηi denote the
interior deviation which is a deviation from the hyperplane of a point xi
properly classified. Some of main objectives in this approach are as follows:

i) Minimize the maximum exterior deviation (decrease errors as
much as possible)
ii) Maximize the minimum interior deviation (i.e., maximize the mar-
iii) Maximize the weighted sum of interior deviation
iv) Minimize the weighted sum of exterior deviation

Although many models have been suggested, the one considering iii) and
iv) above may be given by the following linear goal programming:

8 Generating Support Vector Machines using MOP/GP 179

minimize (hi ξi − ki ηi ) (GP)
subject to yi (xTi w + b) = ηi − ξi ,
ξi , ηi  0, i = 1, . . . , ,

where since yi = +1 or −1 according to xi ∈ A or xi ∈ B, two equations
xTi w + b = ηi − ξi for xi ∈ A and xTi w + b = −ηi + ξi for xi ∈ B can be
reduced to the following one equation

yi (xTi w + b) = ηi − ξi .

Here, hi and ki are positive constants. As was stated in the preceding
section, if hi > ki for i = 1, . . . , , then we have ξi ηi = 0 for every i = 1, . . . , 
at the solution to (GP). Hence then, ξi and ηi are assured to have the meaning
of the exterior deviation and the interior deviation respectively at the solution.
It should be noted that the above formulation may yield some unacceptable
solutions such as w = 0 and unbounded solution. In the goal programming
approach to linear classifiers, therefore, some appropriate normality condition
must be imposed on w in order to provide a bounded nontrivial optimal
solution. One of such normality conditions is ||w|| = 1.
If the classification problem is linearly separable, then using the normal-
ization ||w|| = 1, the separating hyperplane H : wT x + b = 0 with maximal
margin can be given by solving the following problem [3]:

maximize η (MOP/GP2 )
subject to yi (xTi w+ b)  η, i = 1, . . . , ,
||w|| = 1.

However, this normality condition makes the problem to be of nonlin-
ear optimization. Instead of maximizing the minimum interior deviation in
(MOP/GP2 ), we can use the following equivalent formulation with the nor-
malization xT w + b = ±1 at points with the minimum interior deviation

minimize ||w|| (MOP/GP2 )
subject to yi xi w + b  η, i = 1, . . . , ,
η = 1.

This formulation is the same as the one used in SVM.

180 H. Nakayama and Y. Yun

8.5 Soft Margin SVM
Separating two sets A and B completely is called the hard margin method,
which tends to make overlearning. This implies the hard margin method is
easily affected by noise. In order to overcome this difficulty, the soft margin
method is introduced. The soft margin method allows some slight error which
is represented by slack variables (exterior deviation) ξi (i = 1, . . . , ). Using

the trade-off parameter C between minimizing ||w|| and minimizing i=1 ξi ,
we have the following formulation for the soft margin method:

minimize ||w||22 + C ξi (SVMsof t )P
2 i=1

subject to yi wT z i + b  1 − ξi ,
ξi  0, i = 1, . . . , .

Using a kernel function in the dual problem yields


maximize αi − αi αj yi yj K(xi , xj ) (SVMsof t )
2 i,j=1

subject to αi yi = 0,
0  αi  C, i = 1, . . . , .

It can be seen that the idea of soft margin method is the same as the
goal programming approach to linear classifiers. This idea was used in an
extension of MSM by Benett [2]. Not only exterior deviations but also interior
deviations can be considered in SVM. Such MOP/GP approaches to SVM are
discussed by the authors and their coresearchers [1], [16], [23]. When applying
GP approaches, it was pointed out in Section 3 that we need some normality
condition in order to avoid unacceptable solutions.
Glover suggested the following necessary and sufficient condition for avoid-
ing unacceptable solutions [10]:
−lA xi + lB xi w = 1, (8.1)
i∈IB i∈IA

where lA and lB denote the number of data for the category A and B, respec-
tively. Geometrically, the normalization (8.1) means that the distance between
two hyperplanes passing through centers of data respectively for A and B is
scaled by lA lB .
Lately, taking into account the objectives (ii) and (iv) of goal programming
stated in the previous section, Schölkopf et al. [19] suggested ν-support vector

8 Generating Support Vector Machines using MOP/GP 181


minimize ||w||22 − νρ + ξi (ν−SVM)P
2  i=1

subject to yi wT z i + b  ρ − ξi ,
ρ  0, ξi  0, i = 1, . . . , .

where 0  ν  1 is a parameter.
Compared with the existing soft margin algorithm, one of the differences is
that the parameter C for slack variables does not appear, and another differ-
ence is that the new variable ρ appears in the above formulation. The problem
(ν−SVM)P maximizes the variable ρ which corresponds to the minimum inte-
rior deviation (i.e., the minimum distance between the separating hyperplane
and correctly classified points).
The Lagrangian dual problem to the problem (ν−SVM)P is as follows:


maximize − yi yj αi αj K (xi , xj ) (ν−SVM)
2 i,j=1

subject to yi αi = 0,

αi  ν,
0  αi  , i = 1, . . . , .

8.6 Extensions of SVM by MOP/GP

In this section, we propose various algorithms of SVM considering both slack
variables for misclassified data points (i.e., exterior deviations) and surplus
variables for correctly classified data points (i.e., interior deviations).

8.6.1 Total Margin Algorithm

In order to minimize the slackness and to maximize the surplus, we have the
following optimization problem:

minimize w22 + C1 ξi − C2 ηi (SVMtotal )P
2 i=1 i=1

subject to yi wT z i + b  1 − ξi + ηi ,
ξi  0, ηi  0, i = 1, . . . , ,

182 H. Nakayama and Y. Yun

where C1 and C2 are chosen in such a way that C1 > C2 which ensures that at
least one of ξi and ηi becomes zero. The Lagrangian function for the problem
(SVMtotal )P is

L(w, b, ξ, η, α, β, γ) = w22 + C1 ξi − C2 ηi
2 i=1 i=1

− αi yi wT z i + b − 1 + ξi − ηi


− βi ξi − γi ηi ,
i=1 i=1

where αi  0, βi  0 and γi  0.

Differentiating the Lagrangian function with respect to w, b, ξ and η
yields the following conditions:

∂L(w, b, ξ, η, α, β, γ) 
= w− αi yi z i = 0,
∂w i=1
∂L(w, b, ξ, η, α, β, γ)
= C1 − αi − βi = 0,
∂L(w, b, ξ, η, α, β, γ)
= −C2 + αi − γi = 0,
∂L(w, b, ξ, η, α, β, γ)  
= αi yi = 0.
∂b i=1

Substituting the above stationary conditions into the Lagrangian function
L and using kernel representation, we obtain the following dual optimization


maximize αi − yi yj αi αj K (xi , xj ) (SVMtotal )
2 i,j=1

subject to yi αi = 0,
C2  αi  C1 , i = 1, . . . , .

Let α∗ be the optimal solution to the problem (SVMtotal ). Then, the
discrimination function can be written by

f (φ(x)) = αi∗ yi K (x, xi ) + b.

8 Generating Support Vector Machines using MOP/GP 183

The offset b is given as follows: Let n+ be the number of xj with C2 < αj∗ < C1
and yj = +1, and let n− be the number of xj with C2 < αj∗ < C1 and yj = −1,
respectively. From the Karush-Kuhn-Tucker complementarity conditions, if
C2 < αj∗ < C1 , then βj > 0 and γj > 0. This implies that ξj = ηj = 0. Then,
 
n+ +n− 
b∗ = (n+ − n− ) − yi αi∗ K (xi , xj ) .
n+ + n− j=1 i=1

8.6.2 µ−SVM
Minimizing the worst slackness and maximizing the sum of surplus, we have
a reverse formulation of ν−SVM. We introduce a new variable σ which repre-
sents the maximal distance between the separating hyperplane and misclassi-
fied data points. Thus, the following problem is obtained:


minimize w22 + µσ − ηi (µ−SVM)P
2  i=1

subject to yi wT z i + b  ηi − σ,
σ  0, ηi  0, i = 1, . . . , ,
where µ is a parameter which reflects the trade-off between σ and the sum of
ηi .
The Lagrangian function for the problem (µ−SVM)P is


L(w, b, η, σ, α, β, γ) = w22 + µσ − ηi
2  i=1

− αi yi wT z i + b − ηi + σ − βi ηi − γσ,
i=1 i=1

where αi  0, βi  0 and γ  0.
Differentiating the Lagrangian function with respect to w, b, η and σ
yields the following conditions:

∂L(w, b, η, σ, α, β, γ) 
= w− αi yi z i = 0,
∂w i=1
∂L(w, b, η, σ, α, β, γ) 1
= − + αi − βi = 0,
∂L(w, b, η, σ, α, β, γ) 
=µ− αi − γ = 0,
∂σ i=1

∂L(w, b, η, σ, α, β, γ) 
= αi yi = 0.
∂b i=1

184 H. Nakayama and Y. Yun

Substituting the above stationary conditions into the Lagrangian function
L, we obtain the following dual optimization problem:


maximize − αi αj yi yj K (xi , xj ) (µ−SVM)
2 i,j=1

subject to αi yi = 0,

αi  µ,
αi  , i = 1, . . . , .

Let α∗ be the optimal solution to the problem (µ−SVM). To compute the
offset b, we take the set A of xj which is the same size n with 1 < αj∗ . From
the Karush-Kuhn-Tucker complementarity conditions, if 1 < αj∗ , then βj > 0
which implies ηj = 0. Thus,

1   ∗

b∗ = − αi yi K (xi , xj ) .
xj ∈A

8.6.3 µ−ν−SVM

Applying SVMtotal and µ−SVM, all training points become support vectors
due to the second constraint of the problem (SVMtotal ) and the third con-
straint of the problem (µ−SVM). In other words, the algorithms (SVMtotal )
and (µ−SVM) lack in the sparsity of support vectors. In order to overcome this
problem in (SVMtotal ) and (µ−SVM), we suggest the following formulation,
which combines the ideas of ν−SVM and µ−SVM:
minimize w2 − νρ + µσ (µ − ν−SVM)P
2 2
subject to yi wT z i + b  ρ − σ, i = 1, . . . , ,
ρ  0, σ  0,

where ν and µ are parameters.
The Lagrangian function to the problem (µ−ν−SVM)P is
L(w, b, ρ, σ, α, β, γ) = w22 − νρ + µσ

− αi yi wT z i + b − ρ + σ − βρ − γσ,

where αi  0, β  0 and γ  0.

8 Generating Support Vector Machines using MOP/GP 185

Differentiating Lagrangian function with respect to w, b, ρ and σ yields
the four conditions

∂L(w, b, ρ, σ, α, β, γ) 
= w− αi yi z i = 0,
∂w i=1

∂L(w, b, ρ, σ, α, β, γ) 
= −ν + αi − β = 0,
∂ρ i=1

∂L(w, b, ρ, σ, α, β, γ) 
=µ− αi − γ = 0,
∂σ i=1

∂L(w, b, ρ, σ, α, β, γ) 
= αi yi = 0.
∂b i=1

Substituting the above stationary conditions into the Lagrangian function
L and using kernel representation, we obtain the following dual optimization


maximize − αi αj yi yj K (xi , xj ) (µ − ν−SVM)
2 i,j=1

subject to αi yi = 0,

ν αi  µ,
αi  0, i = 1, . . . , .

Letting α∗ be the optimal solution to the problem (µ−ν−SVM), the offset
b can be chosen easily for any i satisfying αi∗ > 0. Otherwise, b∗ can be

obtained by the similar way with the decision of the b∗ in the other algorithms.

8.7 Numerical Examples

In order to investigate the performance of our proposed method, we compare
the results for four data sets in the following: (The data can be downloaded
from http://www.ics.uci.edu/˜ mlearn/MLSummary.html)
I. MONK’s Problem (all data sets with 7 attributes)
a) case 1
i. training : 124 instances (A : 62 instances, B : 62 instances)
ii. test : 432 instances (A : 216 instances, B : 216 instances)

186 H. Nakayama and Y. Yun

b) case 2
i. training : 169 instances (A : 64 instances, B : 105 instances)
ii. test : 432 instances (A : 142 instances, B : 290 instances)
c) case 3
i. training : 122 instances (A : 60 instances, B : 62 instances)
ii. test : 432 instances (A : 228 instances, B : 204 instances)
II. Cleveland heart-disease from Long Beach and Cleveland Clinic Founda-
tion : 303 instances (A : 164 instances, B : 139 instances) with 14 attributes
III. BUPA liver disorders from BUPA Medical Research Ltd. : 345 instances
(A : 200 instances, B : 145 instances) with 7 attributes
IV. PIMA Indians diabetes database : 768 instances (A : 268 instances, B :
500 instances) with 9 attributes
In the following numerical experiments, QP solver of MATLAB was used
for solving QP problems in SVM formulations; Gaussian kernels with r = 1.0
were used with the data normalization for each sample xi
xki − µk
x̃ki =
where µk and σk are the mean value and the standard deviation of k-th
component of given the sample data {x1 , . . . , xp }, respectively. For parameters
in applying GP model, we set h1 = h2 = · · · = h = C1 and k1 = k2 = · · · =
k = C2 .
For the dataset I, we followed both the training data and the test data as in
the benchmark of the WEB site. Tables 8.1–8.6 compare the classification rates
by using the existing algorithms (GP), (SVMsof t ) and (ν−SVM) with the
proposed algorithms (SVMtotal ), (µ−SVM) and (µ−ν−SVM), respectively.
For the datasets II and III, we adopt the ‘cross validation test’ method
which makes 10 trials for randomly selected training data of 70% from the
original data set and the test data of the rest 30%. Tables 8.7–8.18 compare
the average (AVE) and the standard deviation (STDV) of classification rates
by using the existing algorithms (GP), (SVMsof t ) and (ν−SVM) with the
proposed algorithms (SVMtotal ), (µ−SVM) and (µ−ν−SVM), respectively.
For the dataset IV, there is an unbalance between the number of el-
ements of two classes: A (tested positive for diabetes) has 268 elements,
while B (tested non-positive for diabetes) 500 elements. We selected ran-
domly 70% from the whole data set as the training samples, and set the
rest 30% as the test samples. We compared the results by (GP), (SVMsof t )
and (SVMtotal ν−SVM) with the proposed algorithms (SVMtotal ), (µ−SVM)
and (µ−ν−SVM) as seen in Tables 8.19–8.24, respectively.
Table 8.25 shows the rate of support vectors in terms of percentage for
each problem and each method.
Throughout our numerical experiments, it has been observed that even
though the result depends on the value of parameters, the family of SVM
using MOP/GP such as ν−SVM, SVMtotal , µ−SVM and µ − ν−SVM show a

8 Generating Support Vector Machines using MOP/GP 187

relatively good performance in comparison with the simple SVMsof t . Some-
times unbalanced data sets cause a difficulty in predicting the category with
fewer samples. In our experiments, MONK (case2) and PIMA diabetes are of
this kind. It can be seen in those problems that the classification ability for
the class with fewer samples is much sensitive to the value of C in SVMsof t . In
other words, we have to select the appropriate value of C in SVMsof t carefully
in order to attain some reasonable classification rate for unbalanced data sets.
SVMtotal and µ − ν−SVM, however, have advantage over SVMsof t in clas-
sification rate of the class with fewer elements. In addition, the data set of
MONK seems not to be linearly separated. In this example, therefore, SVMs
using MOP/GP show much better performance than the mere GP.

Table 8.1. Classification Rate by GP for MONK’s Problem

C1 1 10 100
C2 0.001 0.01 0.1 0.01 0.1 1 0.1 1 10 average

case 1 73.39 73.39 73.39 73.39 73.39 71.77 73.39 73.39 73.39 73.21

Training case 2 63.31 63.31 63.31 63.31 63.31 63.91 63.31 63.31 65.09 63.57
A 53.13 53.13 45.31 51.56 51.56 48.44 51.56 51.56 45.31 50.17
B 69.52 69.52 74.29 70.48 70.48 73.33 70.48 70.48 77.14 71.75

case 3 88.52 88.52 88.52 88.52 88.52 88.52 88.52 88.52 88.52 88.52

case 1 66.67 66.67 66.67 66.67 66.67 65.97 66.67 66.67 66.67 66.59

Test case 2 58.33 58.33 58.33 59.03 59.03 59.26 59.03 59.03 61.11 59.05
A 39.44 39.44 35.92 40.14 40.14 37.32 40.14 40.14 35.92 38.73
B 67.59 67.59 70.69 68.28 68.28 70.00 68.28 68.28 73.45 69.16

case 3 88.89 88.89 88.89 88.89 88.89 88.89 88.89 88.89 88.89 88.89

Table 8.2. Classification Rate by SVMsof t for MONK’s Problem

C 0.1 1 10 100 average

case 1 87.90 95.16 100 100 95.77

Training case 2 62.13 85.80 100 100 86.98
A 0.00 64.06 100 100 66.02
B 100 99.05 100 100 40.84
case 3 81.15 99.18 100 100 95.08

case 1 78.94 83.80 92.36 92.36 86.86

Test case 2 67.13 70.14 79.63 80.09 74.25
A 0.00 40.14 82.39 83.10 51.41
B 100 84.83 78.28 78.62 85.43

case 3 69.44 95.83 91.67 91.67 87.15

188 H. Nakayama and Y. Yun

Table 8.3. Classification Rate by ν−SVM for MONK’s Problem

ν 0.1 0.2 0.3 0.4 0.5 0.6 0.7 average

case 1 100 100 100 99.19 98.39 94.35 91.94 97.70
Training case 2 100 100 100 98.82 98.82 95.27 88.17 97.30
A 100 100 100 96.88 96.88 89.06 70.31 93.30
B 100 100 100 100 100 99.05 99.05 99.73

case 3 100 99.18 99.18 99.18 97.54 95.90 94.26 97.89

case 1 92.36 92.13 91.20 88.43 87.04 84.03 80.56 87.96

Test case 2 80.09 80.09 79.40 78.70 77.78 74.31 71.06 77.35
A 83.10 83.10 82.39 80.28 73.94 60.56 45.07 72.64
B 78.62 78.62 77.93 77.93 79.66 81.03 83.79 79.66

case 3 91.67 94.44 95.14 96.06 95.60 93.52 92.13 94.08

Table 8.4. Classification Rate by SVMtotal for MONK’s Problem

C1 1 10 100

C2 0.001 0.01 0.1 0.01 0.1 1 0.1 1 10 average

case 1 95.16 95.16 95.97 100 100 100 100 100 90.38 97.40

Training case 2 86.98 87.57 88.76 100 100 100 100 100 80.47 93.75
A 70.31 71.88 76.56 100 100 100 100 100 100 90.97
B 97.14 97.14 96.19 100 100 100 100 100 68.57 95.45

case 3 99.18 99.18 99.18 100 100 100 100 100 95.0 99.18

case 1 84.49 84.26 84.03 92.59 92.59 86.57 92.59 86.57 79.40 87.01

Test case 2 69.68 69.91 70.83 77.78 78.01 78.01 77.78 78.01 69.91 74.43
A 47.18 47.89 50.70 86.62 87.32 89.44 87.32 89.44 85.92 74.65
B 80.69 80.69 80.69 73.45 73.45 72.41 73.10 72.41 62.07 74.33
case 3 95.83 95.83 96.06 91.90 91.90 91.90 91.90 91.90 90.51 93.08

Table 8.5. Classification Rate by µ−SVM for MONK’s Problem

µ 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 average

case 1 90.32 90.32 90.32 90.32 90.32 90.32 90.32 90.32 90.32

Training case 2 71.01 71.01 71.01 71.01 71.01 71.01 71.01 71.01 71.01
A 100 100 100 100 100 100 100 100 100
B 53.33 53.33 53.33 53.33 53.33 53.33 53.33 53.33 53.33

case 3 100 100 100 100 100 100 100 100 100

case 1 77.46 77.46 77.46 77.46 77.46 77.46 77.46 77.46 77.46
Test case 2 62.73 62.73 62.73 62.73 62.73 62.73 62.73 62.73 62.73
A 97.18 97.18 97.18 97.18 97.18 97.18 97.18 97.18 97.18
B 45.86 45.86 45.86 45.86 45.86 45.86 45.86 45.86 45.86

case 3 93.52 93.52 93.52 93.52 93.52 93.52 93.52 93.52 93.52

8 Generating Support Vector Machines using MOP/GP 189

Table 8.6. Classification Rate by µ − ν−SVM for MONK’s Problem

µ 1 10 100
ν 0.001 0.01 0.1 0.01 0.1 1 0.1 1 10 average

case 1 100 100 100 100 100 100 100 100 100 100

Training case 2 100 100 100 100 100 100 100 100 100 100
A 100 100 100 100 100 100 100 100 100 100
B 100 100 100 100 100 100 100 100 100 100

case 3 100 100 100 100 100 100 100 100 100 100

case 1 95.37 93.06 92.59 92.59 92.59 92.36 92.59 92.36 92.36 92.88

Test case 2 80.56 75.69 75.46 75.46 75.46 80.09 75.46 80.09 80.09 77.60
A 95.77 92.96 92.96 92.96 92.96 83.10 92.96 83.10 83.10 89.98
B 73.10 67.24 66.90 66.90 66.90 78.62 66.90 78.62 78.62 71.53
case 3 93.98 93.52 93.52 93.52 93.52 91.67 93.52 91.67 91.67 92.95

Table 8.7. Classification Rate by GP for Cleveland Heart-disease
C1 1

C2 0.001 0.01 0.1

training test training test training test

rate A B rate A B rate A B rate A B rate A B rate A B

AVE 87.93 88.80 86.88 79.11 80.05 78.07 88.03 88.98 86.88 79.11 80.05 78.07 88.22 89.50 86.68 78.89 80.30 77.35

STD 1.07 1.21 2.05 3.61 5.73 5.14 1.03 1.22 2.05 3.61 5.73 5.14 0.93 1.47 1.94 3.02 5.24 4.48

C1 10

C2 0.01 0.1 1

AVE 88.12 89.16 86.88 78.56 78.86 78.44 88.12 89.16 86.88 78.56 78.86 78.44 88.26 89.77 86.47 79.22 80.29 78.20

STD 1.41 1.56 2.17 2.77 5.32 5.45 1.41 1.56 2.17 2.77 5.32 5.45 1.07 1.44 2.23 3.26 5.77 5.28

C1 100

C2 0.1 1 10

AVE 88.12 89.16 86.88 78.56 78.86 78.44 88.12 89.16 86.88 78.56 78.86 78.44 88.22 89.77 86.38 79.00 80.12 77.92

STD 1.41 1.56 2.17 2.77 5.32 5.45 1.41 1.56 2.17 2.77 5.32 5.45 1.08 1.30 1.95 3.08 5.54 5.84

Table 8.8. Classification Rate by SVMsof t for Cleveland Heart-disease

C 0.01 0.1 1.0

training test training test training test

rate A B rate A B rate A B rate A B rate A B rate A B

AVE 53.05 90.00 10.00 53.89 90.00 10.00 53.05 90.00 10.00 53.89 90.00 10.00 99.72 100 99.40 73.89 75.19 74.30

STD 1.67 30.00 30.00 7.36 30.00 30.00 1.67 30.00 30.00 7.36 30.00 30.00 0.23 0.00 0.49 4.31 11.86 16.65

C 10 100

AVE 100 100 100 74.56 74.49 76.28 100 100 100 74.56 74.49 76.28

STD 0.00 0.00 0.00 3.93 10.80 14.14 0.00 0.00 0.00 3.93 10.80 14.14

190 H. Nakayama and Y. Yun

Table 8.9. Classification Rate by ν−SVM for Cleveland Heart-disease
ν 0.1 0.2 0.3

training test training test training test

rate A B rate A B rate A B rate A B rate A B rate A B

AVE 100 100 100 74.56 74.49 76.28 100 100 100 74.56 74.49 76.28 100 100 100 74.56 74.49 76.28

STD 0.00 0.00 0.00 3.93 10.80 14.14 0.00 0.00 0.00 3.93 10.80 14.14 0.00 0.00 0.00 3.93 10.80 14.14

ν 0.4 0.5 0.6

AVE 100 100 100 74.67 74.87 76.04 100 100 100 74.33 74.45 75.72 99.91 100 99.80 74.33 74.64 75.48

STD 0.00 0.00 0.00 4.18 10.74 14.59 0.00 0.00 0.00 3.83 10.66 14.52 0.19 0.00 0.40 3.99 10.75 14.65

ν 0.7 0.8

AVE 99.86 100 99.70 74.67 75.05 75.76 99.72 100 99.40 73.78 75.02 74.30

STD 0.22 0.00 0.46 4.38 10.75 15.40 0.23 0.00 0.49 4.59 12.21 16.65

Table 8.10. Classification Rate by SVMtotal for Cleveland Heart-disease

C1 1

C2 0.0001 0.001 0.01

training test training test training test

rate A B rate A B rate A B rate A B rate A B rate A B

AVE 99.72 100 99.40 74.44 74.80 75.97 99.72 100 99.40 74.11 74.23 75.97 99.72 100 99.40 74.11 74.23 75.97

STD 0.23 0.00 0.49 4.99 11.80 17.23 0.23 0.00 0.49 4.87 11.99 17.23 0.23 0.00 0.49 4.87 11.99 17.23

C1 10

C2 0.001 0.01 0.1

AVE 100 100 100 74.22 73.26 77.00 100 100 100 74.22 73.26 77.00 100 100 100 74.22 73.05 77.25

STD 0.00 0.00 0.00 3.47 10.62 13.93 0.00 0.00 0.00 3.47 10.62 13.93 0.00 0.00 0.00 3.47 10.56 14.14

C1 100

C2 0.01 0.1 1

AVE 100 100 100 74.22 73.26 77.00 100 100 100 74.22 73.05 77.25 100 100 100 72.56 57.91 91.63

STD 0.00 0.00 0.00 3.47 10.62 13.93 0.00 0.00 0.00 3.47 10.56 14.14 0.00 0.00 0.00 3.37 5.67 5.76

Table 8.11. Classification Rate by µ−SVM for Cleveland Heart-disease
µ 1.2 ··· 1.5

training test training test training test

rate A B rate A B rate A B rate A B rate A B rate A B

AVE 99.81 99.67 100.00 81.00 82.25 79.72 ··· ··· 99.81 99.67 100.00 81.00 82.25 79.72

STD 0.33 0.59 0.00 2.19 3.33 4.02 ··· ··· 0.33 0.59 0.00 2.19 3.33 4.02

µ 1.6 ··· 2.0

AVE 99.81 99.67 100.00 81.00 82.25 79.72 ··· ··· 99.81 99.67 100.00 81.00 82.25 79.72

STD 0.33 0.59 0.00 2.19 3.33 4.02 ··· ··· 0.33 0.59 0.00 2.19 3.33 4.02

00 4.67 69.71 73.001 0.67 5.00 0.45 75.22 µ 100 ν 0.91 4.93 76.69 100 100 100 80.30 4.56 77.34 64.75 64.00 0.62 65.92 2.48 65.00 0.38 1.00 0.38 1.45 75.44 1.83 2.31 77.38 3.49 0.22 0.00 2. 8 Generating Support Vector Machines using MOP/GP 191 Table 8.00 0.13.71 2.67 69.31 77.30 STD 1.82 4.56 .58 2.46 1.12.00 0.00 86.67 69.45 1.67 5.70 5.1 AVE 100 100 100 85.14 0.64 2.1 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 71.81 73.44 82.32 75.34 64.80 14.50 64.81 5.00 2.02 4.10 75.97 5.80 2.62 65.00 2.66 82.67 69.36 75.61 100 100 100 79.00 0.38 1.17 4. Classification Rate by µ − ν−SVM for Cleveland Heart-disease µ 1 ν 0.62 5.56 77.00 0.87 2.48 65.23 1.11 80.00 0.45 1.67 5.92 72.00 0.82 4.87 2.27 C1 10 C2 0.00 0.00 0.62 65.00 0.91 4.00 2.00 1.30 4.73 5.1 1 AVE 100 100 100 80.89 63.59 STD 0.37 1.38 69.71 73.23 1.92 72.66 82.62 5.30 STD 1.58 2.06 63.61 1.93 10.61 1.50 93.14 µ 10 ν 0.26 3.38 3.92 71.00 3.73 5.23 1.10 75.80 4.38 3.59 100 100 100 74.00 0.00 2.92 71.17 4.78 70.30 100 100 100 85.90 0.01 0.44 1.0001 0.67 95.97 5.00 0.92 71.28 STD 0.93 76.00 0.56 C1 100 C2 0.44 82.1 1 10 AVE 71.01 0.69 100 100 100 80.51 STD 1.48 65. Classification Rate by GP for Liver Disorders C1 1 C2 0.81 73.14 0.36 75.1 1 AVE 71.61 100 100 100 79.00 2.17 64.61 1.71 73.80 4.45 1.38 1.11 80.67 1.37 1.36 3.39 72.001 0.00 86.61 STD 0.00 4.75 64.61 64.17 64.17 64.31 77.57 69.56 77.00 0.56 74.78 70.74 3.42 1.90 0.49 76.01 0.58 2.14 Table 8.44 2.00 0.73 5.01 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 100 100 100 94.32 75.01 0.44 82.80 2.36 3.58 2.00 2.00 4.10 74.06 63.71 73.001 0.97 70.38 1.

94 78.85 57.64 78.15.17 3.36 99.20 2.20 5.21 STD 0.49 1.35 57.44 90.49 0.29 98.91 2.1 1 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 58.01 0.74 ν 0.00 86.32 0.75 3.00 0.16 C1 100 C2 0.95 8.99 5.89 C 10 100 AVE 95.44 0.33 1.89 1.17 6.34 5.01 0.91 5.69 93.83 71.65 0.40 0.72 55.84 64.44 0.07 2.00 86.91 2.80 0.64 67.29 99.62 2.46 99.90 56.48 4.94 78.20 69.00 3.00 57.18 7.36 99.85 57. Yun Table 8.32 0.26 98.82 8.53 74.11 6.86 3.66 65.6 AVE 93.86 100 0.48 69.16 1.86 94.54 STD 0.08 1.46 3.81 2.29 95.47 5.05 71.98 95.49 4.02 74.22 1. Classification Rate by SVMtotal for Liver Disorders C1 1 C2 0.01 0.7 0.20 99.13 6.51 3.50 95.79 STD 1.79 76.52 55.29 99.71 4.10 85.53 71.96 6.45 84.69 1.29 96.0001 0.5 0.86 6.2 0.1 0.34 0.09 2.86 100 0.72 51.55 0.72 100 99.46 65.34 C1 10 C2 0.89 70.1 AVE 95.19 1.99 7.22 99.64 7.40 65.00 99.48 3.42 99.53 5.11 0.001 0.22 80.41 0.32 0.89 3.17 69.50 0.86 94.72 3.3 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 99.50 65.48 4.18 5.94 99.74 92.09 50.17 0.55 51.47 95.01 4.68 1.12 98.95 1.44 0. Classification Rate by ν−SVM for Liver Disorders ν 0.56 STD 1.91 66.20 6.17 0.00 0.74 92.76 STD 0.61 3. Classification Rate by SVMsof t for Liver Disorders C 0.59 Table 8.92 STD 1.76 58.08 4.46 99.97 STD 1.51 3.95 0.39 8.91 2.8 AVE 87.00 58.50 3.23 2.86 76.06 57.15 69.16 1.11 0.40 73.61 61.59 12.73 4.14.61 61.16 3.63 68.19 0.56 86.30 STD 1.61 63.86 53.04 55.70 56.192 H.40 65.84 64.17 4.15 3.85 .43 6.61 63.00 1.01 1.93 4.34 1.26 0.46 99.22 65.97 3.27 0.54 54.00 3.23 0.12 94.83 71.31 7.21 16.16 3.07 ν 0. Nakayama and Y.44 0.97 95.00 0.01 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 82.11 8.20 2.32 0.00 0.12 91.81 3.4 0.25 95.78 75.02 3.80 79.001 0.49 0.45 58.70 63.05 49.29 93.00 0.16.61 8.42 1.27 0.17 69.02 100 0.69 3.30 3.65 Table 8.09 2.42 99.48 4.02 100 0.66 71.00 1.80 2.59 70.29 1.72 2.42 82.81 93.15 87.74 7.99 7.17 3.1 1 AVE 99.11 66.73 96.34 72.51 90.42 83.86 94.25 0.25 95.54 58.83 5.71 53.51 82.07 96.29 99.12 73.67 0.00 57.73 90.97 95.

27 5.71 5.08 2.23 4.14 67.72 63.00 4.93 86.15 5.05 1.71 5.00 0.20 66.30 0.73 1.53 STD 0.43 78.92 1.51 77.75 83.10 67.84 23.09 1.40 77.55 85.93 STD 1.92 1.02 2.27 12.87 60.19 7.1 1 10 AVE 78.39 5. Classification Rate by µ − ν−SVM for Liver Disorders µ 1 ν 0.06 1.50 66.25 3.64 5.1 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 78.01 78.30 5.1 1 AVE 100 100 100 63.93 5.31 3.92 1.02 2.07 3.24 100 100 100 62.92 1.00 4.00 85.58 3.67 78.22 91.001 0.82 84.01 0.11 2.82 1.01 0.52 76.97 83.74 C1 10 C2 0.00 4.18 84.12 67.34 92.14 68.46 4.87 STD 0.5 1.43 3.00 4.48 66.07 2.21 µ 1.78 41.07 1.89 95.00 3.06 STD 1.39 5.29 3.71 5.32 1.27 1.43 66.00 0.64 53.06 Table 8.87 ··· ··· 100 100 100 69.00 0.14 67.0 AVE 100 100 100 69.57 65.92 90.20 66.18 84.78 Table 8.27 4.24 100 100 100 62.12 42.20 66.00 0.4 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 69.00 5.04 1.35 66.00 0.58 1.14 67.70 3.93 STD 1.87 5.18.19 84. Classification Rate by µ−SVM for Liver Disorders µ 1.85 5.93 STD 7.50 5.34 86.01 0.17 12. Classification Rate by GP for PIMA C1 1 C2 0.79 63.00 0.14 68.33 2.64 43.58 93.83 4.61 3.3 1.75 74.41 83.68 77.45 5.37 3.74 58.08 84.83 88.65 3.24 STD 3.1 AVE 100 100 100 72.30 0. 8 Generating Support Vector Machines using MOP/GP 193 Table 8.93 90.02 83.94 86.8 ··· 2.53 STD 0.62 4.49 78.39 1.10 STD 1.35 µ 1.00 0.54 73.40 77.13 1.87 68.49 77.57 68.64 43.69 93.00 0.37 65.25 98.00 4.58 93.12 67.87 54.36 78.00 0.40 83.29 1.00 61.84 100 100 100 63.00 61.99 50.6 1.09 1.22 87.00 0.00 4.00 4.26 39.48 8.52 76.74 63.00 0.63 2.1 1 AVE 78.50 66.15 1.0001 0.81 97.53 100 100 100 62.22 4.94 C1 100 C2 0.49 69.01 78.55 85.71 5.23 4.74 58.09 77.71 77.58 2.23 4.79 3.51 7.91 1.78 66.7 AVE 78.79 1.00 0.45 5.00 4.94 .48 8.00 0.70 76.78 µ 100 ν 0.69 3.51 77.38 2.39 5.84 100 100 100 63.19 1.55 1.37 2.00 1.96 41.41 6.18 23.86 5.19.00 0.15 1.61 3.14 1.74 66.64 5.83 1.19 97.01 0.78 0.30 5.74 58.00 0.34 1.00 0.59 41.00 0.75 6.75 5.31 0.001 0.39 4.37 2.2 1.76 1.00 3.45 87.22 87.03 90.82 5.00 1.58 3.13 88.55 1.17.75 93.14 67.01 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 97.18 67.86 1.75 6.71 5.32 69.06 ··· ··· 0.87 27.83 1.24 3.00 0.31 0.77 92.87 54.001 0.92 5.85 77.30 µ 10 ν 0.41 83.24 83.13 15.00 0.85 5.77 86.11 37.48 8.87 54.19 0.07 2.00 0.96 1.85 3.85 100 100 100 72.71 59.64 88.11 87.

94 2.89 100 100 100 67.03 0.17 52.60 68.07 3.29 1.16 99.01 0.51 86. Classification Rate by SVMtotal for PIMA C1 1 C2 0.00 100 91.23 67.25 2.46 1.91 57.69 99.00 100 63.85 91.80 4.69 72.92 0.39 2.00 1.23 STD 1.84 3.00 0.194 H.62 78.35 64.05 3.20 0.00 2.89 69.30 5.00 100 63.44 98.48 73.37 0.02 0.51 58.61 0.29 0.91 0.36 0.22 99.44 98.1 1 AVE 100 100 100 67.20 0.74 2.98 2.85 82.57 1.43 52.75 2.67 82.89 69.59 STD 0.91 99.03 0.12 0.99 91.33 98.00 0.00 0.43 99.36 53.37 0.21.70 99.11 1.52 62.83 54.52 0.20 71.42 4.68 2.64 3.45 0.75 68.97 69.47 80.25 C 10 100 AVE 99.13 100 54.92 2.91 54.66 94.35 67.22 91.1 AVE 99.90 2.63 4.88 3.78 3.12 0.69 0.25 3.97 3.17 63.09 51.77 87.03 84.14 0.0001 0.58 1.01 0.65 53.82 0.64 73.00 0.65 2.53 2.60 100 100 100 68.65 78.02 2.27 72.85 74.43 45.61 54.1 0.34 99.92 2.01 0.62 5.71 Table 8.37 0.49 2.00 0.23 2.24 2.45 73.00 0.95 65.31 1.24 1.51 81.17 63.22.46 0.00 1.08 74.78 2.14 0.2 0.40 2. Classification Rate by SVMsof t for PIMA C 0.00 100 65.45 77.66 85.62 82.91 0.69 .81 2.31 C1 100 C2 0.00 2.01 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 68.29 78.14 0.001 0.94 3.01 2.70 99.04 54.24 STD 0.25 0.84 2.00 4.89 STD 0.66 0.28 0.59 ν 0.6 AVE 94.27 2.57 2.19 82.44 2.36 99.63 0.86 3.76 STD 0.37 70.74 97.40 83.19 2.33 77.94 70. Yun Table 8.29 96.1 1 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 65.47 2.00 2.61 0.00 0.82 3.75 86.05 0.86 99.75 96.46 1. Classification Rate by ν−SVM for PIMA ν 0.40 STD 1.00 0.09 STD 0.08 3.22 85.54 3.57 93.70 2.001 0.18 97.41 0.72 2.51 4.26 0.5 0.00 0.18 90.35 69.41 3.51 96.18 0.48 98.93 88.23 0.51 C1 10 C2 0. Nakayama and Y.19 2.20 0.3 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 99.00 2.48 98.73 100 69.52 98.23 0.04 54.41 0.65 66.00 0.36 0.47 Table 8.56 0.00 0.00 2.45 72.08 1.57 58.79 2.20.4 0.71 97.

00 2.17 STD 2.7 ··· 2.17 100 100 100 69.06 3.07 76.16 3.01 46.46 3.00 0.01 0.00 0.45 77.13 2.26 2.44 0.72 5.78 2.62 66.44 0.00 2.41 6.41 89.00 0.05 99.00 0.41 6.14 4.00 0.00 0.001 0.91 54.46 3.25 1.00 2.07 3.09 56.84 9.0001 0.00 1.00 4.93 77.23.86 67.72 5.24.13 100 100 100 73.46 74.00 2.06 3.13 2.77 0.49 100 100 100 68.00 0.99 5.77 0.00 0.01 0.20 3.00 2.91 26.70 97.63 0. Classification Rate by µ−SVM for PIMA µ 1.00 0.68 94.0 AVE 100 100 100 68.00 64.13 5.6 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 78.87 30.40 µ 1.25 4.64 100 100 100 69.00 0.06 STD 0.1 1 AVE 100 100 100 69.32 88.40 ··· ··· 0.78 2.4 1.17 100 100 100 69.14 4.00 0.00 0.26 2.07 76.44 µ 10 ν 0.41 6.01 training test training test training test rate A B rate A B rate A B rate A B rate A B rate A B AVE 98.46 3.12 3.00 0.13 2.81 23.13 7.46 74.00 0.02 89.14 93.5 1.64 100 100 100 69.57 94.00 0.49 STD 0.71 .92 88.25 4.69 4.06 STD 2.40 Table 8.06 ··· ··· 100 100 100 68.00 31.46 74.93 77.00 4.001 0.14 100 1000 100 68.00 0.00 0.00 0.01 0.90 2.17 60.00 64.09 56.17 60.00 0. 8 Generating Support Vector Machines using MOP/GP 195 Table 8.00 1.00 0.25 4.17 60.47 0.29 87.32 88.63 µ 100 ν 0.00 0.00 0.92 85.00 2.43 0.95 97.1 AVE 100 100 100 73. Classification Rate by µ − ν−SVM for PIMA µ 1 ν 0.70 6.09 STD 0.00 31.00 4.91 3.00 31.20 3.32 88.

02 STD 2.18 0 0 3.59 75. we introduced various SVM algorithms using MOP/GP. This is another point for which µ − ν−SVM is promising. The details on regression by µ − ν−SVM will be discussed elsewhere. Acknowledgement This research was supported by JSPS.83 (case 1) STD 15.23 MONK AVE 76. µ − ν−SVM overcomes the lack of sparsity of support vectors. moreover.69 100 100 62.71 0 0 1.74 1.53 0 0 1. µ−ν−SVM shows relatively good performance in our experiences. which is widely applied to many real problems.41 Cleveland AVE 97.78 0 0 0.13 2. For regression problems.23 100 100 56.54 PIMA AVE 72.25.43 4.19 13. It is observed in our experience. . As a total. Yun Table 8.35 100 100 64.KAKENHI13680540.37 100 100 59.83 Heart-disease STD 0. Nakayama and Y. µ − ν−SVM minimizing the exterior deviation is akin to function approximation using the Tchebyshev error. moreover.70 Liver Disorders AVE 81.40 96.53 71.196 H.60 76. Rates of Support Vectors (unit : %) SVMsof t ν−SVM SVMtotal µ−SVM µ − ν−SVM MONK AVE 74.85 MONK AVE 70.90 73.10 100 100 69.11 (case 3) STD 17.8 Concluding Remarks In this chapter.00 (case 2) STD 7. SVMtotal and µ−SVM among the proposed algorithms are inferior to the standard SVM algorithms in terms of sparsity of support vectors. and proved that the error bound can be decreased by minimizing slack variables and maximizing surplus variables [23].98 8.40 12.97 100 100 96.26 STD 1. This means that those methods cause some difficulty in computation for large scale data sets. However. The authors have given a generalization error bound.75 0. However. that some values of µ yield unacceptable solutions in µ−SVM algorithm.70 0 0 0. and does not cause so much difficulty in computation even for large scale data sets.65 0 0 0.70 74.

1. and Asada.M. (1981) Simple but Powerful Goal Programming Mod- els for Discriminant Problems. Elsevier Science Publsihing [5] Charnes. H.. A. NC2-TR-1998-031 . Nakayama. ZOR–Methods and Models of Operations Research. Machine Learning. of Operational Research. (1983) Multiobjective Decision Making Theory and Methodlogy . T. B. and Cooper W.. pp. and Vapnik. and Mangasarian. 353-362 [4] Chankong. and Tanino. T. Y. Tanino. (1989) Discriminant Analysis via Mathematical Programming: Certain Problems and their Causes. Inuiguchi (eds. H. 1.. 21. in T. H. Computers and Operations Research. T. 20. V. 771-785 [11] Mangasarian. IEEE Transact. Multi-objective Programming and Goal Programming. of ICOTA2001. O. 16. (1995) Support Vector Networks. K. InSymposium on the Mathematical Theory of Automata. European J. Cambridge University Press [8] Erenguc. Kluwer Academic Publishers.M. 615–622. Proc. (1962) On the Convergence Proofs on Perceptrons.J. N. 36.B.. vol.L. K. pp.. Pardalos.L. (2001) Support Vector Machines formulated as Multi Objective Linear Programming. and Koehler. 147-174 [16] Nakayama.J.. by P.Y. Academic Press [19] Schölkopf. (1999) Arbitrary-Norm Separating Plane. Advances in Multicriteria Analysis. J.. N. Wiley [6] Cortes. on Information Theory... A. 273–297 [7] Cristianini.L. T.. T. Neu- roCOLT2 Technical report Series. C. 8 Generating Support Vector Machines using MOP/GP 197 References [1] Asada. and Soyster. (1994) Theory of Multiobjective Optimization. pp. 12. Y. 44-60 [10] Glover. Managerial and Decision Economics. V. (2000) An Introduction to Support Vec- tor Machines and Other Kernel-based Learning Methods. S.1171-1178 [17] Novikoff. J. Y. 93-98 [2] Bennett. (1992) Robust Linear Programming Dis- crimination of Two Linearly Inseparable Sets. G. pp. pp. and Shawe-Taylor. (1968) Multisurface Method of Pattern Separation. (1999) Nonlinear Multiobjective Optimization . 11. (1995) Aspiration Level Approach to Interactive Multi-objective Programming and its Applications. and Haimes. and Savard. (1961) Management Models and Industrial Ap- plications of Linear Programming . Decision Sciences. (1992) Novel Approaches to the Discrimination Problem. IT-14. Kluwer Aca- demic Publishers [15] Nakayama. ed. A.. 801-807 [12] Mangasarian. A. M. (1990) Improved Linear Programming Models for Discriminant Analysis... Operations Re- search Letters 23 [13] Marcotte. and Smola. Tanaka and M. G. (1990) Survey of Mathematical Programming Models and Experimental Results for Linear Discriminant Analysis..S. 3.L. Zopounidis. 7.517-545 [14] Miettinen.W. F.. P.P.. Optimization Methods and Soft- ware. 215-225 [9] Freed. Siskos and C. (1998) New Support Vector Algorithms.P. O. vol. F. and Nakayama. (2003) SVM using Multi Objective Linear Pro- gramming and Goal Programming.). and Glover. H. O. 23-34 [3] Cavalier.. Ignizio.. Polytechnic Institute of Brooklyn [18] Sawaragi.

B. IJCNN’03. Proc. H. and Nakayama. 2049-2053 . and Beyond-. and A. and Application . (1986) Multiple Criteria Optimization: Theory.E. (2003) A Role of Total Margin in Support Vector Machines. (2002) Learning with Kernels: Support Vector Ma- chines. V. Y. Computation. MIT Press [21] Steuer. M. (1998) Statistical Learning Theory..198 H. Nakayama and Y. John Wiley & Sons..Smola..N. Regularization. Yun [20] B. Optimization. Yun..J.Schölkopf. New York [23] Yoon. R. Wiley [22] Vapnik.

9.g. This multi-objective design problem is usually tackled by aggregating the objectives into a scalar function and applying standard methods to the re- sulting single-objective task. It requires finding appropriate trade-offs between several ob- jectives. Studies in Computational Intelligence (SCI) 16. The optimization of split modified radius-margin model selection criteria is demonstrated on benchmark problems. for example between model complexity and accuracy or sensitivity and specificity. The computational complexity of a so- lution can be an additional design objective.e. The MOO approach to SVM design is evaluated on a real-world pattern recognition task. Suttorp and C.9 Multi-Objective Optimization of Support Vector Machines Thorsten Suttorp and Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum 44780 Bochum. especially between model com- plexity and accuracy on a set of noisy training examples (→ bias vs. and the number of support vectors to reduce the computational complexity. the false negative rate. in particular under real-time constraints. However.springerlink. it is further advisable to consider sensitivity and specificity (i. a linear weighting of empirical T.suttorp@neuroinformatik. empirical risk).rub. 199–220 (2006) www. Germany thorsten. Here the three objectives are the minimization of the false positive rate. a high false alarm rate may be tolerated if the sensitivity is high.igel@neuroinformatik...de Summary. such an approach can only lead to satisfactory solutions if the aggregation (e. capacity vs. namely the real-time detection of pedestrians in infrared images for driver assistance systems. Support vector machines are reviewed from the multi-objective perspective.rub. true positive and true negative rate) separately.com  c Springer-Verlag Berlin Heidelberg 2006 . and different encodings and model selection criteria are described. variance.1 Introduction The design of supervised learning systems for classification requires finding a suitable trade-off between several objectives. Designing supervised learning systems is in general a multi-objective optimization problem.de christian. We consider the adaptation of kernel and regularization parameters of support vector machines (SVMs) by means of multi-objective evolutionary optimiza- tion. Igel: Multi-Objective Optimization of Support Vector Machines. In many applications. For example in medical diagnosis.

. where yi ∈ {−1. and kernel encodings. We review model selection criteria. we concisely summarize SVMs and illustrate some of the underlying concepts. Here fast classifiers with a small false alarm rate are needed. The only assumption that is made is that the training data as well as the unseen examples are generated independently by the same. 1} is the label associated with input pattern xi ∈ X. the false negative rate.j=1 has .3 we discuss MOO model selection for SVMs. 14. Let S = ((x1 . We consider MOO of support vector machines (SVMs). The kernel matrix K = (Kij )i.4 summarizes results on MOO of SVMs considering model selection criteria based on radius-margin bounds [25]. 36]. and the complexity of the classifier. In Section 9. Suttorp and C.1 General SVM Learning We start with a general formulation of binary classification. optimization methods with an emphasis on evolution- ary MOO. For an introduction to SVMs we refer to the standard literature [12. 40]. be the set of training examples. Section 9. In Section 9. which computes an inner product in the feature space and thereby defines the repro- ducing kernel Hilbert space (RKHS) H. 45]. 9. y1 ). which mark the state-of-the-art in machine learning for binary classification in the case of moderate problem dimensionality in terms of the number of training pat- terns [42. 9. 43. but unknown probability distribution D. (x . which can be approached using SVMs [30. . In this section. Igel error and regularization term) matches the problem. A solution is Pareto-optimal if it cannot be improved in any objective without getting worse in at least one other objective [11. First. A better way is to apply “true” multi-objective optimization (MOO) to approximate the set of Pareto- optimal trade-offs and to choose a final solution afterwards from this set.5 we present a real-world application of the proposed methods: Pedestrian detection for driver assistance systems is a difficult classification task. . y )). The main idea of SVMs is to map the input vectors to a feature space H.2. The task is to estimate a function f from a given class of functions that correctly classifies unseen examples (x. y) by the calculation of sign(f(x)). 45]. we briefly introduce SVMs from the perspective of MOO. The transfor- mation Φ : X → H is implicitly done by a kernel k : X × X → R. 42. . We therefore propose MOO to mini- mize the false positive rate. where the transformed data is classified by a linear function f .200 T.2 Support Vector Machines Support vector machines are learning machines based on two key elements: a general purpose learning algorithm and a problem specific kernel that com- putes the inner product of input data points in a feature space. 32.

1). (x .2). the separating hyperplane maximizes the margin γ. . in the left not. δ ∈ (0. upper bounds on the gen- eralization error from statistical learning theory are studied that hold with a probability of 1 − δ. the probability of misclassifying unseen examples PD (sign(f (x)) = y). In the right plot. y )} be drawn independently according to a probability distribution D and fix δ ∈ (0. Then with probability at least 1 − δ over samples of size  we have " 1  4  ln(2/δ)  sign(f (x))) ≤ PD (y = ξi + tr(K) + 3 . . Let γ > 0 and f ∈ {fw : X → R. w < 1} a linear function in a kernel-defined RKHS with norm at most 1. γ − yi f (xi )) measures how much the example fails to meet the margin (Figure 9. It holds [43]: Theorem 1. The kernel has to be positive semi-definite. . The best function f for classification is the one that minimizes the gen- eralization error. that is. yi ) with respect to a function f : X → R is defined by yi f (xi ). f ) = max(0.1 and 9. Two linear decision boundaries separating circles from squares in some feature space. that is. If a function f and a desired margin γ are given. R R γ γ Fig. γ i=1 γ 2 where K is the kernel matrix for the training set S. . fw (x) = w · Φ(x). Thus. y1 ). Φ(xj ) . a direct minimization is not possible. 1). the example’s slack variable ξi (γ. 9. The margin of an example (xi .1. We follow the way of [43] for the derivation of SVM learning and give an upper bound that directly incorporates the concepts of margin and slack variables. . Because the example’s underlying distribution D is un- known. vT Kv ≥ 0 for all v ∈ R and all S. 9 Multi-Objective Optimization of Support Vector Machines 201 the entries Kij = Φ(xi ). The radius of the smallest ball in feature space containing all training examples is denoted by R. Let S = {(x1 .

.  and w2 = 1 . There are more possible formulations of SVM learning. The upper bound of Theorem 1 gives a way of controlling the generalization error PD (y = sign(f (x))). Suttorp and C. 1 We do not take into account that the bound of Theorem 1 has to be adapted in the case b = 0.2. .  i ξi ≥ 0.  i ξi ≥ 0. It is a scaled version of PSVM . i = 1.202 T.  . Igel R γ image of xi in H ξi Fig. For the solution of PSVM all training patterns (xi . The concept of slack variables. when considering the kernel as part of the SVM learning process. . it becomes necessary to incorporate the term tr(K) of Theorem 1 into the optimization problem. . but finally provides the same classifier:    min (w · w)    min i=1 ξi PSVM = (9. In this formulation γ represents the geometric margin due to the fixation of w2 = 1. It states that the described learning problem has a multi-objective character with  two objectives. i = 1. . . This motivates the following definition of SVM learning1 :    max γ   min i=1 ξi PSVM =   subject to y ((w · Φ(xi )) + b) ≥ γ − ξi . as it is done in [6]. yi ) with ξi = 0 have a distance of at least γ to the hyperplane. For example. The more traditional formulation of SVM learning that will be used throughout this chapter is slightly different.1)   subject to y ((w · Φ(xi )) + b) ≥ 1 − ξi . . 9. . . namely the margin γ and the sum of the slack variables i=1 ξi .

αj k(xi . . .3. i=1 The coefficients αi∗ are the solution of the following quadratic optimization problem   1   maximize W (α) = αi − yi yj αi .3. This optimization problem PC-SVM defines the soft margin L1 -SVM schematically shown in Figure 9. xj ) i=1 2 i.2. . .  ξi ≥ 0.  . . It can be solved by Lagrangian methods. The resulting classification function becomes   sign(f (x)) with f (x) = yi αi∗ k(xi . i=1 0 ≤ αi ≤ C.2) . i = 1. 9. 9 Multi-Objective Optimization of Support Vector Machines 203 9. (9. i = 1.  . Fig.j=1   subject to αi yi = 0 . The factor C determines the trade-  off between the margin γ and the sum of the slack variables i=1 ξi :    min 2 (w · w) + C 1 i=1 ξi PC-SVM = subject to yi ((w · Φ(xi )) + b) ≥ 1 − ξi . it generates the coefficients of a decision function.2 Classic C-SVM Learning Until now we have only considered multi-objective formulations of SVM learn- ing. The learning algorithm is fully specified by the kernel function k and the regularization parameter C. . Classification by a soft margin SVM. x) + b . .1). In order to obtain the classic single-objective C-SVM formulation the weighted sum method is applied to (9. . Given training data.

i = 1.  ξi ≥ 0. where     ξi∗ = max 0. . . Igel The optimal value for b can then be computed based on the solution α∗ . 0 ≤ αj ≤ C+ . . xj ) + b .204 T. kernel adaptation reduces to finding an appropriate parameter vector. } | yi = −1}. subsuming hyperparameter adaptation and feature selection with re- spect to different model selection criteria. We consider model selection of SVMs. . . .  . For the remainder of this chapter. j ∈ I + . i ∈ I − .2) that has be adapted to 0 ≤ αi ≤ C− . . It is especially important for practical applications. xj ) i. The regularization parameter C controls the trade-off between maximizing the margin  −1/2  γ∗ =  yi yj αi∗ αj∗ k(xi . j=1 In the following we give an extension of the classic C-SVM. 1 − yi  yj αj∗ k(xj . The vectors xi with αi∗ > 0 are called support vectors. 9. . . These parame- ters together with the regularization parameter C are called hyperparameters of the SVM. The number of support vectors is denoted by # SV. When a parameterized family of kernel functions is considered. The quadratic optimization problem remains unchanged. To realize a different weighting for wrongly classified positive and negative training examples different cost-factors C+ and C− are introduced [31] that change the optimization problem PC-SVM to     min 2 (w · w) + C+ 1 i∈I + ξi + C− i∈I − ξi PC̃-SVM = subject to yi ((w · Φ(xi )) + b) ≥ 1 − ξi . we focus on a different multi-objective design problem in the context of C-SVMs. . Choosing the right kernel for an SVM is crucial for its training accuracy and generalization capabilities as well as the complexity of the resulting clas- sifier. where I + = {i ∈ {1. except for constraint (9. where the case of highly unbalanced data appears very frequently.j=1 and minimizing the L1 -norm of the final margin slack vector ξ ∗ of the training data. .3 Model Selection for SVMs So far we have considered the inherent multi-objective nature of SVM train- ing. } | yi = 1} and I − = {i ∈ {1. which are discussed in this section. . Suttorp and C. .

Further. Because of its computational complexity. for separable data / L2 -SVMs) when the model is consistent with the training data. the computation of the gradient is only exact in the hard-margin case (i. are generated by partially stochastic variations of parent individuals. the standard method to determine the hyperparameters is grid- search. where in [20] additionally the (discretized) regularization parameter was adapted. Further. 39]. However. 29]. they have some drawbacks and limitations.3 deals with appropriate encodings for Gaussian kernels and for feature selection. 22. In simple grid-search the hyperparameters are varied with a fixed step-size through a wide range of values and the performance of every com- bination is measured. After the fitness of each offspring has been computed. where each individual has a genotype that encodes a candidate solution for the optimization problem at hand. 10. In canonical EAs. as the objective functions are indeed multi-modal. the offspring. a set of individuals forming the parent population is maintained. 20. In [19. 9 Multi-Objective Optimization of Support Vector Machines 205 In the following. a selection mechanism that prefers individuals with better fitness chooses the new parent population from the current parents and the offspring.. 9. the performance of gradient-based heuristics may strongly depend on the initialization—the algorithms are prone to getting stuck in sub-optimal lo- cal optima. new individuals.3. Section 9. The fitness of an individual is equal to the objective function value at the point in the search space it represents. This loop of variation and selection is repeated until a termination criterion is met. The most important one is that the score function for assessing the perfor- mance of the hyperparameters (or at least an accurate approximation of this function) has to be differentiable with respect to all hyperparameters. A single-objective genetic algorithm for SVM feature selection (see below) was used in [17. 28]. grid-search is only suitable for the adjustment of very few parameters. single-objective evolution strategies were proposed for adapting SVM hyperparameters.3. randomized global optimization techniques based on principles of neo-Darwinian evolution theory. Evolutionary Algorithms Evolutionary algorithms (EAs) are a class of iterative.1 Optimization Methods for Model Selection In practice. which excludes reasonable measures such as the number of support vectors. Evolutionary methods partly overcome these problems. 21. we first discuss optimization methods used for SVM model selection with an emphasis on evolutionary multi-objective optimization. In some approaches. these meth- ods are highly efficient. Then different model selection criteria are briefly reviewed.e. 27. the choice of the discretization of the search space may be crucial. . In each iteration of the algorithm. Perhaps the most elaborate techniques for choosing hyperparameters are gradient-based approaches [9. direct. When applicable.

DL of (approximately) equal size. no Pareto-optimal solution can be said to be superior to another element of the Pareto set. 14]. .206 T. even this inefficient procedure does not help (cf. Suttorp and C. . . one monitors its accuracy on data not used for training. In L-fold cross-validation (CV) the avail- able data is partitioned into L disjoint sets D1 . fM : X → R to be minimized. Therefore. M } : fm (x) > fm (x ). . . That is. A solution x ∈ X dominates a solution x and we write x ≺ x if and only if ∃m ∈ {1. .2 Model Selection Criteria In the following. In the ith iteration. . the first one is used for building the SVM and the second for assessing the performance of the classifier. . the available data is split into a training and validation set. the SVM is trained L times. To estimate the generalization perfor- mance of an SVM. probably conflicting objectives. various trials with different aggregations become necessary—but when the Pareto front (the image of the Pareto set in the m-dimensional objective space) is not convex. When approaching a MOO problem by linearly aggregating all objectives into a scalar function. . . Applications of evolutionary MOO to model selection for neural networks can be found in [1. 9. 46]. . Without any further information. One can always compute the empirical risk given by the error on the training data. . Igel Multi-objective Optimization Training accuracy. 33. . . for SVMs in [25. all data but the patterns in Di are used to train the SVM and afterwards the performance on the ith validation data set Di is determined. Evolutionary multi-objective algorithms have become the method of choice for MOO [11. For given hyperparameters. generalization capability. which provide insights into the trade-offs between the objectives. The goal of MOO is to find in a single trial a diverse set of Pareto-optimal solutions. and complexity of the SVM (mea- sured by the number of support vectors) are multiple. it can be beneficial to treat model selection as a multi- objective optimization (MOO) problem. Consider an optimization problem with M objectives f1 . Accuracy on Sample Data The most straightforward way to evaluate a model is to consider its classifica- tion performance on sample data. . The elements of X can be partially ordered using the concept of Pareto dominance. [14. we list performance indices that have been considered for SVM model selection. . each weighting of the objectives yields only a limited subset of Pareto-optimal solutions. 44]. . 40]). In MOO a subset of these criteria can be used as different objectives. M } : fm (x) < fm (x ) and m ∈ {1. In the simplest case. They can be used alone or in linear combination for single-objective optimization. 26.3. The elements of the (Pareto) set {x | x ∈ X : x ≺ x} are called Pareto-optimal. 2.

g. the variance may not be. the CV error. or even deteriorating feature dimensions. The goal of feature selection is then to determine a subset of indices (feature dimen- sions) {i1 . The features were se- lected out of 125 protein properties such as the frequencies of the amino acids. patterns. 20. The -fold CV (training) error is called the leave-one-out (training) error. . This topic is discussed in detail in Section 9. . Number of Input Features Often the input space X can be decomposed into X = X1 ×· · ·×Xm . im } ⊂ {1. By considering only a subset of feature dimensions. . polarity.. and van der Waals volume.i.d. It can be reasonable to split the classification performance into false neg- ative and false positive rate and consider sensitivity and specificity as two separate objectives of different importance. moderate choices of L (e. The three objective functions to be minimized were the number of features. Although the bias is low. the assumptions of the underlying theorems from statistical learning theory are violated and the term “bound” is misleading. The L-fold CV error is an unbiased estimate of the expected generalization error of the SVM trained with  − /L i. the computational complexity of the resulting classifier decreases. Feature selection for SVMs is often done using single-objective [17. . the average empirical risk observed in the L iterations can be computed. in particular for large L. redundant. . the SVM may give bet- ter classification performance than when trained on the complete space X. Modified Radius-Margin Bounds Bounds on the generalization error derived using statistical learning theory (see Section 9. . For example.2 In the 2 When used for model selection in the described way. and the leave-one-out training error were minimized using evolutionary MOO. and the CV training error. m} that yields classifiers with good performance when trained on the reduced input space X  = Xi1 × · · · × Xim . . 44] evolutionary computing.2. 29] or multi-objective [33. 9 Multi-Objective Optimization of Support Vector Machines 207 At last. the subset of genes. In another bioinformatics scenario [33] dealing with the classification of gene expression data using different types of SVMs. 27. . 5 or 10) are usually preferred [24]. the errors observed on the L validation data sets are averaged yielding the L-fold CV error. In addition. a quantity we call L-fold CV training error. Therefore reducing the number of feature dimensions is a common objective. .1) can be (ab)used as criteria for model selection. and for reasons of computational complexity. By detect- ing a set of highly discriminative features and ignoring non-discriminative. the leave-one-out error. in [44] evolutionary MOO of SVMs was used to design classifiers for protein fold prediction. Therefore.5.

. xj ) β i=1 i. Suttorp and C. i=1 i.g.j=1 where β ∗ is the solution vector of the quadratic optimization problem     maximize βi K(xi .3 Both criteria can be viewed as two different aggregations of the following two objectives     f1 = R2 αi∗ and f2 = ξi∗ (9. in [10] it was suggested to use     TRM = R2 αi∗ + ξi∗ . corresponding to tighter bounds) need not correspond to better model selection criteria [10]. The modified radius-margin bound     TDM = (2R)2 αi∗ + ξi∗ . 3 Also for L2 -SVMs it was shown empirically that theoretically better founded weightings of such objectives (e.. xi ) − βi∗ βj∗ K(xi . i=1 i=1 based on heuristic considerations and it was shown empirically that TRM leads to better models than TDM . .  . see [41]. Igel following. . 16]. this ex- pression did not lead to satisfactory results [10. In practice.3) i=1 i=1 penalizing model complexity and training errors. . Let R denote the radius of the smallest ball in feature space containing all  training examples given by $ %  %   R=& βi∗ K(xi . xj ) . xi ) − βi βj K(xi . respectively. a highly complex SVM classifier that very accurately fits the training data has high f1 and small f2 . i = 1. i=1 i=1 was considered for model selection of L1 -SVMs in [16].4.208 T. For example. .j=1   subject to βi = 1 i=1 βi ≥ 0 . Therefore. we consider radius-margin bounds for L1 -SVMs as used for example in [25] and Section 9.

However. 9 Multi-Objective Optimization of Support Vector Machines 209 Number of Support Vectors There are good reasons to prefer SVMs with few support vectors (SVs). 45]).g. Further. When a parameterized family of kernel functions is considered. the space and time complexity of the SVM classifier scales with the number of SVs. a parameterization of M is used which was inspired by the encoding of covariance matrices for mutative self-adaptation in evolution strategies. The most frequently used kernels are Gaussian functions. To allow for arbitrary covariance matrices for the Gaussian kernel.3. However. only by drop- ping the restriction to diagonal matrices one can achieve invariance against linear transformations of the input space. . In [19].4. For example. the questions of how to ensure that the optimization algorithm only generates positive definite matrices arises. for scaling and rotation of the search space. indicating that the ith feature dimension is used or not depending on bi = 1 or bi = 0. allowing more flexibility has proven to be beneficial (e. the kernel parameters can be encoded more or less directly. In the hard-margin case. where M := {B ∈ Rn×n | ∀x = 0 : xT Bx > 0 ∧ B = BT } is the set of positive definite symmetric n × n matrices.. It is straightforward to allow for independent scaling factors weighting the input components and consider kD . bn )T = {0. z) = exp −(x − z)T A(x − z) for x. General Gaussian kernels have the form  kA (x. where I is the unit matrix and γ > 0 is the only adjustable parameter. see also Section 9. Often the search is restricted to kγI . z ∈ Rn and A ∈ M . We make use of the fact that for any symmetric and positive definite n × n matrix A there exists an orthogonal n × n matrix T and a diagonal n × n matrix D with positive entries such that A = TT DT and . . 44]. the genotypes are n-dimensional bit vectors (b1 .g.4 and 9. 9. respectively [33. Here. see [9.3 Encoding in Evolutionary Model Selection For feature selection a binary encoding is appropriate. When adapting Gaussian kernels. the number of SVs (#SV) is an upper bound on the expected number of errors made by the leave-one-out procedure (e. that is. we focus on the encoding of Gaussian kernels. This can be realized by an appropriate parameterization of A. . 21]). we use a parameterization of M mapping Rn(n+1)/2 to M such that all modifications of the parameters by some optimization algorithm always result in feasible kernels. . 1}n . 19. see [9. where D is a diagonal ma- trix with arbitrary positive entries. in [25] the number of SVs was optimized in combination with the empirical risk. This parameterization is used in most of the experiments described in Sections 9.5.. In the following.

j )]ii = [R(αi.4 shows the results of optimizing kγI (see Section 9. The n × n matrices R(αi. Assuming convergence to the Pareto-optimal set. 13. applying the rotations in a different order (as discussed in the context of evolution strategies in [23]).210 T. A → . and thyroid with input dimensions n equal to 9.3) were optimized. i=1 j=i+1 as proven in [38]. Suttorp and C.3) using the objectives (9. heart. It is not invariant under reordering the axes of the coordi- nate system.4 Experiments on Benchmark Data In this section. 8. this is not a canon- ical representation. For diabetes. However. 468. the results of a single MOO trial are sufficient to determine the outcome of single-objective opti- mization of any (positive) linear weighting of the objectives. see [21]. and thyroid. However. and 140. TDM . The evaluation was based on four common medical benchmark datasets breast-cancer. 170.j )]ij = sin αij . heart. Thus. we can conclude that TDM puts too much emphasis on the “radius-margin part” yielding worse classification results on . These are equal to the unit matrix except for [R(αi.j )]jj = cos αij and [R(αi. Igel ' n−1 ' n T= R(αi. diabetes. When looking at CE(Dextern ) and the minima of TRM and TDM . and the percentage of wrongly classified patterns in the test data set 100 · CE(Dextern ) are given. The experiments confirm the findings in [10] that the heuristic bound TRM is better suited for model selection than TDM . the solutions lie on typical convex Pareto fronts.3). The natural injective parame- terization is to use the exponential map ∞  Ai exp : m → M .3. 9. The data originally from the UCI Benchmark Repository [7] were preprocessed and partitioned as in [37]. we can directly determine and compare the solutions that minimizing TRM and TDM would suggest. but non-injective function m → M mapping A → AAT should work. also the simpler.j ) . and  equal to 200. In that study.j ) are elementary rotation matri- ces. L1 -SVMs with Gaussian kernels were considered and the two objectives given in (9. we summarize results from evolutionary MOO obtained in [25]. TRM . in the breast-cancer example the convex front looks piecewise linear. and 5. i=0 i! where m := {A ∈ Rn×n | A = AT } is the vector space of symmetric n × n matrices. The first of the splits into training and external test set Dtrain and Dextern was considered. that is. Figure 9.j )]ji = − [R(αi. For each f1 value of a solution the corresponding f2 .

For every solution the values of TRM .e. Pareto fronts (i. 9.e. one would probably not pick the solution suggested by TDM in the diabetes benchmark. f2 ) of non-dominated solutions) after 1500 fitness evaluations. The heart and thyroid results suggest that even more weight should be given to the slack variables (i. Projecting an (f1 . 9 Multi-Objective Optimization of Support Vector Machines 211 thyroid heart 80 140 70 TDM 120 60 TDM = 4 · f1 + f2 100 50 TRM 40 TRM 80 = f1 + f2 30 60 f2  20 f2 = i=1 ξi∗ 40 10 100 · CE(Dextern ) 20 100 · CE(Dextern ) 0 0 0 10 20 30 40 50 60 0 20 40 60 80 100 120 140 diabetes breast-cancer 400 TDM 140 TDM TRM 350 120 300 TRM 100 250 80 200 f2 60 150 f2 100 40 50 100 · CE(Dextern ) 20 100 · CE(Dextern ) 0 0 0 50 100 150 200 250 300 350 400 450 0 20 40 60 80 100 120 140 160 180 2  ∗  f1 = R i=1 αi f1 = R2 i=1 α∗i Fig. and 100 · CE(Dextern ) are plotted against the corresponding f1 value. see [25] for details. In the MOO approach. Projecting the minimum of TRM (for TDM proceed analogously) along the y-axis on the Pareto front gives the (f1 . the external test set (except for breast-cancer where there is no difference on Dextern ).. the performance on the training set) than in TRM .4. A typical . For example. f2 ) pair suggested by the model selection criterion TRM —this would also be the outcome of single-objective optimization using TRM . degenerated solutions resulting from a not appro- priate weighting of objectives (which we indeed observed—without the chance to change the trade-off afterwards—in single-objective optimization of SVMs) become obvious and can be excluded. TDM .. where CE(Dextern ) is the proportion of wrongly classified patterns in the test data set. f2 ) pair along the y-axis on 100 · CE(Dextern ) yields the corresponding error on an external test set. (f1 .

where the test errors of all optimized trade-offs were the same). MOO heuristic is to choose a solution that belongs to the “interesting” part of the Pareto front. We automatically select the kernel parameters. but not necessarily to better generalization performance. this leads to results on a par with TRM and much better than TDM (except for breast-cancer.3) and thyroid data after 1500 fitness evaluations [25].5 for an example. Suttorp and C.5 Real-world Application: Pedestrian Detection In this section.. and the weighting of pos- . In case of a typical convex front.5. 35]. as well as the number of support vectors. For both kernel parameterizations. Pareto fronts after optimizing kγI and kD for objectives (9. Adapting the scaling of the kernel (i. Instead of optimizing a single SVM and varying the bias parameter to get a ROC (receiver operating characteristic) curve [30. This is a challenging real-world task with strict real-time constraints requiring highly optimized classifiers and a considerate adjustment of sensitivity and specificity. see Figure 9. the regularization parameter. see Figure 9. we consider MOO of SVM classifiers for online pedestrian de- tection in infrared images for driver assistance systems. this heuristic combined with TRM (derived from the MOO results) is an alternative for model selection based on modified radius margin bounds. Igel thyroid 40 kγI : f2 35 kγI : 100 · CE(Dextern ) 30 kD : f2 25 kD : 100 · CE(Dextern ) 20 15 10 5 0 0 10 20 30 40 50 2  ∗ f1 = R i=1 αi Fig. optimizing kD ) sometimes led to better objective values compared to kγI .4).e. f2 and 100 · CE(Dextern ) are plotted against f1 . 9. this would be the area of highest “curvature” (the “knee”. we apply MOO to decrease the false positive rate. Therefore.212 T. In our benchmark problems. Reducing the latter directly corresponds to decreasing the capacity and the computational complexity of the classifier. 9. the false negative rate.

we solve the problem using the real-valued non-dominated sorting genetic algorithm NSGA-II [14.5. and traffic lights). They make major contributions to the environ- ment representation of the ego-vehicle. 5] and an SVM-based one was suggested in [47]. The advantage of infrared based systems is that they are almost indepen- dent on the lighting conditions. which evaluates the leg-motion and tracks the upper part of the pedestrian. namely the segmentation of candidate regions for pedestrians and the classification of the segmented regions (Figure 9. If the candidate region . carrying bags. We keep it rather short because our focus mainly lies on the classification task. 15]. 9. Gaussian kernel functions with individual scaling parameters for each component of the input are adapted. where the color depends on the heat of the ob- ject.5. In [13] a hybrid approach for pedestrian detection was presented. Besides vehicle detection the early detec- tion of pedestrians is of great interest since it is one important step towards avoiding dangerous situations. as pedestrians are standing or walking. so that night-vision is possible. containing buildings.. These approaches use a lot of different techniques so we can name only a few. Another reason making pedestrian detection very difficult is that pedestrians usually appear in urban environment with complex background (e. This is an extremely difficult problem. which serves a basis for different high- level driver assistance applications. cars. etc. which is used to find vertical structures in the image. traffic signs. In [48] segmentation was done by means of stereo vision and classification by the use of neural networks. because of the large variety of human appearances. Classification with SVMs that are working on wavelet features was suggested in [34].6).g. wearing hats. As neither gradient-based optimization methods nor grid-search techniques are applicable.2 Pedestrian Detection System In this section we give a description of our pedestrian detection system that is working with infrared images. The task of detecting pedestrians is usually divided into two steps. A shape-based method for classifica- tion was applied in [4].1 Pedestrian Detection Robust object detection systems are a key technology for the next generation of driver assistance systems. In our system the segmentation of candidate regions for pedestrians is based on horizontal gradient information. A shape-based method for the classification of pedestrians in infrared images was developed by [3. Most of the past work in detecting pedestrians was done using visual cam- eras. 9 Multi-Objective Optimization of Support Vector Machines 213 itive and negative examples during training. Recently some pedestrian detection systems have been developed that are working with infrared images. In this section we focus on the special case of the detection of pedestrians in a single frame. 9.

Dval .214 T. thus making a better adaptation to the given problem possible. inde- pendent scaling factors for each component of the feature vector (see Sections 9. 9. because it directly determines the computational complexity of the classifier. the left picture shows the result of the segmentation step. which are obtained using a Canny filter [8]. and Dextern .2 and 9. Results of our pedestrian detection system on an infrared image. Igel Fig. In a last step the resulting 256-dimensional feature vector is normalized to the range [−1. Each of the datasets consists of candidate regions (256-dimensional feature vectors) that are manually labeled pedestrian or non-pedestrian. 9. The candidate regions are obtained by our segmentation algorithm to ensure that the datasets are realistic in that way that all usually appearing critical cases . To make the approach scaling-invariant we put a 4 × 8 grid on the candidate region and determine the histograms of eight different angles for each of these fields. that is. The calculation of the feature vectors is based on contour points and the corresponding discretized angles.3. the image on the right shows the regions that has been labeled as pedestrian. which is especially important for real-time tasks like pedestrian detection. A third objective for SVM model-selection.3). Concretely we tune the parameters C+ . and D. which provides candidate regions for pedestrians. is at least of size 10 × 20 pixels a feature vector is calculated and classified using an SVM. C− . is the number of support vectors. For the optimization we generated four datasets Dtrain . We use an EA for the tuning of much more parameters than would be pos- sible with grid-search. This analysis visualizes the trade-off between the two partially conflicting objectives false negative and false positive rate and allows for the selection of a problem specific solution.3 Model Selection In practice the common way for assigning a performance to a classifier is to analyze its ROC curve.2.6.5. whose use will become apparent in the discussion of the optimization algorithm. 1]256 . Dtest . Suttorp and C.

which have been evaluated on the validation set Dtest for every individual that has been created by the optimization process. The progress of one optimization trial is exemplary shown in Figure 9.5 0. Furthermore the segmentation algorithm provides much more non-pedestrians than pedestrians and therefore negative and positive exam- ples in the data are highly unbalanced. The dataset Dextern is used for the final evaluation (cf. The solutions in the archive after the first generation roughly correspond to the solutions that have been found by 3D-grid search.9.5.5 False Positive Rate 0. ηm = 20). The other parameters of the NSGA-II are chosen like in [15] (pc = 0. For the application of the NSGA-II we choose a population size of 50 and create the initial parent population by randomly selecting non-dominated solutions from a 3D-grid-search on the parameters C+ . C− . For optimization we use the NSGA-II.4 Results In this section we give a short overview about the results of the MOO of the pedestrian detection system. and one global scaling factor γ. 9 Multi-Objective Optimization of Support Vector Machines 215 are contained. that is D = γI. each of them lasting for 250 generations.7. The datasets are obtained from dif- ferent image sequences. It illustrates the Pareto-optimal solutions in the objective space that are con- tained in the external archive after the first and after the 250th generation. . 9. 9. We carried out 10 optimization trials. which have been captured on the same day to ensure similar environmental conditions.5 False Positive Rate False Negative Rate 0 False Negative Rate 0 0 0 Fig. but no candidate region from the same se- quence is in the same dataset. [46]).7. Pareto-optimal solutions that are contained in the external archive after the first (left plot) and after the 250th generation (right plot). pm = 1/n. The solutions after the # SV # SV 1 1 0. ηc = 20. where the fitness of an individ- ual is determined on dataset Dval with an SVM that has been trained on dataset Dtrain . To avoid overfitting we keep an external archive of non-dominated solutions. This training is done using the individual’s corresponding SVM parameterization.5 1 1 0.

We considered evolutionary MOO of support vector machines (SVMs).6 0.4 0.85 0. 9.7 False Positive Rate Fig.7 0 0.8 0. Figure 9. A possibility for visual- izing the outcome of a series of optimization trials are the so-called summary attainment surfaces [18] that provide the points in objective space that have been attained in a certain fraction of all trials.8 shows the points that have been attained by all. 1 0. Suttorp and C.95 attainment surface: 1st attainment surface: 5th attainment surface: 10th 0. 9.5 0.3 0.6 Conclusions Designing classifiers is a multi-objective optimization (MOO) problem.9 True Positive Rate 0. The application of “true” MOO algorithms allows for visualizing trade-offs. and the best of our optimization trials. 50%.1 0. which are the objectives of the ROC curve.216 T. thereby allowing for a problem-specific choice of an SVM.8. for example between model complexity and learning accuracy or sensitivity and specificity. We give the summary attainment curve for the two objectives true positive and false positive rate. This approach can adapt multiple hyperparameters of SVMs based on con- flicting. for guiding the model selection process.75 0. Igel 250th generation have obviously improved and clearly reveal the trade-off be- tween the three objectives. . For assessing the performance of a stochastic optimization algorithm it is not sufficient to evaluate a single optimization trial. not differentiable criteria. Summary attainment curves for the objectives true positive rate and false positive rate (“ROC curves”).2 0.

46(1):131–159. Springer-Verlag. Shape-based pedestrian detection. Thorsten Graf. 1986. editors. Choosing multiple parameters for support vector machines. 8(6):679–698. Machine Learning. Here the three objectives are the false positive rate. A computational approach to edge detection. The third objective reduces the model complexity in order to meet real-time constraints. 2003. 2000. V. Alberto Broggi. [2] Hussein A. Chapelle. Bi. IR pedestrian detec- tion for advanced driver assistance systems. pages 582–590. Neural Computation.L. and Massimiliano Sechi. 25(3):265–281. UCI repository of machine learning databases. 2003. Merz. Bousquet. Machine Learning. 2003. Abbass. In Tom Fawcett and Nina Mishra. Speeding up backpropagation using multiobjective evolu- tionary algorithms. Proceedings of the 20th International Conference (ICML 2003). [8] J Canny. Thorsten Graf. An evolutionary artificial neural networks approach for breast cancer diagnosis. In Proceedings of the IEEE Intelligent Vehicles Symposium 2003. [4] Massimo Bertozzi. Mukherjee. Vapnik. Alessandra Fascioli. 2003. Acknowledgments We thank Tobias Glasmachers for proofreading and Aalzen Wiegersma for providing the thoroughly preprocessed pedestrian image data. . 2002. pages 662–667. Blake and C. [6] J. 9 Multi-Objective Optimization of Support Vector Machines 217 When optimizing the norm of the slack variables and the radius-margin quotient as two objectives. it turned out that standard MOO heuristics based on the curvature of the Pareto front led to comparable models as correspond- ing single-objective criteria proposed in the literature. and S. and Michael Meinecke. Pedestrian detection in infrared images. We acknowl- edge support from BMW Group Research and Technology. AAAI Press. In Proceedings of the 25th Pattern Recognition Symposium. [7] C. Paolo Grisleri. O. [3] Massimo Bertozzi. pages 35–42. the false negative rate. 15(11):2705–2726. 1998. Paolo Grisleri. In benchmark problems it appears that the latter should put more emphasis on minimizing the slack variables. Multi-objective programming in SVMs. In Proceedings of the IEEE Intelligent Ve- hicles Symposium 2000. Marcello Carletti. [5] Massimo Bertozzi. Artificial Intelligence in Medicine. Alessandra Fascioli.J. [9] O. pages 215–220. Alberto Broggi. Abbass. IEEE Transactions on Pattern Analysis and Machine Intelligence. The Pareto front of the first two objectives can be viewed as a ROC curve where each point corresponds to a learning machine optimized for that partic- ular trade-off between sensitivity and specificity. and the number of support vectors. volume 2781 of LNCS. References [1] Hussein A. We demonstrated MOO of SVMs for the detection of pedestrians in in- frared images for driver assistance systems. Alberto Broggi. and Michael Meinecke. 2002.

[19] Frauke Friedrichs and Christian Igel. In Bruno Bosacchi. Evo- lutionary Algorithms for Solving Multi-Objective Problems. 2003. C.. Springer-Verlag. Multi-Objective Optimization Using Evolutionary Algorithms. 51:41–59. Invariance. Keerthi. Zitzler.-L. Thiele. Junshui Ma. and James P. J. Curio. Chung. Hernandez Aguirre. Meyarivan. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Intelligent Transportation Systems. 2002. 2002. Neurocomputing. Perkins.-J. [23] Nikolaus Hansen. Lin. Igel. [22] Carl Gold and Peter Sollich. and T. Wiley. editors. A tutorial on the per- formance assessment of stochastic multiobjective optimizers. Knowles. L. Zitzler. Schölkopf. and Prediction.N. In Proceedings of the 6th International Conference on Parallel Problem Solving from Nature (PPSN VI). Fröhlich. David A. and B. . [20] H. Theiler. International Journal on Artificial Intelli- gence Tools. Proceedings of the Third International Conference on Evolutionary Multi-Criterion Optimization (EMO 2005). Simon J. Sun. [15] Kalyanmoy Deb. Robert Tibshirani. and C. editors. Sean Davis. [16] K. Suttorp and C. Reid B. Lamont. 1(3):155–163. Duan. 13(4):791–800. Kao.-M. self-adaptation and correlated mutations and evolution strategies. 15(11):2643– 2681. W. Fuzzy Systems. 2003. Radius margin bounds for support vector machines with RBF kernel. 6(2):182–197. Daniel Hill. 64(C):107–117. 2001. Coello Coello. Neurocomputing. [24] Trevor Hastie. [18] C. A. pages 74–85. 2005. von Seelen. Multi-objective model selection for support vector machines. S. Neural Computation. [11] Carlos A. Genetic algorithms and support vec- tor machines for time series classification.218 T. 2005. Cambridge University Press. and A. Presented at the Third International Conference on Evolutionary Multi-Criterion Optimization (EMO 2005). Inference. 2002. Tzomakas. Feature selection for support vector machines using genetic algorithms. 2000. and A. 55(1-2):221–249. O. Springer-Verlag. pages 355–364. E. David B. 2000. [13] C. Porter. Amrit Pratap. 2005. Evolutionary tuning of multiple SVM parameters. Model selection for support vector machine clas- sification. pages 534–546. 2001. Evaluation of simple performance mea- sures for tuning SVM hyperparameters. C. [21] Tobias Glasmachers and Christian Igel. IEEE Transactions on Evolutionary Computation. Coello Coello. and Jerome Friedman. Springer Series in Statistics. volume 4787 of Proceedings of the SPIE. and W. Gradient-based adaptation of general gaussian kernels. 2003. volume 1917 of LNCS. Kalinke. 2005. Springer-Verlag. [17] Damian R. S. Van Veldhuizen. Fonseca. [12] Nello Cristianini and John Shawe-Taylor. M. [14] Kalyanmoy Deb. J. Eads. An Introduction to Support Vector Machines and other kernel-based learning methods. Igel [10] K. Edelbrunner. The Elements of Statistical Learning Data Mining. Neurocomputing. and Gary B. Chapelle. Applications and Science of Neural Networks. Kluwer Academic Publishers. Bezdek. In C. 2004. and Evolutionary Computation V. 17(10):2099–2105. 2000. volume 3410 of LNCS. T. Neural Computation. D. and E.-C. [25] C. Fogel. Walk- ing pedestrian recognition. and James C. Samir Agrawal. Poo.

T. 9 Multi-Objective Optimization of Support Vector Machines 219 [26] Yaochu Jin. C. Tatsuya Okabe. Technical Report AITR-1685. pages 241–246. Poggio. 1998. H. Applications of Evolution- ary Computing. 42(3):287–320. Malley. F. Constantine Papageorgiou. Neural Information Processing – Letters and Reviews. Johnson. IEEE Transactions on Neural Networks. and T. Papageorgiou. 2001. Morgan Kaufmann. volume 176 of Mathematics in Science and Engineering.a case study in intensive care monitoring. T. Rätsch. Rothlauf. 1985. volume 5031 of Proceedings of the SPIE. . and T. Y. 2004. Keerthi. and Thomas Poggio. [40] Y. Marchiori. Machado. Jerebko. and R. 3(3):59–68. 2001. Artificial Intelligene Laboratory. TSVM. C. Müller. [34] C. transductive inference. and Aad van der Vaart. M. S. Onoda. pages 268–277. Männer and B. In Congress on Evolutionary Computation (CEC’04). Elsevier. Papageorgiou. Academic Press. International Journal of Computer Vision. [31] Katharina Morik. P. pages 193–199. Massachusetts Institute of Technol- ogy. volume 2. J. 38(1):15–33. Medical Imaging 2003: Physiology and Function: Methods. volume 3005 of LNCS. G. Elena Marchiori. [29] M. Combining sta- tistical learning with a knowledge-based approach . In Proceedings of the IEEE International Conference on Intelligent Vehicles Symposium 1998. Corne. Raidl. 1999. and T. IEEE Transactions on Pattern Analysis and Machine Intelligence. pages 1–8. G. Cagnoni. Summers. editors. 2000. D. Nakayama. [38] Günther Rudolph. Kasabov. Soft margins for AdaBoost. Smith. E. E. 1997. T. [30] Anuj Mohan. [37] G. In R. Clough and Amir A. [35] C. IEEE Press. Branke. [36] Constantine Papageorgiou and Tomaso Poggio. Inductive vs. Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms. 23(4):349–361. Pang and N. Jin. Drechsler. [33] S. and SVMT for gene expression classification problems. D. 2004. D. 1992. Sawaragi. and K. Machine Learning. Manderick. Amini. Analysis of proteomic pattern data for cancer detection. In Anne V. Springer-Verlag. [32] M. A trainable pedestrian detection system. A trainable system for object detection in images and video sequences. P. pages 105–114. [28] S. 13(5):1225–1229. Poggio.-R. Peter Brockhausen. S. 2004. Sinha. P. pages 41–51. 2003. R. A trainable system for object detection. 2000. 2004. 2002. Example-based object detection in images by components. and Bernhard Sendhoff. [27] Kees Jong. In Proceedings of the 16th International Conference on Ma- chine Learning. and Thorsten Joachims. editors. and G. Miller. IEEE Press. Asynchronous parallel evo- lutionary model selection for support vector machines. K. In G. J. Feature selec- tion for computer-aided polyp detection using genetic algorithms. [39] Thomas Philip Runarsson and Sven Sigurdsson. On correlated mutations in evolution strategies. Neural network regulariza- tion and ensembling using multi-objective evolutionary algorithms. P. Pedestrian detection using wavelet templates. Oren. R. pages 102–110. Squillero. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. editors. In International Joint Conference on Neual Networks (IJCNN). Parallel Problem Solving from Nature 2 (PPSN II). W. global vs. A. Papageorgiou. pages 1197–1202. Evgeniou. Tanino. Osuna. local models: SVM. and Applications. Theory of Multiobjective Optimiza- tion. Systems.

In U. Proceedings of the First International Conference on Knowledge Discovery & Data Mining. Pedestrian detection and tracking with night vision. Fayyad and R. 2005. Wiegand. Thorpe. Uthurusamy. In Proceedings of the IEEE International Conference on Intelligent Transporta- tion Systems’99. Vapnik. Stereo. Kernel Methods for Pattern Analysis. P. and K. 1995. Y. C. CA. M. M. and Beyond. IEEE Transactions on Intelligent Transportation Systems. 2004. [44] S. Burges. and K. Zhao and C. [45] Vladimir N. International Journal of Com- putational Intelligence and Applications. Shi. . Liu. 2002.and neural network-based pedestrian detection. Fujimura. Cambridge University Press. 2004. [48] L. Schölkopf and A. Deb. and U. Special issue on Neurocomputing and Hybrid Methods for Evolving Intelligence. Learning with Kernels: Support Vector Machines. J. 2004. X. Suttorp and C. 4(3):237–253. 6(1):63–71. Evolutionary multi-objective opti- mization of neural networks for face detection. pages 61–66. and V. C. 1999. [43] John Shawe-Taylor and Nello Cristianini. Springer- Verlag. 1995. Schölkopf. [42] B. Smola. Igel. Vapnik. [46] S. The Nature of Statistical Learning Theory. Xu. J. In IEEE Symposium on Compu- tational Intelligence in Bioinformatics and Computational Biology. [47] F. C. Igel [41] B. pages 298–303.220 T. Multi-class protein fold recognition using multi-objective evolutionary algorithms. N. MIT Press. pages 252–257. Handmann. Menlo Park. Extracting support data for a given task. Suganthan. Regularization. Optimization. editors. AAAI Press.

10 Multi-Objective Evolutionary Algorithm for Radial Basis Function Neural Network Design Gary G. their similar dynamic behaviors stimulate research on whether a synergistic combination of these two technologies may provide more problem solving power than either alone [22]. A Hierarchical Rank Density Genetic Algorithm (HRDGA) is proposed to evolve the neural net- work’s topology and parameters simultaneously. OK 74078-5032. There has been an extensive analysis of different classes of neural networks possessing various architectures and training algorithms.1 Introduction Neural Networks (NN’s) and Genetic Algorithms (GA’s) represent two emerg- ing technologies inspired by biologically motivated computational paradigms. Studies in Computational Intelligence (SCI) 16. to three state-of-the-art designs for radial-basis function neural networks to predict Mackey-Glass chaotic time series. the hierarchical approach addresses several deficiencies highlighted in literature.springerlink. while GA’s are motivated by the theory of evolution to evolve a whole population toward better fitness.G. Compared with traditional genetic algorithm based designs for neural networks. HRDGA provides a set of near-optimal neural networks to the designers so that they can have more flexibil- ity for the final decision-making based on certain preferences. Instead of producing a single optimal solution. or even superior. Although these two technologies seem quite different in the time period of action. USA gyen@okstate. we present a multiobjective evolutionary algorithm based design procedure for radial-basis function neural networks. NN’s are derived from the information-processing framework of a human brain to emulate the learning behavior of an individual.edu Summary. In terms of searching for a near-complete set of candidate networks with high performances.com  c Springer-Verlag Berlin Heidelberg 2006 . In this chapter. 221–239 (2006) www. Yen School of Electrical and Computer Engineering Oklahoma State University Stillwater. the rank-density based fit- ness assignment technique is used to optimize the performance and topology of the evolved neural network to tradeoff between the training performance and network complexity. the networks designed by the proposed algorithm prove to be competitive. 10. Yen: Multi-Objective Evolutionary Algorithm for Radial Basis Function Neural Network Design. In addition. and the process scheme. number of involved individuals. Without a proven G.

instead of a single network. Yen guideline. weights and biases are evolved along with the neural network topology. Section 10. Meanwhile. The remainder of this chapter is organized as follows. more than one neural network structure (i. we will restrict our discussions to the radial basis function neural network. multiobjective trade-off characteristic of the neural network design has not been well studied and applied in the real world applications. Since the 1990’s. Section 10. This has promoted re- search on how to identify an optimal and efficient neural network structure. 17. Too few neurons can result in underfitting problems (poor approximation). with different weighting coefficients and numbers of neurons) can be trained to solve a given problem within an error bound if given sufficient training time. each chromosome is a candidate neural network and is coded by three different gene segments– high level segments have control genes that can determine the status (activated or deactivated) of genes in lower level segments.. Without loss of generality. The decision of “which network is the best” is often decided by which network will better meet the user’s needs for a given problem. while too many neurons may contribute to overfitting problems. 18.2 discusses the neural network design dilemma and the difficulty of finding a single optimal neural network. It is known that the performance of neural networks is sensitive to the number of neurons.e. Obviously. evolutionary algorithms have been successfully applied to the design of network topologies and the choice of learning parameters [1. Moreover. a new rank-density based fitness assignment technique is developed to evaluate the structure complexity and the performance of the evolved neural network. In this chapter.222 G. They reported some encouraging results that are comparable with conventional neural network design approaches. HRDGA produces a set of near-optimal candidate networks with different trade-off traits from which the designers or decision makers can make flexible choices based on their preferences. AIC can be inconsistent and has a tendency to overfit a model. More importantly. AIC (Akaike Information Criterion) [19] and PMDL (Predictive Minimum Description Length) [8] are two well-adopted approaches. 2. Treating the neural network design as a bi-objective optimization problem. However. the design of an optimal neural network for a given problem is an ad hoc process. we propose a Hierarchical Rank Density Genetic Algorithm (HRDGA) for neural network design in order to evolve a set of near-optimal neural networks. achieving a better network performance and simplifying the network topology are two competing objectives. while PMDL only suc- ceeded in relatively simple neural network structures and seemed very difficult to extend to a complex NN structure optimization problem. However. Given a sufficient number of neurons.G. In HRDGA.3 reviews various approaches to applying genetic algorithms for neural network design and introduces the proposed hierarchi- . all of these approaches tend to produce a single neural network for each run. 15]. Hidden layers and neurons are added or deleted by this “on/off” scheme to achieve an optimal structure through a survival of the fittest evolution. offering the designers no alternative choices.

y)fx. Section 10. Section 10. and orthogonal least square training algo- rithms. 10 Radial Basis Function Neural Network Design 223 cal structure. and multi-fitness measures of the proposed genetic algorithm. 2. 10. A rule or algorithm that can determine the network complexity and ensure it to be sufficient for solving the given training problem. y) denotes the joint pdf that depends on the input vector x and the target output vector y. (10.5 introduces the proposed rank-density fitness assignment technique for multiobjective genetic algorithms and describes HRDGA parameters and design flowchart. In order to choose the optimal neural network.2 Neural Network Design Dilemma To generate a neural network that possesses the practical applicability. The design of an optimal neural network involves all of these three problems. The structure N S  is said to be dominated by N S” if FN S  ⊂ FN S” . Y) as: ( ( E[gT (fN S (X. ∗ 1. two problems have to be solved. Given a network structure N S. y)dxdy. ω)}. weights and biases) for the specified network structure and training task. the determination of the respective weights) that gives the minimal cost value within the family FN S : . Y)] = gT (fN S (x. parameterizd by ω. A metric or measure to evaluate the reliability and generalization of the produced neural network. the ultimate goal of the construction of a neural network with the input-output relation y = fN S (x.y (x. X and Y are spaces spanned by all individual training samples. 1. x and y. ω). A time series with chaotic char- acter is trained and the performance is compared with those of the k-nearest neighbors.e. Finally. As given in [6]. a family of input- output relations FN S = {fN S (x. 3.4 applies hierarchical genotype representation to a radial-basis function neural network design.6 presents a feasible study on the Mackey-Glass chaotic time series predic- tion using HRDGA evolved neural networks. consisting of all network functions that may be formed with different choices of the weights can be assigned. genetic operators.y (x. A training algorithm that can search for the optimal parameters (i.1) where fx. ω).e. ω.. Section 10. several essential conditions need to be considered. Section 10..7 provides some concluding remarks along with pertinent observations. ω) is the minimization of the expectation of a cost function gT (fN S (X. Determination of the network function fN S (x) (i. generalized regression.

·) denotes the cost function measuring the performance over the training set. ω). Y)]. Determination of the network structure N S ∗ that realizes the minima cost value within a set of structures {N S}: N S ∗ = arg minN S∈FN S E[gT (fN ∗ S (X). ω ) = arg minω E[gL (fN S (X. instead of trying to obtain a single optimal neural network. However. (10. (10. the solutions of both tasks need not result into a unique net- work. finding a set of near-optimal networks with different network structures seems more feasible.G. 10. the realized network function will depend greatly on the resulting realization of the given limited training set. This trade-off characteristic implies that a single optimal ∗ neural network is very difficult to find as extracting fN S (x) from FN S by us- ing a finite training data set is a difficult task. as a neural network can only tune the weights by the given training data sets. if not impossible [9]. For this reason. if the number of neurons is excessive.3). Chang and Lippmann used a genetic algorithm to reduce the dimensionality of a feature set for a KNN classifier in a speech recognition task [3]. N S2∗ .3. Kelly and Davis used a genetic algorithm to find the rotation of a data set and scaling factors for each attribute to improve the performance of a KNN classifier [13]. Each individual in this neural network set may provide different training and test performances for different training and test data sets. genetic algorithms and multiobjective optimization techniques can be introduced in neural network design problems to evolve net- work topology along with parameters and present a set of alternative network candidates.3 Neural Network Design with Genetic Algorithm In the literature of applying genetic algorithms to assist neural networks de- sign. several approaches have been proposed for different objectives. · · · meet the criterion as shown in Equation (10. Yen ∗ ∗ fN S (x) = fN S (x. and these data sets are always finite. 2. On the other hand. These approaches can be categorized into four different areas. Therefore. In [6]. 10. if several structures N S1∗ .3) Obviously. the one with the minimal number of hidden neurons is de- fined as an optimal. . Moreover.2) where gL (·. there will be a trade-off between NN learning capability and the number of the hidden neurons. A network with insufficient neurons might not be able to ap- proximate well enough the functional relationship between input and target output. Y)]. the idea of providing “a set of” candidate networks to the decision makers can offer more flexibilities in selecting an appropriate network judged by their own pref- erences.224 G.1 Data Preparation GA’s were primarily used to help NN design by pre-processing data.

10. Whiltley et. To avoid this problem. which had been shown to easily produce “one phenotype mapping different genotypes” problem. But in their work. namely network feasibility. Since an N - neuron neural network must be expressed as a chromosome with a bit string of length N 2 . number of layers and number of neurons. 10 Radial Basis Function Neural Network Design 225 10. However. which are then used to predict a time series [24]. 10.4 Evolving NN Structures together with Weights and Biases Dasgupta and McGregor proposed an sGA (Structure Genetic Algorithm) to evolve neural networks [5]. is pre-determined based upon some heuristic judgments. al. used a genetic algorithm to evolve an optimally connected matrix to form a neural network [16].3. As a result. HGA (Hierarchical Genetic Algorithm) was first proposed by Ke for fuzzy controller design [12]. Lucas proposed a GA based adaptive neural archi- tecture selection method to evolve a back-propagation neural network [15]. if not impossible. using sGA to evolve a large neural network is computationally expensive. Based on this idea. a complex phenotype will map to a much more complex genotype. Three main problems exist in current topology design research.1(a). al. Miller et.3. both of these algorithms use the connection matrix and from-to units.2 Evolving Network Parameters Belew and his colleagues used a genetic algorithm to identify a good ini- tial weighting configuration for a back-propagation network [1].. The chromosome structure (genotype) is shown in Figure 10. On the other hand. a three- layer HGA is used to evolve a Multi-layer Perceptron (MLP) neural network. Yen and Lu designed an HGA Neural Network (HGA-NN) [23]. The structure. They used two layer genes to evolve mem- bership functions for a fuzzy logic design. and Davila applied GA’s schema theory to aid the design of genetic coding for NN topology optimization. Bruce and Timothy used a genetic algorithm to evolve the centers and widths for a radial basis function neural network [2]. only a XOR problem and a 4×4 encoder/decoder problem were tested. one genotype mapping multiple phenotypes and one phenotype mapping different genotypes [23]. Zhang and Cho proposed Bayesian evolutionary algorithms to evolve the structures and parameters of neural trees. a hierarchical genotype representation is adopted in this study. In the HGA-NN.3. but this method can only be used for simple problems.. which is relatively simple. As shown . Genetic algorithms are used to evolve the weights or biases in a fixed topology neural network. used a genetic algo- rithm to find which links could be eliminated in order to achieve a specific learning objective [21].3 Evolving Network Topology This is the most targeted area with which genetic algorithm can be used in neural network design.

The active status of one control gene determines whether the parameters of the next level controlled by this gene will be activated or not. we choose the variance as unity for each Gaussian neuron. Yen Fig. Without loss of generality.4 HGA Evolved Radial-Basis Function NN In a similar spirit. while only one neuron in the third hidden layer is activated. The mid-level neuron genes indicate that two out of three neurons in the first hidden layer are activated. Genotype structure of an individual MLP neural network (left) and the corresponding phenotype with layered neural network topology (right). in Figure 10.1(a) corresponds to an individual neural network (phenotype) with two hidden layers and two and one neuron in each layer in Figure 10. each candidate chromosome corresponding to a neural net- work implementation is assumed to have at most four hidden layers (shown in the high-level layer genes).1(b). (10. Since the second and the fourth layers are deactivated. 10.226 G. the problem of “one phenotype mapping different genotypes” can be prevented.1(a). and m is the number of Gaussian neurons in the hidden layer. By using this hierarchical genotype.4) i=1 where ci denotes the center of the ith localized function. their neurons are not used. Additionally. where the first and the third hidden layers are activated (as indicated with binary bits 1) and the second and the fourth hidden layers are deactivated (with binary bits 0). a genetic chromosome (genotype) shown in Figure 10.1. . 10.G. As an example. ωi is the weighting coefficient connecting the ith Gaussian neuron to the output neuron. The low-level parameter genes are then used to represent the weighting and bias parameters of each corresponding neuron activated. HGA is tailored in this chapter to evolve an RBF (Radial- Basis Function) neural network. we assume at most three neurons in each hidden layer as shown in the space available in neuron gene corresponding to each element in layer gene. A radial-basis function can be formed as:  m f (x) = ωi exp(−||x − ci ||2 ).

10. weight genes.2. The Pareto front yields many candidate solutions— non-dominated points. the weight genes and center genes are represented by real values. neural network design problems have a multi- objective trade-off characteristic in terms of optimizing network topology and performance. their corresponding weight and center parameters are used.1 Multiobjective Optimization Problems Multiobjective Optimization (MO) is a very important research topic. The lengths of these three layers of genes are the same and specified by the user. It consists of a family of non-dominated points. and center genes. 10. In many optimization problems. there is not even a universally accepted defini- tion of “optimum” as in single-objective optimization [10]. but also many open issues to be answered qualitatively and quantitatively. which describes the trade- off among contradicted objectives. a so-called Pareto front [7]. because the solu- tion to a MO problem is generally not a single point. As a result. The value of each control gene (0 or 1) determines the activation status (off or on) of the corresponding weight gene and center gene. be- cause most real world problems have not only a multiobjective nature. from which we can choose the desired one under different trade-off conditions. third and fifth hidden neurons are activated. 10 Radial Basis Function Neural Network Design 227 Fig. multiobjective genetic algorithm is applied in NN design procedure.5 Multiobjective Genetic Algorithm As discussed in Section 10. 10. where the first. Therefore.3. In most cases. In HGA based RBF neural network design. the Pareto front is on the . Control genes and weight genes are randomly initialized and the center genes are randomly selected from given training data samples.2 shows the genotype and phenotype of a HGA based RBF neural network. On the other hand. Genotype (left) and phenotype (right) of HGA based RBF neural net- work.5. Figure 10. genes in the genotype are hierarchically structured into three layers: control genes.

Pareto front) is our goal. 10. searching for a near-complete set of non-dominated and near-optimal candidate networks as the design solutions (i. several Multiobjective Genetic Algorithms (MOGAs) have been proposed and applied in multiobjective optimization problems [25].. Three essential techniques were applied in this technique. There- fore. Graphical illustration of the Pareto optimality.3. which are two conflicting objectives.. which represents the dominated . b) searching for and keeping better-approximated Pareto points by diffusion and elitism schemes. Automatic Accumulated Ranking Strategy (AARS) In HRDGA. genetic drift). Compared to traditional fitness assignment methods. the proposed rank- density based technique possesses the following characteristics of a) simplify- ing the problem domain by converting high-dimensional multiple objectives into two objectives to minimize the individual rank value and population den- sity value. However. a neural network design problem can be regarded as a class of MO problems as minimizing network structure and improving network performance.2.228 G. this ultimate goal is far from been accomplished by the existing MOGAs described in literature due to trade-off decisions between homoge- nously distributing the computational resources and GA’s strong tendencies to restrict searching efforts (i.2 Rank-density Based Fitness Assignment Since the 1980’s. Fig. These algorithms all have almost the same purpose— searching for a uni- formly distributed and near-optimal Pareto front for a given MO problem. 10.3. In this chapter. Yen boundary of the feasible range as shown in Figure 10.e. and c) preventing harmful individuals by intro- ducing a “forbidden region” concept. an Automatic Accumulated Ranking Strategy (AARS) is ap- plied to calculate the Pareto rank value.e.5.G. we propose a new rank-density based fitness assignment technique in a multiobjective genetic algorithm to assist neural network de- sign. Considering the NN design dilemma outlined in Section 10.

the cell size will vary from generation to generation to maintain the accuracy of the density calculation.6) Ki where di is the width of the cell in the ith dimension. each selected parent performs crossover with the best individual (the . The cell width in each objective dimension can be computed as : maxx∈X fi (x) − minx∈X fi (x) di = . t).” For each subpopulation. The density value of an individual is defined as the number of the individuals located in the same cell. while dominated ones are penalized to reduce the population density and redundancy. rank(yp(t) .4. y2 .5) j=1 Therefore. In AARS. t) = 1 + rank(yj . Assume at generation t.4. respectively.e. Afterwards. each subpopulation is filled with individuals that are randomly chosen from the current population according to rank and density value.. and X denotes the decision vector space. t). re- spectively. by AARS. Its rank value can be computed by (t)  p rank(y. Ki denotes the number of cells designated for the ith dimension (i. Before fitness evaluation. the parent selection and replacement schemes are borrowed from Cellular GA [14] to explore the new search area by “diffusion. all the non-dominated individuals are still assigned rank value 1. population rank and density values are designated as the two fitness values for GA to minimize. Here. K1 = 12 and K2 = 8). i = 1. HRDGA adopts an adaptive cell density evaluation scheme as shown in Figure 10. yp(t) . a fixed number of parents are randomly selected for crossover. n. the entire population is shuffled. rank(y2 . · · · . (10. an individual’s rank value is de- fined as the summation of the rank values of the individuals that domi- nate it. (10. Rank-density Based Fitness Assignment Because rank and density values represent fitness and population diversity. Adaptive Density Value Calculation To maintain the diversity of the obtained Pareto front. t). t). 10 Radial Basis Function Neural Network Design 229 relationship among individuals. and crossover and mutation are then performed. individual y is dominated by p(t) individ- uals y1 . Then. For crossover. As the maximum and minimum fitness values will change with different generations. · · · . in Figure 10. the entire population is divided into two subpopulations with equal sizes. whose rank values are already known as rank(y1 . · · · . the new rank-density fitness formulation can convert any multiob- jective optimization problem into a bi-objective optimization problem.

The forbidden re- gion includes all the cells dominated by the selected parent. Meanwhile. and a resulting offspring of the selected parent pis located in the forbidden region. These individuals are compared to achieve the final Pareto front after the evolution process has stopped. . we take the minimization of the population density value as one of the objectives. because this kind of off- spring has the tendency to push the entire population away from the desired evolutionary direction. This offspring will be eliminated even if it reduces the population density. Fig. Finally. As shown in Figure 10. the simple elitism scheme [7] is also applied as the bookkeeping for storing the Pareto individuals obtained in each generation. 10.5. a for- bidden region concept is proposed in the replacement scheme for the density subpopulation. it replaces its parent. Yen one with the lowest rank value) within the same cell and the nearest neigh- boring cells that contain individuals. thereby preventing the “backward” effect.4. It is expected that the entire population will move to- ward an opposite direction to the Pareto front where the population density value is being minimized. and thus the selected parent will not be replaced. Density map and density grid.230 G. If one offspring produces better fitness (a lower rank value or a lower population density value) than its correspond- ing parent. obviously. these individuals are harm- ful to the population to converge to the Pareto front.G. Although moving away from the true Pareto front can reduce population density value. The offspring located in the forbidden region will not survive in the next generation. To prevent “harmful” offspring surviving and affecting the evolutionary direction and speed. The replacement scheme of the mutation operation is analogous. sup- pose our goal is to minimize objectives f1 and f2 .

3 HRDGA for NN Design To assist RBF network design.7. real values are adopted as the gene representation to reduce the length of the chromosome.7 for the control. .05. In the control gene segment.5. common binary value mutation was adopted. 10. For the weight and center genes. each individual (chromosome) represents a candidate neural net- work. and center genes. weight. real value mu- tation was performed by adding aGaussian(0.5. The control genes are binary bits (0 or 1). and center genes. Chromosome Representation In HRDGA. weight. and 0. respectively. 0. The mutation rates were set to be 0. Illustration of the valid range and the forbidden region. HRDGA is applied to carry out the fitness evaluation and mating selection schemes. The HRDGA operators are designed as followed. One-point mutation was applied in each segment.8. The population size is fixed and chosen ad hoc by the difficulty of the problem to be solved. Crossover and Mutation We used one-point crossover in the control gene segments and two-point crossover in the other two gene segments. In the weight and center gene segments. 1). 10. 0. and 0.1.05 for the control. respectively. The crossover points were ran- domly selected and the crossover rates were chosen to be 0. which denotes a Gaussian function with zero mean and unit variance. 10 Radial Basis Function Neural Network Design 231 Fig.

232 G. 150 control genes and 150 weight genes are initially generated as well.7).2. we applied three other center selection methods—KNN (K-Nearest Neighbor) [11]. Mating is then iteratively processed. the original fitness— network performance and number of neurons—of each individual in a generation is evaluated and ranked.6 Mackey-Glass Chaotic Time Series Prediction Since the proposed HRDGA is designed to evolve the neural network topology together with its best performance. Stopping Criteria When the desired number of generations is met. Larger values of τ produce more chaotic dynamics which are much more difficult to predict.6. (10.1 Mackey-Glass Time Series The Mackey-Glass time series is a continuous time-delay data differential equation: d(x(t)) a x(t − τ ) = − b x(t). Population size was set to be 400. we used HRDGA evolved neural networks to predict a chaotic Mackey-Glass time series with τ = 150. it proves useful in solving complex prob- lems such as time series prediction or pattern classification. For a feasibility check. In the proposed HRDGA. we use the HRDGA assisted NN design to predict the Mackey-Glass chaotic time series. we need to convert them into the rank-density domain. x(t − 12).1 and c = 10 for Equation (10. and the density value is calculated. b = 0. For comparison. 150 initial center genes are selected. Yen Fitness Evaluations and Mating Selection Since we are trying to use HRDGA to optimize the neural network topology along with its performance. GRNN (Generalized Regression . the evolutionary process stops. 10. In this study. The network is set to predict x(t + 6) based on x(t). x(t − 6). and x(t − 18). Then the new rank and density fitness values of each individual will be evaluated and the individuals with higher fitness measures will reproduce and crossover with other high fitness individuals with a certain probability. Their offspring replaces the low fitness parents forming a new generation. Some examples are listed in Table 10. 10. Therefore.7) d(t) (1 + xc (t − τ )) The chaotic behavior of the Mackey-Glass time series is determined by the delay parameter τ .1.G. Here we assign a = 0.

Furthermore. Ideally. we can see.000. We used the first 250 seconds of the data as the training data set.6. For KNN and GRNN types of networks.10. we selected 100 different ρvalues. 10.10 show the same types of results by using the second. and Figure 10. ρ should be larger than. For HRDGA. because the RBF centers of the . The stopping criteria for KNN.000 – 1. GRNN. where σε2 is the variance of the residuals. we generated a group of neural networks with various training performances and numbers of hidden neurons. Figure 10.7(b) shows their corresponding Pareto fronts.8.4 with the step size of 0. the stopping generation is set to be 5.6(a) shows the resulting average training SSEs of neural networks with different number of hidden neurons by four training approaches.000.249 seconds were used as the corresponding test data sets to be predicted by four different approaches. From Figures 10.53 < τ < 13. or the training Sum Square Error (SSE) between two sequential generations is smaller than 0.6(b) shows the approximated Pareto fronts (i. and OLS (Orthogonal Least Square Error) [4] methods on the same time series prediction problem.3 < τ < 16. Each approach runs 30 times with different parameter initializations to obtain the average results. and then the data from 250 – 499.8 Period limit cycle doubles τ > 16. Table 10. and σd2 is the variance of the desired output. Figure 10. comparing to KNN and GRNN. and OLS algorithms is either the epochs exceed 5. which are from 0. the ratio σε2 /σd2 . HRDGA and OLS algorithms have much smaller training and test errors for the same network structures.01 to 0. Figure 10. For the OLS method.e.. A smaller ρ value will produce a neural network with more neuron number. third.7(a) shows the average test SSEs of the resulting networks by using the first test data set for each approach. but very close to.10.1.8 Chaotic attractor characterized by τ Neural Network) [20]. Characteristics of Mackey-Glass time series delay parameter τ Chaotic characteristics τ < 4. 750 – 999. Therefore. Each of these networks will be trained by KNN and GRNN methods. whereas a larger ρ value generally results in a network with less number of neurons. For the given Mackey-Glass time series prediction problem. by using different ρ values. and the stopping criterion is the same with the one used in HRDGA. respectively. non-dominated sets) by the selected four approaches.01.2 shows the best training and test performances and their cor- responding numbers of hidden neurons.3 A stable limit cycle attractor 13.53 A stable fixed point attractor 4. KNN trained networks produce the worst performances.01.9 and 10. 10 Radial Basis Function Neural Network Design 233 Table 10. respectively. and fourth test data. the selection of the tolerance parameter ρ determines the trade-off between the performance and complexity of the network. 500 – 749. 70 networks are generated with the neuron numbers increasing from 11 to 80 with the step equals to one. Figures 10. and 1.

Comparison of best performance (SSE) and structure (number of neurons) between KNN.0711 43 2.6. 10.2.2348 37 OLS 2.7216 58 . SSE no.9644 40 3. OLS.234 G.7199 54 HRDGA2.8339 69 3. SSE no.3329 60 2. Table 10.G.3382 68 2.5534 52 2.8586 48 4. (a) Training performances for the resulting neural networks with different number of hidden neurons and (b) The corresponding Pareto fronts (non-dominated sets).5369 37 2. (a) Test performances for the resulting neural networks with different number of hidden neurons and by using test set #1 (b) The corresponding Pareto fronts (non-dominated sets).7720 38 3. Yen (a) (b) Fig.8074 19 GRNN 2.7.5226 48 2.4601 46 2.4520 42 4. GRNN and HRDGA Training Test set Test set Test set Test set set #1 #2 #3 #4 SSE no.4633 47 2. KNN 2. SSE no. (a) (b) Fig.3693 42 3.2901 74 2.5856 50 2. SSE no. 10.

which make KNN to achieve only a “local optimum” solution.6. After that. The network with the .8. Moreover. the performances of OLS are comparable to HRDGA. and the orthogonal result is near optimal. the test error still decreases as the network complexity increases. the training error decreases. (a) Test performances for the resulting neural networks with different number of hidden neurons and by using test set #2 (b) The corresponding Pareto fronts (non-dominated sets). 10. (a) (b) Fig. from Figure 10. this phenomenon is only partially maintained for the relationship between the test performances and the network complexity. 10 Radial Basis Function Neural Network Design 235 (a) (b) Fig. Since GA always seeks “global optimum”. we can see that when the network complexity increases. the test error has the tendency to fluctuate even when the number of hidden neurons increases. KNN algorithm are randomly selected. Before the number of hidden neurons reaches a certain threshold. This phenomenon can be observed from the results by all of the selected training approaches. This occurrence can be considered as that the resulting networks are overfitted. However.9. 10. (a) Test performances for the resulting neural networks with different number of hidden neurons and by using test set #3 (b) The corresponding Pareto fronts (non-dominated sets).

G. because it treats each objective equally and inde- pendently.10 and Table 10. From the simulation results. and its population diversity preserving techniques help it build a near-uniformly distributed non-dominated solution set. since these data sets possess different traits. from Figures 10. On the other hand. as shown in Figure 10.10.11. 10. it is very difficult to find a single optimal network that can offer the best performances for all the test data sets.1. However. instead of searching for a single optimal neural network. they have the advantage that the designer can control the network complexity by increasing or decreasing the neuron numbers at will. which may cause a redundant computation effort in order to generate a near-complete neural network solution set. Therefore. it also has serious problem to generate a set of network solutions in that the designers cannot manage the network structure directly. The trade-off characteristic between network performance and complexity totally depends on the value of toler- ance parameterρ. Same ρ value means completely different trade-off features for different NN design problems. . In addition. Compared with the other three training approaches. many-to- one mapping. (a) Test performances for the resulting neural networks with different number of hidden neurons and by using test set #4 (b) The corresponding Pareto fronts (non-dominated sets). the relationship between ρ value and network topology is a nonlinear. This is the essential reason that multiobjective genetic algorithms can be justified for this type of neural network design problems. althoughK KNN and GRNN approaches did not provide better training and test results comparing to the other two ap- proaches. HRDGA does not have problems in design- ing trade-off parameters. Yen (a) (b) Fig. an algorithm that can result in a near-complete set of near-optimal networks can be a more reasonable and applicable option.236 G.6– 10. best test performance before overfitting occurs is called the optimal network and is judged as the final single solution by conventional NN design algorithms. although the OLS algorithm always provides near-optimal net- work solutions with good training and test performance.

HRDGA shows potential in estimating neural network topology and weighting parameters for complex problems when a heuristic es- timation of the neural network structure is not readily available. From the results presented above. This is resulted from GA’s feature of seeking “global optimum” and HRDGAs’ Pareto ranking technique. and long-extended Pareto front. and 3. offering a near-complete. HRDGA shows competitive. Relationship between ρ values and network complexity. comparing to the other three traditional training approaches. the proposed HRDGA algorithm offers several benefits for the neural network design problems in terms of: 1.7 Conclusions In this study. non-dominated set. or . 10. A Hierarchical Rank Density Genetic Algorithm (HRDGA) is developed to evolve both the neural network’s topology and parameters simultaneously.11. presenting competitive or even superior individuals with high training and test performances. HRDGA provides a set of near-optimal neural networks from the perspective of Pareto optimality to the designers so that they can have more flexibility for the final decision-making based on certain preferences. providing a set of candidate solutions. Instead of producing a sin- gle solution. Therefore. 10 Radial Basis Function Neural Network Design 237 Fig. which is caused by GA’s population- based optimization capability and the definition of Pareto optimality. 2. 10. and the concept of “forbidden region”. we propose a multiobjective genetic algorithm based design procedure for the radial-basis function neural network. For the given Mackey–Glass chaotic time series prediction. density preserving technique. which is originated from HRDGA’s population diversity keeping design that can be found in AARS.

A new method for initializing radial basis function classifiers. CS90-174. Conf. Int. Neural Information Processing Systems.L. Tang. Bienenstock . Using genetic algorithms to improve pattern clas- sification performance. 1991 [5] D. Structure optimization of neural networks with the A*-Algorithm.N. 1991 .238 G. P.87–96. Workshop on Combina- tions of Genetic Algorithms and Neural Networks. Springer. Masud. IEEE Trans.377–383. 1998 [13] J. 1997 [7] C. pp. of the IEEE Int. IEEE Int.C. In addition. Dousat. Man. CSE Technical Report. Industrial Electronics. In: Proc. R.M. An overview of evolutionary algorithms in multi- objective optimization. N. Hwang. L. In: Proc. Kaylani. Neural Computation.M.J. pp. Hierarchical genetic fuzzy controller for a solar power plant.J. Doering. Speech signal restoration using an opti- mal neural network structure. feedback.R. Dasgupta. Orthogonal least square learning algorithm for radial basis function networks. 1994 [12] T. H. While we considered radial-basis function neural networks. 2:303–314. 1992 [6] A. pp. or self-organized). Conf. D. Hartimo.O. In: Proc. Geman. as some of the traditional neural network training approaches (i. McGregor. Ke. 3:1–16. Witte. Genetic Algorithms.2584– 2587. Fleming. Gao. Berlin [11] T. McInerney. Neural Networks.S.J. Luk. the proposed hierarchical genetic algorithm may be easily extended to the designs of other neural net- works (i. J. Kelly. Neural networks and the bias/variance dilemma. Systems. Man. S. Ovaska. D.M.Y. feed-forward.584–588. Lippmann. K. 1990 [2] A. 1989 [10] C. 8:1434–1445. Dasgupta. pp. Neural Networks.P. Multiple objective decision making—methods and applications.1841–1846. 1991 [4] S. References [1] R. 1995 [8] X... Evolutionary Computation. Fonseca. University of California at Dan Diego. Z. a hybrid algorithm that synergistically integrates traditional training method with the proposed algorithm can be a promising future work. C.F.e. Yen even superior performances comparing with the other three selected training algorithms in terms of searching for a set of non-dominated. Cooperative-competitive genetic evolution of radial basis function centers and widths for time series prediction. 2:302–309. Cowan. Bruce.e. Galicki. Belew. In: Proc.D. Conf. 797–803. S. Davis. Evolving networks: Using genetic algorithms with connectionist learning.M. R. P. Hybridizing the genetic algorithm and the K-nearest neighbors classification algorithm.C. IEEE Transactions Neural Networks. Grant. K. near-complete neural network solutions with high training and test performances. Chang.W. In: Proc.G. IEEE Int. Designing application-specific neural networks using the structured genetic algorithm. Chen.K.S. Symp. A. and Cybernetics. Intl. OLS algorithm) also provide competitive results. M. E. IEEE Trans.F. Schraudolph. Neural Networks. Timothy. 7:869-880. P. 1996 [9] S. 1996 [3] E. pp.

3:257–271. 10 Radial Basis Function Neural Network Design 239 [14] T. C.M. P. Genetic Algorithms. pp. Evolving neural trees for time series prediction using Bayesian evolutionary algorithms. pp. Combination of Evolutionary Computation and Neural Networks. International Journal of Neural Systems. 2001 [18] A. Todd. Specifying intrinsically adaptive architectures. Network information criterion – determin- ing the number of hidden units for an artificial neural network model. In: Proc. 1993 [23] G. In: Proc. Multiobjective evolutionary algorithms: A comparative case study and the strength Pareto approach.224–231. Moon. IEEE Symp. IEEE Symp. Yao. 4:203–222.43–51. Lucas. Parallel Computting. Parameter control using agent based patchwork model. Krink.K. 1994 [20] P. H. S. pp. 1989 [17] S. IEEE Transactions on Neural Networks. 1st IEEE Symp. Kong. Wasserman. Block-based neural network. Combination of Evo- lutionary Computation and Neural Networks. 2000 [19] N.168–175. IEEE Trans.G. Designing neural networks using genetic algorithms. S. In: Proc. Starkweather. 2000 [15] S. Hedge.F. Morales. In: Proc. Advanced Method in Neural Computing. 17–23. 1993 [21] D. IEEE Transactions on Neural Networks. pp. Ursem. D. T. 2000 [25] E. Evolving artificial neural network. Conf. 2000 [16] G.G. 12:307–317. Zhang. 14:347– 361. 5:865–872. Combination of Evolutionary Computation and Neural Networks.W. Genetic algorithms and neural net- works: optimizing connections and connectivity. In: Proc. VNR. Cho. IEEE Congress on Evolutionary Computation. R. 3rd Int. Combination of Evolutionary Computation and Neural Networks. 1999 . S. Whitley.77–83. Non-standard norms in genetically trained neural networks. Amari. Hierarchical genetic algorithm based neural network design. In: Proc.U.K. Lu. Murata. IEEE Symp. Zitzler. New York. Thiele. Miller. pp. 1990 [22] X. 2000 [24] B. Yoshizawa. S.379–384. L. Evolutionary Com- putation.D. Bogart. Yen.

the continuous-valued at- tributes should be discretized with threshold values. The creation of decision trees often relies on heuristic information such as information gain measurement. how we can find the best structure leading to desirable generalization performance. When data points with numerical attributes are involved. 80799. Thus.mpg. Mitchell [22] showed the curve of the accuracy rate of decision trees with respect to the number of nodes over the independent test examples.de Summary. There exists a peak point of the accuracy rate in a certain size of decision trees. we can evaluate the classification performance under various structural complexity of decision trees. and each leaf represents a class depending on the conditions of its parent nodes. Decision tree induction algorithms such as C4. 11. Tree induction algorithms use heuristic information to obtain decision tree classification. However. Each internal node in the decision tree has a splitting criterion or threshold for continuous-valued attributes to partition a part of the input space. there has been little research on how many rules are appropriate for a given set of data. Yet how many nodes are appropriate for a given set of data has been an open question. an evolutionary multi-objective optimization approach with genetic programming will be applied to the data classification problem in order to find the minimum error rate or the best pattern classifier for each size of decision trees. a larger size of decision trees can D. The suggested method is compared with C4.com  c Springer-Verlag Berlin Heidelberg 2006 .springerlink. 241–260 (2006) www. a conjunctive rule is obtained by following the tree traversal from the root node to each leaf node. Kim: Minimizing Structural Risk on Decision Tree Classification. that is. As a result.1 Introduction The recognition of patterns and the discovery of decision rules from data examples is one of the challenging problems in machine learning.11 Minimizing Structural Risk on Decision Tree Classification DaeEun Kim Cognitive Robotics Max Planck Institute for Human Cognitive & Brain Sciences Munich.5 build decision trees by recursively partitioning the input attribute space [24]. In this chapter.5 application for machine learning data. Studies in Computa- tional Intelligence (SCI) 16. Germany daeeun@cbs. Following structural risk minimization suggested by Vapnik.E. we can determine a desirable number of rules with the best generalization performance.

where the structure of decision trees is similar to linear regression trees [6].242 D. In our EMO approach. tree size and accuracy rate in data classification. 24. and he suggested the structural risk minimization to find the best structure. that is. Many techniques such as tree growing with stopping criterion. Their method was based on the infor- mation gain measurement. 17]. it followed the C4. . This problem is related to the overfitting problem to increase the generalization error1 . However. The EMO approach was used to minimize the training error and the tree size for decision tree classification [3. An evolutionary approach to decision trees has been studied to obtain op- timal classification performance [16. 3]. the tree size and the training error. 20]. 5] have been studied to reduce the generalization error. Freitas et al. However. Recently a genetic programming approach with evolutionary multi-objective optimization (EMO) was applied to decision trees [4. 15.5 in some data sets. the EMO with genetic programming for two objectives. tree prunning or bagging [25. Kim increase its classification performance on the training samples but reduces the accuracy over the test samples which have not been seen before. Also other works using fitness and size or complexity as objectives have been reported [2. is first used to obtain the Pareto-optimal so- lutions. were considered in the method. Yet there has been no effort so far to find what is the best structure of decision trees to have the minimal generalization error. 21. We will follow the approach and the tree size will be the parameter to control the structure. but had higher test error rates than C4. searching for the best structure of decision trees has not been considered in their works. It can be achieved by exploring the empirical error (training error) and generalization error (test error) for various structure complexity. It has been shown that EMO is very effective for optimization of multi- objectives or constraints in continuous range [27. The method succeeded in reducing both error rates and size of decision trees in some data sets. Vapnik [26] showed an analytic study to reduce the generalization error. 4]. 7].E. the minimum training error rate for each size of trees. In this work. Also EMO is a useful tool even when the best performance for each discrete genotype or structure should be determined [19.5 splitting method and selected the attributes with genetic algorithms. a special elitism strategy 1 This is also called test error in this chapter. we can pinpoint the best structure to mini- mize the generalization error. [15] have shown evolutionary multi-objective optimization to obtain both the minimum error rate and minimum size of trees. Two ob- jectives. Then the best tree for each size will be examined to see the generalization perfor- mance for a given set of test data. A new representa- tion of decision trees for genetic programming was introduced [3]. By observing the distribution of the test error rates over the size of trees. They were able to reduce the size of decision trees. and they do not explore every size of trees. since the decision tree based on heuristics is not optimal in structure and performance. the methods are dependent upon a heuristic information or measure to estimate the generalization error. 8].

the classifiers are located in the internal nodes of the tree and for each instance. and Yj is one of m partitions for the attribute A. The decision trees can be easily repre- sented as a set of decision rules (if-then rules) to assist the interpretation. This evalua- tion decides which attribute or what threshold of the selected attribute clas- sifies well a given set of instances. each attribute A has its own threshold τi that produces the greatest information gain. One of the most popular learning algorithm is to construct decision trees from the root node to leaf nodes with a top-down greedy method [25. 11 Minimizing Structural Risk on Decision Tree Classification 243 for discrete structures is applied. 24]. The approach will be compared with the tree induction algorithm C4. It has been shown that information gain measure efficiently selects one of attribute vectors and its thresholds [23]. Genetic programming evolves decision trees with variable thresholds and attributes. and the threshold τi is one of the best cut for the attribute A to make good decision boundaries. A preliminary study of the approach was published in [18]. This threshold is automatically selected by information gain measurement. Y ) log p(Ci .1 Decision Tree Classification Decision tree learning is a popular induction algorithm. Y ) be the probability of the examples in Y that belong to class Ci . Each data attribute (with appropriate threshold if the attribute is continuous- valued) is evaluated using the information theory with entropy. Inductive learning methods create such decision trees. Let Y be the set of examples and let C be the set of k classes. A has m partitions.2 Method 11. Then the information gain of an attribute A over a collection of instances Y is defined as m E(Y ) − i=1 |Y i| |Y | E(Yi ) Gain(Y. One decision . Y ) is an entropy function. A decision tree clas- sifies data samples by a set of decision classifiers.2. and an incremental evolution from small to large structures with Pareto ranking is used.5. A) = m j=1 p(Yj ) log p(Yj ) k where E(Y ) = i=1 p(Ci . the tree traversal de- pending on the decision of the classifiers from the root node to some leaf node determines the corresponding class. often based on heustic information or statistical probability. 11. The suggested method provides the accuracy rate of classification for each size of trees as well as the best structure of decision trees. Let p(Ci . For a given set of instances. The split probability p(Yj ) in a continuous-valued attribute among m partitions is given as the probability of the examples that belong to the partition Yj when the range of the attribute is divided into several regions.

Here.5 [24] will be used for inductive tree classication. More sophisticated algorithms to improve the classification have been developed. Relationship between training error and generalization error boundary divides the parameter space into two non-overlapping subsets.1. it has a dichotomy A > τi and A ≤ τi . learning algorithms such as decision trees and neural networks can reduce the approximation error called bias but increase the variance of the model. However. 11. The VC di- mension is a measure of the capacity of a set of classification functions. 13]. we are interested in finding a tree structure with minimal generalization error at the expense of increase in training error.1 shows the relationship between the training error and generalization error. while the com- plexity of a model over a given set of data increases. but the style of tree induction is not much changed.244 D. 11. because increasing the number of nodes can decrease the bias and increases the variance in classifica- tion problem. C4. The number of decision-tree nodes in induction trees can be a control parameter to be related to the generalization error.2 Structural Risk Minimization According to the statistical estimation theory by Vapnik [26]. de- pending on τi . The generalization error is the rate of errors caused by the model when the model is tested on samples which have not been seen before.E. Vapnik [26] mentioned this control parameter as VC (Vapnik-Chervonenkis) dimension. Vapnik showed the general bounds for the variance and the generaliza- tion error. 11. 14.2. . The generalization error varies with respect to a control parameter of the learning algorithm to model a given set of data. multiple intervals of the attribute can improve the classification [10]. Kim bias+variance (bound on generalization error) error variance bias (training error) VC dimension (tree size) Fig. Much research is concerned with reducing the generalization error which is a combination of bias and vari- ance terms [12. the decision tree algorithm finds the best attribute and its threshold. With the above process of information gain measurement. Fig. In our experiments.

. From this information. In the sug- gested evolutionary approach. is minimized with the EMO method and then the generalization error for the selected pattern classifier is identified. From the method of structural risk minimization [26]. the current method encodes only a binary tree classification. 11 Minimizing Structural Risk on Decision Tree Classification 245 In structural risk minimization. First.2. Thus. we can easily set the VC dimension into the number of leaf nodes and the VC dimension of each pattern classifier is finite. n.3 Evolutionary Multiobjective Optimization We use evolutionary multiobjective optimization to obtain Pareto-optimal solutions which have minimal training error for each size of trees. for k = 2. Then we have S1 ⊂ S2 ⊂ · · · ⊂ Sn . Sk . we do not have a prior knowledge about what is the best number of rules to minimize the generalization error. 11. we can choose the best model function to minimize the generalization error. As a result.. Thus. The structure of decision trees can be specified by the number of leaf nodes. The terminal node defines a class. β)|β ∈ Dk } where x is a set of input-output vectors. a decision tree is encoded as a genotype chro- mosome. Among a collection of those func- tions. each internal node specifies one attribute for training instances and its threshold. we define a set of pattern classifiers as follows: Sk = {F (x. . When we wish to have a desirable set of rules over a given set of data. in the experiments. or best size of trees among a collection of rule sets for the original data set. bias) for each structure space is found. the training error for each set of pattern classifiers. the only one function set is .. a two-phase algorithm with the EMO method can be applied to general classification problems. The number of leaf nodes (rules) in decision trees corresponds to the VC-dimension that Vapnik mentioned in the structural risk minimization [26]. a hierarchical space of structures is enu- merated and then the function to minimize the empirical risk (training error. In this chapter. The best pattern classifier or the best structure of pattern classifiers is the one with the minimum generalization error. Then the method of finding the best structure with 10- fold cross validation or other validation process can be applied to the training instances. β) is a pattern classifier with parameter vector β. we can apply the EMO method to the whole training instances and obtain a set of rules for each size of trees. F (x. Dk is a set of decision trees with k terminal nodes and Sk is a set of pattern classifiers formed by decision trees with k terminal nodes. Unlike many genetic programming approaches. 10-fold cross validation will be used to estimate the generalization error. we can obtain a desirable set of rules to avoid the overfitting problem. depending on the conjunctive conditions of its parent nodes through the tree traversal from the root node to the leaf node. we can decide the best VC-dimension.

the number of rules. the rank cannot be linearly ordered. tournament selection of group size four is used for Pareto optimization. A subtree crossover over a copy . . . .. ... each decision tree will be tested on a given set of data for classification. . x2 . that is.. Inside the tournament group.. We are interested in minimizing two objectives. Kim a comparison operator for a variable and its threshold.. The continuous-valued attributes re- quire partitioning into a discrete set of intervals. xm ) for m objectives is said to dominate Y = (y1 . The highest rank is zero. The threshold is one of major components to form a pattern classifier. The Pareto scoring in EMO approach has been popular and it is applied to maintain a diverse population over the two objectives. A higher rank of genomes in the group have more probability of reproducing themselves for the next generation. y2 . classification error rate and tree size in a single evolutionary run. more than one chromosome may have tie rank scores and in this case chromosomes will be randomly selected among multiple non-dominated individuals. A vector X = (x1 .. Individuals of rank 0 in a Pareto distribution are dominated by no other members and individuals of rank n are dominated only by individuals of rank k for k < n. the number of rules will be considered as one objective to be optimized. 11. ym ). Thus. For each group of four members. m.E. xi < yi ) A Pareto optimal set is said to be the set of vectors that are not dominated by any other vector.. In our approach. the de- cision tree will have a single threshold for every internal node to partition the continuous-valued range into two intervals. The tournament selection initially partitions the whole population into multiple groups for the fitness comparison. xm )|¬(∃Y = (y1 . . ym ) (written as X ≺ Y ) if and only if X is partially less than Y.. While an evolutionary algorithm creates a varying size of decision trees. The genetic pool in the evolutionary computation handles decision trees as chromosomes. for an element which has no dominator. m. (∀i ∈ 1. Pareto score of each member is compared each other and ranked. {X = (x1 . In the experiments...2 shows an example of dominating rank method. Thus. The chromosome size (tree size) is proportional to the num- ber of leaf nodes in a decision tree. the two best chromosomes using a dominating rank method are first selected in a group and then they reproduce themselves.. For simple control of VC dimension.246 D. members in each group are randomly chosen among the population. In the multi-objective optimization. a population is initialized with a random size of tree chromosomes... The classification error rate will be the second objective. a dominating rank method [11] is applied in this work. A dominance rank is thus defined in the Pareto distri- bution.. we assume that the decision tree is a binary tree... Y ≺ X)} To obtain a Pareto optimal set.. Fig. xi ≤ yi ) ∧ (∃i ∈ 1. that is. and the terminal set consists of classes determined by decision rules.

respectively.657 y y n n y < 0.595 x < 0.657 y n y n y n y n class 0 class 1 class 1 class 0 class 0 class 0 class 1 y < 0.241 y n x < 0. 11 Minimizing Structural Risk on Decision Tree Classification 247 0 2 1 1 0 1 1 0 0 1 0 0 0 3 3 0 1 2 0 0 0 0 1 0 Fig.435 y y n y n n class 1 class 0 class 1 x < 0.) y < 0.241 class 1 x < 0. will produce two new offspring. These new offspring replace the two worst chromosomes in the group.595 class 0 class 1 class 1 x < 0. 11. Crossover operator (arrows: crossover point) of two best chromosomes.3.782 y < 0.435 y n crossover class 1 class 0 y < 0.775 x < 0.595 x < 0.3.241 y n x < 0.775 n y class 1 x < 0. The crossover operator swaps subtrees of two parent chromosomes where the crossover point can be specified at an arbitrary branch – see Fig.775 x < 0. . Dominating rank method in a tournament selection of group size four (x-axis and y-axis represent the number of rules and classification error. and a small rank is a better solution since two objectives should be minimized.775 class 0 y n n y n y x < 0. 11.2. followed by a mutation operator. each number represents the dominating rank.782 class 1 class 0 y n class 0 class 1 Fig.595 x < 0.241 y n y < 0. 11.

775 x < 0. Mutation operators (arrows: mutation point) The mutation has five different operators as shown in Fig. The subtree to be replaced will be randomly chosen in a decision tree.595 x < 0.241 y n y n x < 0.241 y < 0.657 y n y n y n y n y < 0.241 y n y n x < 0.241 y < 0.237 y n y y n y n n class 0 class 1 class 1 class 0 class 0 class 1 class 0 class 1 Fig.248 D.4.595 x < 0. 11.775 x < 0. This will as- . This keeps the parent tree and modifies only one node.595 x < 0. The second operator first picks up a random internal node and then changes the attribute or its threshold.775 x < 0. The third operator chooses a leaf node and then splits it into two nodes.595 y n y y n y n n class 0 class 1 class 1 class 0 class 0 class 1 class 1 x < 0. The first operator deletes a subtree and creates a new random subtree.595 x < 0. 11.4.241 y n y n x < 0.E.775 x < 0.241 information gain y n y n x < 0.782 y n class 0 class 1 # operator 4 y < 0.775 x < 0.241 y < 0. Kim # operator 1 y < 0.595 x < 0.241 y < 0.241 y < 0.775 class 0 class 0 class 1 class 1 class 0 class 0 class 1 n y class 1 x < 0.775 x < 0.775 x < 0.775 x < 0.241 y n y n x < 0.435 y n class 1 class 0 # operator 2 y < 0.775 class 0 y n y y n n class 0 class 1 class 1 class 0 class 0 class 1 # operator 5 y < 0.775 y < 0.788 y n y y n y n n class 0 class 1 class 1 class 0 class 0 class 1 class 1 class 0 # operator 3 y < 0.

The elitism strategy has been significantly effective for EMO methods [27]. crossover rate 0. In the initialization of the population or the recombination of trees.5 efficiently generates decision trees using information gain.5 The well-known tree induction algorithm C4. but the random selection of the thresh- olds avoids local optimization only based on information gain. 3 There is a possibility of using only information gain splitting. The candidate thresholds can be obtained at boundary positions2 by sorting the instances according to the selected vari- able (the threshold to maximize the information gain is also located at such a boundary position [9]). The last operator has an effect of choosing desirable boundaries based on information gain. Each non- dominated solution in a discrete space of tree size represents the minimized error fitness for each number of decision rules. . tree size) for each parameter value. we have a limit for the tree size.4 Variable-node C4. every member in the elitist pool will be reproduced. In this work. some data set does not need the exploration of as many nodes as 35. Then we extract the best performance as well as the best pattern classifier 2 We first try to find adjacent samples which generates different classification cate- gories and then the middle point of the adjacent samples by the selected attribute is taken as a candidate threshold. The algorithm can produce a varying size of decision trees by controlling the parameter of a minimum number of object in the branches. we have a collection of pattern classifiers each of which has a different size. The fifth operator sets a random attribute in a node and chooses one of the possible candidate thresholds randomly. a new random subtree is generated until the limit condition is satisfied. 11. an elitist pool is maintained.6 and mutation rate 0. That is. 11 Minimizing Structural Risk on Decision Tree Classification 249 sist incremental evolution by adding one more decision boundary. The minimum size of leaf nodes is 2 and the maximum size of leaf nodes is set to 35 or 25.2 were used. It will have the effect of removing redundant subtrees.2. we can collect a set of pairs (classification performance. Thus. For each generation. where each member is the best solution for every size of decision trees under progress. In this work. The fourth operator selects a branch of a subtree and reduces it into a leaf node with random class. A single run of the EMO method over training examples will lead to the Pareto optimal solutions over classification performance and the number of rules. but we use this method instead to allow more diverse trees in structure and avoid local optimiza- tion. the last mutation operator3 accelerates a fast speed of convergence in classification and the other four operators provide a variety of trees in a population. When the minimum number of objects in the two branches increases from two to one hundred or more sequentially. because a small number of leaf nodes are sufficient. If the number of leaf nodes in a new tree exceeds the limit.

11. we will show an application of the EMO method to mini- mize the training error. the best pattern classifiers are obtained by heuristic measurement. Thus. As the number of rules increases. Fig. 11.5 method by default parameter setting. and the experiments minimizing the generalization error will be provided in section 3. With the variable-node C4. but some rules support only a small number . the best boundaries evolved for a small number of rules also belong to the classification boundaries for a large number of rules. In case that a certain size of trees may be missing by the given splitting method.5(f). It is assumed that the average performance over a given size of trees by variable-node C4.5 will represent the estimation over all the best C4. more rules improve the classification performance for the training data by adding decision boundaries. new boundaries can provide better solutions in some cases.5. 11. 11. not generalization error to show how it works.E. but a large number of rules needs a long time to find the best trees since more rules have more parameters to be evolved. 11. however.5. The data set contains 150 samples with two classes. decision boundaries are added and the training error performance improves. incremental evolution from small to large structures of decision trees is operated with the EMO approach. Kim for each size of trees from the database.5 tree induction program was applied.1 EMO with Artificial Data The EMO method was first tested on a set of artificial data with some noise as shown in Fig. We note that in Fig.6 shows an example of the best tree chromosomes. 11. For reference. It produced a 29.5(a). When the cross validation process is applied. 43 examples were misclassified (28.2. a neural network (seven nodes in a hidden layer) trained with the back-propagation al- gorithm achieves a 14.3 Experiments 11. We will call this method as variable- node C4.7 % error rate) as shown in Fig.7 % error rate (22 example misclassifications). A small number of rules are relatively easily evolved.5 to distinguish it from the conventional C4. and it was better than the C4. the performance could be improved with better parameter setting. Large trees tend to be evolved sequentially from the base of small trees or a small number of rules. we calculate the average performance of classification for available tree sizes.250 D. With only two rules allowed.5 decision trees obtained with a given size of trees. Evolving decision trees with 1000 generations and a population size of 200 by the EMO approach produced Pareto trees.5.3 % training error rate (44 errors). six rules was sufficient to obtain the same performance as neural networks with seven hidden nodes. Moreover.3. it generated 3 rules as shown in Fig.5 method. we will estimate the average performance over training set and test set with many trials. When the C4. In many cases. while the EMO tries to find the best classifier for a given size of trees using evolutionary search mechanism.

9 0.7 0.6 0.8 training error rate (%) 0.3 0.9 0.8 1 X X (a) (b) 1 1 0.8 1 X X (c) (d) 1 30 0.4 10 0.4 0.8 1 0 0.2 5 0.5 decision boundaries (◦ : class 0.6 0.5 0.2 0.7 0.1 0 0 0 0.4 0.2 0.6 0. Artificial data and EMO (a) data set and C4.1 0 0 0 0.9 25 0.5 Y Y 0.6 0.9 0.9 0.2 0.8 0.4 0.3 0.5.2 0.1 0.7 0.8 0.8 1 5 10 15 20 X number of rules (e) (f) Fig.6 0.6 0.8 0.4 0.3 0.1 0 0 0 0.6 0.2 0.6 0.4 0.5 0.3 0.4 0.6 0.2 0.3 0.2 0.4 0.1 0.4 0.2 0. × : class 1) (b) 3 rules from EMO (c) 4 rules from EMO (d) 6 rules from EMO (e) 9 rules from EMO (f) an example of EMO result . 11.5 Y Y 0.5 15 Y 0.2 0.7 0.8 1 0 0.4 0. 11 Minimizing Structural Risk on Decision Tree Classification 251 1 1 0.8 0.7 20 0.6 0.

775 x < 0. 11. In that case.241 y n y n y n y < 0. 11.252 D.595 x < 0. wine. 95% confidence intervals of fitness (test error rate) are measured by assuming t-distribution.775 class 0 n y n y class 1 class 0 class 1 x < 0.5 by default parameter setting. bupa) in the UCI repository [1] and the artificial data in Fig. Fig. For each size of decision trees. The suggested EMO approach takes a population size of 500 and 1000 generations with tournament selection of group size four for each experiment. and we used variable-node C4. These data sets are mostly for machine learning experiments. wpbc. pima.657 y n y n y n y n class 0 class 1 y < 0. Kim x < 0.81 class 0 class 0 class 1 y < 0. the data sample is removed. An example of best chromosomes by EMO of samples. Evolutionary computation was able to attain a hierarchy of structure for classification performance.245 class 0 x < 0. For the C4.595 y < 0.5 run.7(a)-(b) shows that the artificial data have four decision rules as the best structure of decision trees and that the ionosphere data have six 4 Some sets of data include missing attribute values. This is the motivation of our work using structural risk minimization.2 Machine Learning Data For the general case.5 and the EMO approach as well as C4. glass.5. ecoli.595 x < 0. 11. both number of rules and error rate will be examined with t statistic.775 x < 0.241 y n y n x < 0.6. the suggested evolutionary approach has been tested on several sets of data4 (iris. Classification error rates are estimated by running the complete 10-fold cross-validation ten times.241 y < 0.595 y n y n y n class 1 class 0 class 0 class 1 class 0 class 1 class 1 class 0 2 rules 3 rules 4 rules y < 0.E.435 y n class 1 class 0 5 rules 6 rules Fig. The rules may cause over-specialization problem to degrade the generalization performance. There exists the best number of rules to show the minimum generalization error as expected from the structural risk minimiza- tion. ionosphere. 11.3. .775 x < 0.

5 45 C4.5 with default parameters.5 V 45 C4. 11.5 C4.5 C4.5 V EMO 1 EMO 1 40 EMO 2 40 EMO 2 35 35 Error rate (%) Error rate (%) 30 30 25 25 20 20 15 15 10 10 5 5 0 0 5 10 15 20 25 5 10 15 20 25 Number of rules Number of rules (c) (d) 40 50 C4. ◦: EMO result with 1000 generations) (a) artificial data (b) ionosphere data (c) iris data (d) wine data (e) ecoli data (f) bupa data .5 V EMO 1 45 EMO 1 35 EMO 2 EMO 2 30 40 Error rate (%) Error rate (%) 25 35 30 20 25 15 20 5 10 15 20 25 30 35 5 10 15 20 25 30 35 Number of rules Number of rules (e) (f) Fig.5 C4.5 V EMO 1 EMO 1 35 EMO 2 20 EMO 2 Error rate (%) Error rate (%) 30 15 25 10 20 5 15 0 5 10 15 20 25 5 10 15 20 25 30 35 Number of rules Number of rules (a) (b) 50 50 C4.5 with varying number of nodes. : EMO result with 100 generations. *: C4.5 C4.5 C4. Generalization performance with varying number of rules 1 (arrow: C4.5 V C4. 11 Minimizing Structural Risk on Decision Tree Classification 253 40 25 C4.5 V C4.7.

Comparison between C4.5 and EMO (EMO 1 and EMO 2 represent the EMO running with 100 generations and 1000 generations. More generations tend to show a better curve for the best structure of trees. respectively) (b) the number of rules with C4. 11.5 V 35 EMO 1 EMO 2 30 25 Error rate (%) 20 15 10 5 0 iris wine ion ecoli pima wpbc glass bupa artificial (a) 35 C4.5 EMO 1 30 EMO 2 25 number of rules 20 15 10 5 0 iris wine ion ecoli pima wpbc glass bupa artificial (b) Fig.5 and EMO method (a) error rate in test data with C4. The EMO even with 100 gener- .254 D. If the tree size is larger than the best tree size. Kim 40 C4. then the generalization performance degrades or has no improvement.8.E.5 and EMO (the number of rules for EMO is determined by selecting the minimum error rate) rules.5 C4. This test validation process can easily determine a desirable number of rules.

7 12 pima 768 8 25. Its performance is significantly worse than the EMO performance in most of cases.9 33.9 ± 0. Data classification errors in C4. tree size.2 ± 0.5 by default parameters.3 ± 1. Variable-node C4.4 ± 1. Among the collection of varying sizes of trees.2 4 ionosphere 9.4 ± 1.0 ± 0.9 ± 1.1 ± 1.8 ± 1.9 ± 1.9 ± 2.5 induction tree in the test error rates for these two sets of data.5 and variable-node C4.4 ± 0.6 25.4 ± 1.8 4.0 30. .2 ± 0.1 5.4 ± 1.5 ± 0.1-11. artificial and ionosphere data.4 ± 0.7 2 glass 34.0 ± 0.8 ± 0.7 4 wine 178 13 6.7 4 3. We collected the best model complexity and the corresponding perfor- mance into Table 11.6 2 24.5 variable node C4. Fig.6 16 iris 150 4 5.9 9 7.3 ± 0.1 ± 1.2 8 6.0 ± 2.0 16 17. but instead irregular type of performance curve.5 3 wpbc 24. For reference.3 ± 0.5 30.) EMO (1000 gen.4 27 29.1.3 9 bupa 345 6 34.9 ± 1.5 is mostly worse in perfor- mance than the EMO method for each number of rules.5 13 wpbc 194 32 30. Data classification errors in the EMO method EMO (100 gen.) data error (%) # rules error (%) # rules artificial 21.9 24.5 C4.7 14.0 25.5 result for most of data sets.7 19.9 ± 1.1 4 22.2 6.5 ± 1.9 ± 0. the EMO method outperforms the variable-node C4.8 ± 1.5 often finds better performance than C4.5 17 Table 11.3 ± 1.0 ± 0.5 18 bupa 33.7 6 iris 3.7 shows that the variable-node C4.2.7 2 ionosphere 351 34 10.4 4 wine 7.0 24.9 ± 0.5 in generalization performance for all cases except pima and wpbc data. 11.0 9 ecoli 336 7 20.4 ± 0.1 ± 1.5 data pattern attr.0 23.8 ± 1.4 ± 0.1) for generalization performance with the VC-dimension. 11.0 5.0 ± 0.5 ± 0.1 ± 0.2 3 ations is better than the C4.2 ± 1.2 6 pima 25.0 6 ecoli 18. Variable-node C4. and if we choose the best structure to minimize structural risk.9 ± 0.1 3 31.0 ± 0.8 ± 1.1 ± 1.3 ± 0.5 does not produce a V-shape curve (in Fig.9 3.3 2 glass 214 9 32.4 ± 0.3 34. there exists some tree better in performance than C4.0 19 25.5 by default parameter setting.2. error (%) # rules error (%) # rules artificial 150 2 33.3 10. 11 Minimizing Structural Risk on Decision Tree Classification 255 Table 11.0 13. we showed the performance of C4.

The performance of C4. An algorithm of min- . and it is better than C4. It is presumed that more consistent data are required for the two sets of data. but it is due to the fact that the EMO finds an integer number of rules in a discrete space.9).5 EMO EMO 45 30 40 28 Error rate (%) Error rate (%) 35 26 30 24 25 22 20 15 20 5 10 15 20 25 30 35 5 10 15 20 25 30 35 Number of rules Number of rules (a) (b) Fig. Kim 50 32 C4.8 show that the EMO is significantly better than C4.5 in the test error rate for all the data. 11. It confirms that some rules from C4.5 finds a small number of rules.5 induction trees. 11.5 in the test error rate for all the data except wine and glass.5 in error rate with 95 % confidence levels for many experimental data.E. Investing longer training time on wpbc and pima data does not improve validation performance. regardless of the number of rules. Two sets of data have a bad prediction performance. 11. The EMO with 1000 generations improves both the error rate and the number of rules. In some cases variable-node C4.9. the EMO with 100 generations outperforms C4. Table 11.5 are redundant and thus C4.4 Discussion We evaluated generalization performance under various model complexities.2 and Fig.5 with default parameter. Generalization performance with varying number of rules 2 (arrow: C4.5 may suffer from an over-specialization problem. 11. The other data experiments show that the best number of rules in decision trees by the EMO is significantly smaller than the number of rules by C4.5 is worse than or similar to that of two or three rules evolved. The EMO method with wine and artificial data have a little higher number of rules.256 D. ◦: EMO result with 1000 generations) (a) wpbc data (b) pima data In our experiments. An interesting result is obtained for the wpbc and pima data (see Fig.1-11.5 C4. with the pattern classifiers found by the EMO method. which is determined by information gain. but the rule set has much worse performance in classification than the best rule set by the EMO.

100 generations are sufficient to find the best model structure. With the parameter control. it fails to produce uniform Pareto-optimal solutions. If one of the objectives in EMO is discrete. a specific number of decision rules can be missing. For the comparison with C4.5 with varying number of nodes is shown in Fig. In some case. Generally it is hard to generate a consecutive number of leaf nodes or a large size of trees with C4.5. The gen- eralization performance under a variety of structure complexity can determine the best structure or best size of trees. where each member corresponds to each size of trees. If chromosomes are linearly ranked by only error rate performance instead of Pareto dominating rank.5. We leave the comparison between the suggested approach and other methods under the same model complexities to future work. 11. It is believed that better training performance of the EMO can find better model structure to minimize structural risk. Yet more generations often produce the result that the best model structure shifts to a smaller size of trees. by controlling the parameter of a minimum number of objects in the branches. other tree induction algorithms that grow trees by one node can be tested and compared with the EMO method. Applying our EMO approach to linear re- gression trees or neural network classifiers would produce better classification performance. The above method was applied to each training set in the 10-fold cross validation. In the EMO approach. that is. we tried to provide a variety of model complexities using variable-node C4. since the evolutionary run sticks to an one-objective solution. and the best structure followed the result in Table 11. This procedure can be repeated with many trials to obtain the average generalization error. although we can find the best number of classifiers. the evolved de- cision trees have axis-parallel decision boundaries. but it is still an open question how much training we need for desirable generalization performance. all members in the elitist pool were reproduced every generation. . struc- ture). As an alternative. it is mostly worse than our method in classification performance. When we tested varying number of members in the elitist pool for reproduction in a new population. The pattern classification may have a limitation of fitting nonlinear sample patterns. elitism can be easily applied to evolutionary computation by keeping a pool of the best solutions for each discrete genotype. We showed the performance of the EMO with different number of gener- ations.5. we found more members reproduced in the elite pool can significantly improve training performance. 100 and 1000. Given any arbitrary data set for future prediction.2 in most of cases. In our approach. The performance of C4. 11 Minimizing Structural Risk on Decision Tree Classification 257 imizing structural risk was applied to the Pareto sets of (performance. we can apply the cross- validation process to find the best pattern classifier minimizing structural risk.7. It would be a better comparison with the suggested approach if more sophisticated tree induction method with pruning or stopping growing can be tested. The data set is divided into training and test set randomly and then the suggested EMO can be applied to the training data.

IEEE Press. For future study. Multiobjective genetic program- ming: Reducing bloat using SPEA2. Conf. M. Many researchers have used pima and wpbc in their experiments. Zitzler.E. we can find a desirable number of rules for a given error bound. pages 536–543. ionosphere. [2] S. classification performance and tree size (number of rules). In particular. and bupa took roughly 3 seconds. The bagging process may also be applied to the best rules obtained from the proposed method. The result can also be compared with that obtained from neural networks. wine.5 or variable-node C4. . The EMO approach can be extended to more complex trees such as trees with multiple thresholds or linear regression trees. wpbc. For example. It can also help to evaluate how difficult it is to classify a given set of data examples. Morgan Kaufmann. Blake.258 D. it can reduce the model complexity. UCI repository of machine learning data- bases. 7 seconds. In Congress on Evolutionary Computation. 11. [3] M. The performance of the best rule set is better than that of C4. In Genetic and Evolutionary Computation Conference. for decision tree classification. we can indirectly determine if a given set of data requires more consistent data or whether it includes many noisy samples. The decision tree evolved in the present work has the simple form of a binary tree. on Machine Learning.J. but the distribution of error rates over the size of trees implies that these data cannot expect prediction. Bot. which is one of the promising methods to obtain good accuracy rates. A single EMO run for iris. Thiele. Kim The computing time of evolutionary computation requires much more time than C4. 8 seconds.5.C. Brack. 2000.1 second (Pentium computer). the size of trees dramatically. Bleuler. pages 403– 410.5 Conclusions In this chapter. 27-30 May 2001. although it takes more computing time. In addition.5. ecoli. 28 seconds. glass. 1998. Improving induction of linear classification trees with genetic pro- gramming. Keogh.J. L. The proposed EMO approach searches for the best accuracy rate of classification for each different size of trees. respectively. and E. the suggested method can be compared with the bagging method [5].5 application takes only 0. a single EMO run with a population size of 500 and 100 generations over pima data takes about 22 seconds while a single run of C4. we introduced an evolutionary multiobjective optimization with two objectives. By structural risk minimization. but it can improve the classification performance significantly in most of the data sets. and C. Generally the EMO needs much more computing time to find the best performance for every size of trees. 8 seconds. 12 seconds. Merz. References [1] C. 6 seconds. E. of the Fifth Int. In Proc.

J. Fonseca and P. 4(3):211–233. 24(2):123–140. Cambridge.. Olshen.D. Mingers. Attribute selection with a multiobjective genetic algorithm. [5] L. 1992. Application of genetic programming to induc- tion of linear classification trees. Coello Coello. 4(2):227–243.J. Kinnear. [12] S. [6] L. Investigation and reduction of discretization vari- ance in decision tree induction. In Proc. [18] D. pages 312–322. [7] C. LNAI 1810. of the Fifth Int. 4:77–90. Quinlan. Pappa and C. Advances in Genetic Programming 2. In European Conference on Machine Learning. Geurts and L. van Veldhuizen.J. MIT press. 1986. [24] J. Classification and Regression Trees. pages 247–258. Neural networks and the bias/variance dilemma. LNCS 3003. In Proc. pages 1–58. University of Liege. [15] A. University of Michigan.B. 2002. [14] P.A. Kim and J. Bagging predictors. [9] U. Geman.M. Ph. Kaestner.E. Connection Science. Kim. M. pages 1022– 1027. J. 1984. B. 2002. Friedman. EECS department. [23] J. In From Animals to Animats 7.B. discussion and generalization. MA. 2000. 1993. [8] E. 2002. Contributions to decision tree induction: bias/variance tradeoff and time series classification. Morgan Kaufmann. Evolutionary Algorithms for Solving Multi-Objective Problems. 1997. Lamont. In Proceedings of IJCAI’93. Neural computa- tion. Breiman. and G. Kluwer Academic. 2004. In Proceedings of the 3rd European Conference on Genetic Programming. Journal of Arti- ficial Intelligence Approach. 1996.R.A. Irani and V. Multi-interval discretization of continuous-valued attributes for classification learning.B. thesis. Fayyad and K. [21] J. pages 416–423. Springer-Verlag. de Jong and J. [11] C. Ann Arbor. Wehenkel.B. In Proc. Khaminsani. Freitas G. dissertation. The University of Michigan. [13] P. Data structures and genetic programming. 1(1):81–106. Angeline and K. 2002. Hallam. An empirical comparison of selection measures for decision-tree induction. Kim. of European Conf. Mitchell. [16] K. Wadsworth International Group. . Pollack. An evolutionary approach to quantify internal states needed for the woods problem.A. pages 395–414. Quinlan.A.D.A. 1989.C. Irani.M. 16(3):183–210. 1996. 2003.D. Belgium. 1991. Fleming. 1991. Morgan Kaufmann.R. Stone. pages 338–348. Langdon.L. Machine Learning. Springer-Verlag. Breiman. Ph.A. of the 16th Brazilian Symposium on Artificial Intelligence. Springer Verlag. Improved use of continuous attributes in C4. pages 162–170. 1993. [17] D. R. In P. Fayyad. Knowledge based automation of semiconduc- tor manufacturing. Machine Learning. 11 Minimizing Structural Risk on Decision Tree Classification 259 [4] M. Induction of decision trees.5. [10] U. Conf. Bot and W. [20] W. 2004. on Genetic Program- ming. editors. D. Genetic Programming and Evolvable Machines.A. on Genetic Algorithms. Evolving internal memory for T-maze tasks in noisy environments. MIT Press. On the induction of decision trees for multiple concept learning. and C. Structural risk minimization on decision trees using an evolutionary multiobjective optimization. Multi-objective methods for tree size control. pages 280–290. [22] T. 1996. Genetic algorithms for multiobjective opti- mization: Formulation. [19] D. Machine Learning. M. Langdon. McGraw Hill. 2000. Machine Learning. In SRC Project Annual Review Report. Geurts.

D. Rivest. The nature of statistical learning theory. Kim [25] J. thesis. Information and Computation. Evolutionary Algorithms for Multiobjective Optimization: Methods and Applications. 80(3):227–248. Vapnik. Quinlan and R. Zitzler.E.260 D. 1999. Inferring decision trees using the minimum de- scription length principle. 1996. Ph. Swiss Federal Institute of Technology.R.N. . Springer Verlag. [26] V. 1995. [27] E.

Part III Multi-Objective Learning for Interpretability Improvement .

which means it can be generalizable to new instances.springerlink. The dataset is formed by a set of instances. 08022 Barcelona. Classification may be regarded as an inherently multiobjective task.: Multi-objective Learning Classifier Systems.12 Multi-objective Learning Classifier Systems Ester Bernadó-Mansilla1 . 12. Bernadó-Mansilla et al. Some of these mechanisms include implicit multiobjective learning mechanisms.com Summary. Xavier Llorà2 . It consists of inducing a model that describes the target concept represented in a dataset. The model induced by the learner should be accurate so that it can represent precisely the data instances. 55 Broad Street. especially in Pittsburgh LCSs. resulting in the proposal of multiple systems. Spain esterb@salleurl. Universitat Ramon Llull.edu 2 Illinois Genetic Algorithms Lab. Enginyeria i Arquitectura La Salle. hence it can be used to explain the hidden structure of the dataset and to classify newly collected instances whose associated class is unknown. and offering the human expert a set of concept description alternatives. 3rd Floor.. 104 S. Such a task requires inducing a knowledge representation that represents the tar- get concept completely and consistently. and Ivan Traus3 1 Department of Computer Engineering. The paper analyses the advantages of using multiobjective evolutionary algorithms. complete. Studies in Computa- tional Intelligence (SCI) 16. Urbana.. 2. such as controlling the so-called bloat effect. Mathews Ave. IL 61801. 261–288 (2006) www. where each instance is described by a set of features and an associated class. In many domains. the efforts of the community have been centered on the design of LCSs that solved these goals efficiently. Learning concept descriptions from data is a complex multiobjective task.uiuc. The model describes the relation between the features and the available classes.edu 3 Conducive Corp. Along the intense history of the field.1 A Multiobjective Motivation Classification is a central task in data mining and machine learning appli- cations.com c Springer-Verlag Berlin Heidelberg 2006 . or easily readable. This paper revises the main LCS approaches and focuses on the analysis of the different mechanisms designed to fulfill the learning goals.ge. University of Illinois at Urbana-Champaign. USA itraus@conducivecorp. while others use explicit multiobjective evolutionary algorithms. the induced model E. New York. Quatre Camins. and minimum. NY 10004. Learning Classifier Systems (LCSs) are a family of learners whose primary search mechanism is a genetic algorithm. USA xllora@illigal.

In Pittsburgh LCSs. The whole population is needed to rep- resent the whole target concept. The use of variable-size individuals causes bloat [60]. genetic algorithms are used as search algorithms. This paper revises the main LCS approaches through the prism of multi- objective learning. In other cases [10. Particularly. In both cases. In Michigan LCSs. The Pittsburgh approach codifies a complete knowledge representation (either a ruleset. Multiobjective learning is also present in learning classifier systems (LCSs). 13]. The so-called Michigan approach cod- ifies a population of rules. but a set of compromise solutions representing different hypotheses along the Pareto front. the multiobjective goals are dealt implicitly by the interac- tion of several components. Bernadó-Mansilla et al. since they have different types of difficulties to achieve the learn- ing goals. which often means a compact representa- tion. i. general. The later case benefits from the fact that the search is performed by a genetic algorithm and uses experience gained from the field of multiobjective evolutionary algorithms to get a hypothesis optimizing simultaneously the learning goals. thus. and compare it with MOLeCS [10]. due to their different architecture. One of the advantages associated with this approach is the improvement of the search in the space of possible hypotheses. LCSs evolve a model of the target concept by the use of genetic algorithms (GAs). Besides. minimizing the probabilities of falling into local minima. 4]. Evolutionary multiobjec- . an unnecessary growth of individuals without fitness improvement. This offers the human expert the possibility of choosing among a set of alternatives.. the best individual evolved is the solution to the classification problem. we focus our attention on the most successful systems of both the Michigan and Pittsburgh approaches. Two main approaches are defined under the LCS framework. we give a short history of LCS. which opens the area up to the field of ensemble classifiers [17]. Then. learning is centered on the evolution of a minimum set of rules where each rule should be accurate and maximally general. Thus. 45]. and minimum rule- set. Moreover. or decision trees [42.262 E. We revise how the XCS classifier system [63] achieves implicitly these goals. 23. or induction tree) in each individual. where each individual of the population represents a single rule. All these objectives are usually opposed and classification schemes try to balance them heuristically. First. instance set. the multiobjective goals are directly codified in the fitness function. evolving an accurate. prototype sets. should also be easily interpretable. We focus on these types of systems separately. this allows to explore the possibilities of combining different hypotheses. the solution obtained is not only a single hypothesis. The hypotheses can be represented as rule sets. seeking to optimize the multiobjective learning goals. a mul- tiobjective learning classifier. In many cases [63. and take few computational resources especially in the exploitation stage but also in the training stage.e. genetic algorithms search for a hypothesis or a set of hy- potheses representing the target concept. summarizing early approaches of LCSs which were significant and inspired nowadays competent LCSs. a partial solution. learning implies a search through a space of rulesets. 43.

2. Its architecture was composed of a set of receptors collecting messages into a message list.2 Learning Classifier Systems: A Short Overview 12. When reward was received from the environment. 12 Multi-objective Learning Classifier Systems 263 tive algorithms are useful to limit this effect while achieving the learning goals. Finally. where the quality (strength) of rules is assigned according to the payoff prediction. i. an epochal algorithm distributed payoff into the list of active rules that triggered that action. thus biasing the search towards highly rewarded rules.e. This laid the basis for the first practical implementa- tion of a classifier system which was called CS-1 (Cognitive System Level One) [34]. The major problems that arise with traditional Michigan LCSs are the achievement of accurate generalizations and the co-evolution and maintenance of different niches in the population. while ensuring the co-evolution of a diverse set of rules.1 Cognition and Environment Interaction The origins of the learning classifiers systems field can be placed within the work by Holland on adaptive systems theory and his early proposals on schemata processors [32]. The Boole classifier system [62] tried to balance this equilibrium by the use of fitness sharing and a GA applied to the niches defined by the action sets (sets of rules with similar conditions and . The system learns incrementally from the interaction with the environment. The bucket brigade algorithm [34] was a classical credit assignment algorithm. The maintenance of different rules along the evolution is addressed by a non-generational scheme and the use of niching techniques [29]. we discuss open issues on multiobjective learning and directions for further research. The early Michigan approach can be sum- marized as having the following properties.. The GA uses the strength calculated by the credit assign- ment algorithm as the fitness of each rule. These messages were matched against a set of rules to decide which actions should be sent to the effectors. CS-1 was devised as a cognitive system capable of capturing the events from the environment and reacting to them appropriately. The GA task is to discover new promising rules. There is a delicate balance between com- petition of individuals. 12. The system inspired many approaches which were named under the frame- work of the Michigan approach. the payoff that the rule would receive from the environment if its action was selected. A genetic algorithm was run to discover new rules by means of crossover and mutation. This balance is difficult to maintain for Michigan LCSs. It usually starts from a random population of rules which is evaluated with a reinforcement learning scheme by means of a credit assignment algorithm. and the co-operation needed to achieve jointly a set of rules. which compete to occupy its place in the population.

2. later approaches also introduced other types of codifications). Posterior versions introduced incremental learning (GABIL) and the use of two specific operators.264 E. Fitness usually considers the classification accuracy of the ruleset against the training set of samples. The Pittsburgh approach can be characterized as a “classical” genetic algorithm. e. Additionally. COGIN [31] avoided competition among rules by restrict- ing new classifiers to cover instances not covered by other classifiers of the population. Each individual represents a whole ruleset. was tested in med- ical datasets and its performance found competitive with respect to CN2 [22] and backpropagation neural networks [57]. which was adapted to the requirements of epidemio- logic databases with mechanisms for risk assessment. . whose purpose was also gaining control over the type of rules evolved. The main learning cycle usually operates as in a generational genetic algorithm. Further work on NEWBOOLE derived into EpiCS [35].2 Genetic Algorithm based Concept Learners Coexisting with the early developments of Michigan LCSs. the operation of the GA is simpler. among others. NEWBOOLE [16]. the GA converges to a single solution and niching algorithms are not needed. rule’s generalization. Results are also competitive when compared with other classifier schemes. It emerged with LS-1 classifier system [58].. can hardly be tuned. which may encourage the formation minimal rulesets. thus. but at the expense of an increased parameterization. The first approaches such as GABL considered only classification accuracy on each individual’s fitness.g. since Pittsburgh LCSs search in the space of possible rulesets. In Pittsburgh LCSs each individual codifies a complete ruleset. Despite the known difficulties. This way. which inherits from Boole. a complete solution to the learning problem. However. and unequal distribution of classes. Michigan approaches succeeded in many applications. The two additional GABIL’s operators were designed with this purpose. Individuals were codified as fixed length rulesets. whose individuals codify a set of rules (in fact. 12. adding alternatives and dropping condition. which inspired a main classifier scheme called GABL [23]. a parallel line of LCSs was also under research named as the Pittsburgh approach. The fitness of each individual is evaluated as the performance of its ruleset as a whole. usually under a supervised learning scheme. so there is no need for cooperation among individuals as in the Michigan approach. Another difficulty is that few control can be applied at the rule level. NEWBOOLE ob- tained a compact and interpretable ruleset. GIL [36] was another proposal that included a large set of operators acting at different levels. the search space is large and usually takes more computational resources than the Michigan approach. to bias the search towards general rules and minimal rulesets respectively. where each rule was codified as a binary string. the same action). Bernadó-Mansilla et al.

composed of several local GAs evolving separately a single rule covering a partial set of training examples. but caused excessive growth of individuals. such as MOLS-GA and MOLS-ES [45]. several hybrid systems were proposed such as REGAL and GA- MINER. but its evaluation is done independently from other rules in the population. so it does not need any credit assignment algorithm. The formation of rulesets is performed us- ing an incremental update strategy based on heuristics. this is not guaranteed by the system.3 Hybrid Approaches Due to the known different difficulties of the Michigan and the Pittsburgh approaches. The main difference is that UCS is de- signed specifically for supervised learning problems. Partic- ularly. tend to be minimal [37]. 12. besides being complete and accurate.2. 12. The main goal of GA-MINER is to find some interesting patterns in the dataset. as in [27] while others addressed these issues from a multiobjective problem perspective [45]. Under this framework.4 Accuracy-based Approaches XCS [63. REGAL [28] uses a distributed architecture. Previous Michigan LCSs had identified difficulties to achieve accurate generalizations and maintain a balanced co-evolution of different niches in the population. Each individual in the population has a single rule. XCS differs from traditional LCSs on the definition of rule fitness. with the inclusion of useless rules inside the individuals or the evolution of overspecific rules. and a global GA that forms the complete ruleset. 12 Multi-objective Learning Classifier Systems 265 The use of variable sized individuals allowed increased flexibility in Pitts- burgh LCSs. and the use of a niche GA. Some recent developments are GALE [39]. We will also study in detail some multiobjective evolutionary learners. UCS (sUpervised Classifier System) shares many features with XCS. These aspects have resulted in a strong tendency to evolve accurate and maximally general rules. GA-MINER [25] is a system designed for pattern discovery in large data- bases. The key point of the system relies on the distribution of examples and the combination of the partial solutions. The later will be analyzed as an example of implicit multiobjective learner. Some solutions added parsimony pressures in the fitness function. 64] represents a major development of Michigan style LCSs. This led to suboptimal rulesets that could hardly represent the target con- cept. which is based on the accuracy of the payoff prediction rather than on the predic- tion itself. The experiments showed . favoring the achieve- ment of knowledge representations that. Other accuracy-based approaches have been studied recently [11]. the evolution of a complete ruleset is not necessary and in fact. such as the generalization mechanisms. and GAssist[4].2. instead of fully covering the database as expected in a classification task.

k . there is not a global optimum that minimizes all the objectives simultaneously. . . the minimization of a certain objective implies a degradation in another objective. . fk (x)). 12. . . . . a mul- tiobjective Michigan LCS. . spe- cially those with high number of classes or with highly unequal distribution of examples. F is a function which maps points from the decision variable space Ω to the objective function space Λ: F : Ω −→ Λ (12. . k. if and only if every component ui is less than or equal to vi . . . . with i = 1 . the concept of optimality must be redefined. . m. We will study XCS in more detail and compare it with MOLeCS. The set of all solutions whose objective vectors are not dominated by any other objective vector is called the Pareto optimal set P ∗ : . In general terms. Vilfredo Pareto [53] introduced the concept of dominance and Pareto optimum to deal with this issue. In this context.266 E. Then. This can be formulated as follows: u  v ⇐⇒ ∀i ∈ 1. . written as u  v. . we briefly describe the notation we will use to refer to multiobjective issues. . These objectives are grouped in a vector function denoted as F (x) = (f1 (x). a vector u dominates another vector v. ui ≤ vi ∧ ∃i ∈ 1. . i = 1. In other words. In a multiobjective optimization problem (MOP) [61] a solution x ∈ Ω is represented as a vector of n decision variables x = (x1 . . A solution that minimizes all the objectives and satisfies all constraints may not exist. that UCS is more suitable to classification problems with large spaces. where F (x) ∈ Λ. . . Sometimes. . These constraints are necessary for problems where there are invalid solutions in Ω. xn ). a solution whose objectives can not be improved simultaneously by any other solution is Pareto optimum.3 Multiobjective Optimization Prior to the study of the different multiobjective mechanisms of the Michigan and Pittsburgh LCSs. . . and at least there is one component in u which is strictly less than the corresponding component in v. where Ω is the decision variable space. we can define a MOP as the problem of minimiz- ing a set of objectives F (x) = (f1 (x). k : ui < vi (12.1) x −→ y = F (x) Without loss of generality. . . We want to optimize k objectives which are defined as fi (x). a solution x ∈ Ω is Pareto optimal if there is not any other solution x ∈ Ω whose objective vector u = F (x ) dominates u = F (x). subject to some constraints gi (x) ≤ 0. Thus. Bernadó-Mansilla et al. . fk (x)).2) The concept of Pareto optimality is based on the dominance definition.

an input x is presented to the system. the GA is triggered when the average time since the last occurrence of the GA in the action set exceeds a threshold θGA . Therefore. and then the GA may be triggered. and is improved by a search mechanism based on a genetic algorithm. . given a match set [M].4 Multiobjective Learning in Michigan Style LCSs Michigan style LCSs codify each individual as having a single rule. 12. The main parame- ters are: a) prediction. through a reinforcement learning scheme. fk (x)) | x ∈ P ∗ } (12. During training.4) 12. Each rule’s quality is estimated by a set of parameters. .3) Analogously. The basic training cycle performs as follows. XCS proposes the best class from those advocated in [M]. uses an accuracy-based fitness. The . the accuracy of the prediction and c) fitness. the set of all vectors u = F (x) such that x belongs to the Pareto optimal set is called the Pareto Front PF ∗ : PF ∗ := {u = F(x) = (f1 (x). This section revises two particular Michigan LCSs. MOLeCS. learn- ing in Michigan LCSs implies codifying the learning goals at the rule level. defines a multiobjective fitness which directly guides the search process to optimize these goals. Thus. complete and minimal ruleset.1 XCS: Implicit Multiobjective Learning Description of XCS XCS [63. which is formed by all the classifiers in the population whose conditions are satisfied by the input example. All classifiers proposing that class are classified as belonging to the action set [A]. an estimate of the reward that the rule would receive from the environment. which consequently could combine into a consistent. representative of two dif- ferent approaches. then the covering operator is triggered. It creates new classifiers matching the current input. . the system builds a match set [M]. This ruleset is in- crementally evaluated by means of interacting with the environment. which is based only on accuracy. At each time step. Given x. If we run XCS in training. . 12 Multi-objective Learning Classifier Systems 267 P ∗ := {x1 |  x2 : F(x2 )  F(x1 )} (12. 64] represents the target concept in a set of rules. The approach is to maximize accuracy and generality in each rule. The later. XCS explores the consequences of all classes for each in- put. In test mode. XCS selects randomly one of the classes proposed in [M] and sends it to the environment. and the update and search mechanisms are disabled. If no classifiers match. while generalization is achieved mainly by a niche GA applied in a frequency basis. XCS.4. The first one. b) accuracy. The environment returns a reward which is used to update the parameters of the classifiers in [A].

XCS’s goal is to maximize the rewards received from the environment and in doing so it tries to get a complete. consistent and minimal representation of the target concept. This is done incre- mentally. The environment is designed to give a maximum reward if the system predicts the correct clas- sification and a minimum reward (usually zero) otherwise. it is checked for subsumption with its parents before being inserted into the population. The task of the reinforcement component is to evaluate the classifier’s pa- rameters from the reward received from the environment. The role of covering is to enforce a complete coverage of the input space. The search component is based on a genetic algorithm. Whenever an input is not covered by any classifier.268 E. and applies crossover and mutation. Bernadó-Mansilla et al. The niche GA is also designed to favor the maintenance of a diverse set of rules which jointly represent the target concept. Implicit Multiobjective Learning XCS receives inputs from the instances available in the training set and re- ceives feedback of its classifications in the form of rewards. If the population is full. For each example. and then updates fitness based on this computed accuracy. GA takes place in the action set. XCS also uses subsumption as an additional mechanism to favor the gen- eralization of rules. covering creates an initial classifier from where XCS launches its learning and search mechanisms in that part of the space. and more general than the classifier. rather than in the whole population. then the classifier is discarded. This is achieved through dif- ferent mechanisms: a) the GA’s triggering mechanism. its deletion probability is increased by an inverse proportion of its fitness. If one of the classifier’s parents is sufficiently experienced. which tries to balance . and the parent’s numerosity is increased. The resulting offspring are introduced into the population. Whenever a classifier is created by the GA. Basing fitness on accuracy makes the genetic algorithm to search for accurate rules. resulting in compact representations. If the classifier is experienced and poorly fit. The deletion probability is proportional to the size of the action sets where the classifier has participated. it computes the prediction error of the classifier. It selects two parents from the current [A] with probability proportional to fitness. We analyze the role of each component in achieving this compound goal. the more general classifier will tend to displace the specific classifier. gets a measure of accuracy. then the more general one will win because it partic- ipates in more action sets and thus has more reproductive opportunities. As a consequence. a classifier is selected for deletion. This process is called GA subsumption. General- ization is stressed by the application of the GA to the niches defined by the action sets. Subsumption is included to encourage generalization and compactness as well. accurate. This is explained by Wilson’s generalization hypothesis [63]: given two accurate classifiers with one of them matching a subset of the input states matched by the other.

Each rule should maximize simultaneously gen- eralization and accuracy. in terms of classification accuracy. 12 Multi-objective Learning Classifier Systems 269 the application of the GA among all the action sets. The system’s goal is to achieve a complete. basing fitness on generality would result in low performance. each individual is a rule and the whole ruleset is formed by all individuals in the population. complete and accurate set of rules. As a Michigan-style LCS. which are formed by a set of classifiers matching a common set of input states and a common class. In XCS.4. Description of MOLeCS Each individual in MOLeCS codifies a rule (classifier) of type: condition → class.e. 12.. which is applied locally to the action sets. and two measures are computed: # covered examples (ri ) generalization(ri ) = (12. since the GA search is performed at the rule level. c) crossover.6) # covered examples (ri ) . The hypothesis was that by guiding the search towards general and accurate rules would result in minimum. If fitness was only based on accuracy. Therefore. these goals are adapted to the mech- anisms of a Michigan style LCS by defining two objectives at the rule level: generalization and accuracy.2 MOLeCS: Explicit Multiobjective Learning From Multiobjective Rulesets to Multiobjective Rules MOLeCS (MultiObjective Learning Classifier System) [10] is a Michigan style LCS that evolves a ruleset describing the target concept by the use of multi- objective evolutionary algorithms. these goals cannot be directly defined into the GA search.5) # examples in the training set # correctly classified examples (ri ) accuracy(ri ) = (12. b) selection. On the contrary. not at the ruleset level. consistent and compact ruleset by means of multi- objective evolutionary algorithms. need of more rules) and also poor coverage of the feature space. Each rule ri is evaluated against the available instances in the training dataset. performing a kind of restricted mating. and d) the deletion algorithm. However. which tends to delete resources from the most numerous action sets. Thus. The solu- tion was to balance these two characteristics at the same time. the GA search would be biased towards accurate but too specific rules. This would result in an enhancement of the solution set (i. the niches are defined by the action sets. MOLeCS evolves a population of rules which jointly represent the target concept.

it is used under an exploit or test phase. The interpreta- tion was that of searching for accurate rules being as general as possible. In each iteration. fitness was assigned according to the ranking. where the child replaces the most similar parent only if it has greater fitness. Then. the optimally hypothesis [37] argues that XCS tends to evolve minimum rulesets. It works as follows. Additionally. Several MOEA techniques were explored in MOLeCS. Particularly. Thus.3 Results XCS’s generalization hypothesis [64] explains that the accuracy-based fitness coupled with the niche GA favor the evolution of compact rulesets consisting of the most general accurate rules. MOLeCS defines the multiobjective problem as the simultaneous max- imization of accuracy and generalization. being a method based on lexicographic ordering the most successful one. 13]. due to the genetic drift [30].4. These hypotheses are sup- ported by theoretical studies on the pressures induced by the interaction of XCS’s components [21. Re- cent studies are investigating the domain of competence of XCS in real world domains. Once the system has learned. Therefore. An example coming from the test set is presented. G individuals are selected with stochastic universal sampling (SUS) [7]. The genetic algorithm could tend. As explained before. the system should enforce the co-evolution of a set of diverse fit rules by niching mechanisms. XCS has also demonstrated to be highly competi- tive with other machine learning schemes such as induction trees. Bernadó-Mansilla et al. among others. Then. deterministic crowding is applied. the most accurate rule is chosen. Once sorted. 19]. Niching in MOLeCS is performed in the replacement stage. Other strategies based on Pareto dominance and the aggregating approach were also considered but found less successful. it was argued that promoting general classifiers was not sufficient to reach a complete ruleset. to a single general and accurate classifier and usually one classifier does not suffice to represent the whole target concept. the fitness of each rule is assigned according to a multiobjective evaluation strategy. Once the fitness assignment phase is performed. 12. in case of equally fit rules. they undergo crossover with probability pc and mutation with probability pm per allele.e. . The replacement method was designed to achieve a complete coverage of the feature space. to what kind of problems XCS is suited and poorly suited to [12].270 E. Particularly. the GA proceeds to the se- lection and recombination stages.. i. sort them by the generalization objective. the approach taken was that of sorting the population according to the accuracy objective and in case of rules equally accurate. In fact. the system finds the matching rules and applies the fittest rule to predict the associated class. The result- ing offspring are introduced into the parental population. and nearest neighbors. in real world classification problems [65.

For instance. so that the algorithm could find accurate rules as general as possi- ble. the tradeoff among accuracy and generalization. Switching towards implicit niching mechanisms such as those of XCS. Then. Having both an explicit pressure towards generalization by means of a multiobjective approach and an implicit generalization pressure produced by the niche GA could break the delicate equilibrium of evolutionary pres- sures inside LCSs and lead to overgeneral rules. the fitness of each candidate individual can be computed using the accuracy of the ruleset clas- sifying the dataset. Such simplicity.e. Pittsburgh classifier systems should provide compact and accurate individuals. with results highly competitive with respect to other LCSs such as XCS in real world datasets. the Pittsburgh style classi- fier systems evolve a population of individuals. each of them a variable-length ruleset that represents a complete solution to the problem.. this remains an unexplored area that would benefit from further research. This consequently resulted in poor accurate rulesets. The multiobjective evolutionary algorithms let the system evolve accurate and maximally general rules which together formed compact representations. would include an extra generalization pressure. as well as in real world datasets. 12 Multi-objective Learning Classifier Systems 271 MOLeCS was tested in artificial classification problems. of type of multi- plexer and parity problems. i.5 Multiobjective Learning in Pittsburgh Style LCSs The previous section presented how Michigan classifier systems evolve a pop- ulation of rules that classify a particular. The best approach taken was that of introducing the decision preferences a priori. As in Michigan approaches. The main difficulty of MOLeCS was identified within the niching mech- anisms. 12. however. However. However. Optimizing simultaneously generality and accuracy was a better approach than a single optimization approach. Niching mechanisms such as deterministic crowding were only useful for lim- ited datasets which could be easily represented by small rulesets. This led to the evolution of nearly optimal rulesets. The rest of this . comes with a price to pay. A comparison with a single-objective approach maximizing only each rule’s accuracy demonstrated that the multiobjective optimization was necessary to overcome the tendency of evolving too many specific rules which would lead the system towards suboptimal solutions partly represent- ing the target concept. However. as it is explained by Wilson’s generalization hy- pothesis. giving the same importance to these objectives with techniques such as Pareto rank- ing caused the system evolve overgeneral rules. the evaluated individuals undergo the traditional selection and recombination mechanisms. during the search process. preventing other maximally general rules from being explored and maintained in the population. Such an approach greatly simplifies the evolutionary process. often used in the LCS community. MOLeCS presented difficulties to stabilize a niche population and obtain a diverse set of rules which jointly represented a complete ruleset.

e. in [27] the bloat is controlled by a step fitness function: when the number of rules of an individual exceeds a certain maximum. 3. Thiele. Brack. i.2 Implicit Multiobjective Learning Parsimony Pressure The classical approach to addressing bloat and multiobjective learning goals was to introduce a parsimony pressure in the fitness function. or (2) the evolution of over-specific rules.. Contributions to address this issue have been worked out from both the GP field and the LCSs field. labeled as “fitness causes bloat” and “natural code is protective”. reduces the probability that the genetic operators disrupt useful code. bloat may arise in two different forms: (1) the addition of useless rules. 69]. The former refers to the fact that selection does not distinguish between individuals with the same fitness but different size. Thus. 12.5. Some of the approaches taken in the GP field consist of imposing a parsi- mony pressure toward compact individuals (see for example.272 E. they do not improve fitness but their maintenance in the population is favored by the genetic operators. The next sections detail them. A search guided by fitness becomes a random walk among individuals with different sizes and equivalent fitness. section revises the implications of such a goal on Pittsburgh classifier systems and presents some systems and results that exploit multiobjective optimiza- tion ideas to achieve it. 12. [59]) by varying fitness or through specially tailored operators. The term. and solutions are far from being optimal. there is a large set (possibly infinite) of individuals with the same fitness and larger codes. an approach proposed by Bleuler. One problem of this approach is to set this threshold value appropriately. and Zitzler [15] uses a multiobjective evolutionary algorithm to optimize two objectives: maximize fitness and minimize size.5. 70. Langdon [38] attributes bloat to two possible reasons. In Pittsburgh classifier systems. Recently. Without limitation on the number of rules in the ruleset. while others directly codified a multiobjective evolutionary ap- proach. for each individual with a given fitness. 27]. its fitness is decreased abruptly. Bernadó-Mansilla et al. For example. In their approach they use the SPEA2 algorithm [68. named within the field of genetic programming (GP). Thus. refers to the code growth of individuals without any fitness improvement. the search space becomes too large. some approaches also used a parsimony pressure. The second cause of bloat considers that neutral code that does not influence fitness tends to be protective. in such a way that the fitness of larger individuals was decreased [9. In Pittsburgh LCSs. .1 A Generalization Race? Evolution of variable size individuals causes bloat [60].

49]. An individual consists of an ordered. the use of a parsimony pressure has beneficial effects: it con- trols the unlimited growth of individuals. it seems that a multiobjective approach may over- come some of these difficulties. In GP literature this concept has also been termed non- effective code [8]. the so-called introns. The next section revises an approach of this type. W is a constant that adjusts the relation between T L and EL. . Soule and Foster [59] showed that the effect of the parsimony pressure can be measured by calculating explicitly the relationship between the size and the performance of individuals within the population. An excessive pressure toward small individuals could result in premature convergence leading to compact solutions but with suboptimal fitness [49].7) where TL is for the complexity of the ruleset. which considers the number of rules and the number of relevant attributes in each rule. 3. which is based on the minimum description length principle [5]. The authors argue that the bloat control has an influence over the generalization capability of the solutions. Nevertheless. or even in a total popula- tion failure (population collapses with individuals of minimal size). as well as a set of operators for the deletion of introns4 [49] (rules that are not used in the classification) and a tournament-based selection operator that considers the size of individuals. The algorithm deletes the rules that do not have been activated by any example. Based on these results. The first step toward introducing multiobjec- tive pressures in Pittsburgh classifier systems is to use a linear combination of the accuracy of a given individual and its generality –understood as inversely proportional to its size. The system applies a near-standard generational GA that evolves individuals that represent com- plete problem solutions. Tournament selection is used. 12 Multi-objective Learning Classifier Systems 273 Bacardit and Garrell [3] define a similar fitness function. The MDL formulation used is defined as follows: M DL = W · TL + EL (12. It has been observed that shorter rulesets tend to have more generalization capabilities [14. The authors argue that the growth of 4 Non-coding segments. balancing the accuracy and the complexity of an individual. GAssist GAssist [5] is a Pittsburgh LCS descendant of GABIL. An adaptive mechanism is used to avoid domain-specific tuning of the constant. and EL is a measure of the error of the individual on the training set. The system also uses a rule deletion operator to further control the bloat effect. increases the efficiency in the search process and leads to solutions with better generalization. Therefore. A special fitness function based on the Minimum Description Length (MDL) principle [56] is used. the parsimony pressure must be balanced appropriately. variable–length ruleset.

as the code growths. D) the number of incorrectly classified instances of D performed by x.274 E. Classification and Multiobjective Optimization Multiobjective Evolutionary Algorithms can be applied to trade off between two objectives: the accuracy of an individual.5. high quality general solutions (in terms of accuracy out of sample. where the generalization pressure due to the niche reproduction hardly applied because of data sparsity. Section 12. For instance.g. size(x) a measure of the current size of x (e. we can explicitly search for a set of rules that minimizes the misclassification error and the size (number of rules). Let’s define x as an individual that is a complete solution to the classi- fication problem. However. However. the chances of improving the fitness of the individual by recombination also decrease. introns in the individuals is protective. Therefore. |D| number of instances in D.4 gives more details by means of a comparison of these systems in a selected set of classification problems. A comparison of GAssist with XCS [2] showed similar performance in terms of accuracy rate. or compactness of hypotheses) are useful. we can postpone the decision of picking the “best ruleset” to the final user (decision maker). in the sense that they prevent the crossover operator from disrupting the useful parts of the individual. Hence. The comparison of GAssist with XCS also arose a certain difficulty of GAssist in handling multiple classes and huge search spaces. Bernadó-Mansilla et al. D the training dataset for the given problem. This explicit tradeoff formation let us explore the generalization capabilities of the hypotheses that form the Pareto front. removing part of this useless code through an appropriate balance is beneficial to the search process. where the extraction of explanatory models is desirable.5. evolving a set of different compromise solutions between accuracy and generalization. and its size. Maintaining a Pareto front of compromise solutions we can identify the overfitted perturbations of high quality general hypotheses. XCS was found to overfit in some datasets. a multiobjective approach can be defined as follows: . and finally. In certain environments like data mining. 12. 39]. or combine them all using some bagging technique [17. miss(x. Besides controlling the number of rules (bloat) dynamically. the number of rules evolved by GAssist was much smaller that the number of rules of XCS.3 Explicit Multiobjective Learning By using a multiobjective evolutionary approach. Using this notation. the presence of noise in the dataset may lead to accurate but overfitted solutions. the number of rules it contains). this would allow the formation of compromise hypotheses. 41.

The next front to the right represents I 1 and so on. In this example. 63]. 12 Multi-objective Learning Classifier Systems 275 minimize F (x) = (fe (x). 1. being different on whether they base the search mechanism on a GA (MOLS-GA) or an evolution stratgey (MOLS-ES). which moreover avoid the bloat effect. The next equivalence class I 1 is computed without consider- ing the individuals in I 0 . the fitness of each individual in the population is computed. represented by the ternary alphabet (0. Other types of proposals could include more objectives such as measures of general- ization of the rulesets.1 shows an example of the different equivalence classes that appear in a population at a given iteration. This is done on a multiobjective basis. and so forth. this is a simple multiobjective approach. given a population of individuals I. 29. The individuals of the population are sorted in equivalent classes. . That is. First. Figure 12. Otherwise. MOLS-GA codifies a population of individuals. These classes are determined by the Pareto fronts that can be defined among the population. if the problem is defined by continuous-valued attributes. MOLS-GA uses rulesets. Searching for rulesets minimizing the misclassification error means to search for rulesets covering correctly as many instances as possible. or coverage (number of instances covered by a ruleset). In fact.9) |D| fs (x) = size(x) (12. fs (x)) (12.8) miss(x. where each individual rep- resents a complete representation of the target concept. The GA learning cycle works as follows. the first equivalence class I 0 is the set of individuals which belongs to the evolved Pareto optimal set I 0 = P ∗ (I). the population is classified into nine different fronts. we will center our study on MOLS-GA for being more representative of Pittsburgh LCSs. 42. which corresponds to the non-dominated vectors of the population. instance sets—based on a nearest neighbor classification—are used. 43]. Since they do not offer significant differences on the multiobjective approach itself. They represent two similar approaches to the multiobjective problem. D) fe (x) = (12. Minimizing the number of rules of the ruleset seeks to search for general and compact rulesets. The available knowl- edge representations are rulesets or instance sets [39. taking into account the misclassification error and the size of each individual. MOLS-GA Some approaches have addressed learning rulesets in Pittsburgh LCS archi- tectures from a multiobjective perspective. MOLS-GA and MOLS-ES [45] are two examples. This plot is obtained with the multiplexer (mux) problem. If the problem’s attributes are nominal.10) The misclassification error corresponds to the number of instances incor- rectly classified divided by the size of the training set. #) often used in other LCSs [33. as I 1 = P ∗ (I \ I 0 ). The left front is I 0 .

selection is applied using a tournament selection algorithm [50. Fig. fitness values are assigned. where n is the number of equivalence classes and δ is a constant. all the individuals of the same equivalence class I i receive the same constant value (n − i)δ. Once the population of individuals I is sorted.11) Thus. i. in order to spread the population along the Pareto front. we impose the constraint: f itness(I i ) > f itness(I i+1 ) (12. a sharing function is applied. 12. Elitism is often applied in evolution- ary multiobjective optimization algorithms and it usually consists of keeping .276 E. The sharing function is computed using the Euclidean distance between the multiobjective vectors. 6] with elitism..e. The fitness of each individual depends on the front where the individual belongs.1. Since the evolution must bias the population toward non-dominated solutions.12) k∈I φ(dIji Ik ) where φ(dIji Ik ) is the sharing function [30]. the real Pareto front. The radius of the sharing function σsh was set to σsh = 0. Sorted population fronts at a given iteration of MOLS in the mux problem [45].1. Moreover. Thus. That is. Bernadó-Mansilla et al. the final fitness of an individual j in a given equivalence class I i is: (n − i)δ f itness(Iji ) =  (12. After individuals are evaluated. the evolution will try to guide the population toward the left part of the plot.

but they should be equivalent in both parents so that valid offspring can be obtained. Two strategies may be considered. 12 Multi-objective Learning Classifier Systems 277 the solutions of the Pareto front evolved in each generation [61].5. the one that mini- mizes fe (x). An adaptation of such an strategy to the combination of solutions from the Pareto front could give similar benefits as shown elsewhere [40]. In other words. The first one (best-accuracy) chooses the solution x of the front with the best accuracy.1 shows an example of how the sorting of a population of candidate solutions produces a clear visualization of the tradeoff between accuracy and generality. and also a 30% of the individuals with the lowest error. We could instead benefit from the combination of the multiple hypothe- ses obtained in the evolved Pareto front. the solution x that minimizes |F (x)| = fe (x)2 + fs (x)2 . On the other hand.4 Experiments and Results Through a Pareto Front Glass Figure 12. The decision maker has several hypotheses among which to choose. The mutation consists in generating a random new gene value. The decision maker can be a human or an expert system with some knowledge about the problem. cut points can occur anywhere inside the ruleset. This guarantees that the best compromise solutions evolved so far are not lost. the selected solution is the one that balances equally  both objectives. we can delay the need of choosing a solution until the evolution is over and the evolved Pareto front is provided. MOLS-GA performs similarly: it keeps all the distinct solutions of the evolved Pareto front. However. that is. all provided by the classification front. and the combined solution often improves the generalization capability of the individual solutions [39]. What is the Purpose of the Pareto Front? The main purpose of the evolved Pareto front is to keep solutions with different tradeoffs between error and size. 12. which consists of training several classifiers and combining them all into an ensemble classifier. as shown in figure 12. Such a technique is inspired in the bagging technique [17. this decision is critical for achieving a high quality generalization when tested with unseen instances of the target concept. The bagging technique tends to reduce the deviation among runs. crossover and mutation are applied. Coevolving these compromise solutions. Crossover is adapted from two-point crossover to individuals of variable size. After selection. 41]. which are important to drive the evolution toward accurate solutions.2. as well as the solutions with the lowest error. A typical approach is to select a solution from the evolved Pareto front. . Thus. the second one (best-compromise) picks the hypothesis of the front that minimizes the objective vector u = F (x).

278 E.3. 12.2. The rupture point indicates the place where the evolved . Bernadó-Mansilla et al. Such Pareto fronts arise an interesting property. 12. Size Size Error Error (a) best-accuracy (b) best-compromise Fig. as shown by Llorà and Goldberg [44]. Decision maker strategies using (a) the solution with the lowest error (best accuracy) and (b) the solution closer to the origin (best compromise). Fig. Evolved Pareto front by MOLS-GA in the LED problem [44]. Such rupture point is achieved around the minimal achievable error (MAE) [44] for the dataset at hand. the Pareto front presents a clear rupture around the maximally general and accurate solutions. In the presence of noise in the dataset.

& Bernadó-Mansilla [45] evaluated the performance of different multiobjective LCSs on nine different datasets. Particu- larly. iris (irs). all the methods seem to perform equivalently. glass (gls). All these problems disappear when we force the theoretical and the empirical MAE to be the same. Goldberg. These datasets contain categorical and numeric attributes. 12 Multi-objective Learning Classifier Systems 279 Pareto front abruptly changes its slope. This has an interesting interpretation. Two of them are artificial datasets. widely used by the LCS community [63]. ionosphere (ion). Results The work by Llorà. . whereas MOLS-ES used best-compromise. primary tumor (prt).2 shows the classification accuracy obtained by each method. we compare MOLS-GA and MOLS-ES with GAssist and XCS. Observe that LCSs in general are highly competitive with the non-evolutionary schemes. MOLS-GA and MOLS-ES offer good classification accuracies when compared with GAssist and XCS. as well as binary and n-ary classification tasks.5 [54. Table 12. which are described in Table 12. On the contrary.3 shows the average and standard deviation of the number of rules obtained by the LCS approaches. Traus. Their algorithms were obtained from the Weka package [67] and ran with their default configuration.1. Instead. If any of these solutions is tested using new instances not previously shown in the training phase. There is not a clear winner in all the datasets. and sonar (son). C4. Moreover. This means that they are learning some misleading noisy pattern or a solution too specific to the current training dataset. All the points that define this segment of the front are over-fitted solutions. The front that appears to the left of the rupture point is the result of the deviation of the empirical MAE from its theoretical value. and PART [26]. the ruleset evolved by Pittsburgh LCSs is that obtained by the best individual of the population. Observe that the three Pittsburgh-based methods offer fairly simple rulesets. obtained from the UCI repository [47]. 67] using the different learning algorithms on the selected datasets. Wisconsin breast cancer (bre). be- cause very small (misleading) improvements require a large individual growth. Thus. The remaining datasets also belong to the UCI repository: bupa liver disorders (bpa). Table 12. these solutions are closer to the bloat phenomenon. composed of 3 to 15 rules approximately. Mux is the eleven input multiplexer. Moreover. where two explicit mul- tiobjective Pittsburgh approaches are compared with other schemes. We summarize some of the results published in [45]. Led is the seven-segments problem [18]. they would produce a significant drop in accuracy. 55]. The results were obtained from stratified ten-fold cross-validations runs [48. MOLS- GA used a best-accuracy strategy for the test phase. In fact. XCS gets much larger rulesets. We also include a comparison with non-evolutionary schemes: IB1 [1]. this leads to a reduction of the generalization capa- bilities (in terms of classification accuracy) of the solutions kept in that part of the front.

the front presented in figure 4(b) shows an interesting resemblance to the fronts obtained in the noisy led problem . Bernadó-Mansilla et al. For instance.0 34 . XCS does not use such a hierarchy so that the number of necessary rules is higher. 2 gls Glass 214 0. The first rules tend to cover many examples (the general case).0 9 .0 4 . 6 ion Ionosphere 351 0. For exam- ple. If we consider that GAssist evolves a population of 400 individuals.280 E. Summary of the datasets used in the experiments. the bpa ruleset could be reduced to 90 rules. 24. which corresponds to a better evolved Pareto front. Table 12. 7 10 mux Multiplexer (11 inputs) 2048 0. resulting in higher performance. 51]. Most of the rules are only product of the recent exploration of the algorithm and are not relevant to performance.0 6 . MOLS-GA is in average the method which gets smaller rulesets for similar values of classification accuracy. which is still higher that Pittsburgh LCSs. 2 The Pareto fronts presented in figure 12. they could be compacted. which means that the explicit multiobjective approach is effective to further reduce the size of the ruleset. while the ruleset obtained by XCS considers all the population.4 also suggest another interesting analysis of the results.e. Despite these issues. 17 22 son Sonar 208 0. i. 3 led Led (10% noise) 2000 0. therefore. but more reasonable to human experts. This probably means that MOLS-GA evolves better Pareto fronts than MOLS-ES.0 .. 2 bre Wisconsin Breast Cancer 699 0. the final XCS’s ruleset should be still further processed.4 shows two examples for the bre and the prt datasets. In [51] some reduction algorithms reported a reduction of 97% without a significant degradation in classification accuracy. Figure 12.3 9 . the number of explored rules is equivalent to that evolved by XCS. while the last rules codify the exceptions to the general rule. Research on reduction algorithms and further processing of the evolved rules is being conducted by several authors to get a ruleset easily interpretable to human experts [66. 2 irs Iris 150 0.0 60 . The second plot corresponds to the more general case where MOLS-GA obtains a better front.1.9 . See that the first one belongs to the case where MOLS-ES gets better performance. 1 2 prt Primary Tumor 339 3. the rules form a hierarchy going from the most general to the most specific ones. Other rules overlap partially with others. the ruleset of a Pitts- burgh LCSs operates as an activation list. id Data set Size Missing Numeric Nominal Classes values(%) Attributes Attributes bpa Bupa Liver Disorders 345 0. Moreover.0 . MOLS- GA also gets smaller rulesets than GAssist in average.

3±3.2 prt 51.5 96.0 89.3±3.1±0.2 mux 10.0±3.6 Table 12.0 8.2.1 5.5 gls 4.0±0.2 74.9±0.8±8.2 100.7 92.8±6.6±3.9±13.4±6.3±0.6±6.8±6.0±10.0 99.6 11.5±0.6 41.5±10.4±10.7 47. id MOLS-GA MOLS-ES GAsssist XCS bpa 10.2±0.1±4.1 bre 96.5 12.0 7.5 640±45.3 led 74.1 96.5±8.3±5. Pareto fronts achieved in real problems by MOLS-GA and MOLS-ES [45].6±0.3±1.4±1.7±2.0±1.3 74.8±0.5±1.2 75.0±0.0 99.9 irs 2. Comparison of classification accuracy on selected datasets.1 89.5 prt 9.6 92.9 730±75.2 83.6±0.a.8±1.9±3.0 100.7±5.9 65.3 15.5 90.3 son 90.5 68.6 2369.1±0.0 100.0 100.6±12.2±9.9±2.9 ion 6.8 40. 2].7±2.6±9.0±5.5±8.8±2.2±8.3 68.7±0.5±6.1±9.1 39.5±13.2±1. 12. 2].3±3. 12 Multi-objective Learning Classifier Systems 281 Table 12.1 71.3.3 70.5±0.7 8.9±1.8±9.2 95.3±3.9 ion 91.2 95.7 74.2 2377.1 4.8 led 14.4.9±0.1±2.7 94.4 68.8 son 11.8±10.5±114.4 42.3 63.3 95.9 77.3 12.1±4. id MOLS-GA MOLS-ES GAssist XCS C4. Comparison of ruleset sizes obtained by different methods on selected datasets.3±0.9 65.4±7.8 irs 99.5±2.8±0.6±6.7 bre 9.7±1.5 gls 67.5 8.0±0.0±0.5±125.a.4 4745±123.9±0. 74.3±1.2±2.8 14.5 95.5±3.2 95.5±10.4 12. The table shows the mean and standard deviation of each method on a stratified ten-fold cross-validation experiment [45.2±15.7±6.5 73.0±0.5 77.8±3.1±0.0 n.9 11.5±3. .4 2709±98.3±3.3 65.4±1.9±0.4±3.2 Fig.0 66. 102.5 PART IB1 bpa 76.4 69.3±2.4±5.0±0.0 64.9 95.4 n.7 61.3 4360±134.8±1.7±2.8 2.6 71.2 95.1 95.2±9.5 13.7 mux 100.9 90.2±0.0 16.4 41.6±5. The table shows the mean and standard deviation of each method on a stratified ten-fold cross- validation experiment [45.8 1784±45.6 95.3 12.

But for continuous attributes and real-valued representations. The combination of accuracy- based fitness and the niche GA.282 E. the best approach. However. does not use explicit multiobjective evolutionary algorithms. with the addition of other heuristic compo- nents. the equilibrium is delicate and studies on pressures point out how to balance this equilibrium through appropriate pa- rameter settings. Despite this difficult balance. Within the Pittsburgh approach. it does not seem feasible to add explicit multiobjective algorithms to further emphasize the generalization of rules. the . Moreover. The main problem is to stabilize different niches co-operating to form a complete ruleset.6 Current Research Trends and Further Directions Learning classifier systems address the complex multiobjective task either bal- ancing the learning goals implicitly or by means of evolutionary multiobjective algorithms adapted to the particular architectures. such as in problems with unequal distribution of examples per class (see for example [52]).3). 12. combined with restricted mating methodologies. The main difficulty is found on the architecture rather than on the multiobjective evaluation it- self. In fact. A problem common in Michigan LCSs is the difficulty to control the num- ber of rules forming the final ruleset. XCS. Stressing generalization and favoring removal of useless rules during training could help evolving smaller rulesets. they still provide larger rulesets. Although some reduction algorithms have been designed in the Michigan framework to get compact rulesets [20]. Pittsburgh LCSs using some kind of pressure on the number of maximum rules produce rulesets much reduced than those of Michigan LCSs. Within the field of Michigan LCSs. further work on the usage of the MAE measure is still needed. Explicit multiobjective algorithms are also able to achieve accurate and maximally general rules in other types of Michigan LCSs. this still remains an unexplored research area which may be analyzed to strength generalization in cases where this is scarcely favored by the niche GA. since the system would maintain less rules during training. (see figure 12. Bernadó-Mansilla et al. Michigan LCSs tend to produce high number of rules that overlap partially. This would also reduce the computational cost. implicit and explicit multiobjective ap- proaches are able to balance accuracy and generalization. More research on niching algorithms is needed. because this probably would favor overgeneral rules. achieves the appropriate balance to favor the creation and maintenance of maximally general rules. However. This clearly points out to the presence of inconsistencies in the prt dataset that bounds the MAE. This imposes a limitation to their interpretability by human experts. as well as exploring its connections to probably approximately correct models in the computational learning theory field [48]. favoring general- ization of rules is enough to achieve a minimum ruleset. For binary attributes.

When such ideas are combined with estimation of distribution algorithms. and what kind of solutions can be evolved. Air Force Materiel Command. administered by the National Center for Supercomputing Applications (NCSA) and funded by the Office of Naval Research (N00014-01-1-0175). as it is currently being analyzed in [40]. Acknowledgments The first author acknowledges the support of Enginyeria i Arquitectura La Salle. The US Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. However. as pointed out in the chapter. Another key research area for Pittsburgh classifier systems is the identi- fication of overfitting situations from the evolved Pareto fronts. However. Measures of rule interpretability must be defined. Accuracy has become a central element in the process of computing the fitness of rules (or classifiers). a bottom-up approach to Pittsburgh classifiers would start to emerge—as the compact classifier system (CCS) has preliminary shown. at University of Illinois at Urbana-Champaign. it can simply be integrated in the evolu- tionary process as a third objective to be optimized. Recently. 12 Multi-objective Learning Classifier Systems 283 explicit usage of multiobjective techniques provides an output with a clear tradeoff among both objectives. Initial attempts to apply Wilson’s ideas to Pittsburgh-style classifier systems are on their way [46]. instead of using a single measure of error. and by the Technology Research. Once the proper measure is ready. the work of Wilson in 1995 [63] and the major shift in the way in which fitness was com- puted in the Michigan approach have been revisited by researchers. and Commercialization Center (TRECC). Such a front may be explored for a particular solution. specify the cost of misclassifying each of the classes. A Pareto front provides a way of sorting the final population of candidate solutions. Education. such as medical domains. Such research also needs to face the read- ability of the final solution. we should analyze whether the addition of more objectives involves a harder search. as well as the support of Ministerio de Ciencia y Tecnologı́a under Grant TIC2002-04036-C05-03 and Generalitat de Catalunya under Grant 2002SGR-00155. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or . USAF (F49620-03-1-0129). Multiobjective algorithms can also be ap- propriate to balance the different types of misclassification errors and provide a set of alternatives. Ramon Llull University. or just combined to form an ensemble of classifiers. such approach assumes that the global pres- sure toward accuracy and generality will lead the rulesets towards the desired learning goals. This work was sponsored in part by the Air Force Office of Scientific Re- search. This is a clearly top-down approach. Many domains.

First International Conference. Bernadó-Mansilla et al. In Proceedings of the 6th International Workshop on Learning Classifier Systems. Bacardit and J. Reducing bias and inefficieny in the selection algorithm. In Seventh International Workshop on Learning Classifier Systems (IWLCS-2004). Aha and D. Springer Verlag. Some Considerations on the Reason for Bloat. 2002. M. LNAI. [12] E. 2003. MOLeCS: Using Multiobjective Evo- lutionary Algorithms for Learning. [4] J. Accuracy-Based Learning Classifier Systems: Models. V. Baker. Bäck. A. pages 2–8. λ)-selection. Kibler. Bassett and K. Garrell. Garrell. EMO 2001.S. Garrell. In Proceed- ings of the Genetic and Evolutionary Computation Conference. Bernadó-Mansilla and J. [8] W. 2004. 6:37–66. [5] J. 4th International Workshop. Banzhaf and W. Langdon. Government. pages 696–710. Machine Learning. the Office of Naval Research. M. Generalized convergence models for tournament. 2000. volume 1932 of LNAI. Evolving Multiple Discretizations with Adaptive Intervals for a Pittsburgh Rule-Based Learning Classifier System. K. XCS and GALE: a Compar- ative Study of Two Learning Classifier Systems on Data Mining. 2001. 2003. Genetic Programming and Evolvable Hardware. Proceedings of the Sixth International Conference on Genetic Algorithms. and J. [7] J. pages 157–165. volume 2321 of LNAI. Springer-Verlag Berlin Heidelberg. References [1] D. or the U. Springer. pages 14–21. M. Instance-based learning algorithms. Ho. In Ge- netic Algorithms and their Applications: Proceedings of the Second International Conference on Genetic Algorithms. 9(1):82–104. of the Air Force Office of Scien- tific Research. 2002. 11(3):209–238. IEEE Transactions on Evolutionary Computation. In Evolutionary Multi-Criterion Optimiza- tion. Evolution- ary Computation. Bloat Control and Generalization Pressure using the Minimum Description Length Principle for a Pittsburgh Approach Learning Classifier System. Bernadó-Mansilla and J. Bacardit and M. either expressed or implied. Springer (in press). volume 1993 of LNCS. 3(1):81–91. M. De Jong. 1995. [2] J. volume 2724 of LNCS. [6] T. In Advances in Learning Classifier Systems. In Primer Congreso Espan̈ol de Algoritmos Evolutivos y Bioinspirados (AEB’02). 1987. Métodos de generalización para sistemas clasifi- cadores de Pittsburgh. 2002. Garrell. pages 1818–1831. Bacardit and J. M. pages 486–493. In Foundations of Intelligent Systems: 12th International Symposium.and (µ. Evolving Behaviors for Cooperating Agents. Analysis and Applications to Classification Tasks. Butz. Llorà. and Commercialization Center. Bernadó-Mansilla and T. Domain of Competence of XCS Classifier System in Complexity Measurement Space. M. pages 115–132. Garrell. X.284 E. Bernadó-Mansilla. the Technology Research. 1991. 2003. Bacardit and J. [9] J. endorsements. Springer (in press). Education. [13] E. K. E. [10] E. Springer. LNAI. 2005. Data Mining in Learning Classifier Systems: Comparing XCS with GAssist. Garrell. . B. [11] E. [3] J.

NEWBOOLE: A fast GBML System. and Prediction. 3(4):375–416. Bagging predictors. pages 935–942. 1993. Genetic algorithms with sharing for multi- modal function optimization. Butz and M. volume 3242 of LNCS. PhD thesis. Analyzing the Evolutionary Pressures in XCS. IEEE Press. and D. Zitzler. 2004. [16] P. P. . Bernadó-Mansilla. W. V. Corne. [27] J. In Proceedings of the Second International Con- ference on Genetic Algorithms. Automatic Diagnosis with Genetic Algorithms and Case-Based Reasoning.. Multiobjective genetic program- ming: Reducing bloat using SPEA2. Lanzi. Breiman. Breiman. 1999. A Study of a Genetic Classifier System Based on the Pittsburgh Approach on a Medical Domain. R. Classification and Regression Trees. CA: Morgan Kaufmann. Inc. Optimization and Machine Learning. Witten. 1987. 1991. Giordana and F. 1989. Oates. IEA/AIE-99. 13:367–372. 24(2):123–140. 3(4):261–283. V. L. Mekaouche. In Parallel Problem Solving from Nature (PPSN-2004). [20] M. pages 144–151. 1990. and E. Dixon. Morgan Kaufmann. 1995. [15] S. [23] K. De Jong and W. Llorà. Search-Intensive Concept Induction. E. W. Garrell. Bonelli. pages 133–150. 2001. Evolutionary Computation. GA-MINER: Parallel Data Mining with Hierarchical Genetic Algorithms. 2001. Classification. Thiele. Learning Concept Classification Rules Using Genetic Algorithms. Richardson. San Francisco. Machine Learning. Neri. [24] P. E. Niblett. E. Brack. Generating Accurate Rule Sets Without Global Optimization. S. 4th International Workshop. Butz. Morgan Kaufmann. Wilson. Llorà. 1984. Artificial In- telligence in Engineering. [30] D. J. Bleuler. M. Goldberg. Butz. and S. E. Machine Learning. Goldberg. J. D. and J. A. pages 1051–1060. [26] E. Sen. A Preliminary Investigation of Modified XCS as a Generic Data Mining Tool. 1996. In Machine Learning: Proceedings of the Fifteenth International Conference. Technical Report EPCC-AIKMS-GA-MINER-REPORT 1. and X. V. Bernadó-Mansilla. W. Pelikan. X. pages 41–49. Knowledge Extraction and Problem Structure Identification in XCS. 1998. pages 536–543. [21] M. P. Competition-based induction of decision models from examples. M. In Advances in Learning Classifier Systems. 13:229–257. A. Australia. 2004. 2002. [31] D. and M. Golobardes. [17] L. [19] M. Uni- versity of Edinburgh. volume 2321 of LNCS. In Proceedings of the 2001 Congress on Evolutionary Computation CEC2001. [28] A. 12 Multi-objective Learning Classifier Systems 285 [14] E. 1995. 1989. pages 651–656. 1999. pages 175–184. M. Genetic Algorithms in Search. E. L. University of Illinois. Frank and I. Spears. In 12th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems. [29] D. Parodi. Greene and S. pages 153– 159. [18] L. Springer. In Proceedings of the International Joint Conference on Artificial Intelligence. In Proceedings of the Genetic and Evolutionary Computation Confer- ence (GECCO’2001). W. Rule-based Evolutionary Online Learning Systems: Learning Bounds.0. Flockhart. and C. In Seventh International Conference on Machine Learning. M Garrell. [25] I. Stone. H. Springer. The CN2 induction algorithm. Olshen. Friedman. Addison-Wesley Publishing Company. Sidney. Machine Learning. Wadsworth International Group. Smith. [22] P Clark and T. A. Goldberg and J. F.

Par- simony. In Proceedings of Learning00 Workshop. Irvine. . 1997. Llorà and J. IWLCS 2002. 2003. [35] J. In Learning Classifier Systems. [34] J. I. 1978. Machine Learning. [42] X. Ramon Llull University. Morgan Kaufmann. B. In Associative Infor- mation Processing. Garrell. [40] X. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology. 2005. Llorà. [32] J. 1997. H. and E. Accuracy. Goldberg. Pattern Directed Inference Systems. Garrell. Janikow. [48] T. Sastry. Processing and processors for schemata. Control and Artificial Intelligence. 1998. pages 127–146. Fitness causes bloat: Mutation.html]. Morgan Kaufmann Publishers. Poli. Department of Information and Computer Science. Langdon and R. Morgan Kaufmann Publishers. pages 313–329. [41] X. [47] C. In Proceedings of the 18th International Conference on Machine Learning (ICML’2001). Automatic Classification and Artificial Life Models. [39] X. Inductive Learning of Decision Rules in Attribute-Based Examples: a Knowledge-Intensive Genetic Algorithm Approach. pages 426–433. 2001. Genetic Based Machine Learning using Fine-grained Parallelism for Data Mining.286 E. Genetic Program- ming: First European Conference. Merz and P. 1975. [43] X. Complete and Minimal Representations for Boolean Functions. 2002. Springer-Verlag. 1997. M. Kovacs. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’2001). Traus. UCI Repository for Machine Learning Data- Bases [http://www. Bacardit. volume 2661 of LNAI. Holland and J. E. Where to go once you evolved a bunch of promising hypotheses? In Learning Classifier Systems. Llorà and J. 2001. M. Bernadó-Mansilla. Barcelona. XCS Classifier System Reliably Evolves Accurate. H. [36] C. [45] X. Springer (in press). [33] J. 11(3):279–298. Goldberg. Knowledge-Independent Data Mining with Fine- Grained Parallel Evolutionary Algorithms. Mitchell. Goldberg. Llorà and D. and Generality in Evolutionary Learning Systems via Multiobjective Selection. pages 37–48. J. Traus. E. European Union. pages 118–142. Catalonia. Llorà. July 1991. In Proceedings of the Seventh International Conference of Genetic Algorithms (ICGA97). University of North Carolina at Chapel Hill. Murphy. 2000. Bernadó-Mansilla et al. Holmes. Springer. and E. Bernadó-Mansilla. Enginyeria i Arquitectura La Salle. In Soft Computing in Engi- neering Design and Manufacturing. Discovering Risk of Disease with a Learning Classifier System. 5th International Workshop. MIT Press/ Bradford Books edition. [37] T. 6th International Workshop. S. H. PhD thesis.ics.edu/∼mlearn/MLRepository. M. February. [46] X. Llorà. J. LNAI. McGraw-Hill. M. Holland. pages 337–344. Cognitive systems based on adaptive algo- rithms. K. pages 461–468. In Proceedings of the IEEE Conference on Evolutionary Computation. H. IEEE press (in press). Llorà. E. 2005. 2003. [38] W. 1971. Garrell. [44] X. and D. Holland. Reitman. Evolutionary Computation. PhD thesis. CA: University of California. The Compact Classifier System: Scalability Analysis and First Results. Evolving Partially-Defined Instances with Evo- lutionary Algorithms. IEEE and Univesidad Carlos III. Llorà and J. M.uci. IWLCS 2003. I. pages 59–68. D. 1998. Bounding the effect of noise in Multiobjective Learning Classifier Systems.

San Francisco. Compact Rulesets from XCSI. Bernadó-Mansilla. Generalization in the XCS Classifier System. [54] R. [51] A. 1995. Wilson. In Genetic Pro- gramming: Proceedings of the Third Annual Conference. C4. Wilson. . and S. Multiobjective evolutionary algo- rithms: Analyzing the state-of-the-art. [67] I. pages 665–674. [53] V. W. Rouge. [60] W. 1986. Classifier Fitness Based on Accuracy. The MIT Press. Smith. 1(1):81–106. [57] D. Soule and J. Orriols Puig and E. Analysis of Reduction Algorithms in XCS Classifier System. Classifier System Learning of a Boolean Function. Rissanen. 1896. Induction of decision trees. W. Morgan Kaufmann Pub- lishers. W. [64] S. Quinlan. pages 422–425. University of Illinois at Urbana-Champaign. [62] S. Evolutionary Computation. J. Vol. pages 383–390. Wilson. II. Pareto. Flexible Learning of Problem Solving Heuristics through Adaptive Search. Urbana. niching. 2000. 8(2):125– 147. CA: Morgan Kaufmann. [65] S. 14:465– 471. Mining Oblique Data with XCS. Cours d’Economie Politique. 1986. volume 1996 of LNAI. Eibe. The Rowland Institute for Science. IlliGAL Report No. 1998. K. [61] D. Practical Machine Learning Tools and Techniques with Java Implementations. 91011. and the PDP Research Group. 2004. Data Mining. 1986. IL. W. I. [58] S. Parallel Distributed Processing. Goldberg. Evolutionary Computation. 1994. volume 113 of Frontiers in Artificial Intelligence and Applications. 3(2):149–175. vol. 1983. Lamont. [55] R. volume I and II. IWLCS 2005. A. [56] J. selection. F. LNAI. Springer-Verlag Berlin Heidelberg. The Class Imbalance Problem in Learning Classifier Systems: A Preliminary Study. A. 7th International Workshop. Evolutionary Computation. and the preservation of diversity. H. 1978. [66] S. In Proceedings of the 8th International Joint Conference on Artificial Intelligence. 1993. pages 197–210. Wilson. J. In Recent Advances in Artificial Intelligence Re- search and Development. 1995. University of Southern Cal- ifornia. Oei. In Pro- ceedings of the Sixth International Conference on Genetic Algorithms. 2002. D. W. Technical Report RIS 27r. Wilson. pages 158–176. 12 Multi-objective Learning Classifier Systems 287 [49] P. Springer.L. Lausanne. and the genetic construction of com- puter programs. In Advances in Learning Classifier Systems. In Advances in Learning Classifier Systems: Proceedings of the Third International Workshop. Nordin and W. B. Foster. Bernadó-Mansilla.E. Modeling by shortest data description. F. Machine Learning. Van Veldhuizen and G. volume 2321 of LNAI. 2005. Banzhaf. Orriols Puig and E. McClelland. Unpublished doctoral dissertation. Automatica. Rumelhart. In Learning Classifier Sys- tems. Quinlan. 2001. [63] S. 2000. Recombination. Witten and F. Chang. IOS Press. Morgan Kaufmann. [50] C. Springer (in press). [52] A. 6(4):293–309. 1991.5: Programs for Machine Learning. Tournament selection. Complexity Compression and Evolution. Winter 1998. E. 4th International Workshop. Effects of code growth and parsimony pressure on populations in genetic programming. Tackett. [59] T.

Swiss Federal Institute of Technology (ETH) Zurich. [69] E. 2000. Zitzler. Zitzler. . Deb. 2001. CH-8092 Zurich. [70] E. Evolutionary Computation. SPEA2: Improving the Strength Pareto Evolutionary Algorithm. Evolutionary Algorithms for Multiobjective Optimization: Methods and Applications. Comparison of Multiobjective Evolutionary Algorithms: Empirical Results.288 E. Technical report 103. May. Bernadó-Mansilla et al. K. Thiele. Zitzler. Swiss Federal Institute of Technology (ETH) Zurich. 1999. Glo- riastrasse 35. [68] E. and L. PhD thesis. 8(2):173–195.

Over the past two decades. Nevertheless. Germany yaochu. To improve the accuracy of neural networks. The efficiency of Baldwin effect and Lamarckian evolution are compared. Two most common objectives are accuracy and interpretability. which can be used for different purpose. 37]. In this chapter. Researchers and practitioners are mainly concerned with the accuracy of neural networks. such as regression and classification. In other words.com c Springer-Verlag Berlin Heidelberg 2006 .13 Simultaneous Generation of Accurate and Interpretable Neural Network Classifiers Yaochu Jin. While in most cases we are inter- ested only in the model accuracy.de Summary. Generating machine learning models is inherently a multi-objective op- timization problem.springerlink. Simulation results on two benchmark problems demonstrate that the evolutionary multi-objective ap- proach is able to generate both accurate and understandable neural network models. 291–312 (2006) www. neural networks should exhibit high pre- diction or classification accuracy on both seen and unseen data. which are very likely conflicting with each other. Lifetime learning is embedded to fine-tune the weights in the evolution that mutates the structure and weights of the neural networks. It is found that the Lamarckian evolution outperforms the Baldwin effect in evolutionary multi-objective optimization of neural networks. 30 63073 Offenbach/Main. which are more or less of biological plausibility.: Simultaneous Generation of Accurate and Interpretable Neural Network Classi- fiers. many efficient learning algorithms have been developed [8. 13. Studies in Computational Intelligence (SCI) 16. various artificial neural networks have successfully been applied to solve a wide range of tasks. among many others.jin@honda-ri. there are also many cases in which it becomes essential that human users are able to understand the knowledge that the neural network Y. interpretability of the model becomes the major concern if the model is used for data mining or if the model is applied to critical applications. we present a method for simultaneously generating accurate and interpretable neural network models for classification using an evo- lutionary multi-objective optimization algorithm. Bernhard Sendhoff and Edgar Körner Honda Research Institute Europe Carl-Legien-Str. Jin et al.1 Introduction Artificial neural networks are linear or nonlinear systems that encode infor- mation with connections and units in a distributed manner.

This chapter presents a method for generating accurate and interpretable neural models simultaneously using the multi-objective optimization ap- proach [28]. Multi- objective approach to support vector machines [21] and fuzzy systems [22. both accurate neural network (in the sense of approximation accuracy on test data) and interpretable neural networks can be identified. The successful development in evolutionary multi-objective optimization [11. Objectives in training neural networks include accuracy on training or test data.5. A short summary of the chapter is given in Section 13. It should be pointed out that a trade-off between the accuracy on the training data and the accu- racy on test data does not necessarily result in a trade-off between accuracy and complexity. General model selection criteria in machine learning are also briefly described. the error on training data and the norm of the weights of neural networks are minimized. the mean squared error and the number of hidden neurons are considered. 15. To address this problem. The idea of introducing Pareto-optimality to deal with multiple objectives in constructing neural networks can be traced back to the middle of 1990’s [29]. which incurs additional computational costs. In another work [39]. It has been shown that evolutionary multi-objective algorithms are well suited for and very powerful in obtaining a set of Pareto-optimal solutions in one single run of optimization [11.6. 12]. In this work.3 shows that any formal neural network regularization methods can be treated as multi-objective optimization prob- lems. A common drawback of most existing rule extraction method is that the rules are extracted after the neural network has been trained. Section 13. 32. multi- ple objectives in machine learning using multi-objective evolutionary algo- rithms [2. Section 13. 26. 25]. 3. 41]. 4. where accuracy and complexity serve as two conflicting objectives. Most of the work focuses on im- proving the accuracy of a single or an ensemble of neural networks. However. diversity. Jin et al. The proposed method is verified on two classification examples in Section 13. and so on). it is often necessary to extract symbolic or fuzzy rules from trained neural networks [7. two objectives. 10. and number of connections. and interpretability. has learned from the training data. . 30. together with the life-time learning method will be provided in Section 13. 24. 40] has also been studied.4. A number of neural networks that concentrate on accuracy and interpretability to a different degree will be generated using a multi-objective evolutionary algorithm combined with life-time learning.2 discusses different aspects in generating accurate and inter- pretable neural networks. 27. namely. conventional learning algorithms do not take the interpretability of neural networks into account and thus the knowledge acquired by neural networks is often not transparent to human users. 12] has encouraged many researchers to address the conflicting. where a population of Pareto-optimal neural networks are generated. By analyzing the Pareto front. 16. complexity (number of hidden neurons.292 Y. The details of the evolutionary multi-objective algorithm.

feedforward neural networks with one hidden layer are universal approximators. To im- prove the interpretability of neural networks. approaches to rule extraction from trained neural networks can be divided into two main categories. the complexity for an accurate model and an interpretable model is often different. 13 Simultaneous Generation of Accurate and Interpretable Classifiers 293 13. it is necessary to control the complexity of the neural network. 13. structural regularization or pruning has been performed during learning [23. The first category is known as decompositional ap- proach.g. see e. A problem in creating accurate neural networks is that a very good ap- proximation of the training data can result in a poor approximation of unseen data. no matter when we want to generate accurate or inter- pretable neural network models. [18]. when we talk about accurate neural network models. According to [7]. Complexity reduction procedure prior to rule extrac- tion is also termed skeletonization [14]. in which rules are extracted from individual neurons in the hidden and output layer. which often reduces the complex- ity of neural networks. 25. 36]. 13. particularly if structure optimization.2 Interpretability of Neural Networks The knowledge acquired by neural networks is distributed on the weights and connections and is therefore not easily understandable to human users. which in turn will limit the ac- curacy that can be achieved on the training data. Thus. the complexity of an interpretable model is lower than an accurate model due to the limited .2. the basic approach is to extract symbolic or fuzzy rules from trained neural networks [7.1 Accurate Neural Networks Theoretically. To avoid overfitting.. Usually.2. the pedagogical approach. because the resulting neural network architecture reflects the problem structure in a way.3 Model Selection in Machine Learning It is interesting to notice that we need to control the complexity of neural networks properly. In the second approach. We believe the decompositional approach is more interesting. no analysis of the trained neural network is undertaken and the neural network is treated as a “black-box” that provides input-output data pairs for generating rules.2. However. and rule extraction [15]. we mean that the neural network should have good accuracy on both training data and unseen test data. 24]. However. a large number of hidden neurons is required to approximate a given complex function to an arbitrary degree of accuracy.2 Tradeoff between Accuracy and Interpretability 13. network skeletonization. Decompositional approaches to rule extraction from trained neural network can largely be divided into three steps: neural network training. 15. which is known as overfitting.

while the second term is the complexity of the model. but also on unseen data. a trade-off between accuracy and interpretability is often unavoidable. Jin et al. the higher the model complexity is. Thus. For example. assuming that a number of models are available. cognitive capacity of human beings. Several criteria have been proposed based on the Kullback-Leibler Information Crite- rion [9]. L(θ|y.294 Y. K is the number of effective parameters of g. Unfortunately. model selection criteria have been proposed for creating accurate models. g) + 2 K. By generalization. The most popular criteria are Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC).3. (13. Traditionally. The most common error function in training or evolving neural networks is the mean squared error (MSE): . This approach is usu- ally known as regularization in the neural network community [8]. Ω is the regularization term representing the complexity of the network model. the complexity of inter- pretable neural networks may be insufficient for neural networks to achieve a good accuracy. model selection ac- cording to the AIC is to minimize the following criterion: AIC = −2 log(L(θ|y. and λ is a hyperparameter that controls the strength of the regularization. the more accurate the approximation on the training data will be. Obviously. it is meant that a trained neural network should perform well not only on training data. (13. g) is the maximized likelihood for data y given a model g with model parameter θ. The main purpose of neural network regularization is to improve the generalization ca- pability of a neural network by control its complexity.1 Neural Network Regularization Neural network regularization can be realized by including an additional term that reflects the model complexity in the cost function of the training algo- rithm: J = E + λΩ.3 Multi-objective Optimization Approach to Complexity Reduction 13.2) where E is an error function.1) reflects how good the model approximates the data. Model selection criteria have often been used to control the complexity of models to a desired degree in model generation. 13. Model selection is an important topic in machine learning. The first term of Equation (13. a trade-off between accuracy and model complexity has to be taken into account in model selection. Usually. The task of model selection is to choose the best model in the sense of accuracy for a set of given data.1) where.

4) and (13. we assume that the neural network has only one output. it has been shown that regularization using the sum of squared weights is able to change (reduce) the structure of neural networks as efficiently as using the sum of absolute weights. the difference observed in [31] is mainly due to the gradient learning algorithm. respectively. A more direct measure for model complexity of neural networks is the number of weights contained in the neural network:  Ω= cij . but not the regularizers themselves. Thus. (13. y d (i) is the desired output of the i-th sample. In other words. One weakness of the weight decay method is that it is not able to drive small irrelevant weights to zero. (13. and y(i) is the network output for the i-th sample. (13. Both regularization terms in equations (13. Refer to [8] for other error functions.5) have also been studied from the Bayesian learning point of view. An alternative is to replace the squared sum of the weights with the sum of absolute value of the weights:  Ω= |wi |. For the sake of clarity.4) 2 k where k is an index summing up all weights. when gradient-based learning algorithms are employed. Several measures have also been suggested for denoting the model complex- ity Ω. no significant difference has been observed from the Gaussian and Laplace regularizers when evolutionary algorithms are used for regularization. such as the Minkowski error or cross-entropy. It should be noticed that the above complexity measure is not generally applicable to gradient-based learning methods.6) i j where cij equals 1 if there is connection from neuron j to neuron i. which are known as the Gaussian regularizer and the Laplace regularizer.3) N i=1 where N is the number of training samples. A comparison of the three regularization terms using multi-objective evo- lutionary algorithms has been implemented [27]. which may result in many small weights [37]. A most popular regularization term is the squared sum of all weights of the network: 1 2 Ω= wk .5) i It has been shown that this regularization term it is able to drive irrelevant weights to zero [31]. . and 0 if not. 13 Simultaneous Generation of Accurate and Interpretable Classifiers 295 1  d N E= (y (i) − y(i))2 . (13. This regularization method has been termed weight decay. Different to the conclusions reported in [31] where the gradient-based learning method has been used.

8) f2 = Ω. . f2 } (13. In order to generate interpretable neural networks. or (13. One possibility is to employ model selection criteria we mentioned above. a first step towards this direction is made by analyz- ing the performance gain with respect to complexity increase of the obtained Pareto-optimal solutions. and bootstrap.3.296 Y.3. an interesting question arisen is that if it is possible to pick out the accurate models without resorting to conventional model selection criteria. In this chapter.3 Simultaneous Generation of Accurate and Interpretable Models If evolutionary algorithms are used to solve the multi-objective optimization problem in Equation (13. An important issue is how to choose models that are of good accuracy.5). Neural network models with a sufficient but not overly high complexity are expected to perform a good approximation on both training and test data. interpretable symbolic rules can be extracted from those of a lower complexity [28]. (13. (13.3). cross-validation. (13. To illustrate the feasibility of the suggested idea.4). Meanwhile. rules are extracted from the simplest Pareto-optimal neural networks based on the decompositional approach. and Ω is one of the regularization terms defined in equation (13.7).2 Multi-objective Optimization Approach to Regularization It is quite straightforward to notice that neural network regularization in equation (13.7) f1 = E.2) can be reformulated as a bi-objective optimization problem: min {f1 . Note that regularization is traditionally formulated as a single objective optimization problem as in Equation (13. the breast cancer diagnosis data and the iris data. this tradition can be mainly attributed to the fact that traditional gradient-based learning al- gorithms are not able to solve multi-objective optimization problems. Jin et al.9) where E is defined in equation (13. BIC.7). such as AIC. With the Pareto front that trades off between training error and com- plexity at hand. multiple solutions with a spectrum of model com- plexity can be obtained in a single optimization run. In our opinion.2) rather than a multi-objective op- timization problem as in equation (13. 13.6). simulations are conducted on two widely used benchmark problems. 13.

M equals 1. In the matrix. four of which mutate the connection matrix (neural network structure) and one of which mutates the weights. 13. 13. 13 Simultaneous Generation of Accurate and Interpretable Classifiers 297 6 3 0 0 0 0 0 0 1 5 0 0 0 0 0 0 2 1 1 0 0 0 1 1 1 0 0 0 1 4 0 0 1 1 0 0 Fig.4 Evolutionary Multi-objective Model Generation 13. and one output neuron. j = 1. including the input and output neurons. if the cij in the connection matrix equals zero.. where an element in the last column indi- cates whether a neuron is connected to a bias value. both hidden neurons have a bias. Accordingly. . whereas the weight matrix determines the strength of each connection.. deletion of a hidden neuron. If j = M + 1.1. . The strength (weight) of the connections is defined in the weight matrix.4. it means that there is a connection be- tween the i-th and j-th neuron and the signal flows from neuron j to neuron i. No crossover has been employed in this algorithm.. M. i = 1. it indicates that there is a bias in the i-th neuron. insertion of a connection and deletion of a connection [19]. Binary coding is adopted representing the neural net- work structure and real-valued coding for encoding the weights. 13. . two hidden neurons. Assume that a neural network consists of M neurons in total. then the size of the connection matrix is M × (M + 1). Five genetic operations have been introduced in the evolution of the neural networks. It can be seen from the figure that the network has two input neurons. the corresponding element in the weight matrix must be zero too. 13.. The four mutation operators are the insertion of a hidden neuron... A connection matrix and the corresponding network structure.2 Evolution and Life-time Learning A genetic algorithm has been used for optimizing the structure and weights of the neural networks. if element cij .1 il- lustrates a connection matrix and the corresponding network structure.4.1 Parameter and Structure Representation of the Network A connection matrix and a weight matrix are employed to describe the struc- ture and the weights of the neural networks. A Gaussian-type mutation is applied to mutate the weight matrix. Fig. Besides. The connection matrix speci- fies the structure of the network.

After mutation. which will be dis- cussed further in Section 13. Jin et al.11)    ∆(t−1) . they are bounded by ∆min ≤ ∆ij ≤ ∆max . . otherwise ij where 0 < ξ − < 1 < ξ + . The step-size for each weight is adjusted as follows:   (t−1) (t−1) (t)  ξ · ∆ij  . if ∂E∂wij · ∂E + ∂wij > 0 (t) (t−1) (t) ∆ij = ξ − · ∆ij (t−1) .12) is modified as follows: (t−1) ∂E (t−1) ∂E (t) ∆w(t) = −∆ij . One exception must be considered. if ∂E∂wij · ∂E ∂wij < 0 . it is necessary to check if the partial derivative changes sign.3. The weight change should be re- tracted only if the partial derivative changes sign and if the approximation error increases. The Rprop learning algorithm [34] is believed to be a fast and robust learning algorithm. (13. In reference [20]. (13. In this case. which indicates that the previous step might be too large and thus a minimum has been missed. the ∂E (t) /∂wij should be set to 0. the previous weight change should be retracted: (t−1) ∂E (t−1) ∂E (t) ∆w(t) = −∆ij . Let wij denotes the weight connecting neuron j and neuron i. if · < 0. After the weights are updated. an improved version of the Rprop algorithm [20] has been employed to train the weights. This can be seen as a kind of life-time learn- ing (the first objective only) within a generation. it is argued that the condition for weight retraction in equation (13. which is initialized to ∆0 for all weights.4. Both Baldwin effect and Lamarckian evolution have been examined in this work. ∆ij ≥ 0 is the step-size. Thus. (13. the weight retraction condition in equation (13.12) is not always reasonable.12) ∂wij ∂wij Recall that if the weight change is retracted in the t-th iteration.13) ∂wij ∂wij It has been shown on several benchmark problems in [20] that the modified Rprop (termed as Rprop+ in [20]) exhibits consistently better performance than the Rprop algorithm. To prevent the step-sizes from becoming too large or too small.298 Y.10) ∂wij (t) where sign(·) is the sign function. (13. then the change of the weight (∆wij ) in each iteration is as follows:  (t) ∂E (t) (t) ∆wij = −sign · ∆ij . if · < 0 and E (t) > E (t−1) .

the non-dominated solutions in the current combined population (say there are N2 non-dominated solutions) will be passed to the offspring population.4. The two outputs are complementary binary value.1.9. Now the combined population has 2N − N1 individuals. However. then the second output is 0. In the first approach.. Otherwise. to be more exact. the fitness value of each individual will be replaced with the new fitness resulted from learning. 6]. 13 Simultaneous Generation of Accurate and Interpretable Classifiers 299 13.8. 0.. This procedure is repeated until the offspring population is filled. Assume the population size is N .. and mitosis (x9 ). each of which has 9 inputs and 2 outputs. the offspring and the parent populations are combined. the N − N1 − N2 − . the elitist non-dominated sorting method proposed in the NSGA- II algorithm [13] has been adopted.e.. 13. 0. In selection. The inputs are: clump thickness (x1 ). bland chromatin (x7 ).. 1. uniformity of cell shape (x3 ). bare nuclei (x6 ).. the breast cancer data and the iris data in the UCI repository of machine learning database [33] are studied in this work.1 Benchmark Problems To illustrate the feasibility of the idea of generating accurate and interpretable models simultaneously using the evolutionary multi-objective optimization approach. These two strategies are known as the Baldwin effect and the Lamarkian evolution. marginal adhesion (x4 ). − Ni−1 ) in the current offspring population. 13.5 Simulation Results 13. all non-dominated individuals (say there are N1 non-dominated solutions) in the combined population are passed to the offspring population. two strategies can often be adopted. x9 ∈ {0.. In this case. a non- domination rank and a local crowding distance are assigned to each individ- ual in the combined population. single epithelial cell size (x5 ).5.. − Ni−1 individuals with the largest crowding distance from the Ni non-dominated individuals will be passed to the offspring generation. the allele of the chromosome remains unchanged.2. i. 0. In the second approach. . normal nucleoli (x8 ).3 Baldwin Effect and Lamarckian Evolution When life-time learning is embedded in evolution. the first output is . Then.0}. both the fitness and the allele are replaced with the new values found in the life-time learning. which means “benign”. x1 .. If N1 < N . if the first output is 1.4 Elitist Non-dominated Sorting In this work. .4.. At first.. and are removed from the combined pop- ulation. uniformity of cell size (x2 ). All inputs are normalized. which are two terms borrowed from biological evolution [5. It could happen that the number of non-dominated solutions in the current combined population (Ni ) is larger than the left slots (N − N1 − N2 − . The breast cancer data contains 699 examples.

and ξ − = 0. Thus.2. which might indicate that efficient life-time learning becomes more important for harder learning tasks.2]. The 150 data samples can be divided into 3 classes.2. 13. The standard deviation of the Gaussian mutations applied on the weight matrix is set to 0. In this work. 40 data samples for each class are used for training and the rest 10 data samples are for test.300 Y. The maximum number of hidden nodes is set to 10. The weights of the network are initialized randomly in the interval of [−0. Iris-Setosa (class 1). 13. The iris data has 4 attributes. Jin et al. 13. From these results.2.3 Comparison of Baldwin Effect and Lamarckian Evolution To compare the performance of the Baldwin effect and Lamarkian evolution in the context of evolutionary multi-objective optimization of neural networks. In the Rprop+ algorithm. ξ + = 1. 0. the step-sizes are initialized to 0. 13. Sepal-width. (13. . distribution and spread. The hidden neurons are nonlinear and the output neurons are linear.5.2 Experimental Setup The population size is 100 and the optimization is run for 200 generations. recalling that the breast cancer data has more input attributes and more data samples.5. each class having 50 samples. only the results using the Lamarckian evolution will be considered in the following sections. x g(z) = . the Pareto fronts achieved using the the two strategies are plotted in Fig. The test data are unavailable to the algorithm during the evolution. One of the five mutation operations is randomly selected and performed on each individual. 13. The data samples are divided into two groups: one training data set containing 599 samples and one test data set containing 100 samples. and Iris-Virginica (class 3). Although a non-layered neural network can be generated using the coding scheme described in Section 13. which means “malignant”. The activation function used for the hidden neurons is as follows. which are the default values recommended in [20] and 50 iterations are implemented in each local search.0125 and bounded between [0.3.3 for the breast cancer data and in Fig. only feedforward networks with one hidden layer are generated in this research.14) 1 + |x| which is illustrated in Fig. Therefore. The spread of the Pareto-optimal solutions is par- ticularly poor for the breast cancer data. Iris-Versicolor (class 2). namely.2.05. Sepal-length. and the second output is 1. only the first output is considered in this work.4 for the iris data. Petal- length and Petal-width. 0. it can be seen that the solution set from the Lamarkian evolution shows clearly better performance in all aspects including closeness. 50] during the adaptation.

0546 and 0. For the breast cancer data.5(a).01 0.03 0.5 g(x) 0 −0.0324. . the simplest Pareto- optimal neural network has only 4 connections: 1 input node. 13.02 0. 13 Simultaneous Generation of Accurate and Interpretable Classifiers 301 1 0.06 MSE on Training Data Fig. we attempt to extract symbolic rules from simple neural networks.2.3. Pareto fronts obtained using the Baldwin effect (circles) and Lamarckian evolution (diamonds) for the breast cancer data.05 0. The activation function of the hidden nodes. see Fig 13.4 Rule Extraction from Simple Neural Networks To obtain interpretable neural networks. respectively. The mean squared error (MSE) of the network on training and test data are 0.5. 70 60 Number of Connections 50 40 30 20 10 0 0 0. 13.04 0. 13.5 −1 −10 −5 0 5 10 x Fig. one hidden node and 2 biases.

02 0. (13.04 0.5.68 x2 y −2.16) Based on these two simple rules.08 0.06 0. then malignant. 45 40 35 Number of Connections 30 25 20 15 10 5 0 0 0.15) R2: If x2 (uniformity) ≤ 0. which means that x2 ≥ 0.5.75. 13.302 Y.5.2. From such a simple network.0 1. Assuming that a case can be decided to be “malignant” if y < 0. i. and 4 of them cannot be decided with a predicted value of 0. Pareto fronts obtained using the Baldwin effect (circles) and Lamarckian evolution (diamonds) for the iris data.4.25. 13. The simplest neural network for the breast cancer data. (13.33 0.1 0. and “benign” if y > 0.2. 8. the following two symbol-type rules can be extracted (denoted as MOO NN1): R1: If x2 (uniformity) ≥ 0.0 Fig.4. only 2 out of 100 test samples will be misclassified.. .12 MSE on Training Data Fig. then x2 ≤ 0.57 1. then benign. we can then derive that if x2 > 0.21 − 0. then “malignant” and “benign” if x2 < 0. Jin et al.49.22.e.

In this network.6(a). whereas the number of cases on which no decision can be made remains to be 4.6 or x6 (bare nuclei) ≥ 0. the number of cases that are misclassified has been reduced to 1. then benign. which is very ambiguous.4 or x2 (uniformity) ≥ 0. The next simplest neural network on the breast cancer data.4 ∧ x6 (bare nuclei) ≥ 0.9 or x2 (uniformity) ≥ 0.2 ∧ x6 (bare nuclei) ≥ 0. and the prediction results are provided in Fig.2 or x2 (uniformity) ≥ 0. .593 x4 y 5. although the ambiguity of the decision for the four cases did decrease.0 Fig. respectively. which has 6 connections in total. (13.4 or x2 (uniformity) ≤ 0.6. The prediction results on the test data are presented in Fig 13.2.0 1. x4 and x6 are present.68 0. with the introduction of two additional features x6 and x4 (although the influence of x4 is too small to be reflected in the rules). then we could extract the following rules: (denoted as MOO NN2) R1: If x2 (uniformity) ≥ 0.17) R2: If x2 (uniformity) ≤ 0. 13 Simultaneous Generation of Accurate and Interpretable Classifiers 303 x2 9.18) Compared to the simplest network. Now let us look at the second simplest network.78 −4.5 or x2 (uniformity) ≥ 0.5(b).2 ∧ x6 (bare nuclei) ≤ 0. The MSE of the network on training and test data are 0.0203.1 ∧ x6 (bare nuclei) ≤ 0.3 ∧ x6 (bare nuclei) ≥ 0. If the same assumptions are used in deciding whether a case is benign or malignant.5 ∧ x6 (bare nuclei) ≥ 0. 13. 13.0312 and 0. x2 . 13. (13.76 −0.70 0.6(b). The connection and weights of the network are given in Fig. then malignant.7.562 x6 1.

(13. Though the structure is still very simple.2. see Fig. . then Setosa.8. (13. where only attribute 1 has been selected. (13.8. we have been successful in extracting under- standable symbolic rules from the simple neural networks generated using the evolutionary multi-objective algorithm.25 y2 −306 −0.5 −0. Similar analysis has been done on the Iris data. refer to Fig.25 y3 0. 13. x4 (Petal-Width) < 1.5 − y1 x1 0. 13. then Setosa.0. but is still too simple to distinguish between Iris-Versicolor and Iris-Virginica.5 x2 −0. One rule can be extracted from the neural network: R1: If x3 (Petal-Length) < 2. By analyzing the structure of the neural network. (13. we demonstrate that the idea to generate interpretable neural networks by reducing the complexity of the neural networks is feasible. it can be seen that the network can only distinguish the class Iris-Setosa (class 3) from others. then Versicolor.. 13.20) R2:If x3 (Petal-Length) > 2.7. then Virginica. x4 (Petal-Width) < 1.4.25 Fig. The simplest neural net- work has 8 connections.19) The second simple Pareto-optimal neural network has 13 connections with 2 inputs. The simplest neural network for the iris data. the following three rules can be extracted: R1:If x3 (Petal-Length) ≤ 2.22) From the above analyses. Jin et al.25 x3 x4 677. From the neural network structure.304 Y.21) R3:If x4 (Petal-Width) > 1. Thus. the network is able to separate three classes correctly.

In multi-objective optimization.5 x3 −1.5 Identifying Accurate Neural Network Models One advantage of evolutionary multi-objective optimization approach to neural network generation is that a number of neural networks with a spec- trum of complexity can be obtained in one single run. the accuracy on training data also increases. In [17]. some interesting work has been presented to de- termine the number of clusters by exploit the Pareto front in multi-objective data clustering. M SEj . If we take a closer look at the Pareto fronts that we obtained for the two benchmark problems. The reason is that for the solutions at the knee. (13. the “knee” of a Pareto front is often con- sidered interesting. Thus. However.58 y2 −297 0.0086 −4. i. we notice that when the model complexity increases. Nj are the MSE on training data.312 0. we introduce the concept of “normalized performance gain (NPG)”. 13. which is defined as follows: M SEi − M SEj NPG = .24 Fig.28 0. such a knee is not always obvious in some applications. .8. and the number of connections of the i-th and j-th Pareto optimal solutions.57 x4 −0. it is important to define a quantitative measure for the knee according to the problem at hand. we hope that we can identify those networks that are most likely to have good gener- alization capability.23) Ni − N j where M SEi .5. the amplitude in accuracy increase is not always the same when the number of connections increases. With these neural net- works that trade off between training error and model complexity. 13 Simultaneous Generation of Accurate and Interpretable Classifiers 305 y1 x1 0. perform well both on training data and unseen test data.e. Thus.. The second simplest neural network for the iris data. a small improvement in one objective will cause large deterioration in the other objec- tive.01 0.286 y3 658 0.5 0. Ni . 13.28 x2 4. However.

0.13. 0. Complexity versus training error for the breast cancer data. . 13. The normalized performance gain on the breast cancer data is illustrated in Fig. The five solutions right to the knee are circled.10.004 0. Jin et al. 13. we examine the test error of the neighboring five Pareto-optimal solutions right to the knee. We notice that the largest NPG is achieved when the number of connections is increased from 4 to 6 and from 6 to 7. though there is some small fluctuations.03 0.306 Y. To investigate if there is any correlation between the knee point and the approximation error on test data.006 0.012 0.04 Training Error Knee 0.10.06 0. After that.02 0.002 0 10 20 30 40 50 60 70 Number of Connections Fig.05 0. which are circled in Fig.01 0 0 10 20 30 40 50 60 70 Number of Connections Fig. Thus. Normalized performance gain for the breast cancer data.008 0.10.014 Normalized Performance Gain 0.9.01 0. we denote the Pareto optimal solution with 7 connections as the knee of the Pareto front. 13. the NPG decreases almost to zero when the number of connections is increased to 14.016 0.9. see Fig 13.

03 0.02 0. we can see from Fig 13. The NPG on the iris data is presented in Fig.01 0. 13. 13.12. 13. Similarly.05 0. the solution with 14 connections is considered as the knee of the Pareto front. 13 Simultaneous Generation of Accurate and Interpretable Classifiers 307 0. the NPG decreases dramatically. see Fig.014 Normalized Performance Gain 0. Very interestingly.004 0. 13. Normalized performance gain for the iris data.006 0.01 0 10 20 30 40 50 60 70 Number of Connections Fig.12. We do the same analysis on the iris data. the NPG is large when the number of connections increases from 8 to 13 and from 13 to 17.002 0 13 17 19 23 25 35 40 Number of Connections Fig. After that. 0. .012 0.13.008 0.06 0. Complexity versus test error for the breast cancer data.11. The circled solutions are those neighbor solutions right to the knee on the Pareto front. Thus.04 Test Error 0.11 that all the five solutions belong to the cluster also have the smallest error on test data.

06 0. 0.15.308 Y. refer to Fig.02 0 5 10 15 20 25 30 35 40 Number of Connections Fig. Actually. 0.04 0. 13.14. However.12 MSE on Training / Test data 0. when we look at the test error of all the solutions.08 Training Error 0.12 0. The training errors are denoted by diamonds and the test error by circles. the test error shows very similar behavior to the training data with respect to the increase in complexity.13. Jin et al.04 0.08 0. the test errors are even smaller than the training error. .1 0. because none of the test samples (denoted by filled shapes) are located in the overlapping part of the different classes. Complexity versus training error for the iris data.1 0.02 0 5 10 15 20 25 30 35 40 Number of Connections Fig.06 Knee 0. 13. If we take a look at the training and test data as shown in Fig. we find that no observable overfitting has occurred with the increase in complexity. In fact. 13. this turns out to be natural. Complexity versus training and test error for the iris data.14. 13.

5 0 1 2 3 4 5 6 7 x 3 Fig. Location of the training and test samples in the x3 −x4 space of the iris data. References [1] H.5 x4 1 0. Proc. .A.15. respectively. This work presents a method for achieving this goal using an evolutionary multi-objective algorithm. we are able to choose neural networks with a simpler structure for extracting interpretable symbolic rules. on Artificial Intelligence. Abbass. class 2 by diamonds and filled. In the meantime. and class 3 by triangles down and filled triangles down. 13 Simultaneous Generation of Accurate and Interpretable Classifiers 309 2. Second. We believe the complexity is a very important objective for learning models in that both generalization performance and interpretability are closely related to the complexity. By analyzing the achieved Pareto-optimal solutions that trade off between training accuracy and complexity. 13.6 Conclusions Generating accurate and interpretable neural networks simultaneously is very attractive in many applications. A memetic Pareto evolutionary approach to artificial neural networks. this method should be verified on more bench- mark problems. if overfitting occurs. The accuracy on the training data and the number of connections of the neural network are taken as the two objectives in optimization. First it is interesting to compare our method with model selection criteria such as cross-validation or bootstrap methods. Training and test data of class 1 are denoted by triangles and filled triangles. 13. neural networks that have good accuracy on unseen data can also be picked out by choosing those solutions in the neighborhood of the knee of the Pareto front.5 2 1. of the 14th Australian Joint Conf. Much work remains to be done in the future.

Multilayer feedforward networks are universal approximators. Neural Networks for Pattern Recognition. J. Model Selection and Multimodel Inference. Neural Computation. pages 547–560. and J. second edition. Duch. Littman. In: Proc. Handl and J. Ackley. 2:487–509. Optimizing forecast model complexity using multi- objective evolutionary algorithms. 2005 [18] K. 2000 [20] C. Proceedings of the IEEE. Interactions between learning and evolution.H. Springer. LNCS 3410. Tickle. pages 98–109. Gayko.H. 2001 [2] H. pages 115–121. In Parallel Problem Solving from Nature. 1995 [9] K. A. Coello Coello. Chichester. Evolutionary Multi-Criterion Optimization. M. M. Diederich. Littman. 92(5):771–805. IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (ECNN 2000). White. Veldhuizen. Anderson. pages 1–12. Oxford. Knowledge Based Systems.A. Congress on Evolutionary Computa- tion. 2002 [3] H. S. 2005 [11] C. Hüsken.B. Exploiting the trade-off . 8(6):373–389. R. Artificial Intelligence in Medicine. Zurada.L. 2000 . Multi-objective Optimization Using Evolutionary Algorithms. A survey and critique of techniques for extracting rules from trained artificial neural networks. [15] W. Adamczak. Artificial Life. Knowles. 2003 [5] D. 2000 [14] W. Hornik. Coello Coello. 7:1–9. and T. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. Fieldsend. Symposium on Neural Computation. In Proc. R. 15(11):2705–2726. Singh. E. Bishop. IEEE Press. 2002 [12] K. 1994 [7] R.A. Meyarivan. and G. 2(5):359–366. 1998. 2001 [13] K. Grabczewski. pages 2074–2080. Lamont (eds.P. An evolutionary artificial neural networks approach for breast cancer diagnosis. and B. Pareto neuro-ensemble: Constructing ensembles of neural net- works using multi-objective optimization. Ackley. Stinchcombe. of 13th European Symposium on Artificial Neural Networks. 1989 [19] M. Deb. Andrews. Optimization for problem classes – Neural networks that learn to learn. World Scientific. [16] J. H. S. Evolutionary algorithms for solving multi-objective problems. 1995 [8] C. Abbass.A. Pratap. Improving the Rprop learning algorithm. pages 675–700. 25(3):265–281. Hüsken. C.L. M. Jin et al. D. Extraction of logical rules from backpropagation networks. Igel and M.E. Abbass. In Applications of Multi-objective Evolution- ary Algorithms. G.The benefits of multiple objectives in data clustering. 2004 [17] J. 2002 [10] Evolutionary framework for the construction of diverse hybrid ensembles. Abbass. Ar- tificial Life. 2004. 3:3–10. New York. Neural Networks. Lamont. Burnham and D. Wiley. UK. 1992 [6] D. Oxford University Press. and A. Neural Processing Letters. pages 849–858. Setiono. Kluwer Academic.). In Xin Yao and David B. New York. Computational intelligence methods for rule-based data understanding. M. 2003 [4] H.R. J. Sendhoff. and K. A case for Lamarckian evolution. volume VI. of the 2nd ICSC Int. Agrawal. pages 253– 258. Fogel.310 Y. Speeding up back-propagation using multi-objective evolutionary algorithms. Deb. Duch. editors.

Okabe. Riedmiller. Evolutionary multiobjective optimization for gen- erating an ensemble of fuzzy rule-based classifiers. H. World Scientific. C. IEEE Transactions on Medical Imaging. Springer. In IEEE Int. 2005 [22] H. Jin. pages 586–591. Ishikawa. I:533-537. Conf. Setiono. T. A dynamical system perspective of structural learn- ing with forgetting. Symbolic representation of neural networks.D. PROBEN1 . Taha. D. Extracting interpretable fuzzy rules from RBF networks.A. Park.B. Evolutionary multi-objective optimization for simultaneous generation of signal-type and symbol-type representations. Sendhoff. 1999 [31] D. H. 1994 [34] M. Neural network regularization and ensembling using multi-objective evolutionary algorithms. of IEEE Int. 2000 [36] R. on Fuzzy Systems . IEEE Com- puter. Coello Coello. In Congress on Evolutionary Computation. Kottathra. Zurada. volume 1. B. Miller. Attikiouzel. Generating concise and accurate classification rules for breast cancer diagnosis.H. pages 534–546. T. 13 Simultaneous Generation of Accurate and Interpretable Classifiers 311 [21] C.). Proc. Okabe. In Genetic and Evolutionary Computation Conference. 29(3):71–77. Evolu- tionary Multi-Criterion Optimization. Advanced Fuzzy Systems Design and Applications. C. Setiono. A. 11(3):448–463. R. Saldanha. Teixeira. 17(2):149–164. [30] M. on Neural Networks. 18(8):675–685. 2003 [23] M.P. 2004 [28] Y. Braun.a set of neural network benchmark problems and benchmarking rules. Neural Processing Letters.A. T. Sendhoff. Y. Lam- ont (eds. E. 18:205–219. 9(3):508–515. The MIT Press. pages 1–8.M. 1998 [32] S. pages 653–673. pages 752–766. LNCS 3410. Jin. B. Braga. Takahashi. J. IEEE Transactions on Neural Networks. 2004 [27] Y. Ishibuchi. M. Park. 2005 [29] K. 2000 [24] Y. 19:135–147. Jin. pages 1077–188. 1996 [37] R. 1999 [38] I. Anastasio. J. Heidelberg. Kupinski. Technical Report 21/94. Körner. IEEE Transactions on Knowledge and Data Engineering. Ghosh. Sendhoff. Nam. Neural Networks. Neural Smithing. 1996. B. 35:189–194. 1999 [33] L. A direct adaptive method for faster backpropgation learning: The RPROP algorithm. Neurocomputing.C. 13:1171–1183. Conf. 2003 [25] Y. Prechelt. A novel multicriteria optimization algorithm for the structure determination of multilayer feedforward neural networks. Rule extraction by successive regularization. Journal of Network and Computer Applications. New York. Design of a neural controller using multiobjec- tive optimization for nonminimum phase. R. Igel. Symbolic interpretation of artificial neural networks. R. 2003 [26] Y. Liu. R. LNCS 3410.A. Improving generalization of MLPs with multi-objective optimization. 1999 [39] R.J. 1993 [35] R. Reed. B. In Applica- tions of Multi-objective Evolutionary Algorithms. Univer- sität Karlsruhe. Evolutionary multi-objective optimization ap- proach to constructing neural network ensembles for regression. H. Multiobjective genetic optimization of diag- nostic classifiers with implementations for generating receiver operating char- acteristic curves. Jin. Multi-objective model selection for support vector machines. 2000 . G. Yamamoto. Evo- lutionary Multi-Criterion Optimization. Marks II. Artificial Intelligence in Medicine. Jin. Sendhoff. de A. Fakultät für Informatik. IEEE.

K. Agent-based evolutionary approach for interpretable rule-based knowledge extraction. Y. W. Int. 2005 [41] S. Wei. and Cybernetics. C. Man. 2004 . Man. Jin. Journal of Computational Intelligence and Applications. Jin et al. Part C: Applications and Reviews. Wiegand. Handmann.F. Igel. IEEE Transac- tions on Systems. U. 35(2):143–155. 4:237–253. [40] H. Evolutionary multi-objective optimization of neural networks for face detection. Kwong.312 Y. Wang. S.

The chapter presents a new method of rule extraction from trained neural networks.wroc. Krystyna Mularczyk Wroclaw University of Technology. Neural networks are used especially in situations when algorithms for a given problem are unknown or too complex. a network does not need an algorithm. The most commonly used neural networks have the input and output data presented in the form of vectors. The main part of the chapter contains a thorough description of the proposed method.markowska-kaczmar@pwr. weather prediction and medicine. the theory of control. it must be provided with training examples that enable it to learn the cor- rect solutions. and techniques used when approaching them with genetic algo- rithms are presented. pattern and speech recogni- tion. 14. Wyspianskiego 27. Institute of Applied Informatics Wyb.1 Introduction For many years neural networks have been successfully used for solving various complicated problems. which consists in giving correct answers for new input patterns that had not been used for training. based on a hierarchical multiobjective genetic algorithm. The unquestionable advantage of neural networks is their ability of gener- alization. Studies in Computational Intelligence (SCI) 16. The prob- lems associated with rule extraction. Their resistance to errors is equally impor- tant. In order to solve a problem. It is followed by a discussion of the results of experimental study performed on popular benchmark datasets that confirm the method’s effec- tiveness. The training consists of a certain iterative process of modifying the network’s parameters. are de- scribed in detail. Finally.14 GA-Based Pareto Optimization for Rule Extraction from Neural Networks Urszula Markowska-Kaczmar. 313–338 (2006) www. The areas of their applications include communication systems. This means that networks can produce the right answers also in the case of noisy or missing data. Poland urszula. especially its multiobjective nature. signal processing. so that the answer given by the network for each of the input patterns is as close to the desired one as possible. Such a pair of vectors is an example of the relationship that a network should learn.springerlink. it should be emphasized that they are fast U.pl Summary.com c Springer-Verlag Berlin Heidelberg 2006 . instead. Markowska-Kaczmar and K. Mularczyk: GA-Based Pareto Optimization for Rule Extraction from Neural Networks. 50-370 Wroclaw.

if presented in a comprehensible way. Nevertheless. the system’s reconstruction could be performed relatively quickly by repeating the process of training the network and extracting knowledge. which became the basis for a method of rule extraction called MulGEx (Multiobjectve Genetic Extractor) that will be presented in this chapter. Its methods aim at finding new and correct patterns in the processed data. it would offer a possibility of explaining the way of reaching a particular conclusion. The problems that data mining deals with include classification. The majority of them may be approached by means of neural networks. The discovered knowledge. Rule extraction from neural networks might be used for this purpose as well as perceived as a method of automatic rule extraction for an expert system. The system itself could be built as a hybrid consisting of a neural network and a set of rules that describe its performance. and. the domain of extracting knowledge from neural networks has been developed since the 90s [1]. Networks’ ability to deal with noisy data should be emphasized again. However. if we imagine a network with dozens of inputs with various types. which is typical of neural networks. on the other hand. Moreover. Markowska-Kaczmar and K. if the data changed. . It tends to be much easier to extract knowledge after having the data processed by a network. Data is presently perceived as a source of knowledge and exploited by a dynamically evolved area of computer science called data mining. regression.314 U. Searching such a vast space of solutions may be efficiently performed by genetic algorithms – another nature–based technique. Such a hybrid would combine the advantages of both compo- nents. could be used for building expert systems as a source alternative or supplementary to the knowledge obtained from human experts. the trustworthiness of a system that supports people’s work is essential. Skeptics could ask why knowledge should be extracted from a network and not directly from data. Because in certain areas of applications. Rapid development of data storage techniques has been observed in the last years. which is the main quality of expert systems. for instance in medicine. it would work fast. the problem of finding restrictions that describe the conditions of belonging to a given class becomes NP–hard. a serious disadvantage is that neural networks produce answers in a way that is incom- prehensible for humans. grouping and characteristics. Mularczyk while processing data and relatively cheap to build. Knowl- edge extraction may also be perceived as a tool of verifying what has actually been learned by a network. This knowledge describes in an understandable way the performance of a neural network that solves a given problem. In this case knowledge extraction from a network could make its behavior more trustworthy.

where an adaptation of PAC algo- rithm is applied. There are many criteria of the evaluation of the extracted rules’ quality. where the first stage consists of de- scribing the conditions of a single neuron’s activation. [9]. Otherwise differ- ent approaches are developed in order to reduce the architecture of a neural network. i. which testifies not only to the impor- tance of the problem. [10]. However. substituting a cluster of neurons by one neuron. or [15]. Full RE [20]. for example genetic algorithms or decision tree methods. Such rules are created for all neurons and concatenated on the basis of mutual dependencies. Thus we obtain rules that describe the relations between the inputs and outputs for the entire network. The last group encompasses mixed methods that combine the two ap- proaches described above. observing only its inputs and responses produced at the outputs. but also to its complexity. RULEX [2]. Fuzzy rules [14].. The developed approaches may be divided into three main cate- gories: global. M of N [7]. in determining the values of its inputs that produce an output equal to 1. although works concerning rule extraction for the regression task appear as well [17]. It is worth mentioning that in such approaches the architecture of a neural network is insignificant. [5]. [18].e. too. BIO-RE [20] that applies truth tables to extract rules. Most of the methods concern multilayer perceptrons (MLP networks). local and mixed. The majority of methods apply to networks that perform the classification task. a network provides the method with training patterns. The problem of rule extraction by means of the local approach is simple if the network is relatively small. Global methods treat a network as a black box. In other words. based on inversion. first order rules and decision trees are used. Ruleneg [8]. methods dedicated to other networks are developed as well. Other algorithms in this group treat the rule extraction as a machine learning task where neural network delivers training patterns. The opposite approach is the local one. other optimize the structure of a neural network introducing a special training procedure or using genetic algorithms [16].. Knowledge acquired from neural networks is represented as crisp prepo- sitional rules. As examples one may mention Partial RE [20].g.2 The State-of-the-Art in Rule Extraction from Neural Networks The research on effective methods of acquiring knowledge from neural net- works has been carried out for 15 years. 14 Pareto Optimization in Rule Extraction 315 14. Some methods group the hidden neurons’ activations. This taxonomy bases on the degree to which a method examines the structure of a network. The examples include VIA [21] that uses a procedure similar to classical sensibility analysis. In this case different machine learning algorithms are used. e. The most frequently used include: .

thus making the set more com- plicated and less understandable. especially fidelity and comprehensibility. Fidelity stands for the degree to which the rules re- flect the behavior of the network they have been extracted from. Accuracy is deter- mined on the basis of the number of previously unseen patterns that have been correctly classified. For this reason the architecture of a net- work.. 14. some require providing a default rule that is used if no other rule can be applied in a given case. the network produces sets of rules that classify unseen patterns in the same way. The methods differ in computational complexity (that is not specified in most cases). • guaranteed correctness of obtained results. 14. during different training sessions. Comprehensibility is defined as the ratio of the number of rules to the number of premises in a single rule. since their simultaneous optimization is practically infeasible. In real applications not all of them may be taken into account and their weight may be different.316 U. the way of connecting individual neurons. tend to be contradictory. Some of the methods are applicable to enumerable or real data only. i. In [1] these four requirements are abbreviated to FACC. A method that would meet all these requirements (or at least the vast majority) has not been developed yet. consisting of few rules. Hence the necessity of developing new methods. • consistency. we must add new rules that would deal with the remaining patterns and exceptions. A good algorithm of rule extraction should have the following properties [21]: • independence from the architecture of the network. Markowska-Kaczmar and K. This work is an attempt to fill this gap for a network that solves the problem of classification. Consistency occurs if. The main problem in rule extraction is that these criteria. • comprehensibility. Mularczyk • fidelity.1)that provides it with training patterns. • accuracy. • a mechanism of accurate description of a network’s performance. If we want to improve a given set’s fidelity. Therefore rule extraction requires finding a compromise between different criteria. The least complex sets. cover usually only the most typical cases that are represented by large numbers of training patterns. • no restrictions on the process of a network’s training.3 Problem Formulation The presented method of extracting rules from a neural network belongs to the group of global methods. some require repeating the process of training or changing the network’s architecture.e. is insignificant. It treats the network as a black box (Fig. . These criteria are discussed in detail by Ghosh and Taha in [20].

The class is encoded using the . C2 . where xij is the signal on the j-th input and wij is the weight of this connection).. Neural network as a black box. 14.. However. 14. Broadly speaking. 14. xp2 . Every neuron (Fig.2. Fig. Ck . which is followed by applying the activation function f (net) and producing the neuron’s output. Let’s assume that a network solves a classification problem.. Every p-th pattern will be described as an input vector xp = [xp1 . . xpm ] presented to the network.. the knowledge of this problem is encoded in the network’s architecture. . weights and activation functions of individual neurons.. because of a large number of neurons in a network and the par- allelism of processing. After the training the network’s output in- dicates the class that the pattern belongs to.1. Model of neuron. This resolves itself into dividing objects (patterns) into k mutually separable classes C1 . it is difficult to describe clearly how a network solves a problem.. 14 Pareto Optimization in Rule Extraction 317 Fig.2) in a neural network performs simple operations of addition and multiplication by calculating its total input (for the i-th neu- ron neti = xij wij .

premn T HEN classv .e. . i. the most popular way of representing the extracted knowledge is the 0-order logic.T HEN. Therefore the conjunction of premises does not necessarily have to contain constraints for all attributes. where there are no dependencies between the attributes. Our goal is to describe the performance of such a network in a comprehen- sible way. The IF. During the process of classification some of the attributes may prove to be insignificant. (14. • For logical attributes – it determines which of the two values they must take.e. that would meet the criteria defined in the previous section..e. Xi = V alue. The major- ity of classification problems. in the above-mentioned notation. f alse}. and that the number and complexity of the rules they consist of are kept at a relatively low level. a constraint is defined in one of the following ways: • For a real type of attribute (discrete and continuous) it introduces the bounds of the range of acceptable values.318 U. to be willing to acquaint themselves with the notation used in the first order logic (the calculus of predicates). where V aluej belongs to the set of all possible values defined within a given type. Markowska-Kaczmar and K. can be expressed by means of the 0-order logic.. even in the presence of one class only.. since in a general case the patterns belonging to one class cannot be covered by one rule. for instance. • For enumerable attributes – a constraint represents a subset of permissible values. presented in an understandable way. This simple reason explains the popularity of the above-mentioned solution. is connected with multimodal optimization. if we recall the criteria that an extracted set of rules should fulfill.. On the other side. Depending on the type of the attribute... V aluemax ]. Mularczyk “1 of k” rule.. i. the problem of rule extraction turns out to be multiobjective. i.. i.. One cannot ex- pect every doctor. Some of the premises may be eliminated. Xi ∈ [V aluemin . The most impor- tant features are that these sets reflect the performance of the network both for training patterns and the previously unseen ones.1) A single premise premi determines the constraints that must be imposed on the i -th input so that the network may classify the pattern into the class included in the conclusion. Each premise in the left part of the rule corresponds to a constraint imposed on the i -th attribute Xi .e. which indicates that all values of corresponding attributes are accepted by the rule. where V alue ∈ {true. Xi = V alue1 OR Xi = V alue2 OR .. Undoubtedly.. e.g. only one of k outputs may produce the value 1 and the remaining ones must be equal to 0. Therefore the pur- pose of a rule extraction algorithm is to produce one or more sets of rules..Xi = V aluek .e. i. notation is very natural and intuitive.. rules of the following form: IF prem1 AN D prem2 . The problem of classification.

different crite- ria may be important and different sets of weights may be required to meet all the needs. . for example: f1 (x) < f1 (y) ∧ f2 (x) > f2 (y) ⇒ ¬(f (x) ≺ f (y)) ∧ ¬(f (y) ≺ f (x)). The weights are used for fitness calculation and influence the degree to which particular criteria are optimized. The method in question bases on the concept of Pareto domination – a relation of quasi order. which. tends to be very time- consuming. This decision rests entirely on the user and must be made before the algorithm is run.. fm (x)]. Pareto optimization is an alternative to scalarization that enables avoiding the operation of converting the values of criteria into a single value. For example. f2 (x). the relation of domi- nance is defined in the following way: f (x) ≺ f (y) ⇔ (∀k = 1. . (14. Its worth noticing that two solutions may not be bound by this relation at all.. but its main disadvantage consists in the necessity for choosing weights that determine the relative importance of the criteria. namely: all the criteria values (vector elements) in x are at least as good as the corresponding values in y.2) f (y) = [f1 (y).4) This means that f (x) dominates f (y) if two conditions are met. (14. Another drawback is that depending on the purpose of rule extraction. The problem is that choosing the correct weights is not necessarily easy and can be done onlyapproximately. A significant role is played by intuition and therefore the results produced by the algorithm may prove not to be satisfactory..5) . in rule extraction one must decide whether rules should be very accurate or rather more concise and comprehensible.. f2 (y).. such that: f (x) = [f1 (x).. especially for complex problems. and a function f that is used for measuring the solutions’ quality.3) On the assumption that all criteria are minimized. forcing the algorithm to respect certain pri- orities. that makes it possible to compare two solutions represented as vectors consisting of the values of individual criteria. m) fk (x) ≤ fk (y) ∧ (∃k)fk (x) < fk (y). Such an algorithm would create complex and accurate rules as well as more general and comprehensible ones... Let’s assume that there are two solutions – x and y. All criteria are equally taken into account and therefore the purpose of the algorithm is to produce a whole set of solutions with different merits and disadvantages. 14 Pareto Optimization in Rule Extraction 319 14. and at least one of them is better than in y.4 Pareto Optimization The most common way of dealing with multiobjetive problems is weighted aggregation. leaving the final choice of the best solution to the user. fm (y)]. (14. denoted by the symbol ≺. In both cases the algorithm requires rerunning with modified parameters. (14.. .

Besides. which facilitates optimizing a whole set of solutions. if all optima may be found with equal probability. It consists in comparing individuals by means of domination. but in some cases. A multiob- jective algorithm. The first Pareto-based fitness assign- ment method. Several methods are used for approaching such problems [13]. Such problems in GA-based methods are solved by applying scalarization. because the objectives are usually mutually exclusive and several solutions representing different trade-offs between them may exist. because they operate on a large number of individuals simultaneously. The original fitness is divided by the value of the sharing function. which is a great advan- tage. where each represents one of the objectives. which enables individuals to remain at the local optima. Genetic algorithms are particularly suitable for performing multiob- jective optimization. The existence of several objectives in the problem to be solved is reflected in the method of evaluating individuals in a genetic algorithm.where a set of populations is evolved indepen- dently. • Sharing-based methods . among others: • Iterations . The problem becomes how to define the weights. aims at finding an entire set of non-dominated solutions that are located as close to the real Pareto front as possible.5 Multimodality and Multiobjectiveness in Genetic Algorithms Genetic algorithms aim at finding the global extreme. an additional requirement is that these solutions should be distributed evenly in the space to ensure that all criteria are equally taken into account. The objective function takes the form of a vector.320 U. unlike a single objective one. belongs to the most commonly . 14. Choosing the right weights may be a problem. • Parallel computations . which is proportional to the number of individuals surrounding a given one and inversely proportional to the distance from them. due to niching techniques the second condition is met automatically. depending on the number of other individ- uals in its neighborhood. Pareto optimization may also be used in this case. That’s why the mechanism of nich- ing was developed. In this case the fitness function is a weighted sum of elements. therefore it can not be directly used as the fitness function. which is the most common approach. when finding all the optima is the basis of a correct solution. which allows either rank assignment or performing the tournament selection. Markowska-Kaczmar and K. Moreover.the sharing function determines by how much an individual’s fitness is reduced. Mularczyk In such a case both solutions are considered equally valuable.a genetic algorithm is rerun several times. may prove to be undesirable. one gets a chance of finding them due to independent computations. proposed by Goldberg in [6].

In the case of multiobjective optimisation. but they are not mutated. 14 Pareto Optimization in Rule Extraction 321 used ones. i..e. Fitness assignment in the Strength Pareto Evolutionary Algorithm (SPEA) is performed in the following way (N denotes the size of the population): • For each individual i in the external set do: card{j ∈ P opulation|i ≺ j} fi = N +1 • For each individual j in the population. N ). This optimisation has been introduced into the NSGA algorithm [19]. which solves the problem of nich- ing when elitism is introduced to a multiobjective genetic algorithm [23]. Zitzler and Thiele modified it to develop the Strength Pareto Approach. do:  fj = fi + 1. Afterwards. It consists in assigning the highest rank to all nondominated solu- tions. which prevents them from losing their quality. proposed by Fonseca and Fleming in 1993. 1) and fj ∈ [1. Another method. [4]. the fitness function must be modified so that selection may be carried out. i. elitism requires storing all non- dominated individuals found during the course of evolution in an external set. These individuals participate in reproduction along with the ones from the standard population. is based on the idea that the rank of an individual depends on the number of other in- dividuals in the population that dominate it. fi ∈ [0. these solutions are temporarily removed from the popula- tion and the procedure is repeated for the remaining individuals that receive a lower rank value. .i≺j Because in this method the lowest fitness values are assigned to the best individuals. The process of rank assignment is carried out as follows: T emp := P opulation n=0 while (T emp =0) do { N ondominated ={x ∈ T emp/(¬∃y ∈ T emp)f (y) ≺ f (x)} (∀x ∈ N ondominated) rank(x) := n T emp := T emp\N ondominated n := n + 1 } Niching may be used at every stage of this algorithm to help preserve diversity in the population.

e. Its main idea consists in introducing genetic algorithms working on two levels as in [12]. the way of encoding data in the chromosome. However. The first one consists in encoding a single rule in each individual. The proposed method combines both these approaches. but also the entire algorithm’s performance. Consistency requires running the algorithm several times and comparing the obtained results.6 MulGEx as A New Method of Rule Extraction MulGEx belongs to black box methods.3. it does not impose any constraints on the types of attributes in the input patterns. because of the variable length of chromosomes. which eliminates the drawbacks of the previous method. • complexity . The first two correspond to fidelity and determine to what extent the answers given on the basis of the extracted rules are identical to those produced by the network. MulGEx is based on the concept of Pareto optimization. This allows rules to be optimized efficiently. The remaining FACC requirements cannot be used as criteria during the process of rule extraction. Complexity has been introduced to evaluate the solutions’ com- prehensibility.depending on the number of premises in a single rule and rules in a set. The difference has been shown by experimental study. but scalarization has been implemented on both levels to allow comparison. The main idea is shown in Fig. more complicated genetic operators have to be implemented. . It does not require any special network architecture nor use the information encoded in the network itself. Focusing on individual rules allows adjusting their parameters precisely. Two alternative approaches are used in rule extraction. but gives no possibility of evaluating how the ex- tracted rules work as a set. In the Pitt approach an individual contains an entire set of rules. i. Markowska-Kaczmar and K.322 U. The following criteria have been used for evaluating the produced solu- tions: • coverage . i. • error .. One of the most important decisions that needs to be made before im- plementing a genetic algorithm concerns the form of an individual. Accuracy may be measured only after the algorithm has stopped and the final set of rules is applied to previously unseen patterns. Moreover. for it always produces one solution at a time.the number of patterns classified correctly by a rule or set.calculated on the basis of the number of misclassified patterns.. namely Michigan and Pitt [13]. Mularczyk 14. but on the upper level the choice influences the algorithm’s performance significantly.e. 14. In the case of the lower-level algorithm this is less important. one must implement one of the niching techniques as well as solve the problem of contradictory rules that might be evolved. This influences not only the ge- netic operators. Moreover. The first step consists in evolving single rules by a low-level genetic algo- rithm. the changing number of rules in the evolved sets.

Both ap- proaches.1 The Lower-level Algorithm The lower level algorithm delivers the initial pool of rules for the upper level algorithm. all training patterns recognized by this rule are removed from the input set used by the algorithm for fitness calculation (sequential covering approach [22]). several independently evolved populations are introduced. Each of the genetic algorithms on the lower level is run several times and produces one rule at a time. namely the way of encoding individuals. This is a solution to the problem of multimodality that guarantees that the algorithm finds relatively precise rules for all classes. i. 14 Pareto Optimization in Rule Extraction 323 Fig.. At this stage the choice of the method of dealing with multiobjectiveness is of little significance. 14.7 Details of the Method The algorithms on both levels will be described in detail in this section. In order to reduce the complexity of rule extraction. The most significant elements in the design of a GA-based application will be presented. have been implemented . one for each of the classes defined in the problem. 14. The main idea of MulGEx. scalarization and Pareto optimization. since only one rule is extracted at a time.e. The results are passed to the upper-level algorithm that performs further op- timization by working on sets built of the obtained rules and evaluating how well these rules cooperate. It operates on chromosomes representing single rules (the Michi- gan approach). The best rule evolved is stored in an external set (the upper level genetic algorithm’s initial pool). This sequential approach to rule extraction solves the prob- lem of multimodality within the classes. 14. including those represented by a small number of training patterns. Then a new population is created in a random way and the algorithm is rerun. genetic operators and fitness function.3. Afterwards.7.

4.324 U. be- cause complexity is rarely taken into account at this stage. At an early stage of evolution the algorithm attempts to discover the areas in the space where patterns belonging to a given class are located. Because of this. Choosing the most appropriate weights is relatively easy. As soon as evolution in all the populations has stopped. Nevertheless. The number of generations created within such an iteration is chosen by the user who may decide to evolve a constant number of generations every time or to remove the best rule when no improvement has been detected for a given period of time. A Pareto-based genetic algorithm considers all ob- jectives separately. therefore such rules are very valuable (as non-dominated) and will be optimized by means of genetic operators with a strong probability in the next generations. This guarantees high fidelity of the final set evolved by the algorithm. one must temporarily apply certain weights to scalarize the fitness function. even though the number of rules may be relatively large. The conclusion contains a single integer . Markowska-Kaczmar and K. In most cases it is reasonable to select rules with a very small error value or. having stopped the algorithm. or when the user decides to stop the algorithm after noticing that newly found rules cover too few patterns (which means that the algorithm may have started taking exceptions and noisy data into account). In this case. Pareto optimization tends to be more efficient here. Therefore the number of generations needed to evolve satisfying rules tends to be lower if Pareto optimization is used. espe- cially for multidimensional solution spaces and large numbers of classes. the choice of the best rule is not straightforward because of a limited possibility of comparing indi- viduals. The form of a single premise depends on the type of . The first rules with coverage greater than 0 evolved have usually large error values and should be subsequently refined in order to reduce misclassification. Improvement occurs when an individual with the highest fitness so far appears in a single objective algorithm or when a new non-dominated solution is found in a multiobjective one. thus being eliminated from the population. all rules stored in the external set are passed to the upper-level algorithm for further processing. The Form of the Chromosome The chromosome on this level is presented in Fig. however. An individual consists of a set of premises and a conclusion that is identical for all individuals within a given population.the number of the appropriate class. with no error at all. the initial error will be reduced. Mularczyk and compared. 14. This hinders the process of evolution. A single objective algorithm combines coverage and error into one value and therefore does not take into account the potential usefulness of such rules. The process of extraction for a given class completes when there are no more patterns corresponding to this class. These rules are assigned low fitness values because of high error and may be excluded from reproduction. if possible. As a result.

which indicates that certain constraints are not taken into account when applying the rule to the patterns. The number of premises in a rule is constant and equal to the number of the network’s input attributes. Every logical or real variable in the chromosome is . An enumerable attribute requires an array of logical variables. where every element corresponds to one value within the type (Fig.5b). enumerable or real value. The appropriate genes designed for all above-mentioned types of attributes are presented in Fig. 14.5a – 14. 14. depending on the type of attribute. Fig. Genes representing different types of premises. The process of exchanging genetic information between individuals is based on uniform crossover. Every premise is accompanied by a logical flag A that indicates whether the constraint is active. 14. A given rule can be applied to a pattern if the value of the attribute in the pattern belongs to the subset specified within the premise. one must introduce appropriate genetic operators to allow reproduction and mutation.5c. Genetic Operators Having defined the form of the chromosome. Fig. 14.5a). unless the flag indicates that this particular constraint is inactive. Premises are elimi- nated by deactivating their flags. In order to match a given rule. 14. If the flag is set to false. 14.5c).5. 14 Pareto Optimization in Rule Extraction 325 the corresponding input in the neural network and may represent a constraint imposed on a binary. A schema of chromosomes on the lower level. all values of a given attribute are accepted by the rule. A constraint defined for a binary attribute has been implemented as a single logical variable (Fig.4. a pattern must have the same value of this attribute as the premise. All elements set to true represent together the subset of accepted values. A constraint imposed on a real value consists of two real variables repre- senting the minimal and maximal value of the attribute that can be accepted by the rule (Fig.

all of them are multiplied by certain weights and the results are summed up to produce a scalar value. In the case of real premises. These criteria of quality assessment have to be gathered into a single fitness value. A logical premise is therefore copied entirely from one parent. 14.326 U. the newly created range may be either copied entirely from one parent or created as a sum or intersection of both parents’ ranges. The Fitness Function and the Method of Selection The quality of an individual is measured in the following way: The first step consists in calculating the number of patterns classified correctly by the rule (coverage) as well as the number of misclassified patterns (error ). may be assigned the opposite value. Markowska-Kaczmar and K. This re- quires checking whether the neural network’s response. Additionally. chosen randomly usually with a small probability. F itnessscalar = Wc · coverage − We · error − Wx · complexity (14. The subset of values in an enumerable one is a random combination of those encoded in the parents’ chromosomes. with equal probability. For obvious reasons mutation cannot affect the conclusion that remains constant in the course of evolution. The choice of the parent is performed randomly. which prevents individuals from being modified too rapidly and losing their qualities. The algorithm becomes more effective if small changes are more probable than larger ones. Some of the logical variables in premises corresponding to binary or enumerable attributes. which changes the set of patterns accepted by the rule. Mutation may also influence the flags attached to premises. This results in individual constraints being activated or deactivated and makes rules either more specific and complex or more general. matches the conclusion. Since fitness cannot be negative. since its value is the same within a given population.6 (the objectives that are minimized are negated). the information concerning the values of individ- ual criteria is lost. the obtained values may need to be scaled. 14. Comparing two individuals in this case must be based . Constraints imposed on real attributes may be altered by modify- ing one of the bounds of the range of accepted values. The conclusion of the rule may be copied from either of the individuals. In single objective optimization. however. This is the simplest solution. Mutation consists of modifying individual premises in a way that depends on their type.7).6) The multiobjective approach requires creating a vector containing all values of the criteria (Eq. given for every pat- tern that the rule can be applied to. the number of active premises is determined in order to evaluate the complexity of the rule. The weights used for fitness calculation are chosen by the user and the function takes the form presented by Eq. This is done by adding a random value to a given real variable within the premise. Mularczyk copied from one of the offspring’s two parents.

7. the procedure of fitness assignment proposed in the Strength Pareto Approach is used. 14. Naturally.. error. Gradual simplification of the sets results in producing various solutions with different qualities. . but their influence on the quality of evolved solutions is usually of little importance. The final choice of the most appropriate one is left to the user and is usually done by applying weights to the criteria to enable comparing non-dominated solutions. The roulette wheel technique has been implemented in MulGEx. 14 Pareto Optimization in Rule Extraction 327 on the relation of domination. which is not possible in the case of scalarization. The result produced by the algo- rithm is a set of non-dominated solutions that correspond to sets of rules with different qualities and drawbacks. Evolution stops after a given number of generations has been created since the beginning or since the last improvement. F itnessP areto = [coverage. this may indicate that this rule covers exceptions that haven’t been eliminated by a neural network. Other modifications of the rules are also possible at this stage. which is the main advantage of a multiobjective algorithm. This allows identifying noisy data. this process should be accompanied by possibly low deterioration of fidelity. which implies that all individuals have a chance to be chosen for reproduction. i. Due to the multiobjective ap- proach the user may examine several solutions at a time without having to rerun the algorithm with different parameters. This implies that the maxi- mal fidelity achieved at the previous stage is retained and further optimization does not require adding new rules. Fitness is calculated on the basis of Goldberg’s method. 14. The purpose of the upper-level algorithm consists mostly in improving comprehensibility by either excluding certain rules from the sets or eliminating individual premises.e.7) The last element that needs to be defined when designing a genetic al- gorithm is the method of selection. Every individual in the initial population consists of all the rules produced by the lower-level genetic algorithms (Fig. In the latter case. However. if the observed improvement is very small after a new rule has been added. but the probability of them being selected is proportional to their fitness value. increasing the complexity of a set is accompanied by improved fidelity. Normally.3). complexity] (14. It operates on entire sets of rules that are initially created on the basis of the results obtained from the lower-level one. although scalarization has also been implemented to allow comparison. The multiobjective approach helps to determine the optimal size of a rule set.2 The Upper-level Algorithm The upper-level algorithm in MulGEx is multiobjective in the Pareto sense. Various possibilities have been proposed here. the algo- rithm must decide which rules are the least significant. where weights have to be carefully adjusted for the algorithm to produce satisfying results. unless the option of retaining the best individuals (elitism) has been chosen.

Markowska-Kaczmar and K. . The operator of mutation is more complicated.6. including the flag. premises inside the rule may be modified by applying the mutation operator used on the lower level. followed by genetic operators. which implies that rules corresponding to all classes are included in every set. designing the algorithm requires defining the form of an individual. This allows adjusting the size of the set without having to introduce variable-length chromosomes. which results in adding or removing the rule from the set. for it should enable modifi- cation on two levels . 14. Rules are copied into the offspring entirely. Because of these new conditions further optimization of rules might be possible and that is why the operator of mutation in the upper-level algorithm may alter the internal structure of rules that was developed before. The Form of the Chromosome An individual takes the form of a set of rules that describes the performance of the network. Therefore mutation is performed in two steps. It is worth noticing that in the lower-level algorithm some rules were evolved on the basis of a reduced set of patterns. Fig. individual rules cooperate to perform the classification task and are assessed together. 14. Genetic Operators The hierarchical structure of the chromosome requires defining specialized genetic operators. Crossover exchanges corresponding rules between the parents in a uniform way. Rules are accompanied by binary flags whose purpose is the same as in the chromosome on the lower level. This helps to adjust the ranges of accepted values inside the rule or reduce the rule’s complexity at the cost of fidelity. The form of chromosome on this level is presented in Fig. At this stage the whole set is used for evaluation.in relation to entire sets and individual rules. Namely. Mularczyk As previously. every flag attached to a rule may change its value with a small probability. Moreover. fitness evaluation and the method of selection.6. Afterwards. A schema of chromosomes on the upper level. setting a flag to false results in the temporary exclusion of a given rule from the set. First.328 U.

The data contain different types of attributes. which allows verifying the method’s versatility. since not all of them participate in determining the class that a given pattern belongs to (17 attributes are insignificant and have been chosen randomly). • effectiveness in the presence of superfluous attributes. this data allows to observe whether the algorithm can detect superfluous attributes. where 2% noise had been added to make the classification task more difficult. Again. A good practice is to perform experiments on well-known benchmark data sets. Coverage depends on the number of patterns classified correctly by the set as a whole. . MulGEx has been tested on four sets from [3] and presented in Table 14. so that the results may be compared with those obtained by means of other methods. • scalability. Scalarization implemented for the purpose of com- parison consists in applying weights to the objectives so that a single value is obtained. hence the Quadrupeds set. therefore niching proves to be very useful. Pareto optimization requires creating a vector of the values of objectives (Eq. 14. The process of creating new generations resembles the one introduced on the lower level.1. The difference between genetic algorithms on both levels is that at this stage the purpose of the algorithm is to produce a whole set of solutions simultaneously. Resistance to noise was tested on the LED- 24 data set. However. The method should be tested on various data. Fitness is assigned to individuals depending on the type of the algorithm. this is not necessary in the case of the SPEA algorithm. Complexity is defined as the number of active rules increased by the overall number of active premises within them.7). To this end the technique of sharing function has been implemented to reduce the fitness of those individuals that are surrounded by many others. 14 Pareto Optimization in Rule Extraction 329 Evaluation of Individuals The criteria used for the evaluation of individuals are calculated on the basis of the criteria introduced for rules on the lower level. whereas error is the sum of errors made by each of the active rules. since niching is guaranteed by the method of fitness assignment.8 Experimental Study The purpose of experiments is to verify the algorithm’s effectiveness and ver- satility. The evaluation of scalability requires patterns with a large number of attributes. • resistance to noise. Moreover. which aims at checking whether it has the following properties: • independence on the types of attributes. according to one of the methods described in the previous sub- section. 14.

the offspring may lose the most desirable properties. Therefore if the fittest individuals exchange genetic information during reproduction. The subsequent experiments performed to evaluate MulGex were conducted both on raw as well as processed sets. During the experiments the best effectiveness of rule extraction was achieved when the probability of crossover was relatively low. Data sets used for the tests Name Type of Class No of (examples) attributes examples Iris Setosa 50 (150 instances ) 3 Versicolour 50 continuous Virginica 50 LED-24 24 binary about 100 (1000 instances) (17 superfluous).7 LED-24 3 95 Monk-1 3 100 Quadrupeds 2 100 The optimal values of the algorithm’s parameters depend on the prop- erties of the data set.2. A multi-layer feed-forward network with one hidden layer was provided with the data and trained using the back-propagation algo- rithm.330 U. Low probability of crossover guarantees that some of the individuals are copied di- rectly into the new population. Results of the network training data Number of neurons Accuracy set in the hidden layer % Iris 2 96. for example 0. Markowska-Kaczmar and K. Mularczyk Table 14. The reason is that even a small modification of a rule may change its coverage and error significantly. Introducing elitism may also be helpful in this case.1. The results of the training are gathered in Table 14.2. especially on the number of attributes. 2% noise 10 classes per one class Monk-1 6 0 216 (432 instances) enumerable 1 216 Quadrupeds Dog 29 (100 instances) 72 Cat 21 continuous Horse 27 Giraffe 23 All sets were processed by a neural network in order to reduce noise and eliminate unusual data. Table 14. for it ensures that the best individuals are always present in the next .5. unless the network achieved maximal accuracy.

2± 6. i.8± 2. cover only the most typical examples.4.0± 0.5 12 918.0 5 99. consisting of few rules. The experiments confirm the fact that the criteria used for evaluating solutions are contradictory. The probability of mutation should be inversely proportional to the number of attributes and during the tests was usually set not to exceed 0.0 1 108. depending on the weights chosen by the user. When more rules are added to the set. because the process of calculating coverage and error for each in- dividual is very time-consuming. MulGEx produced sets of non-dominated solutions with various properties. Every test was car- ried out 5 times and the average result as well as the standard deviation were calculated.8 6 147. The results of experiments on raw data Iris LED-24 Monk-1 Quadrupeds Rules F idelity Rules F idelity Rules F idelity Rules F idelity 1 50.0 3 326.6± 0.2 .4 6 608. as well as with differ- ent sets of classes defined for a given problem.3. In the case of Iris.5 15 924. the more potential solutions may be examined and compared at a time.0± 0. The results of the experiments indicate that MulGEx is suitable for solving problems with various numbers and types of attributes.0 3 76.0± 1.8± 2.0± 0. This was not possible for LED-24.0 2 53.0 1 27. On the other hand.0 4 97.0± 5.e. The size of the population influences the efficiency of the algorithm as well.5 9 843. respec- tively. Rule extraction for the above-mentioned data sets was successfully carried out when populations consisted of 20 .4± 5.0± 0. solutions con- taining different numbers of rules were selected and their fidelity. was determined. where the presence of noise resulted in increasing complexity or error. large populations cause deterioration in the algorithm’s performance. 14 Pareto Optimization in Rule Extraction 331 generation. In each case multiobjective algorithms on both levels were stopped af- ter having evolved 100 generations without improvement.4± 0. Theoretically.0 6 396.3 and Table 14.50 individuals.01.2± 12. Monk-1 and Quadrupeds the algorithm succeeded in finding a relatively small rule set that covered all training patterns without misclassification.4 2 180. The simplest sets. Table 14.6 4 324.. Therefore a reasonable compromise must be found.4± 0.3± 1.7 7 432. At this point it should be mentioned that MulGEx satisfies the requirement that the produced solutions should be distributed evenly in the space. increasing the number of individuals reduces the number of generations needed to find satisfactory solutions.9 8 149. The results of the experiments performed both on raw data and sets processed by the network are gathered in Table 14. Then.5 2 96. its fidelity is improved at the cost of complexity.4± 2.0± 0.8± 0.0 4 143. The more individuals. the difference between coverage and error.0± 0.8± 4.

7 9 897. Before choosing a solu- tion the user may evaluate which one would be the most satisfactory. Mularczyk Table 14.7 shows the values of fidelity and complexity for rule sets produced during one of the experiments.6± 1. The results obtained for LED-24 for raw and processed data differ significantly. In the presented example very high fidelity may be achieved by a set with complexity equal to 6. it proves to be much more convenient to find a set of solutions in a single run and then apply different weights in search of the most satisfying one. which is very time- consuming.332 U.3 7 149. .4± 4. When the Pareto approach was used.4 6 642.2± 0.1 12 981. it must be emphasized that scalarization requires rerunning the algorithm every time when the user decides to change the weights.6± 4.8± 4.0 6 147. The results (Table 14.0± 3.0± 0. To this end separate tests for the Iris data set have been per- formed.0 3 336. In all cases Pareto optimization was used at this stage and rules with very high error weight were selected. Markowska-Kaczmar and K. The Pareto approach in the upper-level algorithm has been compared with scalarization.0 2 98.9 15 987.6 4 143. it was run only once during each of the tests and a whole set of solutions was created. 14. which facilitates rule extraction. Afterwards the upper-level one was executed several times with different settings. However.0± 0. The results of experiments on data processed by a network Iris LED-24 Rules F idelity Rules F idelity 1 50. Fig.5± 0. If complexity is increased by adding new rules or premises. the improvement of fidelity is relatively low. If we take into account the fact that adjusting weights is not easy. therefore separate runs were necessary to produce results for all indicated weights’ combinations. This is possible due to neural networks’ ability of reducing noise.5) show that for the Iris data set there is hardly any difference in the quality of solutions between the two approaches. Then the user could apply weights to allow the algorithm to select the most appro- priate solution. based on the shape of the Pareto front.7 The evolved rule sets represent different trade-offs between the objectives and contain sets with maximal fidelity as well as sets consisting of one rule only and various intermediate solutions. which means that in this case a single rule covers statistically more patterns and misclassifies less.2± 2. The lower-level algorithm was run 5 times and produced initial rule sets. In the case of scalarization weights for the objectives had to be determined beforehand. Another advantage of Pareto optimization is that one may analyze all so- lutions simultaneously.4. so that the initial set could achieve max- imal fidelity. The fidelity of sets containing the same number of rules is higher for data passed through a neural network.

0± 0. The size of population was set to 20 and the probabilities of crossover and mutation to 0.9 show an example of rules produced for LED-24.9 6.2± 1.0± 0.0 22.2± 1.2± 1. 14.5 143.0± 0.01.6 1 1 10 143. The presented results were produced by the multiobjective lower-level al- gorithm and the following weights: coverage = 10. respectively. The comparison of Pareto approach and scalarization (Cov stands for coverage.2± 1.9 150. the user may choose the most appropriate weights when selecting one particular solution. Error Comp.6 and Fig.0± 2.2± 1. Evolution was stopped after 100 gen- erations without improvement and the best rule in each of the populations was selected. On the other hand. fidelity deteriorates rapidly. Therefore.9 Fig. 14.0± 0.5 1.0± 1.4± 0. Table 14. The first attributes in LED- 24 represent seven segments of a LED display presented in Fig.0± 2. missing lines stand for segments that are off and thin grey lines correspond to seg- .0 1 10 1 141. if complexity is reduced below 6.7.6 0. error = 10.6 0.0 2. The remaining ones are superfluous and are successfully excluded by the algorithm from participating in classification.0 9. 14 Pareto Optimization in Rule Extraction 333 Table 14.2± 1.8. 14.7 6.6± 1.1 18.4± 3.0 142.0± 0. 14.6± 1. Cov. Error Comp. Error Comp. which suggests that an important rule has been eliminated. having this additional information due to the Pareto approach. The values of the criteria for rules sets obtained for Iris.1 Evaluation of Rules by Visualization The quality of rules produced by MulGEx for LED-24 may be easily evaluated due to the possibility of visualizing the data.0 4. The option of saving the best individuals (elitism) was chosen.6 10 1 1 150.9 and 0. Cov.9 thick black lines denote segments that are lit.0 9.0± 0.0 8.0± 2.8.0 6.6 141.0± 0.0 5.5.0± 1. complexity = 1 were used for selecting the most satisfactory solutions after the algorithm had stopped.0± 0. The experiment was carried out for data that had been processed by a network.5 1.0± 0.0 8. 14. 1 1 1 143. In Fig. Comp – for complexity) Weights Pareto Scalarization Cov. This may indicate that more complicated sets cover noisy data and are not valuable for the user.

14. The state of some of the segments proves to be insignificant when a certain pattern is classified as one of the digits. 14. ments whose state is insignificant when assigning a pattern to a given class (these segments are represented by inactive premises). Visualization of rules for LED-24.334 U. which confirms the effectiveness of MulGEx.9. At the same time the algorithm succeeded in reducing the rules’ complexity.8. The obtained rules presented in the form of a schema in Fig. One may easily notice that the results resemble digits that appear on a LED display. Markowska-Kaczmar and K. The segments of LED. Mularczyk Up center Up right Up left Mid center Down right Down left Down center Fig.9 prove to be very comprehensible for humans. . 14. Fig.

14 Pareto Optimization in Rule Extraction 335 Table 14. The results of experiments performed for MulGEx may be confronted with those obtained for a different multiobjective rule extraction algorithm.6. contain information on the quality of evolved sets expressed in terms of coverage and the number of rules. Nevertheless the relative effectiveness of these methods may be evaluated only approximately because of certain differences in the way of performing experiments and the scope of collected information. This is facilitated by the fact that the same data sets were used in both cases. Nevertheless two different approaches have been implemented in MulGEx to allow evaluating their per- formance. Rules produced by the lower-level algorithm for LED-24 Rule Coverage Error IF Up center AND Up right AND NOT Mid center AND Down left THEN 0 114 0 IF NOT Up center AND NOT Up left AND NOT Down left THEN 1 83 0 IF NOT Up left AND Down left AND NOT Down right AND Down center THEN 2 118 0 IF Up center AND NOT Up left AND Mid center AND NOT Down left AND Down center THEN 3 110 0 IF Up left AND Up right AND NOT Down left AND Down right AND NOT Down center THEN 4 82 0 IF Up left AND NOT Up right AND NOT Down left THEN 5 92 3 IF NOT Up right AND Down left THEN 6 103 0 IF Up center AND NOT Up left AND NOT Mid center AND NOT Down left THEN 7 102 0 IF Up right AND Mid center AND Down left AND Down right AND Down center THEN 8 104 0 IF Up left AND Up right AND Mid center AND NOT Down left AND Down center THEN 9 81 0 14. namely GenPar [11].9 Comparison to Other Methods The vast majority of rule extraction algorithms are based on the scalar ap- proach. which reduced the number of objectives by one. in all cases GenPar failed to produce a solution with the . and only coverage was measured during the tests.7. The results for GenPar. Although fidelity achieved by solutions with the same numbers of rules are similar. presented in [11] and shown in Table 14. during the course of evolution coverage and error were combined into a single objective. GenPar was tested on data processed by a neural network only. Moreover. Therefore comparing their effectiveness may not be appropriate be- cause of different goals pursued by their authors. The results for GenPar indicate that its effectiveness is lower than that of MulGEx.

Because the problem of rule extraction is very complex.7. which prevents rules from being intensively optimized by adjusting values within the premises. have no ability of explaining their answers and presenting acquired knowledge in a comprehensible way. These objectives are contradictory. however. Markowska-Kaczmar and K. standard al- gorithms can deal with one objective only. The superiority of MulGEx lies in introducing a lower-level algorithm that evolves single rules before sets are formed and allows constant chromosome length on the upper-level. It is network archi- tecture independent and can be used regardless of the number and types of attributes in the input vectors. This serious disadvantage leads to developing numerous methods of knowledge extraction. its efficiency may be deteriorated because the presence of noise in the data requires more complicated rule sets to describe the process of classification. This implies that sets are evaluated as a whole and the performance of single rules is not taken into account.10 Conclusion Neural networks. combines multiobjective optimization with a hierarchical structure to achieve high efficiency. taking into account its ability to generalize. genetic algorithms are commonly used for this purpose. the most common one is to aggregate the objectives into a single value. Although the algorithm has been designed to extract rules from neural networks. therefore finding a satisfactory solution requires a compromise. Multiobjective problems are ap- proached in many ways. 14. in a comprehensible way.336 U. in spite of their efficiency in solving various problems. it can be also used with raw data. The reason may be that GenPar is based exclusively on the Pitt approach (entire rule sets are encoded in individuals). The proposed method. especially for networks that solve the problem of classification. but this approach has many drawbacks that can be elimi- nated by introducing proper multiobjective optimization based on domination in the Pareto sense. Mularczyk Table 14. In this case. The results of experiments for GenPar (size – number of rules. . MulGEx. Their purpose is to produce a set of rules that would describe a network’s performance with the highest fidelity possible. Since the fitness function is scalar. f id – fidelity equivalent to coverage) on the basis of [11] Iris LED-24 Quadrupeds size f id size f id size f id 1 50 1 87 2 182 2 94 4 324 4 471 4 137 5 473 – – 5 144 13 928 – – maximal fidelity possible.

and S. Carlos A. and C. [7] Hayashi. The lower- level one produces single rules (independently evolved populations correspond to the classes defined for a given classification problem) and passes them to the upper-level one for further optimization.: 1996. ‘Rule extraction from constrained error back propagation MLP’. [8] Hayward. Massachusetts. Addison-Wesley Publishing Company. Merz: 1998. The effectiveness and versatility of MulGEx has been confirmed by experiments and the superiority of the Pareto approach over scalarization has been pointed out. R. The latter algorithm evolves entire sets of rules that describe the performance of a network. 753–758. D. 9–12. and A. Evolutionary Algorithms for Solving Multi-Objective Problems. B. Reading. Van Veldhuizen. Yoshida: 2000. The usefulness of such solutions depends on the purpose of rule extraction. C. D. [2] Andrews. The final choice of the most appropriate solution is left to the user who may select several solutions without the need for rerunning the algorithm. ‘RULENEG: Extract- ing rules from trained ANN by stepwise negation’. References [1] Andrews. and L. and K. C. . [6] Goldberg. J. pp. Journal of Advanced Computational Intelligence 4(4). Nevertheless. R. In: Proceedings Congress on Evolutionary Computation. [4] Coello.. Dept. but achieve high fidelity. of Information and Computer Sciences. E. Due to Pareto optimization MulGEx produces an entire set of solutions with different properties simultaneously. A. In: Proceedings of 5th Australian Conference on Neural Networks.. ‘Rule extraction by genetic algorithms based on a simplified RBF neural network’.: 1989. ‘Survey and critique of techniques for extracting rules from trained artificial neural networks’. Wang: 2001. Neuro- computing Research Centre Queensland University Technology Brisbane. [3] Blake. Knowledge-Based Systems 8(6). In most cases relatively general rules are considered valuable for a classification problem. Technical report. Irvine. Setiono. 294–301. pp. E. Optimization and ma- chine Learning. [5] Fu. Tickle: 1995. we might be interested rather in more specific rules that cover a smaller number of patterns. University of California. for they provide information about the most important properties that objects belonging to a given class should have. Kluwer Academic Pblisher. Quinsland. New York. and G. Geva: 1994. Brisbane. ‘UCI Repository of Machine Learning Data- bases’. Genetic Algorithms in Search. C. X. Ho-Stuart. varying from complex ones with very high fidelity to the simplest ones that cover only the most typical data. Lamont: 2002. R. 373–389. Y. J.. Diedrich. Old. R. Diedrich. and P. if we attempt to find hidden knowledge by means of data mining. ‘Learning M of N concepts for medical diagnosis using neural networks’. 14 Pareto Optimization in Rule Extraction 337 MulGEx consists of genetic algorithms working on two levels.

R. Mularczyk [9] Jin. Vasara: 2001. 257–271. 564–577. LNC 2206 pp. U. 448– 463. Department of Mechanical Engineering. J. ‘Hierarchical Evolutionary Al- gorithms in the Rule Extraction from neural Networks’. Paredis: 2000. Leow. Sendhoff. V.. ‘Neuro-Fuzzy Rule Generation: Survey in Soft Computing Framework’. and J. [12] Markowska-Kaczmar. Indian Institute of Technology. B. B. Neagu. ‘Rule Induction with Genetic Sequential Covering Algorithm (GeSeCo)’. J. In: Proceedings of the second ICSC Symposium on Engineering of Intelligent Systems. ‘Rule Extraction from Neural Network by Genetic Algorithm with Pareto Optimization’. and P. U. Ghosh: 1999. ‘Extracting comprehensible rules from neural networks via genetic algorithms’. IEEE Transactions on Neural Networks 13(3). pp. and J. Deb: 1993. ‘Interpretation of Trained Neural Networks by Rule Extraction’. [15] Palade. and H. Wnuk-Lipinski: 2004. Zagorski: 2004. Fuzzy Days 2001. Genetic Algorithms + Data Structures = Evolution Pro- grams. and L. [10] Jin. A. Rutkowski (ed. [17] Setiono. 89–94. Portugal. [20] Taha.: 1995. Neural Processing Letters 17(2). [18] Siponen. 748– 768. Technical report. Sendhoff: 2003. [19] Srinivas.. In: The Third International Conference on Evolutionary Multi- Criterion Optimization. ‘Extracting rules from artificial neural networks with dis- tributed representation’. IEEE Transactions on Neural Networks 11(3). pp. and R. K. and A. Kanpur. Tesauro. [21] Thrun. D. Y. Koerner: 2005. ‘Evolutionary multi-objective op- timization for simultaneous generation of signal-type and symbol-type repre- sentations’. ‘Multiobjective Evolutionary Algorithms: A Comparative Case Study and Strength Pareto Approach’. Z. and T. S. Simula.: 1996. C. Vesanto. Symp.. In: G. S. 130–139. Springer Verlag Berlin Heidelberg. . ‘Extracting interpretable fuzzy rules from RBF networks’. IEEE Transaction on Evolutionary Computation 3(4). 149–164. E. N. [22] Weijters. In: L. Leen (eds. and E. R. and R. (EIS 2000). I. J. 752–766. on Combinations of Evo- lutionary Computation and Neural Network 1. and P. ‘Extraction of Rules from Artifi- cial Neural Networks for Nonlinear Regression’. [11] Markowska-Kaczmar. ‘Multiobjective optimizatio using nondominated sorting in genetic algorithms’. [16] Santos. ‘Symbolic Interpretation of Artificial Neural Net- works’. Fourth International Symposium on Engineering of Intelligent Systems. India. 152– 161. and J. [13] Michalewicz.. and B. Freitas: 2000. O. pp.338 U. 450–455. Y. Zurada: 2002. EIS. ‘An Approach to Automated Interpretation of SOM’. 245–251. Yoichi: 2000. pp. [14] Mitra. Thiele: 1999. M. IEEE Transactions on Knowledge and Data Engineering 11(3). Markowska-Kaczmar and K. D. and K. Patton: 2001. W. [23] Zitzler.): Artificial Intelligence and Soft Computing. Touretzky. Nievola. In: In Proceedings of Workshop on Self- Organizing Map 2001(WSOM2001).): Advances in Neural Information Processing Systems. Funchal.

Wang et al. Kowloon. 339–364 (2006) www. To meet this end. and over the last thirty years thousands of commercial and industrial fuzzy systems have been successfully developed. The trade-off between accuracy and inter- pretability of fuzzy systems derived from our agent based approach is studied on some benchmark classification problems in the literature. and 2) rule bases are usually of high redundancy.15 Agent Based Multi-Objective Approach to Generating Interpretable Fuzzy Systems∗ Hanli Wang1 .edu. However. Hong Kong. nonlinear.jin@honda-ri. China {wanghl@cs. H. Hence. increasing attention has been paid to improve the interpretability of fuzzy systems [2]-[18].wilson@cs}. 63073. One of the most important motivations to build up a fuzzy system is to let users gain a deep insight into an unknown system through the human-understandable fuzzy rules.1 Introduction The fundamental concept of fuzzy reasoning was first introduced by Zadeh [1] in 1973. but also optimize the number and distribution of fuzzy sets. an agent based multi-objective approach is pro- posed to generate interpretable fuzzy systems from experimental data. ∗ This work is partly supported by the City University of Hong Kong Strategic Grant 7001615. The proposed approach can not only generate interpretable fuzzy rule bases. 83 Tat Chee Avenue.de Summary. Yaochu Jin2 .com  c Springer-Verlag Berlin Heidelberg 2006 . and sometimes mathematically intangible dynamic systems.: Agent Based Multi-Objective Approach to Generating Interpretable Fuzzy Sys- tems. especially when they are generated from the training data.cssamk@.hk 2 Honda Research Institute Europe. An- other main attraction lies in the characteristics of fuzzy systems: the ability to handle complex. Offenbach/Main. Studies in Computational Intelligence (SCI) 16.springerlink. Sam Kwong1 . when the fuzzy rules are extracted by the traditional meth- ods. 15. there is often a lack of interpretability in the resulting fuzzy rules due to two main factors: 1) it is difficult to determine the fuzzy sets information such as the number and membership functions. Therefore.cityu. City University of Hong Kong. Interpretable fuzzy systems are very desirable for human users to study complex systems. fuzzy systems considering both the interpretabil- ity and accuracy are very desirable. Germany yaochu. and Chi-Ho Tsang1 1 Department of Computer Science.

In our former studies [17]-[18]. It is also preferable to extract knowledge from training data in the form of fuzzy rules that can be easily understood and interpreted. In Section 15. In order to study the trade-off between the accuracy and interpretabil- ity of fuzzy systems. Thus.2 discusses the inter- pretability issues of fuzzy systems. we conclude this work and some potential future works are discussed in Section 15. where the number of fuzzy sets is predetermined and the distributions of fuzzy sets are not much considered. the experimental re- sults are presented on some benchmark classification problems. more innovative issues about interpretability of fuzzy systems are presented and the agent- based approach is updated accordingly.340 H.3. [17]- [18]. Wang et al. In addition. experimental results have demonstrated that our methods are more powerful than that in [4]. And the accuracy and interpretability issues can be incor- porated into multiple objectives to evaluate candidate fuzzy systems.5. can be obtained in a single run. Indeed. In this chapter. the genetic fuzzy system is one of the most successful approaches to hybridize fuzzy systems with the learning and adaptation abilities [19]. more efforts have been made in the MOEA domain to extract interpretable fuzzy systems [4]. Section 15. Compared with other methods [20]-[30]. The proposed agent-based multi-objective approach is discussed in Section 15. the improvement of interpretability and the trade-off between the accuracy and interpretability can be easily studied. [20]-[30]. Interpretability (also called transparency) of fuzzy systems . the multi-objective evolutionary algorithm (MOEA) is very suitable to be applied. we have a detailed discussion about some important issues of interpretability and propose an agent based approach to generate interpretable fuzzy systems for function approximation problems. our agent based approach can not only extract interpretable fuzzy rule bases but also optimize the number and distribution of fuzzy sets through the hierarchical chromosome formulation [31]-[33] and interpretability-based regulation method. The chapter is organized as follows. a single MOEA is proposed for fuzzy modeling to solve the function approximation problems.4. we extend our research scope from the function approximation problems to the classification problems based on the proposed agent-based multi-objective approach. A main advantage of using MOEA is that many so- lutions. 15. where each individual of the population contains a variable number of rules between 1 and max which is defined by the deci- sion maker. Recently. Finally. In [4]. real-coded representation. This algorithm has a variable-length.2 Interpretability of Fuzzy Systems The most important motivation to apply fuzzy systems is that they use lin- guistic rules to infer knowledge similar to the way humans think. each of which represents an individual fuzzy system. Methods for constructing fuzzy systems from training data should not be limited to finding the best approximation of data only.

then it indicates that the two fuzzy sets overlap too much and the distinguishability between them is poor. then their con- clusions should be coherent [7. In our previous work [17] and [18]. B) = m (15.2 Consistency Another important issue about interpretability is the consistency among fuzzy rules and the consistency with a priori knowledge.1 Completeness and Distinguishability The discussion of completeness and distinguishability is necessary if fuzzy systems are generated automatically from data. A definition of consistency and its calcu- lation method is given in [7]. ∧ and ∨ are the minimum and maximum operators. [7].2) j=1 [uA (xj ) ∨ uB (xj )] 15. and [34]. m j=1 [uA (xj ) ∧ uB (xj )] S(A. similarity between fuzzy sets is defined as the degree to which the fuzzy sets are equal. In [34].2. respectively.1) M (A ∪ B) M (A) + M (B) − M (A ∩ B) where M (A) denotes the cardinality of A. There are several methods to calculate the similarity. The partitioning of fuzzy sets for each fuzzy variable should be complete and well distinguishable. [3]. We use the following similarity measure between fuzzy sets [34]: M (A ∩ B) M (A ∩ B) S(A. Let A and B be two fuzzy sets of the fuzzy variable x with the member- ship functions uA (x) and uB (x). and the operators ∩ and ∪ repre- sent the intersection and union. 5]. 15 Agent Based Approach for Interpretable Rule Systems 341 is the criterion that indicates whether a fuzzy system is easily interpreted or not. if the similarity of two fuzzy sets is too big. Another important factor about consistency is the inclusion relation [27]. If the antecedents of two fuzzy rules are compatible with an input vector and the antecedents of one rule is included in those of . One computationally effective form in [11] and [12] is described in (15. 15.2. The consistency with a priori knowledge means that the fuzzy rules generated from data should not be in conflict with the expert knowledge or heuristics. In fact.2) on a discrete universe U = {xj |j = 1. we have discussed some important interpretability issues in detail. at least one fuzzy set is fired. In the following. The completeness means that for each input variable. m}. · · · . respectively. 2. The concepts of completeness and distinguishability of fuzzy systems are usually expressed through a fuzzy similarity measure [2]. we will have a quick review about these issues and further propose two more innovative concepts about interpretability: overlapping and relativity. B) = = (15. Consistency among fuzzy rules means that if two or more rules are simultaneously fired.

the membership of A and B are the same and equal to 1. 15. whereas in Fig. the former rule should have a larger weight than the latter to calculate the output value.1. it is not guaranteed that each of the fuzzy sets be used by at least one rule.5 completely belongs to both of these two fuzzy sets. the other rule. 15. 15.3 Utility Even if the partitioning of fuzzy variables is complete and distinguishable.5 in Fig.2.e. Appropriate distribution about overlapping shows an appropriate distribution between fuzzy sets A and B because the re- gions of the maximum membership value of A and B do not overlap. then all of the fuzzy sets are utilized as antecedents or consequents by fuzzy rules. If a fuzzy system is of sufficient utility. the region of maximum membership value of fuzzy sets should not overlap. for any input value.5 Overlapping From human heuristics. We use the following two figures to illustrate the overlapping concept. So regarding the input value 3. 15. the number of maximum mem- bership values (i. a fuzzy system of insufficient utility indicates that there exists at least one fuzzy set that is not utilized by any of the rules. Figure 15. 3.e.2.2. Wang et al. We use the term utility to describe such cases.4 Compactness A compact fuzzy system means that it has the minimal number of fuzzy sets and fuzzy rules.342 H.1 Fig.. Whereas. In [18]. the number of fuzzy variables is also worth being considered. we proposed a method to calculate the updated rule weight to represent the inclusion relation. 1) should not be greater than one. In other words. Then the unused fuzzy sets should be removed from the rule base resulting in a more compact fuzzy system..2 such regions overlap each other. i. 15. it will cause a bad . In addition.2. 15.

a4 ]. 15. e.2. 15.6 Relativity Relativity is associated with the relative bounds of definition domain of two fuzzy sets. then a1 ≤ b1 and a4 ≤ b4 (15. 3. 15. and considering the input value 2. Therefore. Inappropriate distribution about overlapping interpretability of fuzzy systems.3) .4). b3 . We will discuss the relativity issue with the aid of trapezoidal mem- bership functions because other types of membership functions can also be eas- ily transformed to the trapezoidal type such as the triangular and Gaussian membership functions. a2 is the left center. c1 ] and [c2 . where c1 and c2 are the abscissas at which A and B intersect. a3 is the right center and a4 is the upper bound. etc. Inappropriate distribution about relativity term small and big. Obviously. a3 . where a1 is the lower bound of the definition domain of A. If a3 < b2 .5 is both completely big and com- pletely small. 15 Agent Based Approach for Interpretable Rule Systems 343 Fig.3. a4 ] represent the membership function parameters of fuzzy set A and [b1 . b2 . 15.2. In this case. it is very hard to interpret the input values in the ranges [b1 . we propose the following condition to guarantee a better interpretability for relativity (Fig. the overlapping should be avoided when we build fuzzy systems..3. especially the fuzzy system is constructed from sampling data. the membership degree of small is lower than that of big. For instance. this interpreted result is in con- tradiction with the common sense. b4 ] of B. A and B are assigned the linguistic Fig.g. The relative position of A and B is shown in Fig. 15. Let the parameter vector [a1 . Thus. a2 .

even not to mention extracting interpretable fuzzy rule bases from the combination of fuzzy sets. Wang et al. then extract fuzzy rules according to the available fuzzy sets using the Pittsburgh-style approach [40]. Agents are computa- tional entities capable of autonomously achieving goals by executing needed actions.344 H. it is almost impossible to have a good understanding about an unknown com- plex system. 15. four common agent charac- teristics have been proposed in [35]: situatedness. The generation of fuzzy systems from training data can be described as finding a small number of combination of fuzzy sets. It turns out to be particularly difficult to build up a multi-agent system where the agents do what they should do in the light of the afore men- tioned characteristics. In our agent based framework (Fig. To address this problem. While this task seems to be simple at a first glance. the distri- bution of fuzzy sets is inappropriate. there are two kinds of agents in the multi-agent system: the Arbitrator Agent (AA) and the Fuzzy Set Agent (FSA).3 Agent-Based Multi-Objective Approach As we mentioned earlier. the interpretability will very likely be lost.4. Both of these two factors make it hard to interpret the resulted fuzzy systems. the situatedness of agents is realized. Although there seems to be little general agreement on a precise definition of what constitutes an intelligent agent. Several challenging and intertwined issues have been proposed in [36]-[39]. because the number of possible combinations of linguistic values exponentially increases. Thus. These FSAs obtain information from the AA in which the informa- tion is expressed in terms of training data. Fig. Second. First. 15. Worse. we propose an agent-based multi-objective ap- proach to construct fuzzy systems from training data. autonomy. thus giving the linguistic terms for each fuzzy variable in advance becomes very hard. it is in fact very diffi- cult especially in the case of high-dimensional problems. each of which is used as the antecedent or consequent part of a fuzzy rule. adaptability. Appropriate distribution about relativity 15.5). when fuzzy systems are extracted using the tradi- tional methods. So the autonomy and . and sociability. the numbers of fuzzy sets and rules are usually larger than necessary. the FSA can autonomously determine its own fuzzy sets information such as the number and distribu- tion with the aid of hierarchical chromosome formulation and interpretability- based regulation method. According to the available training data.

After the self-evolving of FSAs. the FSA is able to cooperate and compete with other FSAs. As far as the social behavior is concerned. and generate offspring FSAs. 15. the FSAs cooperatively exchange their fuzzy sets information by ways of crossover and mutation of the hierarchical chromosome.5. and judge which FSAs should survive and be kept to the next population. whereas the obsolete agents are die. the agent based approach is mainly applied to generate interpretable . Framework of the agent-based approach adaptability of agents are achieved. In [18]. the AA uses the NSGA-II [41] algorithm to evaluate the FSAs. 15 Agent Based Approach for Interpretable Rule Systems 345 Agents recombination strategy Interaction among fuzzy set agents Interpretability-based regulation strategy Interaction Fuzzy rules generation strategy Multi-agents Fuzzy Interaction Environment Set Agent Fuzzy Set Fuzzy Agent Set Agent Arbitrator Fuzzy Agent Fuzzy Set Set Agent Agent Fuzzy Fuzzy Set Set Agent Agent Fuzzy sets distribution strategy Fuzzy Set Agent Interpretability-based regulation strategy Information Fuzzy Set Agent Fuzzy rules generation strategy Fig. In the proposed approach.

The shape of the rightmost curve is specified by σ2 and c2 . In this work. i. where 1 is assigned to represent that the corresponding parameter gene is activated to join the evolutionary process.e. σ2 . The control genes are assigned 1 or 0 to manipulate the corresponding parameter genes. σ2 and c2 as given by:  ) *   −(x−c1 )2  exp 2 2σ1 : x < c1 µ(x. so the resulting model will in general have a high accuracy in fitting the training data. fuzzy systems for function approximation problems.4)    exp −(x−c2 2 ) 2 2σ2 : c2 < x where σ1 and c1 determine the shape of the leftmost curve. σ1 . c1 . and 0 means deactivated. the fuzzy set agents employ the fuzzy sets dis- tribution strategy. Another characteristic of Gauss2mf is that the completeness of fuzzy system is guaranteed because the Gauss2mf covers the universe sufficiently. the associated parameter gene is activated. and fuzzy rules generation strategy to generate accurate and interpretable fuzzy systems.1 Intra Behavior of Fuzzy Set Agents In the proposed approach. there are totally M1 + M2 + · · · + MN possible fuzzy sets. When 1 is assigned. So there are M1 + M2 + · · · + MN control genes coded as 0 or 1. c1 . interpretability-based regulation strategy. we update the interpretability-based regulation strategy accord- ing to the newly proposed interpretability issues and apply the agent-based approach to classification problems. Hence the parameter vector [σ1 . where the genes of the chromosome are classified into two types: control genes and parameter genes. The Gauss2mf is a kind of smooth membership functions. The effective- ness of this chromosome formulation enables the number and the distribution of fuzzy sets to be optimized simultaneously. Wang et al. 15. The Gauss2mf depends on four parameters σ1 . For N dimensional problems. Fuzzy Sets Distribution Strategy The hierarchical chromosome formulation is introduced in [31]-[33].. We apply the Gaussian combinational membership function (abbreviated as Gauss2mf) to describe antecedent fuzzy sets. For each fuzzy variable xi . c1 . σ2 . And the simulation re- sults have demonstrated that the proposed approach is very effective to study the trade-off between the accuracy and interpretability of fuzzy systems.3. The details of these strategies are discussed below. c2 ] is used to represent one parameter gene.346 H. An example of the relationship between control . a combination of two Gaussian func- tions. c2 ) = ) 1 * : c1 ≤ x ≤ c2 (15. we determine the possible maximal number of fuzzy sets Mi so that it can sufficiently represent this fuzzy variable. otherwise the parameter gene is deactivated.

a2 . c1 . a4 ) = ε. a3 . respectively (a1 ≤ a2 ≤ a3 ≤ a4 ). a3 . we define a fuzzy set A using the mem- bership function uA (x. a4 ) = uA (a4 . a3 . a2 . In this work.6. 15. 0 is assigned. we use the Gauss2mf in (15.001) which is regarded as equal to zero: uA (a1 . a2 . However. a2 . 15 Agent Based Approach for Interpretable Rule Systems 347 Fig. Example of control genes and parameter genes genes and parameter genes is given in Fig. a1 . a3 . 1].6) " (a4 − c2 )2 σ2 = − (15. 1]. a4 ). where a1 . σ2 . left center. a1 . a3 . If rand > 0.8) Suppose A belongs to the variable x with the lower bound lb and upper bound ub. B. a1 . and a4 are the lower bound. a2 . right center and upper bound of the definition domain. 15. The initializations of control genes and parameter genes are given as follows.4) as the membership function. First we initialize the parameter c1 and c2 as: .5. respectively. A. a4 ] is described as: " (a1 − c1 )2 σ1 = − (15.5) 2 ln ε c1 = a2 (15. The relationship between the parameter list [σ1 . c2 ] and [a1 .g.7) 2 ln ε c2 = a3 (15. We need to calculate a1 and a4 using a very small number ε (e. otherwise. rand is a uniformly random number in [0. For example. 0. the initialization step is a lit- tle bit complicated.6. then the target control gene is assigned to 1. so it is not easy to obtain a1 and a4 like the trapezoidal type of membership functions. Initialization of Control Genes The fuzzy set agents initialize control genes using a uniformly random number in the interval [0. Initialization of Parameter Genes As far as the parameter genes are concerned.

a1 . a3 . Merging Similar Fuzzy Sets A method to calculate the similarity measure between two fuzzy sets is given in (15. A.12) candi2 = (ub − lb) × rand2 + lb (15.15) In the next step. Wang et al.19) c3 = λ3 a3 + (1 − λ3 )b3 (15. b3 . candi2 } (15. a2 . the parameters σ1 and σ2 are derived as: " [(ub − lb) × rand3 ]2 σ1 = − (15.14) c2 = max(c2 . 1].17) 2 ln ε where rand3 and rand4 are two random numbers in the interval [0.13) rand1 and rand2 are two random numbers in [0.348 H. Considering two fuzzy sets A and B with the membership functions uA (x. b4 ). we also guar- antee c1 is not greater than ub and c2 is not less than lb: c1 = min(c1 . c1 . b2 .10) in order to guarantee the relation: c1 ≤ c2 (15.20) . lb) (15.2). This strategy includes the following actions.18) c2 = λ2 a2 + (1 − λ2 )b2 (15. c1 = min{candi1 . candi2 } (15. ub) (15. Interpretability-Based Regulation Strategy After the fuzzy set distribution strategy. the resulting fuzzy set C with the membership function uC (x.9) c2 = max{candi1 .16) 2 ln ε " [(ub − lb) × rand4 ]2 σ2 = − (15. In addition. then we merge these two fuzzy sets to generate a new one. b1 . a4 ) and uB (x. b1 ) (15. the interpretability-based regulation strategy is applied to the initial fuzzy sets to obtain a better distribution about interpretability. If the similarity value is greater than a given threshold.11) where candi1 = (ub − lb) × rand1 + lb (15. c2 . c3 .1]. c4 ) is derived from merging A and B by: c1 = min(a1 .

e. w