Professional Documents
Culture Documents
JAMES G . SHANAHAN
Xerox Research Centre Europe (XRCE)
Grenoble Laboratory
6 chemin de Maupertuis
Meylan 38240, France
James.Shanahan@xrce.xerox.com
http://www.xrce.xerox.com/-shanahan/kdboolc/
Shanahan, James G .
Soft computing for knowledge discovery : introducing Cartesian granule features /
James Shanahan.
p. cm. - (The Kluwer international series in engineering and computer science; SECS 570)
Includes bibliographical references and index.
I S B N 978-1-4613-6947-9 I S B N 978-1-4615-4335-0 (eBook)
D O I 10.1007/978-1-4615-4335-0
1. Soft computing. 2. Database searching. I. Title. II. Series.
A l l rights reserved. N o part of this publication may be reproduced, stored in a retrieval system or
transmitted in any form or by any means, mechanical, photo-copying, recording, or otherwise,
without the prior written permission of the publisher, Springer Science+Business Media, L L C .
The following webpage has been developed in conjunction with this book:
http://www.xrce.xerox.com!-shanahanlkdbook!.This page provides access to
additional information related to the material presented in this book, various
pedagogical aids, datasets, source code for several algorithms described in this book, an
online bibliography and pointers to other World Wide Web related resources.
To the memory of my parents
Jimmy and Mary
and to my dear friend and mentor
Anca Ralescu.
FOREWORD
Publication of "Soft Computing for Knowledge Discovery: Introducing Cartesian
Granule Features" or KD_CGP for short, is an important event in the development of a
better understanding of human reasoning in an environment of imprecision, uncertainty
and partial truth. It is an important event because KD_CGP is the first and, so far, the
only book to focus on granulation as one of the most basic facets of human cognition, a
facet that plays a pivotal role in knowledge discovery (KD). The author of KD_CGP,
Dr. James Shanahan, has been, and continues to be, in the forefront of research on
computationally-oriented approaches to knowledge discovery, approaches centering on
soft computing rather than on the more traditional methods based on probability theory
and statistics.
Let me elaborate on this point. During the past several years, the ballistic ascent in the
importance of the Internet has propelled knowledge discovery to a position that is far
more important than it had in the past. And yet, much of the armamentarium of
knowledge discovery consists of methods drawn for the most part from probability
theory and statistics. The leitmotif of KD_CGP is that the armamentarium of KD
should be broadened by drawing on the resources of soft computing, which is a
consortium of methodologies centering on fuzzy logic, neurocomputing, evolutionary
computing, probabilistic computing, chaotic computing and machine learning. More
precisely, concepts are captured in terms of Catesian granule fuzzy sets (where the
underlying granules are represented using fuzzy sets, i.e. f-granular) that are
incorporated into fuzzy logic rules or probabilistic rules. Learning of such models is
achieved using probability theory and genetic programming.
Successes of probability theory have high visibility. So what is the rationale for
moving beyond the confines of traditional probability-based methods?
In a broad sense, granulation involves a decomposition of the whole into parts. More
specifically, granulation of an object A results in a collection of granules of A, with a
granule being a clump of objects (or points) which are drawn together by
indistinguishability, similarity, proximity or functionality.
FoREWORD x
Modes of information granulation in which the granules are crisp (c-granular) play
important roles in a wide variety of methods, approaches and techniques. Among them
are: interval analysis, quantization, rough set theory, diakoptics, divide and conquer,
Dempster-Shafer theory, machine learning from examples, chunking, qualitative
process theory, qualitative reasoning, decision trees, semantic networks, analog-to-
digital conversion, constraint programming, Prolog, cluster analysis and many others.
Important though it is, crisp information granulation has a major blind spot. More
specifically, it fails to reflect the fact that in much, perhaps most, of human reasoning
and concept formation the granules are fuzzy (f-granular) rather than crisp. In the case
of a human body, for example, the granules are fuzzy in the sense that the boundaries of
the head, neck, arms, legs, etc. are not sharply defined. Furthermore, the granules are
associated with fuzzy attributes, e.g., length, color and texture in the case of hair. In
tum, granule attributes have fuzzy values, e.g., in the case of the fuzzy attribute
length(hair), the fuzzy values might be short, long, very long, etc. The fuzziness of
granules, their attributes and their values is characteristic of the ways in which human
concepts are formed, organized and manipulated. In particular, human perceptions are,
for the most part, f-granular. A point of importance is that f-granularity of perceptions
precludes the possibility of representing their meaning through the use of conventional
methods of knowledge representation.
Fuzzy information granulation has a position of centrality in fuzzy logic. This is the
reason why fuzzy sets and fuzzy logic are treated at length in KD_CGF. But, to
maintain balance, KD_CGF also contains succinct and insightful expositions of
probabilistic computing, evolutionary computing and parts of machine learning theory.
The broad coverage of KD_CGF has the effect of greatly enhancing the capability of
knowledge discovery techniques to come to grips with the complexity of real-world
problems in which decision-relevant information is a mixture of measurements and
perceptions.
Dr. Shanahan's experience in industry has made it possible for him to include in
KD_CGF a chapter dealing with a variety of applications of knowledge discovery tools
based on soft computing and information granulation. The wealth of information
provided in KD_CGF - presented with high expository skill and attention to detail-
makes Dr. Shanahan's book an invaluable resource for anyone who is interested in
applying KD techniques to real-world problems. The author and the publisher deserve
our thanks and congratulations.
Lotti A. Zadeh
Berkeley, CA
PREFACE
In the age of the Internet, ubiquitous computing and data warehouses, society faces the
challenge of dealing with an ever-increasing data flood. Knowledge discovery is an
area of computer science that attempts to exploit this data flood by uncovering
interesting and useful patterns in these data that permit a computer to perform a task
autonomously or that assist a human to perform a task more successfully or efficiently.
In recent years knowledge discovery has been applied in many fields of business,
engineering and science leading to interesting and useful applications, ranging from
systems that detect fraudulent credit card transactions, to information filtering systems
that learn users' reading preferences, to medical systems that predict the mutagenicity
of chemical compounds. At the same time, there have been important advances in the
theory and algorithms that form the foundation of this field.
The primary goal of this book is to present a self-contained description of the key
theory and algorithms that form the core of knowledge discovery from a soft computing
perspective. Knowledge discovery is inherently interdisciplinary, drawing on concepts
and results from many fields, including artificial intelligence, machine learning, soft
computing, information theory and cognitive science. This book introduces these
concepts, providing a highly readable and systematic exposition of knowledge
representation, machine learning, and the key methodologies that make up the fabric of
soft computing - fuzzy set theory, fuzzy logic, evolutionary computing, and various
theories of probability (point-based approaches such as naIve Bayes and Bayesian
networks, and set-based approaches such as Dempster-Shafer theory and mass
assignment theory).
The approaches presented in this book are further illustrated on a battery of both
artificial and real world problems. Knowledge discovery in real world problems such as
object recognition in outdoor scenes, medical diagnosis and control is described in
detail. These case studies provide a deeper understanding of how to apply the presented
concepts and algorithms to practical problems.
Futhermore, the following webpage has been developed in conjunction with this book:
http://www.xrce.xerox.coml-shanahanlkdbook!.This page provides access to
additional information related to the material presented in this book, pedagogical aids,
PREFACE xii
datasets, source code for several algorithms described in this book, an online
bibliography and pointers to other World Wide Web related resources.
The book is divided into five main parts and an appendix:
with some views on what the future may hold for knowledge discovery in
general and for Cartesian granule features in particular.
• The Appendix gives an overview of evolutionary computation.
Target Audience
Because of the interdisciplinary nature of the material, this book makes few
assumptions about the background of the reader. Instead, it introduces basic concepts
from artificial intelligence, probability theory, fuzzy set theory, fuzzy logic, machine
learning, and other disciplines as the need arises, focusing on just those concepts most
relevant for knowledge discovery. The book is intended for advanced undergraduate
and graduate students, as well as a broad audience of professionals and researchers in
computer science, engineering and business information systems who have an interest
in the dynamic fields of knowledge discovery and soft computing.
Acknowledgements
Like all books this too has behind it a story that represents a journey, both physical and
intellectual, which can be traced back to November 1992, when I attended JKAW
(Japanese Knowledge Acquisition Workshop) in Kobe, Japan. As a result of various
discussions at this workshop and ensuing conversations with Anca Ralescu, I became
interested and eventually came to work in soft computing in general, and fuzzy systems
in particular. Ever since, Anca has been a brilliant source of not just inspiration,
encouragement and knowledge but of friendship. I am eternally grateful to her for
providing me with the opportunity to work at LIFE (Laboratory for International Fuzzy
Engineering) in Yokohama, Japan. She has also been an excellent "mentor". Without
her encouragement, in many ways, I wouldn't have started this "journey". Domo
arigato Anca.
A lot of the work presented in this book was accomplished while I was at the
University of Bristol, where I benefited from the inspiration, vision, direction and
enthusiasm provided by Jim Baldwin for which I am deeply grateful. Trevor Martin has
also played a key role in this work, providing much support and direction. Other
members of the crew at the Department of Engineering Mathematics, University of
Bristol also provided much inspiration, explanation and an ambient research
environment, especially Mario DiBernardo, Simon Case, Simeon Earl, Carla Hill,
Martin Homer, Jonathan Lawry, Nigel Mottram, Bruce Pilsworth, Christiane Ponsan,
Jonathan Rossiter, Mehreen Saeed, Athena Tocatlidou and Patrick Woods. This was
paralleled by the crew at the Department of Computer Science, University of Bristol. A
special thanks to Neil Campbell, Angus Clark, Mark Everingham, Dave Gibson, Claire
Kennedy, Katerina Mania, Ann M'Namara, Majid Mirmehdi, Jackson Pope, Erik
Reinhard and Barry Thomas.
The work presented in this book was partially funded by the University of Bristol under
a Scholarship Award, by the European Community through a Training and Mobility of
Researchers' grant (Marie Curie Fellowship) and by the DERA (UK) under grant
92W69.
PREFACE xiv
As the founder of fuzzy set theory over thirty years ago, and more recently as the
originator of the concept of soft computing, Professor Lotfi Zadeh continues to be the
source for momentum, direction and inspiration in this highly dynamic field systems
modelling. His foresight and ingenuity have been proven time and time again over the
years. More recently, his ideas on information granUlation, computing with words, and
on computational approaches to perception have inspired most of the work presented in
this book.
Next I thank Professor Toshiro Terano, Hosei University, Japan (former director of
LIFE) who was also instrumental in providing me with the opportunity to work at
LIFE. His sage words and philosophy of science and engineering have inspired and
guided me through my research.
My fellow researchers at the Image Understanding Group (LIFE), and LIFE - ''the olde
boys" - and the IU group advisors, including Hirota-sensei (Tokyo Institute of
Technology), Asada-sensei (Osaka University), Minoh-sensei (Kyoto University)
deserve special mention as they provided me with a solid background not only in fuzzy
systems and image understanding but also in doing research in general.
lowe a lot to the people at Mitsubishi, in Tokyo, Akita and Naoshima who provided a
very stimulating environment in knowledge based systems especially Mr. Y. Abe
(CSD), Dr. K. Nishimura (CSD), Mr. Yanagisawa (A.I. Group, MHI), Mr. S. Tsuchino,
Mr. Y. Matsuno, Mr. (Tobi) Ishitobi (all from the Knowledge Industry Centre,
Mitsubishi Materials Corporation).
The writing of this book, like knowledge discovery, has drawn upon the expertise of
many technical experts in the sub-disciplines that make up the field. It became a reality
because of their help. I am deeply indebted to the following people (which I try to list
in geographical order beginning from Grenoble) who took time out to review chapter
drafts or to provide other technical support:
I am very grateful to following who have helped in proof reading this book: Clare
Dickinson, Lucy J. Jobson (who both have the privilege of reading the whole book),
Andrew Poulter, Ken Brown, Martin Blackledge, Viktoria Ljungqvist, Samantha Stern
and Wendy Yeo.
I would also like to acknowledge the many useful comments provided by anonymous
referees of workshop, conference and journal papers in which the results synthesised in
this book were first reported.
I thank the instructors and students who have field tested some of the chapters in this
book and who have contributed their suggestions.
On a personal side, I thank (posthumously) my parents, Jimmy and Mary, for providing
me with a loving and supporting family that includes my brothers - The Shanahan Boys
- John, Michael, Timothy, Patrick and Thomas, my sister-in-Iaws Ann and Eibhlls, my
grandparents, my aunts, my uncles and cousins. A special thanks goes to my
Godmother, Auntie Margaret, and grannie Me Namara who have always been there for
me. Viktoria Ljungqvist deserves many thanks for her encouragement and patience. A
special thanks to my friends, far and near, for their unconditional support and
friendship. It would be dangerous to draw up a list.
James G. Shanahan
TABLE OF CONTENTS
FOREWORD ............................................................................................................... IX
PREFACE .................................................................................................................... Xl
PART I ........................................................................................................................... 1
The chapter in this part provides a general introduction to the subject of knowledge
discovery (KD). In addition, it briefly describes a new knowledge discovery process
centred on Cartesian granule features and corresponding learning algorithms (an
approach, which integrates various methodologies from soft computing, such as
evolutionary computation, fuzzy set theory, and probability theory). This approach,
supporting soft computing methodologies and other popular approaches to knowledge
discovery are presented in detail and compared in the remainder of this book. A
road map of this presentation is provided at the end of Chapter 1.
PART II
KNOWLEDGE REPRESENTATION
Since the introduction of the first operational modern computer (Heath Robinson) in
1940 by Alan Turing's team, scientists and engineers have tried, with varying degrees
of success, to increase its usefulness to mankind through the development of systems
with high MIQ (Machine Intelligence Quotient) [Zadeh 1994b]. This desire to increase
the computers' usefulness to mankind has led to the birth of many computer-related
disciplines. One such discipline is knowledge discovery (KD) whose main emphasis is
on using algorithms that exploit computational power and resources to automatically
discover general properties and principles (knowledge) from historical data (and
background knowledge), that permit a computer to perform a task autonomously or that
assist a human to perform a task more successfully, efficiently or in a more value-added
way.
Since its informal birth in 1989 [Fayad, Piatetsky-Shapiro and Smyth 1996], the field of
knowledge discovery has seen an explosive growth in techniques, applications and
interest. This growth has been driven by the potential that knowledge discovery affords
us as humans, attempting to solve practical problems facing our cyborg society, along
with explaining human learning through "cognitive simulation" [Simon 1983]. For
example, knowledge discovery can contribute in the following ways:
The main goal of this book, after providing a detailed introduction to the key algorithms
and theory that form the core of knowledge discovery from a soft computing
perspective, is to propose a new knowledge discovery process centred on Cartesian
granule features and corresponding learning algorithms. The approach integrates
various methodologies from soft computing, such as evolutionary computation, fuzzy
set theory, and probability theory to address the knowledge discovery criteria outlined
above. This approach is amply i1Justrated in the context of both benchmark and real
world problems.
This chapter serves as a backdrop against which the rest of the book is developed,
providing an overview of knowledge discovery. It begins with an informal look at the
background and history of knowledge discovery. Section 1.2 formulates the knowledge
discovery process as a multi-step iterative process involving a three-way dialogue
between the domain expert, the knowledge engineer and the computer, in order to
prepare the domain data and background knowledge, to extract the knowledge via
machine learning algorithms, and finally to evaluate and interpret the extracted
knowledge. Section 1.3 reviews some of the successes of knowledge discovery. In
Section lA, Cartesian granule features are introduced briefly as a soft computing
approach that overcomes some of the limitations of existing approaches in machine
learning and knowledge discovery. Finally, Section 1.5 describes the organisation of
this book.
Even though the term knowledge discovery, sometimes referred to in the literature as
"knowledge discovery from databases", "advanced data analysis", "data mining" or
simply "machine learning", was only coined in 1989 [Fayad, Piatetsky-Shapiro and
Smyth 1996], the field of knowledge discovery has a long history that derives from its
chief constituent components: knowledge representation, search, feature selection and
discovery, statistics, and machine learning. Each of these components is covered in
detail over the course of this book. To capture the essence of this new field of research
and development the term knowledge discovery was coined. Its motivation was simply
to emphasise the multi-step, inter-disciplinary nature and to broaden the scope and
appeal of knowledge discovery, moving from a process that conducts machine learning
on "perfect" data to a process that exploits alternatives from various fields in order to
deal with the real world of imperfect data in a more effective manner. This resulted in
the confluence of fields, previously disjoint. Since 1989, the field has seen an explosive
growth in techniques, applications and interest. As stated previously, this growth has
been driven by the potential that knowledge discovery affords us as humans, attempting
to solve practical problems facing our cyborg society. Before discussing some of the
CHAPTER I: KNOWLEDGE DISCOVERY 6
Other examples include: the identification of the pattern of use of a credit card to detect
possible fraud; or the detection of a pattern in the documents a user reads, so that when
new documents are published that match this pattern the user is alerted. This definition
is adapted from the various definitions in the knowledge discovery literature [Fayad,
Piatetsky-Shapiro and Smyth 1996; Klosgen and Zytkow 1996]. An alternative
definition is to view knowledge discovery as the process of transforming data (and
background knowledge) into a format (for example, if-then rules) that permits a
computer to perform a task autonomously or that assists a human to perform a task
more successfully or efficiently or in a more value-added way (e.g. decision making or
triggering innovative creativity). Munakata simply defines knowledge discovery as
"computer techniques that, in the broadest sense, automatically find fundamental
properties and principles that are original and usefuC' [Munakata 1999].
p
g
0
p
p
~
u n
p n
~ n
Q n
.8 p n
e
::s n
z p
c: n
n p
n
p n p
n
n n n
QTimeToDestination
Figure 1-1: Data for car parking problem where "p" represents cases where a
customer successfully parked the car and where "n" represents cases where a
customer was unsuccessful.
.n
lj p
u p
a.
Vl p
<>
n
~ p
n n
<2
~"E p n n
z" n
a s
p
n
n p
n
p n p
n
n n
I
I
n
......
T Q TimeToDeslinalion
Figure 1-2: A possible rule-based model of the car parking problem characterised by
the following rule: "If TimeToDestination < T and NumberOfFreeSpaces is > S then
ParkingStatus will be successful OTHERWISE ParkingStatus will be unsuccessful".
The shaded region corresponds to successful parking.
As illustrated in Figure 1-3 the knowledge discovery process is interactive and iterative
involving numerous steps where decisions are made by the knowledge engineer or
experts in the field of application. Some of the basic steps in this process are broadly
outlined below and illustrated for the parking problem presented above:
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTROL>UCING CARTESIAN GRANULE FEATURES 9
AjfectedStreets, and Parking Status. This results in a return to step 3 for the
knowledge engineer and another iteration of the KD process. This iteration
may result in a much more complicated decision tree with an improved
accuracy of 90%. Subsequently, the analysts believe that the discovered model
is too complicated and ask the knowledge engineer to try to come up with a
simpler and more intuitive model. This causes the knowledge engineer to
return to step 2 and select a different form of knowledge representation; say
predicate logic. Consequently, the knowledge engineer has to select a new
induction algorithm: such as the PROGOL algorithm [Muggleton and Buntine
1988]. This iteration may result in a very simple model of the parking problem
with an accuracy of 90%. The resulting model could consist of one rule "if
TimeToDestination < NumberOfFreeSpaces then display "parking is
available" otherwise display "no parking available here, use alternative
parking" as depicted in Figure 1-4. The analysts like the resulting model, and
it is deployed as a means of supporting the downtown drivers.
Each of these steps, as indicated above, will be revisited in detail at various points
throughout the book. The next section shifts focus from an overview of the knowledge
discovery process and illustrative example to a review of some of the real world
applications of knowledge discovery.
Model Jntaprdation
and Evaluation
Select Know~
Represenucion I
I
• I..
I
I
+
I
I I
Dl.laand
BlIICQound ------------~--------~------~-------
K.nowIOO~
Acquisition
Figure 1-3: An overview of the steps making up the knowledge discovery process.
CHAPTER I : KNOWLEIX;E DISCOVERY 12
p p
p
n
P n n
p n n
P n
n p
n n
p n P
n
n n n
n'l i nl!To()",lin,lion
Figure 1-4: A possible rule-based model of the car parking problem characterised by
the rule if TimeToDestination < NumberOfFreeSpaces then display "parking is
available" otherwise display "no parking available here, use alternative parking". The
shaded region corresponds to "no parking available here, use alternative parking".
Knowledge representation and machine learning are two of the most critical
components of knowledge discovery. Both of these have received a lot of attention over
the past decade, leading to the adoption, extension or hybridisation of many traditional
representation schemes and machine learning algorithms such as neural networks,
probabilistic approaches, genetic programming, decision tree induction, inductive logic
programming, rough sets and fuzzy sets in order to deal with the challenging task of
knowledge discovery. The resulting knowledge discovery techniques have led to
practical applications in many areas: within decision support systems applications such
as analysing medical outcomes, detecting credit card fraud, and predicting customer
purchase behaviour; within engineering and manufacturing systems applications such
as autonomous vehicles and process control systems; within game playing applications
such as playing chess at grandmaster level; and within human-computer interaction
applications such as recognising human gestures and user profiling for e-commerce to
mention but a few. For example, Muggleton [Muggleton 1999] illustrates the power of
first order logic techniques for the knowledge discovery of biological functions in
structured domains such as molecular biology, carcinogenicity and pharmacophores.
Fayad et al. [Fayad, Djorgovski and Weir 1996] demonstrate how knowledge discovery
techniques were applied to the classification of celestial objects from the Palomar
Observatory Sky Survey, consisting of terabytes of data. In this case, knowledge was
extracted using decision tree approaches. De Jong [Jong 1999] shows the powers of
evolutionary computation for the discovery of heuristics, tactics and strategies. For
example, in the field of telecommunications he discusses how genetic algorithms were
used to generate alternative network designs that reduce costs by 10 to 20%. Engineers,
by examining the resulting designs, gained some important insights on how these costs
were achieved. Glance et al. [Glance, Arregui and Dardenne 1998; Glance, Arregui and
Dardenne 1999] have demonstrated how patterns in user recommendations can be
SOH COMI'UTING FOR KNOWU;IX;E DISCOVERY: INTRODUCING CARTESIAN GRANUI.E FEATURES 13
Even though in recent years many successful knowledge discovery applications have
been developed, as highlighted in the previous section, current approaches to
knowledge discovery suffer from a number of shortcomings such as decomposition
error and transparency. Cartesian granule features and related learning algorithms were
originally introduced to address some of these shortcomings [Baldwin, Martin and
Shanahan 1996; Baldwin, IVlartin and Shanahan 1997; Shanahan 1998; Shanahan,
Baldwin and Martin 1999]. A Cartesian granule feature is a multidimensional feature,
which is built upon a linguistic partition or discretisation of the base universe. Fuzzy
sets, probability distributions and mass assignments can be naturally and succinctly
expressed in terms of the Cartesian granules (words) that discretise the base universe.
Fuzzy sets are used to represent the granules, thereby overcoming some of the problems
posed by crisp discretisation, such as vulnerability to boundary location, that have
plagued many probabilistic and logic-based approaches to machine learning. For
example, Figure 1-5(a) graphically displays a linguistic partition of the Position
variable, where each word is denoted by a fuzzy set. The variable value of 40 can be
linguistically summarised or described using the Cartesian granule fuzzy set: {left/0.2 +
Middle/I}. In a similar fashion more general concepts can be summarised. For example,
the concept of car locations in images could be summarised linguistically and
succinctly using the Cartesian granule fuzzy set depicted in Figure 1-5(b) (see Part IV
of this book for more details). This new approach exploits a divide-and-conquer
strategy to representation, capturing knowledge in terms of a rule-based network of
low-order semantically related features - a network of Cartesian granule features.
Cartesian granule features can be incorporated into fuzzy logic rules or probabilistic
rules. Classification, regression and clustering problems can be addressed quite
naturally using Cartesian granule features. Parts IV and V of this book describe
Cartesian granule feature models, corresponding learning algorithms, and the
knowledge discovery of such models in both benchmark and real world problems.
Over the course of this book the extremely challenging problem of knowledge
discovery is addressed using Cartesian granule feature modelling, an example of a soft
computing approach that exploits the powers of genetic programming (in order to
discover a good concept language), fuzzy sets (for concept representation), and
probability theory (for leaming concepts and reasoning) in order to achieve systems
with high MIQ. Knowledge representation in terms of Cartesian granule features is an
example of exploiting uncertainty, imprecision in this case, in order to achieve
tractability and transparency on the one hand and generalisation on the other.
Left Middle
!lPosition
( a)
rt I
Middle
r
O~--------------~5~O------~------~IOO--~»
(b)
Right
E
!lPosition
Figure 1-5: Concept descriptions in terms of a Cartesian granule fuzzy set (a)
linguistic partition of the universe of position; (b) a concept Cartesian granule fuzzy
set.
This book provides a self-contained description of the theory and algorithms that form
the core of knowledge discovery from a soft computing perspective. The sections above
have presented a general introduction to knowledge discovery and its applications. This
has set the stage for the rest of the book, which provides a highly readable and
systematic exposition of knowledge representation, machine learning, and the key
methodologies that make up the fabric of soft computing - fuzzy set theory, fuzzy
logic, evolutionary computing, and various theories of probability (point-based
SOIoT COMI'UTINC; HlR K.'lOWI.EIX'E DISCOVERY: INTRODUCINC1 CARTESIAN GRANUI.E FEATURES 15
approaches such as naIve Bayes and Bayesian networks, and set-based approaches such
as Dempster-Shafer theory and mass assignment theory). Along with describing well
known approaches, Cartesian granule features and corresponding learning algorithms
are also introduced as a new and intuitive approach to knowledge discovery. This new
approach embraces the synergistic spirit of soft computing, exploiting uncertainty,
imprecision in this case, in order to achieve tractability and transparency on the one
hand and generalisation on the other. In doing so it addresses some of the shortcomings
of existing approaches such as decomposition error and performance-related issues
such as transparency, accuracy and efficiency. Parallels are drawn between this
approach and other well known approaches (such as naIve Bayes, decision trees)
leading to equivalences under certain conditions.
The remainder of this book is divided into four main parts and an appendix: Part II
introduces the key components of knowledge representation and outlines the desiderata
of knowledge representation, along with describing the key algorithms and theory of
various soft computing approaches to knowledge representation (in tutorial style). Part
III introduces the basic architecture for learning systems and its components and details
many popular learning algorithms. Part IV proposes a new soft computing approach to
knowledge discovery based on Cartesian granule features. Applications and
comparisons of this new approach in the context of both artificial and real world
problems are described in Part V.
1.6 SUMMARY
1.7 BIBLIOGRAPHY
" ... when one learns to categorize a subset of events in a certain way, one is
doing more than simply learning to recognize instances encountered. One is
also learning a rule that may be applied to new instances. The concept or
category is basically, this" rule of grouping" and it is such that one constructs
informing and attaining concepts."
The notion of using a rule as an abstract representation of concepts in the human mind,
as mentioned in this excerpt, has since been questioned by many and has created a lot
of debate [Hayes 1999; Holland et al. 1986; Sammut 1993]. Numerous other models of
how humans store and represent knowledge have been examined and proposed but to
date, few have proven adequate [Hayes 1999]. This is somewhat paralleled within
knowledge discovery, where there exists a wide range of possible representations with
no "panacean" approach. From a knowledge discovery perspective, the type of KR
selected determines the nature of learning: it determines the type of learning; what can
be learned; when it can be learned (one-shot or incremental); how long it takes to learn;
the type of experience required to learn.
This chapter begins by introducing the key parts of knowledge representation: the
observation language, the hypothesis language and the general purpose inference and
decision making mechanisms. The intimate relationship between uncertainty and
knowledge representation is then described. Subsequently, the desiderata of knowledge
representation are presented, paying particular attention to how they affect knowledge
discovery. A taxonomy of knowledge representation approaches, commonly used
within knowledge discovery, is then presented and discussed with respect to these
desiderata.
The nature of knowledge in knowledge-based systems can be split into two broad
categories: specific (domain) and general knowledge. The specific knowledge refers to
the environment and its interpretations such as the observations and the induced
models, whereas the general knowledge refer to the inference and decision making
mechanisms used, which are generally the same across all problem domains. This
section provides a description of the knowledge representation components used to
represent specific knowledge, while the next section introduces general purpose
knowledge.
Decision making is performed on the results of inference and can take many forms. For
example, in a Bayesian classifier decision making could involve taking the class
associated with the maximum posterior probability as the classification of the input
data. Fuzzy inference for predictive problem domains (Le. continuous valued outputs)
is a type of forward chaining, where activated rules contribute collectively (a process
known as defuzzification) to the point valued solution Le. decision making reduces to
selecting a single output value from a set of possible values.
So far as the laws of mathematics refer to reality, they are not cenain. And so
far as they are cenain, they do not refer to reality.
satisfactory manner. Consider the parking example. A crisp decision tree will require a
lot of leaf nodes to model this problem, whereas using fuzzy sets (a form of
imprecision) to partition each universe will lead to a more transparent model (less
bushy decision tree) with satisfactory performance for this problem. This model,
though not a perfect model of reality (as measured in terms of its accuracy on a test
dataset), may provide satisfactory performance as measured in terms of decreased
downtown pollution. In some cases, uncertainty may increase the generalisation power
of a learnt system (see Chapters 3 and 9). In addition, explicitly managing uncertainty
can increase model transparency and user confidence or credibility in the model.
Fuelled by reality (according to Einstein) and the possibilities that uncertainty can
afford in problem solving, researchers have introduced many new theories of
uncertainty such as, fuzzy set theory [Zadeh 1965], Dempster-Shafer theory [Dempster
1967; Shafer 1976], possibility theory [Zadeh 1978], and nonmonotonic logic [Bobrow
1980; McDermott and Doyle 1980]. These new theories have led to many successful
applications in fields where approaches that do not cater explicitly for uncertainty have
failed; these fields include speech recognition [Rabiner 1989] and control [Ralescu and
Hartani 1995; Terano, Asai and Sugeno 1992; Yen and Langari 1998]. Applications in
these fields and in other domains have demonstrated not only the power of uncertainty
but also the necessity of uncertainty for model representation and model learning. The
remaining chapters in this part of the book describe different types of uncertainty and
related approaches; in particular stochastic uncertainty, imprecision, ignorance and
inconsistency. Furthermore, Part IV introduces a new form of knowledge
representation, Cartesian granule features, that exploits uncertainty, in the form of
imprecision, in order to provide more succinct and possibly more natural descriptions
of systems. In addition, imprecision provides improved generalisation when learning
such systems (see Chapter 10 for more details).
In the previous sections it was shown how knowledge representation clearly has a big
intluence on the knowledge discovery process. The remainder of this chapter presents a
taxonomy of the commonly used approaches to knowledge representation within the
field of knowledge discovery:
o symbolic;
o probabilistic;
o fuzzy sets and logic
o mathematical;
o prototype.
CHAPTER 2: KNOWLEDGE REPRESENTATION 28
case of Bayesian networks, knowledge can be organised into an intuitive and relatively
transparent directed graph structure (sometimes known as a knowledge map), where
each node represents a variable and a probability distribution, and the directed arcs
represent causal relationships between variables. Inference in probabilistic approaches
relies generally on the conditioning operation (such as Bayes' Rule [Bayes 1763]) or
belief revision, which both update existing probabilities given evidence. Decision
making is based, in general, on choosing a hypothesis that has a maximum (posterior)
probability or utility. A full description of probabilistic approaches to KR is given in
Chapter 5. The advantages of these approaches to KR include:
• Some approaches may suffer from decomposition error when there are
dependencies among problem domain variables. Examples of this type of
problem are presented in Chapter 10.
Inference, in general for these approaches, is based upon a similarity measure and
decision making is based upon a nearest neighbour strategy i.e. the classification of an
unlabelled case is class of the most similar neighbour in memory). This simple scheme
works well [Langley 1996] and is tolerant to some noise in the data. It can learn from
sparse data and is amenable to updating and extension. Disadvantage of this approach
include:
2.6 SUMMARY
This chapter, has introduced the key components of knowledge representation; the
observation language, the hypothesis language and the general purpose inference and
decision making mechanisms. Some desiderata of knowledge representation were
outlined and their effect on knowledge discovery discussed. A taxonomy of KR
approaches commonly used within knowledge discovery was presented and the
constituent categories discussed with respect to the KR desiderata. The remaining
chapters of this part of the book present in greater detail some of the soft computing
approaches to knowledge representation: fuzzy set theory; fuzzy logic; point-based
probability theories; interval-based probability theories. Part IV of the book introduces
a new form of KR based on Cartesian granule features, and associated induction
algorithms.
2.7 BIBLIOGRAPHY
Hertz, J., Anders, K., and Palmer, R. G. (1991). Introduction to the Theory of Neural
Computation. Addison-Wesley, New York.
Holland, J. H., Holyoak, K. J., Nisbett, R. E., and Thagard, P. R. (1986). Induction:
Process of Inference, Learning, and Discovery. MIT Press, Cambridge, Mass.,
USA.
Jolliffe, L T. (1986). Principal Component Analysis. Springer, New York.
King, R. D., Lewis, R. A., Muggleton, S. H., and Sternberg, M. J. E. (1992). "Drug
design by machine learning: the use of inductive logic programming to model
the structure-activity relationship of trimethoprim analogues binding to
dihydofolate reductase", Proceedings of the National Academy of Science, 89.
Kohonen, T. (1984). Self-Organisation and Associative Memory. Springer-Verlag,
Berlin.
Langley, P. (1996). Elements of Machine Learning. Morgan Kaufmann, San Francisco,
CA, USA.
McDermott, J., and Doyle, J. (1980). "Non-monotonic logic f', Artificial Intelligence,
13(12):41-72.
Michie, D., Spiegelhalter, :0. J., and Taylor, C. c., eds. (1993). "Machine Learning,
Neural and Statistical Classification", Ellis Horwood, New York, USA.
Minsky, M., and Papert, S. (1969). Perceptrons: An introduction to computational
geometry. M.LT. Press, Cambridge, MA.
Muggleton, S. (1999). "Scientific knowledge discovery using inductive logic
programming", Communications of the ACM, 42(11):43-46.
Muggleton, S., and Buntine, W. (1988). "Machine invention of first order predicates by
inverting resolution." In the proceedings of Fifth International Conference on
Machine Learning, Ann Harbor, MI, USA, 339-352.
Murphy, S. K., Kasif, S., and Salzburg, S. (1994). "A system for induction of oblique
decision trees", Journal of Artificial Intelligence Research, 2:1-33.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible
inference. Morgan Kaufmann, San Mateo.
Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, l(l ):86-106.
Quinlan, J. R. (1990). "Learning logical definitions from relations", Machine Learning,
5(3):239-266.
Rabiner, L. R. (1989). "A tutorial on hidden Markov models and selected applications
in speech recognition", Proceedings of the IEEE, 77(2):257-286.
Ralescu, A. L., and Hartani, R. (1995). "Some issues in fuzzy and linguistic
modelling." In the proceedings of Workshop on Linguistic Modelling, FUZZ-
IEEE, Yokohama, Japan.
Ralescu, A. L., and Shanahan, J. G. (1999). "Fuzzy perceptual organisation of image
structures", Pattern Recognition, 32:1923-1933.
Ruspini, E. H. (1969). "A New Approach to Clustering", Inform. Control, 150 ):22-32.
Sammut, C. (1993). "Knowledge Representation", In Machine Learning. Neural and
Statistical Classification, D. Michie, D. J. Spiegelhalter, and C. C. Taylor,
eds., 228-245.
Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University Press.
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Sugeno, M., and Yasukawa, T. (1993). "A Fuzzy Logic Based Approach to Qualitative
Modelling", IEEE TrailS Oil Fuzzy Systems, I( I): 7-31.
CHAI'TIOR 2: KNOWI.I'I)(;IO RIOI'RI'SIONTA flON 34
Terano, T" Asai, K., and Sugeno, M. (1992). Applied fuzzy systems. Academic Press,
New York.
Vapnik, V. (1995). The nature of statistical learning theory. Springer-Verlag, Berlin.
Yager, R. R. (1994). "Generation of Fuzzy Rules by Mountain Clustering", l.
Intelligent alld Fuzzy Systems, 2:209-219.
Yen, J., and Langari, R. (1998). Fuzzy logic: intelligence, control and information.
Prentice Hall, London.
Zadeh, L. A. (1965). "Fuzzy Sets", loumal of Information and Control, 8:338-353.
Zadeh, L. A. (1978). "Fuzzy Sets as a Basis for a Theory of Possibility", Fuzzy Sets and
Systems, 1:3-28.
Zadeh, L. A. (1999). "Some reflections on the relationship between AI and fuzzy logic
(FL) - a heretical view", In Fuzzy logic in Al (Selected and invited papers from
llCAI workshop, 1997, Nagoya, Japan), A. L. Ralescu and J. G. Shanahan,
eds., Springer, Tokyo, 1-8.
CHAPTER
FUZZY SET THEORY
3
This chapter presents the fundamental ideas behind fuzzy set theory. It begins with a
review of traditional set theory (commonly referred to as crisp or classical set theory in
the fuzzy set literature) and uses this as a backdrop, against which fuzzy sets and a
variety of operations on fuzzy sets are introduced. Various justifications and
interpretations of fuzzy sets as a form of knowledge granulation are subsequently
presented. Different families of fuzzy set aggregation operators are then examined. The
original notion of a fuzzy set can be generalised in a number of ways leading to more
expressive forms of knowledge representation; the latter part of this chapter presents
some of these generalisations, where the original idea of a fuzzy set is generalised in
terms of its dimensionality, type of membership value and element characterisation.
Finally, fuzzy set elicitation is briefly covered for completeness (Chapter 9 gives a
more detailed coverage of this topic).
The following excerpt provides a very good illustration of the potential use of crisp
sets and their limitation in representing the real world (this limitation will be elaborated
upon in Section 3.2).
"We begin with what seems a paradox. The worLd of experience of qny normaL
man is composed of a tremendous array of discriminabLy different objects,
events, peopLe, impressions... But were we to utilize fully our capacity for
registering the differences in things and to respond to each event encountered
as unique, we wouLd soon be overwheLmed by the compLexity of our
environment... The resoLution of this seeming paradox... is achieved by man's
ability to categorize. To categorize is to render discriminabLy different things
equivaLent, to group objects and events and peopLe around us into classes ... "
[Bruner, Goodnow and Austin 1956]
As noted by Bruner et al. in their landmark work "A study of thinking" [Bruner,
Goodnow and Austin 1956], categories are a necessary abstraction of the real world for
humans in order to survive. Bruner et al. demonstrated how categories or classes could
be mathematically represented as classical sets, where each element in the set has a
common ("equivalent") property.
More formally, in classical set theory, a set A is any collection of definite and distinct
objects that inhabit a universe ilx. ilx refers to the universe of values Xj that a variable X
can be assigned. Each element of ilx either belongs to the set A or not. This is denoted
I if Xj E A
{
A(x) = 0 if x j EO A
where It denotes for all. A set can be defined either by listing all its members
(enumeration) or by specifying properties that members of a set possess. Enumeration
is, of course, restricted to finite sets and is normally denoted as follows for a set A
defined over a universe Ox = {XI> • • •, xn}:
170 A~
TAU..
1
o .... o ......
165 170 190 200 Or 165 170 190 200 On
!4Ici!1Jl !41ci!1X
(a) (b)
Figure 3-1: Examples of classical sets over the universe of height values expressed in
centimetres: (a) a point valued set; (b) an interval-valued set.
On the other hand, sets characterised by various properties such as Ph ... , Pn can be
denoted as follows:
where each element Xj satisfies each property P;, that is P;(x) is true for each property
Pi. This latter notation can be used to denote both finite and infinite sets. "I" denotes
such that. Figure 3-1 graphically depicts two sets: (a) corresponds to the height point
value of 170cm (b) corresponds to the set of tall people, Tall(xj), i.e. people who
possess the property of having a height in the interval [170, 190] .
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 37
This section begins by presenting some of the motivations behind the introduction of
fuzzy sets. Subsequently, a fuzzy set is formally defined and concrete examples
provided. Finally, this section describes various interpretations of fuzzy sets.
3.2.1 Motivations
Even though set theory is a well-developed and understood area of mathematics with
numerous applications in engineering and science, its dichotomous nature (either an
object is a member or not of a set) has a number of shortcomings, which can arise when
it is used to model the real world. Borel [Borel 1950] highlights some of these
shortcomings as follows:
One seed does not constitute a pile nor two or three ... from the other side
everybody will agree that 100 million seeds constitute a pile. What therefore is
the appropriate limit? Can we say 325,647 seeds don't constitute a pile but
325,648 do?
This excerpt emphasises two major difficulties with traditional set theoretic approaches:
• Continuity of concepts (or categories) abounds in the real world (see also
Bruner's excerpt in Section 3.1) and often cannot be described precisely,
that is sharp boundaries between categories may not be easy to elicit (pile
versus not pile).
• Furthermore, if "we were to utilize fully our capacity for registering the
differences in things (pile of320,OOO seeds versus a pile of325,648 seeds)
and to respond to each event encountered as unique, we would soon be
overwhelmed by the complexity of our environment" [Bruner,' Goodnow
and Austin 1956]. Here Bruner et al. refer to human capabilities, but the
same argument holds for computational systems.
B. Russell [Russell 1923] states the first point more eloquently and radically in the
context of traditional logic approaches:
All traditional logic habitually assumes that precise symbols are being
employed. It is therefore not applicable to this terrestrial life but only to an
imagined celestial existence.
As the complexity of a system increases, our ability to make precise and yet
significant statements about its behaviour diminishes until a threshold is
reached beyond which precision and significance (or relevance) become
almost mutually exclusive characteristics.
CHAPTER 3: FtJ/.z't SET THEORY 38
Uncertainty usually results from the inability to capture a complete (as highlighted
above) and correct model of the problem domain. Real world situations are often very
uncertain in a number of ways. Due to a lack of information, the future state of a system
might not be known completely. This type of uncertainty, often referred to as stochastic
uncertainty, has long been handled by probability theory (see Chapter 5) and statistics.
In these approaches, it is assumed that the events or statements are well defined.
However, there may be situations where it not possible to describe precisely events or
phenomena (due to a lack of information or processing power, human or otherwise).
This lack of definition arising from imprecision is calledjUzziness and it abounds in the
real world; for example, in areas such as natural language, engineering, medicine,
meteorology, and manufacturing [Ruspini, Bonissone and Pedrycz 1998; Zimmermann
1996]. Examples of fuzziness include concepts such as tall people, red roses,
creditworthy customers, low volume, where the boundaries between concepts are
blurred.
In order to address these and other shortcomings, and to provide a more natural, and
succinct (and possibly transparent) means of representing the real world in
mathematics, in 1965 Zadeh [Zadeh 1965] introduced the notion of a fuzzy set. A fuzzy
set differs from a classical set by relaxing the requirement that each object be either a
member or a non-member of a set. A fuzzy set is a set with boundaries that are
imprecise, where membership is not a matter of affirmation or denial, but a matter of
degree. As in classical set theory, objects that are members of a fuzzy set can be
represented by a characteristic function, which is called a membership function in fuzzy
set theory.
Though the introduction of fuzzy set theory was initially based on intuitive and
common-sense grounds, in the intervening years since its introduction, numerous
supporting theories and applications have provided fuzzy set theory with well defined
and understood semantics and have demonstrated its usefulness as a very intuitive and
powerful means of handling uncertainty in applications ranging from decision support
to pattern recognition [Ralescu 1995a; Ruspini, Bonissone and Pedrycz 1998; Terano,
Asai and Sugeno 1992; Yen and Langari 1998]. A further, more recent motivation for
using fuzzy sets, arises from its use within the field of machine learning [Baldwin,
Martin and Shanahan 1997a; Sugeno and Yasukawa 1993; Yager 1994] (see also part
IV of this book), where fuzzy sets are shown to be a useful, possibly transparent, and
sometimes necessary abstraction of the world in order to achieve good generalisation
within an inductive reasoning framework. This form of generalisation through
abstraction (fuzzy sets in this case) is more succinctly stated in the principle of
generalisation proposed by Baldwin [Baldwin, Martin and Pilsworth 1995]:
The more closely we observe and take into account the detail, the less we are
able to generalise to similar but different situations...
interval [0, I] (in contrast to {O, I) in traditional set theory). This can be expressed
more formally as follows:
As in the case of classical sets, fuzzy sets can be defined by enumerating the objects
that have non-zero membership in the fuzzy set (restricted to finite sets defined on
discrete universes). In crisp set theory, each element of the universe that is listed in a
set is implicitly associated with a membership degree of I, whereas in fuzzy theory, it is
necessary to list the element and state explicitly its associated membership value, since
it can have any value in the unit interval [0, I]. A fuzzy set A defined over a universe
ilx = {x], ... , xnl is normally denoted as follows:
- -
A ={xl / A(x,)+ ... +xll / A(xll ) }
where each Xi / A(x;l represents an element Xi and its corresponding membership in the
fuzzy set A, and "f' is used to avoid confusion. The "+" denotes the union of the
singleton elements Xi I A(xi ) • Alternatively this can be rewritten in shorthand notation
as follows:
n
A= LXjIA(x)
j=l
where I should be interpreted as union and should not be confused with the standard
algebraic summation. Consider the following example of a discrete fuzzy set describing
large die numbers. Given the universe of die numbers, ilDieNumbers = {I, ... , 6/, a
plausible definition of Large could be as follows:
Here the die value of 4 has a membership of 0.7 in the fuzzy set Large, indicating its
degree of compatibility with the concept large die number. When the universe is
continuous, the corresponding fuzzy set is normally denoted as follows:
A = LxI A(x)
I { •.. } denotes a discrete set. {a, ... , b) denotes a discrete interval such that a s:; x s:; b
VXE {a, ... ,b).
CHAPTER 3: FuJ'zY SET THEORY 40
where the integral J denotes the union of fuzzy singletons. For example, real numbers
close to 2 could be represented by the following fuzzy set (as depicted in Figure 3-2):
(3-1)
where the parameter p controls the width of the fuzzy set. As the value of p increases,
the graph becomes narrower.
0 if x ~ 165
x-165
if 165<x <170
5
Tall(x)= if 170 < x ~ 190
2oo-x
--- if 190 < x < 200
10
0 if x;;:: 200
In this case, height values in the interval [170, 190] have a membership value of 1 and
correspond to the core of the fuzzy set. Values in the intervals (165, 170)2 and (190,
200) have membership values in the range (0, 1), while other values in the universe
have zero membership in this definition of the concept Tall. Values having
membership greater than zero in a fuzzy set correspond to the support Qf the fuzzy set.
Since the fuzzy set Tall is characterised by a trapezoidal fuzzy set, it may be viewed as
a fuzzy interval or class. Figure 3-3(b) illustrates a triangular fuzzy set About _170 .
Triangular fuzzy sets can be viewed intuitively as fuzzy numbers or fuzzy points, that
is, the core is a singleton.
crisp set can be viewed as a special case of a fuzzy set, the same notation is used for
both. As a further simplification of notation, fuzzy sets will be denoted by a capital
letter (or capitalised word) and the tilde (-) notation, used until now to distinguish
between fuzzy and crisp sets, will be dropped. For example, the fuzzy set
2 (165, 170) corresponds to any value x such that the following condition holds: 165 < x
< 170, whereas [165, 170] represents any value x such that 165::; x ::; 170.
SOIT COMI'UTIN<O FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 41
corresponding to large die numbers will be presented as Large and not as Large. This
notational convention for fuzzy sets and membership values will be adapted for the
remainder of this book.
ABOUT_2 -
0.8
0.6
0.4
0.2
0
0 0.5 1.5 2 2.5 3 3.5 4
TAlL AbouU70
R
J
.8«
::I.
(a) (b)
Figure 3-3: Examples of (a) an intervaL-vaLued fuzzy set, and (b) of a fuzzy number,
both of which are defined over the universe of height values expressed in centimetres.
(i) It is certain that James was born around the end of the sixties (1960s).
(ii) Probably, James was born in 1967.
In the first statement, the year in which James was born is imprecisely stated but
certain, whereas in statement (ii), the year is precisely stated but there is uncertainty
about the statement being true or false. Uncertainty arising from imprecision can be
very naturally modelled using traditional set theory and its generalisation - fuzzy set
theory and various set-based probabilistic theories such as possibility theory (see
Chapter 5). On the other hand, uncertainty arising from beliefs or expectations has been
addressed by various theories of probability (see Chapter 5 for a detailed presentation
of probability theory).
One of the most natural means of interpreting a fuzzy set in terms of human reasoning
is the voting model [Baldwin 1991; Gaines 1977; Gaines 1978]. Consider a population
of voters, where each voter is asked to describe a value x E n, by voting in a "yes or
no" fashion on each word W E W (a set of words or vocabulary), on its appropriateness
as a label or description of the value x. The membership of x, Ilw(x), in a fuzzy set
characterising the word w is defined to be the proportion of the population who accept
was a description of the value x. Voters are expected to vote consistently and abide by
the constant threshold assumption [Baldwin, Martin and Pilsworth 1995], according to
which, any voter accepting an element, will also accept any other element having a
higher membership in the concept described by the fuzzy set. That is, for a fuzzy set f
defined over the universe Ox, a voter must accept (votes yes) any Xi e Ox for which
pix;) ;? J1t<Xj) if the voter accepts Xj, for any Xj e Ox. The constant threshold assumption
provides a unique voting pattern. Consider a die variable defined over the discrete
uni verse {I, ... , 6}. The meaning of the word small can be generated from the voting
patterns of the population for each die value. Table 3-1 presents the voting pattern for a
population of ten voters for the word small across all possible die values. The meaning
of small consists of the list of die values associated with the proportion of voters who
accept small as a description of the respective die values. These proportions correspond
to membership values. For example, the die value I will have a membership value of 1
in the fuzzy set denoting small since all voters accept small as a suitable linguistic
description of the die value of 1. The voting pattern presented in Table 3-1 corresponds
to the following fuzzy set description of small: {1/1 + 2/0.2}.
Table 3-1: Voting pattern for ten people corresponding to the interpretation of the
linguistic term small die values. Values for which everybody voted "no" (i.e. 3, 4, 5, 6)
are not shown.
-
Word\Person I 2 3 4 5 6 7 8 9 10
I Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
2 Yes Yes No No No No No No No No
Support: The support of a fuzzy set A, Supp(A), is the crisp set of X E Q x such that all
elements have a non-zero membership:
Supp(A)={XEQx I,uA(x»O}
Core: The core of a fuzzy set A, Core(A), is the crisp set of x E Q x such that all
elements have a membership of I:
The core of A corresponds to l-cut of fuzzy set A. For example, consider the fuzzy set
A = {OAla + O.61b + .71 c + lid}, then the following is a list of possible a-level sets (a-
cuts):
A.4={a, b, c, d}
A. 6 ={b, c, d}
A.7={C, d}
Al ={d}
This property can also be expressed as follows, using set intersection and union
operations:
Height: The height of a fuzzy set A, denoted by h(A), is the largest membership grade
obtained by an element in that set. This is formally denoted as follows:
In the case where the universe 4 is continuous Sup (supremum) is used instead of Max.
Normal: A fuzzy set A is normal if the height of A, h(A), is 1, that is the 1-cut of fuzzy
set A is not the empty set, mathematically stated as follows: AI :t= fjJ • If the height of A,
h(A), is less than 1 then fuzzy set A is subnormal.
Other approaches to normalisation are presented later where the bi-directional mapping
from a fuzzy set to probability distribution is used as a means of generating normalised
fuzzy sets (see Section 8.2.2).
Cardinality: Various definitions of fuzzy set cardinality have been proposed in the
literature. Some of the more popular measures are described here. The following is one
of the simplest definitions of cardinality, generically denoted as IAI for any set A. Given
SOl'T COMPUTING FOR KNOWLEDOE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 4S
a finite fuzzy set A defined on the universe !lx, the cardinality of A denoted by
l:count(A) is defined as follows [Zadeh 1983]:
This is commonly referred to as the sigma count. For example, consider Medium =
{1I0.6 + 2/0.9 + 311 + 4/0.7 + 5/0.3}, the fuzzy set denoting medium die values on a
standard 6-faced die, i.e. defined over the universe {I, ... , 6}; the cardinality of Large,
l:count(Large), is 3.5. If however t4 is continuous, l:count is defined as follows for a
fuzzy set A defined over ilx:
This definition of cardinality, though simple, is not very useful. Consider a group of
people and let A be the fuzzy set denoting tall people in this group. The use of sigma
count to characterise the number of tall people brings with it the possibility that a group
of people with low membership grades in A (i.e. small people) will add up to a tall
person. In order to overcome this limitation, alternative definitions of cardinality have
been proposed based upon fuzzy numbers [Klir and Yuan 1995; Zadeh 1983].
An example of fuzzy cardinality is the FG count proposed by Zadeh [Zadeh 1983]. The
FG count of the fuzzy set A, denoted as FGCountA, is a fuzzy set defined on the non-
negative integers where, for each integer i, FGCountA(i), the membership grade of i in
FGCountA, indicates the truth of the proposition "A contains at least i elements". This
membership grade is defined as follows:
For a more complete presentation of fuzzy cardinality see [Ralescu 1995b; Yager
1998].
(voting models by modelling humanistic reasoning can help in modelling this mapping
[Baldwin 1991; Gaines 1977; Gaines 1978]). Numerous representations for fuzzy sets
have been proposed in the literature, most of which attempt to satisfy the following two
criteria: provide an accurate and natural reflection of the real world; and which are
computationally tractable (comptractible). The flexibility of representation comes
usually in the form of parameterised membership functions and comptractibility is
normally met by choosing simple membership functions with few parameters. Table 3-
2 lists typical membership functions and their corresponding graphical representations.
In the case of Gaussian, exponential-like, or r membership functions the constant k
controls the width of the fuzzy set. Triangular and trapezoidal shapes of membership
functions are used most often for representing fuzzy sets, due to their simple nature
both from a computational and understandability point of view. For example, the
triangular membership function can be equivalently written as follows:
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
32 34 36 38 40
More recently, as fuzzy based systems are being learnt from data, more expressive
membership functions are being adapted such as piecewise linear representations, in
SOH COMPUTIN(i FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 47
order to provide a more natural rapport with reality. Chapter 8 of this book introduces a
new type of fuzzy set, a Cartesian granule fuzzy set (briefly introduced in Section 3.7.2)
that represents fuzzy sets as linguistic summaries, in contrast to the curves or formulas
typically used to represent fuzzy sets (see Table 3-2). This type of fuzzy set while
leading to more succinct and more easily describable representations than its
mathematical looking counterparts, also provides high degrees of accuracy (see Chapter
11).
Exponential-like PA(X} -
I
2 where k > 1
Figure 3-2
l +k(m - x}
Gaussian JiA (x) = e-n,-m)' Figure 3-4
r if x Sa
k>O
Figure 3- 5
JiA(X)={I _ e- ?(X-ol'
if x >- a
Before presenting the main operations on fuzzy sets, a review of the basic operations on
classical crisp sets and their properties is presented. The basic set operations considered
here are intersection, union, and negation. As crisp sets can be denoted by their
characteristic functions, these operations can be conveniently described in terms of
these functions. Given two sets A and B defined on universe Qx, then the operations of
intersection n, union u, and complement..., can defined as follows :
where (AnB)(x), (AuB)(x) and ...,A(x) denote the membership values of each value x
in the set reSUlting from intersection, union and negation. The fundamental properties
of these set operations are summarised in Table 3-3. All concepts of classical set theory
have their generalised counterparts in fuzzy set theory. But, fuzzy counterparts of
CHAPTER 3: fuzzY SET THEORY 48
classical set-theoretic operations are not unique. Each basic operation on classical sets -
the complement, intersection, and union - is represented by a broad class of operations
in fuzzy set theory. Below a brief overview of these broad classes is presented. This
section begins, however, by presenting the definitions of these set theoretic operations
suggested originally by Zadeh [Zadeh 1965] and subsequently describes the numerous
alternative definitions that have been proposed in the fuzzy set literature. Once again,
operations on fuzzy sets are defined via their membership functions.
Fuzzy set intersection: The membership function characterising the fuzzy set resulting
from the intersection of two fuzzy sets A and B defined over universe fl, may be point-
wise defined as follows:
\;j XE Qx
For example, consider two fuzzy sets, A and B, defined over the discrete universe of die
values flVieVallles ={ 1, ... , 6} to defined as follows: A = 2/0.8 + 3/1 + 4/0.3 and B = 3/0.2
+ 4/0.8 + 5/1. Then the fuzzy set corresponding to the intersection of A and B, AnB (,
is as follows:
For example, consider the two fuzzy sets, A and B, as defined above, then the fuzzy set
corresponding to the union of A and B, AuB, is as follows:
Fuzzy set complement: For ease of presentation, the compl~ent operation will be
denoted interchangeably by the following characters: -.; . Zadeh defined the
membership function characterising the fuzzy set resulting from the complement of
fuzzy set A defined over universe .Q. as follows:
"i/ XE n,.
For example, consider the fuzzy set, A, as defined above, then the fuzzy set
corresponding to the complement of A, .....4, is given as follows:
A fuzzy set complement operator is said to satisfy the property of involution if the
following holds:
This means that the degree of non-membership in the fuzzy complement of a fuzzy set
is the same as the degree of the membership in the fuzzy set. This property holds for the
fuzzy complement as defined above.
These definitions, proposed by Zadeh [Zadeh 1965], of intersection and union modelled
using the min-operator and max-operator respectively, are often referred to as the
"logical aru!' (or standard and) and "logical or".
Other operators have also been proposed which differ mainly with respect to the
generality or adaptability of the operators as well as the degree to which they are
justified. Justification normally comes in the form of intuition (e.g. a voting model
interpretation) or through axiomatic or empirical justification. Most fuzzy intersection
and union operators proposed to date can be classified as belonging to one of two
classes: axiomatic-based operators (for intersection and union); and hybrid operators.
The operators within both of these classes can be further sub-divided into operators that
are parameterised and non-parameterised. Below a brief overview of these families of
operators is presented. Regarding the complement operator, alternative definitions to
Zadeh's original definition have also been proposed including threshold-based
complements, and Suge o's parametric A. complement (however, these are not
presented helC) See [Klir and Yuan 1998] for a comprehensive treatment of
complement opt;rators.
CHAPTER 3: FU7ZY SET THEORY 50
that satisfies the axioms outlined below. For every element x of the universal set Ox,
this function takes as its argument the pair consisting of the membership grades in
fuzzy sets A and B, both of which are defined over Ox, and yields the membership
grade in the fuzzy set constituting the intersection of A and B, AnB. Thus,
This can also be written as follows: JlA(X) AJlS(X). The following axioms need to be
satisfied in order for a function to qualify as a t-norm: for any a, b E [0, 1],
corresponding respectively to JlA(X), Jls(x) for any element x of the universal set Ox:
The following special cases for this class of t-norm are amongst the most commonly
used t-norms, where the subscript attached to the symbol ® indicates the value or the
limit of the parameter p:
(a) (b)
(c) (d)
Figure 3-6: Examples of t-norms in the Schweizer-Sklar class: (a) min(a, b); (b)
product i.e. ab; (c) drastic min; (d) bounded difference i.e. max(O, a+h-l).
Furthermore, it can be shown that ®_~ is the largest t-norm (i.e. fuzzy intersection
operator) and that ®~ is the smallest t-norm. More succinctly
See [Klir and Yuan 1998] for a corresponding proof. However, since Schweizer and
Sklar class of t-norms is defined by a particular format, it does not cover all possible t-
norms.
Some of the t-norms presented above possess other desirable properties such as
idempotency (a ® a = a). For example, it is easy to show that min(a, b) is the only
idempotent t-norm.
Next the t-conorm function (corresponding to fuzzy set union, also known as s-norm in
the literature), the logical dual of the t-norm, is described. Like fuzzy intersection,
fuzzy union can be represented by a well-established class of functions that are called
triangular conorms - also known as t-conorms. T-conorms, denoted by EEl, represent a
family of binary functions on the unit interval; that is, a function of the form
that satisfies the axioms outlined below. For every element x of the universal set Ox,
this function takes as its argument, the pair consisting of the membership grades in
fuzzy sets A and B, both of which are defined over the universe Ox, and yields the
membership grade in the fuzzy set constituting the union of A and B, AuB. Thus,
This can also be written as follows: J.lA(X) v J.lo(x). The following axioms need to be
satisfied in order for a function to qualify as a t-conorm, for any a, b E [0, ll,
corresponding respectively to J.lA(X), J.lo(x) for any element x of the universal set Ox:
The following special cases for this class of t-conorm are amongst the most commonly
used t-conorms, where the subscript attached to the symbol $ indicates the value or the
limit of the parameter p:
These t-conorms (depicted in Figure 3-7), namely the standard t-conorm, the algebraic
sum, the bounded sum, and the drastic t-conorm satisfy the following ordering for any
values of a and b:
Furthermore, it can be shown that $_ is equivalent to the smallest t-conorm (i.e. fuzzy
union operator) and that ®_ is the largest t-conorm. More succinctly
See [Klir and Yuan 1998] for a corresponding proof. However, since Schweizer and
SOFT COMPUTING FOR KNOWl.EDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 53
Sklar class of t-conorms is defined by a particular format, it does not cover all possible
t-conorms. Some of the t-conorms presented above possess other desirable properties
such as idempotency (a EB a = a). For example it is easy to show that max(a, b) is the
only idempotent t-conorm.
(a) (b)
(c) (d)
Figure 3-7: Examples of t-conorms in the Schweizer-Sklar class: (a) max(a, b); (b)
Algebraic sum i.e. a+b - ab; (c) drastic max; (d) bounded sum i.e. min(l, a+b).
The previous paragraphs have presented one general purpose parameterised family of t-
norm and t-conorms that has been frequently applied; for a presentation of other
parameterised families see [Zimmermann 1996].
Non-parameterised t-norms are mentioned here for completeness and one example is
presented: the Hamacher product [Hamacher 1978; Zimmermann 1996]. The Hamacher
product t-norm is mathematically defined as follows:
(a ®H b) = _a_b_
a+b-ab
(aEBHb)= a+b-2ab
I-ab
CHAPTER 3: Fuzzy SET THEORY S4
In classical set theory, the operations of intersection and union are dual with respect to
the complement in the sense that they satisfy the DeMorgan laws. In the case of fuzzy
set theory, Bonissone and Decker [Bonissone and Decker 1986] have shown that for
any involutive fuzzy complement (i.e. satisfies...,..,a = a), dual pairs of t-norms and t-
conorms satisfy the following generalisation of DeMorgan's law:
and
-,(a $ b) =-,a $ -,b
A triple <®, $, -.> denoting at-norm, t-conorm and fuzzy complement, satisfying the
above laws is commonly known as a DeMorgan triple [Klir and Yuan 1998].
Examples of DeMorgan include:
that satisfies the properties listed below. In the following presentation, the membership
values a], a2, ... , an E [0, 1] denote J.l.Al(X), J.l.A2(X), ... , J.l.An(x) respectively, for any
element x of the universal set Ox:
Properties (i) and (ii) are required for averaging operators, whereas properties (iii) and
(iv) are highly desirable along with other properties. Within this group of aggregation
operators, the weighted generalised means and OWA (order-weighted aggregation)
operators [Yager 1988] are most prevalently used.
The weighted generalised means is formally defined as follows [Klir and Yuan 1998]:
= t w;a;a J
h;:(al,···,a n )
[ l/a
1=1
for any a; E [0, 1], i E [1, n], a. E 9t (a. "# 0); and the weight vector w =<Wb ... , wn>
satisfies the following constraint:
n
Lw;=1
;=1
and each Wi 2: O.
On the other hand, the OW A operators consist of a weight vector W = <Wb W2, ... , Wn>
that is used in the following way to aggregate:
h w (al,a2,···,a n ) = Lw;b;
;=1
where <bj, b2, ... , bn> is a reordering of <ab a2, ... , an> such that b l ~ b2 ~ ... ~ bn .
Various weight vectors lead to intuitive OWA operators. For example, if the weight
vector W = <0,0, ... ,0, 1> is used then the min operator is recovered:
If the weight vector w =<1,0, ... ,0, 0> is used then the max operator is recovered:
There are many families of averaging operators; for a detailed listing see [Klir and
Yuan 1998].
an involutive complement and y be a parameter in the unit interval [0, 1]. Formally,
compensative operators can defined as functions taking n arguments
c:[O, 1]"~[0, 1]
One of the most commonly used compensative operators is the y-operator originally
introduced by Zimmerman and Zysno [Zimmermann and Zysno 1980]. It is defined as
follows:
Matching two fuzzy sets plays a key role in many contexts including inductive
reasoning and approximate reasoning (presented in the remaining chapters of this part
of the book). At this juncture, two popular and relatively straight forward approaches
for matching two fuzzy sets are described, while Chapter 5 presents a third approach
based upon conditional probabilities (namely, semantic unification, see Section 5.3.3.1)
that exploits the formal relationship between fuzzy sets and set-based probabilities.
Two of the most commonly used approaches for matching two fuzzy sets are based
upon possibility and necessity measures [Dubois and Prade 1988; Zadeh 1978]. The
possibility measure of two fuzzy sets A and D defined over the universe ilx, where A is
viewed as a reference fuzzy set (part of a model) and D is viewed as a given piece of
Son COMPUTINCi FOR KNOWLEDCiE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 57
data, reduces to calculating the intersection of both fuzzy sets, and then taking the
possibility measure of the resulting fuzzy set. This is more succinctly stated as follows:
The previous sections presented the basic definition of a fuzzy set and various families
of operations that accept single and multiple fuzzy set values as inputs. This section
describes generalisations of fuzzy sets that have been developed. This presentation
details multidimensional fuzzy sets, relations, Cartesian granule fuzzy sets, and higher-
order fuzzy sets such as interval-valued fuzzy sets and type-2 fuzzy sets.
n
M = LIl(XI, ... ,xll)lx l X ... XXIl
x=1
where tuples <x], ... ,xn > may have varying degrees of membership. The membership
grade of a tuple in a multidimensional fuzzy set, as in the one-dimensional case,
CHAPTER 3: fuzzY SET THEORY 58
indicates the degree of similarity between it and the imprecise concept characterised by
the fuzzy set.
Pos(A, D) =0.6
Nec(A, D) =0
Figure 3-8: Examples of necessity and possibility measures.
For example, consider the definition of a hypothetical fuzzy set corresponding to people
who could be potentially overweight, which is characterised in terms of two variables,
height and weight defined over universes .!2Height and .!2Weight respectively. A possible
definition for this fuzzy set could be as follows:
otherwise
This is a shorter representation of the relation that exploits various properties that this
particular relation possesses including symmetry and anti-reflexivity. Fuzzy relations
form a big area of study in fuzzy set theory playing a key role in areas such as
approximate reasoning fuzzy (which will be covered in the next chapter).
,-,1
3: 0.8
~0.6
-a.
'1' 0.4
s
~ 0.2
~ 0
l5
'01
~
:i.
100
1
40
Weight
up yas
where tuple y is a sub-sequence of tuple x. The max (or sup in continuous case)
CHAI'TEK 3: FUlLY SETTHEOKY 60
operation is used since many tuples in R will lead to the same tuple in [R!X-Y] with
different membership values.
For example, consider the relation R defined in Table 3- 5. The projection [R!X l],
denoting the projection of R on to a new relation consisting of Xl only, results in the
following fuzzy set:
J!vcryrnr<X h Y l) yl y2 y3 y4
xl 0.1 0.7 0.5 I
x2 0.7 0.3 0.8 0.2
x3 0.1 0.11 1 0
Another operation on multidimensional fuzzy sets, which can be viewed as the inverse
to projection, is called the cylindrical extension. A cylindrical extensif''1 can be
formally defined as follows: consider multidimensional universes, X and y, as defined
previously. Let R be a multidimensional fuzzy set defined on yand let [RiX-Y] denote
the cylindrical extension of R into the multidimensional universe X. Then
[Rix-y] (x)=R(y)
This operation produces the largest fuzzy set (in the sense of tuple membership grades
of the extended Cartesian product) that is compatible with the given projection. It is
interesting to note that the Cartesian product of uni-variate projections (i.e. [R!Q'i]) of
a fuzzy relation R does not result in the original relation but rather its upper estimate:
SOFT COMI'UTINCi FOR KNOWLED(,E DISCOVERY: INTRODUCINCi CARTESIAN GRANULE FEATURES 61
m
LDxi = 2. A /IlAj
j=l
j (Xi)
40 = {Bottom/0.2 + Middle/l }
Granule fuzzy sets can conveniently be used to describe concepts. For example the
position of sky regions in digital images, could be described using the following
granule fuzzy set:
discourse Q x onto the unit interval [0, 1]. This type of fuzzy set is also known as a type-
1 fuzzy set, and is by far the most commonly researched and applied type of fuzzy set to
date. However, over the years, other generalisations of fuzzy sets have developed. Two
of these generalisations are covered here: interval fuzzy sets [K1ir and Yuan 1995]; and
type-2fuzzy sets [Mizumoto and Tanaka 1976].
40 50 n y -"osiDon 100
In the case of interval-valued fuzzy sets, rather than restricting the membership value
to a single value in the interval [0, 1], the membership value is generalised to a closed
interval of real numbers in [0, I]. Interval-valued fuzzy sets are more formally defined
as follows:
AQ x ~ £[0, 1]
where £[0, 1] denotes the family of all closed intervals of real number in [0, 1]; note
that
£[0, 1] c prO, I]
where pro, 1] denotes the power set of elements in the interval [0, I]. Figure 3-11
graphically depicts an interval valued fuzzy set where the membership value J.lA(X) of
each element x is represented by an interval [ax_L, <Xx_u], denoting the lower and upper
bounds for membership values.
Type-2 fuzzy sets [Mizumoto and Tanaka 1976] are a further generalisation of
interval-valued fuzzy sets, where every element in the universe is mapped onto a type-1
fuzzy set. Type-2 fuzzy sets are more formally defined as follows:
where F([O, 1]) denotes the family of type-I fuzzy sets that can defined on the interval
[0, I]. F([O, I]) is commonly referred to as the fuzzy power set. Figure 3-12 graphically
depicts a type-2 fuzzy set where the membership value J.lA(X) of each element x is
SOIT COMPUTING FOR KNOWI.EIXiE DISCOVERY : I NTRODUCING CARTESIAN GRANULE FEATURES 63
represented by a fuzzy set, that characterises its membership; in this case the
membership value of each value x is characterised by a trapezoidal fuzzy set.
Other types of generalisations of fuzzy sets also exist, such as probabilistic fuzzy sets
[Hirota 1981], and intuitionistic fuzzy sets [Astanassov 1986]. Overall, fuzzy sets other
than type-l fuzzy sets, fuzzy relations and Cartesian granule fuzzy sets, are still the
subject of research and have not been applied extensively in real world problems. Even
though these generalisations of a type-l fuzzy set, such as type-2 fuzzy sets, provide
more expressivity, this comes at added computational costs and hence few applications
to date.
x Member hip
Figure 3-12: An example of type-2 fuzzy set and a type-1 fuzzy set membership value
(characterised by a trapezoidal fuzzy set) for x.
elicit membership functions from example data, in particular for Cartesian granule
fuzzy sets.
3.9 SUMMARY
This chapter serves as a concise introduction to fuzzy sets. Along with presenting the
basic definition of a fuzzy set, it also presents various properties and operations that can
be performed on fuzzy sets such as aggregation and matching. Various justifications
and interpretations of fuzzy sets as a form of knowledge granulation were presented.
Generalisations of the fuzzy set, including Cartesian granule fuzzy sets and relations are
also described, illustrating the potential power and flexibility of the fuzzy set.
Membership function elicitation was briefly discussed, but will be explored in detail in
Chapters 7 and 9.
3.10 BIBLIOGRAPHY
Astanassov, K. T. (1986). "Intuitionistic fuzzy sets", Fuzzy sets and systems, 20:87-96.
Baldwin, J. F. (1991). "A Theory of Mass Assignments for Artificial Intelligence", In
IJCAI '91 Workshops on Fuzzy Logic and Fuzzy Control, Sydney, Australia,
Lecture Notes in Artificial Intelligence, A. L. Ralescu, ed., 22-34.
Baldwin, J. F., and Lawry, J. (2000). "A fuzzy c-means algorithm for prototype
induction." In the proceedings of IPMU, Madrid, To appear.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - Fuzzy and Evidential
Reasoning in A.I. Research Studies Press(Wiley Inc.), ISBN 0863801595.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (l997a). "Fuzzy logic methods in
vision recognition." In the proceedings of Fuzzy Logic: Applications and
Future Directions Workshop, London, UK, 300-316.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997b). "Modelling with words
using Cartesian granule features." In the proceedings of FUZZ-IEEE,
Barcelona, Spain, 1295-1300.
Bonissone, P. P., and Decker, K. S. (1986). "Selecting uncertainty calculi and
granularity: An experiment in trading-off precision and complexity", In
Uncertainty in Artificial Intelligence, L. N. Kanal and J. F. Lerner, eds., North-
Holland, Amsterdam, 217-247.
Borel, E. (1950). Probabilite et certitude. Press universite de France, Paris.
Bruner, J. S., Goodnow, J. J., and Austin, G. A. (1956). A Study of Thinking. Wiley,
New York.
Dubois, D., and Prade, H. (1983). "Unfair coins and necessary measures: towards a
possibilistic interpretation of histograms", Fuzzy sets and systems, 10: 15-20.
Dubois, D., and Prade, H. (1988). An approach to computerised processing of
uncertainty. Plenum Press, New York.
Gaines, B. R. (1977). "Foundations of Fuzzy Reasoning", In Fuzzy Automata and
Decision Processes, M. Gupta, G. Saridis, and B. R. Gaines, eds., Elsevier,
North-Holland, 19-75.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 65
This chapter introduces fuzzy logic as the basis for a collection of techniques for
representing knowledge in terms of natural language like sentences and as a means of
manipulating these sentences in order to perform inference using reasoning strategies
that are approximate rather than exact. It was first introduced in the early 1970s by
Zadeh in order to provide a better rapport with reality [Klir and Yuan 1995; Zadeh
1973]. Fuzzy logic can be viewed as a means of formally performing approximate
reasoning about the value of a system variable given vague information about the
values of other variables, and knowledge about the dependence relations between them
(that is typically represented as IF-THEN rules expressed as fuzzy relations). For
example, if knowledge is expressed in terms of IF-THEN rules, such as IF X is A THEN
Y is B, and if the fact X is A' is known, then the deductive process needs to derive Y is
B' as a logical consequence. In an approximate reasoning setting, in contrast to a
classical logic setting, where inference is performed by manipulating symbols,
inference is performed at a semantic level by numeric manipulation of membership
functions that characterise the symbols.
The chapter is organised into three main sections: knowledge representation; fuzzy
inference; and fuzzy decision making. It begins by introducing the main forms of
domain specific knowledge representation in fuzzy logic: linguistic variables, linguistic
hedges, fuzzy facts and fuzzy if-then rules. Subsequently, it presents the main modes of
inference in fuzzy logic, some of which are derived from multi-valued logic. Decision
making processes are then described (known as defuzzification in fuzzy logic parlance).
A simple example, illustrating the potential of fuzzy logic as an accurate and
transparent modelling technique is also presented. Finally, real world applications of
fuzzy logic are overviewed.
4.1 FUZZYRULESANDFACTS
Fuzzy propositions are typically expressed in linguistic terms. In this book fuzzy
propositions of the following types are considered:
p:'X is A'
where each proposition of the type p or r, as defined above, is associated with a point or
interval probability. Qualified fuzzy propositions will be considered in more detail in
Chapter 6 in the context of the Fril programming environment. The range of truth
values for fuzzy propositions is [0, I], where truth and falsity are expressed by the
values t and 0 respectively.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 69
4.1.1.1 Partitions
The concept of a partition can be exploited both to reduce the information complexity
and also to enhance the interpretability of a system. Partitions facilitate a more natural
mapping between the computational representation and the human perception of the
world. Partitions achieve a natural and efficient reduction of information complexity,
by quanti sing or discretising continuous universes. They can be viewed as a means of
carving the attribute space into regions of self-similarity. Zadeh refers to these regions
are as granules [Zadeh 1994]. Notions of indistinguishability, similarity, proximity and
functionality play key roles in determining the extent of these granules. Granules are
normally characterised by crisp or fuzzy sets. Consequently, crisp sets or fuzzy sets can
be used to. partition the universes upon which the problem domain variables are
defined, thus leading to crisp and fuzzy partitions.
Definition: Let X = {x], ... , xn} be a set of given data. A partition P of X is a family of
subsets of X denoted by P = {A], ... , Ac}, that satisfy the following properties:
UA; =X
c
(ii)
;=1
When each Ai is a fuzzy set, a fuzzy partition [Ruspini 1969] for X is defined and the
following properties, corresponding to (i) and (ii) above, must hold:
then P = {A], AzI is a fuzzy partition (or fuzzy 2-panition) of X. It can be easily seen
that if P is restricted to crisp sets, then this definition corresponds to the standard
definition of a partition as shown above.
c
0< LJlA/Xk) S; c \ike {I, ... ,n}
i=1
linguistic partition [Ralescu and Hartani 1995]. It is quite natural to assign linguistic
labels (from a predefined dictionary of terms or from a list that an expert has provided)
to each fuzzy subset. For example, the universe of a variable Position could be
partitioned into three fuzzy subsets that are associated with the words Left, Middle and
Right. Variables defined over these fuzzy subset labels are termed as linguistic
variables [Zadeh 1975a; Zadeh 1975b; Zadeh ] 975c]. A linguistic variable takes as its
values a finite set of words or labels. Linguistic partitions can be viewed as a lens/filter
through which the data can be seen in an intuitive manner. Linguistic partitions permit
operations on data such as learning and reasoning a more effective and transparent
fashion. Examples of these claims are presented in Parts IV and V of this book, where
linguistic partitions provide tractability, effectiveness and understandability for the
modelling approaches presented.
of the value x. Similarly, other linguistic terms generated by the syntactic rule, such as
not w, very w, WI or W2 can be assigned meaning. For example, the meaning of the
compound statement WI and W2 for a value x E ilx can be taken as the proportion of
population who say "yes" to each of WI and W2 as being an appropriate description of x.
The meaning of the qualified linguistic terms such as very Small could be obtained
either by getting voters to vote on each possible qualified linguistic term (this is
possibly infinite) or by getting voters to vote on a general definition of each hedge in
the context of this linguistic variable. As described here, the voting model process
could be used to derive m, the semantic rule of a linguistic variable, in a very natural
manner.
Linguistic (A bstract)
Variable
:e ,
Linguistic
Valu es
( States)
f
~
'\,
\,:
~+---+--!
7~\-
Semantic
Rule
Figure 4-1: An example of a linguistic variable defined over the universe ilposition.
For example, consider a die variable defined over {1, ... , 6}. Let W, the set of linguistic
terms of the corresponding linguistic variable, consist of (Small, Medium, Large). The
meaning of the words Small, Medium, and Large can be generated from the voting
patterns of the population for each die value. Table 4-1 presents the voting pattern for a
population of ten voters for the die value 3. Similar voting patterns are generated for the
other die values. Subsequently, in the case of each word Wi, the meaning consists of the
list of die values associated with the proportion of voters who accept Wi as a description
of the respective die value. These proportions correspond to membership values. For
example, the value 3 will have a membership value of 1 in the fuzzy set Medium. The
voting pattern presented in Table 4-1 can alternatively be viewed as a linguistic
description of the die value 3; this description is characterised by the following fuzzy
set: (Mediuml1 + Small/D.2).
Son COMPUTING FOR KNOWLEIX;E DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 73
Table 4-1: Voting pattern for tell people corresponding to the linguistic interpretation
of the die value 3.
Word\Person I 2 3 4 5 6 7 8 9 10
Medium Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Small Yes Yes no no no no no no no no
Let the Position universe OPo'ilion be defined over the range [0, 100]. Then the
definitions of the above (using uniformly placed triangular) fuzzy sets (in Fril notation 3
[Baldwin, Martin and Pi Isworth 1988; Baldwin, Martin and Pi Is worth 1995]) could be
3A fuzzy set definition in Fril such as Middle [0:0, 50: 1, 100:0] can be rewritten
mathematically as follows (denoting the membership value of x in the fuzzy set
Middle):
0 if x:O;O
x
if O<x :0;50
IlMlddle(X ) 1- -
50
= 100- x
if50<x<100
50
o ifx~loo
CHAPTER 4: Fuzzy LoGIC 74
This is graphically depicted in Figure 4-2(a). Any position value in QPosilioo has a non-
zero membership in at least one, or at most two, of these fuzzy sets. For example, the
position 10 will have a membership value of 0.8 in Left and 0.2 in Middle. Linguistic
partitions provide a means of giving the data a more anthropomorphic feel, thereby
enhancing understandability. In this case the value 10 corresponds to the linguistic
description characterised by the following fuzzy set:
LA; /
c
Ji-A; (10) = {Leftl.8+Middle/.2}.
;=1
Triangular fuzzy sets cOlifespond quite naturally to fuzzy numbers. In this example the
fuzzy set Middle could also, quite intuitively, be labelled as AbouC50.
Co
:.c:
~
.8
.,E
::20
~--------~~--------~--~.
(a)
(b)
Figure 4-2: Linguistic partition of the variable universe QPos;t;on using (a) triangular
fuzzy sets and (b) trapezoidal fuzzy sets.
Alternatively, where trapezoidal fuzzy sets are used to partition a universe of discourse,
to generate a partition with granularity of n, n+ J points need to be provided (these n+ J
points include the universe boundary values); this leads to a partitioned universe
consisting of n intervals. A trapezoidal fuzzy set can be characterised by four points a,
b, c, and d as depicted in Figure 4-2(b). The interval [b, c] characterises the core (i.e.
all points in this interval have a membership value of 1) of the fuzzy set, while the
interval [a, d] characterises the support of the fuzzy set (i.e. all points in this interval
SOl-I COMPUTING FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE F EATURES 7S
have a membership value> 0). The core [bj , Cj] of each fuzzy set is set to the interval
[Pj' Pj+/], while the support raj, dj] is set to the following interval:
[ Pj- ([ p ). - 2p )-
. I] ) ([ I-
*degreeOfOverlap ,Pj+ )11
P . ] *degreeOfOverlap~.
p.)+2)
~
.D
.
.-----------.-:.,. ... ...:- .. _ .. _ .. _ .. -
. . . . ,
/ ......
E : " ...
C,) I ......
i
O~------~----+_----~----41----~----------~\----~~
~ nposilion
(b)
1 At.
0. - - - ----r- - - -- ---,----- - ------ ... . - .. - .. - .. - .. -
:E I
I
~
~
E
II)
::E
O~----------~--------~--------~~---------·~
I
~
( c)
Figure 4--3: Partition of the variable universe [2Position using four trapezoidal fuzzy sets
with varying degrees of overlap; (a) 100% overlap; (b) 50% overlap; (c) No overlap
i.e. crisp sets.
sets correspond quite naturally to classes or intervals in the data. In this example. the
fuzzy set Middle could also quite intuitively be labelled as Roughly40_60.
M:[O, 1] ~[O, 1]
Linguistic hedges can be used to modify any fuzzy set. Typical examples of linguistic
hedges include very, more or less, slightly etc. where very is often represented as the
unary square operation (that is M(a) = a2 , where a corresponds to a membership value)
and more or less is often' characterised by the unary square root operation ( that is M(a)
= -J;;) [Zadeh 1972]. See Figure 4-4 for a graphic depiction of the linguistic hedges
for "very" and "more or less". Linguistic hedges can be used to modify the semantics of
fuzzy predicates (represented by fuzzy sets as seen here), fuzzy truth-values and fuzzy
probabilities. It is important to note that hedges do not exist in classical logic.
more or less
7
0.8
/
0.6
/
/
0.4 /
/
0,2 /
/
/
0L-~~~~------~----~----~----~
o 0.2 0.4 0,6 0,8
Fuzzy inference, despite its imprecise connotations, consists of deductive methods that
are sound, rational and not just mere heuristic approximations of classical two-valued
SOI-T COMI'UTIN(; FOR KNOW!.ED( ;!' DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 77
logical inference. They are centred around the Compositional Rule of Inference (CRI)
originally introduced by Zadeh [Zadeh 1973]. CRI (often referred to as generalised
modus ponens in the literature) views If-Then rules as dependencies, which are
characterised by fuzzy relations (see the Section 3.7.1.1 for a full presentation of fuzzy
relations) and inference reduces to the composition of these relations and memberships
functions. CRI provides a framework for generalising classical inference processes,
based on tautologies such as modus ponens, modus tollens and hypothetical syllogism.
(4-1)
y=f(x)
y
B
.9-
..c
o 5 0
I ----------- x I -------"--0:--'
~~ rrerrbership
rrerrbership
(aJ
Figure 4-5: An example of relationship between variables X and Y characterised by
function f:X -.>Y; (a) depicts a mapping from value x to y (i.e. x =j(x)); (b) depicts a set
mapping from A to B using Equation 4-1.
CHAPTER 4: Fu72Y UX,IC 78
Zadeh [Zadeh 1973] extended the characterisation of relations between variables from
crisp relations to fuzzy relations and thereby paved the way for a new type of inference
based upon imprecise concepts represented as fuzzy sets - the compositional rule of
inference (CRI). Formally, if Rxy is a fuzzy relation between variables X and Y, and A
and B are fuzzy sets defined over Q x and Qy respectively, then if it is known that the
value of variable X is a fuzzy set A, the fuzzy set B can be inferred as the value of
variable Y using equation 4-1; the key difference being that A and B are fuzzy sets in
this case, rather than crisp sets as presented above. The compositional rule of inference
in fuzzy logic is often referred to as generalised modus ponens for reasons that will
become obvious over the next couple of paragraphs. Equation 4-1 can be succinctly
written in matrix form as follows:
B = AoR XY '
Rxy Rxy
----------A
B I
----------., I
B
.9-
..c::
o t 0
I .- ---------- -g I
rrerrbershlp x ~ rrerrbership
( a) (b)
A graphic example of inference using CRI is presented in Figure 4-7 and is described
subsequently. The calculations associated with this example are presented in matrix
format. Let A be a discrete fuzzy set A = {x/O.3, x21/, x/O.5}, graphically depicted as a
discrete approximation of the fuzzy set A in Figure 4-7. Let Rxy be a fuzzy relation
describing a portion, often referred as a fuzzy patch, of the function y = f(x). This is
depicted in Figure 4-7 as a greyscale rectangle with a dashed boundary. This relation
Rxy can be generated by a number of means, which are described in subsequent
sections. The generation of the relation Rxy used here is described in Section 4.2.1.2
(see Figure 4-9). Each tuple in this discrete fuzzy relation is indicated by a point in
Figure 4-7, e.g. <X3, YI>' Let B be the fuzzy set that is inferred from the fuzzy setA and
the fuzzy relation (patch) Rxy using the CRI rule as defined above (Equation 4-1). In the
following, the fuzzy sets, A and B, and the relation, Rxr. are written mathematically as
matrices. The inferred fuzzy set B is calculated as follOWS:
SOH COMI'UTINt; FOR KNOWLED<iE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 79
B == A o RXY
0.3 0.3
0.2]
[0.9 1 0.2] == [0.3 1 0.5]0 [ 0.9 I 0.2
0.5 0.5 0.2
where for example, the value of 0.9 in matrix B (denoting the membership of YJ in the
fuzzy set B) is calculated as follows:
max(min(O.3, 0.3), min(1, 0.9), min(O.5, 0.5» == max(O.3, 0.9, 0.5) == 0.9.
y=f(x)
B 0.3 0 ; J
0.2
3
Rxy =[ 0.9 I 0.2
0.5 0.5 .
J-lRxY(Xl , YI) =0.2
.9-
.s:::
~
o
o
t~
::E .
Merrbershlp
Figure 4-7: An example offuzzy inference based on the compositional rule of inference
(Equation 4-1); fuzzy set B is inferred by CR1 using the fuzzy relation Rxy and the fuzzy
set A.
A relation Rxy between any two variables X and Y, can characterise different types of
relationships. However in fuzzy logic systems, this relation is restricted to representing
the dependency relationship which is expressed using fuzzy conditional unqualified
propositions (If-Then rules) of the form "If X is A then Y is B", where both A and B are
fuzzy sets. This rule-based format (both crisp and fuzzy) has been commonly and
successfully used to describe systems ranging from controllers to object recognition
systems that are void of mathematical models or difficult to describe [Ralescu 1995;
Ruspini, Bonissone and Pedrycz 1998; Terano, Asai and Sugeno 1992]. In fuzzy logic
systems, having captured this type of relationship between variables, different types of
inference can be performed using CRI. Two types of inference are considered:
generalised modus ponens; and generalised modus toUens. Both inference procedures
use the relation Rxy. In subsequent subsections, two commonly used approaches for
generating such relations are presented: one based upon logical implication; and the
other based on conjunction.
r: IF X is A THEN Y is B (4-2)
where X and Yare variables defined over universes nx and ny, whose values are fuzzy
sets A and B respectively. Here only one variable is used in the conditional part of the
rule (also known as the antecedent or body of a rule) and also in the action part of the
rule (also known as consequent or head of rule), however any number of variables can
be used for each portion of the rule. As alluded to previously, this rule proposition can
alternatively be expressed as a fuzzy relation Rxy (i.e. a fuzzy set on the Cartesian
universe nXXny) where the membership value for each possible tuple <x, y>, for all
combinations of x E nx and y E n y is determined as follows:
(4-3)
f: XisA'
where A' is fuzzy set defined on nx and potentially different to A, it is possible to infer
that Y is B' using CRI (Equation 4-5, a slightly modified version of Equation 4-1). This
inference can be succinctly expressed as follows:
Given: r: IF X is A THEN Y is B
And: f:XisA'
Infer: Y is B' (4-4)
This inference procedure is called generalised modus ponens due to its similarity with
the classical modus ponens rule of inference, which states that given a fact f, and logic
rule r: if/then g, then the consequent of the rule g can be inferred provided both/ and
rare true.
This following equation formally defines CRI, and is slightly different to Equation 4-1
so that it can deal with fuzzy propositions that may differ from those expressed in the
conditional part of a rule:
VyeQy (4-5)
A generalised version of modus tollens also exists in fuzzy logic and can be succinctly
expressed as follows:
Given: r: IF X is A THEN Y is B
And: f: YisB'
Infer: XisA' (4-6)
and the fuzzy relation Rxy is as defined above. When the sets are crisp (Le. A'= A and
- --
B'= B, where A and B refer to the complement of A and B respectively) classical
modus tollens is recovered.
As in classical logic, other inference rules are also possible in fuzzy logic such as
hypothetical syllogism and contraposition. For a more complete treatment see [Klir and
Yuan 1995; Zimmermann 1996].
and construct a fuzzy relation Rxy using fuzzy implication functions. This relation can
subsequently be used with the CRI inference process described above.
which accepts as input, truth values, a and b, of the fuzzy propositions (facts) f and g
and returns the truth value, lea, b), corresponding to the conditional proposition "iff
then g". Fuzzy implication functions possess various mathematical properties, such as
monotonicity and continuity and identity, (see [Klir and Yuan 1995]) and are in general
extensions of classical material implication. In classical logic, various equivalent forms
of implication (from a classical truth value perspective) exist such as:
While these are logically equivalent, their extensions in fuzzy logic are not and
consequently, result in distinct families of fuzzy implication. Extending the above logic
formulas to fuzzy logic, leads to different families of fuzzy logic implications that are
parameterised by t-norm ®, t-conorm (9, and fuzzy complement -.. A selection of these
families is described below.
CHAPTER 4: Fuzzy lAx,le 82
For example, consider the discrete fuzzy sets A = {x/O.3, x211, x/O.5} and B ={y/O.9,
y/J, y/O.2}, graphically depicted as a discrete approximations of the fuzzy sets A and B
respectively in Figure 4-7. The relation Rxy between the fuzzy sets A and B can be
constructed using Lukasiewicz implication as follows:
Y=B
Rxy y 1/0.9 yllI ydO.2
X=A x l/O.3 I I 0.9
XI/l 0.9 1 0.2
x/0.5 1 1 0.7
I a':;' b
/(a,b) = {
b otherwise
Other fuzzy implications have also been introduced which do not fall into any of the
above categories. For further details and comparative studies of fuzzy implications see
Son COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 83
[Gaines 1978; Klir and Yuan 1995; Mizumoto and Zimmermann 1982; Ruan and Kerre
1993].
Gi yen : r: IF X is A T H FN Y is B
An d : f: X is A'
Infer : (u ing CRI and wka iewicz' i"lllication)
B
U'
.~ ~4----+------+-r--+~--~
..r:; 0
~ X
Q
~IU
~Mcrrbcrship
Figure 4-8: Fuzzy set B' is inferred from the given rule (IF X is A THEN Y is B) and
fact (X is A ') using the compositional rule of inference (Equation 4-5), where the fuzzy
relation Rxy is based upon Lukasiewicz's implication (Equation 4-8).
r: IF X is A THEN Y is B
the goal is to construct a fuzzy relation Rxy, in this case, using conjunction. The relation
Rxy can be constructed using the following equation:
(4-10)
For example, consider the discrete fuzzy sets A = {x/O.3, xyl, x/O.5} and B ={y/O.9,
yyJ, y/O.2}, graphically depicted as discrete approximations of the fuzzy sets A and B
respectively in Figure 4--7. The relation Rxy between the fuzzy sets A and B can be
CHAPTER 4: Fuzzy LoGIC 84
constructed using Equation 4-10, where ®, the conjunction operator, is set to min. This
calculation is presented in Figure 4-9.
Y=B
Rxy y/O.9 Y2/l Y3/0.2
X=A XI/O.3 0.3 0.3 0.2
xiI 0.9 I 0.2
xiO.5 0.5 0.5 0.2
Figure 4-9: Calculating the relation Rxy between the fuzzy sets A and Busing
Equation 4-10, where @, the conjunction operator, is set to min.
Gi ven: r: IF X is A THEN Y i B
And: f: X is A'
Infer: Y is S' (u ing CRJ and conjunction-based Rxy)
S
U
.~ ~~---+------+-
.r; 0
~ X
0)
~
0)
2: Membership
Figure 4-10: Fuzzy set B' is inferred from the given rule (IF X is A THEN Y is B) and
fact (X is A') using the compositional rule of inference (Equation 4-5), where the fuzzy
relation Rxy is based upon conjunction (Equation 4-10).
As in the implication case, this relation can subsequently be used with equation 4-5 as
part of the CRI inference process. Figure 4-10 presents an example of inference using
CRI with a conjunction based fuzzy relation. Originally, Mamdani limited the t-norm ®
to the min operation, which subsequently became popularised as a means of doing
fuzzy control [Terano, Asai and Sugeno 1992]. This type of inference is commonly
known as max-min inference (see Figure 4-11 for an example of max-min inference,
with further explanation provide in Section 4.3). Over the years max-min inference has
been one of the most popular forms of fuzzy inference. The main reasons for this
popularity include the fact that it works very well in real world applications. It has a
significant advantage in reducing the computational complexity of fuzzy inference and
from a logic perspective, conjunction is somewhat appealing as it expresses a relation
of compatibility.
SOIT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 85
In the previous section fuzzy inference was presented. Inference was considered from
an individual rule perspective, even though typical rule bases consist of multiple rules.
For example, consider the following knowledge base consisting of n rules and one
unconditional proposition (fact):
Conclude: Y is Y'
Each rule has three antecedents (conditions) expressed as fuzzy sets Ai, B i, and Ci
defined over the respective universes of discourse QxJ, QX2, and QX3' The
unconditional proposition is merely a vector of point values <x], Xl> X3> drawn from the
universes QxJ, QX2, and QX3 respectively. Alternatively, these values could be fuzzy set
values or interval values or a mixture of the two. In order to simplify the presentation,
point values are chosen. Consequently, each value Xi is represented as a fuzzy set with
one element Xi and an associated membership value of 1 (depicted as straight lines in
Figure 4-11). Figure 4-11 illustrates the results of fuzzy inference for this rule base
given the vector <x], Xl> X3>. It presents the two rules rl and r2 which fire (generate a
non-empty fuzzy set as a result of applying CRI) and the inferred fuzzy sets Y/' and Yz '
(highlighted in grey in Figure 4-11). Here the CRI was based upon a fuzzy relation
generated using the conjunction operation min (Equation 4-10). This results in one or
more output fuzzy sets being inferred, due to the overlapping nature of the fuzzy sets
that populate the input space.
In order for the output of fuzzy inference to be useful in a real world application, it is
normally necessary to convert the output fuzzy sets into a crisp number. For example,
consider a fuzzy rule base that adjusts the power of a heater based on the current room
CHAPTER 4: Fuzzy UX,le 86
Several defuzzification strategies have been developed over the years for continuous-
valued models in domains ranging from control to financial decision support systems.
Below, a couple of the more prominent approaches to defuzzification are presented.
rl
IlKl
.....
o
I LL1t~~
r2
o ~~~~
Figure 4-11: Max-min inference using COG: Fuzzy inference using CRI based upon
min relation, and Centre of Gravity decision making.
The centre of gravity (COG) defuzzification procedure is one of the most commonly
used decision making procedures. It is a rather intuitive approach in that the defuzzified
value corresponds to the geometrical centre of mass of the inferred output fuzzy sets.
This is calculated as follows for real-valued fuzzy sets:
SOFT COMPUTING fOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 87
J ypy,(y)dy
y=
J py.(y)dy
(4-11)
For the discrete case in which the universe of Y, ny, is defined on a finite set of values
{Yl, ... , Yz}, the defuzzified value is calculated as follows:
z
LYif,ly,(Yi)
Y = ..",i=::;,I'--_ __
z
Lf,ly,(Yi)
i=1
The defuzzification strategies considered so far are suitable for prediction problems
only. Here, however, a decision making mechanism for classification problems is
presented. In classification problems, a canonical rule base takes the following form:
CHMTEI{ 4: Fuzzy LOGIc 88
For presentation purposes, rules are assumed to consist of just one condition. The
calculation of Rxy is greatly simplified in classification problems, as the output value is
no longer a fuzzy set but a singleton. For example, using Lukasiewicz implication
(Equation 4-8) to generate the relation Rxy simply reduces to the input fuzzy set value
FSi • Consequently, the fuzzy relation Rxy term in the CRI equation (Equation 4-5) can
be replaced by the fuzzy set value FSi . In other words it simplifies to:
When data is presented to the system, fuzzy inference generates a membership value
(an activation value) for each rule, which is subsequently associated with the output
value Class; for that rule. Decision making then reduces to selecting the output value
ClassMax associated with the highest activation value, that is, the classification of the
input data tuple is ClassMax' Alternative approximate reasoning strategies, based on
support logic (probabilistic reasoning) for both prediction and classification problems,
where knowledge is expressed in terms of fuzzy sets and if-then rules, is presented in
Chapter 6.
r3 Patch
About_' Aboul_2
FUZZY RULE BASE
rl: IF X is AbouC 1 THEN Y is AbouLO
12: rF X is AbouL2 THEN Y is AbouU
r3: rF X is AbouC3THEN Y is AbouL4
r4: rF X is AbouC 4 THEN Y is AbouL9
Figure 4-12: A fuzzy rule base that approximates (almost exactly) the function y= (x-
ll, x IE [1, 4J. Fuzzy patches (depicted as rectangles) highlight the zones of
applicability for rules 2 and 3.
Sovr COMPUTING FOR KNOWLED<;E DISCOVERY: INTRODUCING CARmSIAN GRANULE fEATURES 89
Fuzzy logic has a long and varied application history beginning with the pioneering
work of Mamdani [Mamdani 1977; Mamdani and Assilian 1974] in control systems.
This lead to an avalanche of control applications: in consumer products such as
cameraslcamcorders (Sanyo, Canon, Minolta), washing machines (AEG, Sharp,
Siemens, General Electric), vacuum cleaners (Philips and Siemens), and refrigerators
(Whirlpool); in automotive and power generation such as engine control (Nissan);
industrial process control systems such as refining, distillation, cement kiln control etc;
robotics and manufacturing. Fuzzy logic has also applied successfully (i.e. many
fielded/deployed applications) in decision support systems such as foreign exchange
trading [Ralescu 1995], system design, image understanding [Ralescu 1995; Ralescu
and Shanahan 1999], and more recently in the fields of machine learning and discovery
(described in more detail in Parts ill, N and V of this book). For a more detailed
presentation of applications se,e [Ralescu 1995; Ruspini, Bonissone and Pedrycz 1998;
Terano, Asai and Sugeno 1992; Yen and Langari 1998]
4.6 SUMMARY
This chapter has presented the fundamentals behind fuzzy logic, introducing the main
forms of knowledge representation and approximate reasoning within the fuzzy logic
framework. It began by introducing fuzzy propositions, linguistic variables and
linguistic hedges as a means of representing knowledge in terms of natural language
statements. The principle rule of inference in fuzzy logic, the compositional rule of
inference, was introduced along with the various interpretations that have been
developed over the years. Finally, some of the decision making strategies that exist in
fuzzy logic for prediction and classification problem domains were described. A simple
example illustrated the potential of fuzzy logic as an accurate and transparent modelling
technique. Real world applications of fuzzy logic were also overviewed. Some of the
concepts presented here such as, linguistic variables and approximate reasoning, will be
revisited in Part N of this book in the context of Cartesian granule features.
4.7 BIBLIOGRAPHY
(i) It is certain that James was born around the end of the sixties (1960s).
(ii) Probably, James was born in 1967.
In the first statement, the year in which James was born is imprecisely stated, but
certain, whereas in statement (ii), the year is precisely stated, but .there is uncertainty
about the statement being true or false. Both aspects can coexist, but are distinct.
Uncertainty arising from imprecision can be very naturally modelled using traditional
set theory and its generalisation - fuzzy set theory and various set-based probabilistic
theories such as possibility theory. On the other hand, uncertainty arising from beliefs
or expectations has been addressed by various theories of probability. These and other
types of uncertainty such as ignorance (facilitated by set-based probability theories such
as Baldwin's mass assignment theory and Dempster-Shafer theory), inconsistency
(facilitated by mass assignment theory) will be discussed over the course of this
chapter.
This chapter focuses on probability theory, and its various generalisations and
specialisations, as a means of representing stochastic uncertainty and imprecision. The
first section reviews the fundamentals of probability theory. Subsequently, three point-
based generalisations and specialisations of probability theory are presented: fully
specified joint probability distributions; naive Bayes classifiers; and Bayesian
networks. This is followed by a presentation of set-based probabilistic techniques:
Dempster-Shafer theory; possibility theory; and mass assignment theory. These set-
based approaches provide semantically richer formalisms than point-based probability
theories, catering not only for uncertainty, but also for ignorance and inconsistency. For
each approach, the respective calculus of operations (inference, decision making,
conjunction, negation, etc.,) is described and the relationships between these modes of
uncertainty representation and fuzzy set theory are also explored. These relationships
facilitate more powerful and expressive forms of knowledge representation and
reasoning, very much in the true synergistic spirit of soft computing. The bi-directional
transformation from a membership value to a point probability is subsequently
described in detail. An intuitive justification and interpretation of this relationship
based on human reasoning (the voting model) is also described. This transformation
forms the basis for new learning algorithms presented in Part IV.
Probability theory has been commonly used to represent and reason with uncertainty
since the 17th century. Various generalisations and specialisations of probability theory
have been developed in the intervening years. Work in the field of probability theory
can be crudely categorised into one of two schools: the objective school; and the
subjective school. Other interpretations of probability also exist such as the logical
perspective, but are not of interest here. The interested reader is referred to [Smithson
1989]. The objective school of thought takes the view that probability is about events
that one can count i.e. directly linked to the world. They use a frequentistic definition of
the probability, defining it as the proportion of times the event occurs out of all possible
events. For example, the probability of a coin showing heads is the proportion of times
that a tossed coin landed heads up out of all tosses (for a sufficiently long sequence of
repeated events). For the subjective school of thought, on the other hand, probabilities
are linked directly to one's opinions about the exact nature of the world derived from
the information available. This school of probability is often referred to as Bayesian or
personal probability. The probability of a hypothesis (for example, a tossed coin
showing heads) is a measure of a person's belief in that hypothesis given the available
evidence (that the coin is fair, in this example). The subjective view of probability is
normally defended in terms of rational betting behaviours [deFinetti 1937]. The degree
of belief in a hypothesis should be proportional to the odds that a rational person should
be able to state at which it is indifferent to bet for or against that hypothesis. For
example, a person provides you with the odds 2-to-1 that on tossing a coin, heads
comes up (that is for every franc you bet on heads coming up, you can win 2). If the
ratio 2: 1 does not accurately reflect the world (tossing coins), then one party (either you
or the person offering the bet) will be guaranteed to lose money over a series of coin
tosses. Thus, subjective probability theory can be given a rationality in terms of betting
behaviour.
The remainder of this section introduces the basic forms of knowledge representation in
probability theory along with basic axioms and assumptions. In probability theory, from
a knowledge representation perspective, domain specific knowledge is captured in
terms of conditional and unconditional probabilistic propositions, while general
knowledge is represented using inference mechanisms based upon conditioning and
various decision making strategies. Typically, probabilistic propositions take two
formats:
SOlT COMPUTING FOR KNOWLEfX1E DISCOVERY: INTRODUCING Ci\RTESIAN GRANULE FEATURES 9S
where X and Y are random variables taking values Xi and Yj from their
respective universes Q x and Qy. The vertical line "I" is read as "given",
thus, the proposition r can be interpreted as follows: the probability of
"variable X having a value X;, given that all that is known is that variable
Y has a value yt is prob. Once again prob corresponds to a value in the
unit interval [0, 1]. A point probability Pr(xiIYj) is associated with each
possible combination of values from the universes Q x and Qy. These
conditional distributions are denoted as follows for any two variables X
and Y: Pr(XI Y).
(5-1)
Consequently,
Below, to keep the presentation lucid, definitions are described in terms of a minimal
number of variables, Xi, Xj' Xb etc., and by and large for the discrete case. For a more
general and detailed presentation of probability theory, the reader is referred to
[DeGroot 1989; Jensen 1996]. Conditional probabilities can be defined in terms of
unconditional probabilities as follows:
Pr(X." X.)
Pr(XIX.)= I J (5-2)
I} Pr(X j )
(5-3)
Bayes' rule [Bayes 1763], which is derived via the product rule, is defined as follows:
Independence holds for any two variables Xi and Xj and the index i ;r: j, if.the following
conditions hold:
In other words, knowing the value of variable Xi does not provide us with any
information as to value of variable Xj and vice versa. Consequently, the product rule
(Equation 5-3) simplifies to
where each possible combination of variable values <Xl> ... , X n> is assigned a
probability value.
Inference for systems defined in terms of joint probability distribution (known as the
prior distribution since it is specified prior to inference), and in general for probabilistic
systems, is performed using a conditioning or updating operation. Here, when new
evidence, such as Xk = X, becomes available, inference can be performed using
Equation 5-2 in order to get an updated probability, known as the posterior probability,
for the events that may be of interest or relevant.
Decision making, in the discrete case, can be achieved using a number of mechanisms.
Having inferred a posterior probability for each possible outcome given the evidence
(or just reading the probabilities directly from the joint when no evidence is available),
one decision making approach could be to choose the hypothesis that has the highest
associated (posterior) probability. An alternative approach would be to multiply each
posterior probability by the utility value of the respective outcome and simply choose
the outcome that maximises the resulting expected utility [Lindley 1985]. Various
alternatives exist for decision making in the context of prediction problems (i.e. the
output or dependent variable is continuous). These are discussed in Section 6.3 in the
context of probabilistic reasoning in the Fril programming environment.
probability distribution that is required for inference using Equation 5-2. Different
types of independence further simplify the representation and inference process in
probabilistic systems by reducing the number of dependent variables. Several
approaches to representing uncertainty using point-based probabilities, which exploit
Bayes' theorem and independence, have been developed; these are presented next.
Pr(X" ... , Xn I Y)
Pr(Y).
Pr(X;IY).
Thus, inference (calculation of the posterior probabilities given evidence) using Bayes'
theorem simplifies from
to the following:
n
IIPr(Xj =Xj IY = Yi)Pr(y = Yi)
j=1
(5-6)
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: iNTRODUCING CARTESIAN GRANULE FEATURES 99
Decision making consists of taking the classification value Ymax whose corresponding
posterior probability is the maximum amongst all posterior probabilities Pr(Yi I <Xf, ..• ,
xn » for all values Yi E ilr. This is mathematically stated as follows:
Since, in this decision making strategy, the denominator in Equation 5-6 is common to
all posterior probabilities, it can be dropped from the inference process. This further
simplifies the reasoning process (and the representation also) to the following:
rnm=::J
LQQLJ
T T .95
T F .94
F T .29
F F .001
JohnCalls (4)
1WU liliU
T .70
Figure 5-1: An example of a belief network adapted from {Russell and Norvig 1995}.
A belief network provides a complete description of the domain i.e. every entry in a
joint probability distribution <Xl> ... , x,,> can be calculated from the information in the
network. This follows from the fact that the joint distribution can be rewritten in terms
of a conditional probability and a smaller conjunction using the product rule:
= n
n
i=1
Pr(xi I xi_I , ••• , XI) (5-8)
Equation 5-8 is commonly referred to as the chain rule in the literature. Probabilities
encoded in the nodes of a Bayesian network denote conditionals of the form
where the nodes are suitably labelled in any order consistent with the partial ordering
implicit in the graph structure. Thus, the conditionals in Equation 5-8 can be substituted
by the conditionals explicitly represented in the Bayesian network. Thus, inference
reduces to the product of conditionals (in their prior or posterior states, depending on
whether the conditionals an~ dependent on the evidence or not). Bayesian networks
further exploit independence in the following ways, thereby simplifying the inference
process:
• Consider nodes numbered 1,3, and 5 in Figure 5-1. In the case of this
sub-graph, variables 1 and 5 are conditionally independent given variable
3. Thus
CHAPTER 5: PROBABILITY THEORY 102
reduces to
• The anterior nodes of a node are the set of nodes that cannot be reached
via a directed path. For example, the anterior nodes of node 5 in Figure 5-
1 are {J, 2, 3, 4}. The probability of a node given its parents is
independent of its anteriors. For example, Pr(511, 2, 3, 4) = Pr(513).
These four patterns of reasoning are depicted in Figure 5-2. See [Russell and Norvig
1995] for a more complete description of these reasoning patterns. As a result of this
flexibility, the general problem of inference in Bayesian networks has been shown to be
NP-hard, but recent work has resulted in efficient algorithms that allow exact inference
[Lauritzen and Spiegelhalter 1988; Pearl 1986; Pearl 1988] (exploiting research results
in graph theory and mathematical representations of probability distributions) and
approximate inference based upon simulations and bounding techniques which sacrifice
precision for efficiency [Russell and Norvig 1995]. The main ideas for inference in
Bayesian networks have been described here, but due to space limitations the
technicalities of the various exact and approximate inference algorithms are not
presented. However, the interested reader is referred to [Jensen 1996; Krause and Clark
1993; Russell and Norvig 1995] for excellent tutorial level presentations of these
inference algorithms.
Decision making in the case of Bayesian networks is similar to that for naive Bayes and
fully specified joint probability distributions, in that it reduces to taking the hypothesis
associated with the maximum posterior probability or maximum expected utility. See
Section 5.2.1 for more details.
For a long time point-based probability theory was the only way of expressing
uncertainty. As seen in the previous section, probability theory is a form of knowledge
representation that allows uncertainties to be represented by associating a probability
SOFf COMPUTING !'OR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 103
with all possible values for a variable (or group of variables). This would say that one
value is more likely than another. Inference is based upon conditioning. Point-based
probability theory, while being an intuitive way of representing uncertainty, does not
cater for other areas of incompleteness in knowledge representation such as ignorance
and inconsistency. In order to address these forms of uncertainty, generalisations of
point-based probability theory from point functions to set functions have been
developed. This has resulted in the introduction of Dempster-Shafer theory [Dempster
1967; Shafer 1976], possibility theory [Zadeh 1978], and mass assignment theory
[Baldwin 1991b]. Subsequently, this section presents an overview of Dempster-Shafer
theory, possibility theory, and mass assignment theory. Formal connections between
fuzzy set theory and these theories are described and illustrated with examples as part
of this overview.
Mixed
Inlercau al
Figure 5-2: Examples of reasoning patterns that can be handled by Bayesian networks.
E represents an evidence variabLe and Q is a query variable.
propositions without there being a necessary requirement to distribute the mass with
finer granularity among the individual propositions in the set. This allows a form of
ignorance. For example, consider that it is known for certain that a six-faced die after
being rolled has a value which is even, whilst being totally ignorant as to which of the
set of possible even numbers {2, 4, 6} it is. In Dempster-Shafer theory this information
would be represented as a probability distribution over the elements of the power set of
the frame of discernment. In terms of the die example, this probability assignment
would consist of a single element {2, 4, 6 j and an associated probability mass of J. This
is denoted as follows: <{2, 4, 6j:J>. Suppose that an available expert subsequently
testifies that with 90% the die is fair (Le. he is 90% sure that the probability of the die
showing any value is 0.1667). Then Dempster-Shafer theory gives the following
updated probability assignment: <{2, 4, 6j:O.J, {2j:0.3, {4j:0.3, (6j:0.3>, where the
belief mass has been redistributed according to the expert's information.
m:P(X)~[O, l]
(i) m(0) = 0
(ii) Lm(A) = 1
AeP(X)
Every set A E P(X) for which m(A) > 0 is called a focal element of m. Basic
probability assignments are denoted with the letter m qualified by its associated name
(e.g. the basic probability assignment for the concept even is denoted by meven ) and
when ilx (the universe on which a bpa m is defined)" is finite, m can be fully
characterised by a list of its focal elements A; with the corresponding belief mass m(A;)
as follows: <A;:m(A j ».For example, the bpa for large die numbers could be mLarge =
<{5, 6j:O.8, {3, 4, 5, 6j:O.2j>. Alternatively bpas can be functionally denoted (for both
discrete and continuous universes). Consider a frame of discernment ilx ={Xl, X2, X3, X4,
X5}. A bpa m, representing a mass for each A E P(X) can be written as follows:
(5-10)
mass should be assigned to the individual elements, the singleton sets. The union of the
focal elements forms the core of the bpa. Contrary to axioms of point-based probability
theory, the probability of the negation of a proposition cannot be derived from the
proposition i.e.
Pr(-,A) # 1 - Pr(A)
A bpa can be viewed as a form of knowledge representation that expresses upper and
lower probability measures for every set A IE P(X), Le. a probability interval. These
upper and lower probability measures are known as belief and plausibility measures
respectively.
Belief measure: Given a bpa m for a variable X defined over the universe ax, a unique
belief measure for every set A E P(X) can be determined as follows:
Property (ii) states that belief measures are super-additive with regards to set union,
which is a weaker version of the additive property of point-based probability measures
(Equation 5-1).
Property (ii) states that plausibility measures are sub-additive measures of point-based
probability measures. Plausibility measures are duals of belief measures since:
PI(A) = 1 - Bel(-,A)
Belief or plausibility measures can be calculated from a bpa m as shown above. The
inverse is also possible. Given, for example, a belief measure Bel, the corresponding
unique bpa m is determined for all A IE P(X) by the following formula:
CHAPTER 5: PROBABIlJTY THEORY 106
m(A)= L(-l)IA-BIBel(B)
BIB\:A
i.e. all mass is assigned to the set of values made up of the frame of discernment Ox
and all other A E P(X) are assigned a zero mass.
Dempster's rule of combination: Given two bpa m] and m2 defined over the same
universe of discourse Ox and originating from independent sources (e.g. from two
experts), the aggregation of these two bpas mJ and m2 results in a new bpa ml.2 where
the mass associated with each A E P(X) is calculated using Dempster's rule of
combination as follows:
2,ml(B)m2(C)
ml $ m2 (A) = --=B:..:..n.;:;C'==:,:A_ _ _ __ (5-13)
1- 2,ml(B)m2(C)
BnC=0
The numerator in Equation 5-13 corresponds to the sum over all conjunctions of
arguments (intersection) that support A. The mass associated with each argument mlB)
and miC) is combined using product. This is exactly the same way in which the joint
probability distribution is calculated from two independent marginals (point-based
probability theory); consequently, it is justified on the same grounds. The denominator
is the normalisation coefficient obtained from the mass assigned to the null set or
contradictory information. This normalisation coefficient has been a contentious issue,
sometimes leading to undefined results (in the case when the cores of m] and m2 are
disjoint) and sometimes to counter-intuitive results especially when the two pieces of
evidence mJ and m2 are highly conflicting [Zadeh 1986]. The use of normalisation
corresponds to applying the closed world assumption: the truth must lie somewhere in
the Boolean algebra of propositions derived from the frame of discernment ilx. A
deeper problem with Dempster's rule is its discontinuous nature in the neighbourhood
of total conflict [Krause and Clark 1993].
The combined body of evidence mSmaII E9 mAbouc2, is calculated using Equation 5-13
and the following matrix:
IIlsmaII . . ..
{I , 2, 3}·06
{I}: 0.4 0 {I}
=0.24
0.16
{1,2}: 0.5 {2} {I,2}
=0.2 =0.3
{1,2,3}: 0.1 {2} {I, 2, 3}
=0.04 =0.06
Various evidence conditioning and belief revision operations have been proposed
within the Dempster-Shafer theory of uncertainty [Kalvi 1993; Krause and Clark 1993;
Kruse, Schwecke and Heinsohn 1991]. They allow the updating of probability masses
in the light of some new information, which becomes available and that is certain i.e.
the evidence is absolutely reliable, but imprecise. This evidence corresponds to a bpa
with one focal element with an associated belief mass of 1. The conditioning operation
for updating a bpa m given new evidence E is commonly defined as follows:
meA)
1o
m(AIE)= BelH/(E)
ifA
I !:;;;
otherwise
E
where both A and E are elements of P(X), the power set of the frame of discernment
CHAI'TER 5: PROBABILITY THEORY 108
ilx, and BeLm(E) denotes the belief of E based on m and Equation 5-11. For example,
consider a bpa m, that constrains the values of variable X, which is defined on the
universe ilx = {a, b, c, d, e}. m could be defined as follows:
and then upon receiving "certain" information that the value of X lies in the subset {a,
b, c}, the conditional mass distribution m given evidence E is calculated as follows:
This approach to updating has the affect of transferring the mass (rescaled) to the focal
elements of the original bpa that are subsumed by the new evidence. The resulting mass
assignment can then be used to calculate corresponding belief and plausibility measures
or they can alternatively be directly calculated from the evidence as follows:
Belm(AnE)
Belm(.IE)(A I E)
Belm{E)
and
propositions that make up the focal elements, if a decision needs to be made [Smets
1990]. Consequently, the point-valued probability Pr(A) associated with a proposition
A (i.e. A is a singleton) is the sum of the probabilities that were assigned to A as a result
of A being a part of a focal element in bpa m. In other words:
where 1.1 denotes set cardinality. This transformation from belief masses, referred to as
the credal level by Smets, to point probabilities, termed the pignistic level by Smets,
plays an integral role in the transferable belief model proposed by Smets [Smets 1990;
Smets 1994].
m:P(X) ~ [0,1]
(i) m(0) =0
(ii) Lm(A) = 1
AEP(X)
(iii) Focal elements are nested. Focal elements A E P(X) are linearly
ordered according to the subset relationship c. Consequently, for a
bpa of the form <AI:mJ, Az:m3, ... , An:mn> the following ordering
between focal elements holds Al c Az c ... cAn
For example, consider two bpas, m} and m2 that are defined over the universe fa, b, c, d,
e} as follows:
As a consequence of the nested nature of the focal elements that make up a body of
evidence in possibility theory, the following properties of necessity and possibility
measures hold for any two focal elements A and B E P(X) [Klir and Yuan 1995]:
1t:Qx ~ [0, I]
When the frame of discernment is infinite, sup is used in place of max in Equation 5-17.
Consequently, given a nested body of evidence m, it is possible to directly generate the
corresponding possibility distribution using Equation 5-16. This is described
subsequently.
are nested i.e. the focal elements A € P(X) are linearly ordered according to the subset
relationship c:
Consequently.
This ordering amongst focal elements permits the representation of basic probability
assignments on a finite frame of discernment, in a convenient form. as an n-tuple. The
tuple. written as m = <m], m,i• ...• mn>. represents <m(A 1). m(A2 ) ••••• m(An ».
Since
1t(Xi) = POS({Xi}) (see Equation 5-16), 1t(Xi) can be simply calculated as follows:
n n
1t(Xi) =Pos({xi!) = I,m(Ak ) = I,mk (5-18)
k=; k=;
This permits the calculation of 1t(Xi) from a basic probability assignment. The reverse is
also possible; given a possibility distribution, it is possible to determine uniquely the
associated basic probability assignment. This becomes obvious when Equation 5-18 is
expanded for each X; in the frame of discernment Ox as follows:
Solving these equations for each m; (the belief mass for each focal element) reduces the
calculation of the basic probability assignment from the corresponding possibility
distribution to the following:
For example, consider the possibilistic bpa m =<fa}: 0.3, fa, b}:0.5, fa, b, d, e}:0.2>
defined over the universe {a, b. c. d, e}. Its corresponding possibility distribution 1t can
be calculated by, firstly. writing the basic probability assignment in tuple format as
follows:
Applying Equation 5-18 to this tuple results in the possibility distribution <1t(a). 1t(b)
CHAPTER 5: PROBABIUTY THEORY 112
<1,0.7,0.2,0.2,0>
1t(b) = m2 + ... + m5
= 0.5 + 0 + 0.2 + 0
=0.7
Possibility measures, Pos(A) can be calculated easily from the possibility distribution
using Equation 5-17. Considering the previous example, the Pos({a, bJ) can be
calculated as follows:
As in D-S theory, inference in possibility theory comes in terms of belief revision and
updating. A logic-based approach to inference has also been developed [Dubois and
Prade 1988]. The definitions of the belief revision and updating are summarised here
for both the necessity measure Nee and possibility measure Pos. See [Kruse, Schwecke
and Heinsohn 1991] for a more detailed presentation and discussion of these operations.
A commonly used approach to updating a necessity measure Nee and possibility
measure Pos given new evidence E, E E P(X), is defined as follows:
NeC(A) E
ifiOA
--- I£1C
Nec(A I E) = { Nee(E) -
o otherwise
fPos(AuE)-Pos(E)
Pos(A I E) = 1 1- P~s(E)
otherwise
"*
where the Nec(E) 0 (i.e. Pos(E) = 1). These updating operations are concerned with
redistributing mass such that Nec(EIE) =1. On the other hand, in belief revision is
concerned with revising a body of evidence such that it is consistent with the truth lying
in B. This is achieved as follows for NecE(A) and PosE(A) the respective revised
necessity and possibility measures:
Nee(AUE)-Nee(E)
NecE(A) = { I-Nec(E)
o otherwise
POS(A)
POSE (A) = { PO~(E)
otherwise
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE fEATURES 113
• a-cuts are nested: A fuzzy set can be viewed as a family of nested sets -
the a-cuts (see Section 3.3) For example, consider the following fuzzy set
A = {OAla + 0.6/b + 0.71 c + lid}. The following are the a-cuts Aa of A:
AI ={d}
A. 7 ={c, d}
A. 6 ={b, c, d}
A.4={a, b, c, d}.
IlA(X) = 1t(xIA)
Consequently, it is possible to represent systems using both possibility theory and fuzzy
set or translate one into the other. For example, fuzzy sets provide a very high level
representation of possibilistic bodies of evidence, so one could transform these bodies
of evidence to fuzzy sets and if-then rules and use fuzzy reasoning in order to perform
inference and decision making, which is, in general, much more transparent and
efficient. Alternatively, in other situations such as inductive reasoning, it is possible to
measure the degree of match of fuzzy events exploiting possibility and necessity
measures as introduced above and in Section 3.6. These measures could potentially
highlight uncertainty that might otherwise go unnoticed; for example, they could
identify a model or data deficiency [Klir and Yuan 1995].
Mass assignment: Let X be a variable defined on the universe ilx. A mass assignment
m, defined over the universe ilx, is a function from P(X), the power set of .ox, to the
unit interval [0, 1]:
m:P(X) -7 [0, 1]
(i) Im(A) = 1
AEP(X)
°
Every set A E P(X) for which meA) > is called the focal element of m. Notice here
that the condition, m(0) = 0, has been dropped. In other words, mass can be allocated
to the null set, which enables the modelling of inconsistency in a mass assignment.
Mass assignments are denoted with the letters MA qualified by its associated name (e.g.
the mass assignment for the concept even is denoted by MAeven) and can be written
using a list «A;:m(A;») or functional format (as is the case for basic probability
assignments) .
A mass assignment can be viewed as a form of knowledge that expresses upper and
lower probabilities for the individual elements of frame of discernment. As in
~er-Shafter theory, a probability interval can be calculated for every set A E P(X)
using the necessity and possibility measures. Given a mass assignment m, a unique
necessity measure for every set A E P(X) is determined as follows:
Nec(A) = Im(B)
BIBkA
and a unique possibility measure is determined for every set A E P(X) as follows:
Pos(A) = Im(B)
BlBnA",0
0.3 A = {pass}
{ 0.4 A = {pass, second -class - honours}
MAclass(A) =
0.3 A = {pass, second - class - honours, first - class - honours}
o Otherwise
0.3 ~ Pr(pass) ~ ]
o ~ Pr(second-class) ~ 0.7
o ~ Pr(first-class) ~ 0.3
such that
where P(X) denotes the power set of .ax, MA(A) denotes the mass associated with focal
element A in the mass assignment MA.. PrA(x;) is the probability distribution on a focal
element A and Pr(xjIMA) is the updated or posterior probability distribution obtained
when the mass assignment MA is provided. PrA(Xj) is a local" probability distribution or
selection rule [Ralescu 1997] for each focal element A in the mass assignment MA and
is defined as follows:
CHAPTER 5: PROBABILITY THEORY 116
(5-21)
Notice that the least prejudiced distribution is a more general version of the pignistic
distribution introduced by Smets as part of the Transferable Belief Model [Smets 1990;
Smets 1994]. The transformation of a mass assignment to a point probability
distribution can be simply viewed as the updated point probability distribution obtained
when a prior is conditioned on that mass assignment i.e. Pr(X = X; I MAJ. This
relationship is further considered during the presentation of the bi-directional
transformation of a fuzzy set to a probability distribution in Section 5.4.
Point Semantic Unification: Let m:<M;:mi> and d:<Dj:dj> be two mass assignments
specified in terms of their focal elements Mi and Dj E P(X) (the power set of the frame
of discernment Qx) and their associated masses respectively. The point semantic
unification calculates the point probability resulting from the conditioning of m on
evidence d (defined here for the discrete case; see Section 6.2.1 for the continuous case
[Baldwin, Lawry and Martin 1996]) as follows:
SOfT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 117
(5-22)
n L,Pr(x)
Pr(M I D) = L,m;d
i,j=I
""'irilliPr(x) (5-23)
.leD;
The point semantic unification of MASmall given the evidence MAAbouCZ , Pr(MAsmall I
MAAbouI_2), and a uniform prior, is calculated using the following matrix:
MAAboul2
MAsmaII {2}'04
, , - ..6
{I , 2, 3 }·O
{I}: 0.4 113(0.4 . 0.6)
0
=0.08
{1,2}: 0.5 0.5·0.4 2/3(0.5 . 0.6)
=0.2 =0.2
{1,2,3}: 0.1 0.1 ·0.4 0.1· 0.6
=0.04 = 0.06
where
L md
n
t ifDcM
{
T(MID)= f ifDnM =0
u otherwise
For example, consider the following two mass assignments (same as were used in the
point semantic unification example above):
The interval semantic unification of MAsmalJ given the evidence MA Abou ,-2, Pr(MAsmall I
MAAboul_2), is calculated using the following matrix:
Pr(MAsmail I MAAboul_2) = [0.2 + 0.04 + 0.06, 0.2 :I- 0.4 + 0.06 + 0.24 + 0.3]
= [0.3,0.84]
bi-directional transformation forms the basis for the learning algorithms proposed in
Part N of this book.
As seen in Section 5.3.2.1, using the possibilistic principle, a fuzzy set A can be
transformed into a corresponding possibility distribution 1t by simply equating the
membership of a value with the possibility, that is, IlA(X) = 1t(xlA). Subsequently,
methods that transform probabilities to possibilities can be used to generate
membership functions. Several researchers have investigated the relationship between
possibility distributions and probability distributions [Baldwin 1991b; Dubois and
Prade 1983; Klir 1990; Sudkamp 1992]. This research has been guided by Zadeh's
possibility/probability consistency principle [Zadeh 1978], which states the
following:
If a variable X can take values XI> ••• , Xn with respective possibility and
probability distributions 1t = <1th ... , 1tn> and Pr(p I> ••• , Pn), then the degree of
consistency of the pr;obability distribution Pr with the possibility distribution
1t is given by the following:
n
Consistency(Pr,1t)= L,1Z'; Pr;
;=1
Alternative definitions of consistency also exist [Dubois and Prade 1980], however the
importance of this measure is that it serves as an "approximate formalisation of the
heuristic observation that a lessening of the possibility of an event tends to lessen its
probability - but not vice versa" [Zadeh 1978]. The possibility/probability consistency
principle provides a basis for the calculation of a possibility distribution from a
corresponding probability distribution.
distribution.
Step 1: Fuzzy set <=> mass assignment: As seen Section in 5.3.2, a fuzzy set
can formally be transformed into a nested body of evidence or mass
assignment via its corresponding possibility distribution 1t. Consider
that a variable X has a fuzzy set value f as follows: V is f, where f is a
fuzzy set defined on the discrete universe ilx = {XI> ... , xn}, whose
support corresponds to ilx (for convenience). This is written more
succinctly as follows:
n
f = LX; / JLj(X;)
;=1
This proposition that "V has a fuzzy set value f" induces a possibility
distribution over the values of X such that the membership values of X;
are nUn:ierically equated with possibility i.e.
Suppose f is a normal fuzzy set where the elements are ordered such
that
then
where A = {x], ... , x;} and 1tt{Xn+l) = O. This leads to the following mass
assignment corresponding to the fuzzy setf:
MAt = <{x], ... , x;}: 1tj -1ti+1 > with 1tn+1 = 0 and ViE {l,
... , n}
Consider the following example, where a fuzzy set / = {all, b/0.5, c/O.5, dlO.2} is
transformed into its corresponding probability distribution. A uniform prior probability
distribution is assumed here. This results in the following calculations:
1&r = <1,0.5,0.5,0.2>
MAr = <{a}:O.5, {a, b, c}:O.3, (a, b, c, d}:0.2>
MAf = <{Xl> ... , x;/: 1&; -1tj+h {0 J: I-Jrj> with 1tn+1 = 0, and ViE {I, ... , n}
such that a non-zero mass is assigned to the null set 0; in this case, the mass
assignment is said to be incomplete. To transform this mass assignment into a
probability distribution, the mass associated with the null set 0 needs to be
redistributed amongst the other focal elements. Section 8.2.2 discusses a couple of
distribution policies and the effect these distributions have on the resulting probability
distributions.
Step 3: This fuzzy set f induces a possibility distribution 1tf, which in tum
induces a mass assignment of the form:
Step 4: Letting A; = {x" ... , x;} \7' i E {1, ... , n} and since MAj (A) = 1tr(Xi) -
1tt{Xi+l) (according to Equation 5-19), the following equation
Pr'(xi) =
AEP(X),X;EA
can be simplified to
such that
The remaining values for 1t1{{X;}) (i.e. i E {I, ... , n-l}) can be solved
for by direct substitution of 1tt{ {Xi+1 }). This leads to the following
general equation for obtaining a possibility 1t1{ {Xi}) corresponding the
probability Pr( {Xi}):
This probability distribution corresponds to the fuzzy set such that a prior probability
was conditioned on this fuzzy set, that is Pr(XIf) resulting in the above probability
distribution. Assuming the prior distribution was uniform leads to the following fuzzy
set using Equation 5-25:
where Jlt(w) is calculated through its associated possibility as follows (in the following
calculations "." denotes product):
population of ten people when asked to vote on the appropriateness of these words for
the die value of 5. Similar voting patterns are generated for the other die values. All
voters accept the word Large as an appropriate description for the die value of 5, while
7 people (70%) accept Medium as an appropriate description and 1 person accepts the
word Small. These proportions correspond to membership values. For example, the
word Large will have a membership value of 1 in the fuzzy set linguistic summary of
the die value 5. In short, the voting pattern presented in Table 5-1, corresponds to a
linguistic description of the die value 5 described in terms of the following fuzzy set:
{Large/l + Medium/O.7 + SmallIO.2}. Reinterpreting the voting patterns in another
way, 10% of the voters voted yes for the words in {Small, Medium, Large}, while 50%
voted for the words in {Medium, Large} and 30% voted exclusively for Large. This
interpretation corresponds to probability distribution on the power set of possible die
values QDieValues. This probability distribution corresponds to the following mass
assignment:
To get a probability distribution associated with this voting pattern the users could be
asked to restrict their descriptions of values to one word i.e. each voter is asked to vote
yes for one word only when describing a value. However, in the case where the users
are not available to make such a decision it is possible to uniformly distribute
probabilities amongst the words a voter chose to label a value. This results in the
following probability distribution (which is equivalent to assuming a uniform prior):
This example of the voting model illustrates intuitively the relationship between fuzzy
sets, mass assignments and probability distributions.
Table 5-1: A voting pattern for 10 peopLe defining the Linguistic description ofa die
having a vaLue of 5. This corresponds to the fuzzy set {SmaLlIO.I + Medium/O.7 +
Large/I}.
Word\Person I 2 3 4 5 6 7 8 9 10
Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Medium Yes Yes Yes Yes Yes Yes Yes No No No
SmalJ Yes No No No No No No No No No
membership values and point probabilities. This relationship permits the calculation of
the probability of a fuzzy event and of the conditional probability of fuzzy events and
thus, probabilistic reasoning (see Chapter 6 for more details). Zadeh [Zadeh 1968]
proposed alternative definitions that allow the calculation of the probability of fuzzy
events directly from the underlying fuzzy sets. He defined the probability of fuzzy
event as follows. SupposeJis a fuzzy set defined on the discrete universe nx and Pr is a
probability distribution defined on n x, then the probability ofJis defined as follows:
Pr(f I g) = Pr(f n g)
Pr(g)
where the fuzzy set intersection operator n denotes multiplication. This definition plays
a similar role as semantic unification but leads to different and more limited results (i.e.
point values only).
5.5 SUMMARY
5.6 BmLIOGRAPHY
This chapter begins by describing how to represent domain specific knowledge in terms
of Fril propositions. Subsequently, the general purpose inference and decision making
strategies in Fril are described, that is, the support logic calculus. This presentation
focuses on the reasoning aspects of Fril that are used by Cartesian granule features
models, which are subsequently proposed in Part IV of this book.
Fril provides a very rich and expressive set of propositional forms that facilitate the
modelling of systems in a linguistic and natural way. Currently, propositions of the
following types are accommodated:
The next section presents conjunctive rules, evidential logic rules, and causal relational
rule in more detail. For a full description of Fril rules see [Baldwin, Martin and
Pilsworth 1988; Baldwin, Martin and Pilsworth 1995]. In this book, the hypothesis
language is currently limited to two of these rule structures: the conjunctive; and
evidential logic rule structures. Classification and prediction problems can be modelled
generically by viewing classification problems as crisp instances of prediction
problems. In other words, prediction is the continuous version of classification. This
view arises from the fact that in classification problems the values of the output
variables are discrete or crisp values. Conversely. the values of output variables in
prediction problems are continuous values that are reinterpreted linguistically (thereby,
giving them a discrete nature). Therefore, the values of output variables in the
prediction case reduce to linguistic values characterised by the fuzzy sets which
Son COMI'UTIN(; H)R KNOWLEDGE DISCOVERY: INTRODUCING CNHESIAN GRANULE FEATURES 131
discretise the output variable's universe. Consequently, the values of output variables in
classification problems can be viewed as crisp sets consisting of single elements,
whereas the corresponding values in prediction problems are linguistic labels that
denote a fuzzy subsets of the output variable's universe.
This rule states that there is a high probability (i.e. between 0.9 and 1.0) that Object
(corresponding to a region in an image) can be labelled Summer_sky if the position of
Object is near the top of the image and if the colour is sky -blue. In this case, both
Near_top and Summer jky are fuzzy sets defined elsewhere in the knowledge base.
that for evidential logic rules, each body term is associated with a weight W; and that
each rule is associated with a jilter term. The weight term W; indicates the relative
importance of feature F; for the rule's conclusion. The filter is seen as a function that
linguistically quantifies the number of features that need to be satisfied in order to draw
a reasonable conclusion. Evlog is a buiIt-in-predicate (BIP) that takes care of inference
in evidential reasoning. A more detailed presentation the rule filters and weights is
presented in Section 9.4, where they are learned from data. The semantics of this BIP
are presented below in Section 6.2. Consider the following concrete example of an
evidential logic rule:
This rule states that there is a high probability (i.e. between 0.9 and 1.0) that Object
(corresponding to a region in an image) can be labelled Summer_sky if most of the
weighted features in the body of the rule are satisfied. The term most is a fuzzy set that
can model the expression of optimism or pessimism. Evidential logic rules have the
added value that not all evidence is needed in order to reason. This can prove vital in
some problem domains where, for example, a sensor or remote resource is unavailable
but regardless, partial reasoning is possible due to the weighted sum nature of the
evidential logic rule.
n
(i) LPr(Bi) =1
i=1
((Object is summer_sky)
((Position 0/ Object is Top) (Colour 0/ Object is Blue))
((Position o/Object is Middle) (Colour o/Object is Blue))
((Position o/Object is Bottom) (Colour o/Object is Blue))
((Position o/Object is Top) (Colour o/Object is Nocblue))
((Position o/Object is Middle) (Colouro/Object is Nocblue))
((Position 0/ Object is Bottom) (Colour 0/ Object is Nocblue))
) : (0.9 1) (0.5 J) (00.1) (0 0) (0 0) (0 0)
where Position and Colour are linguistic variables (i.e. values are characterised by
fuzzy sets) with the following possible fuzzy set values {Top, Middle. Bottom} and
values {Blue, Not_blue} respectively (resulting in six conjunctions Bi). The conjunctive
rule can be viewed as a special case of the causal rule i.e. it corresponds to an extended
rule representing two conditionals Pr(HeadIBody) and Pr(HeadhBody). This rule
structure is a very high level means of representing conditional probabilities (that are
normally represented in tables or lists). The main differences here is that the
propositions are imprecise (specified in terms of fuzzy sets) and not crisp.
Body/Antecedeflts and
associated weights
8 m)
:«UI VI) ••• (u; VI)'" (Urn »
vm Rule Supports
6.2 INFERENCE
Inference in Fril occurs at three different levels: at the body proposition level; at the
body level; and at the rule level. At all three levels. inference is based upon
CHAPTER 6: FRIL- A SUI'I'ORT loGIC PROGRAMMING ENVIRONMENT 134
conditionalisation (except in the case of the body level of the evidential logic rule).
Since the conjunctive rule is a simplified version of the extended rule (and only
different to the evidential logic rule in terms of inference at the body level), the
inference process is presented. from the conjunctive rule perspective. In Fril, it is
possible to perform inference in point-valued or interval-valued model, however the
new approaches to knowledge discovery introduced in this book are currently limited to
point-value inference. Future work could harness the more expressive interval-valued
representation and inference. Consequently, the presentation is limited for the most part
to point-valued inference.
Though semantic unification comes in two flavours - interval and point-valued - the
work presented in this book has been limited to point-valued semantic unification. Point
semantic unification can be quite efficiently thought of as corresponding to the
expected value of the membership of a fuzzy setf given the least prejudiced distribution
(LPD) of fuzzy set g [Baldwin, Lawry and Martin 1996]. This is expressed more
succinctly as follows for the discrete case:
n
Pr(f I g)If1.f(x;~ LPDg(x;) (6-1)
;=1
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 135
where both the fuzzy setsfand g are defined over the discrete universe Ox = {Xh X2•••• ,
xn }. The continuous case is as follows:
Pr(flg)= jJlt(x)x[pdg(x)dx
xeQ x
Body = n
;=1
m
Pr(FS;ClASS I Data;).
On the other hand, in the case of Evidential Logic rules the body support Body is
calculated in two steps as follows:
Body'= i
;=1
Pr(FS;CLASS I Data;)w i
where W; is the weight of importance associated with feature i. The second step involves
taking the intermediate value Body' and passing it through the filter function, which
yields the body support Body as follows:
Body = filter(Body')
The filter step can be bypassed by setting it to the identity function Le.filter(x) = x.
(where for simplicity, the associated intervals are reduced to points) can be viewed,
from a probabilistic perspective, as denoting the following probabilities:
These probabilities are specific to the knowledge relating to obj and are not necessarily
related to the prior probabilities Pr(Body;). Given this new information about obj, how
can the probability of the Head proposition be updated denoted by Prob/Head)?
Jeffrey's rule facilitates the update of the probability of a proposition using the theorem
of total probabilities when new information becomes available about a specific
instance. This is formally accomplished as follows:
In terms of inference for conjunctive rules, the support for the Head proposition is
calculated as follows:
Querying this knowledge base with the query qs((Classification of region] is WHAT))
results in the following inference steps:
The result of the inference is that regionJ is a Summer_sky with probability of 0.66.
In the knowledge discovery approaches introduced in this book, the representation rules
are restricted to equivalence rules i.e. the support pairs associated with each rule are of
the form (1 1) (0 0). Consequently, the calculation of the support for the Head
proposition of a rule is simply the support for the body i.e. Pr(Head) = Pr(Body).
The previous section has described how general inference in Fril is performed in three
stages. The decision making processes used within the Fril framework of knowledge
representation are described presently. From a knowledge discovery perspective, the
CHAPTER 6: FRIL - A SUPPORT LoGIC PROGRAMMING ENVIRONMENT 138
On the other hand, in the case of prediction problems, the prediction of the value of the
output variable associated with the input data vector is achieved using a process known
as defuzzification. Here a similar strategy as used in fuzzy logic could be used i.e. use
any of the standard defuzzification procedures such as Centre of Area (COA), Centre of
Gravity (COG) replacing the fuzzy rule activation with the rule support (see Section 4.3
for details). However, here, a procedure that incorporates the spirit of mass assignment
theory is chosen. The result of inference is a collection of rule hypotheses of the form
((CLassification of Object is CLASS j )): (a;) which have non-zeros supports CXj. In this
case CLASSj is a fuzzy set defined over the universe of the output variable. The
defuzzification procedure selected here involves firstly, calculating the expected value
Vi of least prejudiced distribution associated with each fuzzy set CLASSj via the mass
assignment associated with CLASSj • This yields a collection of values Vi and supports
from their respective head clauses as follows {(Vi) : (a;)}. Then taking the expected
value of these values, yields a point value, the result of reasoning. In other words, the
inferred point value as follows is calculated as follows:
c
v= Lvjaj
j=1
where Vi is the expected value of the least prejudiced distribution associated with the
fuzzy set CLASSj and c denotes the number of class rules in the rule base. The value V
corresponds to the predicted output value for the system.
6.4 SUMMARY
Mass assignment theory and related theories of uncertainty and imprecision form the
basis of knowledge representation and reasoning ,for the Fril support logic programming
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARlliSIAN GRANULE FEATURES 139
6.5 BmLIOGRAPHY
Baldwin, J. F., Lawry, J., and Martin, T. P. (1996). "Efficient Algorithms for Semantic
Unification." In the proceedings of IPMU, Granada, Spain, 527-532.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1988). FRIL Manual. FRIL Systems
Ltd, Bristol, BS8 I QX, UK.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - FuzzY and Evidential
Reasoning in A.I. Research Studies Press(Wiley Inc.), ISBN 0863801595.
Jeffrey, R. C. (1983). The Logic of Decision. University of Chicago Press, Chicago and
London.
Lawry, J. (1996). "Knowledge representation course notes", Report No. Course Notes,
Department of Engineering Maths, University of Bristol, UK.
Lindley, D. V. (1985). Making decisions. John Wiley, Chichester.
PART III
MACHINE LEARNING
The discussion so far in this book has focused on knowledge representation and
different soft computing realisations. It has partly assumed that a programmer has built
in all the intelligence in a system. In general, and certainly in the case of complex
systems, solving problems through programming computers using these forms of
knowledge representation is a mammoth task, often beyond human specification; in
short manual prcgramming is not necessarily the best approach for the program or the
programmer. Whenever a software program (model) has incomplete knowledge of the
problem domain in which it operates, learning is often the only way a program can
acquire what it needs to know. Learning thus provides autonomy but more importantly
it provides a powerful way to tackle problems, which previously were considered
beyond the scope of human programming. Examples of these problems include, the
recognition of human motion, or the classification of protein types based on the DNA
sequence from which they were generated. This part of the book consists of one chapter
that covers the field of machine learning - a subfield of AI concerned with programs
that learn from experience. It introduces the basic architecture and components of
learning systems. In addition, it provides an overview of the three broad categories of
machine learners, namely, supervised learners, reinforcement learners and
unsupervised learners. This chapter focuses in particular on supervised learning, as one
of the main goals of this book is to introduce new supervised learning algorithms for
Cartesian granule feature models (see Part IV). Popular induction algorithms including
the C4.5 decision tree induction algorithm, the naIve Bayes classifier induction
algorithm and the fuzzy data browser are also described.
CHAPTER
MACIDNE LEARNING
7
The ability to learn is considered the conditio sine qua non of intelligence, which makes
it an important concern for both cognitive psychology and artificial intelligence. The
field of machine learning (ML), which crosses these disciplines, studies the
computational processes that underlie learning in both humans and machines. The
field's main objects of study are the artefacts [Langley 1996], specifically algorithms
that improve their performance at some task with experience. The goal of this chapter is
to introduce techniques designed to acquire knowledge in this manner and to provide a
framework for understanding the relationships among such methods, and in particular
the machine learning approaches proposed and presented later in this book.
The chapter begins with a brief overview of the somewhat roller-coaster history of
machine learning. Various strategies for learning (from a human perspective) are
subsequently introduced, which leads to the most prevalent form of computational
learning; inductive learning - constructing a description of a function from a set of
input/output examples. Formal definitions of the machine learning are subsequently
provided before the three main categories of machine learning, namely, supervised
learning, reinforcement learning and unsupervised learning, are described; focussing in
particul~ on supervised learning as one of the main goals of this book is to introduce
new supervised learning algorithms for Cartesian granule feature models (see Part IV).
As part of this focus, popular induction algorithms including the C4.5 decision tree
induction algorithm, the naIve Bayes classifier induction algorithm and the fuzzy data
browser are described and illustrated using the car parking problem from Chapter 1.
The presentation of each category is supplemented with a taxonomy of associated
learning algorithms. Inductive learning is then described in detail, viewing induction as
a search process in the space of possible hypotheses (induced computational models) in
which factors such as generalisation, model performance measures, inductive bias and
knowledge representation play important roles. The chapter finishes by looking at some
of the goals, accomplishments and open issues in machine learning.
Many of the central concepts in machine learning have a long history. For example,
Hume [Hume 1748] describes induction, a fundamental notion in "generalisation
learning", but it was not until the 1950s that an interest in computational approaches to
learning really developed with the birth of artificial intelligence (AI) and cognitive
science. From the outset both areas addressed a varied and ambitious agenda, with
topics including game playing, letter recognition, abstract concepts and verbal memory.
Learning was viewed as a central feature of intelligent systems and work on both
learning and performance was concerned with developing general methods for
cognition, perception and action. Since the 1950s, computer scientists have tried, with
varying degrees of success, to give computers the ability to learn. This period can be
divided conveniently into three periods of activity [Shavlik and Dietterich 1990a]:
introduced with its effectiveness illustrated on learning chess end-game rules [Quinlan
1983]. Even though Bryson and Ho introduced the back propagation algorithm for
training neural networks in 1969 [Bryson and Ho 1969], it was largely ignored until the
resurgence of interest in neural networks in the mid-eighties.
Notwithstanding the success, expert systems turned out to be brittle and have difficulty
handling inputs that are novel or noisy. This, coupled with the introduction of many
practical learning algorithms, and convincing demonstrations on real world problems,
helped shift the attention from the static question of how to represent knowledge to the
dynamic quest of how to acquire it. As a result, in the late 1970s, a new interest in ML
emerged within the AI community that grew rapidly over the course of a few years.
This interest was further motivated by the frustration with the encyclopaedic flavour
and domain-specific emphasis of expert systems, and the opportunity of returning to
general principles afforded by machine learning.
By the early 1980s, machine learning was recognised as a distinct scientific discipline,
branching out from the traditIonal areas of concept induction and language acquisition
to areas of machine discovery and problem solving. The past two decades have seen an
explosion of research directions in theory, algorithms and applications within the field
of machine learning. Many new methods have been proposed, older techniques
revisited, such as neural networks, along with the development of a host of inter-
disciplinary approaches such as soft computing based learning techniques. Whereas
traditional AI researchers focused on abstract toy-world problems (e.g. blocks world),
ML researchers, especially in the recent past, have become more serious about the real
world potential of learning algorithms and this has led to the development of new fields
such as knowledge discovery in databases (KDD), and text mining. This phenomenon
is also depicted in this book, where the proposed knowledge discovery process (centred
on the constructive induction of Cartesian granule feature models) is applied to a
variety of real-world problems with very encouraging results (see Chapters 10 and 11).
Human learning, according to [Agency 1995; Honey and Mumford 1992] is the
acquisition over time of a variety of skills, knowledge, experience or attitudes by the
individual. Learning can be seen as a "change in human disposition or capability,
which can be retained, and which is not simply ascribable to the process of growth".
Learning can be measured and observed via these changes in behaviour. In every
learning situation the learner transforms information provided by a teacher (or
environment) into some new form in which it is stored for future use. The nature of the
transformation determines the type of learning strategy used. Several basic categories
exist such as rote learning, learning from instructions, deductive learning, learning by
analogy and inductive learning [Honey and Mumford 1992]. This list is ordered - from
CHi\PTER 7: MACHINE LEARNING 146
If learning were restricted to the approaches presented so far, then people would be
hopelessly restricted in the conclusions they could draw. Often there is a need to go
beyond the infonnation given, i.e. to generalise to unseen scenarios. This leads to
inductive learning, where the transfonnation process involves generalisation of the
input infonnation and selection of the most desirable result, that is, generalised
knowledge is inferred from particular examples. The process that derives new
generalised knowledge from particular examples is known as inductive inference. The
price one pays for this ability is the loss of the guarantee (introduction of uncertainty)
that the conclusions follow from the infonnation given. Finally, learning by analogy is
a mixture of both inductive and deductive reasoning. The following are examples of
inductive learning taken from [Holland et aI. 1986] covering most inferential processes
that expand knowledge in the face of uncertainty:
Machine learning, like most cogmtlve phenomena, is a very ambiguous tenn with
definitions abounding in the literature. Some of the more succinct and less ambiguous
definitions include Simon's useful characterisation [Simon 1983]:
"Learning denotes changes in the system that are adaptive in the sense that
they enable the system to do the same task or tasks from the same population
more effectively the next time. "
"a computer program that improves its performance at some task through
experience"
Background
Knowledge
• supervised learning;
• reinforcement learning;
• and unsupervised learning.
• a set of examples (for example, the car parking success/failure data from
Section 1.2.1) described in a certain instance/observation language;
• background knowledge (and feature construction operators);
• a hypothesis language to represent the learnt computer models;
• a search mechanism;
• general purpose inference and decision making procedures (which could
possibly be learned also);
• and a performance evaluation function.
Inductive learning lies at the core of most machine learning approaches. Inductive
learning takes specific examples and exploits background knowledge in performing a
search in the model or hypotheses space (sometimes in terms of operations such as
generalisation and specialisation) to form general-purpose hypotheses or models that
cover (represent/summarise/explain) the examples in training set and other cases
beyond. This inductive learning process is crudely depicted in Figure 7-2 as a search
through the hypotheses (model) space that is guided by background knowledge and a
performance evaluator. The main components of a general inductive learning process
are presented in detail later in Section 7.8, but the above definition is sufficient for now,
in order to provide an overview of the main categories of machine learning algorithms
in the next three sections.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 149
Example data
Machine Model/Hypothesis
Background
Learner
Knowledge
Possible
Model
Space
Figure 7-2: A simplifying view of machine learning in terms of search through the
space of possible computer models (programs). This search is guided by a performance
component (fitness or cost function) and background knowledge.
This section introduces supervised learning, by far to date the most widely applied and
researched category of machine learning. It begins by describing the general
characteristics of supervised learning, which are subsequently illustrated for a
handwritten classification problem. Popular learning algorithms, namely, the C4.5
decision tree induction algorithm, the na"ive Bayes classifier induction algorithm and
the fuzzy data browser are then described and illustrated on the car parking problem
from Section 1.2.1. Finally, a taxonomy of supervised learning algorithms is presented.
more formally defined in terms of input variables, X ..... , Xno that describe the situation
or event, and an output variable, Y, that describes the outcome. The task of the learner
is to model the dependence of an output variable Y (discrete or continuous) on one or
more input (predictor) variables, X ..... , Xno given N example data {x., ... ,x.}~, and
possibly background knowledge. This results in a model functionft
over the domain (XIt ..., X,J € D~ J(' containing the data. The single valued
deterministic function f, of its n-dimensional argument, captures the joint predictive
relationship of Y on XIt ... , Xn • The additive component e usually reflects the
dependence of Y on quantities other than XIt ... , Xn that are neither controlled or
observed, which can lead to models which are deficient; models may be incomplete,
imprecise, fragmentary, not fully reliable, vague, contradictory or deficient in some
other way. In general, these types of deficiencies may result in different types of model
uncertainty. As presented in Part n, some forms of knowledge representation explicitly
incorporate techniques for handling some of these types of uncertainty, thereby
providing a more realistic model of reality. However, to date no panacea approach
exists for this general problem.
Typical tasks that fall under the category of supervised learning include: diagnosis
problems such as whether a patient suffers from diabetes or not; regression problems
such as predicting foreign exchange rates for tomorrow; computer vision problems such
as handwritten character recognition, gesture recognition and automatic vehicle
navigation; natural language and speech processing; control problems such as
controlling a furnace or the docking of a space craft; and decision support systems such
as predicting customer activity (e.g. will a customer default on a bank loan). The task of
learning to classify handwritten characters from classified examples is chosen as an
illustrative problem for supervised learning and is subsequently described.
SOFT COMPUTING FOR KNOW LEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE F EATURES 151
The next three subsections answer these questions using three popular supervised
learning algorithms: C4.5 decision tree learning algorithm; naive Bayes classifier
induction algorithm; and the data browser. This is supplemented in the following
section (Section 7.5.3) with a taxonomy of supervised learning approaches.
To learn decision trees, the C4.5 algorithm is presented with a database of examples in
spreadsheet format (see Table 7-1). The database is split into two smaller databases:
one for training and one for testing. Using the training examples, the task of C4.5 is to
determine the nodes in the tree and the tests associated with the non-terminal nodes.
C4.5 searches the space of decision trees through a constructive search. It first
considers all trees consisting of only a single root node and chooses the best one. A
number of measures have been proposed for evaluating the best feature, including
entropy, which measures the information content or purity of a feature . Consider the
expanded car parking problem (Section 1.2.1.1), which consists of four input features,
TimeToDestination, NumberOfFreeSpaces, OccurrenceOfAPublicEvent,
afJectedStreets, and an output feature parkingStatus. For this problem, C4.5 determines
that the NumberOfFreeSpaces feature provides the most information and is thus
selected as a root node (as depicted in Figure 7-4). A condition is then associated with
the root node. Once again various approaches have been proposed in the literature for
generating this condition, including entropy, that is, a condition point in the feature ' s
universe is chosen so as to minimise the entropy. The resulting condition generates a
binary partition of the data (only considering binary trees here) corresponding to two
Son COMPUTING FOR KNOW LEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 153
NurnberOrFreeSpaces > 50
TirreToDe
rFreeSpace < 30
The learning algorithm estimates the class conditional probabilities and the class
probabilities from a training dataset, where the class conditionals correspond to the
following:
CHAPTER 7: MACHINE LEARNING 154
Pr(Y).
The class probability Pr(Y =Yj) is simply the fraction of class Yj in the training dataset.
Each class conditional Pr(Xil Y=Yj) can be estimated for the discrete universes
(originally discrete or discretised continuous universes) using the m-estimate as follows
[Mitchell 1997]:
nx +m· p
Pr(X i I Y = y.) =---"k_ __ (6-1)
J nc +m
where nc is the number of training examples (sample size) whose target value is Yj, nxk
is the number examples whose target value is Yj and whose Xi value is Xb p is the prior
estimate of the probability which is being determined here, and m denotes a constant
called the equivalent sample size, which determines how to weight p relative to the
observed data. Note that if m is zero, the m-estimate is equivalent to the fraction (nxk)nc·
J. If both nc and m are nonzero, then the observed fraction (nxk)n/ and the prior p will
be combined according to the weight m. m is called the equivalent sample size as it can
be interpreted as augmenting the nc actual observations by an additional m virtual
samples distributed according to the prior. A typical way of choosing p in the absence
of other information is to assume the uniform prior; that is if an attribute Xi has w
possible values then p=(wr1. One interesting difference between naive Bayes and other
induction algorithms, such as C4.5, is that there is no explicit search through the space
of possible models. Instead, the model is formed using all available features.
Performance of the induced model is evaluated based on the classification accuracy of
the model on the test dataset.
Consider the original car parking problem (Section 1.2.1), which consists of two input
features, TimeToDestination, NumberOfFreeSpaces, and an output feature
parkingStatus. The class conditionals for this problem could be calculated using
Equation 6-1, after both the nTirneToDestination and nNumberOlFreeSpaces universes were
discretised (for example, into uniform intervals or bins of size 5). The value of m, the
equivalent sample size, could be set to the total number of training examples. The
resulting class (hypothetical) probability densities (after interpolating the midpoints of
the bins) are presented in Figure 7-5.
Xn> and predicts the target value y by performing approximate reasoning as described in
Section 6.2. while Section 6.3 describes approximate reasoning when Y is continuous.
The data browser estimates univariate class conditional fuzzy sets from a training
dataset via their corresponding probabilistic class conditionals:
'it i E {I •...• n}
_ _ unsuccessfulParking
_. _. _ successfulPar\o:ing
p
p
p
n
p n
n
p n n
p n
n
p
n n
p n
p
I n
I n n n
Figure 7-5: This figure shows the resulting probability density functions using a naiVe
Bayes approach/or the parking problem.
As is the case for the induction of naIve Bayes classifiers. there is no explicit search
through the space of possible conjunctive models. Instead. the model is formed using
all available features . For evidential models. feature selection is performed by
eliminating features associated with low weights. which are calculated via semantic
discrimination analysis (see Section 9.4). Performance of the induced model is
evaluated by the classification accuracy of the model on the test dataset. The class
conditional fuzzy setS!X1Yj for the parking problem are presented in Figure 7-6.
CHAI'TER 7: MACHINE L EARNING 156
Various extensions to the data browser have been proposed including the extraction of
knowledge in terms of decision trees [Baldwin, Lawry and Martin 1997] and the
extraction of knowledge over multidimensional linguistic variables known as Cartesian
granule features [Baldwin, Martin and Shanahan 1996; Baldwin, Martin and Shanahan
1997; Shanahan 1998], which forms the basis for the learning algorithms described in
Part IV of this book. In addition, alternative feature selection algorithms have been
proposed based upon genetic programming [Baldwin, Martin and Shanahan 1998] (see
Chapter 9).
_ _ unsuccessfuiParking
_. _. _ succes.sfulP3rlung
p
p
p
n
p n
n
p n n
p n
p
n n
n
p n p
n
n n n
,
\ , ,.- .. /
Figure 7-6: This figure shows the resulting class conditional fuzzy sets using the data
browser approach for the parking problem.
o symbolic learning;
o evolutionary computing;
o connectionist learning;
o probabilistic learning;
o fuzzy-based learning;
o and case-based learning.
In the following subsections, for completeness, each category is briefly described and
references to literature provided. This section can be skipped on a first read without loss
of continuity (i.e. resume reading at Section 7.6).
or similar logical knowledge structures. Some of the more popular approaches here
include decision tree algorithms such as ID3 (more recently C4.5) [Quinlan 1983;
Quinlan 1993], and CART [Breiman et al. 1984]. The C4.5 decision tree learning
algorithm is described in Section 7.5.2.1. Decision trees have a long history within
machine learning, having their roots in EPAM [Feighenbaum 1961], the cognitive-
simulation of human concept learning. CLS [Hunt, Marin and Stone 1966] used a
heuristic lookahead method to construct decision trees, while ID3 added the crucial idea
of using information content as a means of specialising hypothesis. Other symbolic
approaches include rule induction techniques such as AQ [Michalski and Chilausky
1980], and predicate logic approaches such as FOIL [Quinlan 1990] and CIGOL
[Muggleton and Buntine 1988]. Though most symbolic induction algorithms use a hill-
climbing search strategy to search the possible model space, recently evolutionary
search techniques have been illustrated as a successful alternative [Banzhaf et al. 1999;
Wong and Leung 1995], avoiding problems such as local optima that can occur using
hill-climbing strategies. Performance is generally measured in terms of the model
accuracy on a test dataset.
G_DACG, genetic programming has been employed more in a search role in contrast to
an induction role.
with immediate feedback stating that the model has made the correct classification or
not. This decision has no effect on subsequent decisions that the model may take.
Conversely, in reinforcement learning, each decision that the model takes affects
subsequent decisions. For example, consider an autonomous robot attempting to
navigate a maze from a starting point to an end point (goal). At each point in time, the
robot must decide whether to move forward, left, right, or backward. Each decision
changes the location of the robot, so the decision will depend on previous decisions.
After each decision, the supervisor provides feedback to the robot in terms of a reward
that reflects the long-term potential of taking that move. For example, if a move leads
to the robot getting to the goal, then the feedback is positive (a reward is given),
otherwise the robot is penalised (for example, when the move leads to a dead-end). The
goal of the robot is to choose sequences of actions to maximise the long-term reward.
This differs from supervised learning where each classification decision is independent
of other decisions. Credit assignment of which decisions resulted in a good result (the
robot reaching the goal state) plays a key role in reinforcement learning, where the
impact of a decision cannot, in general, be measured immediately (feedback is not
direct).
The final category of learning discussed is unsupervised learning, where the learner is
given a collection of observations or events and searches, without the supervision of a
teacher, for regularities and general rules explaining all or most of the observations, e.g.
conceptual clustering. The goal of unsupervised learning is to get some understanding
of the process that generated the data. This can be achieved by cluster analysis, or by
examining the associated fuzzy sets or probability densities.
In this book supervised learning approaches are proposed for both classification and
prediction based on Cartesian granule feature models. In learning these models, it is
shown also how unsupervised learning approaches such as clustering can be used to
discover structure in the data (unsupervised discretisation of feature universes, resulting
in a fuzzy partition) that can subsequently lead to more transparent and accurate model
abstractions. A similar approach was adapted in [Ralescu and Hartani 1994; Sugeno
and Yasukawa 1993], where unsupervised approaches (fuzzy-clustering) were used to
identify the class structure for supervised learning problems.
The previous sections presented a definition of machine learning and various categories
of learning algorithms. At this point, a detailed presentation of the key components that
make up inductive learning (limited to a supervised learning perspective) is provided
(see Figure 7-1):
• search algorithm;
• performance measures that guide the search;
• knowledge representation of both observations and hypothesis;
• and the inductive bias introduced by the search and knowledge
representation techniques used.
Before describing each of these components in detail, the stage is set by introducing
two important operations in learning: generalisation; and specialisation.
,
then ParkingStatus is successful
i I
p
1
C
z
n
n
n C
l!
~
z
n n
n n
0 n n n
!lnm.:.T~tJo.uon
p
i p
i
v-
n n
p n n
p 0
5
n
~
C n
n n n
n n n n
Q TneTol1t\.tIlUlUOCl
I
I
c
~
n
n n
n
n
Figure 7-7: Examples of the generalisation and specialisation operations for the
parking problem as applied from the perspective of successful parking.
Generalisation could be enhanced here by dropping one of the conditions for successful
car-parking. Here shon and many denote crisp intervals defined over QTimcToDeslinalion
and QNumberOlFrccSpaces. For example, dropping the TimeToDestination condition results in
the following more general rule:
CHAPTER 7: MACHINE LEARNING 164
If NumberOJFreeSpaces is many
then ParkingStatus is successful
The former could be achieved by extending the boundaries of the word short (and
correspondingly shortening the interval associated with medium), while the latter could
be achieved by the learner by adding extra values of the TimeToDestination variable
such as the word medium, such that the generalised rule would look like this:
If TimeToDestination is short or
TimeToDestination is medium and
NumberOJFreeSpaces is many
then parkingStatus is successful
For other learning algorithms, generalisation can be accommodated in many ways, for
example, in modelling with Cartesian granule features, generalisation is further
enhanced due to the multidimensional nature of the features. This generalisation can
prove very useful in certain problem domains. Section 10.4 'presents the L classification
problem, a difficult problem (with no perfect solution), on which many popular
learning algorithms fail, but where Cartesian granule feature approaches succeed. This
success is due largely to the multidimensional nature of Cartesian granule features,
which provides extra generalisation power. Alternatively, in other forms of knowledge
representation, generalisation arises from the inference and decision making
mechanisms used; e.g. some case-based learning approaches, which simply store
training instances in memory. In this case, generalisation occurs at retrieval time, with
the power residing in the indexing scheme, the similarity metric used to identify
relevant cases, and the method for adapting cases to new situations.
language, there should exist a set of models that is consistent with the training data and
background knowledge (i.e. covers the training data and possibly unseen data); this is
termed the version space (the plausible model versions) by Mitchell [Mitchell 1982].
The hypothesis space can be partially ordered using the restriction operator (a form of
subset); thus, the version space can be viewed as an interval in this hypothesis space.
Consequently, learning could be viewed as finding the interval in this hypothesis space,
corresponding to the version space, a form of constraint-based programming. In
practice however, this method is awkward to implement, since the details of the partial
order depend on the particular language employed to represent the hypotheses. Also in
the worst case, the size of the version space representation can grow exponentially with
the number of observed training examples [Haussler 1989]. However, the version space
perspective has been extremely helpful in clarifying the nature of inductive learning.
A key part of any search paradigm is the cost or evaluation function, essentially, how
effective is a particular model in performing a specific task. Most induction methods
emphasise the ability to perform well on training or validati9n data, a behavioural-
based approach, but this can prove to be computationally expensive. In this book
however, a novel cost function is proposed based upon the semantic separation of the
concepts learned, thus avoiding expensive behavioural-based testing (on a control
dataset). This forms an integral part of the G_DACG constructive induction algorithm.
Other factors may also be taken into account to augment such decisions such as model
parsimony (simplicity). Empirical evidence (for example, see L classification problem
in Section 10.4) tends to suggest that models that are simple but no simpler tend to find
the right balance between over-generalisation and over fitting. Inductive bias, which is
subsequently presented in Section 7.8.5, also plays a key role in model discovery and is
closely intertwined with model evaluation.
Finally, in any search technique the issue of termination is very important, as the
search for a model may never truly halt. For non-incremental (one-shot) learning
approaches the simplest approach is to search until no further progress occurs or until a
prescribed level of performance is attained. For the approaches proposed in this book,
search for a model is carried out for a prescribed amount of effort commensurate with
CHAPTER 7: MACHINE LEARNING 166
On the other hand, for prediction problems the accuracy is calculated based on the
RMS error (root mean squared error) as follows:
N
t L(y,- »1)2
RMS =....:.....----=-;=....:;1_:--__ *100 (6-3)
IOyl
where Yi and )il correspond to the actual output value (of the test input tuple) and the
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 167
model predicted value respectively. IOyl denotes the size of the universe of the output
variable Y.
Alternatively, when the availability of data is severely limited (less than a 1000
[Michie, Spiegel halter and Taylor 1993]), another popular method of accuracy-based
evaluation is n-fold cross validation [Stone 1974]. Here, the provided dataset is
partitioned into n approximately-equal sized subsets. The system then trains on n-]
subsets and evaluates the performance of the induced model by testing on the remaining
subset. This process is repeated for each of the n subsets that is omitted from training,
and the resulting model accuracies are averaged over all n results. Such a procedure
allows the use of a high proportion of the available data to train, while also making use
of all data points in evaluating the cross-validation error. Typical choices of n tend to
be less than 10, with the limit known as the leave-out-one method. The disadvantage of
such an approach is that it requires the inductive inference process (training) to be
performed n times which in some circumstances could lead to large computational
requirements.
In this book, the holdout estimate error rate is adapted as a measure of model
generalisation as all the problems examined benefit from sufficiently large datasets.
This measure is used only in the parameter identification phase of the induced models,
with the less-computationally intensive evaluation process of semantic separation used
in the language identification phase (see Section 9.3.2 for details).
CHAPTER 7: MACHINE LEARNING ]68
Clearly, a system that learns concepts from examples must somehow be guided through
the space of inductive generalisations (model space) not solely by the training
instances. The machine learning literature often refers to this as inductive bias [Mitchell
1982]. Rendall [Rendall 1986] makes a further important distinction by defining both a
representational bias and a search bias. Representational bias restricts the space of
possible models by limiting the language. For example, in additive Cartesian granule
feature modelling the allowed dimensionality of the features plays an important role in
the search for a model- rendering it comptractable (computationally tractable) or not.
A more flexible approach incorporates the notion of search bias, which considers all
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: NTRODUCING CARTESIAN GRANULE FEATURES 169
possible concept descriptions, but examines some earlier than others in the search
process. Most learning algorithms, if the choice is afforded to them, will prefer to
search simpler hypotheses before more complex ones. For example, the ID3 learning
algorithm proceeds from very general concepts (simple) to more specific (detailed). In
the learning approaches presented in this book (the G_DACG algorithm), a genetic
search based upon the genetic programming paradigm is used in the selection of
possible models. A search bias is encoded in the fitness function, where model
parsimony (simplicity) and high model performance, which is estimated using a cheap
measure based upon the semantic separation of the concepts learned, are promoted.
The past ten years, have seen applications of machine learning within new fields such
as knowledge discovery and knowledge discovery in databases (KDD). KDD is a
derivative of knowledge discovery that exploits machine learning algorithms to analyse
or discover patterns in very large databases. In these fields machine learning is viewed
as one step in the discovery process that is supplied with data by a previous step, in
contrast to being "externally supplied" (a traditional view of machine learning).
Chapter 1 presented an overview of this process, along with some of its successes. The
remainder of this book explores knowledge discovery from a Cartesian granule feature
perspective and demonstrates the process on real world problems.
7.11 SUMMARY
Induction can be seen as learning a function from input/output pairs. This function can
be represented using logical sentences, polynomials, beli~f networks, neural networks
and others. This chapter has provided an overview of machine learning, a field that cuts
across artificial intelligence and cognitive science. Formal definitions of the machine
learning were provided and the three main categories of machine learning: supervised
learning; reinforcement learning; and unsupervised learning, were described. Inductive
learning, an integral part of most computational learning algorithms, was presented in
detail. Popular induction algorithms for decision trees, and naive Bayes classifiers and
fuzzy classifiers were described and illustrated. In subsequent chapters (in Part IV),
new approaches to machine learning are presented in the context of Cartesian granule
feature models. These approaches are subsequently (in Part V) demonstrated on both
real world and artificial problems.
7.12 BIBLIOGRAPHY
Duda, R., Gaschnig, J., and Hart, P. (1979). "Model design in the Prospector consultant
system for mineral exploration", In Expert systems in the microelectronic age,
D. Michie, ed., Edinburgh University Press, Edinburgh, 153-167.
Feighenbaum, E. A. (1961). ''The simulation of verbal learning." In the proceedings of
Western joint computer conference (reprinted in Readings in Machine
Learning (1990), Eds.: Shavlik and Dietterich, Morgan Kaufmann Publishers),
Los Angeles, 121-132.
Fiesler, E., and Beale, R. (1997). Handbook of Neural Computation. Institute of Physics
Publishing Ltd. and Oxford University Press, Bristol, UK.
Fogel, L. J., Owens, A. J., and Walsh, M. J. (1966). Artificial intelligence through
simulated evolution. John Wiley, New York.
Friedberg, R. (1958). "A learning machine, part 1", IBM lournal of Research and
Development, 2:2-13.
Friedberg, R., Dunham, B., and North, T. (1959). "A learning machine, part 2", IBM
lournal of Research and Development, 3:282-287.
Goldberg, D. E., and ~eb, K. (1991). "A comparative analysis of selection schemes
used in genetic algorithms", In Foundations of Genetic Algorithms, G.
Rawlins, ed., Morgan Kaufmann, San Francisco.
Grabisch, M., and Nicolas, J. (1994). "Classification by fuzzy integral: Performance
and tests", Fuzzy Sets and Systems, 65:255-271.
Harris, C. J., Wu, Z. Q., and Feng, M. (1997). "Aspects of the Theory and Application
of Intelligent Modelling, Control and Estimation." In the proceedings of 2nd
Asian Control Conference (invited lecture), Seoul, Korea, 1-10.
Haussler, D. (1989). "Learning conjunctive concepts in structural domains", Machine
Learning, 4(1):7-40.
Hebb, D. O. (1949). The organisation of behaviour. Wiley, New York.
Hertz, J., Anders, K., and Palmer, R. G. (1991). Introduction to the Theory of Neural
Computation. Addison-Wesley, New York.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of
Michigan Press, Michigan.
Holland, J. H. (1986). "Escaping brittleness: the possibilities of general purpose
learning algorithms applied to parallel rule-based systems", In Machine
Learning: An Artificial Intelligence Approach (Vol. 2), R. S. Michalski, J. G.
Carbonell, and T. M. Mitchell, eds., Morgan Kaufman, San Francisco.
Holland, J. H., Holyoak, K. J., Nisbett, R. E., and Thagard, P. R. (1986). Induction:
Process of Inference, Learning, and Discovery. MIT Press, Cambridge, Mass.,
USA.
Honey, P., and Mumford, A. (1992). The Manual of Learning Styles. Peter Honey.
Hume, D. (1748). An inquiry concerning human understanding. Reprinted 1955.
Liberal Arts Press, New York.
Hunt, E. B., Marin, J., and Stone, P. J. (1966). Experiments in induction. Academic
Press, New York.
Ishibuchi, H., Nozaki, K., Yamamoto, N., and Tanaka, H. (1995). "Selecting fuzzy if-
then rules for classification problems using genetic algorithms", IEEE
Transactions on Fuzzy Systems, 3(3):260-270.
James, W. (1892). Briefer Psychology. Harvard University Press, Cambridge.
Kibler, D., and Aha, D. E. (1987). "Learning representative exemplars of concepts." In
the proceedings of International Workshop on Machine Learning, 24-30.
Kohonen, T. (1984). Self-Organisation and Associative Memory. Springer-Verlag,
SOI-T COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTI;SIAN GRANULE FEATURES 173
Berlin.
Kononenko, I. (1993). "Inductive and Bayesian learning in medical diagnosis",
Artificial Intelligence, 7:317-337.
Koza, J. R. (1992). Genetic Programming. MIT Press, Massachusetts.
Koza, J. R. (1994). Genetic Programming II. MIT Press, Massachusetts.
Langley, P. (1996). Elements of Machine Learning. Morgan Kaufmann, San Francisco,
CA, USA.
Langley, P., Simon, H. A., and Bradshaw, G. L. (1987). "Heuristics for empirical
discovery", In Computational models of learning, L. Bolc, ed., Springer-
Verlag, Berlin.
Lenat, D. B. (1977). ''The ubiquity of discovery", Artificial Intelligence, 9:257-285.
McCulloch, W. S., and Pitts, W. (1943). "A logical calculus of the ideas immanent in
neural nets", Bulletin of Mathematical Biophysics(5):115-137.
McDermott, J. (1982). "R 1: A rule-based configuration of computer systems", Artificial
Intelligence, 19( 1):39-88.
Michalski, R. S., Bratko, I., and Kubat, M., eds. (1998). "Machine Learning and Data
Mining", Wiley, New York.
Michalski, R. S., and Chilausky, R. L. (1980). "Learning by being told and by
examples", International Journal of Policy Analysis and Information Systems,
4:125-160.
Michie, D., and Chambers, R. A. (1968). "BOXES: An experiment in adaptive
control", In Machine Intelligence, E. Dale and D. Michie, eds., Oliver and
Boyd,London, 125-133.
Michie, D., Spiegelhalter, D. J., and Taylor, C. c., eds. (1993). "Machine Learning,
Neural and Statistical Classification", Ellis Horwood, New York, USA.
Mill, J. S. (1843). A system of logic, ratiocinative and inductive: being a connected
view of the principles of evidence, and methods of scientific investigation. J.
W. Parker, London.
Minsky, M., and Papert, S. (1969). Perceptrons: An introduction to computational
geometry. M.I.T. Press, Cambridge, MA.
Mitchell, T. M. (1982). "Generalization as search", Artificial Intelligence, 18:202-226.
Mitchell, T. M. (1997). Machine Learning. Mc Graw-HiII, New York.
Moller, M. F. (1993). "A scaled conjugate gradient algorithm for fast supervised
learning", Neural Networks, 6:525-533.
Muggleton, S., and Buntine, W. (1988). "Machine invention of first order predicates by
inverting resolution." In the proceedings of Fifth International Conference on
Machine Learning, Ann Harbor, MI, USA, 339-352.
Narazaki, H., and Ralescu, A. L. (1999). ''Translation and extraction problems for
neural and fuzzy systems: bridging over distributed knowledge representation
in multilayered neural networks and local knowledge representation in fuzzy
systems", In Fuzzy theory, systems, techniques, and applications (Volume 2),
C. T. Leondes, ed., Academic Press, New York, 917-935.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible
inference. Morgan Kaufmann, San Mateo.
Quinlan, J. R. (1983). "Learning efficient classification procedures and their application
to chess endgames", In Machine Learning: An Artificial Intelligence
Approach, R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, eds., Springer-
Verlag, Berlin, 150-176.
Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, 1(1):86-106.
CHAPI'ER 7: MACHINE LEARNING 174
So far this book has been concerned with knowledge discovery, presenting it as a multi-
step process that discovers useful and valid knowledge from data, where knowledge
representation and machine learning play pivotal roles. Knowledge representation
influences knowledge discovery in many ways, including what can be discovered, how
it can be learned, when it can be learned, the understandability and tractability of the
discovered model and so on. Current approaches to knowledge discovery suffer from
one or more shortcomings that stem from the type of knowledge representation
employed. These include decomposition error, and performance issues such as
transparency, accuracy and efficiency.
The main focus of this part is to introduce a new form of knowledge representation and
corresponding learning algorithms, centred on Cartesian granule features. This
approach addresses some of the shortcomings of other knowledge discovery techniques
outlined above. Chapter 8 describes Cartesian granule features, and shows how fuzzy
sets and probability distributions can be defined over these features and how these can
be incorporated into both fuzzy logic and probabilistic models. Chapter 9 describes
induction algorithms for Cartesian granule feature models for both classification and
prediction problems. These algorithms are analysed and illustrated in Part V.
CHAPTER
CARTESIAN GRANULE
8 FEATURES
Current approaches to knowledge discovery suffer from one or more shortcomings that
stem from the type of knowledge representation employed. This chapter introduces a
new form of knowledge representation centred on Cartesian granule features, with
corresponding induction algorithms being presented in the next chapter. This approach
to knowledge representation and related induction algorithms, while not being a
panacea for knowledge discovery, do address some of the shortcomings of other
knowledge discovery techniques such as decomposition error, and performance issues
such as transparency, accuracy and efficiency.
This chapter begins by providing basic definitions and examples of Cartesian granule
features and related concepts. Subsequently, it looks at the different possibilities for
aggregation within the context of individual Cartesian granule features based upon
fuzzy set theory and probability theory. Finally, it is shown how Cartesian granule
features can be incorporated into evidential logic (additive) and fuzzy logic models.
This results in a slightly modified approximate reasoning process for both fuzzy logic
and support logic reasoning, which is also described.
Cartesian granule features [Baldwin, Martin and Shanahan 1996; Baldwin, Martin and
Shanahan 1997; Shanahan 1998] were originally introduced to overcome some of the
shortcomings of existing forms of knowledge representation such as decomposition
error and also to enable the paradigm modelling with words through related learning
algorithms. In addition, this approach addresses other shortcomings of knowledge
discovery techniques as outlined above. A Cartesian granule feature can be
multidimensional in nature and is built upon a linguistic partition of the base universe.
This new approach exploits a divide-and-conquer strategy to representation, capturing
knowledge in terms of a network of low-order semantically related features - a network
of Cartesian granule features. The universes of these multidimensional features are
abstractly partitioned or discretised by Cartesian words, known as Cartesian granules.
This section begins by providing some basic definitions and examples of Cartesian
granule features and related concepts. It then provides a more complete presentation of
Defmition: A Cartesian granule, is an expression of form wJx ... xwm where each Wi is
a granule defined over the universe Q j and where ''x'' denotes the Cartesian product. A
Cartesian granule can be intuitively visualised as a clump of elements in an n-
dimensional universe.
the Cartesian product. More concretely, given a set of features {F], ... , Fm} defined over
the universes {il], il2 , ••• , Q".} and corresponding linguistic partitions {P], ... , Pd,
where each P; consists of labelled fuzzy sets as follows: {Wil' Wi2, ••••• , Wic}. A Cartesian
granule universe Qp,XP2X. ... Xp~ can be formed by taking the cross product of the words
making up each linguistic partition Pi as follows:
where each Cartesian granule is merely a string concatenation of the individual fuzzy
set labels Wi,j and each j denotes the granularity of partition Pj • Consider the following
example, where a two-dimensional Cartesian granule universe is formed using example
problem features of Position and Size. To construct a Cartesian granule universe, the
universe of each feature is linguistically partitioned arbitrarily as follows:
The Cartesian granule universe, Qpposition x PSize, will then consist of the following
discrete elements (Cartesian granules):
is a linguistic partition of the respective universe n. for all i E {J, ... , mI. A Cartesian
granule feature can intuitively be viewed as a multidimensional linguistic variable. For
example, considering the problem features of Position and Size presented above, the
Cartesian granule feature CGposition x Size could denote a feature defined over the
Cartesian granule uni verse ilPPosition xPSiu as defined in Figure 8-1 .
Ulrge
Medium
Small
.!?-
~
~
-g
<l)
<l)
~ Merrbership
Middle Right
Figure 8-J: The Cartesian granule universe ilPPosition x PSize defined in terms of the
linguistic partitions of the universes ilSize and ilposition-
Definition: A Cartesian granule fuzzy set CGFSF,xF2x.... xFm is a discrete fuzzy set
defined over a Cartesian granule universe QPIXP2X ... XP,n where each Fi is a domain
feature and each Pi is a linguistic partition of the respective universe n.
for all i E {J,
... , m}. Each Cartesian granule is associated with a membership value, which is
calculated by combining the individual granule membership values that individual
feature values have in the fuzzy sets that characterise the granules. For example,
consider the Cartesian granule WJJX .•. xwmh where each Wi/ is the word associated with
the first fuzzy subset in each linguistic partition p;. The membership value associated
with this Cartesian granule wJJx ••. xwmJ for a data tuple <Xl> ... , xm> is calculated as
follows:
where Xi is the feature value associated with the i 1h feature within the data vector. Here
the aggregation operator /\ can be interpreted as any t-norm (see Section 3.5.1) such as
product or min. The choice of conjunction operator is considered in Section 8.2.
Extending the example presented above, consider if the universes Qposition and QSize are
CHAPTER 8: KR USING CARTESIAN GRANULE f-EATURES 182
defined as [0, 100J and [0, 100J respectively then possible definitions of the fuzzy sets
in partitions OPosition and OSize (in Fril notation [Baldwin, Martin and Pilsworth 1995])4
could be:
Lejt:[0:],50:0J Small:[0:],50:0J
Middle:[O:O, 50:], 100:0J Medium:[O:O, 50:], 100:0J
Right:[50:0, 100:] J Large: [50:0, 100: 1].
Linguistic partitions provide a means of giving the data a more anthropomorphic feel,
thereby enhancing understandability. In essence, when generating a Cartesian granule
fuzzy set corresponding to a data tuple, it first fuzzifies (or reinterprets) the single
attribute values. Returning to the example, the attribute values for Position and Size are
reinterpreted in terms of the words that partition the respective universes, that is, a
linguistic description of the data is generated. Taking a sample data tuple (of the form
<Position, Size» <60, SO> (denoted as <x,y> in Figure 8-1), each data value is
individually linguistically summarised in terms of two fuzzy sets {Middle/O.S+
Right/0.2J and {MediumlO.4+ Large/O.6}. Subsequently, taking the Cartesian product of
these fuzzy data yields the following fuzzy set in the Cartesian granule universe:
4A fuzzy set definition in Fril such as Middle: [0:0, 50:], 100:0J can be rewritten
mathematically as follows (denoting the membership value of x in the fuzzy set
Middle):
j
0 if x:50
x
ifO<x :550
50
J.lMiddle(X)= loo-x
if 50< x <100
50
o ifx~loo
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 183
The approach proposed here tries to fulfil both desires by discovering models that are
not only accurate but also understandable. This is enabled by the use of Cartesian
granule features - multidimensional features built on words. Learning Cartesian granule
feature models reduces to a simple probabilistic counting of linguistic interpretations of
data (numerical or otherwise). For example, consider Figure 8-2(b), which graphically
displays a linguistic partition of the Position variable. The variable value of 40 can be
linguistically summarised or described using the following fuzzy set: {left/O.2 +
Middlell}. Consequently, due to the linguistic nature of Cartesian granule feature,
modelling with Cartesian granule features enables the paradigm modelling with words,
where words, characterised by fuzzy granules, provide tractability, transparency and
generalisation. Similarly, reasoning in a Cartesian granule feature context can be
viewed as computing with words [Zadeh 1996]. As a result, Cartesian granule feature
models can facilitate a more natural interaction between the human and the computer.
In a sense, a knowledge discovery process centred on Cartesian granule features
discovers knowledge by letting your data speak (literally!). Part V of this book gives
concrete examples of this in terms of real world problems. Learning Cartesian granule
feature models is presented in detail in Chapter 9.
I~
...
>.
:0
~
.. ,
,-" ~"
----
\~
.2 __ r--~
0
d:: ~_/ ...........
,,
I
,-- '-,
-
,~~
0
50 100
,
0
°posilioo
(a)
7
Lell Middle Riglll
I J
.9-
~
J!
,.,II
.. I --
/ /~
.,E r· - "
v'\
:::E 0 '. ,/
,....
~
0 40 50 100
° Postion
(b)
J I
Middle
.9-
.r:
_ _ unsucoessfulParldng
,-- .-~-
p
_____ successfulParking
I '"
u
u P
'"
0-
til
P
" p n
n n
p n n
p n
n p
n n
, p
-
I
P n n
I n n n
_.- .. '"
i
Figure 8-3: An example of modelling the car parking problem using approaches based
upon total decomposition. This figure shows the resulting probability density functions
(class conditional) using the nai"ve Bayes approach (see Sections 5.2.2 and 7.5.2.2 for a
further explanation).
As mentioned previously, when constructing Cartesian granule fuzzy sets, there are
infinite ways of generating the membership values associated with the individual
Cartesian granules. Fuzzy and probabilistic approaches for generating these values are
examined. In the case of the fuzzy approaches, two commonly used operators - min and
product - are investigated and justified from a voting model perspective. This section is
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 187
included here for completeness and can be skipped on a first reading of this chapter
without any loss of continuity.
where Xi is the feature value associated with the i-th feature in the data vector x.
Within fuzzy logic algebra, any of the functions that satisfy the t-norm axioms (see
Section 3.5.1) can be used as a conjunction operator, such as the min operator:
Both conjunction operators are commonly used in fuzzy logic and fuzzy control
applications [Baldwin, Martin and Pilsworth 1995; Klir and Yuan 1995]. In a more
general setting, the averaging operations such as Yager's OWA operators [Yager 1993]
or parameterised aggregators such as Zimmerman's 'Y operator [Zimmermann and
Zysno ] 980] could be used (see Sections 3.5.2 and 3.5.3 for details of these aggregation
operators). The next subsection justifies the use of product and min as combination
operators from a human reasoning perspective using voting model semantics.
The definition of conjunction as the min operator is consistent with the voting model
interpretation of fuzzy sets, provided the voters vote consistently on all concepts that
are combined conjunctively [Baldwin 1991] I.e. the constant threshold assumption is
extended to cover all concepts. This use of min is justified with the following example.
Consider two die variables. Both die variables are defined over the following universe
of values:
The two dice are thrown resulting in die! having a value of 5 and die2 having a value
of 6. A representative population of voters are then asked to vote on the appropriateness
of the words Small, Medium, and Large as a description of each die value. This voting
is performed independently for each of the die values resulting in a voting pattern for
"die! having a value of 5" presented Table 8-1 and a voting pattern for "die2 having a
value of6" presented in Table 8-2.
Table 8-J: A voting pattern for 10 people defining the linguistic description of die1
having the value of 5. This corresponds to the fuzzy set {Small/O.l + Medium/O.7 +
Large/I}.
Word\Person I 2 3 4 5 6 7 8 9 10
Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Medium Yes Yes Yes Yes Yes Yes Yes No No No
Small Yes No No No No No No No No No
Assuming that the voters who were optimistic in voting for the linguistic description of
dieJ having a value 5 share the same optimism when voting for the linguistic
description of die2 having a value 6, the voters in both voting patterns can be directly
matched. This leads to a voting pattern for the conjunction of both linguistic
descriptions that is presented in Table 8-3 . In this case, the cells containing Yes
correspond to voters who accept the Cartesian granules as appropriate descriptions of
both die values. This resulting voting pattern generates the following fuzzy set:
{dJ MediumANDd2Medium/0.7 +
dJ MediumANDd2Large/O. 7 +
dJ LargeANDd2Medium /0.8 +
dJ La rgeANDd2Large /I}
which coincides with the fuzzy set generated by using the min rule for the conjunction
of the individual granule memberships. This example illustrates that using min as a
granule conjunction operator is intuitive.
Table 8-2: A voting pattern defining the linguistic description of die2 having the value
of 6. This corresponds to the fuzzy set {Medium/0.8 + Large/I}.
Word\Person I 2 3 4 5 6 7 8 9 10
Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Medium Yes Yes Yes Yes Yes Yes Yes Yes No No
A similar argument can be used to justify the use of product as the conjunction operator
of granule memberships. In this case, the constant threshold assumption is dropped, that
is, a voter' s degree of optimism/pessimism is allowed to vary across voting for different
concepts. Once again the justification for using product conjunction is illustrated using
an example. This justification uses the linguistic descriptions generated by a voting
SOFT COMPUTIN(; FOR KNOWLED(;E DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 189
population for "diel =5" and "die2 = 6" that are presented in Table 8-1 and Table 8-2
respectively (i.e. using the same patterns that were used in justifying the min operator).
As a result of dropping the constant threshold assumption, the voters labelled 1 to I 0 in
the voting patterns for diel may not correspond to voters labelled 1 to IO in the voting
patterns for die2. In other words, there is no correlation between the voters in the voting
pattern for diel and the voters in the voting pattern for die2, that is the voter labelled I
in Table 8-1 may not correspond to the voter labelled 1 in Table 8-2. This is depicted
in Table 8-4 for the linguistic description of "die2 = 6", where each VP j denotes a voter
variable that can be assigned any of the ten voters. Table 8-5 depicts one possible
voting pattern for the die2 value. Consequently, this results in the voting pattern for the
conjunction of the voting patterns for linguistic descriptions of the dieJ value (Table 8-
1) and the die2 value (Table 8-2) that is presented in Table 8-6. However, there are
many possible instantiations for the voter variables VP j , each resulting in a different
overall voting pattern for the die2 value. This in turn results in a different voting pattern
for the conjunction of both patterns and subsequently a different corresponding fuzzy
set. No voter instantiation is preferable to another. Consequently, all voting patterns are
equally likely for linguistic descriptions of "die2 = 6". Since all voting patterns are
equally likely, the expected fuzzy set can be taken as the fuzzy set corresponding to
conjunction of linguistic descriptions of the individual die instantiations. This results in
the following Cartesian granule fuzzy set:
{dl MediumANDd2Medium/0.56 +
dl MediumANDd2Largel0. 7 +
dlLargeANDd2Mediuml0.8 +
dl LargeANDd2Large II}
Table 8-3: A voting pattern for 10 people corresponding to the linguistic description of
"diel = 5 and die2 = 6", which denotes thefuzzy set {dIMediumANDd2Medium/0.7 +
dlMediumANDd2Largel0.7 + dlLargeANDd2Mediuml0.8 + diLargeANDd2Larg II}.
Canesian word\Person 1 2 3 4 5 6 7 8 9 10
d ILargeAN Dd2Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
d 1LargeANDd2Medium Yes Yes Yes Yes Yes Yes Yes Yes No No
dl MediumANDd2Medium Yes Yes Yes Yes Yes Yes Yes No No No
d 1MediumAND<l2Largc Yes Yes Yes Yes Yes Yes Yes No No No
This fuzzy set coincides with the fuzzy set generated by using the product conjunction
of the individual granule memberships. This example illustrates that using product as a
granule conjunction operator is intuitive.
Table 8-4: A "general" voting pattern defining the linguistic description ofdie2 having
the value of6. This corresponds to the fuzzy set {Medium/0.8 + Large/I}.
Word\Person VPI VP2 VP VP4 VP5 VP6 VP7 VP8 VP9 VPI
Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Medium Yes Yes Yes Yes Yes Yes Yes Yes No No
CHAPTER 8: KR USING CARTESIAN GRANULE FEATURES 190
Table 8-5: A possible voting pattern for the linguistic description of die2 having the
value of6. This corresponds to thefuzzy set (MediumlO.8 + Large/l).
-
Word\Person 4 2 3 7 10 8 I 5 9 6
Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Medium Yes Yes Yes Yes Yes Yes Yes Yes No No
Table 8-6: A possible voting pattern for the conjunction of the linguistic description of
"diel = 5" (as presented in Table 8-1) and the linguistic description of "die2 = 6" (as
presented in Table 8-5). This corresponds to the fuzzy set
{dIMediumANDd2MediumlO.5 + dlMediumANDd2LargelO.5 +
dlLargeANDd2MediumlO.7 + diLargeANDd2Large II}.
Canesian word\Person I 2 3 4 5 6 7 8 9 10
d I LargeAN Dd2Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
dJLargeANDd2Medium Yes Yes Yes Yes Yes No Yes Yes No No
d I MediumANDd2Medium Yes Yes Yes Yes No No Yes No No No
d 1MediumANDdlLarge Yes Yes Yes Yes No No Yes No No No
The previous paragraphs have justified from a voting model perspective the
applicability of both product and min as conjunction operators for combining the
individual granule memberships. However, the use of the product operator is preferred
as it gives more discrimination between different data values, whereas the min operator
can exhibit a plateau (non-discriminating) behaviour. Furthermore, when mutually
exclusive partitions are used to partition the base variables universes, using the product
maintains this property of mutual exclusiveness in the Cartesian granule universe.
For presentation purposes, this approach to forming Cartesian granule fuzzy sets is
described using an illustrative example. Consider a two-dimensional Cartesian granule
consisting of two features F J and F2 defined over the set ~ of real numbers with
corresponding linguistic partitions where the granules are characterised by (mutually
exclusive) triangular fuzzy sets as depicted in Figure 8-4. In the general case, where the
variable F J is assigned a data value i.e. F J = Data, it can be reinterpreted as a linguistic
assignment as shown in Figure 8-4. The linguistic description of Data will take the
form of the following fuzzy set:
SOfT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 191
where 0 ~ y ~ x ~ l.
g
e W3 W4 Ws
J. "" '"
x ...ft-----'~---7'''-----4....
... '" '"
/',
,/,
, .........
'/'
'
~
Figure 8-4: Generating a linguistic description of data using the linguistic partition of
the variable universe ilF• which is characterised by mutually exclusive triangular fuzzy
sets.
This linguistic fuzzy set LD Data corresponds to the following mass assignment:
The mass associated with the null set 0 (arising from the subnormal fuzzy set LD Data ) ,
is redistributed amongst each element in the core of the mass assignment according to a
renormalised prior (assume a uniform prior for this presentation). Other possible
distributions are also possible and are considered in Section 8.2.2.1. This leads to the
following revised mass assignment:
As a result of the mutual exclusive nature of the fuzzy partition in this case and the
strategy used to redistribute the null set mass, the probability associated with each of
the words in the least prejudiced distribution LPD Data coincides with membership value
associated with the word in the original linguistic fuzzy set. On the other hand,
assigning a data value to variable F2 results in similar fuzzy descriptions, mass
assignments and least prejudiced distributions. Subsequently, the joint probability
distribution is formed over these words by associating the product of the individual
probabilities with the Cartesian granules. The use of product here is justified on the
grounds that the linguistic partitions of the base features were generated independently
of each other. In terms of the example, let the least prejudiced distributions
corresponding to the two data values of variables F J and F2 be defined as follows:
CHAPTER 8: KR USING CARTESIAN GRANULE FEATURES 192
Combining these least prejudiced distributions leads to the following joint probability
distribution:
This joint least prejudiced distribution can be converted to the corresponding unique
fuzzy set via its mass assignment using the membership-to-probability transformation
(see Chapter 5). This yields the following Cartesian granule fuzzy set:
In this case, due to the mutually exclusive nature of the underlying linguistic partitions
of the variable universes, the resulting Cartesian granule fuzzy set coincides with the
Cartesian granule fuzzy set obtained when individual linguistic descriptions (fuzzy sets)
are combined using the product operation.
This is an incomplete mass assignment [Baldwin 1992] and does not correspond to a
family of probability distributions. Instead it will correspond to an non-normalised
family of probability distributions where
On the other hand, by redistributing the mass associated with the null set 0, normal
probability distributions can be generated. There are infinite ways of redistributing the
mass associated with the null set 0, thus leading to many different families of
probability distributions. Using a voting model interpretation of this mass assignment,
two ways to distribute the mass associated with the null set 0 amongst the other focal
elements can be justified: the mass can be distributed amongst the other active domain
elements (i.e. elements of the core of this mass asSIgnment) according to the prior; or
alternatively the mass can be distributed amongst the other focal elements in proportion
to their associated masses [Baldwin, Martin and Pilsworth 1995]. The first approach to
redistributing the mass associated with the null set 0 using the renormalised prior
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTROOUCING CARTESIAN GRANULE FEATURES 193
(assume a uniform prior here) amongst the domain elements yields the following mass
assignment and least prejudiced distribution:
The probabilities in the least prejudiced distribution LPD Oata coincide with the
membership values in the original fuzzy set.
Conversely, the mass can be redistributed amongst the other focal elements in
proportion to their associated masses, thereby increasing the mass associated with lh}
by (x - y) y and the mass associated with {Wz, W3} by L: This results in the following
x x
probability distributions:
MAoala=({w2}:x: y , {W2'W3}:~)
LPD oa1a = w2:I_L, W3:~'
2x 2y
The least prejudiced distribution LPD Oata in this case corresponds to normalising the
fuzzy set before transforming it into its corresponding mass assignment and least
prejudiced distribution. Both methods result in different least prejudiced distributions
but are equally justifiable.
So far, this section has focused on the generation of Cartesian granule fuzzy sets via the
least prejudiced distributions associated with the feature linguistic descriptions where
the underlying partitions are mutually exclusive. However, a similar approach could be
taken where the underlying partitions are not mutually exclusive. In this case, the
resulting linguistic descriptions may be normal for all domain values, thereby
simplifying the aggregation process.
In this book Cartesian granule fuzzy sets corresponding to data vectors are generated
using the approach presented Section 8.1, whereby the individual granule memberships
are combined using the product conjunction operator. This preserves truth functionality,
and is a more efficient and a simpler (involving fewer steps) way of generating
Cartesian granule fuzzy sets. Empirical evidence to date suggests there is little
difference between any of the aggregation approaches considered here when they are
employed in a machine learning context [Baldwin, Martin and Shanahan 1997].
In modelling a problem domain, since Cartesian granule features can assume fuzzy set
or probabilistic values, they can be quite naturally incorporated into conjunctive,
CHAPTER 8: KR USING CARTESIAN GRANULE FEATURES 194
evidential and causal relational rule structures, thus enabling reasoning using support
logic [Baldwin, Martin and Pilsworth 1995]. Alternatively, Cartesian granule features
can be incorporated into fuzzy logic rules, thus enabling approximate reasoning based
on CRI as presented in Chapter 4. Even though it is possible to combine all the features
(or base variables) of a problem into one Cartesian granule feature, it may not always
be desirable. For example, in Section 10.4, the need for discovering structural
decompositi~n of input feature spaces into lower order feature spaces is motivated by
the L problem. In general, decomposition is required not only on generalisation grounds
but also from knowledge transparency and tractability perspectives [Baldwin, Martin
and Shanahan 1998; Shanahan 1998]. This partial decomposition can be viewed as a
form of decomposition of the problem domain into low order relationships between
small clusters of semantically related variables, similar in spirit to Bayesian networks
[pearl 1986], where a Cartesian granule feature represents each cluster of semantically
related variables (variables that have dependencies, such as functional or probabilistic
dependencies). This correlation between Bayesian belief networks and Cartesian
granule features is furth~ discussed in Section 10.4.1.2. As a result of this
decomposition, a means of aggregating the individual Cartesian granule features is
required. In this book, the evidential logic rule is chosen as a natural mechanism for
representing this type of decomposed approach to systems modelling [Shanahan 1998].
This type of model is referred to as an additive model. On the other hand, using the
conjunctive rule to aggregate the individual Cartesian granule features results in a
product model. In Section 10.2.2.1, the use of product and additive models is
compared on the ellipse dataset. As mentioned earlier (Chapter 6) in the context of
evidential rules, additive models permit partial reasoning (i.e. tolerates missing values),
which can be an attractive facet in very uncertain problem domains.
More concretely stated, an additive model will consist of an evidential logic rule
corresponding to each class in the problem domain. An evidential logic rule structure is
reviewed here from a Cartesian granule feature perspective (see Section 6.1.2 for a
complete description) and is depicted in Figure 8-5. Here CLASS can be viewed as a
fuzzy set consisting of a single crisp value, in the case of classification type problems,
or as a fuzzy set characterising part of the output variable universe in the case of
prediction problems. Each rule characterises the relationship between input and output
data for a particular region of the output space i.e. a concept. The body (conditional
part) of each rule consists of a collection of Cartesian granule features F;, whose values
CGFSicLASS correspond to fuzzy sets defined over respective universes 0; that
correspond to the output variable value CLASS (in probabilistic terms this can be
viewed as the class conditional Pr(F; = CGFS;cLASS I Classification = CLASS)). Each
feature F; is associated with a weight term w; that reflects the importance of this feature
to CLASS;.
This section considers the support logic approximate reasoning process from an
additive Cartesian granule feature model perspective. As described in detail in the
Son COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 195
previous section, each rule consists of a body of Cartesian granule features and their
corresponding fuzzy set values. The first step in the inference process consists of
generating Cartesian granule fuzzy sets from the incoming data vector Xi corresponding
to each (CG) feature (as described in Section 8.1). This results in a Cartesian granule
fuzzy set description CGD; of Xi for each feature F;. Subsequently, for each feature F;
the first level of inference (as described in Section 6.2.1) is performed. That is, a fuzzy
set match is performed using semantic unification between each class fuzzy set
CGFS iCLASS and the corresponding data fuzzy set CGD; as follows:
SU{CGFSiCLASS I CGD i)
where CGDi corresponds to the Cartesian granule fuzzy set description of xi. Then
evidential reasoning proceeds as described in Section 6.2. Decision making is as
presented in Section 6.3.
The previous sections have shown how Cartesian granule features can be incorporated
into evidential logic and conjunctive rule structures and how probabilistic reasoning can
be carried out in this context. As an alternative, Cartesian granule features can also be
incorporated into fuzzy rules in the fuzzy logic sense as described in Chapter 4.
Consequently, as was the case in probabilistic reasoning, the first step in the inference
process consists of generating a Cartesian granule fuzzy set CGD; for each (CG) feature
Fi from the incoming data vector xi . Subsequently, reasoning is performed using the
compositional rule of inference (CRI) in conjunction with defuzzification procedures as
presented in Chapter 4. In this book, the presentation and results are limited to
probabilistic reasoning, however, the use of fuzzy logic in conjunction Cartesian
granule features will form part of future work.
CHAPTER 8: KR USING CARTESIAN GRANULE FEATURES 196
8.6 SUMMARY
Overall Cartesian granule features open up a new and exciting avenue in uncertainty
modelling which permits not only computing with words but also modelling with words.
The next chapter describes a constructive induction algorithm that facilitates the
extraction of Cartesian granule features models from example data automatically
(modelling with words) for both classification and prediction problems.
8.7 BIBLIOGRAPHY
Having introduced Cartesian 'pule feature models and related induction algorithms,
Part V shifts its attention to applications of Cartesian granule features within the more
general context of knowledge discovery. Chapter 10, for the purposes of illustration and
analysis, applies this approach to artificial problems in both classification and
prediction. Chapter 11 focuses on practical applications of knowledge discovery of
Cartesian granule feature models, in the real world domains of computer vision,
diabetes diagnosis and control, while also comparing this approach with other
techniques such as neural networks, decision trees, naive Bayes and various fuzzy
induction algorithms. Chapter 11 finishes by summarising knowledge discovery from a
Cartesian granule feature perspective and gives some views on what the future may
hold for knowledge discovery in general and for Cartesian granule features.
CHAPTER LEARNING CARTESIAN
GRANULE FEATURE
9 MODELS
In the previous chapter, it was shown how Cartesian granule feature models exploit a
divide-and-conquer strategy to representation, capturing knowledge in terms of a
network of low-order semantically related features. Both classification and prediction
problems can be modelled quite naturally in terms of these models. This chapter
describes a constructive induction algorithm, G_DACG (Genetic Discovery of Additive
Cartesian Granule feature models), which facilitates the learning of such models from
example data [Shanahan 199~; Shanahan, Baldwin and Martin 1999]. This involves two
main steps: language identification (identification of the low-order semantically related
features in terms of Cartesian granule features); and parameter identification of class
fuzzy sets and rules. The G_DACG algorithm achieves this by embracing the
synergistic spirit of soft computing, using genetic programming to discover the
language (structure) of the model, fuzzy sets and evidential rules for knowledge
representation, while relying on the well-developed probability theory for learning the
parameters of the model.
This chapter begins by introducing the G_DACG constructive induction algorithm. The
algorithm is presented from both a classification (Section 9.1) and a prediction problem
(Section 9.1.4) perspective. Feature selection and discovery play important roles in the
induction of Cartesian granule feature models, and consequently, in Section 9.2 a
literature review of existing approaches is given. Section 9.3 describes, in detail, the
language identification (feature discovery) component of G_DACG. It is a population-
based search algorithm, centred on genetic programming [Koza 1992; Koza 1994],
where each node in the search space is a Cartesian granule feature that is characterised
by its constituent features and their abstractions (linguistic partitions). A couple of
novel fitness functions are presented in Section 9.3.2, including fitness based upon the
semantic separation of learnt concepts and parsimony promotion. Sections 9.1.2, 9.4,
and 9.5 present the main steps in parameter identification and optimisation -
identification of class fuzzy sets, evidential weights and rule filters respectively.
Section 9.6 proposes an alternative approach to parameter identification that exploits
neural network learning algorithms. For illustration purposes, in Section 9.7, G_DACG
is applied to a small artificial problem - the ellipse classification problem. The use of
different types of fitness function is examined in the context of this problem. Further
applications (real world) of G_DACG are provided in Chapters 10 and 11.
The induction of additive Cartesian granule feature models falls into the category of
supervised learning algorithms. Within this framework, problem domains are generally
The goal of supervised learning is to generate a model from the training examples, in
this case an additive Cartesian granule feature model, that covers (classifies correctly)
not only training examples, but also examples that have not been seen during training
i.e. that generalises well. Subsequent paragraphs describe the main steps in learning an
additive Cartesian granule feature model from example data using the G_DACG
constructive induction algorithm (Genetic Discovery of Additive Cartesian Granule
feature models). Since the induction of additive Cartesian granule feature models
involves the construction of new features, the G_DACG algorithm can be categorised
as a constructive induction algorithm [Dietterich and Michalski 1983].
G_DACG can be viewed abstractly in terms of the following two steps (see Figure 9-1
for a schematic overview of G_DACG from a knowledge discovery perspective):
language identification is done outside the main phase of the induction method
but uses the induction method as the evaluation function, the feature selection
and discovery component of G_DACG is classified as a wrapper approach
[Kohavi and John 1997].
• Parameter identification (steps 3 to 5 in G_DACG): Having identitied the
language of the model, parameter identification then estimates the class fuzzy
sets and class aggregation rules. Setting up the class aggregation rules is
further divided into the tasks of estimating the weights associated with the
individual Cartesian granule features (sub-models) and with identifying the
rule filters.
Constructive
PlrMleter
Induction
Irentific:l ion
(G..J)ACG)
Language
lck:ntific:lion
Data Seleaion,
Preprocessing,
Tr:.lsfomHlI ion
I
~
Step 1: Setup datasets. Split the database of examples into a training database
Dtrain, a control database DCoDlrol and a testing database Dtes,.
Step 2: Language identification. Select which features jj should be combined to
form Cartesian granule features F;. This step is taken care of by an
automatic, near optimal, feature discovery algorithm that discovers which
Cartesian granule features and their abstractions (Le. the linguistic
partition of each problem feature universe Pj;) are necessary to model a
problem effectively. It outputs a set of Cartesian granule features {FJ, •.• ,
F;. ... , Fm}. These features are subsequently incorporated into evidential
logic rules of the form depicted in Figure 9-2. This algorithm and related
material are presented in Sections 9.2 and 9.3.
Step 3: Learn the class Cartesian granule fuzzy sets. This step extracts the fuzzy
set values CGFSjClass of each class-rule feature. For each class Class in
{CLASS], ..• , CLASSc}, extract a fuzzy set CGFSjclass defined over each
Cartesian granule feature universe Q Fi using the procedure outlined in
Section 9.1.2.
Step 4: Identify rule weights and filter. The Cartesian granule features {FJ, ... , Fj,
... , Fm} and corresponding fuzzy set values are incorporated into
evidential logic rules of the form depicted in Figure 9-2. This step
estimates the weights associated with each Cartesian granule feature F j
using semantic discrimination analysis (see Section 9.4) and sets each
class filter to the identity filter. Using the estimated weights, generate the
corresponding ACOF model, ACGFsDA .
Step 5: Optimise rule weights and filter. This step is optional but can improve the
performance of the learnt additive model in some cases. This step
optimises the rule weights and filters using the Powell's direction set
optimisation algorithm (presented in detail in Section 9.5). The macro-
level details of this step are as follows:
• Take the model ACGFsDA generated in step 4, and optimise the
filters using Powell's direction set algorithm [Powell 1964].
Regenerate the ACOF model with the newly optimised filters
and SDA-based weights, yielding the model ACGFOptFilters_SDA'
• Using the model ACGFOptFilters_SDA generated above, optimise the
weights using Powell's direction set algorithm. Regenerate the
ACOF model with the optimised filters and optimised weights,
yielding A CGFOptFilters_OptWeights'
• Re-optimise the filters of the model ACGFOptFilters_OptWeight., using
Powell's direction set algorithm. Regenerate the ACOF model
with the re-optimised filters and optimised weights, yielding
A CGFOptFilters_OptWeights_2'
• Calculate the accuracy for each of the generated models on the
control dataset. Select the ACOF model with the highest
accuracy on the control set as the learnt model.
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 203
A Cartesian granule fuzzy set CGFSiClass for the concept Class over feature universe n Ft
is learned from example data tuples as follows:
Step Fl. Initialise a frequency distribution DISTiclass defined over all the
Cartesian granules in the Cartesian granule feature universe n Ft, that
is, set each Cartesian granule to zero.
Step F2. For each cll,lSs training tuple TrCIa.s perform the following (Section 5.4
presents the membership-to-probability bi-directional transformation
that exists between fuzzy set theory and probability theory, which is
used extensively below):
• Construct the corresponding Cartesian granule fuzzy set (Le.
linguistic description of the data vector) CGFStc/a•s
corresponding to the training tuple Ttc/ass using the approach
outlined in Section 8.1.
• Subsequently, the fuzzy set CGFStC/ass is transformed into its
corresponding least prejudiced distribution LPDtC/ass .
• Update the overall frequency distribution DISTic/ass with this
least prejudiced distribution LPDtc/ass.
Step F3. This frequency distribution DISTic/ass corresponds to the least
prejudiced distribution LPDiC/ass which can then be transformed into
the Cartesian granule fuzzy set CGFSiC/ass (using the bi-directional
transformation). In the absence of any other information, a uniform
prior distribution over the Cartesian granules is assumed for this
transformation.
This linguistic partition is depicted in Figure 9-3. The main steps in extracting a
Cartesian granule fuzzy set for this simple example are graphically presented in Figure
9-4. The process begins by taking examples of car positions in images and generating
corresponding Cartesian granule fuzzy sets and least prejudiced distributions. The top
left table corresponds to examples of car positions, corresponding linguistic
descriptions (in this case, the Cartesian granule fuzzy sets are equivalent to the
CHAPTER 9: LEARNIN(; CARTESIAN GRANULE FEATURE MODELS 204
linguistic descriptions due to the one-dimensional nature of the CO feature) and least
prejudiced distributions. The top middle graph corresponds to the initial Cartesian
granule frequency distribution. The top right graph depicts the Cartesian granule
frequency distribution after updating with the LPD corresponding to the value of 40.
The right middle graph shows the Cartesian granule frequency distribution after
updating with the LPD corresponding to the value of 60. The bottom right graph
displays the Cartesian granule frequency distribution after counting all the LPDs
corresponding to the example car positions. Finally, the bottom left graph depicts the
corresponding Cartesian granule fuzzy set for car positions in images i.e. a linguistic
summary of car positions in images in terms of the words Left, Middle and Right. Here,
for presentation purposes, the Cartesian granule feature is one dimensional in nature,
however, multidimensional features can be accommodated in a similar fashion.
...
• I
t. ~'
"
•• I "
,..
~·.I
/
"
I •••
.. " /'
o 40 50 100
Step F2: For each training tuple that sati sfies Pr(T,classIOutputValue) > 0, perform
the following :
• Construct the corresponding Cartesian granule fuzzy set (i.e.
linguistic description of the data vector) CGFS,class
corresponding to the training tuple T,cl.ss.
• Transform the fuzzy set CGF'CIass into its corresponding least
prejudiced distribution LPD,Class.
• Update the overall frequency di stribution DISTiclass with this
least prejudiced distribution LPD'Class in proportion with
Pr(T,classIOutputValue).
iL ll=:
# Position I.inguistic lID
D!scriJXion
1 40 Leftl.4+ l.eftl.2,
Md:ile/l Mddle/.8
(:J) Md:lle/l+ Middlel.S,
e ..,
2 tl. ..,
+Rightll Rightl.6
N 45 l.eftl.6+ Leftl.3,
Mddlell Middle/.7
CarPosition =
(left/.3 + Middle/l + Rightl.25)
Figure 9-4: Induction of the Cartesian granule fu zzy set, {Left/O.3 + Middlell +
Right/O.25}, corresponding to car positions in images (lower left graph) from example
car positions (top left table).
9.2 FEATUREDISCOVERY
1994], where each node in the search space is a Cartesian granule feature that is
characterised by its constituent features and their abstractions (linguistic partitions).
Before describing in detail feature discovery using G_DACG, a brief review of other
feature discovery and selection approaches in the literature is given. This section on
feature discovery and selection approaches and constituent subsections can be omitted
on a first reading without loss of continuity.
One can view the task of feature discovery as a search problem; for example, the
discovery of Cartesian granule features can be viewed as a search problem, with each
state in the search space specifying a possible Cartesian granule feature. This task can
be viewed as both a feature selection and construction process. There has been
substantial work on feature discovery and selection in various fields such as pattern
recognition, statistics, information theory, machine learning theory and computational
learning theory. Numerous feature selection algorithms exist. Kohavi and John [Blum
and Langley 1997; Kohavi and John 1997] characterise the various approaches as
follows: those that embed the selection within the basic induction algorithm; those that
use feature selection to filter features passed to induction; and those that treat feature
selection as a wrapper around the induction process. Since feature selection plays a
critical role in feature discovery, the various approaches to feature selection are
examined using these categories.
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 207
These embedded techniques, due to the search mechanisms employed, are very
vulnerable to starting points, and local minima [Blum and Langley 1997; Bossley 1997;
Kalvi 1993; Kohavi and John 1997]. These search techniques work well in domains
where there is little interaction amongst the relevant features. However, the presence of
attribute interactions can cause significant problems for these techniques. Parity
concepts constitute the most extreme example of this situation, but it also arises in other
target concepts. Embedded selection methods that rely on greedy search cannot
distinguish between relevant and irrelevant features early in the search. Although
combining forward selection and backward elimination to concept construction may
help to overcome this problem. A better alternative may be to rely on a more random
search such as simulated annealing, or a more random and diverse search technique
such as genetic algorithms or genetic programming.
form of filter that constructs higher-order features, orders them and selects the best such
features. These features are then passed on to the induction algorithm. Filter
approaches, while interesting and useful, totally ignore the demands and capabilities of
the induction algorithm and thus can introduce an entirely different inductive bias to
that of the induction algorithm [Kohavi and John 1997]. This leads to the argument that
the induction method planned for use with the selected features should provide a better
estimate of accuracy than a separate measure that has an entirely different inductive
bias; this leads to the wrapper technique for feature selection.
Due to the constructive nature of Cartesian granule features, the discovery of good,
highly discriminating, and parsimonious Cartesian granule features (i.e. the feature
subsets and the feature universe abstractions) is an exponential search problem that
forms one of the most critical and challenging tasks in model identification. An additive
model composed of Cartesian granule features that are too simple or too inflexible to
represent the data will have a large bias, while one which has too much flexibility (Le.
redundant structure) may fit idiosyncrasies found in the training set, producing models
SOFT COMPUTING fOR KNOWLEIXiE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURr:.S 209
that generalise poorly; in this case the model's variance is too high. This is an example
of the classical bias/variance dilemma presented in [Geman, Bienenstock and Doursat
1992]. Bias and variance are complementary quantities, and the best generalisation is
obtained when the model provides the best compromise between the conflicting
requirements of small bias and small variance.
Bias: This represents how the average model (often referred to as the best model)
differs from the true system fix). If the extracted model converges to the true system,
the model is said to be unbiased i.e. well matched to the system. This type of bias
differs from the inductive bias presented in Section 7.8.5, which refers to the bias used
in the discovery or search for a model and not bias associated with the discovered
model, as is the case here.
Variance: This represents how sensitive the model is to different datasets by measuring
the expected error between the average model and a model identified on a single
dataset.
In order to find the optimum balance between bias and variance, a means of controlling
the effective complexity of the model is required. This trade-off is incorporated directly
into the G_DACG (Genetic Discovery of Additive Cartesian Granule feature models)
discovery algorithm at two levels; one in terms of a fitness function for the individual
Cartesian granule features and the other at aggregate model level where lowly
significant Cartesian granule features based on their weights are eliminated. In the case
of additive Cartesian granule features models, both the bias and variance can be drawn
towards their minimum, by adding, removing, or altering (granularities, granule
characterisations) the constituent Cartesian granule features, thereby generating models
which tend to generalise better and have a simpler model structure; i.e. Occam's razor,
where all things being equal, the simplest is most likely to be the best.
The search algorithm plays a big part in the discovery of good Cartesian granule
features. It can influence what parts of the space are or are not evaluated and can be
vulnerable to local minima, starting states and computational constraints. Each state in
the parameter space corresponds to a feature subset and the granularity of the individual
base features, that is, the feature selection and feature abstraction steps are combined.
The size of the finite space of all possible Cartesian granule features for any problem
given a finite number of base features is given by the following equation [Baldwin,
Martin and Shanahan 1998]:
L L
MaxGran MarDim
NumOfFeat C dim * (gran )dim
gran =MinGnm dim=1
9.3.2 Fitness
The most important and difficult step of genetic programming is the determination of
the fitness function. The fitness function dictates how well a discovered program is able
to solve the problem. The output of the fitness function is used as the basis for selecting
which individuals get to procreate and contribute their genetic material to the next
genemtion. The structure of the fitness function will vary greatly from problem to
SOfT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURI,S 211
problem. In the case of Cartesian granule feature identification, the fitness function
needs to find Cartesian granule features that provide good class separation (class
corresponds to specific areas of the output variable universe) and that are parsimonious.
Parsimonious features, while providing better transparency, also avoid over-fitting the
data. As a result, when used in fuzzy modelling, these features should yield high
classification accuracy with low computational overhead along with transparent
reasoning. Two functions are proposed for evaluating the fitness of Cartesian granule
feature individuals: fitness based on the semantic separation of concepts and parsimony
promotion; and fitness based on the accuracy of the resulting model on a control dataset
and parsimony promotion.
Discrimination = c. [c
M!fl 1- Max Pr( CGFS klCGFS j) J
k-I j=l
(9-1)
j*k
where c corresponds to the number of classes in the current system and Pr( ·1-) denotes
point semantic unification.
Parsimony is measured in terms of the dimensionality of the individual and the size
(cardinality) of the individual's universe of discourse. The dimensionality factor
corresponds to the number of base features making up a Cartesian granule feature. The
cardinality of a Cartesian granule feature universe is simply the number of Cartesian
granules in the corresponding universe. During the process of evolution it is important
to promote individuals that have high discrimination, low dimensionality and small
universe size. The latter of these two desires is expressed linguistically using the fuzzy
sets depicted in Figure 9-5. The individual measures are combined in the following
manner:
where WDis. WDim and WUS;ze take values in the range [0, 1] and all weights must sum to
one. Since Cartesian granule features of high discrimination are desirable regardless of
other criteria, WDis tends to take values in the range [0.7, 0.9]. The remaining weight is
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 212
split evenly amongst WDim and WUSize' The weights and parsimony promoting fuzzy sets
(depicted in Figure 9-5) are determined heuristically from trial runs.
( a)
:1 "'~'u.".'(U"& ~
o 3000 6000
UniverseSize
(b)
Figure 9-5: Fuzzy sets corresponding (a) to small dimensionality and (b) to the small
size offeature universes.
where WAn. WDim and WUSize take values in the range [0, I] and all weights must sum to
one. Since Cartesian granule features of high accuracy are desirable regardless of other
criteria, WAn tends to take values in the range [0.7, 0.9]. The remaining weight is split
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 213
evenly amongst WDim and WUSize • The weights and parsimony promoting fuzzy sets
(depicted in Figure 9-5) are determined heuristically from trial runs.
9.3.4 Reproduction
The reproduction operator is asexual in that it operates on only one individual in the
current population and produces only one individuaUoffspring in the next generation.
The reproduction operator consists of two steps. First an individual is selected from the
current population according to some selection mechanism based on fitness.
Subsequently, the selected individual is copied, without alteration, from the current
population into the new population. There are many different selection methods based
on fitness. Three of the commonly used methods are fitness-proportionate selection
[Holland 1975; Koza 1992], k-tournament selection and rank selection [Goldberg and
Deb 1991]. To date in this work, both the fitness-proportionate, and k-tournament
selection mechanisms have been used, which are both described here. The fitness-
proportionate approach uses probability based on the fitness of the individual. If j(s;(t))
is the fitness of an individual Si in the population at time t, then, under fitness-
proportionate selection, the probability that the individual Si will be selected (and thus
copied into the next generation) is
!(Si(t))
M
L!(Sj(t))
j=I
where the denominator corresponds to the sum of the all the individual member
fitnesses in the current population. K-tournament involves selecting k individuals from
the current population on a fitness proportionate basis. The individual with the highest
fitness within the k selected individuals is selected and copied into the next generation
in this case.
CHAPTER 9: LEARNING CARTI:SIAN GRANULE FEATURE MODELS 214
contain exactly the same base features and cardinality, then the Cartesian granule
feature with the higher fitness is selected.
Having determined the language of the model (step (vii) of the language identification
algorithm above), the parameters of the model are determined using steps 3-5 in the
G_DACG algorithm. Step 5 of G_DACG can further simplify the language of the
model by removing superfluous Cartesian granule features (through weights learning),
thereby decreasing the additive model's bias and variance. In addition, lowly
contributing features (associated with low weights) can be removed using backward
elimination [Devijver and Kittler 1982], whereby the worst feature in a rule (in terms
of lowest rule weight) is removed. This process is repeated for each rule until the
elimination of a feature results in a model with a severely degraded performance.
Having identified the language and parameters of a model, an additional step, that of
determining the optimal granule characterisation, can further boost the performance
of a model. Section 9.3.6 looks at different ways of generating linguistic partitions. In
addition, Chapter 10 analyse~ the effect of granule characterisation in the context of
artificial problems. The overall G_DACG algorithm is depicted in block format in
Figure 9-6.
pq>ulation
randomly
Evaluate fitness
of each
individual
that of determining the language of the output variable i.e. the linguistic partition of the
output variable. This can be discovered automatically or provided by the expert in the
domain. In the case of the former, the linguistic partition of the output variable's
universe is determined in an iterative manner beginning with a conservative number of
words and iteratively increasing until no improvement in generalisation is achieved.
There are a variety of ways of deciding on characterisations of the granules (as is the
case for input granules); these are examined next.
Any of these clustering techniques will take a training dataset as input and search the
data for structure. The discovered structure is expressed in terms of a list of cluster
centres represented as vectors. These cluster centres can be viewed as corresponding to
concepts in the data. The number of clusters, in this case, is provided by the feature
discovery component of G_DACG, however, this could instead be set by the user or
could autonomously be resolved by the clustering algorithm itself. These centres can
then be used to generate partitions on the variable universes. Figure 9-7 illustrates an
example of how the cluster centres generated by any of the above algorithms could be
utilised in generating a mutually exclusive triangular partition. In this case, cluster
centres xl and x2 defined over the variable universes QPosition and QSize are used to
partition each of the universes. Each cluster centre xi corresponds to a vector, where
each vector element xij is a cluster centre on the corresponding universe £:? Given the
cluster centres for a particular universe, the methods described in Section 4.1.1.4 can be
used to generate the corresponding partitions and thus the associated linguistic variable.
In the case of discrete universes values can be grouped together into subsets and
labelled as previously described or each discrete value could be used directly to form a
partition.
The formation of partitions has been the subject of many research areas. Some of the
more interesting approaches to generating partitions include ID3 [Quinlan 1986] and its
many fuzzy versions [Jang 1994], which rely on entropy measures to partition the
feature universes. In [Bouchon-Meunier, Marsala and Ramdani 1997] an interesting
partitioning approach is proposed where the morphological operations (from image
processing) of dilation and erosion are employed to grow and shrink regions of input
feature universes that correspond to the same class. These partitioning approaches
would not however, prove useful in the generation of partitions for Cartesian granule
features, where the words that define the partition structure are used as a means of
describing a class as opposed to being used as part of a decision making rule.
SOFT COMI'UTIN(; 1'01{ KNOWLED(;E DISCOVERY: INTRODUCING CARTESIAN GRANUI.E FEATURES 217
C
Discrimination_Ficl,ss = I - Max Pr(CGFS iC/ass I CGFS j ) (9-4)
j=1
f#C/ass
Discrimination_FiCla" (9-5)
wiClass = m
I Discrimination_FjCla"
j=1
where m corresponds to the number of features in the class rule (this can vary from
class to class, see next section for details).
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 218
Fitness_Fi (9-6)
WiClass = m
L Fitness_F j
j=1
where Fitness_F; is calculated using either Equation 9-2 or 9-3, and m corresponds to
the number of features in the class rule (this can vary from class to class).
The previous section has shown how to estimate the weights associated with each class
rule feature based on the semantic separation of class fuzzy sets or on titness. In
addition, the filter was set to the identity filter but the filter can provide an extra degree
of freedom that can sometimes boost the performance and transparency of a learnt
model. This section shows how optimisation techniques can identify rule weights and
tilters that can boost the performance of learnt models, while also shedding new light
on the understandibility of the model. The next section introduces an alternative
parameter identification technique based upon the Mass Assignment Neuro Fuzzy
(MANF) framework, where neural network learning algorithms can be applied to learn
the aggregation rules.
The weights identitication problem is encoded as follows: each class rule weight W; is
viewed as a variable that satisfies the following constraint:
The approach begins with estimating the weights by measuring the semantic separation
of the inter class fuzzy sets using semantic discrimination analysis (Section 9.4). Then a
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 219
constrained Powell's direction set minimisation (see Figure 9-8 for an outline of
Powell's direction set minimisation technique) is carried out for p iterations or until the
function stops decreasing. Each iteration involves N direction sets (where N = R * Wi,
and R corresponds to the number of rules in the knowledge base and Wi denotes the
number of feature weights in the body of rule Ri ), where the initial directions are set to
the unit directions. Note in this case it is assumed that each class rule has equal
numbers of weights W, however, this can vary for each class. In order to evaluate the
cost function for a set of weights, the corresponding additive Cartesian granule feature
model is constructed. The class rule weights are set to the normalised Powell variable
values i.e. the constituent weights for a class rule are normalised so that the weights for
a rule sum to one. The constructed model is then evaluated on the validation dataset. In
this case, the class filters are set to the identity function. Following Powell
minimisation, the weight values, whose corresponding model yielded the lowest error,
are taken to be the result of the optimisation.
The filter plays the role of a linguistic quantifier within the evidential logic rule:
determining the conjunctive and disjunctive nature of the rule. This can be more clearly
seen within an evidential logic rule where there are equally weighted terms. Consider
the following filter:
S(x) = { °
I x =1
otherwise
In this case, the filter yields a rule body that is equivalent to a logic conjunction of
terms i.e. all terms must be satisfied.
I X ~.l
S(x) = { n
o othelwise
where n corresponds to the number of body terms. In this case, the filter yields a rule
body that is equivalent to a logic disjunction of terms i.e. only one term is required to
be satisfied. When the weights are not equal, then these interpretations can be modified
to represent weighted conjunction and weighted disjunction interpretations.
In the case of the work presented here, the filter structure is limited to two degrees of
freedom and is canonically defined as follows (see also Figure 9-9):
S(X)=j~_a
x~a
a<x<b
boa
I otherwise
where 0 ~ a ~ b ~ I .
o
a b 1
x
Figure 9-9: An S-functionfilter for the evidential logic rule.
o
.---- a - - - - - - -
• . _._--------------------------- b ----------.
x
The problem is encoded as follows: each filter degree of freedom (ai and bi filter points)
is viewed as a variable in the range [0, 1] that satisfies the following constraint:
The initial filters are set to true position_ Then a constrained Powell's direction set
minimisation (see Figure 9-8 for an outline of Powell's direction set minimisation
technique) is carried out for p iterations (empirical evidence suggests a range of [1, 10]
[Shanahan 1998]) or until the function stops decreasing. Each iteration involves N
(where N= C * 2) direction sets (corresponding to number of filter variables), where the
initial directions are set to the unit directions. In order to evaluate the cost function for a
set of filters the corresponding additive Cartesian granule feature model is constructed
and evaluated on the validation dataset. Following Powell minimisation, the values
associated with each of the variables, whose corresponding model yielded the lowest
error, are taken as the result of the optimisation and are used to generate the respective
class rule filters.
The more general category of MANF networks may exhibit some of the following
characteristics (the list is not exhaustive):
• Data inputs are fuzzy sets (including Cartesian granule fuzzy sets), results of
semantic unifications, or raw feature data or combinations of these;
• Outputs are fuzzy numbers;
• Weights are fuzzy numbers;
• Weighted inputs of each neuron are aggregated by some other aggregation
operator (evidential logic aggregator, fuzzy integral[Grabisch and Nicolas
1994; Klir and Yuan 1995] etc.) besides summation;
• Probabilistic neurons;
• Be represented as a (feed forward) neural network combined whose
parameters are learned using an algorithm such as the backward propagation
learning algorithm (or fuzzified version of this).
Some of the above could prove interesting as future directions for the work presented
here. Figure 9-11 depicts a typical architecture of a MANF network utilised in this
work. In general MANF networks accept raw feature values as inputs. Subsequently, in
the case of Cartesian granule features, it linguistically interprets the raw data values and
performs a match (semantic unification) between previously learned classes (expressed
in terms of Cartesian granule fuzzy sets) and the linguistic data value. The results of
semantic unification are then taken as input to the neural net. The neural network then
classifies based on these input values. In the case of a feed forward net, the
classification value corresponding to maximum output node activation is deemed to be
the classification of the input data tuple. In the case of prediction problems, the output
layer would correspond to a single node whose value denoted the output of the network.
Figure 9-12 outlines the main steps involved in learning a MANF network (such as the
network depicted in Figure 9-11) from classified training data. Step I (in Figure 9-12)
corresponds to structure identification of Cartesian granule feature models as outlined
in the previous sections in the G_DACG algorithm, while step 2 is equivalent to
parameter identification. Step 3 is a data transformation step and essentially replaces
the raw database values for each training tuple with the results of semantic unification,
which are generated by taking the semantic unification of each class Cartesian granule
fuzzy set given the corresponding data fuzzy set. Step 4 trains the neural network using
the transformed data. Empirical evidence to date, suggests that single-layer feed
forward neural network are sufficient to model complex real world problems [Baldwin,
Martin and Shanahan 1997a; Baldwin, Martin and Shanahan 1997c]. For these
problems a conjugate gradient descent learning algorithm [Moller 1993] was used.
The weights extracted by the MANF learning algorithm could be used to aggregate the
features in the evidential logic rule and the filter function could be set to the activation
function used by the MANF network (see Figure 9-13). Each evidential logic rule
needs to take into account the biases of the individual neurons by adding an extra
feature that is always satisfied. This mapping allows the use of the MANF framework
for parameter identification of Cartesian granule feature rule based models. This
permits the use of well known and proven algorithms in the field of machine learning
for parameter identification, although the resulting knowledge representation tends to
be less intuitive due to the presence of negative weights and bias terms. However,
recent work in neural networks addressing restricted weights ranges has shown the
usefulness of neural networks as induction algorithms for logic rule based systems
[Bishop 1995; Fletcher and Hinde 1995; Hertz, Anders and Palmer 1991; Hinde 1997].
These ideas should prove useful in extracting more intuitive models using the MANF
framework and should be investigated in future work.
fI
Pr(CGFSICLASS ICGFSIDATA) WI
«Oassification of Object i CLASS)
(Evlog activation Function (
Pr(CGFS ICLASS ICGFSIDATA) WI
Pr(CGFS,cLASS ICGFSiDATA) w, ....,.,
...
.a.
Pr(CGFSiCLASS ICGFSiDATA) W,
c
.2
Pr(CGFS JTCLASS ICGFSnt>ATA) Wm tl
Pr(CGFS JTCLASS ICGFSnt>ATA)Wm c
Bia Wuios » :::I
:« I 1)(00» tJ..
c
.2
Bias 1 c:;
.~
ti
'"
( a) (b)
Figure 9-13: The correspondence of evidential logic rules with neural network
neurons. (a) A partially evaluated evidential logic rule; (b) A MANF neuron where the
input values are the probabilities associated with the semantic unifications in the
partially evaluated evidential logic rule (i.e. Pr(CGFSiCLASS ICGFSiDATA) ).
CHAI'TER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 226
The previous sections have described the G_DACG constructive induction algorithm
as a means of learning additive Cartesian granule feature (ACGF) models from data.
Here, for the purposes of illustration, the G_DACG algorithm is applied to a small
artificial problem: the ellipse classification problem. The use of both fitness functions
proposed above is examined.
1.5
o -f-,--- f - - - - - - + - - - - - l -
.1.5
·1.5 o 1.5
Figure 9-14: An ellipse in Cartesian space. Points in lightly shaded region satisfy the
ellipse inequality and thus are classified as Legal. Points in darker region are
classified as Illegal.
Table 9-2: G_DACG tableau Jor ellipse problem where fitness was measured using
semantic discrimination and parsimony.
The language identification phase of the G_DACG algorithm was allowed to iterate for
31 generations or halted earlier if the stopping criterion was satisfied. The stopping
criteria in this case, specified that if the best-oj-generation model had a classification
accuracy of 100% on the control dataset, then language identification would halt. The
model language was then set to the language of the model, chosen from the best-oJ-
generation and overall-best models, which had the highest performance on the control
dataset. Then parameters of the corresponding model were determined using steps 3-5
in G_DACG along with an investigation of which granule characterisation gave the
best accuracy. The use of triangular and trapezoidal (with different degrees of overlap)
fuzzy sets was examined. As alluded to previously, both fitness functions were
5 Here the fuzzy set SmallUniv : [30:1, 100:00} can be rewritten mathematically as
!
follows (denoting the membership value of x in the fuzzy set SmaliUniv):
compared in the context of the ellipse problem: fitness based on the semantic separation
of concepts and parsimony promotion; and fitness based on accuracy on the control and
parsimony promotion. The results for both approaches and a brief discussion is
presented SUbsequently.
Table 9-3: Confusion matrix ellipse model presented in Figure 9-15 for test dataset.
This model yields an accuracy of 98. 8% on the test dataset.
Actual\ CIa
Predicted Legal Illegal Total %Accuracy
Legal 495 5 500 99.0
Illegal 7 493 500 98.6
Taking a closer look at step 5 of G_DACG, the weights and filter optimisation step, it
can be seen that the performance of the discovered ellipse model improved as a result
of optimisation. Table 9-4 displays the effects of optimisation in terms of the
accuracies of the resulting models on the training, control and test datasets (columns 3,
4, 5 respectively). The results, expressed in terms of model accuracies on training,
control and test datasets, presented in each row correspond to the following models:
(Row 2) for models where the weights were determined using fitness measures (that
were based on semantic discrimination analysis and parsimony) and the filters were set
to the identity filter; (Row 3) for models where the weights were determined using
fitness measures and the tilters were determined using the filter optimisation algorithm
(presented in Section 9.5.2); (Row 4) for models where the filters were set to those
SOFT COMPUTING FOR KN OWLEDGE DISCOVi:RY: INTRODUCING CARTESIAN GRANULE FEATURES 229
determined in Row 3, and where the weights were optimised using Powell's algorithm
(presented in Section 9.5.1); (Row 5) for models where the weights were set to those
determined in Row 4, and where the filters were re-optimised using Powell's algorithm.
Table 9-4: Model accuracies expressed in terms of training, control and test datasets
at various stages of filter and weights optimisation. The model resulting from
optimising the filters (row 3, in bold) was selected as the output of the G_DACG
algorithm for the ellipse problem on the basis of its superior performance on the
control dataset.
Worst Fitness ~
Best Fitness - t -
, , , Average Fitness - -EI- •
I-
CI)
0.8 - - - +=-l'----i=-'I--=-F =!-=-r-~- _=--t=-=!---F--=I--=-f =!=-+-_ =1-=-1"-- -
I I l!a I I
w
co j3 -0- a-G ~G -0- a-G !p-G -0- a-O- E)rG -rr "13 t?-s 13 D- ElJ}lD- s.Q b
I I I I I
m
c:
0.6
,
-------I-------,-------~-------T-------r------
,
""
u. '" , ,,
0.4 --------------,-------,-------r-------
t;
II:
~ 0.2
0
0 5 10 15 20 25 30
Generation
Figure 9-16: Fitness curves for the ellipse problem, where fitness is based on semantic
discrimination analysis and parsimony.
0.8
0.6
I + I
I
I
I
I
I
______ _ ,_ _______ , ________ , ________, _______ J ______ _
I
1
I
I
j
I I I I I
~
~
0.4 -T----i- -------i- -------i -------:-------:-------
f -----~ -------:- -------:- ------~ -------~ -------
I I I I I
0.2 -
V ' ,
I I I I I
O+-------~--------~------~~-------L--------L-------~
o 5 10 15 20 25 30
Number of Training Cycles
Figure 9-17: Percentage of Cartesian granule features for ellipse problem that were
revisited in each generation on a G_DACG run, where fitness is based on semantic
discrimination analysis and parsimony.
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 231
On the other hand, Figure 9-17 presents the variety, by generation, of the evolutionary
search for the G_DACG run that resulted in the above model. This figure shows, by
generation, the progress of one G_DACG run of the ellipse problem between
generations 0 and 30, using two plots: the percentage of new Cartesian granule features
visited in each generation, though the curve (labelled % oj Chromosomes Revisited) is
plotted from the perspective of the number of features that are revisited; and the second
curve displays the chromosome variety in the current population, but this can be
ignored here, since duplicates are not allowed within a population. The number of novel
features in each population decreases steadily as a result of the small scale of this
problem's search space (and population count) and also because of the evolutionary
nature of the search.
Table 9-5: The test confusion matrix Jor the ellipse model presented in Figure 9-15.
This corresponds to an accuracy oJ 98. 8% on the test dataset.
Actual\ Class
Predicted Legal Illegal I Total %Accuracy
Legal 495 5 I 500 99.0
Illegal 3 497 I 500 99.4
CHAPTER 9: Li'ARNIN(, CARTESIAN GRANULE Fl'ATURE MODELS 232
-l--
Average Flness - 0 -
Ev olution d Chromosome Fitness
Worsl Finess -
l- 0.8
(f)
:.u I
[IJ I ~~:Hl""
I I
I I
0.6
ill
Q)
--------~-------~--------+--------~----- ~--------
c I 1 I I I
;'E I
1
I
I
I
I
I
I
I
I
I I I I · I
0.4 --------r-------,--------T--------r----- ,--------
1- I I I I I
(f) I I I
a: I I I
0 I I I I J
;: 0.2 --------~-------~--------.--------~-------~--------
I I I I I
t I J I I
I I I I I
I I I I I
0
0 5 10 15 20 25 30
Generation
Figure 9-19: Fitness curves for the ellipse problem over 31 generations, where fitness
is based on control dataset accuracy and parsimony.
9.8 DISCUSSION
features is an exponential search problem that forms one of the most critical and
challenging tasks in the identification of Cartesian granule feature models. Most of this
exponential effort is spent in evaluating individuals, in other words, evaluating the
fitness function. As a result, the fitness function dictates the efficiency of the algorithm.
In this chapter, two fitness functions were proposed, which both promote parsimony,
but differ on the second measure used: one corresponds to feature accuracy
(incorporated in a rule-based model) on a control dataset; and the other on the semantic
separation of classes (concepts) expressed in terms of fuzzy sets over this feature space.
The latter is computationally far more efficient than the former. This becomes obvious
after examining the steps involved in computing both. The fitness function measured in
terms of the accuracy of the model on a control dataset involves the following steps:
: r+'\f
I I I \t riety
¥ % of ~~osomes R i~ited---+-
-- --- A~ --- -:-- --- ---~ ------ -~- ------~- --- ---
I
0.8
t
_______
i
I
I
l
_______
I
I
I
I
________ L
I
_______
r
J
I
_______
I
t
I
______ _
0.6
~ ~ ~ ~
~
Q)
'£:
!II / I , :
> + I I I I I
0.4 --r----~-------~--------L-------L-------~-------
I I I I I
I I r I I
r I I I I
0.2
~-----f-------~--------~-------f-------~-------
I r I I I
0
0 5 10 15 20 25 30
Number of Training Cycles
Figure 9-20: Percentage of Cartesian granule features for ellipse problem that were
revisited in each generation on a G_DACG run, where fitness is based on control
dataset accuracy and parsimony.
On the other hand, the fitness function measured in terms of the semantic separation of
concepts involves the following steps:
Step (ii) for the accuracy-based fitness function can be considered almost negligible,
thus both approaches differ in terms of their final steps. From a computational
perspective, the effort involved in calculating the semantic separation of concepts, will
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 234
in general, be a fraction of the effort required to reason about all the examples in a
control dataset. To compute the semantic separation of c classes requires the c· (c-l)
(i.e. c times c-l) semantic unifications. This is in contrast to N· c semantic unifications
for evaluating a model on a control dataset consisting of N examples. Despite the fact
that the computational effort required to calculate one semantic unification for the
semantic separation-based fitness function will be greater than that of the accuracy-
based approach, the total computational effort of this approach will be much less. This
claim is corroborated by the results on this ellipse problem, where the computational
time required for the G_DACG algorithm run with the semantic separation-based
fitness function is less than of that required for a G_DACG run, where fitness was
based on accuracy. In both runs all other parameters were the same.
9.9 SUMMARY
In subsequent chapters, the G,:..DACG algorithm is applied to various artificial and real
world problems. It is also compared to other well known learning techniques and
parallels are drawn between these approaches from knowledge representation and
learning points of view.
9.10 BIBLIOGRAPHY
Almuallim, H., and Dietterich, T. G. (1991). "Learning with irrelevant features." In the
proceedings of AAAI-9I, Anaheim, CA, 547-552.
Baldwin, J. F. (1991). "Combining evidences for evidential reasoning", International
Journal of Intelligent Systems, 6(6):569-616.
Baldwin, J. F. (1993). "Evidential Support logic, FRIL and Cased Base Reasoning", Int.
J. of Intelligent Systems, 8(9):939-961.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - Fuzzy and Evidential
Reasoning in A.I. Research Studies Press(Wiley Inc.), ISBN 0863801595.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997a). "Fuzzy logic methods in
vision recognition." In the proceedings of Fuzzy Logic: Applications and
Future Directions Workshop, London, UK, 300-316.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997b). "Modelling with words
using Cartesian granule features." In the proceedings of FUZZ-IEEE,
Barcelona, Spain, 1295-1300.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997c). "Structure identification of
fuzzy Cartesian granule feature models using genetic programming." In the
proceedings of IJCAl Workshop on Fuzzy Logic in Artificial Intelligence,
Nagoya, Japan, 1-11.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1998). "System Identification of
Fuzzy Cartesian Granule Feature Models using Genetic Programming", In
/JCAI Workshop on Fuzzy Logic in Artificial Intelligence, Lecture notes in
Artificial Intelligence (LNAI 1566) - Fuzzy Logic in ArtificiaL Intelligence, A.
L. Ralescu and J. G. Shanahan, eds., Springer, Berlin, 91-116.
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODEL') 236
Kira, K., and Rendell, L. (1992). "A practical approach to feature selection." In the
proceedings of 9th Conference in Machine Learning, Aberdeen, Scotland,
249-256.
K1ir, G. J., and Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic, Theory and Applications.
Prentice Hall, New Jersey.
Kohavi, R., and John, G. H. (1997). "Wrappers for feature selection", Artificial
Intelligence, 97:273-324.
Kohonen, T. (1984). Self-Organisation and Associative Memory. Springer-Verlag,
Berlin.
Kononenko, I., and Hong, S. J. (1997). "Attribute selection for modelling", FGCS
Special Issue in Data Mining(Fall):34-55.
Koza, J. R. (1992). Genetic Programming. MIT Press, Massachusetts.
Koza, J. R. (1994). Genetic Programming II. MIT Press, Massachusetts.
Lawrence, S., Bums, I., Back, A., Tsos, A. C., and Giles, C. L. (1999). "Neural network
classification and prior probabilities", In Tricks of the trade, Lecture notes in
computer science, G. Orr, K. R. Muller, and R. Caruana, eds., Springer-
Verlag, New York, 20-36.
Ljung, L. (1987). System identification: theory for the user. Prentice Hall, Englewood
Cliffs, New Jersey, U.S.A.
Minsky, M., and Papert, S. (1969). Perceptrons: An introduction to computational
geometry. M.LT. Press, Cambridge, MA.
Moller, M. F. (1993). "A scaled conjugate gradient algorithm for fast supervised
learning", Neural Networks, 6:525-533.
Powell, M. J. D. (1964). "An efficient method for finding the minimum of a function of
several variables without calculating derivatives", The Computer Journal,
7:155-162.
Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, 1(1):86-106.
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Shanahan, J. G., Baldwin, J. F., and Martin, T. P. (1999). "Constructive induction of
fuzzy Cartesian granule feature models using Genetic Programming with
Applications." In the proceedings of Congress of Evolutionary Computation
(CEC), Washington D.C., 218-225.
Syswerda, G. (1989). ''Uniform crossover in genetic algorithms", In Third Int'l
Conference on Genetic Algorithms, J. D. Schaffer, ed., Morgan Kaufmann,
San Francisco, USA, 989-995.
Tackett, W. A. (1995). "Mining the Genetic Program", IEEE Expert, 6:28-28.
CHAPTER ANAL YSIS OF CARTESIAN
10
GRANULE FEATURE
MODELS
This chapter is organised as follows: The first section describes the format for the
experiments and analyses that are described in subsequent sections. Sections 10.2 and
10.3 provide a detailed analysis of Cartesian granule feature modelling for a
classification problem and for a prediction problem respectively. Section 10.4 describes
the application of Cartesian granule feature modelling to a noisy and sparse problem -
the L problem. Finally, an overall discussion on the application of Cartesian granule
features to these artificial problems is presented in Section 10.5.
The example problems to follow, namely the ellipse problem (Section 10.2) and the
sin(x*y) problem (Section 10.3) contain two base (problem) input variables, namely X
and Y, and one output (predicted or dependent) variable. All problems are sufficiently
small, permitting the examination of a significant portion of the possible Cartesian
granule feature models. The purpose of these experiments is to investigate the impact of
different decision variables on the induced Cartesian granule feature model, most of
which lie within the feature discovery process of the G_DACG algorithm. Models
consisting of Cartesian granule features with various levels of granulation, granule
characterization and feature dimensionality are manually and systematically sampled.
Due to resource constraints (time and computing power), the analysis is limited to the
Cartesian granule features where the underlying abstractions of all base feature
universes (within a single Cartesian granule feature) are equivalent; though for the
investigation into data-driven approaches to partitioning, this assumption is dropped.
The examined model sample space represents only a very small proportion of the
infinite abyss of possible models.
In the case of both problems, the use of both one and two dimensional Cartesian
granule features formed 'over the problem input features X and Y is examined. The
granularity of the partitions is varied from coarse (few fuzzy sets) to very fine (many
fuzzy sets). The finer the granularity, the better the powers of prediction, although
empirical evidence tends to suggest that there is a threshold on the number of fuzzy
sets, above which no significant gains are made in terms of model accuracy. This
threshold will vary from problem to problem. For the results presented here,
granularities in the interval [2, 20] were considered, bearing in mind that if the
partitioning is too fine, model generalisation will suffer. This is more succinctly stated
in the principle of generalisation [Baldwin 1995]: "The more closely we observe and
take into account the detail, the less we are able to generalise to similar but different
situations... ". The effect of the following granule characterisations is observed:
triangular fuzzy sets; crisp sets; and trapezoidal fuzzy sets with differing degrees of
overlap. As presented previously, different rule structures lead to different Cartesian
granule feature models. Evidential logic rules lead to additive models and conjunctive
rules lead to product models. Both rule structures are examined here. Table 10-1
summarises the decision variables and their respective values that are investigated.
The analysis of the results takes place at two levels: firstly assorted Cartesian granule
features models are compared amongst themselves; and secondly Cartesian granule
features models are compared with the results of other learning approaches, for decision
trees, neural networks and fuzzy models. This analysis provides a useful platform for
understanding learning algorithms that mayor may not explicitly manipulate fuzzy
events or probabilities.
For both problems, the experimental results are presented as follows: first the use of
different two-dimensional feature models is examined (in terms of linguistic partitions
where the fuzzy sets are characterized by triangular, crisp and trapezoidal fuzzy sets);
then the use of various one dimensional feature models is studied; subsequently the
results of other learning approaches are contrasted with those of Cartesian granule
feature modelling; finally each problem section finishes with a discussion of the results.
In the case of the ellipse problem, the impact of using alternative approaches to
generating linguistic partitions is also investigated.
SOIT COMPUTING FOR KNOWLEDG E DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 243
Table 10--1: Decision variables (and possible choices) analysed in the context of
Cartesian granule feature model construction for three artificial problems.
Before presenting the results of the analysis, the ellipse problem is presented in brief
again for convenience. The ellipse problem is a binary classification problem based
upon artificially generated data from the real universe R x R. Points satisfying the
ellipse inequality, x2 + l ~ 1, are classified as Legal, while all other points are
classified as Illegal. The two domain input features, X and Y. are defined over the
universes ax = [-1.5, 1.5] and Q y = [-1.5, 1.5] respectively. Different training, control
(validation) and test datasets, consisting of 1000, 300 and 1000 data vectors
respectively, were generated using a pseudo-random number stream. An equal number
of data samples for each class were generated. Each data sample consists of a triple <X,
Y, Class>, where Class adopts the value Illegal or Legal.
the universes of the input variables X and Y are defined in Figure 10-
2 (a corresponding graphic depiction is presented in Figure 10-1).
(iv) Generate rule set: These Cartesian granule features and learnt class
fuzzy sets are then incorporated directly into the body of the
respective classification rules. In this case since the model only
consists of one feature, the conjunctive rule and the evidential logic
rule will have equivalent behaviour. However, the evidential logic
rule has another degree of freedom, made available through the filter,
which could allow a more accurate modelling of a problem domain.
For this experiment however, the filter is set to the identity filter i.e. x
= fix). The generated rule set for this problem is presented in Figure
10-3.
xAround_O.1S
~ I - - - - -
xAround_O U
7; - - -." .. - .. f\: .................... .
xAroUDd_ 1.25
~ /\ .\ /.
~ / \ I \ ../ \
E / \/ \ . .'" \
O~--T---.---~--.-~~/---r--~\~:--.---\;·-... --,---,---,--..
-1.5 -1.0 -0.5 o 0.5 1.0 Ox 1.5
Figure 10-1: A linguistic partition of the variable universe ilx, where the granules are
characterised by trapezoidal fuzzy sets with 50% overlap.
Figure 10-2: Linguistic partitions Px and Py of the variable universes ilx and ily
respectively, where each granule is characterised by trapezoidal fuzzy sets with 50%
overlap.
unseen test data are plotted in Figure l~. For convenience, the top right hand comer
of Figure l~ (and of subsequent result graphs) is used to denote the problem being
addressed and the type of Cartesian granule feature model being used to solve it. In the
case of Figure 10-6, the graph presents results for the ellipse problem where the
underlying models consist of one two-dimensional Cartesian granule feature. The
horizontal axis corresponds to the granularity of the base universes and is expressed in
terms of the number of fuzzy sets used. The vertical axis represents the level of
accuracy obtained by the corresponding model. To avoid repetition, it is assumed for
the remainder of this chapter, unless otherwise stated, that result graphs of this type
follow this presentation format. Figure 10-7 shows the ellipse decision boundaries that
were achieved using models where the granularities of the underlying base features
were varied from two to ten. At a granularity level of seven (see Figure 10-7 (f)), the
extracted model starts to fit the ellipse but it is not until a granularity level of about nine
that a good fit is achieved: with an error rate of about 4.8%. Notice that the model
accuracies oscillate (especially in the lower levels). This oscillation is primarily due to
the "lucky fit" of the triangular sets, which have broader support for lower levels of
granularity. This "luck); fit" is more apparent in the case of crisp granules that are
presented subsequently.
•.. I . - . -.
.'
',--;-
Figure 10-3: A possible rule set for the ellipse problem in terms of two-dimensional
Cartesian granule features. See Figure 10-4 for a close-up version of the fuzzy sets in
this model.
Next the use of words that are characterised by trapezoidal fuzzy sets is examined, as a
means of partitioning the base feature universes. This type of linguistic partition is not
mutually exclusive. Again the use of one two-dimensional Cartesian granule feature
formed over the base input features X and Y is explored. The trapezoidal fuzzy sets
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 247
were positioned uniformly over the base universes, varying the trapezoidal overlap
factor from 100% overlap to 0% (0% overlap corresponds to a crisp partition). Figure
10-8 depicts the results obtained using linguistic partitions generated by trapezoidal
fuzzy sets with the following degrees of overlap: 100% overlap (curve named T=1.0),
50% overlap (curve named T= 0.5) and no overlap (curve named crisp i.e. T = 0.0).
Again the granularity of the base input feature universes was varied from 2 to 20 fuzzy
sets.
(a)
0 ••
0.7
a ••
o.~
0. '
t ••
o.~
0.1
o
(b)
Figure 10-4: Graphic representation of (a) Legal and (b) Illegal Cartesian granule
fuzzy sets where each grid point corresponds to a Cartesian granule and its associated
membership.
CHAPTER 10: ANALYSIS OF CARTESIAN GRANUI.E FEATURE MODELS 248
............. ;........... - -. ~
].I
o
- I . :5
· . ~:.J!. . .\/····/·~· ·it.~........
- 1 - 0.S e 0.:1
·····.V........
I...../·/ ··....
>< - T ..... p.'ZQid .. l . T - e.~. G~."ul.~I~~ - 6
Figure 10-5: Decision boundary for the ellipse problem using a two-dimensional
Cartesian granule feature model, where the base feature universes were partitioned
using six uniformly placed trapezoidalJuzzy sets with 50% overlap.
100
~Jl2D(X."
90
a'"
0
;;;
~
c:
0
70
e
>.
:::J
u -+- Triangul
u
« 60
~
2 3 4 5 6 7 8 9 10 II 12 13 14 15 16 17 18 1920
Granularity in Icrms of fuzzy sets
Figure 10-6: Classification results for the ellipse problem using one 2D Cartesian
granule feature, where triangular fuzzy sets were used to partition the base features.
SOI'T COMPUTING FOR KNOWJ.ED<;E DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 249
In general, the use of fuzzy sets as a means of linguistically quantising the base feature
universes gives better results than obtained using crisp sets. The results shown in Figure
10-8 empirically support this claim. The decision boundaries of models using crisp
Cartesian granule features lie along the boundaries of the linear crisp granules and thus
it becomes more difficult to model problems other than those with a stepwise linear
decision boundary. Decision tree approaches (ID3/C4.5 [Quinlan 1986]) yield similar
piecewise linear boundaries. This is clearly depicted in Figure 10-9 where the decision
boundaries of various learnt models that use crisp granules are presented. Nevertheless,
as the granularity increases, the Cartesian granules will better fit the surface boundary
for the problem, thereby reducing the model error. But with this increased model
accuracy comes a high complexity cost, which may prove intractable in more complex
systems, and may lead to over fitting.
(a)
1
'---- . "'--
0 '9
(d)
• • ...
'-'-","'':'~'' --''''''''-- "1
(h) (i)
90
~
~
8 70
'"
~
___ 1=0.5
~
< 60
"
SO
40~~~~~~~~--~~~~~~--~~~
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1920
Granula.riay in terms of rU1~1.)' .selS
Figure 10-8: Classification results for the ellipse problem using one 2D Cartesian
granule feature, where the base feature universes have been partitioned using
trapezoidal fuzzy sets with various degrees of overlap.
-1- . ...
....... ..,... ....... ........ . .... ..,.
.. ... .-...._... "
~ --L... -' .. • .. -"- -""- ~ "'"
, ... . ..":. :.. _ . .. ..... ! ...
(a) (b)
Figure 10-9: (a) Decision boundary for the ellipse problem using a two-dimensional
Cartesian granuLe feature model, where the base feature universes were partitioned
using 3 crisp sets; and (b) with 10 crisp sets.
Figure 10-11 and Figure 10-12 present the model classification accuracies obtained
using different two-dimensional Cartesian granule features with varying base feature
granularities where the granules are characterised by trapezoidal fuzzy sets with
different degrees of overlap, ranging from 0% overlap (curve named crisp i.e. T = 0.0)
to 100% overlap (curve named T = 1.0). In Figure 10-14, graphs (a)-(k) illustrate the
effect of the overlap rate on the decision boundary. These are contrasted with the
decision boundary generated by a model where the granule characterisation is a
triangular fuzzy set as depicted in Figure 10-14(1). In general, for the ellipse problem,
granules characterised by trapezoids with overlapping degrees of between 50% and
70% yield models that fit the ellipse adequately (i.e. error rates in terms of misclassified
area of around 3%) with very few words (five words) used in the linguistic partition of
the base feature universes. Figure 10-10 depicts a model with accuracy of 98% using
SOFTCOMPUTlI\G fOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 251
seven words that are characterised by trapezoidal fuzzy sets with an overlap degree of
60%. As Figure 10-10 depicts, the misclassified areas correspond to false positive areas
for the ellipse class. This is one of the best results obtained using relatively
parsimonious/succinct linguistic partitions (well inside Miller's magic number of
7 ± 2 concepts [Miller 1956]). Furthermore, when compared with triangular-based
partitions, the use of trapezoidal-based partitions tends to yield models which are more
parsimonious and which better fit the problem. Figure 10-13 contrasts the results
obtained using models that use trapezoidal-based partitions with overlap rates of 0%
(crisp case) and 50%, with models that use triangular-based partitions.
Figure 10-10: Decision boundary for the ellipse problem using a two-dimensional
Cartesian granule feature model, where the base feature universes were partitioned
using 7 trapezoidal fuzzy sets with an overlap rate of60%.
1011
90
9
Q 80
~
§
70
~
"~
< 60 ___ T=().2
tj<
___ T=(). I
- - - Cri,p
4() " - - ' - - ' -- ' - - ' - - ' -................_ - ' - - ' - - ' - - ' - - ' - - ' - _ - - ' - - - ' - . . J
2 4 5 6 7 R 9 III II 12 I) 14 1~ 16 17 I ~ 19 20
Granula,ilY m h: rm!" nf fu n, s~Ls
Figure 10-11: Classification results for the ellipse problem using two-dimensional
Cartesian granule features where the base feature universes are partitioned with
trapezoidal fuzzy sets with various degrees of overlap, ranging from 0% (curve named
crisp) to 50% (curve named T=0.5).
CHAI'TU{ 10: ANALYSIS OF CARTESIAN GRANULE I-); ATURE MODELS 252
The two-dimensional features presented here represent only a very small proportion of
the abyss of possible two-dimensional features. For example, it is possible to use
features in which the base attribute universes could have been partitioned with different
types of fuzzy set, different numbers of fuzzy sets, and data centred partitioning.
100
~
~r
~o
!:l
0 80
~
'G
l-
e
",., 70
....... T=0. 7
~
i:l"
< 60
t'<
~
50
40
2 J 4 5 6 7 H ~ 10 II 12 I J 14 15 16 17 I g 19 20
Granul nt)' In lI.:rm . of lUll)' 'CI~
Figure 10-12: Classification results for the ellipse problem using two-dimensional
Cartesian granule features where the base feature universes were partitioned with
trapezoidal fuzzy sets of various overlapping degrees (from 50% to 100%).
CJO
.3
8 KO I-H---r----------------; -+- T=0.5
~
l-
t 70 ~~_1----------------;
~
a - - - Crcsp
~ tlO
II<
50 t-Jr-----------------~
....... FuuTri
2 3 4 tI 7 8 9 10 II 12 n 14 15 Iii 17 18 19 20
('t'.mularilYin h:::rm\ of fuuy SCl~
Figure 10-13: Comparison of classification results for the ellipse problem using two-
dimensional Cartesian granule features where the base feature universes are
partitioned with triangular and trapezoidal fuzzy sets.
SOFT COMPUTING FOR KNOWLEDGE DISCOV ERY: INTRODUCING CARTI ,SIAN GRANULE FEATURES 253
_.'.' . ._-
1
...... : ... , ' r. .. . .... ','
(e) - Trapezoid - 0 .2
. (.
] ;. ~.~ :1
--+----- .
r---
• _ .. ~.. , ' r. .. " . _ "
(d) - Trapezoid - 0 .3
Figure 10-14: A montage of decision boundaries for the ellipse problem using an
assortment of two-dimensional Cartesian granule feature models, where the base
feature universes were partitioned with a granularity of five as follows: (a) - (k)
Trapezoidal fuzzy sets where the degree of overlap varies from 0 to 100% in steps of
10%; (I) Triangular fuzzy sets.
100 . - - - - - - - - - - - - - - - - - - ..,.
.....
,,'ID (X, '"'.I
~
8 80 -+--- Tri.ng
~
g 70~--r+----------------------------~ _ _ 'risp
f
<
<to
60
50
40L-~~~~~~~~~~ __~~~~~~
2 3 4 5 6 7 H 9 10 II 12 13 14 15 16 17 18 19 20
Gronul3rily .n lemos of (uay >cIS
Figure 10-15: Comparison of classification results for the ellipse problem using two
one-dimensional Cartesian granule features where the base feature universes are
partitioned with trianguLar and trapezoidal fuzzy sets. The Cartesian granule features
are combined using the evidential Logic rule.
The use of two one-dimensional Cartesian granule features where the underlying
granules are characterised by trapezoidal fuzzy sets is now examined. The trapezoidal
fuzzy sets were distributed uniformly over the base universes, varying the trapezoidal
overlap factor from 100% overlap to 0% (i.e. a crisp partition). A granularity range of
[2, 20] was investigated with uniformly positioned trapezoidal fuzzy sets with varying
SOfT COMPUTING FOR KNOWI.EDGE DISCOVERY: INTRODUCING CARTESIAN GRANUI.E FEATURES 255
overlap. A subset of the results is presented and is restricted to the following types of
granules: trapezoids with the best overlap rate; crisp granules; and trapezoidal granules
with 100% overlap. This should give some indication of the accuracies attainable with
different degrees of overlap. Figure 10-15 presents results where the evidential logic
rule structure was used as a means of combining the supports of the individual
Cartesian granule features. The classification results plotted correspond to models
where the underlying granules were characterised by trapezoidal fuzzy sets with an
overlap degree of 100% (curve named T=1 .0), with an overlap degree of 30% (curve
named T=O.3), and no overlap (curve named crisp i.e. T = 0.0). The extracted
evidential logic rule models, once again outperform their conjunctive counterparts,
yielding results in the low 90s (see Figure 10-18 for a comparison and the next
paragraph for an explanation). This is due primarily to the fact that the Y based
Cartesian granule feature is more discriminating than the X based feature; as the ellipse
is horizontally oblong. Consequently, this Cartesian granule feature is given a higher
weight (via semantic discrimination analysis) within the evidential reasoning process,
resulting in better model a<;curacies than the conjunctive rule structure that treats the
features as equally important. Figure 10-17 gives an indication of the nature of the
decision boundary generated by evidential and conjunctive rules structures in this case .
•i
.t
Figure 10-16: Decision boundaries when (a) the evidential logic rule and (b) the
conjunctive rule are used as a means of combining one-dimensional Cartesian granule
features for the ellipse problem. A granularity level of 11 was used on each base
feature universe. The granules were characterised by triangular fuzzy sets.
••
!
•
i
••
.I.!.'":
. ' ----:':.,,-----::.•~.,:----,.~-,,.
•.,....----7,_---:-l -, '! IL:.'--';.,-~
.•.-:-.- - 0,-_ _.:-,:.•:--~,_---,J
• - , .. _ ••••• 1. I - • • , . .... _ 1 _ •• 11' .. Ie • .. ,.. _-•.•• 1. , ••.•. ""'_1 ... '..".,.
(a) (b)
Figure 10-17: (a) Decision boundary when an evidential logic rule is used as a means
of combining one-dimensional Cartesian granule features for the ellipse problem. A
granularity level of 10 was used on each base feature universe. The granules were
characterised by trapezoidal fuzzy sets with an overlap degree of 30%. (b) Decision
boundary when a conjunctive rule is used and the granules were characterised by
trapezoidal fuzzy sets with an overlap degree of20%.
IOO~-------------------,
s
8 W+-'---+~--------------i -+-ELT=O.3
~
5 70
~
____ ConT=O.2
~ OO~~~------------------i
~ ......... ELTriang
50 ~ __- - - - - - - - - - - - - - - - - - i
____ COnTn.ng
40~~~~~~~~~~~~~~~~-~
2 3 4 5 6 7 8 9 IO I I 12 13 14 IS 16 17 18 19 20
Granularity in terms of fuzzy sets
Figure 10-18: A comparison of using the evidential logic rule vs. the conjunctive rule
as a means of combining one-dimensional Cartesian granule features for the ellipse
problem, where the base feature universes are partitioned with triangular and
trapezoidal fuzzy sets.
100
97
94
91
88
85
9... 82 -+- Unifom,T=0.5
0 79
~
u 76
!- 73 _Crisp
c
0
>. 70
E" 67
"«""
64
61
IiIl 58
55 --e--Trinng
52
49
46
43 --lIE- T= 1.0
40
2 3 4 5 6 7 8 9 10 II 12 D 14 15 16 17 I 19 20
Grnnularit y in terms of fuzzy sets
base feature universes is examined here. The cluster centres for each class were
generated independently (homogeneous clustering) using the FCM clustering
algorithm. These cluster centres were then used to generate mutually exclusive
triangular based partitions of the base universes. Table I ~2 presents some of the more
interesting results obtained using two-dimensional Cartesian granule features where the
base feature universes were partitioned as described above. The performance of the
models using multidimensional clustering compares very favourably to models that use
uniformly partitioned features (compare columns 3 and 4 in Table 1~2). Forming
Cartesian granule features using multi-dimensional clustering can lead to close-lying
cluster centres when the multi-dimensional cluster centres are projected on the
individual universes. Consequently, the next variation in the approach is to merge
close-lying cluster centres.
100
98
..
8
96 · -+-Uniform
~c: 94
0
>.
~ 92 _ _ 10
:>
t: HelfoCluslC
< ring
~ 90
88 -*-ID
.a
HomogClus
.
86 lering
..,
·c
.C!"-
U 'i
N
l- I- l- l- I- l- l- I-
Type of fuzzy SCI used
Figure 10-20: Ellipse classification using CG Features where "the underlying feature
partitions are generated using uniform and various clustering approaches. The
granularity of the feature universe panition was fixed at seven.
Other forms of pruning are also possible, but are not discussed here, such as logical
merging of granules. For example, neighbouring granules (in the projected one-
dimensional sense) that exhibit similar membership levels could be merged. Similarly,
modified entropy algorithms as used in decision tree pruning [Quinlan 1993] could be
used here to logically merge neighbouring granules. Pruning in this way is an example
of how to exploit the tolerance for imprecision and uncertainty, while achieving
tractability, robustness and low solution cost, one of the guiding principles of soft
computing [Zadeh 1994].
Table 10-3: Classification results using Cartesian granule features CGFxy based upon
two-dimensional clustering partitioning (after pruning).
yield higher levels of accuracy (99%) and transparency. See Section 9.7 for more
details.
Table J0-4: Classification results using the data browser on the ellipse problem.
Fril Rule Type % Accuracy Decision Surface Figures
Conjunctive 93.5 Similar to Figure 10-21 (a)
EvidenLial 94 Figure 10-21 (a)
.,.!'~.'--:C----=:;----:.----;'';---:--~
(a) (b)
Figure 10-21: (a) Ellipse decision boundary using data browser generated rules and
fuzzy sets with no smoothing; (b) ellipse decision boundary using data browser
generated rules and fuzzy sets with smoothing.
vector. The decision boundaries generated by the neural networks models presented in
Table 10-5 are depicted in a series of graphs; the details of which are given in the
column entitled "Decision boundary figures". The neural network performs very well
in modelling this problem but it does require at least three hidden nodes in order to
yield good classification accuracy.
.. .....
(a) (b) (c)
In the case of the Cartesian granule features paradigm, models were constructed
automatically (using the G_DACG algorithm) and semi-automatically (the language
space was sampled manually and the model parameters identified automatically). The
CHAPTER 10; ANAI.YSIS OF CARTESIAN GRANUI.E FEATURE MODELS 264
latter formed a basis for evaluating models consisting of Cartesian granule features with
different levels of granulation, granule characterisation and feature dimensionality. Due
to resource constraints (time and computing power), this analysis was limited to
Cartesian granule features where the underlying abstractions of the base feature
universes were equivalent, though for the investigation into data-driven approaches to
partitioning, this assumption was dropped. This sample space represents only a very
small proportion of the infinite abyss of possible models. The following are the main
findings of these experiments on the ellipse problem:
Despite the uncomplicated nature of the ellipse c1assificati(:m problem, it does serve to
illustrate some of the key differences between the Cartesian granule feature approach
and other supervised machine learning techniques. All of the approaches examined here
do very well in modelling the ellipse problem. Table 10-6 presents a summary of some
of the best results achieved using these approaches. From a generalisation perspective,
the composed approaches, such as the multidimensional Cartesian granule feature
models, MATI models, and neural network models, perform better than the approaches
that rely on total decomposition, such as the single dimensional Cartesian granule
feature approaches and data browser approaches. From a model complexity perspective,
the Cartesian granule feature models and the associated reasoning and inference
procedure are glass-box/transparent in nature, and relatively easily interpreted. The data
browser and MATI algorithms provide similar transparency of representation and
inference. The multi-layer perceptron based models, in addition to their high degree of
parameterisation, also have the disadvantage that the mapping they approximate is
embodied in the weights and biases matrices and thus, the approximation may not be
amenable to inspection or analysis except in simple cases.
Using this example, it is easy to see the parallels between product Cartesian granule
SOFT COMPUTINO FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIA N GRANULE FEATURES 265
feature models and nai"ve Bayes classifiers (see Sections 5.2.2 and 7.5.2.2 for an
overview of na·ive Bayes). The use of crisp one-dimensional Cartesian granule features
incorporated into product rules yields a model that is equivalent to a na·ive Bayes
classifier under certain conditions, even though at the surface level, the models and
inference strategies look very different, with Cartesian granule feature models, being
represented by fuzzy sets and probabilistic rules, and nai"ve Bayes classifiers being
represented by conditional probabilities and a class prior. Both models yield the same
results when the class priors are uniform, and the distribution of data amongst Cartesian
granules is uniform. See [Shanahan 2000] for further details of this comparison. A
possible new approach to learning is to use a nai"ve Bayesian approach where the events
are no longer precise but fuzzy granular.
The previous section has examined and compared the effectiveness of Cartesian granule
features in modelling classification systems, that is, systems where the dependent
output variable is discrete in nature. Here, however, prediction problems are addressed
where the dependent output variable is continuous in nature. This study investigates the
effectiveness with which Cartesian granule features can model a non-linear static
system; in this case, in terms of a small artificial problem - the function sin(X * Y). The
CHAP1F..R 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 266
sin (X * Y) function (nicknamed the swan's neck) has two base input variables, X and Y,
and is graphically depicted in Figure 10-23. The considered domain for both the X and
Y variables is [0, 3]. Different training, control (validation) and test datasets, consisting
of 529 (in grid fashion), 600 (generated randomly) and 900 (in grid fashion) data
vectors respectively, were generated. Each data sample consists of a triple <.X, Y, sin(X
* Y».
1
0.5
o
-0.5
-1
Firstly, the results obtained using two-dimensional Cartesian granule features are
examined. Figure 10-24 summarises the results obtained when the output universe was
partitioned with five uniformly placed mutually exclusive triangular fuzzy sets, and the
input space consisted of a two-dimensional Cartesian granule feature, where the
underlying granules are characterised by the following types of fuzzy sets: triangular
fuzzy sets (curve named Triang); trapezoidal fuzzy sets with an overlap rate of 10%
(curve named T=0.1); trapezoidal fuzzy sets with an overlap rate of 40% (curve named
T=O.4); and trapezoidal fuzzy sets with an overlap rate of 100% (curve named T=1.0).
Other degrees of overlap were also investigated (20%, 30%, 50%, 60%, 70%, 80% and
90%), however, an overlap of 40% gave the best results in terms of accuracy and
transparency (granularity). More specifically, the use of granules characterised by
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 267
)0
27
24
~ 21 -+-T ~ 1.0
~
g 18
~ IS -e-Triang
~
Vl
12
~
'"o!< 9 --i>-T ~ O . I
""""*-T ~ O . 4
o~--~~~~~~~~~~~~~~~--~
2 1 4 5 (, 7 ~ 9 I() II 12 13 14 15 16 17 18 19 20
Oranulnrlly In tcrm~ or runy ~t.\
(a) (b)
Figure 10-26 summarises the results obtained, where the output universe is partitioned
CHAPTER 10: ANAI.YSIS OF CARTESIAN GRANULE FEATURE MODELS 268
with six uniformly placed mutually exclusive triangular fuzzy sets and the input space
consists of a two-dimensional Cartesian granule feature, where the underlying granules
are characterised by the following types of membership functions: triangular fuzzy sets
(curve named Triang); trapezoidal fuzzy sets with an overlap rate of 10% (curve named
T=O.l); trapezoidal fuzzy sets with an overlap rate of 40% (curve named T=O.4); and
trapezoidal fuzzy sets with an overlap rate of 100% (curve named T=l.O). Overall, the
use of granules characterised by trapezoidal fuzzy sets with an overlap degree of 40%
for two-dimensional Cartesian granule features outperformed other types of granule,
yielding an RMS error level which tends towards 2.75% as the granularity is increased
to twenty.
30
27
24
9
.a 21
-+- T=1.0
§ I~
c
0
~ IS _ _ Tri.n~
~ 12
'"<r:
::;:
t!< 9
6 ~----- ----=~~~~~-.:::----
J •
2 3 4 S 6 7 8 9 10 II 12 13 14 IS 16 17 18 19 20
(iranularilY In h;rms of rUIIY.\C1S
Figure 10--27 gives an overall summary of the results obtained where the output
universe was partitioned using uniform and percentile base approaches with different
levels of granularity. In this graph, only the type of input partition (characterized by the
shape of the fuzzy set used) yielding the best results (best average accuracy for
granularities in range [2,20]) for the corresponding output partition type are presented.
The curves graphed here correspond to the following types of Cartesian granule feature
model: the output space was partitioned with 5 uniformly placed triangular fuzzy sets
and input granules were characterised by trapezoidal fuzzy sets with an overlap rate of
40% (curve named T=0.4(5, UT»; the output space was partitioned with 6 uniformly
placed triangular fuzzy sets and input granules were characterised by trapezoidal fuzzy
sets with an overlap rate of 40% (curve named T=0.4(6, UT) ); the output space was
partitioned on a percentile basis with mutually exclusive triangular fuzzy sets and input
granules were characterised by triangular fuzzy sets (curve named Triang=(6, PT) );
SOfT COMI'UTIN<O !'OR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 269
and the output space was partitioned with 7 uniformly placed triangular fuzzy sets and
input granules were characterised by trapezoidal fuzzy sets with an overlap rate of 40%
(curve named T=0.4(7, UT) ). The use of trapezoidal fuzzy sets to partition the output
universe was also examined but does not yield any significant performance
improvement over the results presented previously for triangular based fuzzy sets. The
two-dimensional model, where the output space was partitioned with 6 uniformly
placed triangular fuzzy sets and input granules were characterised by trapezoidal fuzzy
sets with an overlap rate of 40% (curve named T=0.4(6, UT), performed best overall in
modelling the sin(X * Y) problem (in this battery of experiments).
10
27
24
~ _ _ T"nogC6
." 21
I'TI
~ 18
c:
0
"e IS
__ T~n4C6
T)
~ 12
'"
~ y
t'
6
-*-T:U4C7
J UTI
0
, 4 $ 6 7 8 Y I II II 12 11 14 1$ 16 17 18 I Y 211
Grlll1ulllrhy In Icrms of rUII), S,-'ls
datasets that were used in Cartesian granule feature modelling. This is followed in the
next section by a discussion where these approaches are analysed and compared with
the use of Cartesian granule feature based models. The sin(X * Y) problem proves to be
a particularly difficult prediction problem for most supervised learning approaches.
Table 10-7: RMS error results using data browser on Sin(X * Y) problem.
Fril Rull! Granularity of % RMS Error Decision Surface Figures
Typ\! Output Variable
Conjunctive 6 24 Similar to Figure 10-28
Evidential 6 23.6 Similar to Figure 10-28
Evidential 9 23.48 Figure 10-28
Evidential 12 23.6 Similar to Figure 10-28
Figure 10- 28: Sin(X * Y) decision surface generated by a data browser induced model
using nine percentile-positioned triangular fuzzy sets in the output universe. The RMS
error this model is 23.48%.
problem. Table 10-8 presents the results obtained when perceptrons with hidden layers
of different sizes were used to model the sin(X * Y) problem. In this case, the output
node corresponds to the predicted sin(X * Y) value. The two-layered neural network
performs very well in modelling the sin(X * Y) problem, attaining RMS errors of less
that 2% with eight hidden nodes. The number of training epochs (one epoch
corresponds to presentation of all the training data) is also presented in sin(X * Y).
Table 10-8: RMS error for variousfeedforward neural networks for Sin(X*Y).
In the case of the Cartesian granule features modelling, models consisting of input
Cartesian granule features with various levels of granulation, granule characterisation
and feature dimensionality were systematically sampled. Various linguistic partitions of
the output universe were also investigated. The following are the main findings of these
experiments on the sin(X * Y) problem:
Overall, modelling approaches which use total decomposition, including the one-
dimensional Cartesian granule feature models and the data browser, suffer from large
decomposition errors. A results summary of each of the examined approaches is
presented in Table 10-9. On the whole, additive Cartesian granule feature models lead
to relatively high lever of accuracy for this problem while also providing moderate
levels of model transparency. For prediction problems in general, using data-centred
partitioning (e.g. clustering) may yield a more natural partition of the base feature
universes by focussing on where the data lies rather than covering the whole universe
uniformly.
Table 10-9: Results summary for Sin(X * Y) prediction problem using various
supervised learning approaches.
Step 2 of the G_DACG algorithm (Section 9.1.1) is concerned with decomposing the
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 273
input feature space into low order relationships between small clusters of semantically
related variables (dependent variables), which are subsequently modelled with
Cartesian granule features. This decomposition is necessary on a number of grounds
including generalisation, transparency and comptractibility. This section provides a
motivational example as to why decomposition is necessary from a generalisation
perspective.
(a) (b)
quadruplet <A, B, C, D> corresponds to the mask elements (Figure 10-30) and each
element takes the value 1 if the corresponding pixel is black and 0 if the pixel is white.
These patterns are illustrated in Figure 10-32 and Figure 10-33. The training set
consists of patterns that are not entirely sufficient to discriminate between positive and
negative examples; some patterns occur as both positive and negative examples. Within
this problem domain it is known that errors in data communication can occur, thus
resulting in patterns of the form presented in Table 10-10 and Figure 10-34. These
patterns correspond to patterns that can only occur as a result of communication errors.
The classification of these patterns is unknown. The patterns with known classifications
are used to train models that provide generalised classification for these unclassified
cases.
GLJ
~
Figure 10-30: L mask usedfor sensing the environment.
Sender
Q1I11l))
Receiver L-Oassifier
[iliJ 1011
~
Mask
Figure 10-31: L-problem definition.
dimensionality are not mixed in the same model. In each case, the Cartesian granule
features were combined using the evidential logic rule and the weights were determined
using semantic discrimination analysis (see Section 9.4). Table 10-12 presents the
results achieved using these various models. The test results are compared with a
Bayesian model generated using the assumption that an error model exists [Baldwin
1996b]. The following observations regarding these results can be made:
• Although the results are not presented here, mixing Cartesian granule features
of different dimensionality in a model (which includes either 2D or 3D
features) does not improve on those achieved by the three-dimensional model.
• The MATI probabilistic decision tree induction algorithm gives similar results
to the three-dimensional Cartesian granule feature model [Baldwin 1996a].
In this section, the L example problem has highlighted the need for discovering
structural decomposition in order to generate Cartesian granule feature models that
provided good generalisation and knowledge transparency. This approach to structural
decomposition parallels other modelling techniques such as Bayesian networks [Good
1961; Lauritzen and Spiegelhalter 1988] and ANOV A [Efron and Stein 1981; Friedman
1991]. The Cartesian granule feature decomposition can be viewed as a linguistic
functional decomposition [Shanahan 1998].
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 277
Table 10-12: L problem classification results using various learning algorithms. where
G corresponds to Good. U to Uncertain. B to Bad and NIA to not applicable. The
training results are presented as a triple consisting of the number of correctly classified
tuples. the number of tuples classified as uncertain (i.e. each classification rule returns
the same level of support). and the number of misclassified examples. The test results
are presented as a comparison with the results given by a Bayes classifier.
5 OlIO G U U G
6 1100' a u u u u u U
7 0100' G U G G U G U
8 1000' G U B B U B U
9 0111 B U B B B B B
10 0010 B U B B B B B
II 0010 B U B B B B B
12
.......... iO<ii ········ .. -_BB........ ..... _-------
........0001 U G G G a a
13 U U ---------
B ---------
B B B
14 1100' B U U U U U U
15 1000' B U B B U B U
16 0100' B U a a U G U
Test Set
1 1010 U U G G U a G
2 0101 U U B B U B B
- ..3-- 1101 U -- .. U B B .. -_ ..U-_ .. _- B B
-------------._--.--- ........ ---------- ---------
4 1110 U U G G U G G
5 1111 U U U U U U U
6 0011 U U U U U U U
Train accuracy
(Correct! IA 0116/0 8/4/4 1012/4 8/612 101214 812/6
Uncertain/Incorrect)
Test Results N/A 33% 100% 100% 33% 100% N/A
(Corre pondence
with Bayes)
The previous sections have demonstrated the application of Cartesian granule feature
modelling in the context of artificial classification and prediction problems. This
section details some general comments on the use of Cartesian granule features.
CHAPTER 10: ANALYSIS 01- CARTESIAN GRANULE FEATURE MODELS 278
In summary, the results presented in this chapter support the following argument:
approaches that rely on total decomposition, that is, ignore the problem structure (such
as one-dimensional Cartesian granule features, the data browser and naIve Bayes) will
not, in general, perform as well as approaches that focus on modelling the problem
structure (multidimensional Cartesian granule feature models, neural networks and
Bayesian networks). Cartesian granule feature modelling, as personified by the
G_DACG algorithm searches for structure in terms of a network of low-order
semantically related or dependent features.
Fuzzy sets are a more desirable characterisation of granules than crisp sets. Firstly,
models, which employ fuzzy set characterisation of granules, will in general, require a
lower granularity. This lower granularity will tend to lead to better generalisation.
Secondly, fuzzy set based models due to the interpolative nature of smooth fuzzy sets
give a much more flexible decision boundary/surface (i.e. not piecewise linear),
whereas the use of crisp sets or fairly crisp sets (fuzzy sets with low degrees of overlap)
yield decision boundaries which are stepwise in nature. Thirdly, models based upon
crisp granules tend to be very sensitive to the location of granule boundaries,
sometimes yielding a discontinuous behaviour when the boundaries are changed,
whereas the use of fuzzy granules tends to be more robust in this respect. Finally,
empirical evidence presented here corroborates that granules, which are characterised
by fuzzy sets, give accurate models that are more succinct than their crisp counterparts.
The use of fuzzy granules facilitates the expression of both classification and prediction
induction algorithms in a single, coherent framework, such that classification problems
can be viewed as a special case of the more general prediction problem, where each
output classification value is interpreted as a crisp classification.
This chapter has concentrated mainly on the analysis of l"tarnt Cartesian granule feature
models, for both classification and prediction problems, under various conditions. For
the selected problems, the space of possible models was systematically sampled
examining the effect of the following on the resulting model: different linguistic
partitions of input variable universes; the feature dimensionality of the Cartesian
granule features; the type of rule used to aggregate; and different linguistic partitions of
the output variable's universe (in the case of prediction problems). This analysis
provides insights on how to model a problem using Cartesian granule features. It also
serves as a means of comparing this approach with other well known learning
paradigms. In general, the learnt Cartesian granule feature based models performed as
well and in some cases outperformed other well-known learning approaches.
Furthermore, this chapter has provided a useful platform for understanding many other
learning algorithms that mayor may not explicitly manipulate fuzzy events or
probabilities. For example, it was shown how a naIve Bayes classifier is equivalent to
crisp Cartesian granule feature classifiers under certain conditions. Other parallels were
also drawn between learning approaches such as decision trees and the data browser.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 279
As a result of this analysis, an extension to the naIve Bayesian approach from crisp
events to fuzzy events is proposed.
Overall, Cartesian granule features opens up a new and exciting avenue in probabilistic
fuzzy systems modelling which allows not only the ability to compute with words but
also to model with words. The use of Cartesian granule features facilitates the paradigm
modelling with words, yielding anthropomorphic knowledge descriptions that are
effective in modelling classification and prediction systems. The next chapter presents
applications of Cartesian granule features to real world problems in the fields of
medical decision support, computer vision and control.
10.7 BIBLIOGRAPHY
Miller, G. A. (1956). ''The magical number seven, plus or minus two: some limits on
our capacity to process information", Psychological Review, 63:81-97.
Moller, M. F. (1993). "A scaled conjugate gradient algorithm for fast supervised
learning", Neural Networks, 6:525-533.
Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, 1(1):86-106.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San
Mateo, CA ..
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Shanahan, J. G. (2000). "A comparison between naive Bayes classifiers and product
Cartesian granule feature models", Report No. In preparation, XRCE.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman
and Hall, New York.
Sugeno, M., and Yasukawa, T. (1993). "A Fuzzy Logic Based Approach to Qualitative
Modelling", I~EE Trans on Fuzzy Systems, 1(1): 7-31.
Weiss, S. M., and Indurkhya, N. (1998). Predictive data mining: a practical guide.
Morgan Kaufmann.
Zadeh, L. A. (1994). "Soft computing", LIFE Seminar, LIFE Laboratory, Yokohama,
Japan (February, 24), published in SOFT Journal, 6:1-10.
Zell, A., Mamier, G., Vogt, M., and Mache, N. (1995). SNNS (Stuggart Neural Network
Simulator) Version 4.1. Institute for Parallel and Distributed High
Performance Systems (IVPR), Applied Computer Science, University of
Stuggart, Stuggart, Germany.
CHAPTER
APPLICATIONS
11
Having illustrated Cartesian granule feature modelling on artificial problems in the
previous chapter, the focus in this chapter switches to real world applications. Four
applications are considered in the domains of computer vision, diabetes diagnosis and
control. Both classification and regression applications are investigated. Knowledge
discovery of Cartesian granule feature models in these problem domains is contrasted
with other techniques such as neural networks, decision trees, naIve Bayes and various
fuzzy induction algorithms using a variety of performance criteria such as accuracy,
understandability and efficiency.
The next section begins by reviewing existing approaches to object recognition and
discussing some of the motivations behind applying the Cartesian granule feature
knowledge discovery process to image understanding. Section 11.1.2 overviews the
main knowledge discovery steps from an object recognition application perspective,
while also providing a task oriented breakdown of the rest of this section.
11.1.1 Motivations
Fischler and Firschein [Fischler and Firschein 1987] list learning, and representation
and indexing as two of the problems and open issues in computer vision.
Representation and indexing in this context refers to the design of representations for
the visual description of complex scenes that are also suitable for reasoning and
indexing into a large database of stored knowledge. However, in the interim both of
these areas have received much attention from various groups. This attention has
mainly been motivated by the following:
counterparts. Some of the more interesting symbolic learning work in the field of image
understanding (IU) includes Winston's [Winston 1975] landmark work; a high level
approach using semantic nets to learn object structures from examples and counter-
examples (Winston's near-misses). This approach while providing understandable
models, largely ignores the lower and intennediate levels of image processing and
understanding. Other symbolic approaches to learning and representation have tended
to focus on small and limited problem domains. These include [Shepherd 1983] where
learnt decision trees were used in the classification of chocolates. Other approaches
based upon learning semantic net representations include the classification of hammers
and overhead views of commercial aircraft [Connell and Brady 1987]. Michalski et al.
[Michalski et al. 1998] provide some interesting results using a battery of learning
approaches: rule-based learning using AQ [Michalski and Chilausky 1980]; neural
network learning; and a hybrid of AQ and neural networks. The application domains
considered by Michalski et al. of outdoor image classification, detection of blasting
caps in X-ray images of luggage, and action recognition in motion video though
somewhat interesting were limited to rather simple uncluttered scenarios [Michalski et
al. 1998]. Ralescu and Slianahan [Ralescu and Shanahan 1995; Ralescu and Shanahan
1999] propose a novel approach to learning the rules of perceptual organisation using
fuzzy modelling techniques (resulting in intuitive and transparent models). This
approach was limited to the perceptual organisation of edge images, however it could
very easily be extended to other fonns of perceptual organisation including that of
regions, objects and scenes. The resulting high-level structures could then be used to
compare with object models and thus, lead to object recognition.
Overall, successful approaches to learning in vision are still predominantly black box in
nature. Here a new approach to image understanding is proposed, based upon Cartesian
granule features, that not only provides high levels of accuracy, but also facilitates
understanding due to the transparent and succinct nature of the knowledge
representation used. The approach is illustrated on a road recognition problem. Image
representation and indexing, though not addressed directly in this paper, can benefit
from image recognition and understanding approaches that provide transparency.
images. In Section 11.1 .8, the results obtained with Cartesian granule feature models
are compared with standard machine learning approaches. Finally, Section 11 .1.9
finishes with some specific conclusions for the vision problem, while more general
conclusions about knowledge discovery using Cartesian granule features are presented
in Section 11.5.
Reg ion
F ... ure
Exam pIe V.lues F •• lur.
Region S.I •• lIon And
Dat.bau
Ree.lon Feature Values
Background
•
K nowledg<
C Iaui'ier
G tntration C I... ified R 02 ions
Road
Figure I I-I: Three stages in classifier generation for the vision problem: stage I -
feature value generation, stage 2 - system, stage 3 - system evaluation. Note that these
stages are iterative.
Figure 11-2: Typical image of an outdoor scene in the Bristol image database.
CHAPTER II: APPLICATIONS 286
y) is defined as:
where ~X,y), G(X,Y) and B(x,y) are the colours of the pixel, scaled in [0, 1]. The
coefficients approximate the contribution of the three colour separations to luminance.
To achieve a 3D psychophysically plausible definition of colour [Glassner 1995; Valois
and Valois 1993], the features R-G and Y-B are also considered and are defined as
follows:
R -G = R(x,y) -G(X,Y) + 1
2
and
These are the opponent red/green and yellowlblue colour difference signals. The region
value for each colour feature is taken as the average over the corresponding region pixel
values.
between that actual region and approximated region (the 32-sided polygon); the
difference in size between that actual region and approximated region generated using
the PCA eigenvectors. All size differences are normalised by the size of the actual size
of the region. These features provide added discrimination between the polygonal
approximation and all possible region boundaries that can lead to this polygonal
approximation.
Figure 11-5: Region shape description: (a) A typical region boundary; (b) radii from
the region's centroid to its boundary; (c) polygon approximation (dashed line) overlaid
on the original region boundary.
For the purposes of this work, the channels are represented with a bank of Gabor fllters,
which are designed to sample the entire frequency domain of an image, by varying the
shape, bandwidth, centre frequency and orientation parameters. In the spatial domain a
Gabor fllter takes the form of a complex sinusoidal grating oriented in a particular
direction, modulated by a two-dimensional Gaussian. The convolution fllter located at
(xo> 'yo) with centre frequency wo' orientation with respect to the x-axis, 80, and scales of
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 289
the Guassian's major and minor axes 0'" O:v, is defined as follows [Bovlik, Clark and
Geisler 1990]:
The filter has a modulation of (Ul), vo) such that Wo = ~u; +v; and the orientation of
the filter is 80 = tan-I (vo/uo).
Thus, each Gabor filter is tuned to detect only a specific local sinusoidal pattern of
frequency wo, orientated at angle ~ in the image plane. The frequency and orientation
selective properties of a Gabor filter are more explicit in its frequency domain
representation, where it is corresponds to a Gaussian bandpass filter, centred at a
distance Wo from the origin, with its minor axis at an angle 80 to the u-axis.
In the experiments described here, a variety of Gabor filters were considered: regular,
angular and isotropic. Regular Gabor filters correspond to Gabor filters as described
above. Thirty-two Gabor filters were used, positioned on the centre frequencies of 2, 4,
8, 16,32,64, 128 and 256 and at orientations of 0°,45°,90°, and 135°. Angular Gabor
filters consider texture as a function of angle while ignoring frequency and corresponds
to the mean response magnitude of all filters at a centre angle. In order to reduce
complexity, the angular Gabor filter responses are reduced to 0), an integer value
indicating which angular Gabor filter provided the highest response, and (2), the sine
and cosine of the angle of the highest response angular Gabor filter. Isotropic filters, on
the other hand, view texture as a function of frequency while ignoring the orientation,
and corresponds to the mean response magnitude of all filters at a centre frequency. The
texture measure, in the case of all three filter types, for a pixel in the spatial domain
corresponds to the magnitude of the complex filter output. The region value for each
texture feature is taken as the average over the corresponding region pixel values.
Table 11-1: Object classifications for each region and corresponding sample counts.
63 Original FEATURES
No. Features
0 Luminance
I R-O
00
2 Y-B
3 Size
4,5 Centroid (X, y)
6,7 Orientation
00
8-15 Shape: Principle modes
M 16, 17,18 Shape Error
00 19-26 Isotropic Oabor: OJ, O2, 0 4, ... , 0 256
('I
27.28 Largest Oabor Directional Re pon es in
sine and cosine
('I
29-61 Regular Oabor filters 0(1.0),0 11 . 45 ),0(1 .
M
90) •...• 0'256.135)
Table 11-3: Selected features for each region that are considered for learning.
10 Selected FEATURES
No. Features
0 Luminance
I R-O
2 Y-B
3 X
4 Y
5 Orientation I
6 Orientation 2
7 Shape I (principle mode)
8 Texture 0128 - high frequency.
isotropic
9 Texture 0 256- high frequency.
isotropic
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 291
Here, a filter feature selection algorithm is proposed based upon neural networks. The
proposed approach, apart from being computationally very efficient (requiring a single
trained network), benefits from relying on widely available software and well
understood learning algorithms; neural networks. Furthermore, neural networks can
generally handle high dimensional problems and contrary, to their blackbox nature, can
provide very useful indicators, in a behavioural manner, on the appropriateness of
deploying various features in modelling a problem domain. The approach, while relying
(vulnerable to) on an accuracy measure that has an entirely different inductive bias than
the induction method planned for use with the selected features, benefits from
maintaining the original data distribution - be it may in a scrambled format.
Repeat
1. Select a featurej.
2. Scramble the values of feature j in the test dataset i.e. new test
dataset consists of n-J features with their original values and
one feature, whose values are randomly sampled from the
original feature values.
3. Test the trained network TrainNN using the new test dataset.
Until all features have been processed
Subsequently, the k features that give the most degraded performance are selected as
the reduced feature set that will be considered during induction.
CHAPTER II: APPLICATIONS 292
The G_DACG algorithm iterated for fifty generations and at the end of each generation,
five of the best Cartesian granule features were selected from the current population.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 293
The discovered features were then used to form additive Cartesian granule feature rule-
based models. Backward elimination was also employed, eliminating extraneous lowly
contributing features. Table 11-5 presents the results of some of the more interesting
additive Cartesian granule feature models that were discovered using G_DACG. In the
case of the models presented in Table 11-5, the models were constructed using equal
numbers of examples of Road and Not-Road for training. By equalising the example
count across classes, a slight improvement in test case accuracy (of less than 1%) was
achieved over learning from the original skewed training set.
Road NotRoad
1
0.8 0.8
0.6 0.6 I--
-11
0.4 0.4 I--
02 0.2 r I--
0~~L-~~-r~~~L-~~~~ 0 r:1
vLow low medium high vH lgh vLow low medium high -.+ilgh
Figure 11-6: A linguistic summary, in the form of Cartesian granule fuzzy sets, of
luminance for Road and NotRoad Classes.
CHAPTER 11: ApPLICATIONS 294
~Hi 1
- i~
i
l
0000 00
j i
} JI
o®*"ooo
fj
Figure 11-7: Screendump of a Java applet that displays the original image (top left
quadrant), k-means segmented image (top right quadrant) and the results of region
classification using a rule-based ACGF model (bottom left quadrant). The regions
classified as road are highlighted in grey and the non-road regions are displayed in
black.
Table 11-6: The confusion Matrix generated by the discovered 2D optimised model
detailed in Table 11-5.
Actual\Predicted NotRoad Road Total Class %Accuracy
NoLRoad I 1767 I 30 I 1797 I 98.3
Road I 39 I 210 I 249 I 84.3
SOH COMPUTING FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 295
Figure 11-8: An additive Cartesian granule feature model for road classification.
approach include autonomous vehicle navigation systems, medical image analysis, and
landmine detection.
The problem posed here is to discover patterns that predict whether a patient would test
positive or negative for diabetes according to the World Health Organisation criteria
given a number of physiological measurements and medical test results. The dataset
was originally donated by Vincent Sigillito, Applied Physics Laboratory, John Hopkins
University, Laurel, MD 20707 and was constructed by a constrained selection from a
larger database held by the National Institute of Diabetes and Digestive and Kidney
Diseases [Smith et al. 1988]. It is publicly available from the machine learning
repository at UCI [Merz and Murphy 1996J. All the patients represented in this dataset
are females at least 21 years old of Pima Indian heritage living near Phoenix, Arizona,
USA. There are eight input features, and one output or dependent feature, the diabetes
diagnosis, that is discrete, taking one of two value: "positive for diabetes"; or "negative
for diabetes". These input-output features and their corresponding feature numbers
(used for convenience) are listed in Table 11-8. There are 500 positive examples and
268 negative examples.
Cartesian granule features were selected from the current population to form the best-
of-generation model.
Table 11-8: Base features and their corresponding feature numbers for the Pima
diabetes problem.
No. Class
0 Number of limes pregnant
I Plasma glucose concentration in an oral glucose tolerance test
2 Diastolic blood pressure (mmlHg)
3 Triceps skin fold thickness (mm)
4 2-hour serum insulin (mulU/ml)
5 Body mass index (kg/m')
6 Diabetes pedigree function
7 Age(years)
8 Clas ification
Table 11-9: G_DACG parameter tableau for Pima diabetes detection problem
The best discovered ACGF model for the diabetes problem was generated by taking the
five best Cartesian granule features that were visited during the genetic search phase.
Table 11-10 shows the results of the backward elimination process that was taken to
arrive at this model. During the genetic search process the granule characterisations
CHAPTER II : APPLICATIONS 298
were set to trapezoidal fuzzy sets with 50% overlap. However, after further
investigation on the best model, a trapezoidal fuzzy set with 70% overlap was
determined to be the best granule characterisation. The best discovered model from
both a model accuracy and simplicity perspective consists of two Cartesian granule
features, yielding a model accuracy on the test data of 79.7% (see Table 11-10). An
evidential logic rule corresponding to the positive class and model filters are presented
in Figure 11-9. The negative class rule filter in this case is more disjunctive or
optimistic in nature than its positive counterpart. This optimism may arise from the fact
that a single feature may be adequate to model this class. Models with other granule
characterisations give similar or slightly higher accuracies but require more Cartesian
granule features. For example, a model with triangular fuzzy set granule
characterisations gives an accuracy of 79.18% but requires all five Cartesian granule
features; resulting in a rather complex model.
Table 11-10: Model accuracies before and after filter optimisation for various ADCe
models (see Table 11-1 j for feature key). The underlying granule characterisations are
trapezoidal fuzzy sets with 70% overlap.
Before Filter Optimisation After Filter Optimisation
# of CG Features Train % Valid % Test % Train % Valid % Test %
1 76.52 69.83 66.15 81.96 75 74.5
2 76.04 73.3 69.79 82.17 79.31 79.69
3 78.04 77.59 73.96 82.17 78.45 77.08
4 78.48 76.72 75.5 79.57 80.17 77.08
5 77.83 75 75 81.74 78.45 78.12
Table 11-11: The Cartesian granule features sets using in the ACCe models presented
in Table 11-10, wherefor example, the Cartesian granulefeature ((08) (110) (2 2) (3
12)) denotes the following: this feature consists of four base features pregnancyCount,
glucoseConcentration etc. and the universe of each feature. is abstracted by a linguistic
partition with the indicated granularity. For example, the universe ofpregnancyCount
is 8.
Figure 11-10 contains the fitness curves for this run. This figure shows, by generation,
the progress of one run of the pima diabetes problem between generations 0 and 50
SOFT COMPUTING fOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 299
using three plots: the fitness of the best-of-generation individual (Cartesian granule
feature) in the population, the fitness of the worst-of-generation individual in the
population, and the average fitness for all the individuals in the population. On the
other hand, Figure II-II presents the variety, by generation, of the evolutionary search
for the G_DACG run that resulted in the above model. This figure shows, by
generation, the progress of one G_DACG run on the diabetes problem between
generations 0 and 50, using two plots: the percentage of new Cartesian granule features
visited in each generation, though the curve (labelled % of Chromosomes Revisited) is
plotted from the perspective of the number of features that are revisited; and the second
curve displays the chromosome variety in the current population, but this can be
ignored here, since duplicates are not allowed within a popUlation. The number of novel
features in each population decreases steadily over time mainly because of the
evolutionary nature (low mutation rate, and a relatively small population) of the search.
The Pima diabetes problem is a notoriously difficult machine learning problem. Part of
this difficulty arises from the fact that the dependent output variable is really a
binarised form of another variable which itself is highly indicative of certain types of
diabetes but does not have a one-to-one correspondence with the condition of being
diabetic [Michie, Spiegel halter and Taylor 1993a]. To date, only one other machine
learning approach has obtained an accuracy higher than 78% [Merz and Murphy 1996];
this is the mass assignment based induction of decision trees - MATI algorithm
[Baldwin, Lawry and Martin 1997]. The discovered ACGF models have yields an
equivalent accuracy of79.7% (see Table 11-12).
CHAI'TER II: ApPLICATIONS 300
, Worst Fitness ~
Best Fitness -+--
I I I I I
t- 0.8
,
++++++++++++++++++++++++ I I I I Malaga Fil"ess ~
en
w
co
lZCD 0.6
au..
0.4 ----r----r-----I----4----'----'----T----r----r---
t- I I I I I I
en
cr:
0
:s: 0.2
I I I I I
----~--------~----~----1----1----T----r----r---
I I I I
o
o 5 10 15 20 25 30 35 40 45 50
Generation
Figure 11-10: Fitness curves for Pima problem for a G_DACG run.
0.8 ----~----~----~----~----~----~--~~-~\~i~~f--
: : : : : 1.{ V -¥ :
~
CD
0.6 --- -~--- -~/---l~~At0,f~:---f----f-- --f----
.~
>
I
I
I
I V. r
I
I
I
I
I
I
I
I
I
I
I
I
I
0.4 ----~--~~----~----~----~----~----~----~----~----
Tr-+ '
, I I
'.
I I I I I I
0.2
-)l-:----:----:----:----:----r----r----r----r----
j+ I I I I I I I I I
0
0 5 10 15 20 25 30 35 40 45 50
Number of Training Cycles
Figure 11-11: Percentage of Cartesian granule features that were revisited in each
generation for the Pima diabetes problem. Pool variety is 100% (since duplicates are
not allowed).
Son COMPUTING I'OR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 301
Approach. % Accuracy
Additive Cartesian granule feature Model 79.7
MATI decision trees [Baldwin, Lawry and Martin 1997] 79.7
Oblique decision trees [Cristianini 1998) 78.5
Neural networks (normalised Data) 78
C4.5 [Michie, Spiegelhalter and Taylor 1993b) 73
Neural networks (unnormalised Data) 67
Data brow er 70
This application deals with the widely used benchmark problem of modelling a gas
furnace (an example of a dynamical process), which was first presented by Box and
Jenkins [Box and Jenkins 1970]. The modelled system consists of a gas furnace in
which air and methane are combined to form a mixture of gases containing CO2 (carbon
dioxide). Air fed to the furnace is kept constant, while the methane feed rate can be
varied in any desired manner. The furnace output, the CO 2 concentration, is measured in
the exhaust gases at the outlet of the furnace.
The dataset here corresponds to a time series consisting of 296 successive pairs of
observations of the form (u(t), yet)), where u(t) represents the methane gas feed rate at
the time step t and yet) represents the concentration of CO2 in the gas outlets. The
sampling time interval is nine seconds. Using a time-discrete formulation, the dynamics
of the system is represented by a relationship that links the predicted system state
y(t+ I) to the previous input states u(t;) and the previous output states yeti), that is y(t+ I)
is a function of the previous input and output states i.e. y(t+ I) =/(u(t/), U(t2), ... , u(tn),
yetI), y(t2), .. ., y(tn)).
The goal of knowledge discovery here is detect patterns in the time series data that
facilitate automatic control. After a few iterations of the knowledge discovery process,
the value of n was set to five. Consequently, ten input variables were considered and
the database reduces to 291 data tuples of the form (u(t), u(t-I), ... , u(t-4), yet), y(t-I),
... , y(t-4), y(t+ I)).
multi-million node search space. The k-toumament selection parameter k was set to 4
for this problem. The output universe was uniformly partitioned using eight triangular
fuzzy sets. The G_DACG algorithm iterated for fifty generations (or if the stopping
criterion was satisfied it halted earlier, arbitrarily set at an mean square error (MSE) of
less than 0.05). As a result of the G_DACG process an additive Cartesian granule
feature model where each rule consists of two Cartesian granule features was deemed to
be the most suitable model. The model consists of eight rules and a trapezoidal fuzzy
set with 50% overlap was determined to be the best input feature granule
characterisation. The performance <lccuracy of the model was measured based upon the
mean square error (MSE) between the actual data outputs (yi) and the model outputs
(Y), and calculated as follows:
N
MSE= ~L(/-y)2
;=1
The discovered model yields a relatively low MSE of 0.128. In Figure 11-12, the
model performance is compared with the original data. The horizontal axis corresponds
to the time while the vertical axis denotes the furnace output, the CO 2 concentration. A
sample rule in this model is presented in Figure 11-13 describing a fuzzy class in the
output space namely, Small. Increasing the granularity of the output universe (and
consequently, the number of rules) can lead to models with lower MSE, however, this
also leads to more complex models. For example, if the granularity of the output
universe is increased to ten, the MSE of the model drops to 0.11 .
61
59
57
•••• - • AClUnl
55
[
8 53
51
49 ---ACGF
Model
47
o or.
Time
Figure 11-/2: ACGF model predictions versus the actual data for the gas furnace
problem.
SOIT COMPUTING fOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 303
chemical plant. This problem and corresponding dataset were originally presented by
Sugeno and Yasukawa [Sugeno and Yasukawa 1993]. The chemical plant produces a
polymer through a process of monomer polymerisation. Since the start-up of the plant
is very complicated, a human operator is required to manually control the plant.
The dataset consists of 70 observations taken from actual plant operation. Each
observation consists of five input variables (see Table 11-14 for details) and an output
variable corresponding to the set point for monomer flow rate. The human operator
determines the set point for the monomer flow rate and gives this information to a PID
controller, which calculates the actual monomer flow rate for the plant.
Table 11-14: Input and output base features for chemical plant control problem.
No. Class
0 Monomer concentration
J Change of monomer concentration
2 Monomer flow rate
3,4 Local temperatures inside plant
5 Set point for monomer flow rate
7001
6001
o
~ 4001 +----~--==~------------__f
2001
- - ACGI'
Model
1001 to~~------------------__f
;;:;
Time
Figure I I-I 5: ACGF model predictions versus human operator for the chemical plant.
The discovered ACOF model has a high complexity for this problem and may be
suffering from the uniform partitioning of the input feature universes. A more efficient
and possibly a lower dimensional Cartesian granule feature may result from a data
centred approach to partitioning.
11.5 DISCUSSION
This section presents a more general discussion of the results presented above,
evaluating Cartesian granule feature models on performance criteria such transparency,
efficiency and accuracy.
Regarding the other problems considered in the chapter, the resulting Cartesian granule
feature models were not as simple, but they do however decompose the problem
domain in lower order dependencies between semantically related features. The
granUlarity of these features in most cases is high. This may result from the uniform
characterisation of each granule. Using a more data centred approach to partitioning,
such as clustering, may lead to Cartesian granule features with lower granularities.
in the case of bushy decision trees. On the other hand, the G_DACG constructive
induction algorithm is computationally intensive which is due to the global, population
based search approach used, but avoids local minima. The determination of neural
network topologies is also computationally intensive.
In the case of additive Cartesian granule feature models, the system identification step
is not just concerned with identifying a model that provides high performance accuracy
(the goal of most other induction algorithms), but is also concerned with identifying a
model that is glassbox in nature. This issue of identifying glassbox models, while
having extra computational requirements, is compensated by the identification of
accurate models that facilitate understanding. .
The focus and motivation behind the new approach presented in the latter parts of this
book has been the development of a knowledge discovery process that leads to models
that are ultimately understandable not only by computers but also by experts in the
domain of application and that perform effectively. This has resulted in the
development of a new form of knowledge representation - Cartesian granule feature
models- and a corresponding constructive induction algorithm - G_DACG. The book
has highlighted that "much of the power comes not from the specific induction method,
but from proper formulation of the problems and from crafting the representation to
make learning more tractable" [Langley and Simon 1998]. Cartesian granule features
incorporated into additive models tolerate and exploit uncertainty in order to achieve
tractability and transparency on the one hand and generalisation on the other. This
approach has been demonstrated on various real world problems, attaining the goals of
understandability and effectiveness to a great extent. Overall, soft computing
approaches (including Cartesian granule feature modelling), through tolerating and
exploiting uncertainty, provide a very powerful means of attaining transparent and
effective inductive inference and will be a key player in the knowledge discovery
processes of the future.
• Incrennentallearndng
By nature, the world is dynamic, continuously changing and evolving.
Knowledge, one of the artefacts of mankind, is no different. For example, the
concept of granny's image will change over time and for an image retrieval
system to successfully retrieve her image over time, its representation will
need to evolve with granny. This type of learning lies in the domain of
incremental learning [Utgoff 1989]. Incremental learning refers to learning
where the observations are presented one (or a few) at a time to the learning
algorithm. Incremental learning is seen a means of tackling concept drift
[Schlimmer and Granger 1986] and is becoming an important area in
knowledge discovery due to the embryonic nature of the information world.
Although the results presented in this book are the result of one-shot learning,
the proposed approaches, due to their probabilistic nature, can facilitate an
incremental approach to learning, which is the subject of current work.
• Distributed learning
To date, knowledge discovery has mainly been centralised in nature, that is,
the discovered knowledge corresponds to a single model that is determined
from a single database of examples. However, as organisations and their
database management systems become more decentralised, knowledge
discovery systems that offer decentralised model building and deployment
capabilities will become more essential and prominent. Alternatively,
individual entities such as banks may pool decision support systems, such as
fraud detection systems, thereby providing more powerful and trustworthy
support (a software parallel of human committees). Distributed knowledge
discovery can be realised by a number of techniques including bagging and
boosting. Future work will investigate the use of Cartesian granule feature
models in this distributed context. An alternative to the aforementioned
approaches to distributed knowledge discovery, in the case of Cartesian
granule feature models, is to merge the individual models into one overall
model. This is the subject of current work.
11.8 SUMMARY
This chapter has described how additive Cartesian granule- feature modelling has been
applied to a number of real world problems, including the diabetes detection in Pima
Indians and region classification in vision understanding. The O_DACO constructive
inductive algorithm was used to discover these models. The discovered models perform
very well; yielding in some cases simpler more transparent models with accuracies
higher than other well known machine learning techniques. Various ways of extending
and improving the ACOF modelling approach were suggested especially in the context
of the region classification problem. Some overall conclusions regarding the knowledge
discovery of Cartesian granule feature models were drawn. Several future avenues of
research were identified for knowledge discovery in general and for Cartesian granule
feature modelling in particular.
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 311
11.9 BmLIOGRAPHY
Almuallim, H., and Dietterich, T. G. (1991). "Learning with irrelevant features." In the
proceedings of AAAI-9I, Anaheim, CA, 547-552.
Baldwin, 1. F., Lawry, J., and Martin, T. P. (1997). ''Mass assignment fuzzy ID3 with
applications." In the proceedings of FuzzY Logic: Applications and Future
Directions Workshop, London, UK, 278-294.
Baldwin, J. F., Martin, T. P., and Shanahan, 1. G. (1997). "Structure identification of
fuzzy Cartesian granule feature models using genetic programming." In the
proceedings of ]JCAI Workshop on Fuzzy Logic in Artificial Intelligence,
~agoya,Japan, 1-11.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1999). "Controlling with words
using automatically identified fuzzy Cartesian granule feature models",
International Journal of Approximate Reasoning (]JAR), 22:109-148.
Bastian, A. (1995). "Modelling and Identifying FuzzY Systems under varying User
Knowledge", PhD Thesis, Meiji University, Tokyo.
Bellman, R. E. (1961). Adaptive Control Processes. Princeton University Press.
Blum, A. L., and Langley, P. (1997). "Selection of relevant features and examples in
machine learning", Artificial Intelligence, 97:245-271.
Bovlik, A. c., Clark, M., and Geisler, W. S. (1990). "Multichannel texture analysis
using localised spatial filters", IEEE Transactions on PAMI, 12(1}:55-73.
Box, G. E., and Jenkins, G. M. (1970). Time series analysis, forecasting and control.
Holden Day, San Francisco, CA.
Brooks, R. A. (1987). "Model-based three-dimensional interpretations of two-
dimensional images", In Readings in computer vision: issues, problems,
principles, and paradigms, M. A. Fischler and O. Firschein, eds., Kaufmann
Publishers, Inc., Los Altos, CA, USA, 360-370.
Caelli, T., and Reye, D. (1993). "On the classification of image regions by colour,
texture and shape", Pattern Recognition, 26(4}:461-470.
Campbell, F. W., and Robson, J. G. (1968). "Application of Fourier analysis to the
visibility of gratings", Journal of Physiology, 197:551-566.
Campbell, ~. W., Mackeown, W. P. J., Thomas, B. T., and Troscianko, T. (1997).
''Interpreting Image Databases by Region Classification", Pattern Recognition,
30(4}:555-563.
Campbell,~. W., Thomas, B. T., and Troscianko, T. (1997). "Automatic segmentation
and classification of outdoor images using neural networks", International
Journal of Neural Systems, 8(1}:137-144.
Connell, J. H., and Brady, M. (1987). "Generating and generalising models of visual
objects", Artificial Intelligence, 34:159-183.
Cootes, T. F., and Taylor, C. J. (1995). "Combining point distributions with shape
models based on finite-element analysis", Image Vision Computation,
13(5}:403-409.
Cootes, T. F., Taylor, C. 1., Cooper, D. H., and Graham, J. (1992). ''Training models of
shape from sets of examples." In the proceedings of British Machine Vision
Conference, Leeds, UK, 9-18.
Cristianini, ~. (1998). "Application of oblique decision trees to Pima diabetes
problem", Personal Communication, Department of Engineering Mathematics,
University of Bristol, UK.
CHAPTER 11: APPLICATIONS 312
Michalski, R. S., Rosenfeld, A, Duric, Z., Maloof, M., and Zhang, Q. (1998).
''Learning patterns in images", In Machine Learning and Data Mining, R. S.
Michalski, I. Bratko, and M. Kubat, eds., Wiley, New York, 241-268.
Michie, D., Spiegelhalter, D. J., and Taylor, C. C. (1993a). ''Dataset Descriptions and
Results", In Machine Learning, Neural and Statistical Classification, D.
Michie, D. J. Spiegelhalter, and C. C. Taylor, eds., 131-174.
Michie, D., Spiegelhalter, D. J., and Taylor, C. c., eds. (1993b). "Machine Learning,
Neural and Statistical Classification", Ellis Horwood, New York, USA.
Mirmehdi, M., Palmer, P. L., Kittler, J., and Dabis, H. (1999). "Feedback control
strategies for object recognition", IEEE Transactions on Image Processing,
8(8):1084-1101.
Moller, M. F. (1993). "A scaled conjugate gradient algorithm for fast supervised
learning", Neural Networks, 6:525-533.
Mukunoki, M., Minoh, M., and Ikeda, K. (1994). "Retrieval of images using pixel
based object models." In the proceedings of IPMU, Paris, France, 1127-1132.
Murase, H., and Nayar, S. K. (1993). "Learning and recognition of 3D objects from
appearance." In the proceedings of IEEE 2nd Qualitative Vision Workshop,
New York, NY, 39-50.
Murphy, S. K., Kasif, S., and Salzburg, S. (1994). "A system for induction of oblique
decision trees", Journal of Artificial Intelligence Research, 2:1-33.
Nakoula, Y., Galichet, S., and Foulloy, L. (1997). "Identification of linguistic fuzzy
models based on learning", In Fuzzy Model Identification, H. Helledoorn and
D. Driankov, eds., Springer, Berlin, 281-319.
Pedrycz, W. (1984). "An identification algorithm in fuzzy relational systems", Fuzzy
Sets and Systems, 13:153-167.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San
Mateo, CA.
Ralescu, A L., and Shanahan, J. G. (1995). "Line structure inference in fuzzy
perceptual grouping." In the proceedings of NSF Workshop on Computer
Vision, Islamabad, Pakistan, 225-239.
Ralescu, A L., and Shanahan, J. G. (1999). "Fuzzy perceptual organisation of image
structures", Pattern Recognition, 32:1923-1933.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning internal
representations by error propagation", In Parallel Distributed Processing
(Volume 1), D. E. Rumelhart and J. L. McClelland, eds., MIT Press,
Cambridge, USA
Schlimmer, J. c., and Granger, R. H. (1986). "Beyond incremental processing: tracking
concept drift." In the proceedings of Fifth National Conference on Artificial
Intelligence, Philadelphia, 502-507.
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Shanahan, J. G. (2000). "A comparison between naive Bayes classifiers and product
Cartesian granule feature models", Report No. In preparation, XRCE.
Shanahan, J. G., Baldwin, J. F., Campbell, N., Martin, T. P., Mirmehdi, M., and
Thomas, B. T. (1999). ''Transitioning from recognition to understanding in
vision using additive Cartesian granule feature models." In the proceedings of
North American Fuzzy Information Processing Society (NAFlPS), New York,
USA,71O-714.
CHAPTER 11: APPLICATIONS 314
• Fitness functions and reproduction operators may differ slightly but will
share the same philosophy.
One of the most important and difficult concepts of genetic programming is the
determination of the fitness function. The fitness function determines how well a
program is able to solve the problem. The output of the fitness function is used as the
basis for selecting which individuals get to procreate and contribute their genetic
material to the next generation. The structure of the fitness function will vary greatly
from problem to problem. See Section 9.3.2 for example fitness functions.
Three primary operations are used to adapt the chromosomes in genetic programming:
• reproduction;
• crossover;
• and mutation.
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 317
The reproduction operator selects an individual from the current population according
to some selection mechanism based (such as k-toumament selection, see Section 9.3.4)
on fitness and copies it, without alteration, from the current population into the new
population. The crossover operation creates variation in the population by producing
new offspring that consist of mutually shared parents. This operation consists of
selecting two parents from the current population. Subsequently, a node is selected
randomly within each of the selected parents. The sub-trees rooted at these nodes are
then swapped resulting in two new offspring that are inserted in to the next generation.
The mutation operator introduces random changes in individuals in the current
population. Once again an individual is selected. Then a node is randomly selected
within this individual. The sub-tree rooted at this node is replaced by a newly generated
random sub-tree. The mutated individual is subsequently inserted into the next
generation. These operations are described, in more detail, in the context of learning in
Section 9.3.
f5
f6 f3
Figure A-J: An example chromosome structure that denotes the program "f5 + (f6 -
f3)".
No
Bibliography
Fogel, L. J., Owens, A. 1., and Walsh, M. J. (1966). Artificial intelligence through
simulated evolution. John Wiley, New York.
Friedberg, R. (1958). "A learning machine, part 1", IBM Journal of Research and
Development, 2:2-13.
Friedberg, R., Dunham, B., and North, T. (1959). "A learning machine, part 2", IBM
Journal of Research and Development, 3:282-287.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of
Michigan Press, Michigan.
Koza, J. R. (1992). Genetic Programming. MIT Press, Massachusetts.
Schwefel, H. P. (1995). Evolution and optimum seeking. J. Wiley, Chichester.
GLOSSARY OF MAIN SYMBOLS
fx'PA(X)
The fuzzy set defined over the continuous universe Ox
x
An a-cut of fuzzy set A
A=B Set equality
AcB Set inclusion
AcB Proper set inclusion (Le. A c B iff A :f:. B)
...,A The complement of set A
A The complement of set A
AuB The union of the sets A and B
AnB The intersection of sets A and B
A®B Fuzzy intersection or t-norm
GLOSSARY OF MAIN SYMBOLS 320
LX;
"
;=1
The sum XI + X2 + ... + Xn
n
- complement, 49
-, complement, 49
i cylindrical extension, 60
B
background knowledge, 27
•
e element of, 36 basic probability assignment, 104
V forall, 36 Bayes' rule, 96
n intersection, 48 Bayesian network, 99, 278
Jl membership function, 40 belief measure, 105
belief network. See Bayesian network
J. projection, 59 bias, 209
E9 t-conorm, 51
bias/variance dilemma, 209
® t-norm, 50 bijective transformation, 119
uunion, 49 body of a rule, 80
L union notation, 39, 40 body of evidence, 110
n universe, 35 bootstrap procedure, 167
a-cut, 43 Box and Jenkins gas furnace problem,
y-operator, 56, 187 301
(... ),40 bpa. See basic probability assignment
/ membership separator, 39 Bristol image database, 283, 285
[... ],40
{... J,39
1,36,43,95
C II
+ union notation, 39 C4.5, 152. See ID3
< ... >, 97, 98, 104, 114 car parking problem. See parking
problem
A II Cartesian granule, 180
Cartesian granule features, 179, 243,
accuracy, 244 245
ACGF model. See additive Cartesian approximate reasoning, 194
granule feature model definition, 180
additive Cartesian granule feature fuzzy logic, 195
model, 194, 228, 241, 242, 293 fuzzy set, 181
additive model. See additive Cartesian fuzzy set induction, 203
granule feature model product models, 194
antecedent, 80 rules, 193
applications, 241, 281 Cartesian granule fuzzy set, 61, 181,
arity, 316 190,203
SUBJECT INDEX 322
fuzziness, 38 matching, 56
fuzzy C-means, 258 membership, 38
fuzzy complement, 54 membership function, 38
fuzzy decision making. See normal,44
defuzzification normalisation, 44
fuzzy implication, 81 notation, 40
fuzzy inference, 76 operations, 47
conjunction based, 83 possibility theory, 113
implication based, 81 properties, 43
fuzzy integrals, 159,223 representation, 45
fuzzy interval, 40 semantic unification, 116
fuzzy logic, 67 support, 43
applications, 89 tranformation to probabiliy
defuzzification, 85 distributions, 119
inference, 76 trapezoidal, 73
learning, 159 triangular, 73
fuzzy measures, 159 type-I fuzzy set, 62
fuzzy modifiers, 68. See linguistic type-2 fuzzy sets, 62
hedges union, 47
fuzzy mutually exclusive partition, 70 voting model interpretation, 42
fuzzy non-mutually exclusive fuzzy set theory, 35, 37
partition, 70 learning, 159
fuzzy number, 40, 45, 131 motivations, 37
fuzzy partition, 69 fuzzy truth values, 68
fuzzy mutually exclusive partition,
69, 70
linguistic partition, 71 G II
fuzzy patch, 78 G_DACG,200
fuzzy predicate, 68 applications, 241, 281
fuzzy probabilities, 68 chromosome structure, 210
fuzzy relation, 58, 80, 81 detailed example, 226
cylindrical extension, 60 for classification problems, 202
projection, 59 for prediction problems, 204
Fuzzy Relational Inference Language, wrapper-based,208
129 G_DACG Algorithm. See G_DACG
fuzzy set, 38 Gabor filters, 288
alpha-cut, 43 generalisation, 162, 164, 183
averaging operators, 54 generalised modus ponens, 78
cardinality, 45 generalised modus tollens, 80
complement, 47 generation gap, 214
core, 43 genetic algorithms, 157, 315
degree of membership, 38 genetic programming, 157, 210, 214,
example, 39 315
generalisations, 57 glassbox models. See model
height, 44 transparency
interpretations, 41 granularity, 73, 210, 243
intersection, 47 granule, 180
involutive complement. See granule characterisation, 215
involutive complement granule fuzzy set, 61
SUBJECT INDEX 324
H II taxonomy, 27
knowledge stability, 27, 186
head of rule, 80 k-toumament selection, 213, 317
holdout estimate, 167
human learning, 145
hypothesis. See model
L II
hypothesis language, 24 L problem, 273
language identification, 200, 205,208,
p II Q
Q-learning, 160
PAC learning, 169 QL-implications, 82
parameter identification, 201, 202, qualified propositions, 68
217,218,222
parking problem, 6, 152, 154, 155,
162 R
parsimony, 211
partition, 69 regression. See prediction
percentile-based partitions, 257 reinforcement learning, 159
performance, 166 relation, 58
pignistic distribution, 109, 116 RELIEF, 207, 291
SUBJECT INDEX 326
II
a taxonomy of approaches, 156
support logic, 129 w
support pairs, 130
support vector machines, 159 weighted generalised means, 55
symbolic learning, 156 weights identification, 218
wrapper-based feature selection, 208
T II z II
Takagi-Sugeno-Kang model, 85
t-conorm, 51 Zadeh implication, 82
non-parameterised, 53