You are on page 1of 332

T H E K L U W E R INTERNATIONAL SERIES

IN ENGINEERING AND COMPUTER SCIENCE


SOFT COMPUTING FOR
KNOWLEDGE DISCOVERY
INTRODUCING CARTESIAN GRANULE FEATURES
SOFT COMPUTING FOR
KNOWLEDGE DISCOVERY
INTRODUCING CARTESIAN GRANULE FEATURES

JAMES G . SHANAHAN
Xerox Research Centre Europe (XRCE)
Grenoble Laboratory
6 chemin de Maupertuis
Meylan 38240, France
James.Shanahan@xrce.xerox.com
http://www.xrce.xerox.com/-shanahan/kdboolc/

Springer Science+Business Media, LLC


Library of Congress Cataloging-in-Publication

Shanahan, James G .
Soft computing for knowledge discovery : introducing Cartesian granule features /
James Shanahan.
p. cm. - (The Kluwer international series in engineering and computer science; SECS 570)
Includes bibliographical references and index.
I S B N 978-1-4613-6947-9 I S B N 978-1-4615-4335-0 (eBook)
D O I 10.1007/978-1-4615-4335-0
1. Soft computing. 2. Database searching. I. Title. II. Series.

QA76.9 .S63 S53 2000


006.3--dc21
00-056160

Copyright ® 2000 Springer Science+Business M e d i a N e w Y o r k


Originally published by Kluwer Academic Publishers i n 2000

A l l rights reserved. N o part of this publication may be reproduced, stored in a retrieval system or
transmitted in any form or by any means, mechanical, photo-copying, recording, or otherwise,
without the prior written permission of the publisher, Springer Science+Business Media, L L C .

Printed on acid-free paper.


NOTE TO THE READER

The following webpage has been developed in conjunction with this book:
http://www.xrce.xerox.com!-shanahanlkdbook!.This page provides access to
additional information related to the material presented in this book, various
pedagogical aids, datasets, source code for several algorithms described in this book, an
online bibliography and pointers to other World Wide Web related resources.
To the memory of my parents
Jimmy and Mary
and to my dear friend and mentor
Anca Ralescu.
FOREWORD
Publication of "Soft Computing for Knowledge Discovery: Introducing Cartesian
Granule Features" or KD_CGP for short, is an important event in the development of a
better understanding of human reasoning in an environment of imprecision, uncertainty
and partial truth. It is an important event because KD_CGP is the first and, so far, the
only book to focus on granulation as one of the most basic facets of human cognition, a
facet that plays a pivotal role in knowledge discovery (KD). The author of KD_CGP,
Dr. James Shanahan, has been, and continues to be, in the forefront of research on
computationally-oriented approaches to knowledge discovery, approaches centering on
soft computing rather than on the more traditional methods based on probability theory
and statistics.

Let me elaborate on this point. During the past several years, the ballistic ascent in the
importance of the Internet has propelled knowledge discovery to a position that is far
more important than it had in the past. And yet, much of the armamentarium of
knowledge discovery consists of methods drawn for the most part from probability
theory and statistics. The leitmotif of KD_CGP is that the armamentarium of KD
should be broadened by drawing on the resources of soft computing, which is a
consortium of methodologies centering on fuzzy logic, neurocomputing, evolutionary
computing, probabilistic computing, chaotic computing and machine learning. More
precisely, concepts are captured in terms of Catesian granule fuzzy sets (where the
underlying granules are represented using fuzzy sets, i.e. f-granular) that are
incorporated into fuzzy logic rules or probabilistic rules. Learning of such models is
achieved using probability theory and genetic programming.

Successes of probability theory have high visibility. So what is the rationale for
moving beyond the confines of traditional probability-based methods?

What is not widely recognized is that successes of probability theory mask a


fundamental limitation namely, the inability to operate on what may be called
perception-based information. Such information is exemplified by the following.
Assume that I look at a box containing balls of various sizes and form the perceptions:
(a) there are about twenty balls; (b) most are large; and (c) a few are small. The
question is: What is the probability that a ball drawn at random is neither large not
small? Probability theory cannot answer this question because there is no mechanism
within the theory to represent the meaning of perceptions in a form that lends itself to
computation. The same problem arises in the example: Usually Robert returns from
work at about 6 pm. What is the probability that Robert is home at 6:30 pm?

A concept that plays a particularly important role in soft computing is that of


granulation. In KD_CGP, granulation is a cornerstone of the foundations of knowledge
discovery.

In a broad sense, granulation involves a decomposition of the whole into parts. More
specifically, granulation of an object A results in a collection of granules of A, with a
granule being a clump of objects (or points) which are drawn together by
indistinguishability, similarity, proximity or functionality.
FoREWORD x

Granulation is ubiquitous because it reflects a fundamental limitation on the ability of


the human mind to resolve detail and store information. Furthermore, granulation may
be viewed as a way of exploiting the tolerance for imprecision to achieve tractability,
robustness and low solution cost. In this sense, information granulation may be viewed
as a form of lossy data compression.

Modes of information granulation in which the granules are crisp (c-granular) play
important roles in a wide variety of methods, approaches and techniques. Among them
are: interval analysis, quantization, rough set theory, diakoptics, divide and conquer,
Dempster-Shafer theory, machine learning from examples, chunking, qualitative
process theory, qualitative reasoning, decision trees, semantic networks, analog-to-
digital conversion, constraint programming, Prolog, cluster analysis and many others.

Important though it is, crisp information granulation has a major blind spot. More
specifically, it fails to reflect the fact that in much, perhaps most, of human reasoning
and concept formation the granules are fuzzy (f-granular) rather than crisp. In the case
of a human body, for example, the granules are fuzzy in the sense that the boundaries of
the head, neck, arms, legs, etc. are not sharply defined. Furthermore, the granules are
associated with fuzzy attributes, e.g., length, color and texture in the case of hair. In
tum, granule attributes have fuzzy values, e.g., in the case of the fuzzy attribute
length(hair), the fuzzy values might be short, long, very long, etc. The fuzziness of
granules, their attributes and their values is characteristic of the ways in which human
concepts are formed, organized and manipulated. In particular, human perceptions are,
for the most part, f-granular. A point of importance is that f-granularity of perceptions
precludes the possibility of representing their meaning through the use of conventional
methods of knowledge representation.

Fuzzy information granulation has a position of centrality in fuzzy logic. This is the
reason why fuzzy sets and fuzzy logic are treated at length in KD_CGF. But, to
maintain balance, KD_CGF also contains succinct and insightful expositions of
probabilistic computing, evolutionary computing and parts of machine learning theory.
The broad coverage of KD_CGF has the effect of greatly enhancing the capability of
knowledge discovery techniques to come to grips with the complexity of real-world
problems in which decision-relevant information is a mixture of measurements and
perceptions.

Dr. Shanahan's experience in industry has made it possible for him to include in
KD_CGF a chapter dealing with a variety of applications of knowledge discovery tools
based on soft computing and information granulation. The wealth of information
provided in KD_CGF - presented with high expository skill and attention to detail-
makes Dr. Shanahan's book an invaluable resource for anyone who is interested in
applying KD techniques to real-world problems. The author and the publisher deserve
our thanks and congratulations.

Lotti A. Zadeh
Berkeley, CA
PREFACE

In the age of the Internet, ubiquitous computing and data warehouses, society faces the
challenge of dealing with an ever-increasing data flood. Knowledge discovery is an
area of computer science that attempts to exploit this data flood by uncovering
interesting and useful patterns in these data that permit a computer to perform a task
autonomously or that assist a human to perform a task more successfully or efficiently.
In recent years knowledge discovery has been applied in many fields of business,
engineering and science leading to interesting and useful applications, ranging from
systems that detect fraudulent credit card transactions, to information filtering systems
that learn users' reading preferences, to medical systems that predict the mutagenicity
of chemical compounds. At the same time, there have been important advances in the
theory and algorithms that form the foundation of this field.

The primary goal of this book is to present a self-contained description of the key
theory and algorithms that form the core of knowledge discovery from a soft computing
perspective. Knowledge discovery is inherently interdisciplinary, drawing on concepts
and results from many fields, including artificial intelligence, machine learning, soft
computing, information theory and cognitive science. This book introduces these
concepts, providing a highly readable and systematic exposition of knowledge
representation, machine learning, and the key methodologies that make up the fabric of
soft computing - fuzzy set theory, fuzzy logic, evolutionary computing, and various
theories of probability (point-based approaches such as naIve Bayes and Bayesian
networks, and set-based approaches such as Dempster-Shafer theory and mass
assignment theory).

A secondary goal of this book is to present state-of-the-art soft computing approaches


to knowledge discovery. Along with describing well known approaches, Cartesian
granule features and corresponding learning algorithms are also introduced as a new
and intuitive approach to knowledge discovery. This new approach embraces the
synergistic spirit of soft computing, exploiting uncertainty, imprecision in this case, in
order to achieve tractability and transparency on the one hand and generalisation on the
other. In doing so it addresses some of the shortcomings of existing approaches such as
decomposition error and performance-related issues such as transparency, accuracy and
efficiency. Parallels are drawn between this approach and other well known approaches
(such as naIve Bayes, decision trees) leading to equivalences under certain conditions.

The approaches presented in this book are further illustrated on a battery of both
artificial and real world problems. Knowledge discovery in real world problems such as
object recognition in outdoor scenes, medical diagnosis and control is described in
detail. These case studies provide a deeper understanding of how to apply the presented
concepts and algorithms to practical problems.

Futhermore, the following webpage has been developed in conjunction with this book:
http://www.xrce.xerox.coml-shanahanlkdbook!.This page provides access to
additional information related to the material presented in this book, pedagogical aids,
PREFACE xii

datasets, source code for several algorithms described in this book, an online
bibliography and pointers to other World Wide Web related resources.
The book is divided into five main parts and an appendix:

• Part I (Chapter 1) provides a general introduction to the subject of knowledge


discovery (KD) and highlights some of the limitations of current approaches.
• Part IT (Chapters 2, 3, 4, 5, and 6) begins by (Chapter 2) introducing the key
components of knowledge representation and outlines the desiderata of
knowledge representation from a knowledge discovery perspective. In
addition, it describes in detail various soft computing approaches to
knowledge representation. Chapter 3 presents the fundamental ideas of fuzzy
set theory. Chapter 4 introduces fuzzy logic as the basis for a collection of
techniques for representing knowledge in terms of natural language like
sentences and as a means of manipulating these sentences in order to perform
inference using reasoning strategies that are approximate rather than exact.
Chapter 5 describes various language-like approaches of representing
uncertainty and imprecision including point-based probability theory, set-
based probabilistic approaches such as Dempster-Shafer theory, possibility
theory, and mass assignment theory. Formal links between these theories and
fuzzy set theory are also presented. These links form the basis for the learning
algorithms proposed in Chapter 9. Chapter 6 details a development
environment that supports the aforementioned forms of knowledge
representation.
• Machine learning algorithms play a key role in enabling the discovery of
patterns in datasets. Part lIT (Chapter 7) introduces the basic architecture for
learning systems and its components. It provides an overview of the three
broad categories of machine learners, namely, supervised learners,
reinforcement learners and unsupervised learners, which are supplemented
with a taxonomy of associated learning algorithms. Popular induction
algorithms including the C4.5 decision tree induction algorithm, the naIve
Bayes classifier induction algorithm and the fuzzy data browser are also
described.
• The main focus of Part IV (Chapters 8 and 9) is to introduce Cartesian granule
features as a new form of knowledge representation and corresponding
learning algorithms. This approach addresses some of the shortcomings of
other knowledge discovery techniques. Chapter 8 describes Cartesian granule
features, and shows how fuzzy sets and probability distributions can be
defined over these features and how these can be incorporated into both fuzzy
logic and probabilistic models. Chapter 9 describes learning algorithms for
Cartesian granule feature models for both classification and prediction
problems.
• Part V (Chapters 10 and 11) shifts attention to applications of Cartesian
granule features within the more general context of knowledge discovery.
Chapter 10, for the purposes of illustration and analysis, applies this approach
to artificial problems in both classification and prediction. Chapter 11 focuses
on the knowledge discovery of Cartesian granule feature models in the real
world domains of computer vision, diabetes diagnosis and control, while also
comparing this approach with other techniques such as neural networks,
decision trees, naIve Bayes and various fuzzy approaches. Chapter 11 finishes
PREFACE xiii

with some views on what the future may hold for knowledge discovery in
general and for Cartesian granule features in particular.
• The Appendix gives an overview of evolutionary computation.

In addition, each chapter comes with an extensive bibliography.

Target Audience
Because of the interdisciplinary nature of the material, this book makes few
assumptions about the background of the reader. Instead, it introduces basic concepts
from artificial intelligence, probability theory, fuzzy set theory, fuzzy logic, machine
learning, and other disciplines as the need arises, focusing on just those concepts most
relevant for knowledge discovery. The book is intended for advanced undergraduate
and graduate students, as well as a broad audience of professionals and researchers in
computer science, engineering and business information systems who have an interest
in the dynamic fields of knowledge discovery and soft computing.

Acknowledgements
Like all books this too has behind it a story that represents a journey, both physical and
intellectual, which can be traced back to November 1992, when I attended JKAW
(Japanese Knowledge Acquisition Workshop) in Kobe, Japan. As a result of various
discussions at this workshop and ensuing conversations with Anca Ralescu, I became
interested and eventually came to work in soft computing in general, and fuzzy systems
in particular. Ever since, Anca has been a brilliant source of not just inspiration,
encouragement and knowledge but of friendship. I am eternally grateful to her for
providing me with the opportunity to work at LIFE (Laboratory for International Fuzzy
Engineering) in Yokohama, Japan. She has also been an excellent "mentor". Without
her encouragement, in many ways, I wouldn't have started this "journey". Domo
arigato Anca.

A lot of the work presented in this book was accomplished while I was at the
University of Bristol, where I benefited from the inspiration, vision, direction and
enthusiasm provided by Jim Baldwin for which I am deeply grateful. Trevor Martin has
also played a key role in this work, providing much support and direction. Other
members of the crew at the Department of Engineering Mathematics, University of
Bristol also provided much inspiration, explanation and an ambient research
environment, especially Mario DiBernardo, Simon Case, Simeon Earl, Carla Hill,
Martin Homer, Jonathan Lawry, Nigel Mottram, Bruce Pilsworth, Christiane Ponsan,
Jonathan Rossiter, Mehreen Saeed, Athena Tocatlidou and Patrick Woods. This was
paralleled by the crew at the Department of Computer Science, University of Bristol. A
special thanks to Neil Campbell, Angus Clark, Mark Everingham, Dave Gibson, Claire
Kennedy, Katerina Mania, Ann M'Namara, Majid Mirmehdi, Jackson Pope, Erik
Reinhard and Barry Thomas.

The work presented in this book was partially funded by the University of Bristol under
a Scholarship Award, by the European Community through a Training and Mobility of
Researchers' grant (Marie Curie Fellowship) and by the DERA (UK) under grant
92W69.
PREFACE xiv

I express my sincere thanks to the management at Xerox Research Centre Europe


(XRCE) for providing me with the precious resources of time and computing in
completing part of the work presented here and also in writing this book. Christer
Fernstrom (Coordination Technology's area manager, XRCE) as always, encouraged,
and asked "Is it finished yet?" at just the right times. Gregory Grefenstette (principal
scientist, XRCE) provided much needed direction and enthusiasm right through the
writing of this book. I thank Roger Mohr (Laboratory Director, XRCE Grenoble) for
his constant support and many interesting discussions regarding the proposed
approaches.

As the founder of fuzzy set theory over thirty years ago, and more recently as the
originator of the concept of soft computing, Professor Lotfi Zadeh continues to be the
source for momentum, direction and inspiration in this highly dynamic field systems
modelling. His foresight and ingenuity have been proven time and time again over the
years. More recently, his ideas on information granUlation, computing with words, and
on computational approaches to perception have inspired most of the work presented in
this book.

Next I thank Professor Toshiro Terano, Hosei University, Japan (former director of
LIFE) who was also instrumental in providing me with the opportunity to work at
LIFE. His sage words and philosophy of science and engineering have inspired and
guided me through my research.

My fellow researchers at the Image Understanding Group (LIFE), and LIFE - ''the olde
boys" - and the IU group advisors, including Hirota-sensei (Tokyo Institute of
Technology), Asada-sensei (Osaka University), Minoh-sensei (Kyoto University)
deserve special mention as they provided me with a solid background not only in fuzzy
systems and image understanding but also in doing research in general.

lowe a lot to the people at Mitsubishi, in Tokyo, Akita and Naoshima who provided a
very stimulating environment in knowledge based systems especially Mr. Y. Abe
(CSD), Dr. K. Nishimura (CSD), Mr. Yanagisawa (A.I. Group, MHI), Mr. S. Tsuchino,
Mr. Y. Matsuno, Mr. (Tobi) Ishitobi (all from the Knowledge Industry Centre,
Mitsubishi Materials Corporation).

The writing of this book, like knowledge discovery, has drawn upon the expertise of
many technical experts in the sub-disciplines that make up the field. It became a reality
because of their help. I am deeply indebted to the following people (which I try to list
in geographical order beginning from Grenoble) who took time out to review chapter
drafts or to provide other technical support:

Nicola Cancedda, Boris Chidlovskii, Andreas Eisele, Natalie Glance, Gregory


Grefenstette, Antonietta Grasso, Irene Maxwell, Jean-Luc Meunier,
Christopher Thompson (my colleagues at Xerox Research Centre Europe,
Grenoble Lab, France), Jim Baldwin, Jon Lawry, Trevor Martin, Bruce
Pilsworth, (A. I. Group, Department of Engineering Maths, University of
Bristol, UK), Neil Campbell, Angus Clark, Mark Everingham, Majid
Mirmehdi, Barry Thomas (Computer Vision Group, Department of Computer
PREFACE xv

Science, University of Bristol, UK), Chris Hinde (Loughbourough


University), Athena Tocatlidou (University of the Aegean, Greece), and Anca
Ralescu (University of Cincinnati, USA).

I am very grateful to following who have helped in proof reading this book: Clare
Dickinson, Lucy J. Jobson (who both have the privilege of reading the whole book),
Andrew Poulter, Ken Brown, Martin Blackledge, Viktoria Ljungqvist, Samantha Stern
and Wendy Yeo.

Alex Greene, Kluwer Academic Publishers, provided encouragement and expertise in


all phases of this project. I thank Patricia Lincoln, Kluwer Academic Publishers, for her
patience and her many solutions during the final editing of this book.

I would also like to acknowledge the many useful comments provided by anonymous
referees of workshop, conference and journal papers in which the results synthesised in
this book were first reported.

I thank the instructors and students who have field tested some of the chapters in this
book and who have contributed their suggestions.

On a personal side, I thank (posthumously) my parents, Jimmy and Mary, for providing
me with a loving and supporting family that includes my brothers - The Shanahan Boys
- John, Michael, Timothy, Patrick and Thomas, my sister-in-Iaws Ann and Eibhlls, my
grandparents, my aunts, my uncles and cousins. A special thanks goes to my
Godmother, Auntie Margaret, and grannie Me Namara who have always been there for
me. Viktoria Ljungqvist deserves many thanks for her encouragement and patience. A
special thanks to my friends, far and near, for their unconditional support and
friendship. It would be dangerous to draw up a list.

James G. Shanahan
TABLE OF CONTENTS

NOTE TO THE READER .......................................................................................... V

FOREWORD ............................................................................................................... IX

PREFACE .................................................................................................................... Xl

PART I ........................................................................................................................... 1

1 KNOWLEDGE DISCOVERY ........................................................................... 3


1.1 KNOWLEDGE DISCOVERY BACKGROUND AND HISTORy ..................................... 5
1.2 THE KNOWLEDGE DISCOVERY PROCESS ............................................................. 6
1.2.1
A simple illustrative example problem .................................................... 6
Knowledge discovery ............................................................................... 6
1.2.2
1.3 SUCCESSES OF KNOWLEDGE DISCOVERY ......................................................... 12
1.4 CARTESIAN GRANULES FEATURES IN BRIEF ..................................................... 13
1.4.1 Soft computing for knowledge discovery ............................................... 13
1.5 THE STRUCTURE OF THIS BOOK ........................................................................ 14
1.6 SUMMARY ....................................................................................................... 17
1.7 BIBLIOGRAPHY ................................................................................................ 17
PART II ........................................................................................................................ 21

2 KNOWLEDGE REPRESENTATION ............................................................ 23


2.1 REPRESENTATION OF OBSERVATIONS AND HYPOTHESES ................................. 24
2.2 GENERAL PURPOSE KNOWLEDGE - INFERENCE AND DECISION MAKING ........... 25
2.3 UNCERTAINTY AND KNOWLEDGE REPRESENTATION ............... '" ..................... 25
2.4 DESIDERATA OF KNOWLEDGE REPRESENTATION ............................................. 26
2.5 A TAXONOMY OF KNOWLEDGE REPRESENTATION ........................................... 27
2.5.1 Symbolic-based approaches .................................................................. 28
2.5.2 Probability-based approaches............................................................... 28
2.5.3 Fuzzy-based approaches ....................................................................... 29
2.5.4 Mathematical-based approaches........................................................... 30
2.5.5 Prototype-based approaches ................................................................. 31
2.6 SUMMARY ....................................................................................................... 32
2.7 BIBLIOGRAPHY .......................................................•........................................ 32
3 FUZZY SET THEORY ..................................................................................... 35
3.1 CLASSICAL SET THEORY .................................................................................. 35
3.2 Fuzzy SET THEORY ......................................................................................... 37
3.2.1 Motivations ............................................................................................ 37
3.2.2 Fuzzy sets............................................................................................... 38
3.2.3 Notation convention .............................................................................. 40
3.2.4 li/telpretations offuzzy sets ................................................................... 41
T AD I.E OF CONTENTS xviii

3.3 PROPERTIES OF FUZZY SETS •....•......................................•........................•....... 43


3.4 REPRESENTATION OF FUZZY SETS ...........•............•....•........•.........•..•..............•. 45
3.5 fuzzY SET OPERATIONS .....•...............................•.•.......................•.•......•.......... 47
3.5.1 Axiomatic-based operators - t-norms and t-conorms ........................... 50
3.5.2 Averaging operators .............................................................................. 54
3.5.3 Compensative operators ........................................................................ 55
3.6 MATCHING FUZZY SETS ...•...................•......•.........•.•.•...........•.....•....•.••••.......... 56
3.7 GENERALISATIONS OF FUZZY SETS •••........•......•.••............................................ 57
3.7.1 Multi dimensional fuzzy sets .................................................................. 57
3.7.2 Cartesian granule fuzzy sets .................................................................. 61
3.7.3 Higher order fuzzy sets .......................................................................... 61
3.8 CHOOSING MEMBERSIDP FUNCTIONS ...........•.•............•.•••.•..............•.......•....... 63
3.9 SUMMARY .•..............•.....•..•••.•.•.........•........•.........•.....•.•••.•......••......•...•..•••.•.... 64
3.10 BIBLIOGRAPHY ...•....•.......•••.••..••.•.••...•.......••••......•.•.....••.•••.....••••.....••......•.•.•.•. 64
4 FUZZY LOGIC ................................................................................................. 67
4.1 fuzzY RULES AND FACTS .•..............•................••.......•..........•.•.••....•.•.............. 67
4.1.1 Linguistic partitions, variables and hedges........................................... 69
4.1.2 Linguistic hedges................................................................................... 76
4.2 Fuzzy INFERENCE •...•..•.........•.....•••.....................•........•••..•.........................•.•. 76
4.2.1 Compositional rule of inference (CRJ) .................................................. 77
4.3 fuZZY DECISION MAKING FOR PREDICTION - DEFUZZIFICATION .........•...••....... 85
4.3.1 Centre of gravity (COG) method........................................................... 86
4.3.2 Maximum height method ....................................................................... 87
4.4 fuzzy DECISION MAKING FOR CLASSIFICATION ........•........•...............••............ 87
4.5 ApPLICATIONS OF FUZZY LOGIC .......•........................•...................................... 89
4.6 SUMMARY ...................••..........••..•................................••...........................••..•. 89
4.7 BIBLIOGRAPHY •....•..••.•.•.....••••..•.••••...............••...•.........•.•.•.......................•.•..•• 89
5 PROBABILITY THEORY ............................................................................... 93
5.1 fuNDAMENTALS OF PROBABILITY THEORy ...........•..............................•....•...... 94
5.2 POINT-BASED PROBABILITY THEORY ............................................................... 97
5.2.1 Joint probability distributions ............................................................... 97
5.2.2 Naive Bayes ........................................................................................... 98
5.2.3 Bayesian networks ................................................................................. 99
5.3 SET-BASED PROBABILITY THEORY .•...............•............•.•••.•..............•........•.... 102
5.3.1 Dempster-Shafer theory ...................................................................... 103
5.3.2 Possibility theory ................................................................................. 109
5.3.3 Mass assignment theory ...................................................................... 113
5.4 FROM FUZZY SETS TO PROBABILITY DISTRIBUTIONS ...................................... 118
5.4.1 Transforming fuzzy sets into probability distributions ........................ 119
5.4.2 From memberships to probabilities - a voting model justification ..... 123
5.4.3 Zadeh's probability offuzzy events ..................................................... 124
5.5 SUMMARY ....•...........•...................•.....................•..............•......................•.... 125
5.6 BIBLIOGRAPHY ........•....•......•.•..•..•...•...............•.•.••.••..........•.•........................ 126
6 FRIL - A SUPPORT LOGIC PROGRAMMING ENVIRONMENT ........ 129
6.1 FRILRULESANDFACTS ..............•...•...............................•............................... 129
6.1.1 Conjunctive rule .................................................................................. 131
TABLE OF CONTENTS xix

6.1.2 Evidential logic rule ............................................................................ 131


6.1.3 Causal relational rule ......................................................................... 132
6.2 INFERENCE.......••.............•....•...............•.........•......•.........•...•..............•.•.•....... 133
6.2.1 Inference at the body proposition level ............................................... 134
6.2.2 Inference at the rule body level ........................................................... 135
6.2.3 Inference at the rule level .................................................................... 135
6.3 DECISION MAKING •.•.•...••..•....••....•........•..•.•....••.............................•.............•. 137
6.4 SUMMARY ............•.•.........•...................••..•.................•..................•....•....•...•. 138
6.5 BIBLIOGRAPHY .........•••............................•.........................•.....•••.•.....•........... 139
PART ill .................................................................................................................... 141

7 MACHINE LEARNING ................................................................................. 143


7.1 HISTORY OFMACIflNE LEARNING ..•.....•......••..••....••.•.•....•.•..............••......•.•••• 143
7.2 HUMAN LEARNING .•••...•....•....•.••..•••...•.....•.••....••...•.....•...•.....•...•....•.......•.•..••. 145
7.3 MACIflNE LEARNING •.......••....••....•.........•.•••......•.•.....................•.•..•..•....•....•.. 147
7.4 CATEGORIES OF MACHINE LEARNING ..................•......•......•.....•.............•..•.... 148
7.5 SUPERVISED LEARNING ..........•..........•........•..•.....••......•.••...•.....•.•••...•••.•.•...•... 149
7.5.1 Learning to recognise handwritten characters.................................... 151
7.5.2 Examples of supervised learning algorithms....................................... 151
7.5.3 A taxonomy of supervised learning algorithms ................................... 156
7.6 REINFORCEMENT LEARNING .........•..•...•..........•.•...•....................•................... 159
7.6.1 Popular reinforcement learning algorithms ........................................ 160
7.7 UNSUPERVISED LEARNING ....................•...••................•..............•.........•.•....... 160
7.7.1 Clustering and discovery algorithms ................................................... 161
7.8 COMPONENTS OF INDUCTIVE LEARNING ALGORITHMS ..•.•..............•..............• 162
7.8.1 Learning through inductive generalisation ......................................... 162
7.8.2 Generalisation as search ..................................................................... 164
7.8.3 Performance measures ........................................................................ 166
7.8.4 Knowledge representation ................................................................... 168
7.8.5 Inductive bias ...................................................................................... 168
7.9 COMPUTATIONAL LEARNING THEORY •...•.•.•..................•...........................••... 169
7.10 GOALS AND ACCOMPLISHMENTS OF MACIflNE LEARNING ........•....•................ 169
7.11 SUMMARY ........•......•...••..............•........••.•..........•...•........•......•................•..••. 170
7.12 BIBLIOGRAPHY ....•..••........................•...•........•......•.••........•...•........•............••. 170
PART IV •..........••.......•..•..•..•...•.................••..•..•......•.••..•...•....•...................•...•••.••••.••• 177

8 CARTESIAN GRANULE FEATURES .•.•......•...........•............•..•..••.••••.•••••••• 179


8.1 CARTESIAN GRANULE FEATURES ...........•..••...................•....•......•.....•.........•... 179
8.1.1 Why Cartesian granulefeatures? ....................................................... 182
8.1.2 Other usages of Cartesian granules .................................................... 186
8.2 CHOICE OF COMBINATION OPERATOR ............•......•..................•..................... 186
8.2.1 Generating Cartesian granule fuzzy sets via fuzzy approaches........... 187
8.2.2 Generating Cartesian granule fuzzy sets via probability theory ......... 190
8.3 CARTESIAN GRANULE FEATURE RULES .......................................................... 193
8.4 ApPROXIMATE REASONING USING CARTESIAN GRANULE FEATURE MODELS. 194
8.5 CARTESIAN GRANULE FEATURES AND FUZZY LOGIC .•.....•.•.................•.......... 195
8.6 SUMMARY .......•.............••..•..........................•.......•..............................•......... 196
TABLE OF CONTENTS xx

8.7 BIBLIOGRAPHY ...........•.............................................................................•.... 196


9 LEARNING CARTESIAN GRANULE FEATURE MODELS .................. 199
9.1 LEARNING USING THE G_DACG ALGORITHM ....•.......................................... 199
9.1.1 G_DACG Algorithm ............................................................................ 202
9.1.2 Learning Cartesian granulefeaturefuzzy sets from data .................... 203
9.1.3 Cartesian granule fuzzy set induction example ................................... 203
9.1.4 G_DACG algorithm from a prediction perspective............................. 204
9.2 FEATURE DISCOVERY ................•.......................•...................•....................... 205
9.2.1 Feature selection and discovery .......................................................... 206
9.3 FEATURE DISCOVERY INTHEG_DACG ALGORITHM ....•............................•.. 208
9.3.1 Chromosome structure ........................................................................ 210
9.3.2 Fitness ................................................................................................. 210
9.3.3 Modified crossover and mutation ........................................................ 213
9.3.4 Reproduction ....................................................................................... 213
9.3.5 Feature discovery algorithm in G_DACG........................................... 214
9.3.6 Generating linguistic partitions .......................................................... 216
9.4 PARAMETER IDENTIFICATION IN G_DACG •...................•....•......................•.. 217
9.5 PARAMETER OPTIMISATION IN G_DACG ...•.................................................. 218
9.5.1 Feature weights identification using Powell's algorithm .................... 218
9.5.2 Filter identification using Powell's algorithm .................................... 219
9.6 A MASS ASSIGNMENT-BASED NEURO-FUZZY NElWORK ................................. 222
9.7 A DETAILED EXAMPLE RUN oFG_DACG ...........................••......................... 226
9.7.1 Ellipse classification problem ............................................................. 226
9.7.2 Using G_DACG to learn ellipse classifiers ......................................... 226
9.8 DISCUSSION ...................................................................•............................... 232
9.9 SUMMARY ...................................................•.•.••...............................•...........• 234
9.10 BIBLIOGRAPHY .................................................................................•............ 235
PART V .•••••••••..••.••••••••••••••••.••••••••••••••••••••....•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• 239

10 ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS .••••••••••..• 241


10.1 EXPERIMENT VARIABLES AND ANALYSIS ....................................................... 241
10.2 ELLIPSE CLASSIFICATION PROBLEM ............................................................... 243
10.2.1 An example ofACGF modelling for the ellipse problem ..................... 243
10.2.2 Ellipse classification using 2D Cartesian granule features ................ 245
10.2.3 Data centred Cartesian granule features ............................................ 256
10.2.4 A G_DACG run on the ellipse problem ............................................... 260
10.2.5 Ellipse results comparison .................................................................. 261
10.2.6 Ellipse problem discussion and summary ........................................... 263
10.3 SIN(X * Y) PREDICTION PROBLEM ................................................................. 265
10.3.1 ACGF modelling of the Sin(X * Y) problem ........................................ 266
10.3.2 A Comparison with other inductive learning techniques..................... 269
10.3.3 Sin (X * Y) problem discussion and summary ...................................... 271
10.4 WHY DECOMPOSED CARTESIAN GRANULE FEATURE MODELS? ...................... 272
10.4.1 L classification problem ...................................................................... 273
10.5 OVERALL DISCUSSION ................................................................................... 277
10.6 SUMMARY AND CONCLUSIONS.; ..................................................................... 278
10.7 BIBLIOGRAPHY .............................................................................................. 279
T AllLE OJ· CONTENTS xxi

11 APPLICATIONS ............................•..•...........................•.......................•..••••... 281


11.1 REGION CLASSIFICATION IN IMAGE UNDERSTANDING .................................... 281
11.1.1 Motivations .......................................................................................... 282
11.1.2 Knowledge discovery in image understanding .................................... 283
11.1.3 Vision problem description .................................................................. 284
11.1.4 Vision dataset ...................................................................................... 285
11.1.5 Description of region features ............................................................. 286
11.1.6 Region datasets ................................................................................... 289
11.1.7 ACGF modelling of the vision problem ............... » 292
••••••••••••••••••••••••••••••

11.1.8 Vision problem results comparison ..................................................... 295


11.1.9 Vision problem conclusions ................................................................. 295
11.2 MODELLING PIMA DIABETES DETECTION PROBLEM ....................................... 296
11.2.1 ACGF modelling of Pima diabetes problem ........................................ 296
11.2.2 Pima diabetes problem results comparison ......................................... 299
11.3 MODELLING THE BOX-JENKINS GAS FURNACE PROBLEM .....................•......... 301
11.3.1 ACGF modelling of the gas furnace problem ...................................... 301
11.3.2 Gasfurnace results comparison .......................................................... 303
11.4 MODELLING THE HUMAN OPERATION OF A CHEMICAL PLANT CONTROLLER .. 303
11.4.1 ACGF modelling of the chemical plant problem ................................. 304
11.4.2 Chemical Plant Results Comparison ................................................... 305
11.5 DISCUSSION ................................................................................................... 305
11.6 GENERAL CONCLUSIONS ................................................................................ 308
11.7 CURRENT AND FUTURE WORK DIRECTIONS .................................................. ,. 308
11.8 SUMMARy ..................................................................................................... 310
11.9 BIBLIOGRAPHY .............................................................................................. 310
APPENDIX: EVOLUTIONARY COMPUTATION ............................................. 315

GLOSSARY OF MAIN SyMBOLS ........................................................................ 319

SUBJECT INDEX ..................................................................................................... 321


PART I
KNOWLEDGE DISCOVERY

The chapter in this part provides a general introduction to the subject of knowledge
discovery (KD). In addition, it briefly describes a new knowledge discovery process
centred on Cartesian granule features and corresponding learning algorithms (an
approach, which integrates various methodologies from soft computing, such as
evolutionary computation, fuzzy set theory, and probability theory). This approach,
supporting soft computing methodologies and other popular approaches to knowledge
discovery are presented in detail and compared in the remainder of this book. A
road map of this presentation is provided at the end of Chapter 1.
PART II
KNOWLEDGE REPRESENTATION

Knowledge representation is primarily concerned with accommodating the expression


of problem domain knowledge in a computer tractable form, that allows accurate
modelling of the domain (i.e. the solving of a problem or task to a useful level of
performance), and in a fashion that is amenable to human understanding (not always
required). From a knowledge discovery perspective, the type of knowledge
representation selected determines the nature of learning: it determines the type of
learning; what can be learned; when it can be learned (one-shot or incremental); how
long it takes to learn; the type of experience required to learn. This part of the book
presents an overview of knowledge representation (Chapter 2) and describes in detail
the following soft computing approaches to knowledge representation: fuzzy sets
(Chapter 3); fuzzy logic (Chapter 4); probability theory (Chapter 5); and Fril, a
development environment that supports the aforementioned forms of knowledge
representation (Chapter 6).
CHAPTER
KNOWLEDGE DISCOVERY
1
We are drowning in information, but starving for knowledge.
- John Naisbett
An ounce of knowledge is worth a ton of data.
- Brian R. Gaines, 1989

Since the introduction of the first operational modern computer (Heath Robinson) in
1940 by Alan Turing's team, scientists and engineers have tried, with varying degrees
of success, to increase its usefulness to mankind through the development of systems
with high MIQ (Machine Intelligence Quotient) [Zadeh 1994b]. This desire to increase
the computers' usefulness to mankind has led to the birth of many computer-related
disciplines. One such discipline is knowledge discovery (KD) whose main emphasis is
on using algorithms that exploit computational power and resources to automatically
discover general properties and principles (knowledge) from historical data (and
background knowledge), that permit a computer to perform a task autonomously or that
assist a human to perform a task more successfully, efficiently or in a more value-added
way.

Since its informal birth in 1989 [Fayad, Piatetsky-Shapiro and Smyth 1996], the field of
knowledge discovery has seen an explosive growth in techniques, applications and
interest. This growth has been driven by the potential that knowledge discovery affords
us as humans, attempting to solve practical problems facing our cyborg society, along
with explaining human learning through "cognitive simulation" [Simon 1983]. For
example, knowledge discovery can contribute in the following ways:

• Exploiting data overload: In the age of the Internet, ubiquitous computing


digital media, and data warehouses, society faces the challenge of dealing
with an ever-increasing data flood. Frawley et al. [Frawley, Piatetsky-
Shapiro and Matheus 1991] note that "It has been estimated that the
amount of data in the world doubles every 20 months ... earth
observation satellites planned for the I990s are expected to generate one
terabyte (/015) of data everyday". On a relatively smaller, but still
enormous scale, Walmart built an 11 terabyte database of customer
transactions in 1998 [Piatetsky-Shapiro 1999]. It is clearly infeasible for
humans to trawl such data in search of interesting and potentially useful
patterns or relationships. Knowledge discovery can be seen as a means of
detecting these patterns, thereby summarising the data in a useful and
succinct format that can help automate, simplify or enhance an application
domain, while also permitting the disposal of original data. For example,

J. G. Shanahan, Soft Computing for Knowledge Discovery


© Kluwer Academic Publishers 2000
CHAPTER 1: KNOWLEDGE DISCOVERY 4

Kononenko [Kononenko 1993] references 24 papers where inductive


learning systems were applied in the medical domain, and notes that
"typically the automatically generated diagnostic rules slightly
outperformed the diagnostic accuracy o/physician specialists".
• Tackling complex problems: Difficult problems that evade human
programming can be addressed by knowledge discovery. For example,
problem domains that evolve or that are highly complex, such as image
understanding, classification of protein types based on DNA sequences, or
natural language understanding.
• Overcoming the software lag: As computers play an even bigger role in
everyday life, the demand for software also increases, leading potentially
to a software lag. Knowledge discovery provides the possibility of helping
bypass this programming bottleneck.

Current approaches to knowledge discovery can be differentiated on many fronts such


as scalability and computing requirements. Here, however, the discussion is limited to
comparing the approaches based on the discovered models using the following criteria:

• effectiveness (accuracy of model on unseen data);


• understandability (to the user or expert in the domain);
• and uncertainty management.

Most current approaches satisfy understandability or effectiveness, but not


simultaneously, while tending not to cater for uncertainty. For example, most
approaches that have focused on the symbolic representation (transparent) of processes
have had only mild success in terms of performance accuracies compared to their
mathematical (generally opaque) counterparts. In this book, many examples are
presented that support this, including a diabetes diagnosis system (see Section 11.2 for
full details) where a symbolic learning approach such as ID3 [Quinlan 1986] is applied
to model this diagnosis process. Similarly, a mathematically-derived approach such as
neural networks is applied. The resulting neural network outperforms the symbolic
approach in terms of accuracy. However, the induced decision tree, which can be
viewed to some extent as the neural network's linguistic counterpart, outperforms in
terms of model transparency and understandability. When modelling the real world,
regardless of whether the models are manually or automatically programmed,
uncertainty abounds. This uncertainty arises generally as a result of any of the
following: the inability to express fully a model of the problem domain (due to its
encyclopaedic nature or computational limitations or a limitation on the expressiveness
of the knowledge representation used); theoretical ignorance in a domain (e.g. cancer
diagnosis); and data or variable deficiencies (e.g. when data from a sensor is not
available). These can lead to models that are deficient in some ways including: models
that may be incomplete, imprecise, fragmentary, not fully reliable, vague, or
contradictory. Many approaches to knowledge discovery, such as logic and neural
network approaches, result in models that do not explicitly accommodate uncertainty
and consequently, do not provide a natural rapport with reality. Facilitating uncertainty
in knowledge discovery can, as shown in this book, lead to models with improved
transparency and tractability on the one hand and that provide better generalisation on
the other.
SOfT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE fEATURES 5

The main goal of this book, after providing a detailed introduction to the key algorithms
and theory that form the core of knowledge discovery from a soft computing
perspective, is to propose a new knowledge discovery process centred on Cartesian
granule features and corresponding learning algorithms. The approach integrates
various methodologies from soft computing, such as evolutionary computation, fuzzy
set theory, and probability theory to address the knowledge discovery criteria outlined
above. This approach is amply i1Justrated in the context of both benchmark and real
world problems.

This chapter serves as a backdrop against which the rest of the book is developed,
providing an overview of knowledge discovery. It begins with an informal look at the
background and history of knowledge discovery. Section 1.2 formulates the knowledge
discovery process as a multi-step iterative process involving a three-way dialogue
between the domain expert, the knowledge engineer and the computer, in order to
prepare the domain data and background knowledge, to extract the knowledge via
machine learning algorithms, and finally to evaluate and interpret the extracted
knowledge. Section 1.3 reviews some of the successes of knowledge discovery. In
Section lA, Cartesian granule features are introduced briefly as a soft computing
approach that overcomes some of the limitations of existing approaches in machine
learning and knowledge discovery. Finally, Section 1.5 describes the organisation of
this book.

1.1 KNOWLEDGE DISCOVERY BACKGROUND AND HISTORY

Knowledge discovery is a multi-faceted research area, drawing on methods, algorithms,


and techniques from diverse fields such as knowledge representation, machine learning,
pattern recognition, cognitive science, artificial intelligence, databases, statistics,
probability, knowledge acquisition for expert systems and data visualisation. The
unifying goal of these areas is the discovery of models from data and background
knowledge that can automate, simplify or enhance an application domain.

Even though the term knowledge discovery, sometimes referred to in the literature as
"knowledge discovery from databases", "advanced data analysis", "data mining" or
simply "machine learning", was only coined in 1989 [Fayad, Piatetsky-Shapiro and
Smyth 1996], the field of knowledge discovery has a long history that derives from its
chief constituent components: knowledge representation, search, feature selection and
discovery, statistics, and machine learning. Each of these components is covered in
detail over the course of this book. To capture the essence of this new field of research
and development the term knowledge discovery was coined. Its motivation was simply
to emphasise the multi-step, inter-disciplinary nature and to broaden the scope and
appeal of knowledge discovery, moving from a process that conducts machine learning
on "perfect" data to a process that exploits alternatives from various fields in order to
deal with the real world of imperfect data in a more effective manner. This resulted in
the confluence of fields, previously disjoint. Since 1989, the field has seen an explosive
growth in techniques, applications and interest. As stated previously, this growth has
been driven by the potential that knowledge discovery affords us as humans, attempting
to solve practical problems facing our cyborg society. Before discussing some of the
CHAPTER I: KNOWLEDGE DISCOVERY 6

applications and approaches of knowledge discovery, an overview of the process is


presented.

1.2 THE KNOWLEDGE DISCOVERY PROCESS

Prior to formal definitions, a simple example is introduced which is subsequently used


to make some of the concepts behind knowledge discovery more intuitive and concrete.
In addition, throughout this book, this problem will be used to highlight some of the
challenges and open research issues for knowledge discovery.

1.2.1 A simple illustrative example problem


An artificial car parking problem is selected (following [Driankov and Hellendoorn
1995; Zadeh 1994b]), where the problem posed is to predict whether a particular car
park in a city centre will be full, or whether there will be vacancies when a customer
arrives on a weekend afternoon. On the way to the car park the customer receives
various notifications regarding the current number of car park vacancies either through
the radio or notice boards located throughout the city. Recently, due to congestion in
the downtown area, the city council has called in analysts to investigate how to improve
the situation. As part of their investigation they collect data. This is presented in graph
format in Figure 1-1. Only a representative sample of the data is displayed to avoid
clutter. Each data point represents an example of an occasion in the past when a
customer, having received information with regard to space availability, proceeded to
the car park. The horizontal axis represents the time at which the availability
information was received in terms of the number of minutes to the car park. The
vertical axis represents the availability information in terms of the number of free
spaces. The data has been classified into two classes: the p's (positive) represent the
occasions when the customer successfully managed to get a parking space, while the
n's correspond to situations when the customer was unsuccessful in gettIng a parking
space. Thus, this simple problem dataset could represent a historical dataset, where the
task of the knowledge discovery process is to discover patterns or regularities
(knowledge) in the data that could increase customer satisfaction by assisting them with
parking and also to reduce traffic congestion. Typical applications of knowledge
discovery involve data of much higher dimensions (from tens to thousands), with many
more data points (up to millions) and possibly background knowledge. The purpose
here is to illustrate basic ideas of knowledge discovery on a small problem in two-
dimensional space and also to highlight some of the difficulties and open research
issues that face the field.

1.2.2 Knowledge discovery


Knowledge discovery is commonly viewed as the non-trivial general process of
discovering valid, novel, understandable, and ultimately useful knowledge about an
application domain from observation data and background knowledge, in which the
discovered knowledge is implicit or previously unknown. For example, in terms of the
car parking problem, it is the identitication of patterns that lead to successful parking.
SOI'T COMPUTING FOR KNOWl.hlXlE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 7

Other examples include: the identification of the pattern of use of a credit card to detect
possible fraud; or the detection of a pattern in the documents a user reads, so that when
new documents are published that match this pattern the user is alerted. This definition
is adapted from the various definitions in the knowledge discovery literature [Fayad,
Piatetsky-Shapiro and Smyth 1996; Klosgen and Zytkow 1996]. An alternative
definition is to view knowledge discovery as the process of transforming data (and
background knowledge) into a format (for example, if-then rules) that permits a
computer to perform a task autonomously or that assists a human to perform a task
more successfully or efficiently or in a more value-added way (e.g. decision making or
triggering innovative creativity). Munakata simply defines knowledge discovery as
"computer techniques that, in the broadest sense, automatically find fundamental
properties and principles that are original and usefuC' [Munakata 1999].

p
g
0
p
p
~
u n
p n
~ n
Q n
.8 p n
e
::s n
z p
c: n
n p
n
p n p
n
n n n

QTimeToDestination

Figure 1-1: Data for car parking problem where "p" represents cases where a
customer successfully parked the car and where "n" represents cases where a
customer was unsuccessful.

Knowledge, in this context, is taken to correspond to a model or computer program that


is expressed in some hypothesis language. The format of the model can vary from if-
then rules to knowledge maps (such as Kohonen maps [Kohonen 1984]). For example,
for the parking problem, a model may consist of a set of rules such as "if
TimeToDestination is < T and NumberOfFreeSpaces is > S then parking will be
successful" (see Figure 1-2). The application domain may be represented in terms of
observation data or prior knowledge (along with other performance evaluation
functions outlined presently). The car parking problem domain is represented in terms
of a database of cases, where each case contains values for TimeToDestination,
NumberOfFreeSpaces and ParkingStatus. The discovered models should be valid on
new data to a certain degree and this can be captured by performance evaluation
functions. For example. in the case of the car-parking problem, validity could be
measured as a function of the number of correct classifications (space availability
classifications) achieved by the model depicted in Figure 1-2. The discovered models
should potentially lead to some useful actions, as measured by a utility function. For
example, in the car parking this function could be expected to decrease downtown noise
and traftic pollution and increase customer and pedestrian satisfaction.
CHAPTER I: KNOWLEIX;E DISCOVERY 8

An important goal of KD is to make the models amenable to human inspection and


understanding. This may facilitate a better understanding of the underlying problem
area. While this is difficult to measure precisely, especially in difficult problem areas,
one frequently used substitute is the simplicity measure. Several forms of simplicity
measure exist, ranging from purely syntactic (e.g. the length of the rules) to semantic
(e.g. easy for humans to understand). A further requirement of the KD process is that
the discovered patterns should be novel; for example, discovering a new pattern in the
data, such as, the following simple rule "ifTimeToDestination < NumberOfFreeSpaces
then parking will be successful", could be considered novel, if the previously used
model was a rule base consisting of several if-then rules, none of which captured the
knowledge of this rule.

.n
lj p
u p
a.
Vl p
<>
n
~ p
n n
<2
~"E p n n
z" n
a s
p
n
n p
n
p n p
n
n n
I
I
n
......
T Q TimeToDeslinalion

Figure 1-2: A possible rule-based model of the car parking problem characterised by
the following rule: "If TimeToDestination < T and NumberOfFreeSpaces is > S then
ParkingStatus will be successful OTHERWISE ParkingStatus will be unsuccessful".
The shaded region corresponds to successful parking.

1.2.2.1 The knowledge discovery process


Knowledge discovery is usually a multi-step iterative process, involving a three-way
dialogue between the domain expert, the knowledge engineer and the computer,
comprising the following core steps:

• developing an understanding of the application domain;


• determination of knowledge representation;
• selection, preparation, and transformation of data and prior knowledge;
• knowledge extraction (machine learning);
• model evaluation and refinement.

As illustrated in Figure 1-3 the knowledge discovery process is interactive and iterative
involving numerous steps where decisions are made by the knowledge engineer or
experts in the field of application. Some of the basic steps in this process are broadly
outlined below and illustrated for the parking problem presented above:
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTROL>UCING CARTESIAN GRANULE FEATURES 9

1. Developing an understanding of the application domain: One ofthe first tasks


confronting the knowledge engineer is to develop an understanding of the
application domain in terms of the available data, background knowledge,
potential to get new data, the requirements and the goals of the system. This
information is obtained by a discussion with the domain experts, and potential
end-users. This results in the definition of a task that the knowledge discovery
process should model and various criteria that this model should meet. In the
case of the parking problem, the knowledge engineer talks to the analysts who
gathered the car parking data. These analysts explain that their goals are to
reduce downtown congestion, increase customer satisfaction and make a profit
through the display of better quality information at their information points
(possibly not just the number of vacancies). Following further discussion, both
the analysts and knowledge engineer reach a consensus as to how to model the
problem. The knowledge engineer is charged with discovering useful patterns
in the data that could inform the driver of a car to proceed to the car park or to
look for an alternative car park. The knowledge engineer is also given further
criteria the model should meet, such as the model should be understandable
and small.

2. Selection of knowledge representation: Having gained an understanding of the


problem domain and the requirements of the customer, the knowledge
engineer needs to determine the type of knowledge representation that should
be used to model the problem domain. A tremendous variety of approaches,
ranging from quantitative to symbolic, to knowledge representation exist. The
type of knowledge representation selected will have a major influence on what
can be learned, how it can be learned, when it can be learned, what type of
data is required to learn it, and other performance related issues such as
understandability, effectiveness and so on. Part II of this book gives a detailed
presentation of knowledge representation and will not be discussed further
here. Part IV introduces a new form of knowledge representation based upon
probabilistic rules and fuzzy sets defined over Cartesian granule features
[Baldwin, Martin and Shanahan 1997; Shanahan 1998; Shanahan, Baldwin
and Martin 1999]. Returning to the parking problem, the customer has
requested a simple, transparent model. Consequently, the knowledge engineer
opts to represent the model in terms of if-then rules.

3. Data selection, pre-processing and transformation: This step begins with


selecting the features, observation data and background knowledge from the
domain that will be considered during subsequent phases. This is mainly an
automatic process that results in the selection of a subset of available features,
while also possibly highlighting data deficiencies. Various statistical and
machine learning techniques can help with this process, supplemented with
domain knowledge provided by the domain experts. Chapter 9 presents in
detail feature selection and discovery (creation of new features). Subsequently,
the selected data is prepared for the learning step. For example, there data may
be missing (such as an attribute value is not known or does not exist). There
are several possible reasons why a value is missing, such as: it was not
measured; there was an instrument malfunction; or the attribute does not
apply. How one deals with missing data values is very much tied to the
CHAPTER I: KNOWLEDGE DISCOVERY 10

domain of application and to the learning algorithm used to do the discovery.


In some problem domains, it may require human input, whereas in others, the
missing value can be replaced with a single value, such as the class mean
value for this feature, or with an imprecise value such as the fuzzy set value
(that could be estimated using techniques presented in Part IV), corresponding
to class values for a feature. Raw data cannot be used directly by most learning
algorithms. They require various transformations. For example, in the case of
decision tree learners [Quinlan 1983; Quinlan 1993], learning can be greatly
speeded up if the feature universes are discretised prior to learning. Neural
networks can find it difficult to learn, if all feature values are not normalised.
Other feature transformations may be required in some domains, where for
example, there are too many input features. This is discussed in greater detail
in Chapters 9 and II. For the car parking problem, there is no missing data and
the knowledge engineer selects all data features and provided data samples and
subsequently splits the data samples into two datasets: a dataset for training
and a dataset for .testing.

4. Knowledge extraction: The knowledge engineer selects an appropriate


learning algorithm to extract knowledge from the prepared data and
background knowledge. The knowledge engineer, possibly in conjunction with
domain experts, chooses performance evaluation criteria to guide the
discovery process. The selected learning algorithm searches through the space
of possible models, guided by background knowledge and the performance
evaluation criteria, for a satisfactory model. This is one of the most critical
steps in the knowledge discovery process. A detailed insight into machine
learning and the different paradigms of learning, along with performance
evaluation criteria is given Part ill of this book, while Part IV introduces new
approaches to machine learning based upon Cartesian granule features.
Returning to the parking problem, the knowledge engineer selects the C4.5
decision tree induction algorithm [Quinlan 1993] for learning, while
specifying a performance evaluation function based on the accuracy of the
induced model on the test dataset of unseen samples.

5. Knowledge interpretation and evaluation: Though knowledge interpretation


and evaluation criteria can also be used to guide knowledge extraction, this
step mainly focuses on interpreting the results of knowledge discovery, and it
can lead to a return to any of the earlier steps. Assume that in the case of the
parking problem, the discovered decision tree model yields an accuracy of
80% on the test dataset and leads to a rule base with several rules
(hypothetical). The knowledge engineer presents the results to the analysts.
The analysts are unhappy with the results and another discussion ensues as to
why the discovered model does not perform well and why it is so complicated.
The knowledge engineer conjectures that there are some outliers in the data,
suggesting noisy (bad) data sampling or the lack of information. The analysts
ponder for a while and conclude that some of the outliers could be explained
by adverse weather conditions or the occurrence of public events that caused
the closure of some roads and consequently required a much longer time to
reach the car park. The analysts then agree to generate a new dataset consisting
of TimeToDestination, numberOfFreeSpaces, OccurrenceOjAPublicEveflt,
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 11

AjfectedStreets, and Parking Status. This results in a return to step 3 for the
knowledge engineer and another iteration of the KD process. This iteration
may result in a much more complicated decision tree with an improved
accuracy of 90%. Subsequently, the analysts believe that the discovered model
is too complicated and ask the knowledge engineer to try to come up with a
simpler and more intuitive model. This causes the knowledge engineer to
return to step 2 and select a different form of knowledge representation; say
predicate logic. Consequently, the knowledge engineer has to select a new
induction algorithm: such as the PROGOL algorithm [Muggleton and Buntine
1988]. This iteration may result in a very simple model of the parking problem
with an accuracy of 90%. The resulting model could consist of one rule "if
TimeToDestination < NumberOfFreeSpaces then display "parking is
available" otherwise display "no parking available here, use alternative
parking" as depicted in Figure 1-4. The analysts like the resulting model, and
it is deployed as a means of supporting the downtown drivers.

Each of these steps, as indicated above, will be revisited in detail at various points
throughout the book. The next section shifts focus from an overview of the knowledge
discovery process and illustrative example to a review of some of the real world
applications of knowledge discovery.

Model Jntaprdation
and Evaluation

Select Know~
Represenucion I
I

• I..
I
I

+
I
I I
Dl.laand
BlIICQound ------------~--------~------~-------
K.nowIOO~
Acquisition

Figure 1-3: An overview of the steps making up the knowledge discovery process.
CHAPTER I : KNOWLEIX;E DISCOVERY 12

p p
p
n
P n n
p n n
P n
n p
n n
p n P
n
n n n
n'l i nl!To()",lin,lion

Figure 1-4: A possible rule-based model of the car parking problem characterised by
the rule if TimeToDestination < NumberOfFreeSpaces then display "parking is
available" otherwise display "no parking available here, use alternative parking". The
shaded region corresponds to "no parking available here, use alternative parking".

1.3 SUCCESSES OF KNOWLEDGE DISCOVERY

Knowledge representation and machine learning are two of the most critical
components of knowledge discovery. Both of these have received a lot of attention over
the past decade, leading to the adoption, extension or hybridisation of many traditional
representation schemes and machine learning algorithms such as neural networks,
probabilistic approaches, genetic programming, decision tree induction, inductive logic
programming, rough sets and fuzzy sets in order to deal with the challenging task of
knowledge discovery. The resulting knowledge discovery techniques have led to
practical applications in many areas: within decision support systems applications such
as analysing medical outcomes, detecting credit card fraud, and predicting customer
purchase behaviour; within engineering and manufacturing systems applications such
as autonomous vehicles and process control systems; within game playing applications
such as playing chess at grandmaster level; and within human-computer interaction
applications such as recognising human gestures and user profiling for e-commerce to
mention but a few. For example, Muggleton [Muggleton 1999] illustrates the power of
first order logic techniques for the knowledge discovery of biological functions in
structured domains such as molecular biology, carcinogenicity and pharmacophores.
Fayad et al. [Fayad, Djorgovski and Weir 1996] demonstrate how knowledge discovery
techniques were applied to the classification of celestial objects from the Palomar
Observatory Sky Survey, consisting of terabytes of data. In this case, knowledge was
extracted using decision tree approaches. De Jong [Jong 1999] shows the powers of
evolutionary computation for the discovery of heuristics, tactics and strategies. For
example, in the field of telecommunications he discusses how genetic algorithms were
used to generate alternative network designs that reduce costs by 10 to 20%. Engineers,
by examining the resulting designs, gained some important insights on how these costs
were achieved. Glance et al. [Glance, Arregui and Dardenne 1998; Glance, Arregui and
Dardenne 1999] have demonstrated how patterns in user recommendations can be
SOH COMI'UTING FOR KNOWU;IX;E DISCOVERY: INTRODUCING CARTESIAN GRANUI.E FEATURES 13

extracted and represented using probabilistic techniques to ease and increase


information sharing within corporate intranet-based communities. Ralescu and Hartani
[Ralescu and Hartani 1994], demonstrate the extraction of insightful facial expression
patterns from image data using fuzzy clustering techniques. For other examples of
recent knowledge discovery applications, see [Fayad et al. 1996; Munakata 1999].

1.4 CARTESIAN GRANULES FEATURES IN BRIEF

Even though in recent years many successful knowledge discovery applications have
been developed, as highlighted in the previous section, current approaches to
knowledge discovery suffer from a number of shortcomings such as decomposition
error and transparency. Cartesian granule features and related learning algorithms were
originally introduced to address some of these shortcomings [Baldwin, Martin and
Shanahan 1996; Baldwin, IVlartin and Shanahan 1997; Shanahan 1998; Shanahan,
Baldwin and Martin 1999]. A Cartesian granule feature is a multidimensional feature,
which is built upon a linguistic partition or discretisation of the base universe. Fuzzy
sets, probability distributions and mass assignments can be naturally and succinctly
expressed in terms of the Cartesian granules (words) that discretise the base universe.
Fuzzy sets are used to represent the granules, thereby overcoming some of the problems
posed by crisp discretisation, such as vulnerability to boundary location, that have
plagued many probabilistic and logic-based approaches to machine learning. For
example, Figure 1-5(a) graphically displays a linguistic partition of the Position
variable, where each word is denoted by a fuzzy set. The variable value of 40 can be
linguistically summarised or described using the Cartesian granule fuzzy set: {left/0.2 +
Middle/I}. In a similar fashion more general concepts can be summarised. For example,
the concept of car locations in images could be summarised linguistically and
succinctly using the Cartesian granule fuzzy set depicted in Figure 1-5(b) (see Part IV
of this book for more details). This new approach exploits a divide-and-conquer
strategy to representation, capturing knowledge in terms of a rule-based network of
low-order semantically related features - a network of Cartesian granule features.
Cartesian granule features can be incorporated into fuzzy logic rules or probabilistic
rules. Classification, regression and clustering problems can be addressed quite
naturally using Cartesian granule features. Parts IV and V of this book describe
Cartesian granule feature models, corresponding learning algorithms, and the
knowledge discovery of such models in both benchmark and real world problems.

1.4.1 Soft computing for knowledge discovery


Soft computing, a term originally coined by Zadeh [Zadeh 1994a; Zadeh 1994b; Zadeh
1994c], is the integration of advanced problem solving methodologies such as fuzzy
systems, neural networks, evolutionary computing, and various theories of probability
to solve extremely challenging problems. It differs from conventional (hard) computing
in that, unlike hard computing, it is tolerant of imprecision, uncertainty and partial
truth. Zadeh states that the guiding principle of soft computing is to "exploit the
tolerance for imprecision, uncertainty and partial truth to achieve tractability,
robustness and low solution cost". It is important to note that soft computing is not just
CHAPTER I: KNOWLEDGE DISCOVERY 14

a collection of problem solving strategies, but rather, it is a partnership in which each of


the partners contributes a distinct methodology for addressing problems in its domain.
In this perspective, the principal contributions of fuzzy systems, neural networks,
evolutionary computation and probability theory are complementary rather than
competitive.

Over the course of this book the extremely challenging problem of knowledge
discovery is addressed using Cartesian granule feature modelling, an example of a soft
computing approach that exploits the powers of genetic programming (in order to
discover a good concept language), fuzzy sets (for concept representation), and
probability theory (for leaming concepts and reasoning) in order to achieve systems
with high MIQ. Knowledge representation in terms of Cartesian granule features is an
example of exploiting uncertainty, imprecision in this case, in order to achieve
tractability and transparency on the one hand and generalisation on the other.

Left Middle

!lPosition
( a)

rt I
Middle

r
O~--------------~5~O------~------~IOO--~»

(b)
Right

E
!lPosition

Figure 1-5: Concept descriptions in terms of a Cartesian granule fuzzy set (a)
linguistic partition of the universe of position; (b) a concept Cartesian granule fuzzy
set.

1.5 THE STRUCTURE OF THIS BOOK

This book provides a self-contained description of the theory and algorithms that form
the core of knowledge discovery from a soft computing perspective. The sections above
have presented a general introduction to knowledge discovery and its applications. This
has set the stage for the rest of the book, which provides a highly readable and
systematic exposition of knowledge representation, machine learning, and the key
methodologies that make up the fabric of soft computing - fuzzy set theory, fuzzy
logic, evolutionary computing, and various theories of probability (point-based
SOIoT COMI'UTINC; HlR K.'lOWI.EIX'E DISCOVERY: INTRODUCINC1 CARTESIAN GRANUI.E FEATURES 15

approaches such as naIve Bayes and Bayesian networks, and set-based approaches such
as Dempster-Shafer theory and mass assignment theory). Along with describing well
known approaches, Cartesian granule features and corresponding learning algorithms
are also introduced as a new and intuitive approach to knowledge discovery. This new
approach embraces the synergistic spirit of soft computing, exploiting uncertainty,
imprecision in this case, in order to achieve tractability and transparency on the one
hand and generalisation on the other. In doing so it addresses some of the shortcomings
of existing approaches such as decomposition error and performance-related issues
such as transparency, accuracy and efficiency. Parallels are drawn between this
approach and other well known approaches (such as naIve Bayes, decision trees)
leading to equivalences under certain conditions.

The remainder of this book is divided into four main parts and an appendix: Part II
introduces the key components of knowledge representation and outlines the desiderata
of knowledge representation, along with describing the key algorithms and theory of
various soft computing approaches to knowledge representation (in tutorial style). Part
III introduces the basic architecture for learning systems and its components and details
many popular learning algorithms. Part IV proposes a new soft computing approach to
knowledge discovery based on Cartesian granule features. Applications and
comparisons of this new approach in the context of both artificial and real world
problems are described in Part V.

The following provides a more detailed overview of each chapter:

• PART II - Chapter 2: One of the most crucial decisions in the knowledge


discovery process is that of selecting the method of knowledge representation.
This chapter is concerned with issues relevant to representation of experience
(the input to knowledge discovery) and to the representation and organisation
of acquired knowledge (the output of knowledge discovery). The key
components of knowledge representation are first introduced and subsequently
the desiderata of knowledge representation from a knowledge discovery
perspective are presented. Various forms of knowledge representation
commonly used within machine learning and knowledge discovery are briefly
described and discussed with respect to these desiderata.
• Chapters 3, 4, 5: Soft computing provides a variety of ways of representing
knowledge. These three chapters examine the techniques that form the basis of
the approaches proposed in this book. Chapter 3 presents the fundamentals of
fuzzy set theory [Zadeh 1965]. Chapter 4 introduces fuzzy logic [Zadeh 1973]
as the basis for a collection of techniques for representing knowledge in terms
of natural language like sentences and also as a means of manipulating these
sentences in order to perform inference using reasoning strategies that are
approximate rather than exact. Chapter 5 describes various language-like
theories of representing uncertainty and imprecision including point-based
probability theory such as naIve Bayes and Bayesian networks, set-based
probabilistic approaches such as Dempster-Shafer theory [Dempster 1967;
Shafer 1976], possibility theory [Zadeh 1978], and mass assignment theory
[Baldwin 1991]. Formal links between these theories and fuzzy set theory are
also presented. These links form the basis for the learning algorithms proposed
in Chapter 9.
CHAPl'ER I: KNOWLEDGE DISCOVERY 16

• Chapter 6: This chapter gives details of a programming environment that


enables soft computing - FRIL (Fuzzy Relational Inference Language)
[Baldwin, Martin and Pilsworth 1988]. Essentially, Fril is an efficient general
logic programming language with special structures to handle uncertainty and
imprecision using the soft computing techniques presented in earlier chapters.
This chapter describes the different forms of knowledge representation and
probabilistic reasoning mechanisms in Fril, which are subsequently used by
the Cartesian granule feature knowledge discovery process, described in Parts
IV and V of this book.
• PART m - Chapter 7: The field of machine learning is concerned with
software programs that improve their performance with experience. This
chapter overviews learning from human and computer perspectives. A formal
definition of machine learning is provided and inductive learning (the most
prevalent form of machine learning) is described in detail, viewing induction
as a search process in the space of possible hypotheses (induced computational
models) in which factors such as generalisation, model performance measures,
inductive bias and knowledge representation play important roles. These
notions are concretely illustrated in the context of popular induction
algorithms for decision trees, naIve Bayes classifiers and fuzzy classifiers. In
addition, the three main categories of machine learning namely, supervised
learning, reinforcement learning and unsupervised learning, are described;
these are supplemented with a taxonomy of associated learning algorithms.
• PART IV - Chapter 8: This chapter introduces a new form of knowledge
representation centred on Cartesian granule features. It provides basic
definitions and examples of Cartesian granule features and related concepts,
such as Cartesian granule fuzzy sets and additive models. It also looks at
different possibilities for aggregation within individual Cartesian granule
features based upon fuzzy set theory and probability theory. Finally, it shows
how Cartesian granule features can be incorporated into evidential logic
(additive) and fuzzy logic models.
• Chapter 9: The identification of good parsimonious Cartesian granule feature
models is a super-exponential search process through the space of possible
models. This chapter describes a constructive induction algorithm, G_DACG
(Genetic Discovery of Additive Cartesian Granule feature models), which
facilitates the discovery of Cartesian granule feature models from example
data. This involves two main steps: language identification (feature selection,
abstraction and discovery) in terms of Cartesian granule features; and
parameter identification of class fuzzy sets and rules. A novel and inexpensive
fitness function based on the semantic separation of the extracted fuzzy set
concepts and parsimony is also proposed.
• PART V - Chapter 10: In this chapter for the purposes of illustration and
analysis, Cartesian granule feature modelling is applied in the context of
artificial problems, in both the classification and prediction domains. Even
though the G_DACG algorithm can automatically learn models from example
data, here the language of the model is determined manually, while the model
parameters are determined automatically through learning. This allows a close
analysis of the impact of various decisions, primarily in the language
identification phase of learning, on the resulting Cartesian granule feature
models. This analysis provides insights on how to model a problem using
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 17

Cartesian granule features. Furthennore, it provides a useful platfonn for


understanding many other learning algorithms that mayor may not explicitly
manipulate fuzzy events or probabilities. For example, it shows that a naive
Bayesian classifier is equivalent to crisp Cartesian granule feature classifiers
under certain conditions. Other parallels are also drawn between learning
approaches such as decision trees [Quinlan 1986; Quinlan 1993] and the data
browser [Baldwin and Martin 1995; Baldwin, Martin and Pilsworth 1995].
• Chapter 11: In this chapter the G_DACG constructive induction algorithm is
deployed in the knowledge discovery of additive Cartesian granule feature
models in the real world domains of medical decision support, computer vision
and control. The results obtained are compared with those achieved using
other standard induction approaches such as neural nets, naive Bayes, and
various decision tree approaches. The approaches examined are evaluated
using criteria such as model transparency, perfonnance accuracy, and the
efficiency of the learning algorithm. This chapter draws some conclusions
about the proposed Cartesian granule feature approach to knowledge discovery
while also offering some suggestions, challenges and new work directions for
both knowledge discovery and the Cartesian granule feature modelling
approach.
• Appendix: This appendix presents an overview of the evolutionary
computation paradigm.
• Glossary of main symbols: This glossary provides a brief description of the
main symbols used in this book.

1.6 SUMMARY

Knowledge discovery can be viewed as the process of transforming data (and


background knowledge) into a fonnat (for example, if-then rules) that permits a
computer to perfonn a task autonomously or that assists a human to perfonn a task
more successfully or efficiently. It is a multi-faceted field drawing on techniques from
fields such as machine learning, knowledge representation, and statistics. Knowledge
discovery is seen as an important way of overcoming many of the computational
problems facing our cyborg society such as software lag and data overload, while also
extending the range of problems that can currently be addressed by a computer. Some
limitations of existing techniques were identified and a new approach to knowledge
discovery based on Cartesian granule features, which addresses some of these
limitations, was briefly described.

1.7 BIBLIOGRAPHY

Baldwin, 1. F. (1991). "A Theory of Mass Assignments for Artificial Intelligence", In


JJCAI '91 Workshops on Fuzzy Logic and Fuzzy Control, Sydney, Australia,
Lecture Notes in Artificial Intelligence, A. L. Ralescu, ed., 22-34.
CHAPTER I: KNOWI.ElXiE DISCOVERY 18

Baldwin, J. F., and Martin, T. P. (1995). "Fuzzy Modelling in an Intelligent Data


Browser." In the proceedings of FUZZ-IEEE, Yokohama, Japan, 1171-1176.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1988). FRILManual. FRIL Systems
Ltd, Bristol, BS8 ] QX, UK.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - Fuzzy and Evidential
Reasoning in A.!. Research Studies Press(Wiley Inc.), ISBN 0863801595.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1996). "Modelling with Words using
Cartesian Granule Features", Report No. ITRC 246, Dept. of Engineering
Maths, University of Bristol, UK.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997). "Modelling with words using
Cartesian granule features." In the proceedings of FUZZ-IEEE, Barcelona,
Spain, ] 295-1300.
Dempster, A. P. (1967). "Upper and Lower Probabilities Induced by Multivalued
Mappings", Annals of Mathematical Statistics, 38:325-339.
Driankov, D., and Hellendoom, H. (1995). "Chaining of IF_THEN rules: some
problems." In the proceedings of Second International Fuzzy Engineering
Symposium/FUZZ-IEEE, Yokohama, 103-8.
Fayad, U. M., Djorgovski, S. G., and Weir, N. (1996). "Automating the analysis and
cataloging of sky surveys", In Advances in Knowledge Discovery and Data
Mining, U. M. Fayad, G. Piatetsky-Shapiro, P. Smyth, and R Uthurusamy,
eds., AAAI PresslMIT Press, London, England, 47]-493.
Fayad, U. M., Piatetsky-Shapiro, G., and Smyth, P. (1996). "From Data Mining to
Knowledge Discovery", In Advances in Knowledge Discovery and Data
Mining, U. M. Fayad, G. Piatetsky-Shapiro, P. Smyth, and R Uthurusamy,
eds., AAAI PresslMIT Press, London, England, ] -36.
Fayad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R, eds. (1996).
"Advances in Knowledge Discovery and Data Mining", AAAI PressIMIT
Press, London, England.
Frawley, W. J., Piatetsky-Shapiro, G., and Matheus, C. J. (1991). "Knowledge
Discovery in Databases: An Overview", In Knowledge Discovery in
Databases, G. Piatetsky-Shapiro and W. J. Frawley, eds., AAAI PressIMIT
Press, Cambridge, Mass, USA, 1-27.
Glance, N., Arregui, D., and Dardenne, M. (1998). "Knowledge Pump: Supporting the
Flow and Use of Knowledge", In Information Technology for Knowledge
Management, U. Borghoff and R Pareschi, eds., Springer-Verlag, New York,
35-45.
Glance, N., Arregui, D., and Dardenne, M. (1999). "Making Recommender Systems
Work for Organizations." In the proceedings of PAAM, London, UK.
Jong, K. A. d. (]999). "Evolutionary computation for discovery", Communications of
the ACM, 42(11):31-53.
Klosgen, W., and Zytkow, J. M. (1996). "Knowledge Discovery in Databases
Terminology", In Advances in Knowledge Discovery and Data Mining, U. M.
Fayad, G. Piatetsky-Shapiro, P. Smyth, and R Uthurusamy, eds., AAAI
PresslMIT Press, London, England, ] -36.
Kohonen, T. (1984). Self-Organisation and Associative Memory. Springer-Verlag,
Berlin.
Kononenko, I. (1993). "Inductive and Bayesian learning in medical diagnosis",
Artificial Intelligence, 7:3] 7-337.
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 19

Muggleton, S. (1999). "Scientific knowledge discovery using inductive logic


programming", Communications of the ACM, 42(11):43-46.
Muggleton, S., and Buntine, W. (1988). "Machine invention of first order predicates by
inverting resolution." In the proceedings of Fifth International Conference on
Machine Learning, Ann Harbor, MI, USA, 339-352.
Munakata, T. (1999). "Knowledge discovery", Communications of the ACM,
42(11):26-29.
Piatetsky-Shapiro, G. (1999). 'The data-mining industry coming of age", IEEE
Intelligent Systems, 14(6):32-34.
Quinlan, J. R. (1983). "Learning efficient classification procedures and their application
to chess endgames", In Machine Learning: An Artificial Intelligence
Approach, R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, eds., Springer-
Verlag, Berlin, 150-176.
Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, 1(1):86-106.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San
Mateo, CA.
Ralescu, A. L., and Hartani, R. (1994). "Modelling the perception of facial expressions
from face photographs." In the proceedings of The 10th Fuzzy Systems
Symposium, Osaka, Japan, 554-557.
Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University Press.
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Shanahan, J. G., Baldwin, J. F., and Martin, T. P. (1999). "Modelling with words using
Cartesian granule features", Submitted to Fuzzy Sets and Systems:40 pages.
Simon, H. A. (1983). "Why should machine learn?", In Machine Learning: An
Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell, and T. M.
Mitchell, eds., Springer-Verlag, Berlin, 25-37.
Zadeh, L. A. (1965). "Fuzzy Sets", Journal of Information and Control, 8:338-353.
Zadeh, L. A. (1973). "Outline of a New Approach to the Analysis of Complex Systems
and Decision Process", IEEE Trans. on Systems, Man and Cybernetics,
3(1):28-44. '
Zadeh, L. A. (1978). "Fuzzy Sets as a Basis for a Theory of Possibility", Fuzzy Sets and
Systems, 1:3-28.
Zadeh, L. A. (1994a). "Fuzzy Logic, Neural Networks and Soft Computing",
Communications of the ACM, 37(3):77-84.
Zadeh, L. A. (1994b). "Soft computing", LIFE Seminar, LIFE Laboratory, Yokohama,
Japan (February, 24), published in SOFT Journal, 6:1-10.
Zadeh, L. A. (1994c). "Soft Computing and Fuzzy Logic", IEEE Software, 11(6):48-56.
Zadeh, L. A. (1997). "Toward a theory of fuzzy information granulation and its
centrality in human reasoning and fuzzy logic", Fuzzy Sets and Systems,
90(2):111-127.
CHAPTER
KNOWLEDGE
2 REPRESENTATION

Knowledge representation (KR) is primarily concerned with accommodating the


expression of problem domain knowledge in a computer tractable form, that allows
accurate modelling of the domain (i.e. the solving of a problem or task to a useful level
of performance), and in a fashion that is amenable to human understanding (not always
required). Much of the work in knowledge representation is motivated by engineering
concerns, with a little interest in psychological and linguistic plausibility [Hayes 1999].
Philosophers and psychologists have long pondered and debated how humans and other
animals represent knowledge. Several hypotheses have been put forward. For example,
in 1956 a series of experiments, which were directed towards understanding a human's
ability to represent and reason about categories, were reported in a landmark book
entitled "A study of thinking" [Bruner, Goodnow and Austin 1956]:

" ... when one learns to categorize a subset of events in a certain way, one is
doing more than simply learning to recognize instances encountered. One is
also learning a rule that may be applied to new instances. The concept or
category is basically, this" rule of grouping" and it is such that one constructs
informing and attaining concepts."

The notion of using a rule as an abstract representation of concepts in the human mind,
as mentioned in this excerpt, has since been questioned by many and has created a lot
of debate [Hayes 1999; Holland et al. 1986; Sammut 1993]. Numerous other models of
how humans store and represent knowledge have been examined and proposed but to
date, few have proven adequate [Hayes 1999]. This is somewhat paralleled within
knowledge discovery, where there exists a wide range of possible representations with
no "panacean" approach. From a knowledge discovery perspective, the type of KR
selected determines the nature of learning: it determines the type of learning; what can
be learned; when it can be learned (one-shot or incremental); how long it takes to learn;
the type of experience required to learn.

This chapter begins by introducing the key parts of knowledge representation: the
observation language, the hypothesis language and the general purpose inference and
decision making mechanisms. The intimate relationship between uncertainty and
knowledge representation is then described. Subsequently, the desiderata of knowledge
representation are presented, paying particular attention to how they affect knowledge
discovery. A taxonomy of knowledge representation approaches, commonly used
within knowledge discovery, is then presented and discussed with respect to these
desiderata.

J. G. Shanahan, Soft Computing for Knowledge Discovery


© Kluwer Academic Publishers 2000
CHAPTER 2: KNOWLEDGE REPRESENTATION 24

2.1 REPRESENTATION OF OBSERVATIONS AND HYPOTHESES

The nature of knowledge in knowledge-based systems can be split into two broad
categories: specific (domain) and general knowledge. The specific knowledge refers to
the environment and its interpretations such as the observations and the induced
models, whereas the general knowledge refer to the inference and decision making
mechanisms used, which are generally the same across all problem domains. This
section provides a description of the knowledge representation components used to
represent specific knowledge, while the next section introduces general purpose
knowledge.

The input to the knowledge discovery process consists of descriptions of observations


from an environment (problem domain) and in the case of supervised or re-inforcement
learning, an output value is associated with the example (see Chapter 7 for more
details). An observation language describes these input-output experiences. No matter
which form of KR is used, descriptions of objects in the real world must ultimately rely
on measurements or perceptions [Zadeh 1999] of some properties. These may be
physical properties such as size, colour etc. or derived properties such as the difference
in share price, or perceptions such as happy or long. The accuracy and reliability of the
induced concept depends heavily on the accuracy and reliability of the measurements
and perceptions and their corresponding representation in the observation language. In
most knowledge discovery approaches, the observation language is an attribute-value
list where the values of the attributes are in most cases necessarily restricted to
numerical or symbolic quantities. As a result, the learner is often biased or limited by
its observation language. However, recently, many relatively new (and not so new)
forms of KR have been developed and exploited to increase the expressivity of the
observation language, to deal with not just precise data, but also imprecise and
uncertain data, thereby providing a more accurate and natural reflection of the real
world. These include fuzzy set and probabilistic approaches that accommodate both
measurements and perceptions [Zadeh 1999].

The main goal of the knowledge discovery process is to learn a general-purpose


hypothesis or model (commonly known as the knowledge base in the AI literature) that
covers the training observations and that generalises to new unseen observations. A
hypothesis language, generally more expressive than the observation language, is used
as a means of expressing such models. The hypothesis language describes both the
internal state and the output of the machine learning algorithm, which corresponds to its
theory of the concepts or patterns that exist in the observations. Once again the
hypothesis language introduces a bias or limit on what mayor may not be learned. For
example, in the language of attributes and values, (such as propositional logic)
relationships can prove difficult to represent, whereas a more expressive language such
as first-order logic or second-order logic can easily be used to describe such
relationships.
SOI-T COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANUW FEATURES 25

2.2 GENERAL PURPOSE KNOWLEDGE - INFERENCE AND


DECISION MAKING

A further aspect of knowledge representation is the inference engine and decision


making mechanisms, a portion of knowledge that tends to be domain independent in all
forms of KR. It normally encodes general methods of reasoning and operates on the
concepts, induced from the observations, producing inference on new observations.
Inference can be seen as the process that derives new information or inferences from a
given set of observations using a model or knowledge base. Within AI-based
approaches to KR, inference normally takes the form of meta-knowledge known as an
interpreter (inference engine), represented in the formalism of predicate calculus, where
the inference is normally deductive in nature. Within probability theory, the inference
mechanism is based upon a conditioning or a belief updating mechanism such as
Bayes' rule. Inference engines can play important roles in KR and sometimes hold the
key to generalisation, e.g. in case-based reasoning (competitive-based generalisation).

Decision making is performed on the results of inference and can take many forms. For
example, in a Bayesian classifier decision making could involve taking the class
associated with the maximum posterior probability as the classification of the input
data. Fuzzy inference for predictive problem domains (Le. continuous valued outputs)
is a type of forward chaining, where activated rules contribute collectively (a process
known as defuzzification) to the point valued solution Le. decision making reduces to
selecting a single output value from a set of possible values.

2.3 UNCERTAINTY AND KNOWLEDGE REPRESENTATION

Knowledge representation is intimately related with uncertainty. In general, when


modelling the real world, uncertainty abounds. This uncertainty usually results from an
incomplete and incorrect model of the problem domain. This could be caused by any or
all of the following (note: list is not exhaustive): the inability to express fully a model
of the problem domain, due to its encyclopaedic nature of the domain, a limitation on
the expressiveness of the knowledge representation used or computational limitations;
theoretical ignorance in a domain (e.g. cancer diagnosis) and data or variable
deficiencies (e.g. when data from a sensor is not available). These deficiencies lead to
different types of uncertainty such as stochastic uncenainty, imprecision, ignorance,
vagueness, ambiguity, and inconsistency amongst others. The harsh reality of systems
modelling was very eloquently stated by Albert Einstein in 1921 as follows:

So far as the laws of mathematics refer to reality, they are not cenain. And so
far as they are cenain, they do not refer to reality.

Nevertheless, uncertainty can in some situations be a valuable commodity. This is


contrary to traditional scientific belief, where uncertainty was viewed as undesirable
and to be avoided at all costs. For example, allowing more uncertainty (such as
imprecision) may reduce the complexity of a system and permit its solution in a
CHAPTER 2: KNOWLEDGE REPRESENTATION 26

satisfactory manner. Consider the parking example. A crisp decision tree will require a
lot of leaf nodes to model this problem, whereas using fuzzy sets (a form of
imprecision) to partition each universe will lead to a more transparent model (less
bushy decision tree) with satisfactory performance for this problem. This model,
though not a perfect model of reality (as measured in terms of its accuracy on a test
dataset), may provide satisfactory performance as measured in terms of decreased
downtown pollution. In some cases, uncertainty may increase the generalisation power
of a learnt system (see Chapters 3 and 9). In addition, explicitly managing uncertainty
can increase model transparency and user confidence or credibility in the model.

Fuelled by reality (according to Einstein) and the possibilities that uncertainty can
afford in problem solving, researchers have introduced many new theories of
uncertainty such as, fuzzy set theory [Zadeh 1965], Dempster-Shafer theory [Dempster
1967; Shafer 1976], possibility theory [Zadeh 1978], and nonmonotonic logic [Bobrow
1980; McDermott and Doyle 1980]. These new theories have led to many successful
applications in fields where approaches that do not cater explicitly for uncertainty have
failed; these fields include speech recognition [Rabiner 1989] and control [Ralescu and
Hartani 1995; Terano, Asai and Sugeno 1992; Yen and Langari 1998]. Applications in
these fields and in other domains have demonstrated not only the power of uncertainty
but also the necessity of uncertainty for model representation and model learning. The
remaining chapters in this part of the book describe different types of uncertainty and
related approaches; in particular stochastic uncertainty, imprecision, ignorance and
inconsistency. Furthermore, Part IV introduces a new form of knowledge
representation, Cartesian granule features, that exploits uncertainty, in the form of
imprecision, in order to provide more succinct and possibly more natural descriptions
of systems. In addition, imprecision provides improved generalisation when learning
such systems (see Chapter 10 for more details).

2.4 DESIDERATA OF KNOWLEDGE REPRESENTATION

The method of knowledge representation plays a crucial role in knowledge discovery


and in AI in general. The following describes how the method of representation
influences the process of knowledge discovery and ultimately model performance:

I. As mentioned previously, uncertainty management may prove to be a useful


commodity in some problem domains, not just for representing systems but
also for learning such systems.
2. KR determines the observation language. The accuracy and reliability of a
learned concept depends heavily on the accuracy and reliability of the
observed instances. Most learning algorithms to date require observations to
be numerical or symbolic, which in some cases compromises what can be
learned.
3. KR determines the concepts that an algorithm can or cannot learn. An
extreme example of this is given in [Minsky and Papert 1969], where a two-
input perceptron could not be trained to recognise when the values of these
inputs were different.
SOrT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 27

4. KR affects the speed of learning. Some representations lend themselves to


more efficient implementation than others. For example, in training neural
networks, the training data is presented many times to the learning algorithm,
thus requiring heavy computational effort. In contrast, some probabilistic
approaches have very low computational requirements for training, often
requiring only one presentation of the data.
5. KR determines whether background knowledge can be incorporated directly
into the learning process.
6. KR determines the transparency or understandability of the induced
concept description. As noted in [Sammut 1993], a representation that is
opaque to the user may allow the program to learn, but a representation that is
transparent allows the user to learn also.
7. KR determines how the learning algorithm interacts with the problem domain.
For example, it dictates whether learning has to takes place incrementally or
non-incrementally (one shot learning), thereby facilitating model update or
extension.
8. KR determines whether learning is possible from sparse data. This may be
important in areas where data is expensive or impossible to get, or new areas
of application (typically known as the cold start). For example, in an image
database system one might like to learn the concept of Jimi's great-
grandmother Shanahan, but only a few examples (photographs) may exist.
9. KR determines whether learning can take place in a decentralised manner,
such as agent-based learning, an aspect of learning algorithms that is becoming
very much a requirement in an age of decentralised and mobile computing and
also where discretion in these matters is of the utmost importance.
10. KR determines whether hierarchical concepts can be learned. This can prove
to be critical in many domains of application, for example, modelling
perception [Ralescu and Shanahan 1999].
II. KR determines whether classification, regression, unsupervised learning
problems can be dealt with in a homogeneous manner.
12. KR determines the stability of the information in the induced model, i.e.
minor changes in training observations can lead to totally different induced
models.

2.5 A TAXONOMY OF KNOWLEDGE REPRESENTATION

In the previous sections it was shown how knowledge representation clearly has a big
intluence on the knowledge discovery process. The remainder of this chapter presents a
taxonomy of the commonly used approaches to knowledge representation within the
field of knowledge discovery:

o symbolic;
o probabilistic;
o fuzzy sets and logic
o mathematical;
o prototype.
CHAPTER 2: KNOWLEDGE REPRESENTATION 28

A brief description of each of these broad categories of KR is provided. In addition,


each category is evaluated from the KR desiderata perspective. Subsequent chapters
present, in more detail, the main forms of KR in soft computing: fuzzy set theory
(Chapter 3); fuzzy logic (Chapter 4); various theories of probability (Chapter 5); and a
new approach to KR based on Cartesian granule features [Baldwin, Martin and
Shanahan 1997a; Shanahan 1998] (Chapter 7).

2.5.1 Symbolic-based approacbes


Symbolic machine learning approaches learn concepts by constructing symbolic
descriptions. Commonly used forms of KR in this category, include propositional logic
representation (example machine learning approaches include decision trees algorithms
such as ID3 [Quinlan 1986], see Section 7.5.2.1 for a description), and first-order logic
representation (examples include inductive logic programming (ILP) approaches such
as FOIL [Quinlan 1990]) and CIGOL [Muggleton and Buntine 1988]. Inference is
performed in a deductive Ipanner. Advantages of using this form of KR include:

• The induced models tend to be very transparent, glass-box in nature and


can provide insight to the problem domain [King et al. 1992].
• Induced models can be updated and extended quite easily.
• Decentralised learning can be facilitated.
• Problem domains with sparse data can be handled quite effectively
[Michie, Spiegelhalter and Taylor 1993]
• Background knowledge can be easily facilitated by the learner especially
in the ILP case [Muggleton 1999].

Disadvantages of using this form of KR include:

• Decision boundaries tend to be box-like (piecewise linear) in nature,


thereby requiring representation at a much more detailed level in order to
overcome this problem. Consequently, this leads to more cOI;nplex models
and possible loss in transparency (and the risk of over fitting). Examples
of this type of behaviour are presented in Chapter 10, where crisp
Cartesian granule feature models, which exhibit classifications behaviours
similar to decision trees, are demonstrated.
• Information stability can be a problem, especially in the decision tree-
based algorithms (such as C4.5, FOIL) for classification problems, where
tiny changes in universe partitions can lead to decision trees with very
different behaviours.
• Most of the proposed approaches ignore regression problems or do not
perform very well [Michie, Spiegelhalter and Taylor 1993].
• In general, uncertainty management is not facilitated.

2.5.2 Probability-based approacbes


Probabilistic approaches represent knowledge in terms of probability distributions that
may be conditional or unconditional, and point-based (e.g. Bayesian networks [Pearl
1988]) or set-based (e.g. Dempster-Shafer theory [Shafer 1976]). For example, in the
SOH COMPUTING I-OR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 29

case of Bayesian networks, knowledge can be organised into an intuitive and relatively
transparent directed graph structure (sometimes known as a knowledge map), where
each node represents a variable and a probability distribution, and the directed arcs
represent causal relationships between variables. Inference in probabilistic approaches
relies generally on the conditioning operation (such as Bayes' Rule [Bayes 1763]) or
belief revision, which both update existing probabilities given evidence. Decision
making is based, in general, on choosing a hypothesis that has a maximum (posterior)
probability or utility. A full description of probabilistic approaches to KR is given in
Chapter 5. The advantages of these approaches to KR include:

• Representation of the domain knowledge and reasoning mechanisms can


be intuitive and transparent to the human.
• Can manage uncertainty of various forms such as stochastic, imprecision
and ignorance.
• Information deficiencies can be detected easily.
• These algorithms can produce quite accurate classification and regression
models, even in complex probJem domains [Barrett and Woodall 1997;
Heckerman 1991].
• Background knowledge can easily be facilitated and exploited by the
learner.
• Extending and updating models, while still an open research issue for
some of the approaches, can be facilitated easily.
• Learning in a decentralised manner is possible.

Disadvantages of using this form of KR include:

• The specification of probability distributions is a very difficult and time-


consuming task. Though learning approaches have been developed to
alleviate this problem, they can run into the curse of dimensionality
problem [Bellman 1961].
• Inference in some probabilistic frameworks can be NP hard.

2.5.3 Fuzzy-based approaches


Fuzzy set based approaches learn concepts by constructing rule-based symbolic
descriptions, where attribute values correspond to fuzzy sets. In general, one concept
(in the case of classification problems) corresponds to one rule, thus the concept rule
consists of attributes whose values correspond to fuzzy set descriptions of that concept
over that attribute universe. Inference is normally based on fuzzy reasoning (see
Chapter 4). Examples of learning approaches using this type of KR include the
mountain method [Yager 1994], the data browser (see Section 7.5.2.3) [Baldwin and
Martin 1995] and FCM-based approaches such as [Sugeno and Yasukawa 1993]. A
relatively new form of knowledge representation based on Cartesian granule features
permits the representation of knowledge in terms of a rule-based network of low-order
semantically related features. This form of KR helps overcome problems such as
decomposition error and knowledge transparency (see Part IV of this book).

Advantages of these approaches include:


CHAPTER 2: KNOWLEDGE REI'RESENTATION 30

• The models tend to be transparent and qualitative in nature.


• Models can be updated, and extended although it is the topic of further
research.
• Decentralised learning can be facilitated.
• Can manage uncertainty of various forms such as imprecision.
• Problem domains with sparse data can be handled quite effectively
[Sugeno and Yasukawa 1993].
• Background knowledge can be facilitated and exploited by the learner
[Baldwin and Martin 1995].
• Observation language can include both number and linguistic values.
• Approaches can handle both classification and regression problems
[Baldwin and Martin 1995; Sugeno and Yasukawa 1993].

Disadvantages of using this form of KR include:

• Some approaches may suffer from decomposition error when there are
dependencies among problem domain variables. Examples of this type of
problem are presented in Chapter 10.

2.5.4 Mathematical-based approaches


Approaches in this category rely mainly on mathematical forms of knowledge
representation. Generally, knowledge is represented in terms of systems of equations
and functions. Typical approaches in this area include statistical, connectionist
approaches [Hertz, Anders and Palmer 1991] and oblique decision tree approaches
[Murphy, Kasif and Salzburg 1994]. For example, in the case of feed-forward neural
networks, knowledge is represented as a multi-layer weighted directed graph
(perceptrons) of threshold nodes and weighed arcs that link nodes. Inference is
performed by the nodes that spread activation from input feature nodes through internal
nodes to output nodes. Weights on the links determine how much activation is passed
on in each case. Decision making in the case of prediction problems corr.esponds to the
result of inference, whereas in the case of classification decision making corresponds to
taking the class associated with the output node that has the highest activation value.
Because of the low level at which knowledge is represented in mathematical
approaches, it is quite difficult to program them manually - hence learning is an
essential component of approaches. Researchers and practitioners in this area tend to be
more concerned with generating accurate concept boundaries than with model
transparency. Advantages of these KR approaches that rely on mathematical-based KR
include:

• Algorithms exist which have nice mathematical properties, such as


convergence.
• These algorithms can produce quite accurate classitication and regression
models, even in complex problem domains [Baldwin, Martin and
Shanahan I 997b; Michie, Spiegelhalter and Taylor 1993].

Disadvantages of using this form of KR include:


SOFT COMI'UTIN(; FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 31

• Observation language is usually restricted to vectors of numbers.


• The induced models are not normally amenable to human inspection. For
example, in the case of neural networks, the hypothesis language is
usually an array of real numbers.
• Background knowledge cannot be easily facilitated by the learner.
• In general, these approaches do not provide explicit mechanisms for
managing uncertainty.
• Extending and updating models for most approaches is still an open
research issue.
• Learning in a decentralised manner may prove difficult.

2.5.5 Prototype-based approaches


Instance-based approaches can be used in both a supervised (input and outputs
variables are provided) and unsupervised learning (only input variables are provided)
context. Generally, in a supervised context, instance-based approaches represent
concepts in terms of prototypical examples. These instances are directly stored in
memory (a lookup table) using the observation language (or a transformed observation
language e.g. using principle components analysis [Jolliffe 1986]) and do not construct
abstract representations (hypotheses). Unsupervised approaches, generally, search for
structure in the data and then use prototype instances to represent this structure.
Classifications or labels can then be assigned to these instances, thus enabling inference
and decision making as presented subsequently. Typical approaches in this genre
include Kohonen networks [Kohonen 1984], and Fuzzy C-means [Bezdek 1981;
Ruspini 1969].

Inference, in general for these approaches, is based upon a similarity measure and
decision making is based upon a nearest neighbour strategy i.e. the classification of an
unlabelled case is class of the most similar neighbour in memory). This simple scheme
works well [Langley 1996] and is tolerant to some noise in the data. It can learn from
sparse data and is amenable to updating and extension. Disadvantage of this approach
include:

• It can require a large amount of storage capacity.


• For this style of representation it is necessary to define a similarity metric
(upon which generalisation greatly depends) for objects in the universe.
This can prove to be difficult when the objects are quite complex.
• This form of knowledge representation is not human readable.
• It cannot be used easily/intuitively to learning hierarchical concepts.
• Uncertainty management is not facilitated.
• Background knowledge cannot be accommodated easily.
• Domain of application is typically restricted to classification problems.

Support vector machines [Cristianini and Shawe-Taylor 2000; Vapnik 1995], a


mathematical approach to machine learning, can be viewed as a special type of
prototype-based KR, where the stored instances correspond to instances that exist close
to concept boundaries, in contrast to prototypical instances.
CHAPTER 2: KNOWLEDGE REPRESENTATION 32

2.6 SUMMARY

This chapter, has introduced the key components of knowledge representation; the
observation language, the hypothesis language and the general purpose inference and
decision making mechanisms. Some desiderata of knowledge representation were
outlined and their effect on knowledge discovery discussed. A taxonomy of KR
approaches commonly used within knowledge discovery was presented and the
constituent categories discussed with respect to the KR desiderata. The remaining
chapters of this part of the book present in greater detail some of the soft computing
approaches to knowledge representation: fuzzy set theory; fuzzy logic; point-based
probability theories; interval-based probability theories. Part IV of the book introduces
a new form of KR based on Cartesian granule features, and associated induction
algorithms.

2.7 BIBLIOGRAPHY

Baldwin, J. F., and Martin, T. P. (1995). "Fuzzy Modelling in an Intelligent Data


Browser." In the proceedings of FUZZ-IEEE, Yokohama, Japan, 1171-1176.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997a). "Modelling with words
using Cartesian granule features." In the proceedings of FUZZ-IEEE,
Barcelona, Spain, 1295-1300.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997b). "Structure identification of
fuzzy Cartesian granule feature models using genetic programming." In the
proceedings of /JCAI Workshop on Fuzzy Logic in Artificial Intelligence,
~agoya,Japan, 1-11.
Barrett, J. D., and Woodall, W. H. (1997). "A probabilistic alternative to fuzzy logic
controllers", lIE Transactions, 29:459-467.
Bayes, T. (1763). "An essay towards solving a problem in the doctrint;: of chances",
Philosophical transactions of the Royal Society of London, 53:370-418.
Bellman, R. E. (1961). Adaptive Control Processes. Princeton University Press.
Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms.
Plenum Press, ~ew York.
Bobrow, D. G. (1980). "Special issue on non-monotonic logics", Artificial Intelligence,
13.
Bruner, J. S., Goodnow, J. J., and Austin, G. A. (1956). A Study of Thinking. Wiley,
~ewYork.
Cristianini, ~., and Shawe-Taylor, J. (2000). An introduction to Support Vector
Machines. Cambridge University Press, Cambridge, UK.
Dempster, A. P. (1967). "Upper and Lower Probabilities Induced by Multivalued
Mappings", Annals of Mathematical Statistics, 38:325-339.
Hayes, P. (1999). "Knowledge Representation", In The MIT Encylopaedia of the
cognitive science, K. A. Wilson and F. C. Keil, eds., MIT Press, Cambridge,
Mass., 432-433.
Heckerman, D. (1991). Probabilistic similarity networks. MIT Press, Cambridge, Mass.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 33

Hertz, J., Anders, K., and Palmer, R. G. (1991). Introduction to the Theory of Neural
Computation. Addison-Wesley, New York.
Holland, J. H., Holyoak, K. J., Nisbett, R. E., and Thagard, P. R. (1986). Induction:
Process of Inference, Learning, and Discovery. MIT Press, Cambridge, Mass.,
USA.
Jolliffe, L T. (1986). Principal Component Analysis. Springer, New York.
King, R. D., Lewis, R. A., Muggleton, S. H., and Sternberg, M. J. E. (1992). "Drug
design by machine learning: the use of inductive logic programming to model
the structure-activity relationship of trimethoprim analogues binding to
dihydofolate reductase", Proceedings of the National Academy of Science, 89.
Kohonen, T. (1984). Self-Organisation and Associative Memory. Springer-Verlag,
Berlin.
Langley, P. (1996). Elements of Machine Learning. Morgan Kaufmann, San Francisco,
CA, USA.
McDermott, J., and Doyle, J. (1980). "Non-monotonic logic f', Artificial Intelligence,
13(12):41-72.
Michie, D., Spiegelhalter, :0. J., and Taylor, C. c., eds. (1993). "Machine Learning,
Neural and Statistical Classification", Ellis Horwood, New York, USA.
Minsky, M., and Papert, S. (1969). Perceptrons: An introduction to computational
geometry. M.LT. Press, Cambridge, MA.
Muggleton, S. (1999). "Scientific knowledge discovery using inductive logic
programming", Communications of the ACM, 42(11):43-46.
Muggleton, S., and Buntine, W. (1988). "Machine invention of first order predicates by
inverting resolution." In the proceedings of Fifth International Conference on
Machine Learning, Ann Harbor, MI, USA, 339-352.
Murphy, S. K., Kasif, S., and Salzburg, S. (1994). "A system for induction of oblique
decision trees", Journal of Artificial Intelligence Research, 2:1-33.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible
inference. Morgan Kaufmann, San Mateo.
Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, l(l ):86-106.
Quinlan, J. R. (1990). "Learning logical definitions from relations", Machine Learning,
5(3):239-266.
Rabiner, L. R. (1989). "A tutorial on hidden Markov models and selected applications
in speech recognition", Proceedings of the IEEE, 77(2):257-286.
Ralescu, A. L., and Hartani, R. (1995). "Some issues in fuzzy and linguistic
modelling." In the proceedings of Workshop on Linguistic Modelling, FUZZ-
IEEE, Yokohama, Japan.
Ralescu, A. L., and Shanahan, J. G. (1999). "Fuzzy perceptual organisation of image
structures", Pattern Recognition, 32:1923-1933.
Ruspini, E. H. (1969). "A New Approach to Clustering", Inform. Control, 150 ):22-32.
Sammut, C. (1993). "Knowledge Representation", In Machine Learning. Neural and
Statistical Classification, D. Michie, D. J. Spiegelhalter, and C. C. Taylor,
eds., 228-245.
Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University Press.
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Sugeno, M., and Yasukawa, T. (1993). "A Fuzzy Logic Based Approach to Qualitative
Modelling", IEEE TrailS Oil Fuzzy Systems, I( I): 7-31.
CHAI'TIOR 2: KNOWI.I'I)(;IO RIOI'RI'SIONTA flON 34

Terano, T" Asai, K., and Sugeno, M. (1992). Applied fuzzy systems. Academic Press,
New York.
Vapnik, V. (1995). The nature of statistical learning theory. Springer-Verlag, Berlin.
Yager, R. R. (1994). "Generation of Fuzzy Rules by Mountain Clustering", l.
Intelligent alld Fuzzy Systems, 2:209-219.
Yen, J., and Langari, R. (1998). Fuzzy logic: intelligence, control and information.
Prentice Hall, London.
Zadeh, L. A. (1965). "Fuzzy Sets", loumal of Information and Control, 8:338-353.
Zadeh, L. A. (1978). "Fuzzy Sets as a Basis for a Theory of Possibility", Fuzzy Sets and
Systems, 1:3-28.
Zadeh, L. A. (1999). "Some reflections on the relationship between AI and fuzzy logic
(FL) - a heretical view", In Fuzzy logic in Al (Selected and invited papers from
llCAI workshop, 1997, Nagoya, Japan), A. L. Ralescu and J. G. Shanahan,
eds., Springer, Tokyo, 1-8.
CHAPTER
FUZZY SET THEORY
3
This chapter presents the fundamental ideas behind fuzzy set theory. It begins with a
review of traditional set theory (commonly referred to as crisp or classical set theory in
the fuzzy set literature) and uses this as a backdrop, against which fuzzy sets and a
variety of operations on fuzzy sets are introduced. Various justifications and
interpretations of fuzzy sets as a form of knowledge granulation are subsequently
presented. Different families of fuzzy set aggregation operators are then examined. The
original notion of a fuzzy set can be generalised in a number of ways leading to more
expressive forms of knowledge representation; the latter part of this chapter presents
some of these generalisations, where the original idea of a fuzzy set is generalised in
terms of its dimensionality, type of membership value and element characterisation.
Finally, fuzzy set elicitation is briefly covered for completeness (Chapter 9 gives a
more detailed coverage of this topic).

3.1 CLASSICAL SET THEORY

The following excerpt provides a very good illustration of the potential use of crisp
sets and their limitation in representing the real world (this limitation will be elaborated
upon in Section 3.2).

"We begin with what seems a paradox. The worLd of experience of qny normaL
man is composed of a tremendous array of discriminabLy different objects,
events, peopLe, impressions... But were we to utilize fully our capacity for
registering the differences in things and to respond to each event encountered
as unique, we wouLd soon be overwheLmed by the compLexity of our
environment... The resoLution of this seeming paradox... is achieved by man's
ability to categorize. To categorize is to render discriminabLy different things
equivaLent, to group objects and events and peopLe around us into classes ... "
[Bruner, Goodnow and Austin 1956]

As noted by Bruner et al. in their landmark work "A study of thinking" [Bruner,
Goodnow and Austin 1956], categories are a necessary abstraction of the real world for
humans in order to survive. Bruner et al. demonstrated how categories or classes could
be mathematically represented as classical sets, where each element in the set has a
common ("equivalent") property.

More formally, in classical set theory, a set A is any collection of definite and distinct
objects that inhabit a universe ilx. ilx refers to the universe of values Xj that a variable X
can be assigned. Each element of ilx either belongs to the set A or not. This is denoted

J. G. Shanahan, Soft Computing for Knowledge Discovery


© Kluwer Academic Publishers 2000
CHAPTER 3: Fuzzy SET THEORY 36

mathematically as x E A (x is a member of set A) and as x t! A (x is not a member of set


A). Given a universe [lx, a common way of defining any set A that consists of some
objects from [4 is to assign the number 1 to each member of [lx that is also a member
of A, and to assign a the number 0 otherwise. This assignment is referred to as the
characteristic function and can be written mathematically as follows:

I if Xj E A
{
A(x) = 0 if x j EO A

where It denotes for all. A set can be defined either by listing all its members
(enumeration) or by specifying properties that members of a set possess. Enumeration
is, of course, restricted to finite sets and is normally denoted as follows for a set A
defined over a universe Ox = {XI> • • •, xn}:

170 A~
TAU..
1

o .... o ......
165 170 190 200 Or 165 170 190 200 On
!4Ici!1Jl !41ci!1X

(a) (b)

Figure 3-1: Examples of classical sets over the universe of height values expressed in
centimetres: (a) a point valued set; (b) an interval-valued set.

On the other hand, sets characterised by various properties such as Ph ... , Pn can be
denoted as follows:

where each element Xj satisfies each property P;, that is P;(x) is true for each property
Pi. This latter notation can be used to denote both finite and infinite sets. "I" denotes
such that. Figure 3-1 graphically depicts two sets: (a) corresponds to the height point
value of 170cm (b) corresponds to the set of tall people, Tall(xj), i.e. people who
possess the property of having a height in the interval [170, 190] .
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 37

3.2 Fuzzy SET THEORY

This section begins by presenting some of the motivations behind the introduction of
fuzzy sets. Subsequently, a fuzzy set is formally defined and concrete examples
provided. Finally, this section describes various interpretations of fuzzy sets.

3.2.1 Motivations
Even though set theory is a well-developed and understood area of mathematics with
numerous applications in engineering and science, its dichotomous nature (either an
object is a member or not of a set) has a number of shortcomings, which can arise when
it is used to model the real world. Borel [Borel 1950] highlights some of these
shortcomings as follows:

One seed does not constitute a pile nor two or three ... from the other side
everybody will agree that 100 million seeds constitute a pile. What therefore is
the appropriate limit? Can we say 325,647 seeds don't constitute a pile but
325,648 do?

This excerpt emphasises two major difficulties with traditional set theoretic approaches:

• Continuity of concepts (or categories) abounds in the real world (see also
Bruner's excerpt in Section 3.1) and often cannot be described precisely,
that is sharp boundaries between categories may not be easy to elicit (pile
versus not pile).

• Furthermore, if "we were to utilize fully our capacity for registering the
differences in things (pile of320,OOO seeds versus a pile of325,648 seeds)
and to respond to each event encountered as unique, we would soon be
overwhelmed by the complexity of our environment" [Bruner,' Goodnow
and Austin 1956]. Here Bruner et al. refer to human capabilities, but the
same argument holds for computational systems.

B. Russell [Russell 1923] states the first point more eloquently and radically in the
context of traditional logic approaches:

All traditional logic habitually assumes that precise symbols are being
employed. It is therefore not applicable to this terrestrial life but only to an
imagined celestial existence.

The second point relates directly to Zadeh's principle of incompatibility [Zadeh


1973]:

As the complexity of a system increases, our ability to make precise and yet
significant statements about its behaviour diminishes until a threshold is
reached beyond which precision and significance (or relevance) become
almost mutually exclusive characteristics.
CHAPTER 3: FtJ/.z't SET THEORY 38

Uncertainty usually results from the inability to capture a complete (as highlighted
above) and correct model of the problem domain. Real world situations are often very
uncertain in a number of ways. Due to a lack of information, the future state of a system
might not be known completely. This type of uncertainty, often referred to as stochastic
uncertainty, has long been handled by probability theory (see Chapter 5) and statistics.
In these approaches, it is assumed that the events or statements are well defined.
However, there may be situations where it not possible to describe precisely events or
phenomena (due to a lack of information or processing power, human or otherwise).
This lack of definition arising from imprecision is calledjUzziness and it abounds in the
real world; for example, in areas such as natural language, engineering, medicine,
meteorology, and manufacturing [Ruspini, Bonissone and Pedrycz 1998; Zimmermann
1996]. Examples of fuzziness include concepts such as tall people, red roses,
creditworthy customers, low volume, where the boundaries between concepts are
blurred.

In order to address these and other shortcomings, and to provide a more natural, and
succinct (and possibly transparent) means of representing the real world in
mathematics, in 1965 Zadeh [Zadeh 1965] introduced the notion of a fuzzy set. A fuzzy
set differs from a classical set by relaxing the requirement that each object be either a
member or a non-member of a set. A fuzzy set is a set with boundaries that are
imprecise, where membership is not a matter of affirmation or denial, but a matter of
degree. As in classical set theory, objects that are members of a fuzzy set can be
represented by a characteristic function, which is called a membership function in fuzzy
set theory.

Though the introduction of fuzzy set theory was initially based on intuitive and
common-sense grounds, in the intervening years since its introduction, numerous
supporting theories and applications have provided fuzzy set theory with well defined
and understood semantics and have demonstrated its usefulness as a very intuitive and
powerful means of handling uncertainty in applications ranging from decision support
to pattern recognition [Ralescu 1995a; Ruspini, Bonissone and Pedrycz 1998; Terano,
Asai and Sugeno 1992; Yen and Langari 1998]. A further, more recent motivation for
using fuzzy sets, arises from its use within the field of machine learning [Baldwin,
Martin and Shanahan 1997a; Sugeno and Yasukawa 1993; Yager 1994] (see also part
IV of this book), where fuzzy sets are shown to be a useful, possibly transparent, and
sometimes necessary abstraction of the world in order to achieve good generalisation
within an inductive reasoning framework. This form of generalisation through
abstraction (fuzzy sets in this case) is more succinctly stated in the principle of
generalisation proposed by Baldwin [Baldwin, Martin and Pilsworth 1995]:

The more closely we observe and take into account the detail, the less we are
able to generalise to similar but different situations...

3.2.2 Fuzzy sets


A fuzzy set it (in the literature it is common to use "-" to distinguish between crisp
and fuzzy sets) can be represented mathematically by a membership function that
maps each element Xj in the universe of discourse Q x to a membership value in the unit
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: INTROJ)UCIN(; CARTESIAN GRANULE FEATURES 39

interval [0, I] (in contrast to {O, I) in traditional set theory). This can be expressed
more formally as follows:

The degree of membership indicates the degree of compatibility between an object Xj


and the concept or category represented by A. By allowing degrees of membership in
the interval [0, I], fuzzy sets can easily express gradual transitions from membership to
non-membership. A classical set can be viewed as specialisation of a fuzzy set where
the range of the membership function is restricted {O, I).

As in the case of classical sets, fuzzy sets can be defined by enumerating the objects
that have non-zero membership in the fuzzy set (restricted to finite sets defined on
discrete universes). In crisp set theory, each element of the universe that is listed in a
set is implicitly associated with a membership degree of I, whereas in fuzzy theory, it is
necessary to list the element and state explicitly its associated membership value, since
it can have any value in the unit interval [0, I]. A fuzzy set A defined over a universe
ilx = {x], ... , xnl is normally denoted as follows:

- -
A ={xl / A(x,)+ ... +xll / A(xll ) }

where each Xi / A(x;l represents an element Xi and its corresponding membership in the
fuzzy set A, and "f' is used to avoid confusion. The "+" denotes the union of the
singleton elements Xi I A(xi ) • Alternatively this can be rewritten in shorthand notation
as follows:

n
A= LXjIA(x)
j=l

where I should be interpreted as union and should not be confused with the standard
algebraic summation. Consider the following example of a discrete fuzzy set describing
large die numbers. Given the universe of die numbers, ilDieNumbers = {I, ... , 6/, a
plausible definition of Large could be as follows:

Large = {4/0.7 + 510.9 + 6/1.0}

Here the die value of 4 has a membership of 0.7 in the fuzzy set Large, indicating its
degree of compatibility with the concept large die number. When the universe is
continuous, the corresponding fuzzy set is normally denoted as follows:

A = LxI A(x)

I { •.. } denotes a discrete set. {a, ... , b) denotes a discrete interval such that a s:; x s:; b
VXE {a, ... ,b).
CHAPTER 3: FuJ'zY SET THEORY 40

where the integral J denotes the union of fuzzy singletons. For example, real numbers
close to 2 could be represented by the following fuzzy set (as depicted in Figure 3-2):

(3-1)

where the parameter p controls the width of the fuzzy set. As the value of p increases,
the graph becomes narrower.

Consider another example, if a variable Height is defined on the universe consisting of


the interval [20, 250], then the fuzzy set Tall could be defined as follows (graphically
depicted in Figure 3-3(a»:

0 if x ~ 165
x-165
if 165<x <170
5
Tall(x)= if 170 < x ~ 190
2oo-x
--- if 190 < x < 200
10
0 if x;;:: 200

In this case, height values in the interval [170, 190] have a membership value of 1 and
correspond to the core of the fuzzy set. Values in the intervals (165, 170)2 and (190,
200) have membership values in the range (0, 1), while other values in the universe
have zero membership in this definition of the concept Tall. Values having
membership greater than zero in a fuzzy set correspond to the support Qf the fuzzy set.
Since the fuzzy set Tall is characterised by a trapezoidal fuzzy set, it may be viewed as
a fuzzy interval or class. Figure 3-3(b) illustrates a triangular fuzzy set About _170 .
Triangular fuzzy sets can be viewed intuitively as fuzzy numbers or fuzzy points, that
is, the core is a singleton.

3.2.3 Notation convention


In the literature, it is common to denote the membership of an object Xj in the fuzzy set
A (denoted above as A(x) as PX(Xj)' that is Px (x) = A(xj)' In addition, since a

crisp set can be viewed as a special case of a fuzzy set, the same notation is used for
both. As a further simplification of notation, fuzzy sets will be denoted by a capital
letter (or capitalised word) and the tilde (-) notation, used until now to distinguish
between fuzzy and crisp sets, will be dropped. For example, the fuzzy set

2 (165, 170) corresponds to any value x such that the following condition holds: 165 < x
< 170, whereas [165, 170] represents any value x such that 165::; x ::; 170.
SOIT COMI'UTIN<O FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 41

corresponding to large die numbers will be presented as Large and not as Large. This
notational convention for fuzzy sets and membership values will be adapted for the
remainder of this book.

ABOUT_2 -

0.8

0.6

0.4

0.2

0
0 0.5 1.5 2 2.5 3 3.5 4

Figure 3-2: Fuzzy set corresponding to the concept of AbouC2 as characterised by


Equation 3-1.

TAlL AbouU70

R
J
.8«
::I.

o 165 170 90200 On


o 165 170 175
a b c d ~ghl a b c

(a) (b)

Figure 3-3: Examples of (a) an intervaL-vaLued fuzzy set, and (b) of a fuzzy number,
both of which are defined over the universe of height values expressed in centimetres.

3.2.4 Interpretations of fuzzy sets


Various interpretations and justifications of fuzzy sets have been proposed in the
literature. Fuzzy sets can be seen as a means of representing imprecise values or
concepts, in contrast to the stochastic uncertainty arising from beliefs or expectations
that is captured by probability theory. To highlight the difference between these two
types of uncertainty, consider the following two statements:
CHAPTER 3: Fuzzy SET THEORY 42

(i) It is certain that James was born around the end of the sixties (1960s).
(ii) Probably, James was born in 1967.

In the first statement, the year in which James was born is imprecisely stated but
certain, whereas in statement (ii), the year is precisely stated but there is uncertainty
about the statement being true or false. Uncertainty arising from imprecision can be
very naturally modelled using traditional set theory and its generalisation - fuzzy set
theory and various set-based probabilistic theories such as possibility theory (see
Chapter 5). On the other hand, uncertainty arising from beliefs or expectations has been
addressed by various theories of probability (see Chapter 5 for a detailed presentation
of probability theory).

One of the most natural means of interpreting a fuzzy set in terms of human reasoning
is the voting model [Baldwin 1991; Gaines 1977; Gaines 1978]. Consider a population
of voters, where each voter is asked to describe a value x E n, by voting in a "yes or
no" fashion on each word W E W (a set of words or vocabulary), on its appropriateness
as a label or description of the value x. The membership of x, Ilw(x), in a fuzzy set
characterising the word w is defined to be the proportion of the population who accept
was a description of the value x. Voters are expected to vote consistently and abide by
the constant threshold assumption [Baldwin, Martin and Pilsworth 1995], according to
which, any voter accepting an element, will also accept any other element having a
higher membership in the concept described by the fuzzy set. That is, for a fuzzy set f
defined over the universe Ox, a voter must accept (votes yes) any Xi e Ox for which
pix;) ;? J1t<Xj) if the voter accepts Xj, for any Xj e Ox. The constant threshold assumption
provides a unique voting pattern. Consider a die variable defined over the discrete
uni verse {I, ... , 6}. The meaning of the word small can be generated from the voting
patterns of the population for each die value. Table 3-1 presents the voting pattern for a
population of ten voters for the word small across all possible die values. The meaning
of small consists of the list of die values associated with the proportion of voters who
accept small as a description of the respective die values. These proportions correspond
to membership values. For example, the die value I will have a membership value of 1
in the fuzzy set denoting small since all voters accept small as a suitable linguistic
description of the die value of 1. The voting pattern presented in Table 3-1 corresponds
to the following fuzzy set description of small: {1/1 + 2/0.2}.

Elicited degrees of membership can have further interpretations depending on the


context of use. For example, the degree of membership may express the degree of
similarity or compatibility of a value with the concept described by the fuzzy set; this
plays a key role in inductive reasoning (for example, unsupervised learning [Baldwin
and Lawry 2000]) and approximate reasoning (which is presented in detail in the next
chapter) where inference, a generalised version of modus ponens, is based upon
similarity or proximity of concepts. Alternatively, fuzzy sets can be also viewed as
constraints on the value of a variable. Consider the case where you know that the value
of a variable is a fuzzy set. Here, the fuzzy set induces a possibility distribution over the
range of values that this variable can assume. The possibility of a value is simply
equated with the membership of that value [Zadeh 1978]. This equivalence mapping,
and Zadeh's possibility/probability consistency principle, presented in Chapter 5, has
led to work that has resulted on establishing a formal relationship between possibility
distributions and probability distributions [Baldwin 1991; Dubois and Prade 1983; Klir
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 43

1990; Sudkamp 1992], thus providing a probabilistic interpretation of fuzzy sets. In


short, knowing that the value of a variable X is a fuzzy set f induces a posterior
probability distribution. This posterior is obtained when a prior probability distribution
is conditioned on that fuzzy set, that is, a posterior probability Pr(X = xii j) is induced
for any element Xi of the universe Qx. On the other hand, given (denoted by I) a
posterior probability distribution it is possible to uniquely determine the fuzzy set that
was used to condition the prior distribution in order to arrive at the posterior
distribution. In other words, given the prior Pr(X) and the posterior Pr(XIj), wherefis a
fuzzy event, it is possible to determinefuniquely. This bi-directional mapping between
fuzzy sets and probability distributions is described in Chapter 5. The knowledge
discovery approaches presented in this book build on an extension of probability
theory, where instead of defining probabilities on well defmed precise concepts such as
point values or crisp sets, probabilities are defined on imprecise events (that are
characterised by fuzzy sets).

Table 3-1: Voting pattern for ten people corresponding to the interpretation of the
linguistic term small die values. Values for which everybody voted "no" (i.e. 3, 4, 5, 6)
are not shown.
-
Word\Person I 2 3 4 5 6 7 8 9 10
I Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
2 Yes Yes No No No No No No No No

3.3 PROPERTIES OF FUZZY SETS

Various properties of fuzzy sets can be identified in terms of their associated


membership functions. This section describes some of these properties that form the
basis for concepts introduced in later sections.

Support: The support of a fuzzy set A, Supp(A), is the crisp set of X E Q x such that all
elements have a non-zero membership:

Supp(A)={XEQx I,uA(x»O}

Core: The core of a fuzzy set A, Core(A), is the crisp set of x E Q x such that all
elements have a membership of I:

Core(A) ={XE Q x I,uA (x) =1}


a-cut: The a-cut is a generalised version of the core in that it corresponds to the crisp
set of elements that have a membership degree in fuzzy set A of at least the degree a:
CHAI'TER 3: Fuzzy SET THEORY 44

The core of A corresponds to l-cut of fuzzy set A. For example, consider the fuzzy set
A = {OAla + O.61b + .71 c + lid}, then the following is a list of possible a-level sets (a-
cuts):

A.4={a, b, c, d}
A. 6 ={b, c, d}
A.7={C, d}
Al ={d}

An interesting property of a-cuts, which follows immediately from their definition, is


that the total ordering of values in [0, 1] is inversely preserved by the set inclusion of
the corresponding a-cuts, i.e. all a-cuts of any fuzzy set form a distinct set of nested
crisp sets ordered by the corresponding value of a [Negoita and Ralescu 1975].
Formally stated, for any fuzzy set A and pair al , a2 E [0, 1] of distinct values such that
al < a2, the following holds:

This property can also be expressed as follows, using set intersection and union
operations:

Height: The height of a fuzzy set A, denoted by h(A), is the largest membership grade
obtained by an element in that set. This is formally denoted as follows:

h(A) = Max A(x)


.teo,

In the case where the universe 4 is continuous Sup (supremum) is used instead of Max.

Normal: A fuzzy set A is normal if the height of A, h(A), is 1, that is the 1-cut of fuzzy
set A is not the empty set, mathematically stated as follows: AI :t= fjJ • If the height of A,
h(A), is less than 1 then fuzzy set A is subnormal.

Normalisation: Fuzzy normalisation converts a subnormal fuzzy set into a normal


equivalent as follows:

Other approaches to normalisation are presented later where the bi-directional mapping
from a fuzzy set to probability distribution is used as a means of generating normalised
fuzzy sets (see Section 8.2.2).

Cardinality: Various definitions of fuzzy set cardinality have been proposed in the
literature. Some of the more popular measures are described here. The following is one
of the simplest definitions of cardinality, generically denoted as IAI for any set A. Given
SOl'T COMPUTING FOR KNOWLEDOE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 4S

a finite fuzzy set A defined on the universe !lx, the cardinality of A denoted by
l:count(A) is defined as follows [Zadeh 1983]:

Lcount(A) = I,JlA (x)


..eO.

This is commonly referred to as the sigma count. For example, consider Medium =
{1I0.6 + 2/0.9 + 311 + 4/0.7 + 5/0.3}, the fuzzy set denoting medium die values on a
standard 6-faced die, i.e. defined over the universe {I, ... , 6}; the cardinality of Large,
l:count(Large), is 3.5. If however t4 is continuous, l:count is defined as follows for a
fuzzy set A defined over ilx:

Lcount(A) = J JlA (x)dx


xeOx

This definition of cardinality, though simple, is not very useful. Consider a group of
people and let A be the fuzzy set denoting tall people in this group. The use of sigma
count to characterise the number of tall people brings with it the possibility that a group
of people with low membership grades in A (i.e. small people) will add up to a tall
person. In order to overcome this limitation, alternative definitions of cardinality have
been proposed based upon fuzzy numbers [Klir and Yuan 1995; Zadeh 1983].

An example of fuzzy cardinality is the FG count proposed by Zadeh [Zadeh 1983]. The
FG count of the fuzzy set A, denoted as FGCountA, is a fuzzy set defined on the non-
negative integers where, for each integer i, FGCountA(i), the membership grade of i in
FGCountA, indicates the truth of the proposition "A contains at least i elements". This
membership grade is defined as follows:

FGCountA(i) =sup{ alcard(Aa) ~ i}


where card(Aa) corresponds to the cardinality of crisp sets, that is the number of
elements in the crisp set An, the a-cut of A. Thus, FGCountA(i) is the level of the
largest level set having at least i members. Reconsidering the fuzzy set medium defined
above, the fuzzy cardinality of medium, FGCountmedillm(i), corresponds to the following:

FGCountmedium = {01l+111 + 2/0.9 + 3/0.7 + 4/0.6 + 5/0.3}

For a more complete presentation of fuzzy cardinality see [Ralescu 1995b; Yager
1998].

3.4 REPRESENTATION OF FUZZY SETS

Fuzzy sets form abstractions of universes, providing compact ways of manipulating


groups of objects that share something in common, that is, they represent imprecise
concepts. Sometimes these fuzzy sets map directly onto natural language concepts
CHAPTER 3: fuzzY SET THEORY 46

(voting models by modelling humanistic reasoning can help in modelling this mapping
[Baldwin 1991; Gaines 1977; Gaines 1978]). Numerous representations for fuzzy sets
have been proposed in the literature, most of which attempt to satisfy the following two
criteria: provide an accurate and natural reflection of the real world; and which are
computationally tractable (comptractible). The flexibility of representation comes
usually in the form of parameterised membership functions and comptractibility is
normally met by choosing simple membership functions with few parameters. Table 3-
2 lists typical membership functions and their corresponding graphical representations.
In the case of Gaussian, exponential-like, or r membership functions the constant k
controls the width of the fuzzy set. Triangular and trapezoidal shapes of membership
functions are used most often for representing fuzzy sets, due to their simple nature
both from a computational and understandability point of view. For example, the
triangular membership function can be equivalently written as follows:

A(x; a, b, c) =max (min{(x-a)/(b-a), (c-x)/(c-b)}, OJ.


where a< b<c as depicted in Figure 3-3(b).

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Figure 3-4: Fuzzy set AbouC3 with a Gaussian membership function.

32 34 36 38 40

Figure 3-5: Fuzzy set Hot (temperature) characterised by a T-membership function.

More recently, as fuzzy based systems are being learnt from data, more expressive
membership functions are being adapted such as piecewise linear representations, in
SOH COMPUTIN(i FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 47

order to provide a more natural rapport with reality. Chapter 8 of this book introduces a
new type of fuzzy set, a Cartesian granule fuzzy set (briefly introduced in Section 3.7.2)
that represents fuzzy sets as linguistic summaries, in contrast to the curves or formulas
typically used to represent fuzzy sets (see Table 3-2). This type of fuzzy set while
leading to more succinct and more easily describable representations than its
mathematical looking counterparts, also provides high degrees of accuracy (see Chapter
11).

Table 3-2: Commonly used membership functions.

Function name Mathematical description See figure


Trapezoidal 0 if x $ a Figure 3- 3(a)
--
X - l'
b-a
if a < x < b
PA(X) = I if b $ x $ c
d-x
-d·c if c < x <d
0 if x <!d

Triangular 0 if x $a Figure 3- 3(b)


x- a
if a < x <b
b- a
PA(X} = 1 if b = )(
cox
-c · b ifb<x<c
0 if x <! c

Exponential-like PA(X} -
I
2 where k > 1
Figure 3-2
l +k(m - x}
Gaussian JiA (x) = e-n,-m)' Figure 3-4
r if x Sa
k>O
Figure 3- 5
JiA(X)={I _ e- ?(X-ol'
if x >- a

3.5 Fuzzy SET OPERATIONS

Before presenting the main operations on fuzzy sets, a review of the basic operations on
classical crisp sets and their properties is presented. The basic set operations considered
here are intersection, union, and negation. As crisp sets can be denoted by their
characteristic functions, these operations can be conveniently described in terms of
these functions. Given two sets A and B defined on universe Qx, then the operations of
intersection n, union u, and complement..., can defined as follows :

(AnB)(x) = min(A(x), B(x» = A(x) I\B(x) V' XEQx


(AuB)(x) =max(A(x), B(x» =A(x) vB(x) V'XE Qx
-,A(x) = 1- A(x) V'XE Q x

where (AnB)(x), (AuB)(x) and ...,A(x) denote the membership values of each value x
in the set reSUlting from intersection, union and negation. The fundamental properties
of these set operations are summarised in Table 3-3. All concepts of classical set theory
have their generalised counterparts in fuzzy set theory. But, fuzzy counterparts of
CHAPTER 3: fuzzY SET THEORY 48

classical set-theoretic operations are not unique. Each basic operation on classical sets -
the complement, intersection, and union - is represented by a broad class of operations
in fuzzy set theory. Below a brief overview of these broad classes is presented. This
section begins, however, by presenting the definitions of these set theoretic operations
suggested originally by Zadeh [Zadeh 1965] and subsequently describes the numerous
alternative definitions that have been proposed in the fuzzy set literature. Once again,
operations on fuzzy sets are defined via their membership functions.

Table 3-3: Fundamental properties of crisp set operations.


Involution
A=A
Commutativity AuB=BuA
AnB=BnA
Associati vi t y (A u B) u C = A u (B u C)
(A n B) n C = A n (B n C)
Distribulivity An (B u C) =(A n B) u (A n C)
Au (B n C) = (A u B) n (A u C)
Transitivity A c Band B c C implies that A c C
Idempotence AuA=A
AnA=A
Absorption Au (AnB) =A
An(AuB)=A
Identity Au0 = A
AnQx=A
Law of contradiction An A=0
Law of excluded middle Au A =Q x
De Morgan' s laws AnB=AuB
AuB=AnB

Fuzzy set intersection: The membership function characterising the fuzzy set resulting
from the intersection of two fuzzy sets A and B defined over universe fl, may be point-
wise defined as follows:

\;j XE Qx

For example, consider two fuzzy sets, A and B, defined over the discrete universe of die
values flVieVallles ={ 1, ... , 6} to defined as follows: A = 2/0.8 + 3/1 + 4/0.3 and B = 3/0.2
+ 4/0.8 + 5/1. Then the fuzzy set corresponding to the intersection of A and B, AnB (,
is as follows:

AnB = 3/0.2 + 4/0.3


Fuzzy set union: The membership function characterising the fuzzy set resulting from
the union of two fuzzy sets A and B defined over universe flx may be point-wise
detined as follows:
SOI-T COMPUTING fOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 49

For example, consider the two fuzzy sets, A and B, as defined above, then the fuzzy set
corresponding to the union of A and B, AuB, is as follows:

AuB =210.8 + 311 + 4/0.8 + 511

Fuzzy set complement: For ease of presentation, the compl~ent operation will be
denoted interchangeably by the following characters: -.; . Zadeh defined the
membership function characterising the fuzzy set resulting from the complement of
fuzzy set A defined over universe .Q. as follows:

"i/ XE n,.
For example, consider the fuzzy set, A, as defined above, then the fuzzy set
corresponding to the complement of A, .....4, is given as follows:

-,A = 111 + 2/0.2 + 4/0.7 + 5/1 + 611

A fuzzy set complement operator is said to satisfy the property of involution if the
following holds:

This means that the degree of non-membership in the fuzzy complement of a fuzzy set
is the same as the degree of the membership in the fuzzy set. This property holds for the
fuzzy complement as defined above.

These definitions, proposed by Zadeh [Zadeh 1965], of intersection and union modelled
using the min-operator and max-operator respectively, are often referred to as the
"logical aru!' (or standard and) and "logical or".

Other operators have also been proposed which differ mainly with respect to the
generality or adaptability of the operators as well as the degree to which they are
justified. Justification normally comes in the form of intuition (e.g. a voting model
interpretation) or through axiomatic or empirical justification. Most fuzzy intersection
and union operators proposed to date can be classified as belonging to one of two
classes: axiomatic-based operators (for intersection and union); and hybrid operators.
The operators within both of these classes can be further sub-divided into operators that
are parameterised and non-parameterised. Below a brief overview of these families of
operators is presented. Regarding the complement operator, alternative definitions to
Zadeh's original definition have also been proposed including threshold-based
complements, and Suge o's parametric A. complement (however, these are not
presented helC) See [Klir and Yuan 1998] for a comprehensive treatment of
complement opt;rators.
CHAPTER 3: FU7ZY SET THEORY 50

3.5.1 Axiomatic-based operators - t-norms and t-conorms


The fuzzy intersection operator can alternatively be represented by a well-established
class of functions that are called triangular norms - also known as t-norms [Klir and
Yuan 1995; Schweizer and Sklar 1961]. T-norms, denoted by ®, represent a family of
binary functions on the unit interval; that is, a function of the form

®: [0, 1] x [0, 1] ~ [0, 1]

that satisfies the axioms outlined below. For every element x of the universal set Ox,
this function takes as its argument the pair consisting of the membership grades in
fuzzy sets A and B, both of which are defined over Ox, and yields the membership
grade in the fuzzy set constituting the intersection of A and B, AnB. Thus,

This can also be written as follows: JlA(X) AJlS(X). The following axioms need to be
satisfied in order for a function to qualify as a t-norm: for any a, b E [0, 1],
corresponding respectively to JlA(X), Jls(x) for any element x of the universal set Ox:

(i) 1 ®a=a (boundary conditions);


(ii) (a ® b) ® c = a ® (b ® c) (associativity);
(iii) a®b= b®a (commutativity);
(iv) a®b~ a®c if b ~ c (monotonicity ).

Commonly used t-norms can be described in terms of a class of parameterised


functions, ®P' which was introduced by Schweizer and Sklar [Schweizer and Sklar
1963]. When p is any non-zero real number, t-norms in this class are defined for any a,
b as follows:

The following special cases for this class of t-norm are amongst the most commonly
used t-norms, where the subscript attached to the symbol ® indicates the value or the
limit of the parameter p:

a ®_ b = min (a, b) (reduces to standard and/t-norm)


a®ob = ab (algebraic product)
a ®I b = max(O, a+b-1) (bounded difference)
a if b= I
a ®~ b = { b if a = 1 (drastic t-norm)
o otherwise.
These t-norms (depicted in Figure 3-6), namely the standard t-norm, the algebraic
product, the bounded difference, and the drastic t-norm satisfy the following ordering
for any values of a and b:
SOFT COMI'UTIN(; FOR KNOWLEIX;E DISCOVERY: INTKODUCING CARTESIAN GRANULE FEATURES S1

(a) (b)

(c) (d)
Figure 3-6: Examples of t-norms in the Schweizer-Sklar class: (a) min(a, b); (b)
product i.e. ab; (c) drastic min; (d) bounded difference i.e. max(O, a+h-l).

Furthermore, it can be shown that ®_~ is the largest t-norm (i.e. fuzzy intersection
operator) and that ®~ is the smallest t-norm. More succinctly

See [Klir and Yuan 1998] for a corresponding proof. However, since Schweizer and
Sklar class of t-norms is defined by a particular format, it does not cover all possible t-
norms.

Some of the t-norms presented above possess other desirable properties such as
idempotency (a ® a = a). For example, it is easy to show that min(a, b) is the only
idempotent t-norm.

Next the t-conorm function (corresponding to fuzzy set union, also known as s-norm in
the literature), the logical dual of the t-norm, is described. Like fuzzy intersection,
fuzzy union can be represented by a well-established class of functions that are called
triangular conorms - also known as t-conorms. T-conorms, denoted by EEl, represent a
family of binary functions on the unit interval; that is, a function of the form

EEl: [0, 1] x [0, 1] ~ [0, 1]


CHAl'TER 3: fUzzy SET THEORY 52

that satisfies the axioms outlined below. For every element x of the universal set Ox,
this function takes as its argument, the pair consisting of the membership grades in
fuzzy sets A and B, both of which are defined over the universe Ox, and yields the
membership grade in the fuzzy set constituting the union of A and B, AuB. Thus,

This can also be written as follows: J.lA(X) v J.lo(x). The following axioms need to be
satisfied in order for a function to qualify as a t-conorm, for any a, b E [0, ll,
corresponding respectively to J.lA(X), J.lo(x) for any element x of the universal set Ox:

(i) O$a=a (boundary conditions);


(ii) (a $ b) $ c = a $ (b $ c) (associativity);
(iii) a$b= b$a (commutativity);
(iv) a $ b::; a $ c if b::; c (monotonicity ).

As with t-norms, commonly used t-conorms can be described in terms of a class of


parameterised functions, ®P' which was introduced by Schweizer and Sklar [Schweizer
and Sklar 1963]. When p is any non-zero real number, t-conorms in this class are
defined for any a, b as follows:

a $p b = (max(O, (l-a)p +(l-b)p -I»" P •

The following special cases for this class of t-conorm are amongst the most commonly
used t-conorms, where the subscript attached to the symbol $ indicates the value or the
limit of the parameter p:

a $_ b = max (a, b) (reduces to standard or/t-conorm)


a $0 b = a + b - ab (algebraic sum)
a $1 b = min(l, a+b) (bounded sum)
a if b =0
1
a $_ b = b if a = 0
1 otherwise.
(drastic t-conorm)

These t-conorms (depicted in Figure 3-7), namely the standard t-conorm, the algebraic
sum, the bounded sum, and the drastic t-conorm satisfy the following ordering for any
values of a and b:

Furthermore, it can be shown that $_ is equivalent to the smallest t-conorm (i.e. fuzzy
union operator) and that ®_ is the largest t-conorm. More succinctly

See [Klir and Yuan 1998] for a corresponding proof. However, since Schweizer and
SOFT COMPUTING FOR KNOWl.EDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 53

Sklar class of t-conorms is defined by a particular format, it does not cover all possible
t-conorms. Some of the t-conorms presented above possess other desirable properties
such as idempotency (a EB a = a). For example it is easy to show that max(a, b) is the
only idempotent t-conorm.

(a) (b)

(c) (d)
Figure 3-7: Examples of t-conorms in the Schweizer-Sklar class: (a) max(a, b); (b)
Algebraic sum i.e. a+b - ab; (c) drastic max; (d) bounded sum i.e. min(l, a+b).

The previous paragraphs have presented one general purpose parameterised family of t-
norm and t-conorms that has been frequently applied; for a presentation of other
parameterised families see [Zimmermann 1996].

Non-parameterised t-norms are mentioned here for completeness and one example is
presented: the Hamacher product [Hamacher 1978; Zimmermann 1996]. The Hamacher
product t-norm is mathematically defined as follows:

(a ®H b) = _a_b_
a+b-ab

The Hamacher sum t-conorm is an example of a non-parameterised t-conorm


[Hamacher 1978; Zimmermann 1996] and is mathematically defined as follows:

(aEBHb)= a+b-2ab
I-ab
CHAPTER 3: Fuzzy SET THEORY S4

A more comprehensive treatment of t-norms and t-conorms, both parameterised and


non-parameterised is given in [Klir and Yuan 1995; Ruspini, Bonissone and Pedrycz
1998; Schweizer and Sklar 1963; Zimmermann 1996].

In classical set theory, the operations of intersection and union are dual with respect to
the complement in the sense that they satisfy the DeMorgan laws. In the case of fuzzy
set theory, Bonissone and Decker [Bonissone and Decker 1986] have shown that for
any involutive fuzzy complement (i.e. satisfies...,..,a = a), dual pairs of t-norms and t-
conorms satisfy the following generalisation of DeMorgan's law:

-,(a ® b) =-,a $ -,b

and
-,(a $ b) =-,a $ -,b

A triple <®, $, -.> denoting at-norm, t-conorm and fuzzy complement, satisfying the
above laws is commonly known as a DeMorgan triple [Klir and Yuan 1998].
Examples of DeMorgan include:

<®-, $-, -,>


<®o, $0, -,>
<®j, $10-'>
<®~, $~, -,>

3.5.2 Averaging operators


Averaging operators, like t-norms and t-conorms, refer to a collection of operators that
combine several fuzzy sets, to produce a single fuzzy set. These operators differ from t-
norms and t-conorms axiomatically: associativity is not a requirement for averaging
operators (though some operators provide associativity) and the boundary conditions
are different. Mathematically, averaging operators can be defined as a: function of n
values in the interval [0, 1] as follows [Klir and Yuan 1998]:

h:[O, 1]n~ [0, 1]

that satisfies the properties listed below. In the following presentation, the membership
values a], a2, ... , an E [0, 1] denote J.l.Al(X), J.l.A2(X), ... , J.l.An(x) respectively, for any
element x of the universal set Ox:

(i) h(O, 0, ... , 0) = 0 and h(1, 1, ... , 1) = 1 (boundary conditions)


(ii) For any pair of membership tuples <a], a2, ... , an> and <b], b2, ... , bn >
such that ai and bi E [0, 1], if ai ~ bi for all i E [1, n], then h(a], a2, ... , an
) ~ h(b], b 2, ... , bn ) (monotonicity)
(iii) Averaging operator h should be continuous (i.e. small changes in any aj
should result in small changes in the output of h). (continuity)
(iv) For every a E [0, 1], h(a, a, ... , a) = a (ldempotency)
SOFf COMPUTING fOR KNOWI.EDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 55

Properties (i) and (ii) are required for averaging operators, whereas properties (iii) and
(iv) are highly desirable along with other properties. Within this group of aggregation
operators, the weighted generalised means and OWA (order-weighted aggregation)
operators [Yager 1988] are most prevalently used.

The weighted generalised means is formally defined as follows [Klir and Yuan 1998]:

= t w;a;a J
h;:(al,···,a n )
[ l/a

1=1

for any a; E [0, 1], i E [1, n], a. E 9t (a. "# 0); and the weight vector w =<Wb ... , wn>
satisfies the following constraint:

n
Lw;=1
;=1

and each Wi 2: O.

On the other hand, the OW A operators consist of a weight vector W = <Wb W2, ... , Wn>
that is used in the following way to aggregate:

h w (al,a2,···,a n ) = Lw;b;
;=1

where <bj, b2, ... , bn> is a reordering of <ab a2, ... , an> such that b l ~ b2 ~ ... ~ bn .
Various weight vectors lead to intuitive OWA operators. For example, if the weight
vector W = <0,0, ... ,0, 1> is used then the min operator is recovered:

If the weight vector w =<1,0, ... ,0, 0> is used then the max operator is recovered:

For w = <lin, I/n, ... , lin>, the arithmetic mean is recovered:

There are many families of averaging operators; for a detailed listing see [Klir and
Yuan 1998].

3.5.3 Compensative operators


Compensative operators can be viewed as a hybrid form of aggregation ope~tor
combining t-norms and their dual t-conorms. In the following let ® be a t-norm, be
CHAPTER 3: Ful''zY SET THEORY 56

an involutive complement and y be a parameter in the unit interval [0, 1]. Formally,
compensative operators can defined as functions taking n arguments

c:[O, 1]"~[0, 1]

characterised by the following formula [Klir and Yuan 1998]:

where at least the follow set of axioms hold:

(i) c(O, 0, ... , 0) = 0 and c(l, 1, ... , 1) = 1 (boundary conditions)


(ii) For any pair of membership tuples <a], a2, ... , an> and <b], b 2, ... , bn >
such thar a; and b; E [0, 1], if a; ~ b i for all i E [1, n], then c(a], a2,
... , an) ~ c(b], b2, ... , b n ) (monotonicity)
(iii) Aggregation operator h should be continuous i.e. small changes in any aj
should result in small changes in the output of h. (continuity)

One of the most commonly used compensative operators is the y-operator originally
introduced by Zimmerman and Zysno [Zimmermann and Zysno 1980]. It is defined as
follows:

As the value of y increases, the compensative operator's behaviour changes from


resembling a t-norm to a t-conorm. This type y parameter provides an eJ!:tra degree of
freedom when fine tuning a model (during learning).

3.6 MATCHING FUZZY SETS

Matching two fuzzy sets plays a key role in many contexts including inductive
reasoning and approximate reasoning (presented in the remaining chapters of this part
of the book). At this juncture, two popular and relatively straight forward approaches
for matching two fuzzy sets are described, while Chapter 5 presents a third approach
based upon conditional probabilities (namely, semantic unification, see Section 5.3.3.1)
that exploits the formal relationship between fuzzy sets and set-based probabilities.

Two of the most commonly used approaches for matching two fuzzy sets are based
upon possibility and necessity measures [Dubois and Prade 1988; Zadeh 1978]. The
possibility measure of two fuzzy sets A and D defined over the universe ilx, where A is
viewed as a reference fuzzy set (part of a model) and D is viewed as a given piece of
Son COMPUTINCi FOR KNOWLEDCiE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 57

data, reduces to calculating the intersection of both fuzzy sets, and then taking the
possibility measure of the resulting fuzzy set. This is more succinctly stated as follows:

Pos(A,D'J= max[minCu A (X),,uD(X»]


XEQ x

The necessity measure is defined as follows:

and can intuitively be interpreted as the degree to which D is included in A. Possibility


and necessity measures are illustrated in Figure 3-8, where the Pos(A, D) is 0.6, and
Nec(A, D) is 0 for the given fuzzy sets A and B. A detailed presentation of possibility
and necessity measures is given in Section 5.3.2.

3.7 GENERALISATIONS OF FUZZY SETS

The previous sections presented the basic definition of a fuzzy set and various families
of operations that accept single and multiple fuzzy set values as inputs. This section
describes generalisations of fuzzy sets that have been developed. This presentation
details multidimensional fuzzy sets, relations, Cartesian granule fuzzy sets, and higher-
order fuzzy sets such as interval-valued fuzzy sets and type-2 fuzzy sets.

3.7.1 Multi dimensional fuzzy sets


So far in this chapter fuzzy sets have been defined on single universes i.e. univariate
fuzzy sets. However, this section introduces multi-variate fuzzy sets, which extends
fuzzy sets to a multidimensional setting i.e. fuzzy sets are defined over the Cartesian
product of single universes. A multidimensional fuzzy set M can be defined over the
n
Cartesian product of the universes of discourse x Q x. (that is QXJx ... xQxn) of
;=1 I

variables X], ... ,Xn via its membership function as follows:

and denoted in shorthand as follows:

n
M = LIl(XI, ... ,xll)lx l X ... XXIl
x=1

where tuples <x], ... ,xn > may have varying degrees of membership. The membership
grade of a tuple in a multidimensional fuzzy set, as in the one-dimensional case,
CHAPTER 3: fuzzY SET THEORY 58

indicates the degree of similarity between it and the imprecise concept characterised by
the fuzzy set.

Pos(A, D) =0.6

Nec(A, D) =0
Figure 3-8: Examples of necessity and possibility measures.

For example, consider the definition of a hypothetical fuzzy set corresponding to people
who could be potentially overweight, which is characterised in terms of two variables,
height and weight defined over universes .!2Height and .!2Weight respectively. A possible
definition for this fuzzy set could be as follows:

f.1 PossiblyOverweight (h, w) = {I O.5(-.~--c)


(e W )-
1
if (h/w) < 1.8

otherwise

This is graphically depicted in Figure 3-9.

3.7.1.1 Fuzzy relations


A special type of multidimensional fuzzy set is a relation. A relation represents the
presence or absence of an association, interaction or interconnectedness between the
elements of two or more sets. Relations can be quite naturally represented as
multidimensional sets. Thus, a fuzzy relation is a fuzzy set defined on the Cartesian
n
product of universes of discourse x Q x- where tuples <Xj, X2, .. . ,xn> may have varying
1:=1 I

degrees of membership within the relation. The membership grade of a tuple in a


relation indicates the strength of the relation between the elements of the tuple and as
such has more refined interpretation than that of multidimensional fuzzy sets in general.
For example, consider a fuzzy relation, "very far" which indicates the proximity of two
cities. Here the individual universes of discourse could be defined as follows: .!2XI =
{Bristol, Grenoble, Limerick, Tokyo} and .!2X2 = {Bristol, Grenoble, Limerick, Tokyo}.
The relation VeryFar can be written in matrix form, as presented in Table 3-4. Here
each entry, denotes a membership value, reflecting the proximity of the participating
cities (these cities correspond to the row and column labels). For this form of
presentation, all membership values for the relation (including the zero membership
tuples) are listed, however, the relation could alternatively be presented in list format as
follows:
Sor'T COMPUTING fOR KNOWLEIXiE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES S9

J.LVeryFa,(X" X 2 ) = Bristol x Grenoble 10.7 +


Bristol x Limerick 10.5 +
Bristol xTokyolJ + Grenoble xLimericklO.B +
GrenoblexTokyolJ + LimerickxTokyolJ

This is a shorter representation of the relation that exploits various properties that this
particular relation possesses including symmetry and anti-reflexivity. Fuzzy relations
form a big area of study in fuzzy set theory playing a key role in areas such as
approximate reasoning fuzzy (which will be covered in the next chapter).

,-,1
3: 0.8
~0.6
-a.
'1' 0.4
s
~ 0.2
~ 0
l5
'01
~
:i.
100
1

40
Weight

Figure 3-9: An example multidimensional fuzzy set for the concept


Possibly Overweight.

3.7.1.2 Special operations on multidimensional fuzzy sets


Multidimensional fuzzy sets can also be expressed via their projections onto subsets of
universes over which they are defined. Consider a multidimensional fuzzy set R defined
n
over the universe x Q "/f .
;=)
' Let X represent this universe. Let y represent a
multidimensional universe that consists of a subset of the universes {QXf, QX2, .. . ,
Qx n } that make up X. The projection of R onto a subset of the universes that make up
n
X Q r ' denoted by [RtX-Y], is defined on the Cartesian product of universes making
;= 1 . /

up yas

[R t X-y](y)= max R(x)


,tEQx

where tuple y is a sub-sequence of tuple x. The max (or sup in continuous case)
CHAI'TEK 3: FUlLY SETTHEOKY 60

operation is used since many tuples in R will lead to the same tuple in [R!X-Y] with
different membership values.

Table 3-4: A matrix representation of the fuzzy relation " VeryFarH.

J!vcrvr",(X h X2) Bristol Grenoble Limerick Tokyo


Bristol 0 0.7 0.5 I
Grenoble 0.7 0 0.8 I
Limerick 0.5 0.8 0 I
Tokyo I I 1 0

For example, consider the relation R defined in Table 3- 5. The projection [R!X l],
denoting the projection of R on to a new relation consisting of Xl only, results in the
following fuzzy set:

xl/O.7 + x2/0.B +x3/J

The projection [R! Y d results in following fuzzy set:

yl/O.7 + y2/0.7 +y3/J + y4/J

Table 3-5: A matrix representation of the fuzzy relation "RH.

J!vcryrnr<X h Y l) yl y2 y3 y4
xl 0.1 0.7 0.5 I
x2 0.7 0.3 0.8 0.2
x3 0.1 0.11 1 0

Another operation on multidimensional fuzzy sets, which can be viewed as the inverse
to projection, is called the cylindrical extension. A cylindrical extensif''1 can be
formally defined as follows: consider multidimensional universes, X and y, as defined
previously. Let R be a multidimensional fuzzy set defined on yand let [RiX-Y] denote
the cylindrical extension of R into the multidimensional universe X. Then

[Rix-y] (x)=R(y)

This operation produces the largest fuzzy set (in the sense of tuple membership grades
of the extended Cartesian product) that is compatible with the given projection. It is
interesting to note that the Cartesian product of uni-variate projections (i.e. [R!Q'i]) of
a fuzzy relation R does not result in the original relation but rather its upper estimate:
SOFT COMI'UTINCi FOR KNOWLED(,E DISCOVERY: INTRODUCINCi CARTESIAN GRANULE FEATURES 61

3.7.2 Cartesian granule fuzzy sets


In the previous sections fuzzy sets defined on both discrete and continuous universes
were considered. For example, the universe for height could be defined on the real
universe [20, 300]. This subsection introduces an intermediate level of abstraction of
universes, on which fuzzy sets can be more naturally and succinctly defined; universes
are abstractly represented using fuzzy sets and subsequently fuzzy sets are defined in
terms of these fuzzy sets. In other words, fuzzy sets are used to partition the universes
of discourse into regions of self-similarity or functionality and then high-order fuzzy
sets (granule fuzzy sets) are defined in terms of the word labels that denote the
partitioning fuzzy sets. More formally, let P = {AI, ... , Am}, be a fuzzy set partition
(examples of fuzzy partitions are presented below; for more details see Chapter 4) of
universe Q x then a granule fuzzy set [Baldwin, Martin and Shanahan 1997b; Shanahan
1998] can be defined for each element, Xi of Qx, as a discrete fuzzy set as follows:

m
LDxi = 2. A /IlAj
j=l
j (Xi)

Granule fuzzy sets can be seen as linguistic summaries or descriptions (LD) of


individual data values [Baldwin, Martin and Shanahan 1997b; Shanahan 1998].
Consider an example from computer vision: the problem posed is to model the vertical
position of sky in images. The first step in this process is to linguistically partition the
universe of the vertical position of sky in two-dimensional images QY]osition' This is
defined for convenience on the interval [0, 100]. The linguistic partition could consist
of the following three words Bottom, Middle, Top, each of which is characterised by a
trapezoidal fuzzy set as depicted in Figure 3-10. Each value, Xi, of the QY]osition
universe can be linguistically summarised using granule fuzzy sets. For example, if the
Y_Position variable has a value of 40 then this value can be linguistically described or
summarised by the following granule fuzzy set:

40 = {Bottom/0.2 + Middle/l }

Granule fuzzy sets can conveniently be used to describe concepts. For example the
position of sky regions in digital images, could be described using the following
granule fuzzy set:

Sky_position = {Bottom/O.l + Middle/0.7 + Top/I}

This corresponds to a linguistic summary of the positions of sky regions in images.

Extending the notion of a granule fuzzy set to a multidimensional setting leads to


Cartesian granule fuzzy sets [Baldwin, Martin and Shanahan 1997b; Shanahan 1998].
Cartesian granule features and fuzzy sets are presented in detail in Chapter 8, and their
use within a machine learning context is described in Chapter 9.

3.7.3 Higher order fuzzy sets


So far this chapter has considered fuzzy sets that map every element X of a universe of
CHAPTER 3: Fuzzy SET THEORY 62

discourse Q x onto the unit interval [0, 1]. This type of fuzzy set is also known as a type-
1 fuzzy set, and is by far the most commonly researched and applied type of fuzzy set to
date. However, over the years, other generalisations of fuzzy sets have developed. Two
of these generalisations are covered here: interval fuzzy sets [K1ir and Yuan 1995]; and
type-2fuzzy sets [Mizumoto and Tanaka 1976].

Bottom Middle Top


,~---.,..-------.......,- -- --- -- ----- ---
,"
,/
,I
,,
,I "
,,
""

40 50 n y -"osiDon 100

Figure 3-10: Fuzzy partition of the universe ilY]osition.

In the case of interval-valued fuzzy sets, rather than restricting the membership value
to a single value in the interval [0, 1], the membership value is generalised to a closed
interval of real numbers in [0, I]. Interval-valued fuzzy sets are more formally defined
as follows:

AQ x ~ £[0, 1]

where £[0, 1] denotes the family of all closed intervals of real number in [0, 1]; note
that

£[0, 1] c prO, I]

where pro, 1] denotes the power set of elements in the interval [0, I]. Figure 3-11
graphically depicts an interval valued fuzzy set where the membership value J.lA(X) of
each element x is represented by an interval [ax_L, <Xx_u], denoting the lower and upper
bounds for membership values.

Type-2 fuzzy sets [Mizumoto and Tanaka 1976] are a further generalisation of
interval-valued fuzzy sets, where every element in the universe is mapped onto a type-1
fuzzy set. Type-2 fuzzy sets are more formally defined as follows:

A:Q x ~ F([O, I])

where F([O, 1]) denotes the family of type-I fuzzy sets that can defined on the interval
[0, I]. F([O, I]) is commonly referred to as the fuzzy power set. Figure 3-12 graphically
depicts a type-2 fuzzy set where the membership value J.lA(X) of each element x is
SOIT COMPUTING FOR KNOWI.EIXiE DISCOVERY : I NTRODUCING CARTESIAN GRANULE FEATURES 63

represented by a fuzzy set, that characterises its membership; in this case the
membership value of each value x is characterised by a trapezoidal fuzzy set.

Other types of generalisations of fuzzy sets also exist, such as probabilistic fuzzy sets
[Hirota 1981], and intuitionistic fuzzy sets [Astanassov 1986]. Overall, fuzzy sets other
than type-l fuzzy sets, fuzzy relations and Cartesian granule fuzzy sets, are still the
subject of research and have not been applied extensively in real world problems. Even
though these generalisations of a type-l fuzzy set, such as type-2 fuzzy sets, provide
more expressivity, this comes at added computational costs and hence few applications
to date.

Figure 3-11: An example of an interval-valued fuzzy set.

x Member hip

Figure 3-12: An example of type-2 fuzzy set and a type-1 fuzzy set membership value
(characterised by a trapezoidal fuzzy set) for x.

3.8 CHOOSING MEMBERSHIP FUNCTIONS

Membership functions can be incorporated with various other forms of knowledge


representation such as, if-then rules (see the next chapter on fuzzy logic) and neural
networks in order to model a domain. The elicitation of membership functions is a
crucial step in this modelling process. There are two popular approaches to eliciting
membership functions: one or more domain expert could provide the membership
functions; or the membership functions could be estimated through machine learning.
For the latter, Parts 1lI and IV of this book described various learning algorithms that
CHAPTER 3: FU7,ZY SET THEORY 64

elicit membership functions from example data, in particular for Cartesian granule
fuzzy sets.

3.9 SUMMARY

This chapter serves as a concise introduction to fuzzy sets. Along with presenting the
basic definition of a fuzzy set, it also presents various properties and operations that can
be performed on fuzzy sets such as aggregation and matching. Various justifications
and interpretations of fuzzy sets as a form of knowledge granulation were presented.
Generalisations of the fuzzy set, including Cartesian granule fuzzy sets and relations are
also described, illustrating the potential power and flexibility of the fuzzy set.
Membership function elicitation was briefly discussed, but will be explored in detail in
Chapters 7 and 9.

3.10 BIBLIOGRAPHY

Astanassov, K. T. (1986). "Intuitionistic fuzzy sets", Fuzzy sets and systems, 20:87-96.
Baldwin, J. F. (1991). "A Theory of Mass Assignments for Artificial Intelligence", In
IJCAI '91 Workshops on Fuzzy Logic and Fuzzy Control, Sydney, Australia,
Lecture Notes in Artificial Intelligence, A. L. Ralescu, ed., 22-34.
Baldwin, J. F., and Lawry, J. (2000). "A fuzzy c-means algorithm for prototype
induction." In the proceedings of IPMU, Madrid, To appear.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - Fuzzy and Evidential
Reasoning in A.I. Research Studies Press(Wiley Inc.), ISBN 0863801595.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (l997a). "Fuzzy logic methods in
vision recognition." In the proceedings of Fuzzy Logic: Applications and
Future Directions Workshop, London, UK, 300-316.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997b). "Modelling with words
using Cartesian granule features." In the proceedings of FUZZ-IEEE,
Barcelona, Spain, 1295-1300.
Bonissone, P. P., and Decker, K. S. (1986). "Selecting uncertainty calculi and
granularity: An experiment in trading-off precision and complexity", In
Uncertainty in Artificial Intelligence, L. N. Kanal and J. F. Lerner, eds., North-
Holland, Amsterdam, 217-247.
Borel, E. (1950). Probabilite et certitude. Press universite de France, Paris.
Bruner, J. S., Goodnow, J. J., and Austin, G. A. (1956). A Study of Thinking. Wiley,
New York.
Dubois, D., and Prade, H. (1983). "Unfair coins and necessary measures: towards a
possibilistic interpretation of histograms", Fuzzy sets and systems, 10: 15-20.
Dubois, D., and Prade, H. (1988). An approach to computerised processing of
uncertainty. Plenum Press, New York.
Gaines, B. R. (1977). "Foundations of Fuzzy Reasoning", In Fuzzy Automata and
Decision Processes, M. Gupta, G. Saridis, and B. R. Gaines, eds., Elsevier,
North-Holland, 19-75.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 65

Gaines, B. R. (1978). "Fuzzy and Probability Uncertainty Logics", Journal of


Information and Control, 38:154-169.
Hamacher, H. (1978). Uber logiishe Aggregation nicht-binair expliziter
Entscheidungskriterien. Main, Frankfurt.
Hirota, K. (1981). "Concepts of probabilistic sets", Fuzzy sets and systems, 5:31-46.
Klir, G. J., and Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic, Theory and Applications.
Prentice Hall, New Jersey.
Klir, G. J., and Yuan, B. (1998). "Operations on Fuzzy Sets", In Handbook of Fuzzy
Computation, E. H. Ruspini, P. P. Bonissone, and W. Pedrycz, eds., Institute
of Physics Publishing Ltd., Bristol, UK, B2.2:1-15.
Klir, K. (1990). "A principle of uncertainty and information invariance", International
journal of general systems, 17(2, 3):249-275.
Mizumoto, M., and Tanaka, K. (1976). "Some properties of fuzzy sets of type 2",
Information and control, 48(1):30-48.
Negoita, C. V., and Ralescu, D. (1975). "Representation theorems for fuzzy concepts",
Kybernetics, 4(3):169-174.
Ralescu, A. L., ed. (1995a). "Applied Research in Fuzzy Technology", Kluwer
Academic Publishers, New York.
Ralescu, D. (1995b). ''Cardinality, quantifiers, and the aggregation of fuzzy criteria",
Fuzzy sets and systems, 69:355-365.
Ruspini, E. H., Bonissone, P. P., and Pedrycz, W., eds. (1998). "Handbook of Fuzzy
Computation", Institute of Physics Publishing Ltd., Bristol, UK.
Russell, B. (1923). "Vagueness", Australasian journal of psychology and philosophy,
1:84-92.
Schweizer, B., and Sklar, A. (1961). "Associative functions and statistical triangle
inequalities", Pub!. Math. Debrecen, 8:169-186.
Schweizer, B., and Sklar, A. (1963). "Associative functions and abstract semigroups",
Pub!. Math. Debrecen, 10:69-81.
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Sudkamp, T. (1992). "On probability-possibility transformation", Fuzzy Sets and
Systems, 51 :73-81.
Sugeno, M., and Yasukawa, T. (1993). "A Fuzzy Logic Based Approach to Qualitative
Modelling", IEEE Trans on Fuzzy Systems, 1(1): 7-31.
Terano, T., Asai, K., and Sugeno, M. (1992). Applied fuzzy systems. Academic Press,
New York.
Yager, R. (1988). "On ordered weighted averaging aggregation operators in multi-
criteria decision making", IEEE Transactions on Systems Man and
Cybernetics, 18: 183-190.
Yager, R. R. (1994). "Generation of Fuzzy Rules by Mountain Clustering", J.
Intelligent and Fuzzy Systems, 2:209-219.
Yager, R. R. (1998). "Characterisations of fuzzy set properties", In Handbook of Fuzzy
Computation, E. H. Ruspini, P. P. Bonissone, and W. Pedrycz, eds., Institute
of Physics Publishing Ltd., Bristol, UK, B2.5:1-8.
Yen, J., and Langari, R. (1998). Fuzzy logic: intelligence, control and information.
Prentice Hall, London.
Zadeh, L. A. (1965). "Fuzzy Sets", Journal of Information and Control, 8:338-353.
CHAPTER 3: Fuzzy SET THEORY 66

Zadeh, L. A. (1973). "Outline of a New Approach to the Analysis of Complex Systems


and Decision Process", IEEE Trans. on Systems, Man and Cybernetics,
3( I ):28-44.
Zadeh, L. A. (1978). "Fuzzy Sets as a Basis for a Theory of Possibility", Fuzzy Sets and
Systems, 1:3-28.
Zadeh, L. A. (1983). "A Computational Approach to Fuzzy Quantifiers in Natural
Languages", Computational Mathematics Applications, 9: 149-184.
Zimmermann, H. J. (1996). Fuzzy set theory and its applications. Kluwer Academic
Publishers, Boston, USA.
Zimmermann, H. J., and Zysno, P. (1980). "Latent connectives in human decision
making", Fuzzy Sets and Systems, 4(1):37-51.
CHAPTER
FUZZY LOGIC
4
"Classical logic is like a person who comes to a party dressed in black suit,
white starched shirt, a black tie, shiny shoes, and so forth. And fuzzY logic is a
little bit like a person dressed informally, in jeans, tee shirt and sneakers"
[Zadeh 1987J

This chapter introduces fuzzy logic as the basis for a collection of techniques for
representing knowledge in terms of natural language like sentences and as a means of
manipulating these sentences in order to perform inference using reasoning strategies
that are approximate rather than exact. It was first introduced in the early 1970s by
Zadeh in order to provide a better rapport with reality [Klir and Yuan 1995; Zadeh
1973]. Fuzzy logic can be viewed as a means of formally performing approximate
reasoning about the value of a system variable given vague information about the
values of other variables, and knowledge about the dependence relations between them
(that is typically represented as IF-THEN rules expressed as fuzzy relations). For
example, if knowledge is expressed in terms of IF-THEN rules, such as IF X is A THEN
Y is B, and if the fact X is A' is known, then the deductive process needs to derive Y is
B' as a logical consequence. In an approximate reasoning setting, in contrast to a
classical logic setting, where inference is performed by manipulating symbols,
inference is performed at a semantic level by numeric manipulation of membership
functions that characterise the symbols.

The chapter is organised into three main sections: knowledge representation; fuzzy
inference; and fuzzy decision making. It begins by introducing the main forms of
domain specific knowledge representation in fuzzy logic: linguistic variables, linguistic
hedges, fuzzy facts and fuzzy if-then rules. Subsequently, it presents the main modes of
inference in fuzzy logic, some of which are derived from multi-valued logic. Decision
making processes are then described (known as defuzzification in fuzzy logic parlance).
A simple example, illustrating the potential of fuzzy logic as an accurate and
transparent modelling technique is also presented. Finally, real world applications of
fuzzy logic are overviewed.

4.1 FUZZYRULESANDFACTS

Fuzzy logic makes it possible to express knowledge in terms of natural language-like


statements. This is enabled by the use of linguistic expressions, linguistic variables
(introduced in the next subsection) and fuzzy propositions. Linguistic expressions may
contain any of the following fuzzy linguistic terms:

J. G. Shanahan, Soft Computing for Knowledge Discovery


© Kluwer Academic Publishers 2000
CHAPTER 4: fuzzy LoGIC 68

• fuzzy predicate - represented by a fuzzy set defined on the universe of the


variable to which the predicate applies. For example, tall, blue,far, etc.
• fuzzy modifiers (linguistic hedges) - are fuzzy set operations that
typically modify the meaning of other linguistic terms in an intuitive
manner. Examples of linguistic hedges include very, quite, and extremely.
See Section 4.1.2 below for a complete presentation.
• fuzzy truth values - represent truth values as fuzzy sets defined over [0, 1]
as opposed to a single value in {O, I}. Typical examples include true,
Jalse,fairly true, very true, etc.
• fuzzy probabilities - linguistically quantify the probability associated with
a proposition. Examples of fuzzy probabilities include likely, highly likely,
etc.
• fuzzy quantifiers - examples of linguistic quantifiers include most, all,
some, etc. Linguistic quantifiers are described in more detail in Chapters 6
and 9, where there are discussed in the context of evidential reasoning.

Fuzzy propositions are typically expressed in linguistic terms. In this book fuzzy
propositions of the following types are considered:

• Unconditional and unqualified propositions expressed by the canonical form

p:'X is A'

where A is a fuzzy set representation of a fuzzy predicate constraining the


values of the variable X (see the previous chapter for a more detailed
explanation).

• Conditional and unqualified propositions expressed by the following canonical


form

r:'ifX is A then Y is B'

where A and B are fuzzy set representations of a fuzzy predicates


constraining the values of variable X and Y respectively.

Other types of fuzzy propositions include qualified propositions such as:

• unconditional and qualified propositions;


• conditional and qualified propositions.

where each proposition of the type p or r, as defined above, is associated with a point or
interval probability. Qualified fuzzy propositions will be considered in more detail in
Chapter 6 in the context of the Fril programming environment. The range of truth
values for fuzzy propositions is [0, I], where truth and falsity are expressed by the
values t and 0 respectively.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 69

4.1.1 Linguistic partitions, variables and hedges


Since its introduction, fuzzy set theory has been used in various guises. This section
introduces the notion of linguistic partitions and linguistic variables, which form the
foundation for numerous applications of fuzzy set theory in fields such as, fuzzy logic,
and application domains such as systems control, machine learning, pattern recognition,
etc. [Ralescu 1995; Ruspini, Bonissone and Pedrycz 1998; Terano, Asai and Sugeno
1992; Yen and Langari 1998]. This section begins by defining crisp and fuzzy
partitions. It then highlights some of the shortcomings of crisp partitions and describes
briefly how fuzzy partitions address these. Subsequently, a linguistic variable, an
abstract variable defined over a fuzzy partition, is introduced and its semantic
behaviour is justified from a human reasoning perspective using the voting model. This
is followed by a description of how to generate fuzzy partitions of variable universes
using triangular and trapezoidal fuzzy sets.

4.1.1.1 Partitions
The concept of a partition can be exploited both to reduce the information complexity
and also to enhance the interpretability of a system. Partitions facilitate a more natural
mapping between the computational representation and the human perception of the
world. Partitions achieve a natural and efficient reduction of information complexity,
by quanti sing or discretising continuous universes. They can be viewed as a means of
carving the attribute space into regions of self-similarity. Zadeh refers to these regions
are as granules [Zadeh 1994]. Notions of indistinguishability, similarity, proximity and
functionality play key roles in determining the extent of these granules. Granules are
normally characterised by crisp or fuzzy sets. Consequently, crisp sets or fuzzy sets can
be used to. partition the universes upon which the problem domain variables are
defined, thus leading to crisp and fuzzy partitions.

Definition: Let X = {x], ... , xn} be a set of given data. A partition P of X is a family of
subsets of X denoted by P = {A], ... , Ac}, that satisfy the following properties:

(i) Ai n Aj =0 "if i,je{t, ... ,c},andi*j

UA; =X
c
(ii)
;=1

In other words, P provides a minimal, or most efficient covering of X. A trivial example


of a partition of X is provided by {A, --,AI --,A denotes the complement of A} for any
subset A of X.

When each Ai is a fuzzy set, a fuzzy partition [Ruspini 1969] for X is defined and the
following properties, corresponding to (i) and (ii) above, must hold:

"if i, j e {1, ... , c }, and "if k e {1, ... , n}

"ifke {t, ... ,n}


CHAPTER 4: Fuzzy UXilC 70

This type of partition is sometimes known as a fuzzy mutuaUy exclusive partition


[Baldwin, Martin and Pilsworth 1995] for reasons that will become apparent after the
presentation of the voting model interpretation of linguistic variables.

For example, given X ={XI> xz. x31 and

AI= xlO.6 + xiI + xYO.5


A 2=xlOA + X2/0 + xlO.5

then P = {A], AzI is a fuzzy partition (or fuzzy 2-panition) of X. It can be easily seen
that if P is restricted to crisp sets, then this definition corresponds to the standard
definition of a partition as shown above.

Relaxing property (ii) for a fuzzy partition as follows:

c
0< LJlA/Xk) S; c \ike {I, ... ,n}
i=1

leads to a fuzzy non-mutually exclusive partition.

Partitions, through their information reducing capability, provide tractability. To date,


most information processing techniques (as in other fields) have relied upon crisp
partitions. These include knowledge representation areas such as semantic networks,
constraint programming, machine learning techniques such as decision tree based
approaches, and statistical techniques such as histogramming. Even though the use of
crisp partitions has lead to many successful developments in these fields, crisp
partitions suffer from limitations that ultimately affect the usefulness of these
approaches. For example, in the case of machine learning techniques that rely on crisp
partitions such as, decision tree induction algorithms [Quinlan 1986], the positioning of
boundaries tends to be arbitrary. As a direct consequence of the crisp nature of these
partitions and their location arbitrariness, decision trees can exhibit discontinuous
behaviour about these boundaries, that is a little change in a variable value can to lead
to a totally different outcome. Furthermore, little changes in these boundary definitions
during learning can lead to radically different models. From a human perspective of the
real world, boundaries between concepts tend to be necessarily vague and vary between
people. Consider an example of this phenomenon using the concept "Smalf' die value.
The die values that are considered as "small", will vary from person to person, with
some values more prototypical (Le. in the case of a voting model context, more people
vote for these values) of the concept than others. A fuzzy set, through the concept of
graded membership, provides a means of capturing prototypicality and vagueness of
boundary (as seen in Chapter 3). Using fuzzy sets to partition variable universes
provides a natural means of capturing the vagueness of boundary that exists between
concepts, and also insulates the model somewhat form the location of the fuzzy sets
(not as sensitive to boundary locations as models built on crisp partitions). In addition,
fuzzy sets, through their imprecise nature, provide better generalisation and enhanced
transparency in the context of machine learning (see Parts IV and V of this book).

To enhance the transparency/understandability of the partitioned universes, words or


labels are associated with each (fuzzy or crisp) subset. This type of partition is termed a
SOFf COMPUTING FOR KNOWl.EDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 71

linguistic partition [Ralescu and Hartani 1995]. It is quite natural to assign linguistic
labels (from a predefined dictionary of terms or from a list that an expert has provided)
to each fuzzy subset. For example, the universe of a variable Position could be
partitioned into three fuzzy subsets that are associated with the words Left, Middle and
Right. Variables defined over these fuzzy subset labels are termed as linguistic
variables [Zadeh 1975a; Zadeh 1975b; Zadeh ] 975c]. A linguistic variable takes as its
values a finite set of words or labels. Linguistic partitions can be viewed as a lens/filter
through which the data can be seen in an intuitive manner. Linguistic partitions permit
operations on data such as learning and reasoning a more effective and transparent
fashion. Examples of these claims are presented in Parts IV and V of this book, where
linguistic partitions provide tractability, effectiveness and understandability for the
modelling approaches presented.

4.1.1.2 Linguistic variables


A linguistic variable can formally be defined as a quintuple <.X, W, ilx, g, m> in
which X is the name of the variable, W is the set of linguistic terms or words of X that
form a linguistic partition of the universe of discourse ilx, g is a syntactic rule (a
grammar) for generating linguistic terms, and m is a semantic rule that assigns to each
word WE W its meaning mew), which is a fuzzy set of ilx (i.e. m;W~ FuzzySet(X))
[Zadeh ]975a; Zadeh 1975b; Zadeh 1975c]. The fuzzy set mew) can be interpreted as
encoding the meaning of w such that for each x E ilx the membership value J.im( .. ) (x)
quantifies the suitability or applicability of the word w as a description of the value x.
An example of a linguistic variable, Position, is shown in Figure 4-1. This expresses
the position of a pixel (in the horizontal direction) within a digital image using three
words Left, Middle and Right as well as other linguistic terms generated by the syntactic
rule (using linguistic hedges), such as not Left. very Left, Left or Middle (not shown
explicitly in Figure 4-1) and so forth. For the scope of this work the syntactic rule will
be restricted to conjunctions and disjunctions of words and avoids the use of hedges
and negation. Each of the words is assigned one of the three fuzzy sets by a semantic
rule, as shown in the figure. The fuzzy sets, whose membership functions have a
trapezoidal shape, are defined on the interval [0, 100] that denote the universe of the
base variable. Each of them expresses a restriction on the range. A linguistic variable
provides different levels of abstraction of a problem, deriving power of expression from
its symbolic and semantic nature (see Figure 4-1).

4.1.1.3 Voting model interpretation of linguistic variables


The voting model paradigm [Baldwin 199]; Gaines 1977; Gaines 1978; Lawry 1998] is
used to interpret the behaviour of the semantic rule m of a linguistic variable in terms of
the binary "true" or ''false'' responses of a population of individuals. Consider a
population of voters, where each voter is asked to describe a value x E ilx by voting in
a yes or no fashion on each word WE W, on its appropriateness as label or description of
x. Voters are expected to vote consistently and abide by the constant threshold
assumption. At the time of voting, each voter is aware of all the words, WE W.
Requiring each voter to chose one word (exclusively) to describe each value leads to a
fuzzy mutually exclusive partition. Relaxing this constraint leads to fuzzy a non-
CHAPTER 4: Fuzzy U)(;IC 72

mutually exclusive partition. The valuation of x in the characterisation of a word W i.e.


f.1m(W)(x) is defined to be the proportion of the population who accept was a description

of the value x. Similarly, other linguistic terms generated by the syntactic rule, such as
not w, very w, WI or W2 can be assigned meaning. For example, the meaning of the
compound statement WI and W2 for a value x E ilx can be taken as the proportion of
population who say "yes" to each of WI and W2 as being an appropriate description of x.
The meaning of the qualified linguistic terms such as very Small could be obtained
either by getting voters to vote on each possible qualified linguistic term (this is
possibly infinite) or by getting voters to vote on a general definition of each hedge in
the context of this linguistic variable. As described here, the voting model process
could be used to derive m, the semantic rule of a linguistic variable, in a very natural
manner.

Linguistic (A bstract)
Variable

:e ,
Linguistic
Valu es
( States)

f
~
'\,
\,:
~+---+--!
7~\-
Semantic
Rule

~ "" " , '


Q) , ••••• /'

~ 0 _ - -----'-----·......I_ _ __ _ _ _~ Base va riable (Detailed)


o 40 50
Q l'osilion
100

Figure 4-1: An example of a linguistic variable defined over the universe ilposition.

For example, consider a die variable defined over {1, ... , 6}. Let W, the set of linguistic
terms of the corresponding linguistic variable, consist of (Small, Medium, Large). The
meaning of the words Small, Medium, and Large can be generated from the voting
patterns of the population for each die value. Table 4-1 presents the voting pattern for a
population of ten voters for the die value 3. Similar voting patterns are generated for the
other die values. Subsequently, in the case of each word Wi, the meaning consists of the
list of die values associated with the proportion of voters who accept Wi as a description
of the respective die value. These proportions correspond to membership values. For
example, the value 3 will have a membership value of 1 in the fuzzy set Medium. The
voting pattern presented in Table 4-1 can alternatively be viewed as a linguistic
description of the die value 3; this description is characterised by the following fuzzy
set: (Mediuml1 + Small/D.2).
Son COMPUTING FOR KNOWLEIX;E DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 73

Table 4-1: Voting pattern for tell people corresponding to the linguistic interpretation
of the die value 3.

Word\Person I 2 3 4 5 6 7 8 9 10
Medium Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Small Yes Yes no no no no no no no no

4.1.1.4 Triangular and trapezoidalJuzzy set based partitions


When partitioning universes, fuzzy sets represented by simple geometric shapes are
desirable from a computational point of view. In this work, fuzzy set shapes are
restricted to triangles and trapezoids. The shapes provide comptractibility, transparency
and are also quite effective in modelling real world problems (see Chapter 11). There
are infinite ways in which fuzzy sets of these types can be placed over a universe of
discourse. Usually, they are pOsitioned automatically or by an expert in the field. At this
point, the automatic generation of partitions is considered. To generate partitions, the
granularity n of the base universe 0 needs to be determined, i.e. the number of fuzzy
sets that will be used to partition the base universe O. This may be determined directly
from the words used to linguistically partition this variable's universe or, in the absence
of such information, can be resolved automatically using the procedures outlined in
Chapter 9. Alternatively, these points could come from simply placing them uniformly
over the universe of discourse. In the case where triangular fuzzy sets are used, n points
(including the endpoints) need to be provided; these points will be used to partition the
universe into n-J intervals. Subsequently, a triangular fuzzy set is centred about each
point Pj (i.e. Pj is the only normal point of the fuzzy set) whose support ranges from
point Pj./ to point Pj+/' The shape of the fuzzy sets whose core elements are points p / or
Pn will be a right angled triangle or ramp-like as depicted in Figure 4-2(a). Words are
then taken from a predefined dictionary and the semantic rule for the linguistic variable
is generated. Conversely, if the term set Wexists, they can be used to generate m the
semantic rule. For example, take the variable Position defined over universe OPo'ilion,
which could be associated with the following linguistic partition defined as follows:

{Left, Middle, Right} .

Let the Position universe OPo'ilion be defined over the range [0, 100]. Then the
definitions of the above (using uniformly placed triangular) fuzzy sets (in Fril notation 3
[Baldwin, Martin and Pi Isworth 1988; Baldwin, Martin and Pi Is worth 1995]) could be

3A fuzzy set definition in Fril such as Middle [0:0, 50: 1, 100:0] can be rewritten
mathematically as follows (denoting the membership value of x in the fuzzy set
Middle):
0 if x:O;O
x
if O<x :0;50
IlMlddle(X ) 1- -
50
= 100- x
if50<x<100
50
o ifx~loo
CHAPTER 4: Fuzzy LoGIC 74

Left: [0: 1,50:0]


Middle:[O:O, 50:1,100:0]
Right: [50:0, 100:1].

This is graphically depicted in Figure 4-2(a). Any position value in QPosilioo has a non-
zero membership in at least one, or at most two, of these fuzzy sets. For example, the
position 10 will have a membership value of 0.8 in Left and 0.2 in Middle. Linguistic
partitions provide a means of giving the data a more anthropomorphic feel, thereby
enhancing understandability. In this case the value 10 corresponds to the linguistic
description characterised by the following fuzzy set:

LA; /
c
Ji-A; (10) = {Leftl.8+Middle/.2}.
;=1

Triangular fuzzy sets cOlifespond quite naturally to fuzzy numbers. In this example the
fuzzy set Middle could also, quite intuitively, be labelled as AbouC50.

Co
:.c:
~
.8
.,E
::20
~--------~~--------~--~.

(a)

(b)

Figure 4-2: Linguistic partition of the variable universe QPos;t;on using (a) triangular
fuzzy sets and (b) trapezoidal fuzzy sets.

Alternatively, where trapezoidal fuzzy sets are used to partition a universe of discourse,
to generate a partition with granularity of n, n+ J points need to be provided (these n+ J
points include the universe boundary values); this leads to a partitioned universe
consisting of n intervals. A trapezoidal fuzzy set can be characterised by four points a,
b, c, and d as depicted in Figure 4-2(b). The interval [b, c] characterises the core (i.e.
all points in this interval have a membership value of 1) of the fuzzy set, while the
interval [a, d] characterises the support of the fuzzy set (i.e. all points in this interval
SOl-I COMPUTING FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE F EATURES 7S

have a membership value> 0). The core [bj , Cj] of each fuzzy set is set to the interval
[Pj' Pj+/], while the support raj, dj] is set to the following interval:

[ Pj- ([ p ). - 2p )-
. I] ) ([ I-
*degreeOfOverlap ,Pj+ )11
P . ] *degreeOfOverlap~.
p.)+2)

c.' FuzzySetl FuzzySet2 FuzzySet3 FuzzySet4


~

~
.D
.
.-----------.-:.,. ... ...:- .. _ .. _ .. _ .. -
. . . . ,
/ ......

E : " ...
C,) I ......

:::E : /' ...... "


O~~--------~--------~~------~----------~
~ Core of Fuzzy Set2"
( a)

0. ------""71\:------:1\.------------,.';- .. - .. - ..- ..-


:E / I . \
~ / I / '-
.8E /
/ :I .. \,
//! ./ '\
/

i
O~------~----+_----~----41----~----------~\----~~

~ nposilion

(b)

1 At.
0. - - - ----r- - - -- ---,----- - ------ ... . - .. - .. - .. - .. -
:E I
I
~
~
E
II)

::E
O~----------~--------~--------~~---------·~
I

~
( c)

Figure 4--3: Partition of the variable universe [2Position using four trapezoidal fuzzy sets
with varying degrees of overlap; (a) 100% overlap; (b) 50% overlap; (c) No overlap
i.e. crisp sets.

Here the degreeOfOverlap indicates the degree of interaction between neighbouring


concepts and is defined in the range [0, 1], where 0 represents no overlap between
concepts, i.e. crisp concepts as depicted in Figure 4-3(c). An overlap degree of 1 (see
Figure 4-3(a)) leads to linguistic descriptions consisting of multiple words for each
element of the universe. The shape of the fuzzy sets whose core elements are points PI
or point Pn will correspond to trapezoids missing the outer ramp as depicted in Figure
4-2(b). Words can then be associated with their fuzzy set characterisations, thus
completing the definition of the corresponding linguistic variable. Trapezoidal fuzzy
CHAPTER 4: Fuzzy UX;]C 76

sets correspond quite naturally to classes or intervals in the data. In this example. the
fuzzy set Middle could also quite intuitively be labelled as Roughly40_60.

4.1.2 Linguistic hedges


Linguistic hedges are special linguistic terms, characterised as unit-interval functions
that can modify other linguistic terms that are expressed using fuzzy sets [Zadeh 1972].
Linguistic hedges are sometimes known as fuzzy modifiers. A linguistic hedge, M, can
be characterised as follows:

M:[O, 1] ~[O, 1]

Linguistic hedges can be used to modify any fuzzy set. Typical examples of linguistic
hedges include very, more or less, slightly etc. where very is often represented as the
unary square operation (that is M(a) = a2 , where a corresponds to a membership value)
and more or less is often' characterised by the unary square root operation ( that is M(a)
= -J;;) [Zadeh 1972]. See Figure 4-4 for a graphic depiction of the linguistic hedges
for "very" and "more or less". Linguistic hedges can be used to modify the semantics of
fuzzy predicates (represented by fuzzy sets as seen here), fuzzy truth-values and fuzzy
probabilities. It is important to note that hedges do not exist in classical logic.

more or less

7
0.8
/
0.6
/
/
0.4 /
/
0,2 /
/
/
0L-~~~~------~----~----~----~
o 0.2 0.4 0,6 0,8

Figure 4-4: Example linguistic hedges corresponding to "very"(denoted by the dashed


curve) and "more or less".

4.2 Fuzzy INFERENCE

Fuzzy inference, despite its imprecise connotations, consists of deductive methods that
are sound, rational and not just mere heuristic approximations of classical two-valued
SOI-T COMI'UTIN(; FOR KNOW!.ED( ;!' DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 77

logical inference. They are centred around the Compositional Rule of Inference (CRI)
originally introduced by Zadeh [Zadeh 1973]. CRI (often referred to as generalised
modus ponens in the literature) views If-Then rules as dependencies, which are
characterised by fuzzy relations (see the Section 3.7.1.1 for a full presentation of fuzzy
relations) and inference reduces to the composition of these relations and memberships
functions. CRI provides a framework for generalising classical inference processes,
based on tautologies such as modus ponens, modus tollens and hypothetical syllogism.

4.2.1 Compositional rule of inference (CRI)


CRI is most intuitively presented from the crisp case where variables are related
through a function (i.e. one-to-one mapping) [Klir and Yuan 1995]. Consider two
variables (in order to simplify the presentation), X and Y, defined over universes ilx and
ily respectively. A functional relationship between X and Y, exists if each value x of
variable X (i.e. x E ilx) directly maps to a value y of variable Y through a functionfi.e.
y =f(x) as shown in Figure 4-5(a). This can be quite easily extended to the case where
variable X is instantiated to a set of values A. In this case, it is possible to infer that the
value of variable Yis a set B that is characterised as follows, B = (YEily I y=f(x), xEA},
and as illustrated in Figure 4-5(b). Extending the relationship between variables X and
Y to a relation (i.e. a many-to-many relationship) it is possible to infer that any value x
of variable X can be directly mapped to a set B, characterised as follows: B = {y E ily I
<x, y> E Rxy}, and illustrated in Figure 4-6(a). As in the functional case, when variable
X is instantiated to a set value A , a set B can be inferred that can be characterised as
follows: B = {y E ily I <x, y> E Rxyand xEA} and is graphically presented in Figure
4-6(b). This can quite easily be represented in terms of the crisp membership functions
(characteristic functions) IlA' IlB' and IlRxv (that characterise the sets A, B, and the
relation Rxy respectively) as follows: .

(4-1)

y=f(x)
y
B

.9-
..c
o 5 0
I ----------- x I -------"--0:--'
~~ rrerrbership
rrerrbership
(aJ
Figure 4-5: An example of relationship between variables X and Y characterised by
function f:X -.>Y; (a) depicts a mapping from value x to y (i.e. x =j(x)); (b) depicts a set
mapping from A to B using Equation 4-1.
CHAPTER 4: Fu72Y UX,IC 78

Zadeh [Zadeh 1973] extended the characterisation of relations between variables from
crisp relations to fuzzy relations and thereby paved the way for a new type of inference
based upon imprecise concepts represented as fuzzy sets - the compositional rule of
inference (CRI). Formally, if Rxy is a fuzzy relation between variables X and Y, and A
and B are fuzzy sets defined over Q x and Qy respectively, then if it is known that the
value of variable X is a fuzzy set A, the fuzzy set B can be inferred as the value of
variable Y using equation 4-1; the key difference being that A and B are fuzzy sets in
this case, rather than crisp sets as presented above. The compositional rule of inference
in fuzzy logic is often referred to as generalised modus ponens for reasons that will
become obvious over the next couple of paragraphs. Equation 4-1 can be succinctly
written in matrix form as follows:

B = AoR XY '

Rxy Rxy
----------A
B I

----------., I
B

.9-
..c::
o t 0
I .- ---------- -g I
rrerrbershlp x ~ rrerrbership

( a) (b)

Figure 4-6: An example of a relationship between variables X and Y characterised by


relation Rxy; (a) depicts the set B as inferred from the value x using Equation 4-1; (b)
depicts the set B as inferred from the set A using Equation 4-1.

A graphic example of inference using CRI is presented in Figure 4-7 and is described
subsequently. The calculations associated with this example are presented in matrix
format. Let A be a discrete fuzzy set A = {x/O.3, x21/, x/O.5}, graphically depicted as a
discrete approximation of the fuzzy set A in Figure 4-7. Let Rxy be a fuzzy relation
describing a portion, often referred as a fuzzy patch, of the function y = f(x). This is
depicted in Figure 4-7 as a greyscale rectangle with a dashed boundary. This relation
Rxy can be generated by a number of means, which are described in subsequent
sections. The generation of the relation Rxy used here is described in Section 4.2.1.2
(see Figure 4-9). Each tuple in this discrete fuzzy relation is indicated by a point in
Figure 4-7, e.g. <X3, YI>' Let B be the fuzzy set that is inferred from the fuzzy setA and
the fuzzy relation (patch) Rxy using the CRI rule as defined above (Equation 4-1). In the
following, the fuzzy sets, A and B, and the relation, Rxr. are written mathematically as
matrices. The inferred fuzzy set B is calculated as follOWS:
SOH COMI'UTINt; FOR KNOWLED<iE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 79

B == A o RXY

0.3 0.3
0.2]
[0.9 1 0.2] == [0.3 1 0.5]0 [ 0.9 I 0.2
0.5 0.5 0.2

where for example, the value of 0.9 in matrix B (denoting the membership of YJ in the
fuzzy set B) is calculated as follows:

max(min(O.3, 0.3), min(1, 0.9), min(O.5, 0.5» == max(O.3, 0.9, 0.5) == 0.9.

y=f(x)

B 0.3 0 ; J
0.2
3
Rxy =[ 0.9 I 0.2
0.5 0.5 .
J-lRxY(Xl , YI) =0.2

.9-
.s:::
~
o
o
t~

::E .
Merrbershlp
Figure 4-7: An example offuzzy inference based on the compositional rule of inference
(Equation 4-1); fuzzy set B is inferred by CR1 using the fuzzy relation Rxy and the fuzzy
set A.

A relation Rxy between any two variables X and Y, can characterise different types of
relationships. However in fuzzy logic systems, this relation is restricted to representing
the dependency relationship which is expressed using fuzzy conditional unqualified
propositions (If-Then rules) of the form "If X is A then Y is B", where both A and B are
fuzzy sets. This rule-based format (both crisp and fuzzy) has been commonly and
successfully used to describe systems ranging from controllers to object recognition
systems that are void of mathematical models or difficult to describe [Ralescu 1995;
Ruspini, Bonissone and Pedrycz 1998; Terano, Asai and Sugeno 1992]. In fuzzy logic
systems, having captured this type of relationship between variables, different types of
inference can be performed using CRI. Two types of inference are considered:
generalised modus ponens; and generalised modus toUens. Both inference procedures
use the relation Rxy. In subsequent subsections, two commonly used approaches for
generating such relations are presented: one based upon logical implication; and the
other based on conjunction.

In the following presentation of approximate reasoning using CR!, fuzzy conditional


propositions are expressed in canonical form as follows:
CHAPTER 4: FuzzY UXlIC 80

r: IF X is A THEN Y is B (4-2)

where X and Yare variables defined over universes nx and ny, whose values are fuzzy
sets A and B respectively. Here only one variable is used in the conditional part of the
rule (also known as the antecedent or body of a rule) and also in the action part of the
rule (also known as consequent or head of rule), however any number of variables can
be used for each portion of the rule. As alluded to previously, this rule proposition can
alternatively be expressed as a fuzzy relation Rxy (i.e. a fuzzy set on the Cartesian
universe nXXny) where the membership value for each possible tuple <x, y>, for all
combinations of x E nx and y E n y is determined as follows:

(4-3)

where I denotes a fuzzy implication (derived logically), which will be presented


subsequently in Section 4.2.1.1. Given the above fuzzy rule proposition r, characterised
as Rxy, and another unconditional fuzzy proposition (fact)/ofthe form:

f: XisA'

where A' is fuzzy set defined on nx and potentially different to A, it is possible to infer
that Y is B' using CRI (Equation 4-5, a slightly modified version of Equation 4-1). This
inference can be succinctly expressed as follows:

Given: r: IF X is A THEN Y is B
And: f:XisA'
Infer: Y is B' (4-4)

This inference procedure is called generalised modus ponens due to its similarity with
the classical modus ponens rule of inference, which states that given a fact f, and logic
rule r: if/then g, then the consequent of the rule g can be inferred provided both/ and
rare true.

This following equation formally defines CRI, and is slightly different to Equation 4-1
so that it can deal with fuzzy propositions that may differ from those expressed in the
conditional part of a rule:

VyeQy (4-5)

A generalised version of modus tollens also exists in fuzzy logic and can be succinctly
expressed as follows:

Given: r: IF X is A THEN Y is B
And: f: YisB'
Infer: XisA' (4-6)

In this case, CRI has the following form:


SOFT COMPUTING [OOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 81

JiA'(X) = sup minluB,(y),,uRXy(x,y)j (4-7)


yef.ly

and the fuzzy relation Rxy is as defined above. When the sets are crisp (Le. A'= A and
- --
B'= B, where A and B refer to the complement of A and B respectively) classical
modus tollens is recovered.

As in classical logic, other inference rules are also possible in fuzzy logic such as
hypothetical syllogism and contraposition. For a more complete treatment see [Klir and
Yuan 1995; Zimmermann 1996].

4.2.1.1 1mplication based inference


The presentation of CRI so far has assumed a fuzzy relation that somehow models the
conditional dependency between the condition and action parts of a rule. This section
shows how such a relation can be derived using fuzzy implications, which are based
upon implication operators that were originally introduced for multi-valued logic
[Resher 1969] (Le. a logic allowing multiple truth values rather than two as used in
traditional bivalent logic). The next section describes an alternative approach to
modelling this dependency relation based upon conjunction. In short, the goal here is to
take a fuzzy conditional proposition such as

r: IFX isA THEN YisB

and construct a fuzzy relation Rxy using fuzzy implication functions. This relation can
subsequently be used with the CRI inference process described above.

A fuzzy implication can formally defined as a binary function:

1:[0, 1] x [0, I] ~ [0, I]

which accepts as input, truth values, a and b, of the fuzzy propositions (facts) f and g
and returns the truth value, lea, b), corresponding to the conditional proposition "iff
then g". Fuzzy implication functions possess various mathematical properties, such as
monotonicity and continuity and identity, (see [Klir and Yuan 1995]) and are in general
extensions of classical material implication. In classical logic, various equivalent forms
of implication (from a classical truth value perspective) exist such as:

• -.avb (material implication)


• max{x E to, I} I aAb::;;b} (i.e. I(a, b) is the greatest r such that aAb=r)
• -.av (aAb) (originating from quantum logic)

While these are logically equivalent, their extensions in fuzzy logic are not and
consequently, result in distinct families of fuzzy implication. Extending the above logic
formulas to fuzzy logic, leads to different families of fuzzy logic implications that are
parameterised by t-norm ®, t-conorm (9, and fuzzy complement -.. A selection of these
families is described below.
CHAPTER 4: Fuzzy lAx,le 82

Strong or S-implications: Strong or S-implications are basically extensions of material


implication (I(a, b) = -£1 vb) to the more general form (-£1$b). Specific S-implication
functions that were originally introduced for multi-valued logic include Kleenes-Dienes
implication which is defined as I(a, b) = max(l-a, b). Within fuzzy logic this type of
implication has been generalised by replacing the max with EEl. Another example of a S-
implication is the Lukasiewicz implication, which is defined as follows:

I(a, b) = min(l, I-a + b). (4-8)

For example, consider the discrete fuzzy sets A = {x/O.3, x211, x/O.5} and B ={y/O.9,
y/J, y/O.2}, graphically depicted as a discrete approximations of the fuzzy sets A and B
respectively in Figure 4-7. The relation Rxy between the fuzzy sets A and B can be
constructed using Lukasiewicz implication as follows:

Y=B
Rxy y 1/0.9 yllI ydO.2
X=A x l/O.3 I I 0.9
XI/l 0.9 1 0.2
x/0.5 1 1 0.7

Figure 4-8 depicts an example of fuzzy inference where Lukasiewicz implication is


used to generate the constituent fuzzy relation Rxy. Lukasiewicz implication is both an
S-implication and an R-implication (introduced next).

The Reichenbach-implication is another example of an S-implication and is defined as


follows:

I(a, b) = I-a +ab.

Residual or R-implications: Residual or R-implications extends intuitionistic


implications defined in classical logic as I(a, b)=max{x E {O, J} I a A b ::;b}, to sup{x
E {O, 1] I a @ x ::; b} in fuzzy logic. Again numerous examples of such implications
exist, such as GOdel implication that is defined as follows:

I a':;' b
/(a,b) = {
b otherwise

Quantum logic or QL-implications: Quantum logic or QL-implications require that


the t-norm <8> is dual to the t-conorm EEl with respect to the fuzzy complement -, (i.e.
required to be a DeMorgan triple). They are generically defined as follows: I(a, b) =
-£1 (j) (a@b) is a generalisation of their quantum logic counterparts: -£1 v (aAb). The
most commonly used implication operator from within this group of implications is the
Zadeh implication, which is defined as follows:

I(a, b) = max(1-a, min(a, b» (4-9)

Other fuzzy implications have also been introduced which do not fall into any of the
above categories. For further details and comparative studies of fuzzy implications see
Son COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 83

[Gaines 1978; Klir and Yuan 1995; Mizumoto and Zimmermann 1982; Ruan and Kerre
1993].

Gi yen : r: IF X is A T H FN Y is B
An d : f: X is A'
Infer : (u ing CRI and wka iewicz' i"lllication)

B
U'

.~ ~4----+------+-r--+~--~
..r:; 0
~ X
Q

~IU
~Mcrrbcrship

Figure 4-8: Fuzzy set B' is inferred from the given rule (IF X is A THEN Y is B) and
fact (X is A ') using the compositional rule of inference (Equation 4-5), where the fuzzy
relation Rxy is based upon Lukasiewicz's implication (Equation 4-8).

4.2.1.2 Conjunction based inference


The previous section has described how to construct a fuzzy relation from fuzzy If-
Then rules using logic-based implication functions, which could subsequently be
incorporated into fuzzy inference using the compositional rule of inference. An
alternative way of generating this fuzzy relation from If-Then rules is based upon
conjunction (using the t-norm ® operators). This idea originates from Mamdani's work
in the field of control [Mamdani 1977]. Once again, given a fuzzy conditional
proposition such as

r: IF X is A THEN Y is B

the goal is to construct a fuzzy relation Rxy, in this case, using conjunction. The relation
Rxy can be constructed using the following equation:

(4-10)

For example, consider the discrete fuzzy sets A = {x/O.3, xyl, x/O.5} and B ={y/O.9,
yyJ, y/O.2}, graphically depicted as discrete approximations of the fuzzy sets A and B
respectively in Figure 4--7. The relation Rxy between the fuzzy sets A and B can be
CHAPTER 4: Fuzzy LoGIC 84

constructed using Equation 4-10, where ®, the conjunction operator, is set to min. This
calculation is presented in Figure 4-9.

Y=B
Rxy y/O.9 Y2/l Y3/0.2
X=A XI/O.3 0.3 0.3 0.2
xiI 0.9 I 0.2
xiO.5 0.5 0.5 0.2
Figure 4-9: Calculating the relation Rxy between the fuzzy sets A and Busing
Equation 4-10, where @, the conjunction operator, is set to min.

Gi ven: r: IF X is A THEN Y i B
And: f: X is A'
Infer: Y is S' (u ing CRJ and conjunction-based Rxy)

S
U

.~ ~~---+------+-
.r; 0
~ X
0)

~
0)

2: Membership

Figure 4-10: Fuzzy set B' is inferred from the given rule (IF X is A THEN Y is B) and
fact (X is A') using the compositional rule of inference (Equation 4-5), where the fuzzy
relation Rxy is based upon conjunction (Equation 4-10).

As in the implication case, this relation can subsequently be used with equation 4-5 as
part of the CRI inference process. Figure 4-10 presents an example of inference using
CRI with a conjunction based fuzzy relation. Originally, Mamdani limited the t-norm ®
to the min operation, which subsequently became popularised as a means of doing
fuzzy control [Terano, Asai and Sugeno 1992]. This type of inference is commonly
known as max-min inference (see Figure 4-11 for an example of max-min inference,
with further explanation provide in Section 4.3). Over the years max-min inference has
been one of the most popular forms of fuzzy inference. The main reasons for this
popularity include the fact that it works very well in real world applications. It has a
significant advantage in reducing the computational complexity of fuzzy inference and
from a logic perspective, conjunction is somewhat appealing as it expresses a relation
of compatibility.
SOIT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 85

4.2.1.3 Alternative fuzzy models


The Takagi-Sugeno-Kang (TSK) model [Takagi and Sugeno 1985] is an alternative
means of representing fuzzy systems in which rules are made up of fuzzy antecedents
and where the consequent is a linear combination of the input variables. This form of
representation is more mathematical in nature and while it looses some transparency, it
does permit the modelling of very complex non-linear systems using a small number of
rules. Inference for the TSK model is based upon a t-norm conjunction of activated
input fuzzy predicates. For more details see [Takagi and Sugeno 1985] or [Yen and
Wang 1998].

4.3 Fuzzy DECISION MAKING FOR PREDICTION


DEFUZZIFICATION

In the previous section fuzzy inference was presented. Inference was considered from
an individual rule perspective, even though typical rule bases consist of multiple rules.
For example, consider the following knowledge base consisting of n rules and one
unconditional proposition (fact):

r1: IF XI is AI and Xz is BI and X3 is CITHEN Y is Y J


r2: IF XI is Az and Xz is Bz and X3 is CzTHEN Y is Y2

rn: IF XI is An and Xl is Bn and X3 is C nTHEN Y is Yn


fact: XI is XI and Xl is x2 and X3 is X3

Conclude: Y is Y'

Each rule has three antecedents (conditions) expressed as fuzzy sets Ai, B i, and Ci
defined over the respective universes of discourse QxJ, QX2, and QX3' The
unconditional proposition is merely a vector of point values <x], Xl> X3> drawn from the
universes QxJ, QX2, and QX3 respectively. Alternatively, these values could be fuzzy set
values or interval values or a mixture of the two. In order to simplify the presentation,
point values are chosen. Consequently, each value Xi is represented as a fuzzy set with
one element Xi and an associated membership value of 1 (depicted as straight lines in
Figure 4-11). Figure 4-11 illustrates the results of fuzzy inference for this rule base
given the vector <x], Xl> X3>. It presents the two rules rl and r2 which fire (generate a
non-empty fuzzy set as a result of applying CRI) and the inferred fuzzy sets Y/' and Yz '
(highlighted in grey in Figure 4-11). Here the CRI was based upon a fuzzy relation
generated using the conjunction operation min (Equation 4-10). This results in one or
more output fuzzy sets being inferred, due to the overlapping nature of the fuzzy sets
that populate the input space.

In order for the output of fuzzy inference to be useful in a real world application, it is
normally necessary to convert the output fuzzy sets into a crisp number. For example,
consider a fuzzy rule base that adjusts the power of a heater based on the current room
CHAPTER 4: Fuzzy UX,le 86

temperature. Generally, this power component will work in terms of numbers.


Consequently, after the fuzzy rule base performs inference given the current room
temperature, it needs to decide how much (i.e. a numerical quantity) to increase or
decrease the power of the heater. This decision making process for continuous-valued
output models is known as defuzzification . In other words, it needs to pick a unique
value from the domain of the output variable that is representative of that variable given
the constraints imposed by the inferred fuzzy sets. More formally, defuzzification is a
function that takes the aggregation (based on t-conorm EB) of the inferred fuzzy sets Yj ,
and generates a singleton value y, y E Qy:

Defuzz(Y I EB Y z ... EB Yn) ~ y where y E Qy

Several defuzzification strategies have been developed over the years for continuous-
valued models in domains ranging from control to financial decision support systems.
Below, a couple of the more prominent approaches to defuzzification are presented.

rl
IlKl
.....
o
I LL1t~~
r2
o ~~~~

Figure 4-11: Max-min inference using COG: Fuzzy inference using CRI based upon
min relation, and Centre of Gravity decision making.

4.3.1 Centre of gravity (COG) method

The centre of gravity (COG) defuzzification procedure is one of the most commonly
used decision making procedures. It is a rather intuitive approach in that the defuzzified
value corresponds to the geometrical centre of mass of the inferred output fuzzy sets.
This is calculated as follows for real-valued fuzzy sets:
SOFT COMPUTING fOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 87

J ypy,(y)dy
y=
J py.(y)dy
(4-11)

For the discrete case in which the universe of Y, ny, is defined on a finite set of values
{Yl, ... , Yz}, the defuzzified value is calculated as follows:

z
LYif,ly,(Yi)
Y = ..",i=::;,I'--_ __
z
Lf,ly,(Yi)
i=1

An example of COG (Equation 4-11) in practice is graphically presented in Figure 4-


11: the inferred fuzzy sets Y/' and Y2 ' are highlighted in grey; the far right of this figure
graphically depicts COG generating a single value y. Figure 4-12 (based on an example
from [Kasabov 1996]) illustrates the power of fuzzy reasoning using CRI and COG, in
modelling the function, y= (x-If, in a very precise manner, using four fuzzy rules. In a
sense, this mathematical function has been replaced by a natural language-like model.
The COG procedure, while being computationally expensive to calculate, tends to
move smoothly around the output region. The inferred membership associated with
each value in the output domain (Le. J.1y.(y» is often interpreted incorrectly as an
estimate of the probability of selecting that element. This is addressed again in Chapter
6, where formal techniques for calculating probabilities from the inferred output fuzzy
set Y' are presented.

4.3.2 Maximum height method


One of the most straightforward approaches to defuzzification of the output fuzzy set Y'
is to select the value Y corresponding to the maximum membership value. This has the
advantage of speed, but it may run into problems when there are ties. One commonly
used approach to breaking ties is to take the expected value (weighted) of the ties. This
can however, sometimes lead to meaningless values. Another alternative to this
approach is to randomly select one of the ties using a probability distribution estimated
based on membership of the values located in the area of the ties. A detailed
presentation of defuzzification procedures is presented in [Mizumoto 1998].

4.4 Fuzzy DECISION MAKING FOR CLASSIFICATION

The defuzzification strategies considered so far are suitable for prediction problems
only. Here, however, a decision making mechanism for classification problems is
presented. In classification problems, a canonical rule base takes the following form:
CHMTEI{ 4: Fuzzy LOGIc 88

rl: IF X is FS 1 THEN Y is Classl


r2: IF X is FS 2 THEN Y is Class2

rc: IF X is FS c THEN Y is Classc

For presentation purposes, rules are assumed to consist of just one condition. The
calculation of Rxy is greatly simplified in classification problems, as the output value is
no longer a fuzzy set but a singleton. For example, using Lukasiewicz implication
(Equation 4-8) to generate the relation Rxy simply reduces to the input fuzzy set value
FSi • Consequently, the fuzzy relation Rxy term in the CRI equation (Equation 4-5) can
be replaced by the fuzzy set value FSi . In other words it simplifies to:

J.1B(Y)= sup min[uA·(x),FSJ (4-12)


XEOx

When data is presented to the system, fuzzy inference generates a membership value
(an activation value) for each rule, which is subsequently associated with the output
value Class; for that rule. Decision making then reduces to selecting the output value
ClassMax associated with the highest activation value, that is, the classification of the
input data tuple is ClassMax' Alternative approximate reasoning strategies, based on
support logic (probabilistic reasoning) for both prediction and classification problems,
where knowledge is expressed in terms of fuzzy sets and if-then rules, is presented in
Chapter 6.

r3 Patch

About_' Aboul_2
FUZZY RULE BASE
rl: IF X is AbouC 1 THEN Y is AbouLO
12: rF X is AbouL2 THEN Y is AbouU
r3: rF X is AbouC3THEN Y is AbouL4
r4: rF X is AbouC 4 THEN Y is AbouL9

Figure 4-12: A fuzzy rule base that approximates (almost exactly) the function y= (x-
ll, x IE [1, 4J. Fuzzy patches (depicted as rectangles) highlight the zones of
applicability for rules 2 and 3.
Sovr COMPUTING FOR KNOWLED<;E DISCOVERY: INTRODUCING CARmSIAN GRANULE fEATURES 89

4.5 ApPLICATIONS OF FUZZY LOGIC

Fuzzy logic has a long and varied application history beginning with the pioneering
work of Mamdani [Mamdani 1977; Mamdani and Assilian 1974] in control systems.
This lead to an avalanche of control applications: in consumer products such as
cameraslcamcorders (Sanyo, Canon, Minolta), washing machines (AEG, Sharp,
Siemens, General Electric), vacuum cleaners (Philips and Siemens), and refrigerators
(Whirlpool); in automotive and power generation such as engine control (Nissan);
industrial process control systems such as refining, distillation, cement kiln control etc;
robotics and manufacturing. Fuzzy logic has also applied successfully (i.e. many
fielded/deployed applications) in decision support systems such as foreign exchange
trading [Ralescu 1995], system design, image understanding [Ralescu 1995; Ralescu
and Shanahan 1999], and more recently in the fields of machine learning and discovery
(described in more detail in Parts ill, N and V of this book). For a more detailed
presentation of applications se,e [Ralescu 1995; Ruspini, Bonissone and Pedrycz 1998;
Terano, Asai and Sugeno 1992; Yen and Langari 1998]

4.6 SUMMARY

This chapter has presented the fundamentals behind fuzzy logic, introducing the main
forms of knowledge representation and approximate reasoning within the fuzzy logic
framework. It began by introducing fuzzy propositions, linguistic variables and
linguistic hedges as a means of representing knowledge in terms of natural language
statements. The principle rule of inference in fuzzy logic, the compositional rule of
inference, was introduced along with the various interpretations that have been
developed over the years. Finally, some of the decision making strategies that exist in
fuzzy logic for prediction and classification problem domains were described. A simple
example illustrated the potential of fuzzy logic as an accurate and transparent modelling
technique. Real world applications of fuzzy logic were also overviewed. Some of the
concepts presented here such as, linguistic variables and approximate reasoning, will be
revisited in Part N of this book in the context of Cartesian granule features.

4.7 BIBLIOGRAPHY

Baldwin, J. F. (1991). "A Theory of Mass Assignments for Artificial Intelligence", In


/JCAI '91 Workshops on Fuzzy Logic and Fuzzy Control, Sydney, Australia,
Lecture Notes in Artificial Intelligence, A. L. Ralescu, ed., 22-34.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1988). FRILManual. FRIL Systems
Ltd, Bristol, BS8 1QX, UK.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - Fuzzy and Evidential
Reasoning in A.I. Research Studies Press(Wiley Inc.), ISBN 0863801595.
Gaines, B. R. (1977). "Foundations of Fuzzy Reasoning", In Fuzzy Automata and
Decision Processes, M. Gupta, G. Saridis, and B. R. Gaines, eds., Elsevier,
North-Holland, 19-75.
CHAPTER 4: FuzzY LoGIC 90

Gaines, B. R. (1978). "Fuzzy and Probability Uncertainty Logics", Journal of


Information and Control, 38:154-169.
Kasabov, N. K. (1996). Foundations of neural, fuzzy systems and knowledge
engineering. MIT Press, London.
Klir, G. J., and Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic, Theory and Applications.
Prentice Hall, New Jersey.
Lawry, J. (1998). "A voting mechanism for fuzzy logic", International journal of
approximate reasoning, 19:315-333.
Mamdani, E. H. (1977). "Application of fuzzy logic to approximate reasoning using
linguistic systems", IEEE Trans. on Computing, C-26:1182-91.
Mamdani, E. H., and Assilian, S. (1974). "Applications of fuzzy algorithms for control
of simple dynamic plant", Proc.Institute of Electronic Engineering, 121:1585-
1588.
Mizumoto, M. (1998). "Defuzzification methods", In Handbook of Fuzzy Computation,
E. H. Ruspini, P. P. Bonissone, and W. Pedrycz, eds., Institute of Physics
Publishing Ltd., Bristol, UK, B6.2:1-7.
Mizumoto, M., and Zimmermann, H. J. (1982). "Comparison of fuzzy reasoning
methods", Fuzzy sets and systems, 8:253-83.
Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, 1(1):86-106.
Ralescu, A. L., ed. (1995). "Applied Research in Fuzzy Technology", Kluwer Academic
Publishers, New York.
RaIescu, A. L., and Hartani, R. (1995). "Some issues in fuzzy and linguistic
modelling." In the proceedings of Workshop on Linguistic Modelling, FUZZ-
IEEE, Yokohama, Japan.
Ralescu, A. L., and Shanahan, J. G. (1999). "Fuzzy perceptual organisation of image
structures", Pattern Recognition, 32: 1923- t 933.
Resher, N. (1969). Many-valued logic. McGraw-Hill, New York.
Ruan, D., and Kerre, E. E. (1993). "Fuzzy implications operators and generalised fuzzy
method of cases", Fuzzy sets and systems, 54(1):23-37.
Ruspini, E. H. (1969). "A New Approach to Clustering", Inform. Control, 15(1):22-32.
Ruspini, E. H., Bonissone, P. P., and Pedrycz, W., eds. (1998). "Handbook of Fuzzy
Computation", Institute of Physics Publishing Ltd., Bristol, UK.
Takagi, T., and Sugeno, M. (1985). "Fuzzy identification of systems and its
applications to modelling and control", IEEE Transactions Systems Man
Cybernetics, 15:116-132.
Terano, T., Asai, K., and Sugeno, M. (1992). Appliedfuzzy systems. Academic Press,
New York.
Yen, J., and Langari, R. (1998). Fuzzy logic: intelligence, control and information.
Prentice HaIl, London.
Yen, J., and Wang, L. (1998). "Granule-based models", In Handbook of Fuzzy
Computation, E. H. Ruspini, P. P. Bonissone, and W. Pedrycz, eds., Institute
of Physics Publishing Ltd., Bristol, UK, C2.2: 1-11.
Zadeh, L. A. (1972). "A fuzzy set theoretical interpretation of hedges", Journal of
Cybernetics, 2:4-34.
Zadeh, L. A. (1973). "Outline of a New Approach to the Analysis of Complex Systems
and Decision Process", IEEE Trans. on Systems, Man and Cybernetics,
3( 1):28-44.
Zadeh, L. A. (l975a). "The Concept of a Linguistic Variable and its Application to
Approximate Reasoning. Part t ",Information Sciences, 8: 199-249.
SOfT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURF-S 91

Zadeh, L. A. (1975b). ''The Concept of a Linguistic Variable and its Application to


Approximate Reasoning. Part 2", Information Sciences, 8:30]-357.
Zadeh, L. A. (1975c). ''The Concept of a Linguistic Variable and its Application to
Approximate Reasoning. Part 3", Information Sciences, 9:43-80.
Zadeh, L. A. (1987). "Coping with imprecision of the real world: an interview with
Lotti A. Zadeh.", In Fuzzy Sets and Applications: Selected Papers by L. A.
Zadeh, R. R. Yager, S. Ovchinnikiv, R. Tong, and H. T. Nguyen, eds., John
Wiley & Sons, New York, 9-28.
Zadeh, L. A. (1994). "Fuzzy Logic, Neural Networks and Soft Computing",
Communications of the ACM, 37(3):77-84.
Zimmermann, H. J. (1996). Fuzzy set theory and its applications. K1uwer Academic
Publishers, Boston, USA.
CHAPTER
PROBABILITY THEORY
5
I consider the word probability as meaning
the state o/mind with respect to an assertion,
a coming event, or any other matter on which
absolute knowledge does not exist.
August De Morgan, 1838

As previously noted in Chapter 2, in general, when modelling the real world,


uncertainty abounds. This uncertainty usually results from an incomplete and incorrect
model of the problem domain. Many theories of uncertainty management have
proposed, but one of the oldest is probability theory. Probability theory focuses on
managing uncertainty arising from beliefs or expectations, often referred to as
stochastic uncertainty. Stochastic uncertainty differs from the uncertainty arising from
the imprecision or fuzziness that fuzzy approaches manage (see Chapters 3 and 4 for
details). To highlight the difference between these two types of uncertainty, consider
the following two statements (briefly mentioned in Chapter 3):

(i) It is certain that James was born around the end of the sixties (1960s).
(ii) Probably, James was born in 1967.

In the first statement, the year in which James was born is imprecisely stated, but
certain, whereas in statement (ii), the year is precisely stated, but .there is uncertainty
about the statement being true or false. Both aspects can coexist, but are distinct.
Uncertainty arising from imprecision can be very naturally modelled using traditional
set theory and its generalisation - fuzzy set theory and various set-based probabilistic
theories such as possibility theory. On the other hand, uncertainty arising from beliefs
or expectations has been addressed by various theories of probability. These and other
types of uncertainty such as ignorance (facilitated by set-based probability theories such
as Baldwin's mass assignment theory and Dempster-Shafer theory), inconsistency
(facilitated by mass assignment theory) will be discussed over the course of this
chapter.

This chapter focuses on probability theory, and its various generalisations and
specialisations, as a means of representing stochastic uncertainty and imprecision. The
first section reviews the fundamentals of probability theory. Subsequently, three point-
based generalisations and specialisations of probability theory are presented: fully
specified joint probability distributions; naive Bayes classifiers; and Bayesian
networks. This is followed by a presentation of set-based probabilistic techniques:
Dempster-Shafer theory; possibility theory; and mass assignment theory. These set-
based approaches provide semantically richer formalisms than point-based probability

J. G. Shanahan, Soft Computing for Knowledge Discovery


© Kluwer Academic Publishers 2000
CHAfYfER 5: PROBABILITY THEORY 94

theories, catering not only for uncertainty, but also for ignorance and inconsistency. For
each approach, the respective calculus of operations (inference, decision making,
conjunction, negation, etc.,) is described and the relationships between these modes of
uncertainty representation and fuzzy set theory are also explored. These relationships
facilitate more powerful and expressive forms of knowledge representation and
reasoning, very much in the true synergistic spirit of soft computing. The bi-directional
transformation from a membership value to a point probability is subsequently
described in detail. An intuitive justification and interpretation of this relationship
based on human reasoning (the voting model) is also described. This transformation
forms the basis for new learning algorithms presented in Part IV.

5.1 FUNDAMENTALS OF PROBABILITY THEORY

Probability theory has been commonly used to represent and reason with uncertainty
since the 17th century. Various generalisations and specialisations of probability theory
have been developed in the intervening years. Work in the field of probability theory
can be crudely categorised into one of two schools: the objective school; and the
subjective school. Other interpretations of probability also exist such as the logical
perspective, but are not of interest here. The interested reader is referred to [Smithson
1989]. The objective school of thought takes the view that probability is about events
that one can count i.e. directly linked to the world. They use a frequentistic definition of
the probability, defining it as the proportion of times the event occurs out of all possible
events. For example, the probability of a coin showing heads is the proportion of times
that a tossed coin landed heads up out of all tosses (for a sufficiently long sequence of
repeated events). For the subjective school of thought, on the other hand, probabilities
are linked directly to one's opinions about the exact nature of the world derived from
the information available. This school of probability is often referred to as Bayesian or
personal probability. The probability of a hypothesis (for example, a tossed coin
showing heads) is a measure of a person's belief in that hypothesis given the available
evidence (that the coin is fair, in this example). The subjective view of probability is
normally defended in terms of rational betting behaviours [deFinetti 1937]. The degree
of belief in a hypothesis should be proportional to the odds that a rational person should
be able to state at which it is indifferent to bet for or against that hypothesis. For
example, a person provides you with the odds 2-to-1 that on tossing a coin, heads
comes up (that is for every franc you bet on heads coming up, you can win 2). If the
ratio 2: 1 does not accurately reflect the world (tossing coins), then one party (either you
or the person offering the bet) will be guaranteed to lose money over a series of coin
tosses. Thus, subjective probability theory can be given a rationality in terms of betting
behaviour.

The remainder of this section introduces the basic forms of knowledge representation in
probability theory along with basic axioms and assumptions. In probability theory, from
a knowledge representation perspective, domain specific knowledge is captured in
terms of conditional and unconditional probabilistic propositions, while general
knowledge is represented using inference mechanisms based upon conditioning and
various decision making strategies. Typically, probabilistic propositions take two
formats:
SOlT COMPUTING FOR KNOWLEfX1E DISCOVERY: INTRODUCING Ci\RTESIAN GRANULE FEATURES 9S

• Unconditional propositions expressed by the canonical form:

p:Pr(X = x;) = prob hereafter simplified to Pr(x;)

where X is a random variable taking values Xi from the universe of values


Qx. This is sometimes known as the frame of discernment. The
probability prob corresponds to a value in the unit interval [0, IJ. The
value prob expresses a belief or likelihood that the proposition p will be
true (i.e. variable X has a value xD when nothing else is known. A point
probability is associated with each value in the universe. In the literature,
each value assignment corresponds to a proposition, sentence or statement
about the world or problem domain in which X exists. This functional
mapping between domain values and probabilities is called a probability
distribution for discrete variables and a probability density for a
continuous variable. These distributions or densities are denoted as
follows for a variable X: Pr(X).

• Conditional propositions expressed by the canonical form:

r:Pr(X = Xi I Y=Yj) = prob hereafter simplified to Pr(xi I Yj)

where X and Y are random variables taking values Xi and Yj from their
respective universes Q x and Qy. The vertical line "I" is read as "given",
thus, the proposition r can be interpreted as follows: the probability of
"variable X having a value X;, given that all that is known is that variable
Y has a value yt is prob. Once again prob corresponds to a value in the
unit interval [0, 1]. A point probability Pr(xiIYj) is associated with each
possible combination of values from the universes Q x and Qy. These
conditional distributions are denoted as follows for any two variables X
and Y: Pr(XI Y).

Mathematically, probability theory can be defined as a theory of continuous monotonic


functions, Pr, such that the following axioms hold:

(i) All probabilities lie in the unit interval [0, IJ.


0::;; Pr(xj) ::;; 1
(ii) Pr(xj) = I if and only if event Xj is certain.
(iii) The probability of a disjunction of propositions is given by

(5-1)

where v and 1\ denote the disjunction and conjunction of


propositions respectively.

Consequently,

(iv) The probability of the negation of a proposition is given by


Pr(-,Xj) = 1 - Pr(xj)
CHAPTER 5: PROBAIlILITY THEORY 96

Below, to keep the presentation lucid, definitions are described in terms of a minimal
number of variables, Xi, Xj' Xb etc., and by and large for the discrete case. For a more
general and detailed presentation of probability theory, the reader is referred to
[DeGroot 1989; Jensen 1996]. Conditional probabilities can be defined in terms of
unconditional probabilities as follows:

Pr(X." X.)
Pr(XIX.)= I J (5-2)
I} Pr(X j )

This can be rewritten as follows, thus leading to the product rule:

(5-3)

Bayes' rule [Bayes 1763], which is derived via the product rule, is defined as follows:

Pr(Xi I Xj) = Pr()01 Xi)Pr(Xi) (5-4)


Pr(Xj)

Theorem of total probabilities: if events Xi = Xl> ••• , Xi = Xn are mutually exclusive


(that is, only one of these propositions can be true at anyone time) with
£....i"k =1 Pr( X I.
" = x,. ) = I then:
/l

Pr(X j) = Ipr(X j I Xi = xk)Pr(X i = X k ) (5-5)


k=l

Independence holds for any two variables Xi and Xj and the index i ;r: j, if.the following
conditions hold:

In other words, knowing the value of variable Xi does not provide us with any
information as to value of variable Xj and vice versa. Consequently, the product rule
(Equation 5-3) simplifies to

if both variables Xi and Xj are independent.

Two variables Xi and Xi are conditionally independent if the following holds:

Conditional independence simplifies Bayesian inference (Equation 5-4) by reducing the


number of dependent variables.
SOH COMPUTING fOR KNOWLEDGE DISCOVERY: iNTRODUCING CARTESIAN GRANULE FEATURES 97

S.2 POINT-BASED PROBABILITY THEORY

A number of popular approaches to representing uncertainty using point-based


probabilities have been developed over the years. These include a joint probability
distribution based approach, nai"ve Bayes, and Bayesian networks. This section
overviews each of these approaches, highlighting their strengths, their weaknesses, and
their assumptions.

5.2.1 Joint probability distributions


A special type of unconditional probability distribution, known as the joint probability
distribution or multi-variate distribution, could be defined that assigns a probability to
each possible event for a problem domain. Consider a problem as being represented by
the variables Xl> ... , X n, then a joint probability distribution M can be formally defined
over the Cartesian product of the universes of discourse ~ Q (that is, QxJx ... xQxn ) of
;=1 x,

variables Xl> ... ,Xn via its probability function as follows:

where each possible combination of variable values <Xl> ... , X n> is assigned a
probability value.

Inference for systems defined in terms of joint probability distribution (known as the
prior distribution since it is specified prior to inference), and in general for probabilistic
systems, is performed using a conditioning or updating operation. Here, when new
evidence, such as Xk = X, becomes available, inference can be performed using
Equation 5-2 in order to get an updated probability, known as the posterior probability,
for the events that may be of interest or relevant.

Decision making, in the discrete case, can be achieved using a number of mechanisms.
Having inferred a posterior probability for each possible outcome given the evidence
(or just reading the probabilities directly from the joint when no evidence is available),
one decision making approach could be to choose the hypothesis that has the highest
associated (posterior) probability. An alternative approach would be to multiply each
posterior probability by the utility value of the respective outcome and simply choose
the outcome that maximises the resulting expected utility [Lindley 1985]. Various
alternatives exist for decision making in the context of prediction problems (i.e. the
output or dependent variable is continuous). These are discussed in Section 6.3 in the
context of probabilistic reasoning in the Fril programming environment.

In general, it is not possible to define a complete joint probability distribution for a


problem due to its exponential size; for example, an n boolean variable system requires
the specification (and storage) of 2n_1 entries in order to define the joint probability
distribution. Modem probabilistic systems sidestep or avoid the joint distribution by
using Bayes' rule (Equation 5-4). Bayes' rule [Bayes 1763] enables efficient inference,
in terms of conditional distributions, thus avoiding the mammoth requirement of a joint
CHAI'TER 5: PRODADILITY THEORY 98

probability distribution that is required for inference using Equation 5-2. Different
types of independence further simplify the representation and inference process in
probabilistic systems by reducing the number of dependent variables. Several
approaches to representing uncertainty using point-based probabilities, which exploit
Bayes' theorem and independence, have been developed; these are presented next.

5.2.2 Naive Bayes


The naive Bayes algorithm [Duda and Hart 1973; Good 1965; Langley, Iba and
Thompson 1992] can model both prediction and classification systems. This section
presents the approach from a classification perspective. See [Barrett and Woodall 1997]
for an example of a naive Bayes approach to prediction problems or alternatively, see
Sections 6.2 and 6.3 where a fuzzy set-based system can be rewritten in terms of
probabilities resulting in a naive Bayesian approach. Consider a classification problem,
where the target function Y =f(x) models a dependency between a target variable Y
and a set of input variables Xb ... , X n • The target variable Y is discrete, taking values
from the finite set {Yb .•.; Yc}. The naive Bayes classifier accepts as input a tuple of
values <.xb ••• , Xn> and predicts the target value y, or a classification, for this tuple. It
uses Bayes' theorem (Equation 5-4) in order to perform inference. Consequently, the
problem is represented in terms of class conditional probability distributions and class
probability distributions, where the class conditionals correspond to

Pr(X" ... , Xn I Y)

and the class probability distribution corresponds to

Pr(Y).

However, within the naive Bayesian framework a simplifying assumption is introduced,


sometimes known as the naive assumption, where the input variables are assumed to be
conditionally independent given the target value. As a result, the class conditionals
reduce to

Pr(X;IY).

Thus, inference (calculation of the posterior probabilities given evidence) using Bayes'
theorem simplifies from

Pr«X1=xl,···,xn =.x,. >I Y = Yi)Pr(y = Yi)


Pr(Y=Yi kX1=xl,···,xn =.x,. »
Pr«X1=Xl,···,xl =.x,. »

to the following:
n
IIPr(Xj =Xj IY = Yi)Pr(y = Yi)
j=1
(5-6)
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: iNTRODUCING CARTESIAN GRANULE FEATURES 99

Decision making consists of taking the classification value Ymax whose corresponding
posterior probability is the maximum amongst all posterior probabilities Pr(Yi I <Xf, ..• ,
xn » for all values Yi E ilr. This is mathematically stated as follows:

Class«xj, ... , xn» = YIrulX= argmax Pr(Yi I <Xj, ... , xn»


y,eQy

Since, in this decision making strategy, the denominator in Equation 5-6 is common to
all posterior probabilities, it can be dropped from the inference process. This further
simplifies the reasoning process (and the representation also) to the following:

As a result of making the naIve assumption, the number of class conditional


probabilities that need to be provided reduces from being exponential in the number of
variables to being polynomial. This assumption, while unlikely to be true in most
problems, generally provides a surprisingly high performance that has been shown to be
comparable to other classification systems such as logic (decision trees) and neural
networks [Langley, Iba and Thompson 1992].

5.2.3 Bayesian networks


So far, two extreme ways of modelling uncertainty using point-based probability
approaches have been described: one using a fully specified joint probability
distribution; and the other using the naive Bayes approach. While the approach using a
joint probability distribution is very attractive, as it can answer any question about a
domain, it quickly becomes intractably large as the number of variables grows, In
contrast, the naive Bayes approach, while lacking the accuracy of a fully specified joint
probability, is attractive from a tractability perspective due to its compact nature that
results from the conditional independence assumption. This section presents an
intermediate form of knowledge representation based upon a belief network. Belief
networks are also referred to by other terms in the literature including Bayesian
networks and causal networks. Bayesian networks [Good 1961; Lauritzen and
SpiegelhaJter 1988; Pearl 1986; Pearl 1988] provide a concise and relatively accurate
means of specifying a joint probability distribution in terms of a directed acyclic graph
of nodes, exploiting different types of independence where possible (but not
indiscriminately assuming it, as was the case in naIve Bayes). The main motivation
behind the introduction of Bayesian networks was to "make intentional systems
operational by making relevance relationships explicif' [Pearl 1988]. In a Bayesian
network framework, the problem of eliciting massive joint probability distribution
tables is reduced to that of eliciting the conceptually much more meaningful conditional
probabilities between semantically related propositions. To quote Pearl:

In a sparsely connected world like ours, it is fairly clear that probabilistic


knowledge, in both man and machine, should be not be represented as entries
in a giant joint distribution table, but rather by a network of low order
CHAPTER 5: PROBABILITY THEORY 100

probabilistic relationships between small clusters of semantically related


propositions {Pearl 1986}

A Bayesian network is represented as a directed acyclic graph (i.e. has no directed


cycles), where each node corresponds to a variable in the problem domain, and where
each directed arc between nodes represents a dependency between the corresponding
domain variables. A Bayesian network is formally defined as a tuple <.N, E> where N
denotes the nodes making up the graph and E is a binary relation on N encoding the
edges of the graph. The nodes in N correspond to the variables in a problem domain i.e.
N = {Xl> ... , Xn}. Figure 5-1 graphically depicts a possible belief network, where N
consists of five binary variables. The directed arcs, denoted by E, represent a causal or
probabilistic relationship between the source variable and target variable. For example,
consider the nodes Alarm and Earthquake in Figure 5-1, where the variable Alarm is
conditionally dependent on the variable Earthquake. Each node in a belief network is
associated with a probability distribution - expressed in tabular format in Figure 5-1 ;
this is either an unconditional probability distribution (corresponding to root nodes such
as Burglary or Earthq£iake in Figure 5-1) or a conditional distribution representing
dependencies.

rnm=::J
LQQLJ

T T .95
T F .94
F T .29
F F .001

JohnCalls (4)

1WU liliU
T .70

Figure 5-1: An example of a belief network adapted from {Russell and Norvig 1995}.

A belief network provides a complete description of the domain i.e. every entry in a
joint probability distribution <Xl> ... , x,,> can be calculated from the information in the
network. This follows from the fact that the joint distribution can be rewritten in terms
of a conditional probability and a smaller conjunction using the product rule:

Recursively repeating this process of reducing each conjunctive probability to a


conditional probability and a smaller conjunction ultimately leads to the following:
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 101

= n
n

i=1
Pr(xi I xi_I , ••• , XI) (5-8)

Equation 5-8 is commonly referred to as the chain rule in the literature. Probabilities
encoded in the nodes of a Bayesian network denote conditionals of the form

Pr(Xj I Parents(Xj» = Pr(Xj I Xj_h ... , Xl) where 0 ~ i ~ n (5-9)

where the nodes are suitably labelled in any order consistent with the partial ordering
implicit in the graph structure. Thus, the conditionals in Equation 5-8 can be substituted
by the conditionals explicitly represented in the Bayesian network. Thus, inference
reduces to the product of conditionals (in their prior or posterior states, depending on
whether the conditionals an~ dependent on the evidence or not). Bayesian networks
further exploit independence in the following ways, thereby simplifying the inference
process:

• Consider the nodes numbered 1, 2, and 3 in Figure 5-1. In order to


simplify the presentation, the node variables are referred to by their
associated number rather than their associated name. In the case of this
sub-graph, variables 1 and 2 are marginaLLy independent, but conditionally
dependent given variable 3. Application of the chain rule to the
probability distribution Pr(1, 2,3) gives

Pr(I, 2, 3) = Pr(3 11, 2) Pr(11 2) Pr(2)

Since variables 1 and 2 are marginally independent, Pr(1 I 2) = Pr(l).


Thus, for this sub-graph

Pr(I, 2, 3) = Pr(311, 2) Pr(l) Pr(2)

• Consider nodes numbered 3, 4, and 5 in Figure 5-1. In the case of this


sub-graph, variables 4 and 5 are conditionally independent given variable
3. Application of the chain rule to the probability distribution Pr(3, 4, 5)
gives

Pr(3, 4,5) = Pr(5 13,4) Pr(413) Pr(3)

Since 4 and 5 are marginally independent, Pr(5 I 3, 4) = Pr(5 I 3).


Therefore, for this sub-graph

Pr(3, 4, 5) = Pr(5 I 3) Pr(4 I 3) Pr(3)

• Consider nodes numbered 1,3, and 5 in Figure 5-1. In the case of this
sub-graph, variables 1 and 5 are conditionally independent given variable
3. Thus
CHAPTER 5: PROBABILITY THEORY 102

Pr(I,3,5)= Pr(511,3) Pr(311)Pr(1)

reduces to

Pr(1, 3, 5) = Pr(5 13) Pr(3 11) Pr(1)

• The anterior nodes of a node are the set of nodes that cannot be reached
via a directed path. For example, the anterior nodes of node 5 in Figure 5-
1 are {J, 2, 3, 4}. The probability of a node given its parents is
independent of its anteriors. For example, Pr(511, 2, 3, 4) = Pr(513).

Consequently, the probability distribution of a large set of variables can be represented


by a product of conditional relationships between small clusters of "semantically
related propositions".

Bayesian networks possess many desirable characteristics, induding very flexible


reasoning mechanisms. Hour types of reasoning are possible in belief networks:

• Diagnostic inference (from effects to cause)


• Causal inference (from cause to effects)
• Intercausal inference (between causes to a common effect)
• Mixed inference (combining two or more of the above)

These four patterns of reasoning are depicted in Figure 5-2. See [Russell and Norvig
1995] for a more complete description of these reasoning patterns. As a result of this
flexibility, the general problem of inference in Bayesian networks has been shown to be
NP-hard, but recent work has resulted in efficient algorithms that allow exact inference
[Lauritzen and Spiegelhalter 1988; Pearl 1986; Pearl 1988] (exploiting research results
in graph theory and mathematical representations of probability distributions) and
approximate inference based upon simulations and bounding techniques which sacrifice
precision for efficiency [Russell and Norvig 1995]. The main ideas for inference in
Bayesian networks have been described here, but due to space limitations the
technicalities of the various exact and approximate inference algorithms are not
presented. However, the interested reader is referred to [Jensen 1996; Krause and Clark
1993; Russell and Norvig 1995] for excellent tutorial level presentations of these
inference algorithms.

Decision making in the case of Bayesian networks is similar to that for naive Bayes and
fully specified joint probability distributions, in that it reduces to taking the hypothesis
associated with the maximum posterior probability or maximum expected utility. See
Section 5.2.1 for more details.

5.3 SET-BASED PROBABILITY THEORY

For a long time point-based probability theory was the only way of expressing
uncertainty. As seen in the previous section, probability theory is a form of knowledge
representation that allows uncertainties to be represented by associating a probability
SOFf COMPUTING !'OR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 103

with all possible values for a variable (or group of variables). This would say that one
value is more likely than another. Inference is based upon conditioning. Point-based
probability theory, while being an intuitive way of representing uncertainty, does not
cater for other areas of incompleteness in knowledge representation such as ignorance
and inconsistency. In order to address these forms of uncertainty, generalisations of
point-based probability theory from point functions to set functions have been
developed. This has resulted in the introduction of Dempster-Shafer theory [Dempster
1967; Shafer 1976], possibility theory [Zadeh 1978], and mass assignment theory
[Baldwin 1991b]. Subsequently, this section presents an overview of Dempster-Shafer
theory, possibility theory, and mass assignment theory. Formal connections between
fuzzy set theory and these theories are described and illustrated with examples as part
of this overview.

Mixed
Inlercau al

Figure 5-2: Examples of reasoning patterns that can be handled by Bayesian networks.
E represents an evidence variabLe and Q is a query variable.

5.3.1 Dempster-Shafer theory


Dempster-Shafer theory is a relatively new theory of uncertainty developed around the
idea of set-based probability functions [Dempster 1967; Shafer 1976]. Dempster-Shafer
theory is also known as evidence theory or belief theory. It is a semantically richer
formalism than point-based probability theories, catering not only for stochastic
uncertainty but also for ignorance. This richness arises mainly because probability
functions in Dempster-Shafer theory are set-based, rather than point-based or singleton-
based as in traditional probability theory (see previous section). Variables in Dempster-
Shafer theory can assume set values. For example, a variable X could be assigned a set
value X = {Xf, X3, xs} from the possible values that X could assume from the universe
Qx. This assignment could be logically interpreted as a disjunction of elementary
propositions stating that either the proposition "the value of X is XI" is true or "the
value of X is X2" is true or "the value of X is x3"is true. In Dempster-Shafer theory there
is an intuitive correspondence between set operations and logic operations: set union
can be interpreted as a disjunction of propositions; set intersection as conjunction; set
inclusion as implication; and set complement as negation. The universe of discourse of
a variable in Dempster-Shafer theory is referred to as the frame of discernment.

In Dempster-Shafer theory, belief (probability mass) may be assigned to sets of


CHAPTER 5: PROBABILITY THEORY 104

propositions without there being a necessary requirement to distribute the mass with
finer granularity among the individual propositions in the set. This allows a form of
ignorance. For example, consider that it is known for certain that a six-faced die after
being rolled has a value which is even, whilst being totally ignorant as to which of the
set of possible even numbers {2, 4, 6} it is. In Dempster-Shafer theory this information
would be represented as a probability distribution over the elements of the power set of
the frame of discernment. In terms of the die example, this probability assignment
would consist of a single element {2, 4, 6 j and an associated probability mass of J. This
is denoted as follows: <{2, 4, 6j:J>. Suppose that an available expert subsequently
testifies that with 90% the die is fair (Le. he is 90% sure that the probability of the die
showing any value is 0.1667). Then Dempster-Shafer theory gives the following
updated probability assignment: <{2, 4, 6j:O.J, {2j:0.3, {4j:0.3, (6j:0.3>, where the
belief mass has been redistributed according to the expert's information.

The principle mode of knowledge representation in Dempster-Shafer theory is the set-


based probability function known as the basic probability assignment (bpa). A basic
probability assignment is sometimes referred to as a body of evidence.
Mathematically, a basic probability assignment m for a variable X defined over the
universe .axis a function from P(X), the power set of flx, to the unit interval [0, 1]

m:P(X)~[O, l]

that satisfies the following conditions:

(i) m(0) = 0

(ii) Lm(A) = 1
AeP(X)

Every set A E P(X) for which m(A) > 0 is called a focal element of m. Basic
probability assignments are denoted with the letter m qualified by its associated name
(e.g. the basic probability assignment for the concept even is denoted by meven ) and
when ilx (the universe on which a bpa m is defined)" is finite, m can be fully
characterised by a list of its focal elements A; with the corresponding belief mass m(A;)
as follows: <A;:m(A j ».For example, the bpa for large die numbers could be mLarge =
<{5, 6j:O.8, {3, 4, 5, 6j:O.2j>. Alternatively bpas can be functionally denoted (for both
discrete and continuous universes). Consider a frame of discernment ilx ={Xl, X2, X3, X4,
X5}. A bpa m, representing a mass for each A E P(X) can be written as follows:

(5-10)

A bpa assigns probabilities at a coarser granularity than traditional probability theory


Le. probabilities (also known as belief mass) are assigned to sets of values (or
disjunctions of propositions), the focal elements, while remaining ignorant of how the
SOFT COMPUTING FOR KNOWl.EDGE DISCOVERY: INTRODUCING CARTESIAN GRANUl.E FEATURES 105

mass should be assigned to the individual elements, the singleton sets. The union of the
focal elements forms the core of the bpa. Contrary to axioms of point-based probability
theory, the probability of the negation of a proposition cannot be derived from the
proposition i.e.

Pr(-,A) # 1 - Pr(A)

A bpa can be viewed as a form of knowledge representation that expresses upper and
lower probability measures for every set A IE P(X), Le. a probability interval. These
upper and lower probability measures are known as belief and plausibility measures
respectively.

Belief measure: Given a bpa m for a variable X defined over the universe ax, a unique
belief measure for every set A E P(X) can be determined as follows:

Bel(A) = Lm(B) (5-11)


BIB\:A

that has the following properties:

(i) Bel(0) = 0 and Bel(Qx) = I


(ii) Bel(Aj U Aj) ;:: Bel(Aj) + Bel(A} - Bel(Aj (J Aj)

Property (ii) states that belief measures are super-additive with regards to set union,
which is a weaker version of the additive property of point-based probability measures
(Equation 5-1).

Plausibility measure: A plausibility measure can similarly be defined as follows:

PI(A) = Lm(B) (5-12)


BIB"A¢0

that has the following properties:

(i) Pl(0) = 0 and Pl(Qx) = 1


(ii) PI(Aj (J Aj) ~ PI(Aj) + Pl(Aj) - PI(Aj U Aj)

Property (ii) states that plausibility measures are sub-additive measures of point-based
probability measures. Plausibility measures are duals of belief measures since:

PI(A) = 1 - Bel(-,A)

Belief or plausibility measures can be calculated from a bpa m as shown above. The
inverse is also possible. Given, for example, a belief measure Bel, the corresponding
unique bpa m is determined for all A IE P(X) by the following formula:
CHAPTER 5: PROBABIlJTY THEORY 106

m(A)= L(-l)IA-BIBel(B)
BIB\:A

Total ignorance for a variable X is expressed in terms of the following bpa:

i.e. all mass is assigned to the set of values made up of the frame of discernment Ox
and all other A E P(X) are assigned a zero mass.

Inference within a Dempster-Shafer framework is based upon an evidence combination


operation, belief revision and conditioning operations. The rule of evidence
combination introduced originally by Dempster [Dempster 1967] aggregates evidence
from independent sources, expressed in terms of basic probability assignments, defined
over the same universe of discourse Ox. This aggregation results in a new bpa.

Dempster's rule of combination: Given two bpa m] and m2 defined over the same
universe of discourse Ox and originating from independent sources (e.g. from two
experts), the aggregation of these two bpas mJ and m2 results in a new bpa ml.2 where
the mass associated with each A E P(X) is calculated using Dempster's rule of
combination as follows:

2,ml(B)m2(C)
ml $ m2 (A) = --=B:..:..n.;:;C'==:,:A_ _ _ __ (5-13)
1- 2,ml(B)m2(C)
BnC=0

The numerator in Equation 5-13 corresponds to the sum over all conjunctions of
arguments (intersection) that support A. The mass associated with each argument mlB)
and miC) is combined using product. This is exactly the same way in which the joint
probability distribution is calculated from two independent marginals (point-based
probability theory); consequently, it is justified on the same grounds. The denominator
is the normalisation coefficient obtained from the mass assigned to the null set or
contradictory information. This normalisation coefficient has been a contentious issue,
sometimes leading to undefined results (in the case when the cores of m] and m2 are
disjoint) and sometimes to counter-intuitive results especially when the two pieces of
evidence mJ and m2 are highly conflicting [Zadeh 1986]. The use of normalisation
corresponds to applying the closed world assumption: the truth must lie somewhere in
the Boolean algebra of propositions derived from the frame of discernment ilx. A
deeper problem with Dempster's rule is its discontinuous nature in the neighbourhood
of total conflict [Krause and Clark 1993].

As an example of the use of Dempster's rule of combination, consider the following


two bpas, mSmall and mAb(luO defined with respect to the universe of die values {J, ... ,
6}. These bpas are provided by two die experts independently of each other:

mSmaU= <{1}:0.4, {1,2}:0.5, {1,2,3}:0.1>


mAbouc2 = < {2} :0.4, {I, 2, 3 }:0.6>
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 107

The combined body of evidence mSmaII E9 mAbouc2, is calculated using Equation 5-13
and the following matrix:

IIlsmaII . . ..
{I , 2, 3}·06
{I}: 0.4 0 {I}
=0.24
0.16
{1,2}: 0.5 {2} {I,2}
=0.2 =0.3
{1,2,3}: 0.1 {2} {I, 2, 3}
=0.04 =0.06

This results in the following combined bpa:

mSmaII E9 mAbouU = <{I }:0.285, {2}:0.285, {l, 2}:0.358, (I, 2, 3}:0.06>.

where the masses are calculated as follows:

mSmaII E9 mAbouu( {2}) = (0.2 + 0.04)1(1 - 0.16)


= 0.24/0.84
= 0.285
mSmaII E9 mAbouC2( {I }) = 0.24/0 - 0.16)
=0.285
mSmaII E9 mAbouU( { I, 2}) = 0.3/(1 - 0.16)
= 0.358
mSmaII E9 mAbouC2( { I, 2, 3}) = 0.06/(1 - 0.16)
=0.072

From a computational perspective, Dempster's rule quickly becomes intractable when


evidence from multiple sources needs to be combined, however, as with Bayesian
networks, various efficient approaches (exact and approximate) have been proposed.
For an overview of these approaches, see [Krause and Clark 1993].

Various evidence conditioning and belief revision operations have been proposed
within the Dempster-Shafer theory of uncertainty [Kalvi 1993; Krause and Clark 1993;
Kruse, Schwecke and Heinsohn 1991]. They allow the updating of probability masses
in the light of some new information, which becomes available and that is certain i.e.
the evidence is absolutely reliable, but imprecise. This evidence corresponds to a bpa
with one focal element with an associated belief mass of 1. The conditioning operation
for updating a bpa m given new evidence E is commonly defined as follows:

meA)
1o
m(AIE)= BelH/(E)
ifA
I !:;;;
otherwise
E

where both A and E are elements of P(X), the power set of the frame of discernment
CHAI'TER 5: PROBABILITY THEORY 108

ilx, and BeLm(E) denotes the belief of E based on m and Equation 5-11. For example,
consider a bpa m, that constrains the values of variable X, which is defined on the
universe ilx = {a, b, c, d, e}. m could be defined as follows:

m= {a} : 0.6, {a, b, c} : 0.2, {a, c, d} : 0.2

and then upon receiving "certain" information that the value of X lies in the subset {a,
b, c}, the conditional mass distribution m given evidence E is calculated as follows:

Belm( {a, b,c}) = 0.6 + 0.2 = 0.8 (based on m).

m({a}l{a, b, c}) = 0.6/0.8 = 0.75


m({a,b,c}I{a,b,c}) = 0.2/0.8 = 0.25
m({a, c, d}I{a, b, c}) = 0

resulting in the following updated mass assignment m:

m = {a} : 0.75, {a, b, c} : 0.25

This approach to updating has the affect of transferring the mass (rescaled) to the focal
elements of the original bpa that are subsumed by the new evidence. The resulting mass
assignment can then be used to calculate corresponding belief and plausibility measures
or they can alternatively be directly calculated from the evidence as follows:

Belm(AnE)
Belm(.IE)(A I E)
Belm{E)

and

A full discussion of other inference patterns in Dempster-Shafer theory is presented in


[Kruse, Schwecke and Heinsohn 1991].

There are a number of approaches to decision making in Dempster-Shafer theory. The


most straightforward approach is to convert Dempster-Shafer beliefs into point
probabilities for each proposition. Having done this, one can adapt the approach used in
point probability theory and choose the proposition with the maximum posterior
probability or with the maximum expected utility as described in Section 5.2.1. There
has been a lot of discussion of how one should generate a point-valued probability
distribution from a set-valued probability distribution (i.e. a bpa). However one of the
commonly used and simplistic approaches is based upon a generalisation of the
principle of insufficient reason. Simply presented the principle of insufficient reason
[Dubois and Prade 1982] states that in the absence of further information, belief in a set
of mutually exclusive propositions should be evenly distributed amongst those
propositions. A generalised version of this principle, the generalised insufficient reason
principle (applied locally to the focal elements), in the context of a bpa states that
probability masses in a bpa should be evenly distributed amongst the elementary
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 109

propositions that make up the focal elements, if a decision needs to be made [Smets
1990]. Consequently, the point-valued probability Pr(A) associated with a proposition
A (i.e. A is a singleton) is the sum of the probabilities that were assigned to A as a result
of A being a part of a focal element in bpa m. In other words:

Pr(A) = Lm(B) IA InIBI or simply Pr(A) = Lm(B)-lll.


A~B B A~B B

where 1.1 denotes set cardinality. This transformation from belief masses, referred to as
the credal level by Smets, to point probabilities, termed the pignistic level by Smets,
plays an integral role in the transferable belief model proposed by Smets [Smets 1990;
Smets 1994].

5.3.2 Possibility theory


Possibility theory is based upon a refinement of Dempster-Shafer theory (D-S theory),
where the focal elements of a basic probability assignment are nested [Dubois and
Prade 1988; Zadeh 1978]. As in D-S theory, probabilistic measures and updating
operations are defined. In possibility theory the counterparts of belief and plausibility
measures are the necessity and possibility measures.

In possibility theory, as in D-S theory, the principle mode of representing domain


specific knowledge is the set-based probability function known as the basic probability
assignment (bpa) whose focal elements are nested or consonant. For completeness a
body of evidence, m, is redefined here, from a possibility theory standpoint, for a
variable X defined over the universe ilx as a function from P(X), the power set of ilx,
to the unit interval 10, 1]

m:P(X) ~ [0,1]

that satisfies the following conditions:

(i) m(0) =0
(ii) Lm(A) = 1
AEP(X)

(iii) Focal elements are nested. Focal elements A E P(X) are linearly
ordered according to the subset relationship c. Consequently, for a
bpa of the form <AI:mJ, Az:m3, ... , An:mn> the following ordering
between focal elements holds Al c Az c ... cAn

For example, consider two bpas, m} and m2 that are defined over the universe fa, b, c, d,
e} as follows:

ml =<{a}: 0.3, {a, b }:0.5, {a, b, d, e }:0.2>


mz=<{a}: 0.3, {a, b, c}:0.5, {a, b, d, e}:0.2>
CHAPTER 5: PROBABILITY THEORY 110

mJ corresponds to a nested body of evidence, whereas m2 does not qualify as a nested


body of evidence since it violates condition (iii) above.

Bodies of evidence in possibility theory can be viewed as a fonn of knowledge that


expresses upper and lower probability measures for the individual elements of the
frame of discernment. These upper and lower probability measures corresponding to
the necessity measure, Nee, and the possibility measure, Pas, are defined for any a body
of evidence, m, as follows (the equations for Nee and Pas are identical to Bel (Equation
5-11) and PI (Equation 5-12 respectively»:

Nec(A) = Lm(B) (5-14)


BlB~A

Pos(A) = Lm(B) (5-15)


BIBr.A;<0

VA E P(X) , the power set of ilx on which m is defined.

As a consequence of the nested nature of the focal elements that make up a body of
evidence in possibility theory, the following properties of necessity and possibility
measures hold for any two focal elements A and B E P(X) [Klir and Yuan 1995]:

(i) Nec (A rl B) = min [Nec(A), Nec(B)]


(ii) Pos (A u B) = max [Pos(A), Pos(B)]
(iii) Nec(A) + Nec(--,A) ~ 1
(iv) Pos(A) + Pos(......A) :;::: 1
(v) Nec(A) = 1 - Pos(...... A)

Every possibility measure Pos can be alternatively and uniquely represented by a


corresponding point-valued function -- a possibility distributiou function. A
possibility distribution function 1t is fonnally defined as (} mapping from each element
XE Q x (the universe of discourse) to the unit interval as follows:

1t:Qx ~ [0, I]

A possibility function Pas is related to a possibility distribution function 1t as follows


[Klir and Yuan 1995]:

• 1t(x) =pose {x}) for all XE Q x (5-16)


• Pos(A) = max(Jr(x» (5-17)
.xEA

When the frame of discernment is infinite, sup is used in place of max in Equation 5-17.
Consequently, given a nested body of evidence m, it is possible to directly generate the
corresponding possibility distribution using Equation 5-16. This is described
subsequently.

In possibility theory, as presented above, the focal elements of a probability assignment


SOFT COMI'UTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 111

are nested i.e. the focal elements A € P(X) are linearly ordered according to the subset
relationship c:

Consequently.

m(A)=O'v'A:t:Aj,iE {1 •...• n}.

however. it is not required that

m(A):t: 0 'v' i E {1 •...• n}.

This ordering amongst focal elements permits the representation of basic probability
assignments on a finite frame of discernment, in a convenient form. as an n-tuple. The
tuple. written as m = <m], m,i• ...• mn>. represents <m(A 1). m(A2 ) ••••• m(An ».
Since
1t(Xi) = POS({Xi}) (see Equation 5-16), 1t(Xi) can be simply calculated as follows:

n n
1t(Xi) =Pos({xi!) = I,m(Ak ) = I,mk (5-18)
k=; k=;

This permits the calculation of 1t(Xi) from a basic probability assignment. The reverse is
also possible; given a possibility distribution, it is possible to determine uniquely the
associated basic probability assignment. This becomes obvious when Equation 5-18 is
expanded for each X; in the frame of discernment Ox as follows:

mJ + m2 + ... + m; + m;+J + ... + mn


m2 + ... + m; + m;+J + ... + mn

1t(Xi) = m; + m;+J + ... + mn

Solving these equations for each m; (the belief mass for each focal element) reduces the
calculation of the basic probability assignment from the corresponding possibility
distribution to the following:

m; = 1t(Xi) - 1t(Xi + I) 'v' Xi E Ox (5-19)

For example, consider the possibilistic bpa m =<fa}: 0.3, fa, b}:0.5, fa, b, d, e}:0.2>
defined over the universe {a, b. c. d, e}. Its corresponding possibility distribution 1t can
be calculated by, firstly. writing the basic probability assignment in tuple format as
follows:

<0.3. 0.5. O. 0.2, 0>

Applying Equation 5-18 to this tuple results in the possibility distribution <1t(a). 1t(b)
CHAPTER 5: PROBABIUTY THEORY 112

... , 1t(e» as follows:

<1,0.7,0.2,0.2,0>

where, for instance, 1t(b) is calculated as follows:

1t(b) = m2 + ... + m5
= 0.5 + 0 + 0.2 + 0
=0.7

Possibility measures, Pos(A) can be calculated easily from the possibility distribution
using Equation 5-17. Considering the previous example, the Pos({a, bJ) can be
calculated as follows:

Pos({a, b}) = max(1t(a), 1t(b»


=max(l, 0.7)
=1

As in D-S theory, inference in possibility theory comes in terms of belief revision and
updating. A logic-based approach to inference has also been developed [Dubois and
Prade 1988]. The definitions of the belief revision and updating are summarised here
for both the necessity measure Nee and possibility measure Pos. See [Kruse, Schwecke
and Heinsohn 1991] for a more detailed presentation and discussion of these operations.
A commonly used approach to updating a necessity measure Nee and possibility
measure Pos given new evidence E, E E P(X), is defined as follows:

NeC(A) E
ifiOA
--- I£1C
Nec(A I E) = { Nee(E) -
o otherwise
fPos(AuE)-Pos(E)
Pos(A I E) = 1 1- P~s(E)
otherwise

"*
where the Nec(E) 0 (i.e. Pos(E) = 1). These updating operations are concerned with
redistributing mass such that Nec(EIE) =1. On the other hand, in belief revision is
concerned with revising a body of evidence such that it is consistent with the truth lying
in B. This is achieved as follows for NecE(A) and PosE(A) the respective revised
necessity and possibility measures:

Nee(AUE)-Nee(E)
NecE(A) = { I-Nec(E)
o otherwise

POS(A)
POSE (A) = { PO~(E)
otherwise
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE fEATURES 113

5.3.2. J An alternative formulation of possibility theory -fuzzy sets


Thus far, it has been shown how possibility theory can be formulated in terms of nested
bodies of evidence. However, it is also possible to formulate possibility theory in terms
of other higher-level representations - fuzzy sets. This becomes possible for the
following reasons:

• a-cuts are nested: A fuzzy set can be viewed as a family of nested sets -
the a-cuts (see Section 3.3) For example, consider the following fuzzy set
A = {OAla + 0.6/b + 0.71 c + lid}. The following are the a-cuts Aa of A:

AI ={d}
A. 7 ={c, d}
A. 6 ={b, c, d}
A.4={a, b, c, d}.

• possibilistic principle: Zadeh [Zadeh 1978] originally introduced the


possibilistic principle, postulating the following equality between
membership values and point possibility values:

IlA(X) = 1t(xIA)

where x e Q x (the universe of variable X) and A is a fuzzy set defined on


Qx, Il refers to membership, and 1t refers to the point possibility
distribution. This equality can best be justified as follows: the degree of
membership IlA(X) of an item x e Q x in a fuzzy set A defined on Q x can
be viewed as the compatibility of that item (that is well-known) to an iII-
defined set. Alternatively, given a proposition stating that the value of
variable "X is A", this proposition can be seen as inducing a possibility
distribution on X where the membership of any value x in A, IlA(X), is
seen as the degree of strength of opinion that the exact value of variable X
IS X given that you know that it belongs to an imprecisely defined set (in
this case A).

Consequently, it is possible to represent systems using both possibility theory and fuzzy
set or translate one into the other. For example, fuzzy sets provide a very high level
representation of possibilistic bodies of evidence, so one could transform these bodies
of evidence to fuzzy sets and if-then rules and use fuzzy reasoning in order to perform
inference and decision making, which is, in general, much more transparent and
efficient. Alternatively, in other situations such as inductive reasoning, it is possible to
measure the degree of match of fuzzy events exploiting possibility and necessity
measures as introduced above and in Section 3.6. These measures could potentially
highlight uncertainty that might otherwise go unnoticed; for example, they could
identify a model or data deficiency [Klir and Yuan 1995].

5.3.3 Mass assignment theory


In the late 1980s, Baldwin [Baldwin 1991 b; Baldwin 1992] proposed mass assignment
theory (MAT) as an alternative way to address the shortcomings of probability theory,
CHAPTER 5: PROBABlI.lTY THEORY 114

when further incompleteness in knowledge exists, namely that a complete probability


distribution over the frame of discernment cannot be given (which corresponds to a
form of ignorance) and where inconsistency is present. In mass assignment theory, the
principle mode of representing domain specific knowledge is the set-based probability
function known as the mass assignment, which corresponds in representation terms to
the basic probability assignment of Dempster-Shafer theory. MAT differs from
previous work in this area by Dempster and Shafer [Dempster 1967; Shafer 1976], by
catering for not only ignorance, but also for inconsistency (allowing mass to be
assigned to the null set) and providing a different and more expressive calculus.

Mass assignment: Let X be a variable defined on the universe ilx. A mass assignment
m, defined over the universe ilx, is a function from P(X), the power set of .ox, to the
unit interval [0, 1]:

m:P(X) -7 [0, 1]

that satisfies the following condition:

(i) Im(A) = 1
AEP(X)

°
Every set A E P(X) for which meA) > is called the focal element of m. Notice here
that the condition, m(0) = 0, has been dropped. In other words, mass can be allocated
to the null set, which enables the modelling of inconsistency in a mass assignment.
Mass assignments are denoted with the letters MA qualified by its associated name (e.g.
the mass assignment for the concept even is denoted by MAeven) and can be written
using a list «A;:m(A;») or functional format (as is the case for basic probability
assignments) .

A mass assignment can be viewed as a form of knowledge that expresses upper and
lower probabilities for the individual elements of frame of discernment. As in
~er-Shafter theory, a probability interval can be calculated for every set A E P(X)
using the necessity and possibility measures. Given a mass assignment m, a unique
necessity measure for every set A E P(X) is determined as follows:

Nec(A) = Im(B)
BIBkA

and a unique possibility measure is determined for every set A E P(X) as follows:

Pos(A) = Im(B)
BlBnA",0

Alternatively, a mass assignment can be viewed as a family of probability distributions,


all of which satisfy the axioms of probability theory and the upper and lower
constraints delimited by the mass assignment. Consequently, although mass
assignments can represent probabilities, they have the added flexibility of being able to
SOI-T COMPUTING FOR KNOWl.EDGE DISCOVERY: INTRODUCING CARTESIAN GRANUl.E FEATURES liS

represent uncertain probabilities (second order probabilities). For example, consider a


class of undergraduate students where students can be classified as first-class honours,
second-class honours or as pass. Consider the case where there are 100 students, where
it is known that 30 are pass students, 40 are second-class honours or pass and the
remainder unknown. This can be more succinctly written in mass assignment format as
follows:

0.3 A = {pass}
{ 0.4 A = {pass, second -class - honours}
MAclass(A) =
0.3 A = {pass, second - class - honours, first - class - honours}
o Otherwise

This mass assignment corresponds to a family of probability distributions satisfying the


following constraints:

0.3 ~ Pr(pass) ~ ]
o ~ Pr(second-class) ~ 0.7
o ~ Pr(first-class) ~ 0.3

such that

Pr(pass) + Pr(second-class) + Pr(first-class) = ] .0.

A particular type of probability distribution is obtained by distributing the mass


associated with the non-singleton focal elements according to the prior probability
distribution (which is, unless otherwise stated, assumed to be uniform); this distribution
is termed the least prejudiced distribution (LPD) [Baldwin 1992; Baldwin 1996]. In
the case of MAclass the corresponding LPD, LPDclass. is given as follows:

Pr(pass) = 0.3 + 0.412 + 0.3/3 = 0.6


Pr(second-class) =0.4/2 + 0.3/3 =0.3
Pr(frrst-class) =0.3/3 =0.1
Formally, the more general transformation of a mass assignment, MA, to a point
probability distribution can be defined as follows [Baldwin, Martin and Pilsworth 1995;
Ralescu ] 997] for a variable X defined on .ax:

Pr(x; IMA)= LPrA(X;)MA(A) (5-20)


AeP(X),XEA

where P(X) denotes the power set of .ax, MA(A) denotes the mass associated with focal
element A in the mass assignment MA.. PrA(x;) is the probability distribution on a focal
element A and Pr(xjIMA) is the updated or posterior probability distribution obtained
when the mass assignment MA is provided. PrA(Xj) is a local" probability distribution or
selection rule [Ralescu 1997] for each focal element A in the mass assignment MA and
is defined as follows:
CHAPTER 5: PROBABILITY THEORY 116

(5-21)

Notice that the least prejudiced distribution is a more general version of the pignistic
distribution introduced by Smets as part of the Transferable Belief Model [Smets 1990;
Smets 1994]. The transformation of a mass assignment to a point probability
distribution can be simply viewed as the updated point probability distribution obtained
when a prior is conditioned on that mass assignment i.e. Pr(X = X; I MAJ. This
relationship is further considered during the presentation of the bi-directional
transformation of a fuzzy set to a probability distribution in Section 5.4.

5.3.3.1 Mass Assignment Calculus


Mass assignments can ,be combined, corresponding to the conjunction of knowledge
statements, aggregated corresponding to the combination of alternate knowledge
statements and updated, corresponding to forming a posterior mass assignment from a
priori mass assignment when given some specific knowledge, also expressed as a mass
assignment. A mass assignment can be viewed as a lattice, partially ordered by set
inclusion ~. Meet operations (corresponding to conjunction), join operations
(corresponding to union), restriction operations (corresponding to subset) and
conditioning operations have been defined for mass assignments. [Baldwin 1991 b;
Baldwin 1992] provides a detailed presentation of the mass assignment calculus (meet,
join, restrictions, conditioning). In this book however, the interest in the MAT calculus
is limited to the conditioning operation, which is used extensively in the proposed
inference and learning algorithms (see Chapters 6 and 9). This conditioning operation,
introduced presently, plays a key role as a means of performing semantic matching
(unification) of concepts represented as fuzzy sets. This probabilistic operation on
fuzzy sets is facilitated by the formal bi-directional relationship that exists between
fuzzy sets and mass assignments (Section 5.4). The conditioning operation in mass
assignment theory is commonly known as semantic ~nification [Baldwin 1993]
because of its central role in probabilistic reasoning in the unification of concepts
(described in Chapter 6). The maintenance of the uncertain nature of the probabilities in
a mass assignment following conditioning, corresponds to two versions of semantic
unification: interval; and point-valued. Interval semantic unification maintains the
uncertainty of probabilities present in the original mass assignments through its interval
representation, corresponding to lower (necessity) and upper (possibility) bounds of the
conditional probability, whereas point semantic unification generates a point
probability.

Point Semantic Unification: Let m:<M;:mi> and d:<Dj:dj> be two mass assignments
specified in terms of their focal elements Mi and Dj E P(X) (the power set of the frame
of discernment Qx) and their associated masses respectively. The point semantic
unification calculates the point probability resulting from the conditioning of m on
evidence d (defined here for the discrete case; see Section 6.2.1 for the continuous case
[Baldwin, Lawry and Martin 1996]) as follows:
SOfT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 117

(5-22)

In this case, the prior probability distribution for X is assumed to be uniform. If


however, this is not the case Equation 5-22 becomes slightly more complicated as
follows:

n L,Pr(x)
Pr(M I D) = L,m;d
i,j=I
""'irilliPr(x) (5-23)

.leD;

For example, consider the following two mass assignments:

MASmaII = <{l}:0.4, {1,2}:0.5, {1,2,3}:0.1>


MAAbouCZ= <{2}:0.4; {t, 2, 3}:0.6>

The point semantic unification of MASmall given the evidence MAAbouCZ , Pr(MAsmall I
MAAbouI_2), and a uniform prior, is calculated using the following matrix:

MAAboul2
MAsmaII {2}'04
, , - ..6
{I , 2, 3 }·O
{I}: 0.4 113(0.4 . 0.6)
0
=0.08
{1,2}: 0.5 0.5·0.4 2/3(0.5 . 0.6)
=0.2 =0.2
{1,2,3}: 0.1 0.1 ·0.4 0.1· 0.6
=0.04 = 0.06

resulting in the following point probability:

Pr(MAsmall I MAAbouC2) = 0 + 0.08 + 0.2 + 0.2 + 0.04 + 0.06


=0.58

Interval semantic unification is defined as follows for two mass assignments


m:<Mi:mi> and d:<Dj:dj > as the necessity and possibility interval probability [N, P]
resulting from the conditioning of m on evidence d:

Pr(M I D) = [N,N+U] (5-24)

where

L m;dj and T(Mi I Dj)=


n
N= t
i,j=1
CHAPTER 5: PROBABILITY THEORY 118

L md
n

U= i and T(Mi I Dj)= u


i,j:J

t ifDcM
{
T(MID)= f ifDnM =0
u otherwise

For example, consider the following two mass assignments (same as were used in the
point semantic unification example above):

MASmall = < {I }:0.4, {I, 2 }:0.5, {I, 2, 3 }:0.1>


=
MAAbouU < {2} :0.4, { 1, 2, 3 }:0.6>

The interval semantic unification of MAsmalJ given the evidence MA Abou ,-2, Pr(MAsmall I
MAAboul_2), is calculated using the following matrix:

MASmal1 {I , 2,3 }:0.6


{I}: 0.4 f u
=0.24
0.16
{1,2}: 0.5 t U
=0.2 =0.3
{I, 2, 3}: 0.1 t t
=0.04 =0.06

leading to the following interval probability:

Pr(MAsmail I MAAboul_2) = [0.2 + 0.04 + 0.06, 0.2 :I- 0.4 + 0.06 + 0.24 + 0.3]
= [0.3,0.84]

5.4 FROM FUZZY SETS TO PROBABILITY DISTRIBUTIONS

In the previous sections, different approaches to handling uncertainty within the


framework of probability theory were introduced. In addition, formal relationships
between these approaches have been presented: the relationship between a possibility
distribution and a basic probability assignment (Section 5.3.2); the relationship between
a fuzzy set and a possibility distribution (Section 5.3.2.1); and the relationship between
a mass assignment and a probability distribution (Section 5.3.3). The purpose of this
section is to illustrate how these relations can be linked together to provide a formal
connection between a fuzzy set and a probability distribution. This results in a multi-
step bi-directional approach that transforms probabilities to membership values. This
SOFT COMI'UTING [-oR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 119

bi-directional transformation forms the basis for the learning algorithms proposed in
Part N of this book.

As seen in Section 5.3.2.1, using the possibilistic principle, a fuzzy set A can be
transformed into a corresponding possibility distribution 1t by simply equating the
membership of a value with the possibility, that is, IlA(X) = 1t(xlA). Subsequently,
methods that transform probabilities to possibilities can be used to generate
membership functions. Several researchers have investigated the relationship between
possibility distributions and probability distributions [Baldwin 1991b; Dubois and
Prade 1983; Klir 1990; Sudkamp 1992]. This research has been guided by Zadeh's
possibility/probability consistency principle [Zadeh 1978], which states the
following:

If a variable X can take values XI> ••• , Xn with respective possibility and

probability distributions 1t = <1th ... , 1tn> and Pr(p I> ••• , Pn), then the degree of
consistency of the pr;obability distribution Pr with the possibility distribution
1t is given by the following:
n
Consistency(Pr,1t)= L,1Z'; Pr;
;=1

Alternative definitions of consistency also exist [Dubois and Prade 1980], however the
importance of this measure is that it serves as an "approximate formalisation of the
heuristic observation that a lessening of the possibility of an event tends to lessen its
probability - but not vice versa" [Zadeh 1978]. The possibility/probability consistency
principle provides a basis for the calculation of a possibility distribution from a
corresponding probability distribution.

Numerous probability/possibility transformation methods have been proposed in the


literature including Klir's conservation of uncertainty method [Klir 1990], the bijective
transformation [Dubois and Prade 1983] and Baldwin's bi-directional transformation
[Baldwin 1994]. Though, the bijective transformation and the bi-directional
transformation are very similar, the motivations behind their introduction were quite
different. Furthermore, the latter transformation is more general, catering for prior
probabilities, and inconsistencies that may arise due to subnormal fuzzy sets. The work
presented in this book is based on Baldwin's bi-directional transformation [Baldwin
1994; Baldwin, Martin and Pilsworth 1995], which is described subsequently.

5.4.1 Transforming fuzzy sets into probability distributions


The bi-directional transformation proposed by Baldwin [Baldwin 1994; Baldwin,
Martin and Pilsworth 1995] permits the transformation, via mass assignment theory, of
a fuzzy set into its corresponding unique probability distribution and vice versa. This
section overviews the main steps in this process.

5.4.1.1 From afuzzy set to a probability distribution


This section outlines the steps involved in transforming a fuzzy set into a probability
CHAPTER 5: PROBABILITY THEORY 120

distribution.

Step 1: Fuzzy set <=> mass assignment: As seen Section in 5.3.2, a fuzzy set
can formally be transformed into a nested body of evidence or mass
assignment via its corresponding possibility distribution 1t. Consider
that a variable X has a fuzzy set value f as follows: V is f, where f is a
fuzzy set defined on the discrete universe ilx = {XI> ... , xn}, whose
support corresponds to ilx (for convenience). This is written more
succinctly as follows:
n
f = LX; / JLj(X;)
;=1

This proposition that "V has a fuzzy set value f" induces a possibility
distribution over the values of X such that the membership values of X;
are nUn:ierically equated with possibility i.e.

Suppose f is a normal fuzzy set where the elements are ordered such
that

then

Pos(A) = Pos({XI> ... , x;}) = 1tt{XI) ViE {I, ... , n} (see


Equation. 5-17)

Each subset A corresponds to a level set (a-cut) where a = 1tj{Xi).SO


with the assumption that Pr(A) ~ Pos(A) for any A it is possible to find
the belief mass associated each A as follows based on Equation 5-19:

where A = {x], ... , x;} and 1tt{Xn+l) = O. This leads to the following mass
assignment corresponding to the fuzzy setf:

MAt = <{x], ... , x;}: 1tj -1ti+1 > with 1tn+1 = 0 and ViE {l,
... , n}

Step 2: Mass assignment <=> probability distribution: The resulting mass


assignment MAt can then be transformed into a probability
distribution by distributing the mass associated with each focal
element amongst the constituent propositions using the prior
distribution (assume uniform prior if no prior is provided). This leads
to the least prejudiced distribution, described previously in Section
5.3.3. A point probability Pr' is calculated as follows x; E Ox:
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FliATURES 121

Pr'(Xi)= MA (A) Pr(x;)


AeP(Xj,xieA
f Ipr(z)
zeB

Consider the following example, where a fuzzy set / = {all, b/0.5, c/O.5, dlO.2} is
transformed into its corresponding probability distribution. A uniform prior probability
distribution is assumed here. This results in the following calculations:

1&r = <1,0.5,0.5,0.2>
MAr = <{a}:O.5, {a, b, c}:O.3, (a, b, c, d}:0.2>

LPDr(a) = 0.5 +0.3/3 +0.2/4=0.65


LPDr(b) = 0.3/3 + 0.214 = 0.15
LPDr(c) = 0.3/3 +0.2/4= 0.15
LPDr(d) = 0.214= 0.05
LPD r = a:0.6' + b:0.15 + c:0.15 + d:0.05

Step I in the above bi-directional transformation can be extended to non-normal fuzzy


sets so that the mass assignment corresponding to the fuzzy set / looks like the
following:

MAf = <{Xl> ... , x;/: 1&; -1tj+h {0 J: I-Jrj> with 1tn+1 = 0, and ViE {I, ... , n}

such that a non-zero mass is assigned to the null set 0; in this case, the mass
assignment is said to be incomplete. To transform this mass assignment into a
probability distribution, the mass associated with the null set 0 needs to be
redistributed amongst the other focal elements. Section 8.2.2 discusses a couple of
distribution policies and the effect these distributions have on the resulting probability
distributions.

The transformation of a fuzzy set to a point probability distribution can be simply


viewed as the updated point probability distribution obtained when a prior probability
distribution is conditioned on that fuzzy set i.e. Pr(X = Xii f). On the other hand, given a
posterior probability distribution, it is possible to uniquely determine the fuzzy set that
was used to condition the prior distribution in order to arrive at the posterior
distribution. In other words, given the prior prexy and the posterior Pr(XIj), where/is a
fuzzy event,f can be determined uniquely using the transformation described above.

5.4.1.2 From a probability distribution to afuzzy set


The transformation from a probability distribution Pr to a fuzzy set / is now briefly
described. Let/and Pr be defined as before over the frame of discernment.ax = (Xh ... ,
xn}with supports equal to.ax and let P(X) be the power set of .ax . To simplify the
presentation, it is assumed that no two probabilities in Pr are equal and that the prior is
uniform. This transformation consists of the following steps:
CHAPTER 5: PROBADILITY THEORY 122

Step 1: Order the probability distribution Pr such that:

Pr(x;) ;? Pr(x) if i >j \7' i, j E {1, ... , n}.

Step 2: Since this bi-directional transformation is order preserving, the fuzzy


setfcan assume the following form:

pix;);? J1/Xj) if i >j'if i,j E {1, ... , n}.

Step 3: This fuzzy set f induces a possibility distribution 1tf, which in tum
induces a mass assignment of the form:

Step 4: Letting A; = {x" ... , x;} \7' i E {1, ... , n} and since MAj (A) = 1tr(Xi) -
1tt{Xi+l) (according to Equation 5-19), the following equation

Pr'(xi) =
AEP(X),X;EA

can be simplified to

Step 5: For i=n then

such that

The remaining values for 1t1{{X;}) (i.e. i E {I, ... , n-l}) can be solved
for by direct substitution of 1tt{ {Xi+1 }). This leads to the following
general equation for obtaining a possibility 1t1{ {Xi}) corresponding the
probability Pr( {Xi}):

Jrj(Xi ) = iPr'(x;) + LPr'(xk) Ifi E {J, ... , n} (5-25)


k=i+l
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY; INTRODUCING CARTESIAN GRANULE FEATURES 123

Step 6: This results in a possibility distribution Kt and a corresponding fuzzy


setf.

For example, consider the following probability distribution:

Large: 0.6333 + Medium:0.333 + Small:0.0333.

This probability distribution corresponds to the fuzzy set such that a prior probability
was conditioned on this fuzzy set, that is Pr(XIf) resulting in the above probability
distribution. Assuming the prior distribution was uniform leads to the following fuzzy
set using Equation 5-25:

f= {Large: 1+ Medium:0.7 + Small:0.1}

where Jlt(w) is calculated through its associated possibility as follows (in the following
calculations "." denotes product):

llt<Large) = 1tt<Large) = 1·0.6333 + (0.333 +0.0333)


=1
1l,{Medium) =1t,{Medium) =2·0.333 +(0.0333)
= 0.7
llt<Small) = 1tt<Small) = 3·0.0333
= 0.1

The extension of this bi-directional transformation to continuous variables is a little


more involved and is achieved by taking alpha cuts of the fuzzy set and proceeding in a
similar fashion as described above with continuous integrals.

5.4.2 From memberships to probabilities - a voting model


justification
The semantics of the bi-directional fuzzy-set-to-probability-distribution transformation
can be justified and intuitively understood from a voting model perspective [Baldwin
1991a]. The voting model lends a semantic interpretation of concepts expressed in
terms of fuzzy sets, mass assignments and probability distributions from a human
reasoning perspective. It is based upon a frequentist viewpoint. Consider a die variable,
whose values are drawn from the following universe of values:

nDicValuCs: {1, ... , 6).

A population of voters (a representative sample of persons) are asked to vote on the


appropriateness of the words Small, Medium, and Large as a description of the die
values. Each voter must vote yes or no for each proposition p, where p denotes the
following: wordi is a suitable description of the valuej, where wordi can assume any
value in {Small, Medium, and Large} and valttej can assume a value in nDicValucs.
Abstentions are not allowed. Voters are expected to vote consistently and to abide by
the constant threshold assumption. Table 5-1 presents an example voting pattern for a
CHAPTER 5: PRODADII.ITY THEORY 124

population of ten people when asked to vote on the appropriateness of these words for
the die value of 5. Similar voting patterns are generated for the other die values. All
voters accept the word Large as an appropriate description for the die value of 5, while
7 people (70%) accept Medium as an appropriate description and 1 person accepts the
word Small. These proportions correspond to membership values. For example, the
word Large will have a membership value of 1 in the fuzzy set linguistic summary of
the die value 5. In short, the voting pattern presented in Table 5-1, corresponds to a
linguistic description of the die value 5 described in terms of the following fuzzy set:
{Large/l + Medium/O.7 + SmallIO.2}. Reinterpreting the voting patterns in another
way, 10% of the voters voted yes for the words in {Small, Medium, Large}, while 50%
voted for the words in {Medium, Large} and 30% voted exclusively for Large. This
interpretation corresponds to probability distribution on the power set of possible die
values QDieValues. This probability distribution corresponds to the following mass
assignment:

<{Small, Medium,. Large}:O.l, {Medium, Large}:0.6 {Large}:03>

To get a probability distribution associated with this voting pattern the users could be
asked to restrict their descriptions of values to one word i.e. each voter is asked to vote
yes for one word only when describing a value. However, in the case where the users
are not available to make such a decision it is possible to uniformly distribute
probabilities amongst the words a voter chose to label a value. This results in the
following probability distribution (which is equivalent to assuming a uniform prior):

Small: 0.0333 + Medium:O.333 + Large:O.6333

This example of the voting model illustrates intuitively the relationship between fuzzy
sets, mass assignments and probability distributions.

In subjective probability, betting behaviour provides a rationale on how to assess the


probabilities of events. This is paralleled in fuzzy set theory (and probability theory), as
seen above, by the voting model, which provides an operational model to assess the
membership values and also fuzzy set operations such as intersection and union (see
Section 8.2.1 for more details).

Table 5-1: A voting pattern for 10 peopLe defining the Linguistic description ofa die
having a vaLue of 5. This corresponds to the fuzzy set {SmaLlIO.I + Medium/O.7 +
Large/I}.

Word\Person I 2 3 4 5 6 7 8 9 10
Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Medium Yes Yes Yes Yes Yes Yes Yes No No No
SmalJ Yes No No No No No No No No No

5.4.3 Zadeh's probability of fuzzy events


The previous section has described the formal relationship that exists between
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 125

membership values and point probabilities. This relationship permits the calculation of
the probability of a fuzzy event and of the conditional probability of fuzzy events and
thus, probabilistic reasoning (see Chapter 6 for more details). Zadeh [Zadeh 1968]
proposed alternative definitions that allow the calculation of the probability of fuzzy
events directly from the underlying fuzzy sets. He defined the probability of fuzzy
event as follows. SupposeJis a fuzzy set defined on the discrete universe nx and Pr is a
probability distribution defined on n x, then the probability ofJis defined as follows:

He proposed the following definition for calculating the conditional probability of


fuzzy events. Suppose J and g are two fuzzy sets defined on the universe nx and Pr is a
probability distribution defined on n x, then the condition probability of J given g is
defined as follows:

Pr(f I g) = Pr(f n g)
Pr(g)

where the fuzzy set intersection operator n denotes multiplication. This definition plays
a similar role as semantic unification but leads to different and more limited results (i.e.
point values only).

5.5 SUMMARY

In this chapter, various language-like approaches of representing uncertainty and


imprecision were presented. The rationales, advantages, and limitations of major
probabilistic approaches to managing and reasoning under uncertainty were described
using worked examples. The chapter began with a review of probability theory, which
was followed by a presentation of three point-based probability theories. Subsequently
three set-based probabilistic approaches were described: Dempster-Shafer theory;
possibility theory; and mass assignment theory. These provide semantically richer
formalisms than point-based probability theories, catering not only for uncertainty, but
also for ignorance and inconsistency. For each approach the respective calculus of
operations (conjunction, negation etc, inference, decision making) was described and
the relationships between these modes of uncertainty representation and fuzzy set
theory were also explored. The bi-directional transformation from a membership value
to a point probability was described in detail. An intuitive justification and
interpretation of the relationship between fuzzy sets and probability distributions based
on human reasoning (the voting model) was also presented. This relationship facilitates
more powerful and expressive forms of knowledge representation and reasoning, very
much in the true sprit of soft computing. This bi-directional transformation plays a vital
role in the learning algorithms proposed in this book, facilitating learning through a
counting approach and subsequent knowledge expression in a transparent/intuitive
fuzzy set format (see Chapter 8). Finally, the definitions proposed by Zadeh for the
probability offuzzy events were briefly described.
CHAPTER 5: PROBABIUTY THEORY 126

5.6 BmLIOGRAPHY

Baldwin, J. F. (1991a). "Combining evidences for evidential reasoning", International


Journal of Intelligent Systems, 6(6):569-616.
Baldwin, J. F. (1991b). "A Theory of Mass Assignments for Artificial Intelligence", In
IJCAI '91 Workshops on Fuzzy Logic and Fuzzy Control, Sydney, Australia,
Lecture Notes in Arti.ficiallntelligence, A. L. Ralescu, ed., 22-34.
Baldwin, J. F. (1992). "Fuzzy and Probabilistic Uncertainties", In Encyclopaedia of AI,
2nd ed., Shapiro, ed., 528-537.
Baldwin, J. F. (1993). "Probabilistic, Fuzzy and Evidential Reasoning in FRll.. (fuzzy
relational inference language)." In the proceedings of Two Decades of Fuzzy
Control, lEE London, 711-7/4.
Baldwin, J. F. (1994). "Evidential logic rules from examples." In the proceedings of
EUFlT, Aachen, Germany, 91-95.
Baldwin, J. F. (1996). "Knowledge from data using Fril and fuzzy methods", In Fuzzy
Logic, J. F. Baldwin, ed., John Wiley & Sons, 34-76.
Baldwin, J. F., Lawry, J., and Martin, T. P. (1996). "A note on the conditional
probability of fuzzy subsets of a continuous domain", Fuzzy sets and systems,
96:211-222.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - Fuzzy and Evidential
Reasoning in A.!, Research Studies Press(Wiley Inc.), ISBN 0863801595.
Barrett, J. D., and Woodall, W. H. (1997). "A probabilistic alternative to fuzzy logic
controllers", I1E Transactions, 29:459-467.
Bayes, T. (1763). "An essay towards solving a problem in the doctrine of chances",
Philosophical transactions of the Royal Society of London, 53:370-418.
deFinetti, B. (1937). La prevision: See lois logiques, ses sources objectives. Annales de
l'lnstitut de Henri Poincare, 7, pp 1-68. Translated in: Kyberg, H. and
Smokler, H. (1964) Studies in Subjective Probability. Wiley, New York.
DeGroot, M. H. (1989). Probability and statistics (second edition). Addison-Wesley,
New York.
Dempster, A. P. (1967). "Upper and Lower Probabilities. Induced by Multivalued
Mappings", Annals of Mathematical Statistics, 38:325-339.
Dubois, D., and Prade, H. (1980). Fuzzy sets and systems: theory and applications.
Academic, New York.
Dubois, D., and Prade, H. (1982). "On several representations of an uncertain body of
evidence", In Fuzzy information and decision processes, M. M. Gupta and E.
Sanchez, eds., North-Holland, Amsterdam, 167-181.
Dubois, D., and Prade, H. (1983). "Unfair coins and necessary measures: towards a
possibilistic interpretation of histograms", Fuzzy sets and systems, 10: 15-20.
Dubois, D., and Prade, H. (1988). An approach to computerised processing of
uncertainty. Plenum Press, New York.
Duda, R., and Hart, P. (1973). Pattern classification and scene analysis. Wiley, New
York.
Good, I. J. (1961). "A causal calculus", British journal of the philosophy of science,
11 :305-318.
Good, I. J. (1965). The estimation of probabilities: an essay on modem Bayesian
methods. M.1. T. Press.
Jensen, F. V. (1996). An introduction to Bayesian Networks. VCL Press, London.
SOFr COMI'UTING FOR KNOWlJiDGE DISCOVERY: INTRODUCING CARTliSIAN GRANULE FEATURES 127

Kalvi, T. (1993). "ASMOD: an algorithm for Adaptive Spline Modelling of


Observation Data", International Journal of Control, 58(4):947-968.
Klir, G. J., and Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic, Theory and Applications.
Prentice Hall, New Jersey.
Klir, K. (1990). "A principle of uncertainty and information invariance", International
journal of general systems, 17(2, 3):249-275.
Krause, P., and Clark, D. (1993). Representing uncertain knowledge: an artificial
intelligence approach. intellect, Oxford.
Kruse, R., Schwecke, E., and Heinsohn, J. (1991). Uncertainty and vagueness in
knowledge based systems. Springer-Verlag, Berlin.
Langley, P., Iba, W., and Thompson, K. (1992). "An analysis of Bayesian classifiers."
In the proceedings of Tenth National Conference on AI, 223-228.
Lauritzen, S. L., and Spiegelhalter, D. J. (1988). "Local computations with probabilities
on graphical structures and their application to expert systems", Journal of the
Royal Statistical Society, B50(2):157-224.
Lindley, D. V. (1985). Making decisions. John Wiley, Chichester.
Pearl, 1. (1986). "A constraint-propagation approach to probabilistic reasoning", In
Uncertainty in AI, J. F. Kanal and L. N. Lemmer, eds., Elsevier Science
Publishers, North-Holland, 357-370.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible
inference. Morgan Kaufmann, San Mateo.
Ralescu, A. (1997). "Quantitative summarization of numerical data using mass
assignment theory (invited lecture)." In the proceedings of SOFT, Kansai,
Japan, Unpublished manuscript.
Russell, S., and Norvig, P. (1995). Artificial Intelligence a Modern Approach. Prentice-
Hal1, Englewood Cliffs, New Jersey, USA.
Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University Press.
Smets, P. (1990). ''The combination of evidence in the transferable belief model", IEEE
Trans. PAMI, 12:447-458.
Smets, P. (1994). ''The transferable belief model", Artificial Intelligence, 66: 191-234.
Smithson, M. (1989). Ignorance and uncertainty: emerging paradigms. Springer-
Verlag, Berlin.
Sudkamp, T. (1992). "On probability-possibility transfonnation", Fuzzy Sets and
Systems, 51:73-81.
Zadeh, L. A. (1968). "Probability Measures of Fuzzy Events", Journal of Mathematical
Analysis and Applications, 23:421-427.
Zadeh, L. A. (1978). "Fuzzy Sets as a Basis for a Theory of Possibility", Fuzzy Sets and
Systems, 1:3-28.
Zadeh, L. A. (1986). "A simple view of the Dempster-Shafer theory of evidence and its
implication for the rule of combination", AI Magazine, 7:85-90.
CHAPTER FRIL - A SUPPORT LOGIC
PROGRAMMING
6 ENVIRONMENT

The preceding chapters have described various forms of knowledge representation


within soft computing and showed how some of these forms are formally related; for
example, how fuzzy sets are formally related to various probabilistic representations
such as mass assignments and probability distributions. The attention in this chapter
shifts to a programming environment that enables soft computing - FRIL (Fuzzy
Relational Inference Language) [Baldwin, Martin and Pilsworth 1988]. Fril is an
efficient general logic programming language with special structures to handle
uncertainty and imprecision. Mass assignment theory, fuzzy set theory, support logic (a
form of interval based probabilistic reasoning), and related theories of uncertainty and
imprecision form the basis of knowledge representation and reasoning for the Fril
support logic programming environment.

This chapter begins by describing how to represent domain specific knowledge in terms
of Fril propositions. Subsequently, the general purpose inference and decision making
strategies in Fril are described, that is, the support logic calculus. This presentation
focuses on the reasoning aspects of Fril that are used by Cartesian granule features
models, which are subsequently proposed in Part IV of this book.

6.1 FRIL RULES AND FACTS

Fril provides a very rich and expressive set of propositional forms that facilitate the
modelling of systems in a linguistic and natural way. Currently, propositions of the
following types are accommodated:

• Unconditional and unqualified propositions expressed by the canonical


form

p: 'X of Object is A '

where A is a fuzzy set representation constrammg the values of the


variable (or feature) X for an object or sample, Object. Alternatively, A
could also be a mass assignment or a probability distribution. For the
remainder of the book it is assumed, unless it is explicitly mentioned, that
propositions will be fuzzy in nature. For example, the fuzzy proposition
"Height of Joe is Talf' is an imprecise statement denoting that the height
of Joe is the fuzzy set Tall.

J. G. Shanahan, Soft Computing for Knowledge Discovery


© Kluwer Academic Publishers 2000
CHAPTER 6: FRIL - A SUPPORT LOGIC PR(){,RAMMING ENVIRONMENT 130

• Unconditional and qualified propositions expressed by the canonical form

p: 'X of Object is A:(I u)'

where A is a fuzzy set constraining the values of the variable X and (I u)


represent an interval probability, which denotes that the probability of X
having a value of A lies in the interval (I u) where I and u satisfy the
following constraints: 0 ~ I ~ u ~ 1. The following is an example of an
unconditional qualified proposition: "Height of Joe is Tall:(0.9 1.0)",
where Tall is represented by a fuzzy set. This proposition states that there
is a high probability (i.e. between 0.9 and 1.0) that the height of Joe
corresponds to the fuzzy set Tall. Both unconditional qualified and
unqualified propositions correspond to specific knowledge about the
values of object variables (sometimes known as facts).

• Conditional propositions (both qualified and unqualified by interval-


valued probabilities) expressed by the canonical form:

r: '<Head> IF <Body>: <list of support pairs>'

where Head is an unconditional and unqualified fuzzy proposition as


defined above and Body is a list of unconditional and unqualified
propositions that can be matched with specific facts (in the knowledge
base) or the head of another rule. The list of support pairs is a list of
interval-valued probabilities (I u) that probabilistically qualify the rule.

Currently, Fril supports four types of conditional proposition (rules):

• Prolog style rules (i.e. traditional logic programming rule) such as


((append (HIT) List2 (HIList3))
(append T List2 List3));
• conjunctive rules;
• evidential logic rules;
• and causal relational rules.

The next section presents conjunctive rules, evidential logic rules, and causal relational
rule in more detail. For a full description of Fril rules see [Baldwin, Martin and
Pilsworth 1988; Baldwin, Martin and Pilsworth 1995]. In this book, the hypothesis
language is currently limited to two of these rule structures: the conjunctive; and
evidential logic rule structures. Classification and prediction problems can be modelled
generically by viewing classification problems as crisp instances of prediction
problems. In other words, prediction is the continuous version of classification. This
view arises from the fact that in classification problems the values of the output
variables are discrete or crisp values. Conversely. the values of output variables in
prediction problems are continuous values that are reinterpreted linguistically (thereby,
giving them a discrete nature). Therefore, the values of output variables in the
prediction case reduce to linguistic values characterised by the fuzzy sets which
Son COMI'UTIN(; H)R KNOWLEDGE DISCOVERY: INTRODUCING CNHESIAN GRANULE FEATURES 131

discretise the output variable's universe. Consequently, the values of output variables in
classification problems can be viewed as crisp sets consisting of single elements,
whereas the corresponding values in prediction problems are linguistic labels that
denote a fuzzy subsets of the output variable's universe.

6.1.1 Conjunctive rule


A canonical conjunctive rule structure is presented in Figure 6-\. Here CLASS is the
value of variable Classification defined over universe nclassificalion for object Object. It
can be viewed as a fuzzy set consisting of a single crisp value (in the case of
classification type problems) or as a fuzzy number (in the case of prediction problems).
In both the case of classification and prediction problems, the rule characterises the
relationship between input and output data for a particular region of the output space
i.e. a concept. One conjunctive rule is generated for each region in the output variable
universe. In the case of classification problem domains, a rule is generated for each
class in the output space. Correspondingly, in the case of prediction type problems, a
rule is generated for each clump of points (granule) in the output space characterised by
a fuzzy set. The body of each rule consists of information expressed in terms of a list of
problem domain features. In the canonical conjunctive rule, each (F; of Object is
FS;cu,ss) corresponds to fuzzy proposition where the feature F; has a fuzzy set value
FS;cu,ss. Here each F; represents a feature or variable, which is either a problem
(application) feature, or a Cartesian granule feature or some other type of derived
feature (see Chapter 8 further for details). The following is an example of a conjunctive
rule:

((Classification of Object is Summer_sky)


(Position of Object is Near_top)
(Colour of Object is Sky_blue)): (0.91)(00)

This rule states that there is a high probability (i.e. between 0.9 and 1.0) that Object
(corresponding to a region in an image) can be labelled Summer_sky if the position of
Object is near the top of the image and if the colour is sky -blue. In this case, both
Near_top and Summer jky are fuzzy sets defined elsewhere in the knowledge base.

«Cia sification of Object is CLASS) 1* Given *1 Head/Consequent


(F1of Object is FS1cU\Ss)

(Fi of Object is FSicU\ss) Body/A n tecedellts

(Fm of Object is FSmCl.i\SS) )



-~

I 1)(00» Rille Supports

Figure 6-1: Fril conjunctive rule structure.

6.1.2 Evidential logic rule


The general format of an evidential logic rule is similar to the conjunctive rule and a
canonical form is presented in Figure 6-2. The main difference between the two, being
CHAPTER 6: FRIL - A S UI' PORT loUIe PROGRAMM ING ENVIRONMENT 132

that for evidential logic rules, each body term is associated with a weight W; and that
each rule is associated with a jilter term. The weight term W; indicates the relative
importance of feature F; for the rule's conclusion. The filter is seen as a function that
linguistically quantifies the number of features that need to be satisfied in order to draw
a reasonable conclusion. Evlog is a buiIt-in-predicate (BIP) that takes care of inference
in evidential reasoning. A more detailed presentation the rule filters and weights is
presented in Section 9.4, where they are learned from data. The semantics of this BIP
are presented below in Section 6.2. Consider the following concrete example of an
evidential logic rule:

((Classification of Object is Summer_sky) (Evlog most


(Position of Object is Near_top) 0.1
(Size of Object is Big) 0.3
(Position of Object is NearTheOcean) 0.2
(Colour of Object is Sky_blue) 0.4)): (0.91)(00)

This rule states that there is a high probability (i.e. between 0.9 and 1.0) that Object
(corresponding to a region in an image) can be labelled Summer_sky if most of the
weighted features in the body of the rule are satisfied. The term most is a fuzzy set that
can model the expression of optimism or pessimism. Evidential logic rules have the
added value that not all evidence is needed in order to reason. This can prove vital in
some problem domains where, for example, a sensor or remote resource is unavailable
but regardless, partial reasoning is possible due to the weighted sum nature of the
evidential logic rule.

«Clas ification of Object is CLASS) Head/Consequent


(Evlog filler (FI of Object is FS1CI.Ass) WI

(F, of Object is FS.cIA~S) Wi Body/Alltecedents and


associated weights
(Fm of Object is FSmCI. A• s) w m»
:« I 1)(0 0» Rule Supports

Figure 6-2: Fril evidential logic ruLe structure.

6.1.3 Causal relational rule


A third rule format that facilitates the handling of uncertainty and imprecision in Fril is
the causal rule structure (sometimes known as the extended rule). Its canonical form is
presented in Figure 6-3. Here each body term B; denotes a conjunction of fuzzy
propositions (Cil A Cj2 A ... A Cjn)' In addition, each [Uj, Vj] corresponds to a support
interval that contains Pr(HeadIB j ). The propositions in each body B; are assumed to be
mutually exclusive and exhaustive. In other words, the rule represents an interval-
valued conditional probability distribution between the head proposition and the
propositions in each body conjunction B;. Consequently, the following probabilistic
constraints need to be satisfied:
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 133

n
(i) LPr(Bi) =1
i=1

(ii) Pr(B; 1\ Bj ) = O. where i "* j


For example. consider the following extended rule characterising Summer_sky in terms
of the Position and Colour variables:

((Object is summer_sky)
((Position 0/ Object is Top) (Colour 0/ Object is Blue))
((Position o/Object is Middle) (Colour o/Object is Blue))
((Position o/Object is Bottom) (Colour o/Object is Blue))
((Position o/Object is Top) (Colour o/Object is Nocblue))
((Position o/Object is Middle) (Colouro/Object is Nocblue))
((Position 0/ Object is Bottom) (Colour 0/ Object is Nocblue))
) : (0.9 1) (0.5 J) (00.1) (0 0) (0 0) (0 0)

where Position and Colour are linguistic variables (i.e. values are characterised by
fuzzy sets) with the following possible fuzzy set values {Top, Middle. Bottom} and
values {Blue, Not_blue} respectively (resulting in six conjunctions Bi). The conjunctive
rule can be viewed as a special case of the causal rule i.e. it corresponds to an extended
rule representing two conditionals Pr(HeadIBody) and Pr(HeadhBody). This rule
structure is a very high level means of representing conditional probabilities (that are
normally represented in tables or lists). The main differences here is that the
propositions are imprecise (specified in terms of fuzzy sets) and not crisp.

«Classification of Object is CLASS) Head/Consequent


1--
BI

Body/Antecedeflts and
associated weights
8 m)
:«UI VI) ••• (u; VI)'" (Urn »
vm Rule Supports

Figure 6-3: Fril causal relational rule structure.

6.2 INFERENCE

The main forms of representing domain knowledge in the Fril programming


environment were described in the previous section. General inference within Fril is
considered here. while the next section considers the decision making processes used
within this framework of knowledge representation.

Inference in Fril occurs at three different levels: at the body proposition level; at the
body level; and at the rule level. At all three levels. inference is based upon
CHAPTER 6: FRIL- A SUI'I'ORT loGIC PROGRAMMING ENVIRONMENT 134

conditionalisation (except in the case of the body level of the evidential logic rule).
Since the conjunctive rule is a simplified version of the extended rule (and only
different to the evidential logic rule in terms of inference at the body level), the
inference process is presented. from the conjunctive rule perspective. In Fril, it is
possible to perform inference in point-valued or interval-valued model, however the
new approaches to knowledge discovery introduced in this book are currently limited to
point-value inference. Future work could harness the more expressive interval-valued
representation and inference. Consequently, the presentation is limited for the most part
to point-valued inference.

6.2.1 Inference at the body proposition level


As seen previously, Fril rules can be decomposed into head and body propositions that
are fuzzy in nature i.e. are linguistic variables. When new evidence is presented to the
system a "match" between the body propositions and the evidence needs to be
performed in order to enable higher-level inference (rule body level inference), that is,
the level of support for a body-level proposition given the new evidence needs to be
calculated. This is achieved by conditioning each fuzzy proposition in the body on the
evidence. In other words, the posterior probability of Pr( F; = FS;cIASS I Data;) is
calculated, that is, the probability of Variable F; having a fuzzy set value FS;cIASS given
that current value is Data;. This is achieved using the mass assignment theory
conditioning operation of semantic unification (see Section 5.3.3.1), which allows the
conditioning of one fuzzy set given another, through their corresponding mass
assignment representations (see Section 5.4.1 for details of this transformation). Unlike
classical reasoning, where unification is performed at a syntactic level (pattern
matching), here where vague statements are represented by fuzzy sets, unification is
performed at a semantic level, by the numerical manipulation of the corresponding
memberships functions via the semantic unification operation. In order to use the
semantic unification operation, the feature fuzzy set value (and the Data if necessary)
needs to be transformed to a mass assignment. This operation provides a very natural
and formal means of measuring the degree of "match" between concepts expressed in
terms of fuzzy sets. In traditional fuzzy set theory this ~'match" operation can be
achieved using a variety of means (such as the possibilistic match presented in Section
3.6) depending on the underlying assumptions of the calculus used; this can be viewed
as both a weakness (no coherent approach) and an advantage (very expressive) of
traditional fuzzy set theory.

Though semantic unification comes in two flavours - interval and point-valued - the
work presented in this book has been limited to point-valued semantic unification. Point
semantic unification can be quite efficiently thought of as corresponding to the
expected value of the membership of a fuzzy setf given the least prejudiced distribution
(LPD) of fuzzy set g [Baldwin, Lawry and Martin 1996]. This is expressed more
succinctly as follows for the discrete case:

n
Pr(f I g)If1.f(x;~ LPDg(x;) (6-1)
;=1
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 135

where both the fuzzy setsfand g are defined over the discrete universe Ox = {Xh X2•••• ,
xn }. The continuous case is as follows:

Pr(flg)= jJlt(x)x[pdg(x)dx
xeQ x

wherefand g are both defined over the continuous universe nx.

6.2.2 Inference at the rule body level


The previous section has described how to calculate the degree of support for a body
level proposition given new evidence using point valued semantic unification. This
results in a point probability for each body-level proposition Pr(FS;clASS I Data;).
Inference at the rule body level is concerned with calculating a support for the overall
collection of body propositions. This calculation will vary depending on the rule
structure being utilised. For conjunctive rule structures the body support BclAss is
calculated by taking the product of the individual point semantic unifications between
the fuzzy sets FS;cLASS and the data values Data; as follows:

Body = n
;=1
m
Pr(FS;ClASS I Data;).

On the other hand, in the case of Evidential Logic rules the body support Body is
calculated in two steps as follows:

Body'= i
;=1
Pr(FS;CLASS I Data;)w i

where W; is the weight of importance associated with feature i. The second step involves
taking the intermediate value Body' and passing it through the filter function, which
yields the body support Body as follows:

Body = filter(Body')

The filter step can be bypassed by setting it to the identity function Le.filter(x) = x.

6.2.3 Inference at the rule level


Having calculated the support for the body of a rule, the support for the rule can be
inferred using Jeffrey's rule of total probabilities [Baldwin, Martin and Pilsworth 1988;
Jeffrey 1983]. This section begins by briefly describing this inference process as an
updating process that uses Jeffrey's rule [Lawry 1996]. This is followed by a
description and concrete example of how Jeffrey's rule is used for inference at the rule
level.

The following generic conjunctive rule:


CHAPTER 6: FRII. - A SUPI'ORT UJOIC PROGRAMMING ENVIRONMENT 136

(Head Body):(uJ ul)(u2 u2)

(where for simplicity, the associated intervals are reduced to points) can be viewed,
from a probabilistic perspective, as denoting the following probabilities:

• Pr(HeadIBody) = u], the conditional probability of the Head proposition


given the Body;
• and Pre Headl-JJody) = u2, the conditional probability of the Head proposition
given the complement of the Body.

To simplify the presentation here, the probabilities Pr(HeadIBody) and


Pr(Headl-JJody) are generically denoted as follows: Pr(HeadIBody;). In a sense, each
conjunctive rule represents general knowledge about a population of objects P.
Consequently, each Pr(HeadIBodYi) can be viewed as the proportion of objects from P
satisfying Body; for whicp the Head proposition is satisfied (or at least an estimate of
that proportion). Now suppose some knowledge about an object obj in the population P
becomes available. More specifically this knowledge, which may have been provided
by an expert or found experimentally, consists of the probabilities that obj satisfies
Body; denoted as follows:

Probj(BodYi) for i = 1,2 in this case.

These probabilities are specific to the knowledge relating to obj and are not necessarily
related to the prior probabilities Pr(Body;). Given this new information about obj, how
can the probability of the Head proposition be updated denoted by Prob/Head)?
Jeffrey's rule facilitates the update of the probability of a proposition using the theorem
of total probabilities when new information becomes available about a specific
instance. This is formally accomplished as follows:

Prob/Head) = Pr(HeadIBody) . Prob/Body) + Pr(HeadhBody) . Prob/-,Body)

In terms of inference for conjunctive rules, the support for the Head proposition is
calculated as follows:

Pr(Head) = Pr(HeadIBody) . Pr(Body) + Pr(Headl-,Body) . Pr(-.Body) (6-2)

where Pr(HeadIBody) and Pr(HeadhBody) correspond to interval probabilities


associated with the rule and Pr(Body) is calculated described in Section 6.2.2. Consider
the knowledge base presented in Figure 6-4 that consists of a rule, referred to as Rule]
subsequently, describing the concept of a Summer Sky in terms of Position and Colour
attributes, and two facts.

Querying this knowledge base with the query qs((Classification of region] is WHAT))
results in the following inference steps:

Levell: The supports for the body propoSItIOns of Rule ] i.e.


Pr(Near_topIJOO) and Pr(Sky_bLueI200) need to be calculated.
SOIT COMPUTING I'OR KNOWLEI)(iE DISCOVI:RY: INTRODUCING CARTESIAN GRANULE FEATURES 137

Hypothetically, these are set as follows: Pr(Near_topIJOO) = 0.5


and Pr(Sky_blueI200) = O.B. These values would normally be
generated as a result of semantic unification.

Level 2: The support for the body of rule J, Pr(Body) is subsequently


calculated. This is done by taking the product of Pr(Near_topIJO)
and Pr(Sky_blueI200) resulting in the following:

Pr(Body) = 0.5 · 0.8 = 0.4

Level 3: Subsequently, the support for the Head of Rule J, Pr(Head), is


calculated using Jeffrey's rule (Equation 6-2) as follows:

Pr(Head) = 0.9 · 0.4 + 0.5· (1 - 0.4) =0.66

The result of the inference is that regionJ is a Summer_sky with probability of 0.66.

GIVEN «Classijication of Object is SutrulIer_sky)


(Position of Object is Near_top)
(Colour of Object is Sky_blue)
»:(0.90.9)(0.50.5)

AND «Position ofregion I is 100»: (J J) 110 and 255 correspond to


II the top and bottom
«Colour of reg ion 1 is 200): (1 1) 110 and 255 corresponds to
II the lack or presence of blue

QUERY qs«Classijication of reg ion 1 is WHAT»


RES UL T «CLassification ofregiol/J is SUl1uller_sky» with support 0.66

Figure 6-4: An example of inference in FrU reasoning.

In the knowledge discovery approaches introduced in this book, the representation rules
are restricted to equivalence rules i.e. the support pairs associated with each rule are of
the form (1 1) (0 0). Consequently, the calculation of the support for the Head
proposition of a rule is simply the support for the body i.e. Pr(Head) = Pr(Body).

6.3 DECISION MAKING

The previous section has described how general inference in Fril is performed in three
stages. The decision making processes used within the Fril framework of knowledge
representation are described presently. From a knowledge discovery perspective, the
CHAPTER 6: FRIL - A SUPPORT LoGIC PROGRAMMING ENVIRONMENT 138

decision making process varies depending on whether the problem domain is


classification-based (discrete output variable) or regression (also known as prediction
i.e. the output variable is continuous in nature). In general, when dealing with systems
where the individual universes are granulated by fuzzy sets, multiple fuzzy sets and
hence multiple fuzzy rules are called upon to deduce an answer for a test sample i.e.
inference is performed in a data driven manner - forward chaining as in fuzzy logic. For
any particular test case, inference is performed on each rule separately and then the
results of individual rule inference are combined to give a final overall outcome.
Basically, a level of support Si is calculated for the head of each class rule
(CLassification of Object is CLASSj ) using the inference strategies presented above. In
the case of classification problems, the classification of the input data vector (decision
making) is determined as the class CLASSmax associated with the hypothesis with the
highest support. A modified decision making procedure based upon utility theory could
alternatively be used, where the posterior probability Sj (hypothesis support) is
multiplied by the utility value of the respective hypothesis and then the classification of
the input data vector (decision making) is determined as the class CLASSmax associated
with the hypothesis that inaximises the resulting expected utility [Lindley 1985].

On the other hand, in the case of prediction problems, the prediction of the value of the
output variable associated with the input data vector is achieved using a process known
as defuzzification. Here a similar strategy as used in fuzzy logic could be used i.e. use
any of the standard defuzzification procedures such as Centre of Area (COA), Centre of
Gravity (COG) replacing the fuzzy rule activation with the rule support (see Section 4.3
for details). However, here, a procedure that incorporates the spirit of mass assignment
theory is chosen. The result of inference is a collection of rule hypotheses of the form
((CLassification of Object is CLASS j )): (a;) which have non-zeros supports CXj. In this
case CLASSj is a fuzzy set defined over the universe of the output variable. The
defuzzification procedure selected here involves firstly, calculating the expected value
Vi of least prejudiced distribution associated with each fuzzy set CLASSj via the mass
assignment associated with CLASSj • This yields a collection of values Vi and supports
from their respective head clauses as follows {(Vi) : (a;)}. Then taking the expected
value of these values, yields a point value, the result of reasoning. In other words, the
inferred point value as follows is calculated as follows:

c
v= Lvjaj
j=1

where Vi is the expected value of the least prejudiced distribution associated with the
fuzzy set CLASSj and c denotes the number of class rules in the rule base. The value V
corresponds to the predicted output value for the system.

6.4 SUMMARY

Mass assignment theory and related theories of uncertainty and imprecision form the
basis of knowledge representation and reasoning ,for the Fril support logic programming
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARlliSIAN GRANULE FEATURES 139

environment [Baldwin, Martin and Pilsworth 1988]. Essentially, Fril is an efficient


general logic programming language with special structures to handle uncertainty and
imprecision. The main structures that can be used to represent domain knowledge in
Fril were described and illustrated: unconditional propositions; and conditional
propositions (probabilistic if-then rules). Inference is based upon support logic
(reasoning in tenns of probability intervals). Inference in Fril is based primarily upon
conditioning, whereas in fuzzy logic inference, is based upon an extension of modus
ponens. The representation and reasoning mechanisms presented here fonn a
foundation upon which additive Cartesian granule features models are built (see
Chapter 8 for more details).

6.5 BmLIOGRAPHY

Baldwin, J. F., Lawry, J., and Martin, T. P. (1996). "Efficient Algorithms for Semantic
Unification." In the proceedings of IPMU, Granada, Spain, 527-532.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1988). FRIL Manual. FRIL Systems
Ltd, Bristol, BS8 I QX, UK.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - FuzzY and Evidential
Reasoning in A.I. Research Studies Press(Wiley Inc.), ISBN 0863801595.
Jeffrey, R. C. (1983). The Logic of Decision. University of Chicago Press, Chicago and
London.
Lawry, J. (1996). "Knowledge representation course notes", Report No. Course Notes,
Department of Engineering Maths, University of Bristol, UK.
Lindley, D. V. (1985). Making decisions. John Wiley, Chichester.
PART III
MACHINE LEARNING

The discussion so far in this book has focused on knowledge representation and
different soft computing realisations. It has partly assumed that a programmer has built
in all the intelligence in a system. In general, and certainly in the case of complex
systems, solving problems through programming computers using these forms of
knowledge representation is a mammoth task, often beyond human specification; in
short manual prcgramming is not necessarily the best approach for the program or the
programmer. Whenever a software program (model) has incomplete knowledge of the
problem domain in which it operates, learning is often the only way a program can
acquire what it needs to know. Learning thus provides autonomy but more importantly
it provides a powerful way to tackle problems, which previously were considered
beyond the scope of human programming. Examples of these problems include, the
recognition of human motion, or the classification of protein types based on the DNA
sequence from which they were generated. This part of the book consists of one chapter
that covers the field of machine learning - a subfield of AI concerned with programs
that learn from experience. It introduces the basic architecture and components of
learning systems. In addition, it provides an overview of the three broad categories of
machine learners, namely, supervised learners, reinforcement learners and
unsupervised learners. This chapter focuses in particular on supervised learning, as one
of the main goals of this book is to introduce new supervised learning algorithms for
Cartesian granule feature models (see Part IV). Popular induction algorithms including
the C4.5 decision tree induction algorithm, the naIve Bayes classifier induction
algorithm and the fuzzy data browser are also described.
CHAPTER
MACIDNE LEARNING
7
The ability to learn is considered the conditio sine qua non of intelligence, which makes
it an important concern for both cognitive psychology and artificial intelligence. The
field of machine learning (ML), which crosses these disciplines, studies the
computational processes that underlie learning in both humans and machines. The
field's main objects of study are the artefacts [Langley 1996], specifically algorithms
that improve their performance at some task with experience. The goal of this chapter is
to introduce techniques designed to acquire knowledge in this manner and to provide a
framework for understanding the relationships among such methods, and in particular
the machine learning approaches proposed and presented later in this book.

The chapter begins with a brief overview of the somewhat roller-coaster history of
machine learning. Various strategies for learning (from a human perspective) are
subsequently introduced, which leads to the most prevalent form of computational
learning; inductive learning - constructing a description of a function from a set of
input/output examples. Formal definitions of the machine learning are subsequently
provided before the three main categories of machine learning, namely, supervised
learning, reinforcement learning and unsupervised learning, are described; focussing in
particul~ on supervised learning as one of the main goals of this book is to introduce
new supervised learning algorithms for Cartesian granule feature models (see Part IV).
As part of this focus, popular induction algorithms including the C4.5 decision tree
induction algorithm, the naIve Bayes classifier induction algorithm and the fuzzy data
browser are described and illustrated using the car parking problem from Chapter 1.
The presentation of each category is supplemented with a taxonomy of associated
learning algorithms. Inductive learning is then described in detail, viewing induction as
a search process in the space of possible hypotheses (induced computational models) in
which factors such as generalisation, model performance measures, inductive bias and
knowledge representation play important roles. The chapter finishes by looking at some
of the goals, accomplishments and open issues in machine learning.

7.1 HISTORY OF MACHINE LEARNING

Many of the central concepts in machine learning have a long history. For example,
Hume [Hume 1748] describes induction, a fundamental notion in "generalisation
learning", but it was not until the 1950s that an interest in computational approaches to
learning really developed with the birth of artificial intelligence (AI) and cognitive
science. From the outset both areas addressed a varied and ambitious agenda, with
topics including game playing, letter recognition, abstract concepts and verbal memory.
Learning was viewed as a central feature of intelligent systems and work on both

J. G. Shanahan, Soft Computing for Knowledge Discovery


© Kluwer Academic Publishers 2000
CHAPTER 7: MACHINE LEARNING 144

learning and performance was concerned with developing general methods for
cognition, perception and action. Since the 1950s, computer scientists have tried, with
varying degrees of success, to give computers the ability to learn. This period can be
divided conveniently into three periods of activity [Shavlik and Dietterich 1990a]:

• exploration (1950s and 1960s);


• development of practical algorithms (1970s);
• explosion of research directions and applications (1980s to present).

Work during the exploration period of machine learning focused on developing


computational analogues of various neuro-physiological, biological and psychological
phenomena. This led to the introduction of various approaches to supervised and re-
inforcement learning algorithms. For example, the demonstration of how neuron-like
networks could compute, by McCulloch and Pitts [McCulloch and Pitts 1943], inspired
several groups to work on developing learning algorithms to train them; foremost of
these was Rosenblatt's perceptron [Rosenblatt 1958]. In parallel, various groups
worked on simulating nature; this included the landmark work of Friedberg, where he
attempted to solve simple problems by teaching a computer to write computer programs
[Friedberg 1958; Friedberg, Dunham and North 1959] using a framework similar to
modem day genetic algorithms. On the psychological front, numerous groups tried to
build simple symbolic processing systems to model human learning based on
psychological lab experiments [Feighenbaum 1961; Hunt, Marin and Stone 1966]. For
example, Feighenbaum [Feighenbaum 1961] introduced the EPAM system, a computer
program, using a knowledge form similar to modem day decision trees, that was
designed as a psychological model of how humans memorise nonsense syllables. Other
pioneers during this period included Samuel [Samuel 1959], who coined the phrase
machine learning. His work focused primarily on developing a series of programs for
checkers (draughts) that eventually learned to play the game at tournament-level.

In the mid-1960s, both AI researchers and psychologists realised the importance of


domain knowledge, which resulted in a major paradigm shift, towards the manual
construction of the knowledge intensive systems. D~ring this period, most AI
researchers avoided issues of learning, while they attempted to understand the role of
knowledge in intelligent behaviour. Research on knowledge representation, natural
language and expert systems dominated this era, resulting in many successful
applications including the expert systems Rl - (computer configuration) [McDermott
1982] and Prospector (mineral exploration) [Duda, Gaschnig and Hart 1979]. This
resulting lull in learning research, especially in neurophysiological modelling, was
somewhat amplified by Minsky and Papert's theoretical work [Minsky and Papert
1969], which illustrated the limitations of the perceptron. However, some work on
learning continued in the background, incorporating the representations and heuristic
methods that had become central to AI. This led to the development of many practical
learning algorithms, along with convincing demonstrations on real world problems.
These include: Winston's work on concept learning and language acquisition in the
early 70s within the blocks world [Winston 1975]; Buchanan and Mitchell
demonstrated their MetaDendral system for learning mass-spectrometry prediction
rules [Buchanan and Mitchell 1978]; Michalski's introduced the AQ rule-based
learning algorithms with an application to the diagnosis of disease in soybeans
[Michalski and Chilausky 1980]; and the 103 decision tree learning algorithm was
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 145

introduced with its effectiveness illustrated on learning chess end-game rules [Quinlan
1983]. Even though Bryson and Ho introduced the back propagation algorithm for
training neural networks in 1969 [Bryson and Ho 1969], it was largely ignored until the
resurgence of interest in neural networks in the mid-eighties.

Notwithstanding the success, expert systems turned out to be brittle and have difficulty
handling inputs that are novel or noisy. This, coupled with the introduction of many
practical learning algorithms, and convincing demonstrations on real world problems,
helped shift the attention from the static question of how to represent knowledge to the
dynamic quest of how to acquire it. As a result, in the late 1970s, a new interest in ML
emerged within the AI community that grew rapidly over the course of a few years.
This interest was further motivated by the frustration with the encyclopaedic flavour
and domain-specific emphasis of expert systems, and the opportunity of returning to
general principles afforded by machine learning.

By the early 1980s, machine learning was recognised as a distinct scientific discipline,
branching out from the traditIonal areas of concept induction and language acquisition
to areas of machine discovery and problem solving. The past two decades have seen an
explosion of research directions in theory, algorithms and applications within the field
of machine learning. Many new methods have been proposed, older techniques
revisited, such as neural networks, along with the development of a host of inter-
disciplinary approaches such as soft computing based learning techniques. Whereas
traditional AI researchers focused on abstract toy-world problems (e.g. blocks world),
ML researchers, especially in the recent past, have become more serious about the real
world potential of learning algorithms and this has led to the development of new fields
such as knowledge discovery in databases (KDD), and text mining. This phenomenon
is also depicted in this book, where the proposed knowledge discovery process (centred
on the constructive induction of Cartesian granule feature models) is applied to a
variety of real-world problems with very encouraging results (see Chapters 10 and 11).

Section 7.4 examines the explosive growth of machine learning by presenting an


overview of some of the important machine learning algorithms categories. Before that,
human learning is presented (sometimes considered as a role model for machine
learning), while machine learning is subsequently defined in Section 7.3.

7.2 HUMAN LEARNING

Human learning, according to [Agency 1995; Honey and Mumford 1992] is the
acquisition over time of a variety of skills, knowledge, experience or attitudes by the
individual. Learning can be seen as a "change in human disposition or capability,
which can be retained, and which is not simply ascribable to the process of growth".
Learning can be measured and observed via these changes in behaviour. In every
learning situation the learner transforms information provided by a teacher (or
environment) into some new form in which it is stored for future use. The nature of the
transformation determines the type of learning strategy used. Several basic categories
exist such as rote learning, learning from instructions, deductive learning, learning by
analogy and inductive learning [Honey and Mumford 1992]. This list is ordered - from
CHi\PTER 7: MACHINE LEARNING 146

shallow learning to deep learning - by the increasing complexity of the transfonnation


(inference) from the infonnation initially provided to the actual knowledge ultimately
acquired. This order reflects increasing effort on the part of the learner and possibly
correspondingly decreasing effort on the part of the teacher. In any act of human
learning, a mixture of these strategies is usually involved.

In rote learning there is basically no transfonnation; the infonnation provided by the


teacher is more or less directly accepted and memorised by the learner and
subsequently could reasonably be excluded as a learning strategy. In learning by
instruction (or by being told), the basic transfonnations perfonned by the learner are
selection and refonnulation (mainly at a syntactic level) of infonnation provided by
teacher. Deductive learning can be viewed as explanation-based learning, in that the
learner merely draws and stores conclusions (possible explanations) with certainty from
facts known to be true. In a sense, a deduction simply rearranges given information but
does not go beyond it.

If learning were restricted to the approaches presented so far, then people would be
hopelessly restricted in the conclusions they could draw. Often there is a need to go
beyond the infonnation given, i.e. to generalise to unseen scenarios. This leads to
inductive learning, where the transfonnation process involves generalisation of the
input infonnation and selection of the most desirable result, that is, generalised
knowledge is inferred from particular examples. The process that derives new
generalised knowledge from particular examples is known as inductive inference. The
price one pays for this ability is the loss of the guarantee (introduction of uncertainty)
that the conclusions follow from the infonnation given. Finally, learning by analogy is
a mixture of both inductive and deductive reasoning. The following are examples of
inductive learning taken from [Holland et aI. 1986] covering most inferential processes
that expand knowledge in the face of uncertainty:

"The mother of a four-year-old boy, observing that he has been unusually


cranky and obdurate for several days, decides that he has entered a "phase".
A laboratory rat, busily pressing a lever to obtain food, hears a distinctive
tone, which is followed by an electric shock. The very next time the animal
hears the tone, it hesitates in its lever-pressing activity, waiting, one is tempted
to say, for the other shoe to drop. A nineteenth-century scientist observes the
behaviour of light under several types of controlled conditions and decides
that, like sound, it travels in waves through a medium. "

Understanding such inferential processes is a central concern of philosophy,


psychology, and machine learning (a subfield of AI). However in this book, this interest
is limited to computational models of inductive learning, which is overviewed in the
remainder of this chapter beginning with a definition of machine learning and
description of the main components that make up a machine learning system. This is
followed by an overview of the three main categories of machine learning, namely,
supervised learning, reinforcement learning and unsupervised learning, along with
descriptions of popular learning algorithms.
SOH COMPUTING FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTI;SIAN GRANULE FEATURES 147

7.3 MACHINE LEARNING

Machine learning, like most cogmtlve phenomena, is a very ambiguous tenn with
definitions abounding in the literature. Some of the more succinct and less ambiguous
definitions include Simon's useful characterisation [Simon 1983]:

"Learning denotes changes in the system that are adaptive in the sense that
they enable the system to do the same task or tasks from the same population
more effectively the next time. "

Mitchell [Mitchell 1997] provides a more specific definition by viewing learning as

"a computer program that improves its performance at some task through
experience"

Broadly speaking, machine learning is concerned with computer programs that


automatically improve with experience. More concretely, "a computer program is said
to learn from experience E with respect to some class of tasks T and performance
measures P, if its performance at tasks in T, as measured by P, improves with
experience" {Mitchell 1997J. This is adapted as a working definition of learning as it
accurately portrays the learning aspects of the work presented in this book. Figure 7-1
depicts the interactions between the main players in this definition of learning: the
learner; the perfonnance evaluator; the model or the component of the computer
program that actually processes inputs (or sensory data, for example, a digital image)
and generates outputs (for example, a command to the navigation system to tum left
45°); and the environment or problem domain that provides experiences (inputs and
outputs).

Background
Knowledge

Figure 7-1: A general model of machine learning systems.

As Figure 7-1 suggests, learning cannot be considered in isolation. A learning system


(learner) always finds itself in some environment about which it attempts to learn. The
learner acquires infonnation from experience, E, of the environment (in tenns of
observation data or background knowledge), which it then tries to incorporate into its
model of the environment (knowledge). It always attempts to improve the model's
CHAPTER 7: MACHINE LEARNING 148

performance at tasks T, as measured by a performance measure P. Here performance


suggests some quantitative measure of behaviour on a task; this can take on several
forms, such as accuracy (behaviourist view e.g. classification accuracy of diagnosis
system), efficiency (e.g. computation required to generate a plan) and even
understanding (syntactic and semantic measures of knowledge acquired). Learning
involves improvement in performance thus, learning cannot occur in the absence of a
performance task T. In the ML literature one can partition such tasks simply, as based
upon perception, cognition and action with common areas of application including,
vision, speech, design, natural language, reasoning, planning, game playing, decision
support and control. In this book, machine learning is applied, within the more global
context of knowledge discovery, to problems in vision, medical diagnosis and control.

7.4 CATEGORIES OF MACHINE LEARNING

Machine learning alg9rithms can be categorised according to the type of training


experience they use to learn. Three broad categories currently exist:

• supervised learning;
• reinforcement learning;
• and unsupervised learning.

A machine learning algorithm can be loosely defined as a computer program that


performs inductive learning when the following are provided:

• a set of examples (for example, the car parking success/failure data from
Section 1.2.1) described in a certain instance/observation language;
• background knowledge (and feature construction operators);
• a hypothesis language to represent the learnt computer models;
• a search mechanism;
• general purpose inference and decision making procedures (which could
possibly be learned also);
• and a performance evaluation function.

Inductive learning lies at the core of most machine learning approaches. Inductive
learning takes specific examples and exploits background knowledge in performing a
search in the model or hypotheses space (sometimes in terms of operations such as
generalisation and specialisation) to form general-purpose hypotheses or models that
cover (represent/summarise/explain) the examples in training set and other cases
beyond. This inductive learning process is crudely depicted in Figure 7-2 as a search
through the hypotheses (model) space that is guided by background knowledge and a
performance evaluator. The main components of a general inductive learning process
are presented in detail later in Section 7.8, but the above definition is sufficient for now,
in order to provide an overview of the main categories of machine learning algorithms
in the next three sections.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 149

Example data

Machine Model/Hypothesis
Background
Learner
Knowledge

Possible
Model
Space

Figure 7-2: A simplifying view of machine learning in terms of search through the
space of possible computer models (programs). This search is guided by a performance
component (fitness or cost function) and background knowledge.

7.5 SUPERVISED LEARNING

This section introduces supervised learning, by far to date the most widely applied and
researched category of machine learning. It begins by describing the general
characteristics of supervised learning, which are subsequently illustrated for a
handwritten classification problem. Popular learning algorithms, namely, the C4.5
decision tree induction algorithm, the na"ive Bayes classifier induction algorithm and
the fuzzy data browser are then described and illustrated on the car parking problem
from Section 1.2.1. Finally, a taxonomy of supervised learning algorithms is presented.

A supervised learning algorithm is a special type of algorithm that requires direct


feedback, usually in the form of a teacher or expert, in order to learn. That is, the
system predicts that a certain event or situation will have a certain outcome, and the
environment will immediately provide feedback that describes the actual outcome. If
the predicted and actual outcomes are different, the learner adjusts (adapts) its model of
the task so that in future, if such a situation arises, the model will provide the correct
outcome. Computationally, this is normally realised by providing examples to the
learner (learning algorithm) in the form of situation descriptions and outcomes. Thi s is
CHAPTER 7: MACHINE LEARNING 150

more formally defined in terms of input variables, X ..... , Xno that describe the situation
or event, and an output variable, Y, that describes the outcome. The task of the learner
is to model the dependence of an output variable Y (discrete or continuous) on one or
more input (predictor) variables, X ..... , Xno given N example data {x., ... ,x.}~, and
possibly background knowledge. This results in a model functionft

The system that generated the data is presumed to be described by:

over the domain (XIt ..., X,J € D~ J(' containing the data. The single valued
deterministic function f, of its n-dimensional argument, captures the joint predictive
relationship of Y on XIt ... , Xn • The additive component e usually reflects the
dependence of Y on quantities other than XIt ... , Xn that are neither controlled or
observed, which can lead to models which are deficient; models may be incomplete,
imprecise, fragmentary, not fully reliable, vague, contradictory or deficient in some
other way. In general, these types of deficiencies may result in different types of model
uncertainty. As presented in Part n, some forms of knowledge representation explicitly
incorporate techniques for handling some of these types of uncertainty, thereby
providing a more realistic model of reality. However, to date no panacea approach
exists for this general problem.

Problems considered in supervised learning can be further subdivided into supervised


learning for discrete decision making (commonly known as classification) and
supervised learning for continuous prediction (commonly known as prediction or
regression). In the case of classification, the task of the learner is to generate a model
such that, when it is given a description of an event or object, it will output a discrete-
valued quantity. The handwritten character recognition problem, presented
subsequently, is an example of a classification problem, since the learnt model outputs
a discrete value indicating the character that it predicts given an image event. On the
other hand, the task of the learner for continuous prediction is to generate a model
such that, when it is given a description of an event or object, it will output a real-
valued quantity. For example, learning a model that predicts the stock price of Xerox in
twenty-four hours from now given current stock market information.

Typical tasks that fall under the category of supervised learning include: diagnosis
problems such as whether a patient suffers from diabetes or not; regression problems
such as predicting foreign exchange rates for tomorrow; computer vision problems such
as handwritten character recognition, gesture recognition and automatic vehicle
navigation; natural language and speech processing; control problems such as
controlling a furnace or the docking of a space craft; and decision support systems such
as predicting customer activity (e.g. will a customer default on a bank loan). The task of
learning to classify handwritten characters from classified examples is chosen as an
illustrative problem for supervised learning and is subsequently described.
SOFT COMPUTING FOR KNOW LEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE F EATURES 151

7.5.1 Learning to recognise handwritten characters


The task here is to correctly classify handwritten alphanumeric characters. The input
variables correspond to binary pixels in a digital image that describe a character event.
These images could be scanned in, or generated via touch sensitive pads or displays.
The output variable corresponds to the classification of the character described by the
input event. Figure 7-3 presents an input-output example, where the input event
consists of 25 binary-valued features (also known as attributes, or variables) and the
output feature is the letter described by the input feature values, in this case, the letter
"X". The learning experience consists of a database of input-output example characters,
as presented in Table 7-1. The performance measure, in this case, is the number of
correctly classified characters. Background knowledge for this problem could consist of
a heuristic stating that the diagonal elements of the digital image are sufficient to
recognise each character. This greatly simplifies the learning process, by restricting the
search for a hypothesis, to the subspace of those made up of diagonal features only.
This problem, from a machine learning perspective, is summarised in Table 7-2.
Pixel; ____ _

Figure 7-3: An input-output example In the handwritten character recognition


problem.

Table 7-1: Example database In spreadsheet format for handwritten character


recognition problem.
Example Pixel) ... Pixelr ... Pixehs Class
I VII ... Vir ... Vim C1
... .. . .. ... . .. ... . ..
t V tl ... V lf ... Vim CI
... .. .. . . .. ... .. . . ..
N V N1 ... V Nr .. . V Nm CN

7.5.2 Examples of supervised learning algorithms


Machine learning algorithms search the space of candidate c1assitiers (hypotheses
space) for one that performs well on the training data and that is expected to generalise
well to new cases. Numerous supervised learning algorithms have been developed and
typically differ in their response to the following questions:
CHAPTER 7: MACHINE LEARNIN(; 152

• How are observations and hypotheses or models represented?


• What search algorithm is be used?
• What performance measures are employed to evaluate candidate
classifiers?

The next three subsections answer these questions using three popular supervised
learning algorithms: C4.5 decision tree learning algorithm; naive Bayes classifier
induction algorithm; and the data browser. This is supplemented in the following
section (Section 7.5.3) with a taxonomy of supervised learning approaches.

Table 7-2: Machine learning tableau for recognising hand-written characters.

ML Tableau for recognising hand-written characters


TaskT Reco~nise handwritten characters within ima~es
Performance Measure P Percenta~e of characters recognised correctly
TrainingExpcrience E A database of handwritten characters with c1assitications
Input variables Pixel:. in image
Predicted variable Classification of a character

7.5.2.1 Learning decision trees


The C4.5 algorithm [Quinlan 1983; Quinlan 1993] learns decision trees from example
data. Figure 7-4 portrays a hypothetical binary decision tree for the car parking
problem presented in Section 1.2.1. To classify a case, the root node (node 1 in Figure
7-4) is tested as a true or false decision point. Depending on the result of the test
associated with the node (i.e. NumberOfFreeSpaces > 50), the case is passed down the
appropriate branch, and the process recursively continues. When a terminal or leaf node
(highlighted in grey in Figure 7-4) is reached, the class value associated with the node
is the answer or classification. For the parking problem two classes are considered:
successful; and unsuccessful. The paths to the leaf nodes are mutually exclusive.

To learn decision trees, the C4.5 algorithm is presented with a database of examples in
spreadsheet format (see Table 7-1). The database is split into two smaller databases:
one for training and one for testing. Using the training examples, the task of C4.5 is to
determine the nodes in the tree and the tests associated with the non-terminal nodes.
C4.5 searches the space of decision trees through a constructive search. It first
considers all trees consisting of only a single root node and chooses the best one. A
number of measures have been proposed for evaluating the best feature, including
entropy, which measures the information content or purity of a feature . Consider the
expanded car parking problem (Section 1.2.1.1), which consists of four input features,
TimeToDestination, NumberOfFreeSpaces, OccurrenceOfAPublicEvent,
afJectedStreets, and an output feature parkingStatus. For this problem, C4.5 determines
that the NumberOfFreeSpaces feature provides the most information and is thus
selected as a root node (as depicted in Figure 7-4). A condition is then associated with
the root node. Once again various approaches have been proposed in the literature for
generating this condition, including entropy, that is, a condition point in the feature ' s
universe is chosen so as to minimise the entropy. The resulting condition generates a
binary partition of the data (only considering binary trees here) corresponding to two
Son COMPUTING FOR KNOW LEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 153

branches. For the parking problem, the condition NumberOfFreeSpaces > 50 is


generated for the root node. Subsequently, C4.5 considers all trees having that root
node and various left children and ultimately chooses the best one. Returning to the
parking problem, C4.5 selects the TimeToDestination feature to expand the left branch
of the root node. This recursive process of expanding the tree and consequently,
partitioning the data continues until the data at a node corresponds to one class or is
dominated by one class (for example, 80% of the data at this node refers to one class).
At which point, a leaf node is generated and assigned the label of the dominating class.
Performance of the induced model is evaluated based on the classification accuracy of
the tree on the test dataset.

NurnberOrFreeSpaces > 50

TirreToDe

rFreeSpace < 30

Figure 7-4: A decision tree for the car parking problem.

7.5.2.2 Learning naive Bayes classifiers


Nai"ve Bayes classifiers (see Section 5.2.2 for a detailed description) can quite easily be
learned from example data. Consider a classification problem, where the target
function, y =f(x), models a dependency between a target value Y and a set of input
variables Xf, ... , Xn- The target variable Y is discrete, taking values from the finite set
{yf, .. ., Yc}. The nai·ve Bayes classifier accepts as input a tuple of values <.xf, ... , Xn>
and predicts the target value y, or a classification, using Bayes' theorem (Section 5.1) as
follows:

The learning algorithm estimates the class conditional probabilities and the class
probabilities from a training dataset, where the class conditionals correspond to the
following:
CHAPTER 7: MACHINE LEARNING 154

Pr(Xil Y) \f i E {I, ... , n}

and the class probability distribution corresponds to

Pr(Y).

The class probability Pr(Y =Yj) is simply the fraction of class Yj in the training dataset.
Each class conditional Pr(Xil Y=Yj) can be estimated for the discrete universes
(originally discrete or discretised continuous universes) using the m-estimate as follows
[Mitchell 1997]:

nx +m· p
Pr(X i I Y = y.) =---"k_ __ (6-1)
J nc +m

where nc is the number of training examples (sample size) whose target value is Yj, nxk
is the number examples whose target value is Yj and whose Xi value is Xb p is the prior
estimate of the probability which is being determined here, and m denotes a constant
called the equivalent sample size, which determines how to weight p relative to the
observed data. Note that if m is zero, the m-estimate is equivalent to the fraction (nxk)nc·
J. If both nc and m are nonzero, then the observed fraction (nxk)n/ and the prior p will
be combined according to the weight m. m is called the equivalent sample size as it can
be interpreted as augmenting the nc actual observations by an additional m virtual
samples distributed according to the prior. A typical way of choosing p in the absence
of other information is to assume the uniform prior; that is if an attribute Xi has w
possible values then p=(wr1. One interesting difference between naive Bayes and other
induction algorithms, such as C4.5, is that there is no explicit search through the space
of possible models. Instead, the model is formed using all available features.
Performance of the induced model is evaluated based on the classification accuracy of
the model on the test dataset.

Consider the original car parking problem (Section 1.2.1), which consists of two input
features, TimeToDestination, NumberOfFreeSpaces, and an output feature
parkingStatus. The class conditionals for this problem could be calculated using
Equation 6-1, after both the nTirneToDestination and nNumberOlFreeSpaces universes were
discretised (for example, into uniform intervals or bins of size 5). The value of m, the
equivalent sample size, could be set to the total number of training examples. The
resulting class (hypothetical) probability densities (after interpolating the midpoints of
the bins) are presented in Figure 7-5.

7.5.2.3 Data browser


The data browser is an induction system that automatically extracts evidential logic
rules or conjunctive rules with fuzzy set values (one-dimensional) from statistical data
[Baldwin and Martin 1995]. Consider a classification problem, where the target
function, Y =f(x), models a dependency between a target value Yand a set of input
variables X" ... , Xn• The target variable Y is discrete, taking values from the finite set
{y" ... , y,.}. A data browser induced classifier accepts as input a tuple of values <.x" •.. ,
SOfT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 155

Xn> and predicts the target value y by performing approximate reasoning as described in
Section 6.2. while Section 6.3 describes approximate reasoning when Y is continuous.

The data browser estimates univariate class conditional fuzzy sets from a training
dataset via their corresponding probabilistic class conditionals:

'it i E {I •...• n}

This is enabled by the membership-to-probability bi-directional transformation


presented in Section 5.4. Conditional probabilities are constructed on discretised
universes. where the underlying partitions can be crisp or fuzzy. The underlying
partition of a universe OXi induces a fuzzy set description of each training example.
which is subsequently converted into its corresponding least prejudiced distribution
using the membership-to-probability bi-directional transformation. The resulting
probabilistic event descriptions are then counted for each fuzzy bin and a frequency
distribution is generated for each attribute Xi. Each frequency distribution is then
converted back to a fuzzy set. the class conditional fuzzy set !XiYi> which approximates
or summarizes the description of class Yj over this universe Xi. The data browser then
generates a rule for each target class value Yj of the following format (conjunctive rule):

Y is Yj IF value of X 1 is !X1Yj AND ... AND value of Xn is !xnYj.

_ _ unsuccessfulParking
_. _. _ successfulPar\o:ing
p
p
p
n
p n
n
p n n
p n
n
p
n n
p n
p
I n
I n n n

Figure 7-5: This figure shows the resulting probability density functions using a naiVe
Bayes approach/or the parking problem.

As is the case for the induction of naIve Bayes classifiers. there is no explicit search
through the space of possible conjunctive models. Instead. the model is formed using
all available features . For evidential models. feature selection is performed by
eliminating features associated with low weights. which are calculated via semantic
discrimination analysis (see Section 9.4). Performance of the induced model is
evaluated by the classification accuracy of the model on the test dataset. The class
conditional fuzzy setS!X1Yj for the parking problem are presented in Figure 7-6.
CHAI'TER 7: MACHINE L EARNING 156

Various extensions to the data browser have been proposed including the extraction of
knowledge in terms of decision trees [Baldwin, Lawry and Martin 1997] and the
extraction of knowledge over multidimensional linguistic variables known as Cartesian
granule features [Baldwin, Martin and Shanahan 1996; Baldwin, Martin and Shanahan
1997; Shanahan 1998], which forms the basis for the learning algorithms described in
Part IV of this book. In addition, alternative feature selection algorithms have been
proposed based upon genetic programming [Baldwin, Martin and Shanahan 1998] (see
Chapter 9).

_ _ unsuccessfuiParking
_. _. _ succes.sfulP3rlung
p
p
p
n
p n
n
p n n
p n
p
n n
n
p n p
n
n n n

,
\ , ,.- .. /

Figure 7-6: This figure shows the resulting class conditional fuzzy sets using the data
browser approach for the parking problem.

7.5.3 A taxonomy of supervised learning algorithms


A taxonomy of supervised learning approaches is now presented, focusing on the more
frequently used computational learning paradigms:

o symbolic learning;
o evolutionary computing;
o connectionist learning;
o probabilistic learning;
o fuzzy-based learning;
o and case-based learning.

In the following subsections, for completeness, each category is briefly described and
references to literature provided. This section can be skipped on a first read without loss
of continuity (i.e. resume reading at Section 7.6).

7.5.3.1 Symbolic learning algorithms


The symbolic learning paradigm has focused on developing and refining practical
algorithms that acquire knowledge in the form of condition-action rules, decision trees
SOIT COMPUTING I'OR KNOWI.EDGE DISCOVERY: INTRODUCING CARTESIAN GRI\NULE FEATURES 157

or similar logical knowledge structures. Some of the more popular approaches here
include decision tree algorithms such as ID3 (more recently C4.5) [Quinlan 1983;
Quinlan 1993], and CART [Breiman et al. 1984]. The C4.5 decision tree learning
algorithm is described in Section 7.5.2.1. Decision trees have a long history within
machine learning, having their roots in EPAM [Feighenbaum 1961], the cognitive-
simulation of human concept learning. CLS [Hunt, Marin and Stone 1966] used a
heuristic lookahead method to construct decision trees, while ID3 added the crucial idea
of using information content as a means of specialising hypothesis. Other symbolic
approaches include rule induction techniques such as AQ [Michalski and Chilausky
1980], and predicate logic approaches such as FOIL [Quinlan 1990] and CIGOL
[Muggleton and Buntine 1988]. Though most symbolic induction algorithms use a hill-
climbing search strategy to search the possible model space, recently evolutionary
search techniques have been illustrated as a successful alternative [Banzhaf et al. 1999;
Wong and Leung 1995], avoiding problems such as local optima that can occur using
hill-climbing strategies. Performance is generally measured in terms of the model
accuracy on a test dataset.

7.5.3.2 Evolutionary computing


Evolutionary computation is a branch of soft computing, that by analogy with the
phenomena of evolution in nature, attempts to solve problems (learning in this case, by
adapting the structure and parameters of a model, in order to optimise some
performance function) through the processes of natural selection and reproduction.
Several versions of evolutionary computing exist including genetic algorithms [Holland
1975], genetic programming [Koza 1992], evolutionary programming [Fogel, Owens
and Walsh 1966], and evolutionary strategies [Schwefel 1995]. All approaches build on
ideas originally presented by Friedberg [Friedberg 1958; Friedberg, Dunham and North
1959] who tried to solve simple problems by teaching a computer to write Fortran
computer programs through simulated evolution. He used a framework similar to
modem genetic algorithms. The presentation here is limited to genetic programming
and genetic algorithms, as they lie at the core of the induction algorithms presented
later in this book (see Appendix for more details on evolutionary computing). Though
the principal ideas behind evolutionary computation originated in the work of
Friedberg, it was not until the mid-seventies [Holland 1975] that genetic algorithms
were accepted and illustrated (both empirically and theoretically) as robust search
techniques in complex spaces, such as hypotheses spaces. Genetic programming was
subsequently introduced in the late eighties by Koza [Koza 1992] as a more flexible
extension of genetic algorithms. In these genetic approaches knowledge is encoded in
terms of fixed length or variable chromosome structures (either list-like or tree-like in
nature). These chromosome structures represent programs (that perform a task)
expressed in problem variables and functions (typically algebraic and logic based).
These chromosome structures are manipulated by various genetic operators such as
mutation, crossover and reproduction in order to converge on a suitable model of a
problem domain. Recently, genetic based approaches have been applied successfully to
a variety of machine learning problems [Goldberg and Deb 1991; Koza 1992; Koza
1994; Tackett 1995]. Performance is generally measured in terms of the model
accuracy on a test dataset, and model simplicity. Genetic programming forms the
backbone of the G_DACG constructive induction presented in Chapter 9. Within
CHAPTER 7: MACHINE LEARNING 158

G_DACG, genetic programming has been employed more in a search role in contrast to
an induction role.

7.5.3.3 Connectionist paradigm


In the late fifties, Rosenblatt [Rosenblatt 1958] suggested for pattern recognition
purposes a simple device inspired by early mathematical models of biological neurons
[Hebb 1949; James 1892]. He dubbed this device the perceptron. A neural network is
represented as a multilayer weighted directed graph (perceptrons) of threshold nodes
(non-linear functions) that spreads activation from input feature nodes through internal
nodes to output nodes. Weights on the links determine how much activation is passed
on in each case. The activation of output nodes can be translated into discrete
classifications or numeric predictions. Because of the low level at which knowledge is
represented in artificial neural networks, it is quite difficult to program them manually
- hence learning is an essential component of connectionist theories. The learning
strategy of most neural ne~work algorithms is to improve the accuracy of classification
or prediction by modifying the weights associated with each link. Typical learning
algorithms carry out a hill-climbing search through the space of weights, modifying
them in an attempt to minimise the errors the network makes on training data. One of
the most popular learning algorithms is the back propagation algorithm, originally
introduced by Bryson and Ho [Bryson and Ho 1969]. Unfortunately it was not until the
mid-eighties, with the re-introduction of the back propagation learning algorithm
[Rumelhart, Hinton and Williams 1986] (which overcame the limitations identified by
Minsky and Papert [Minsky and Papert 1969]), that work on learning with simulated
neurons underwent a rebirth, generating explosive interest and excitement in the field.
Another, more recent, and powerful, algorithm is the scaled conjugate descent
algorithm [Moller 1993], which is used both as a comparison to the learning approaches
presented in this book and also as part of the MANF inductive learning framework
described in Section 9.6. For a detailed presentation of neural networks see [Bishop
1995; Fiesler and Beale 1997; Hertz, Anders and Palmer 1991].

7.5.3.4 Probabilistic approaches


Probabilistic approaches are based on the assumption that the quantities of interest are
governed by probability distributions and that optimal decisions can be made by
reasoning about these probabilities and observed data. Probability distributions may be
conditional or unconditional, and point-based (e.g. Bayesian networks [Pearl 1988]) or
set-based (e.g. Dempster-Shafer theory [Shafer 1976]). As for many modelling
approaches, learning helps overcome the "cold start" problem (i.e. providing initial
probabilities) that has traditionally plagued probabilistic approaches. Numerous
learning algorithms exist including the naIve Bayes induction algorithm (described in
Section 7.5.2.2) that learns by estimating the various required probabilities, based on
frequencies derived from the training data, or the gradient descent training algorithm of
Bayesian networks proposed by Russell et al. [Russell et al. 1995], or possibilistic
networks learning algorithm proposed by Borgelt ·and Kruse [Borgelt and Kruse 1997]
to mention but three. More recently, approaches have been proposed that use genetic
programming to learn both the structure and distributions that make up probabilistic
and possibilistic networks [Banzhaf et al. 1999].
SOI-T COMPUTING FOR KNOWLEDGE DISCOVERY: L'ITRO[)UCING CARTESIAN GRANULE FEATURES 159

7.5.3.5 Fuzzy based approaches


Fuzzy based approaches represent models in terms of fuzzy sets and if-then rules. The
past decade has seen the introduction of numerous learning algorithms for fuzzy-based
systems. These include machine learning algorithms such as the mountain method
[Yager 1994], the data browser [Baldwin and Martin 1995] (described in Section
7.5.2.3), and clustering approaches such as FCM-based approaches [Sugeno and
Yasukawa 1993]. The work of Grabish and Nicolas [Grabisch and Nicolas 1994] use
fuzzy based learning algorithms that capture models in terms of fuzzy sets, fuzzy
integrals, and fuzzy measures. Hybrid learning approaches include neuro-fuzzy
approaches such as those proposed by [Bossley 1997; Harris, Wu and Feng 1997;
Ishibuchi et aI. 1995; Narazaki and Ralescu 1999]. For example, the hybrid approach
proposed by Ishibuchi et al. [Ishibuchi et al. 1995] uses genetic algorithms to determine
a model consisting of weighted fuzzy if-then rules. The approaches presented in this
book (Part IV), namely Cartesian granule feature modelling [Baldwin, Martin and
Shanahan 1996; Baldwin; Martin and Shanahan 1997; Shanahan 1998], are hybrid in
nature, using genetic programming to discover the structure of the model and
probability theory to identify the parameters of the model that is expressed in terms of
fuzzy sets and additive rules.

7.5.3.6 Case-based learning


Another framework for supervised learning known as instance-based or case-based
learning represents knowledge in terms of specific cases or experiences and relies on
flexible matching methods to retrieve these cases and apply them to new situations.
One common approach simply finds the stored nearest neighbour (according to some
distance metric) to the current situation, then uses it for classification or prediction. The
typical case-based learning method stores training instances in memory, while
generalisation occurs at retrieval time, with the power residing in the indexing scheme,
the similarity metric used to identify relevant cases, and the method for adapting cases
to new situations. In its simplest form, case-based reasoning reduces to nearest
neighbour classification and instance based reasoning [Kibler and Aha 1987]. Support
vector machines [Vapnik 1995] are a form of case-based reasoning where only the
cases that define the interclass boundaries, in a newly transformed feature space, are
retained after learning; these become the support vectors upon which inference occurs.

7.6 REINFORCEMENT LEARNING

The previous section provided an overview of supervised learning algorithms. This


section focuses on the second category of machine learning algorithms, that is,
reinforcement learning algorithms. Reinforcement learning algorithms [Sutton 1988]
involve learning models where the decision making is dependent on previous decisions.
In contrast, the models generated as a result of supervised learning predict values (make
decisions) that are generally independent of previous or subsequent decisions, that is,
one-shot decision making. For example, recognising a character is a one-shot decision
CHAPTER 7: MACHINE LEARNING 160

with immediate feedback stating that the model has made the correct classification or
not. This decision has no effect on subsequent decisions that the model may take.
Conversely, in reinforcement learning, each decision that the model takes affects
subsequent decisions. For example, consider an autonomous robot attempting to
navigate a maze from a starting point to an end point (goal). At each point in time, the
robot must decide whether to move forward, left, right, or backward. Each decision
changes the location of the robot, so the decision will depend on previous decisions.
After each decision, the supervisor provides feedback to the robot in terms of a reward
that reflects the long-term potential of taking that move. For example, if a move leads
to the robot getting to the goal, then the feedback is positive (a reward is given),
otherwise the robot is penalised (for example, when the move leads to a dead-end). The
goal of the robot is to choose sequences of actions to maximise the long-term reward.
This differs from supervised learning where each classification decision is independent
of other decisions. Credit assignment of which decisions resulted in a good result (the
robot reaching the goal state) plays a key role in reinforcement learning, where the
impact of a decision cannot, in general, be measured immediately (feedback is not
direct).

Typical problems that can be tackled by reinforcement learning include robot


navigation and game playing. For example, consider the problem of learning to play
backgammon [Tesauro 1995]. The input variables describe the state of the board. The
predicted variable is the goodness of a reachable state. The performance measure is the
percentage of games won against an opponent. For this problem, the training
experience is playing practice games against oneself. The machine learning tableau for
playing backgammon is presented in Table 7-3.

7.6.1 Popular reinforcement learning algorithms


Early work in this area includes Samuel's checker player [Samuel 1959], Michie and
Chambers' Boxes systems [Michie and Chambers 1968], and Holland's bucket brigade
algorithm [Holland 1986]. However, reinforcement learning did not receive widespread
attention until Sutton's paper [Sutton 1988] on temporal difference learning and
Watkins and Dayan's work on Q-Iearning [Watkins and Dayan 1992]. For a more
extensive treatment of reinforcement learning see [Langley 1996; Russell and Norvig
1995; Sutton and Barto 1998].

7.7 UNSUPERVISED LEARNING

The final category of learning discussed is unsupervised learning, where the learner is
given a collection of observations or events and searches, without the supervision of a
teacher, for regularities and general rules explaining all or most of the observations, e.g.
conceptual clustering. The goal of unsupervised learning is to get some understanding
of the process that generated the data. This can be achieved by cluster analysis, or by
examining the associated fuzzy sets or probability densities.

A typical example of unsupervised learning, is that of class discovery in large


SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 161

databases. Consider the example presented by Cheeseman et al. [Cheeseman et al.


1988], where given a large astronomical dataset, AutoClass (a Bayesian statistical
technique) automatically determined the most probable number of classes and their
probabilistic descriptions. In this case, each group of astronomical objects was
modelled as having a spectrum that was a multivariate normal distribution (centred at a
typical spectrum for a group of objects). The probability distribution Pr(X) describing
the whole collection of astronomical objects was modelled as a mixture of normal
distributions; one for each group of objects. The learning algorithm, AutoClass in this
case, determined the number of groups, the mean, and the covariance matrix of each
multivariate distribution. One of the interesting outcomes of this work is that AutoClass
discovered previously unsuspected classes of astronomical classes. Having fitted a
stochastic model to a collection of objects, such as astronomical objects, that model can
be applied to classify new objects. This happens in the following manner: given a new
astronomical object, it is possible to infer which multivariate Gaussian is most likely to
have generated it, and subsequently assign it to the corresponding cluster of objects.

Table 7-3: Machine learning tableau/or playing backgammon.

ML Tableau for playing backgammon


TaskT Playing backgammon
Performanx Measure P Percentage of games won against opponents
Training Experience E Playing practice games against oneself
Input variables Board SLate
Predicted variable Goodness of a reachable state

In this book supervised learning approaches are proposed for both classification and
prediction based on Cartesian granule feature models. In learning these models, it is
shown also how unsupervised learning approaches such as clustering can be used to
discover structure in the data (unsupervised discretisation of feature universes, resulting
in a fuzzy partition) that can subsequently lead to more transparent and accurate model
abstractions. A similar approach was adapted in [Ralescu and Hartani 1994; Sugeno
and Yasukawa 1993], where unsupervised approaches (fuzzy-clustering) were used to
identify the class structure for supervised learning problems.

7.7.1 Clustering and discovery algorithms


Unsupervised learning encompasses two main techniques: clustering and scientitic
discovery. Examples of clustering algorithms include ISODAT A, FCM a fuzzy
clustering approach [Bezdek 1976; Bezdek 1981], feature maps [Kohonen 1984], and
AutoClass, a probabilistic clustering algorithm [Cheeseman et al. 1988]. Another
widely used unsupervised learning model is the hidden Markov model (HMM). A
HMM is a stochastic tinite state machine that generates strings. A typical application of
HMM is speech recognition, where these strings correspond to speech signals, and one
HMM is trained for each word [Rabiner 1989]. Examples of invention and discovery
systems include Lenat's AM [Lenat 1977] and BACON [Langley, Simon and
Bradshaw 1987]. For a more detailed treatment of unsupervised learning approaches,
see [Bezdek and Pal 1992; Fiesler and Beale 1997].
CHAPTER 7: MACHINE LEARNING 162

7.8 COMPONENTS OF INDUCTIVE LEARNING ALGORITHMS

The previous sections presented a definition of machine learning and various categories
of learning algorithms. At this point, a detailed presentation of the key components that
make up inductive learning (limited to a supervised learning perspective) is provided
(see Figure 7-1):

• search algorithm;
• performance measures that guide the search;
• knowledge representation of both observations and hypothesis;
• and the inductive bias introduced by the search and knowledge
representation techniques used.

Before describing each of these components in detail, the stage is set by introducing
two important operations in learning: generalisation; and specialisation.

7.8.1 Learning through inductive generalisation


The notion of generalisation is key in the inductive learning of models. It can be very
simply defined as the ability to extend a concept to cover cases that were not presented
during training. Specialisation is the reverse process of generalisation, where the
concept is cutback or made more specialised. Both operations help define the
boundaries of concepts. These operations were first described in a learning algorithm
introduced by Mill [Mill 1843]. This algorithm learned through maintaining a single
consistent model, and by adjusting it as new examples arrived in order to maintain its
consistency. Figure 7-7 presents examples of generalisation and specialisation in the
context of this algorithm for the car parking problem (from Chapter 1). Figure 7-7(a)
portrays a consistent model resulting from a previous inductive inference, where the
grey region corresponds to the concept for successful parking and the white region to a
concept for unsuccessful parking. In Figure 7-7(b), afalse negative (a sample is a false
negative if the model says it should be positive but in fact it is negative) is introduced
which causes the learning algorithm to adapt its concept of parking success. In this
case, the generalisation operation is used and the concept for unsuccessful parking is
extended as depicted in Figure 7-7(c). Figure 7-7(d) portrays the introduction of afalse
positive (a sample is a false positive if the hypothesis says it should be negative but in
fact it is positive), causing the learning algorithm to once again adapt its concept of
parking success. In this case, the concept of successful parking is specialised as
depicted in Figure 7-7(e). For this problem, learning is viewed from the perspective of
car parking success concept (highlighted in the grey region) with both generalisation
and specialisation being applied to update the boundaries of the previously induced
model.

From a more general machine learning context, how generalisation or specialisation is


achieved will vary, depending on the underlying knowledge representation and possibly
the learning algorithm used. For symbolic logic learning algorithms, generalisation is
achieved by dropping some condition or attribute from the description of a concept. For
example, the rule for successful car parking could look something like the following:
SOFT COMPUTING I'OR KNOW LEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 163

If TimeToDestination is shon and (6-2)


NumberOfFreeSpaces is many

,
then ParkingStatus is successful

i I
p

1
C
z
n
n

n C
l!
~
z
n n
n n
0 n n n

!lnm.:.T~tJo.uon

(a): A cOllsistellt model. (b): Introduction of a faLse negative.

p
i p

i
v-
n n
p n n

p 0
5
n
~
C n
n n n
n n n n

Q TneTol1t\.tIlUlUOCl

(c): Model generaLisation. (d): Introduction of a faLse positive.

I
I
c
~
n

n n
n
n

(e): Model specialisation.

Figure 7-7: Examples of the generalisation and specialisation operations for the
parking problem as applied from the perspective of successful parking.

Generalisation could be enhanced here by dropping one of the conditions for successful
car-parking. Here shon and many denote crisp intervals defined over QTimcToDeslinalion
and QNumberOlFrccSpaces. For example, dropping the TimeToDestination condition results in
the following more general rule:
CHAPTER 7: MACHINE LEARNING 164

If NumberOJFreeSpaces is many
then ParkingStatus is successful

This is referred to as concept pruning, a strategy that is often employed in decision


tree approaches to provide better generalisation, in that it cuts out branches that do not
lead to useful dichotomies [Quinlan ]986]. Specialisation is realised for inductive logic
approaches by adding or modifying some condition.

Another form of generalisation is possible by looking at the discretisation of the feature


universes. For example, consider that the universe of TimeToDestination is partitioned
into three equal-sized intervals (for the sake of simplicity, assume crisp intervals). The
intervals are labelled as follows: small; medium; and large. Taking the original rule
(Equation 6-2), one can extend the scope of the rule by modifying the conditional
features (antecedents) in either of the following ways:

• by extending the interval that characterises a word;


• or by adding extra words.

The former could be achieved by extending the boundaries of the word short (and
correspondingly shortening the interval associated with medium), while the latter could
be achieved by the learner by adding extra values of the TimeToDestination variable
such as the word medium, such that the generalised rule would look like this:

If TimeToDestination is short or
TimeToDestination is medium and
NumberOJFreeSpaces is many
then parkingStatus is successful

For other learning algorithms, generalisation can be accommodated in many ways, for
example, in modelling with Cartesian granule features, generalisation is further
enhanced due to the multidimensional nature of the features. This generalisation can
prove very useful in certain problem domains. Section 10.4 'presents the L classification
problem, a difficult problem (with no perfect solution), on which many popular
learning algorithms fail, but where Cartesian granule feature approaches succeed. This
success is due largely to the multidimensional nature of Cartesian granule features,
which provides extra generalisation power. Alternatively, in other forms of knowledge
representation, generalisation arises from the inference and decision making
mechanisms used; e.g. some case-based learning approaches, which simply store
training instances in memory. In this case, generalisation occurs at retrieval time, with
the power residing in the indexing scheme, the similarity metric used to identify
relevant cases, and the method for adapting cases to new situations.

7.8.2 Generalisation as search


Inductive learning can be reformulated as a search or discovery task in the space of
generalisations, where each possible model (generalisation) that could be generated
from the data and hypothesis language is viewed as a node in a model search space. In
this space of generalisations, depending on the expressive power of the hypothesis
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INrRODI.!CING CARTESIAN GRANULE FEATURES 165

language, there should exist a set of models that is consistent with the training data and
background knowledge (i.e. covers the training data and possibly unseen data); this is
termed the version space (the plausible model versions) by Mitchell [Mitchell 1982].
The hypothesis space can be partially ordered using the restriction operator (a form of
subset); thus, the version space can be viewed as an interval in this hypothesis space.
Consequently, learning could be viewed as finding the interval in this hypothesis space,
corresponding to the version space, a form of constraint-based programming. In
practice however, this method is awkward to implement, since the details of the partial
order depend on the particular language employed to represent the hypotheses. Also in
the worst case, the size of the version space representation can grow exponentially with
the number of observed training examples [Haussler 1989]. However, the version space
perspective has been extremely helpful in clarifying the nature of inductive learning.

Currently, practical learning algorithms rely on different point and population-based


search paradigms to discover useful generalisations of the training data, while ignoring,
in general, the theoretical version space. These search paradigms include exhaustive
methods, like breadth-first and depth-first, or greedy ones such as beam search, or
pseudo-random search techniques such as simulated annealing or evolutionary search.
However, it turns out in practice due the large scale of the search involved in model
selection that most induction algorithms rely upon greedy or hill-climbing methods
(point-based approaches i.e. local search around one model), which work well in many
domains [Michalski, Bratko and Kubat 1998]. Chapter 9 gives a more detailed
discussion on this issue of search in model discovery, while also introducing a new
model discovery paradigm based upon population-based search using genetic
programming. This model discovery algorithm (G_DACG) is demonstrated on a variety
of problems in Chapters 10 and 11. The proposed approach avoids many of the pitfalls
of other learning algorithms such as local optimums by using an evolutionary-based
search technique.

A key part of any search paradigm is the cost or evaluation function, essentially, how
effective is a particular model in performing a specific task. Most induction methods
emphasise the ability to perform well on training or validati9n data, a behavioural-
based approach, but this can prove to be computationally expensive. In this book
however, a novel cost function is proposed based upon the semantic separation of the
concepts learned, thus avoiding expensive behavioural-based testing (on a control
dataset). This forms an integral part of the G_DACG constructive induction algorithm.
Other factors may also be taken into account to augment such decisions such as model
parsimony (simplicity). Empirical evidence (for example, see L classification problem
in Section 10.4) tends to suggest that models that are simple but no simpler tend to find
the right balance between over-generalisation and over fitting. Inductive bias, which is
subsequently presented in Section 7.8.5, also plays a key role in model discovery and is
closely intertwined with model evaluation.

Finally, in any search technique the issue of termination is very important, as the
search for a model may never truly halt. For non-incremental (one-shot) learning
approaches the simplest approach is to search until no further progress occurs or until a
prescribed level of performance is attained. For the approaches proposed in this book,
search for a model is carried out for a prescribed amount of effort commensurate with
CHAPTER 7: MACHINE LEARNING 166

the problem domain, while employing an early-stopping strategy if a prescribed level of


performance is attained.

7.8.3 Performance measures


As noted in the previous section, performance evaluation (recall Figure 7-1) is one of
the key factors in any learning system and plays a key role in measuring generalisation
and in model selection. Several forms of performance evaluation exist, including the
following:

• accuracy-based evaluations based upon the behaviour of a model in a


particular environment;
• efficiency of a model;
• and the transparency or understandability of a model.

Accuracy-based evaluations are amongst the most commonly used measures of a


model's goodness, and in particular the generalisation of the model. Accuracy of a
model is measured with respect to one or more datasets. The selection of what data to
include in this dataset(s) is discussed in due course, but first the accuracy measures
used by the learning approaches presented in this book are described. For classification
problems, the accuracy of a model on a dataset is based on the proportion of cases
correctly classified by a model out of all the test cases. A confusion matrix provides a
graphical means of analysing the accuracy of a classifier. It displays the actual and
predicted classifications in a square matrix format of order C the number of classes for
a problem. A canonical class confusion matrix for a C class problem is displayed in
Table 7-4. Each row in the table corresponds to the actual classifications of the test
data, while the columns labelled ClassJ, ... ,Classc correspond to the model predicted
classes. The cells in the diagonal of the confusion matrix (i.e. cells labelled Cll , ... ,
Cee) represent the correctly classified test cases for each class. The other non-diagonal
cells denote the misclassified test cases, with the corresponding row label representing
the actual classification, and the corresponding column label representing the model
predicted classification. The other columns, labelled Class Total and Class
%Accuracy, representing the number of test cases for a particular class and the model
accuracy on a per class basis respectively, are provided for convenience. In the
literature, model accuracy is presented in terms of the proportion of correctly classified
cases, corresponding to the model accuracy, or in terms of the proportion of
misclassified cases, corresponding to the model error rate.

On the other hand, for prediction problems the accuracy is calculated based on the
RMS error (root mean squared error) as follows:

N
t L(y,- »1)2
RMS =....:.....----=-;=....:;1_:--__ *100 (6-3)
IOyl
where Yi and )il correspond to the actual output value (of the test input tuple) and the
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 167

model predicted value respectively. IOyl denotes the size of the universe of the output
variable Y.

Table 7-4: A typical class confusion matrix.

Actual\Predicted Predicted ... Predicted Class Total Class %Accuracy


Class I ClasS(:
Actual Class C ... Cle C + ... +CII~ C I(C + . . . +Clc)
... ... ... ... ... ...
Actual ClasS(: Ccl ... C cc Ccl + ... +Ccc Ccc/(CcI + ... + Ccc)

A model's accuracy can be estimated on the observation data to give a resubstitute


estimate of the error rate, but unless the number of samples is very large and also
representative of the environment, the method does not account for generalisation.
Consequently, the use of resubstitution is not recommended. To get over this problem
the provided data is split r~ndomly into a training dataset, a validation dataset (also
known as the control dataset) if required by the leamer, and a test dataset. Usually, 20-
30% of the data are set aside for each of the control and test datasets. The model error
rate on the test data gives an unbiased estimate of the overall error rate of the model; a
useful measure of the model generalisation. This is known as the holdout estimate
error rate.

Alternatively, when the availability of data is severely limited (less than a 1000
[Michie, Spiegel halter and Taylor 1993]), another popular method of accuracy-based
evaluation is n-fold cross validation [Stone 1974]. Here, the provided dataset is
partitioned into n approximately-equal sized subsets. The system then trains on n-]
subsets and evaluates the performance of the induced model by testing on the remaining
subset. This process is repeated for each of the n subsets that is omitted from training,
and the resulting model accuracies are averaged over all n results. Such a procedure
allows the use of a high proportion of the available data to train, while also making use
of all data points in evaluating the cross-validation error. Typical choices of n tend to
be less than 10, with the limit known as the leave-out-one method. The disadvantage of
such an approach is that it requires the inductive inference process (training) to be
performed n times which in some circumstances could lead to large computational
requirements.

An alternative accuracy-based strategy is the bootstrap procedure, which according to


Michie, Spiegelhalter and Taylor [Michie, Spiegelhalter and Taylor 1993] should be
used more often in cases where datasets are extremely small (of order 100). In general,
the process involves the generation of test and training datasets by randomly sampling
the observed dataset with replacement.

In this book, the holdout estimate error rate is adapted as a measure of model
generalisation as all the problems examined benefit from sufficiently large datasets.
This measure is used only in the parameter identification phase of the induced models,
with the less-computationally intensive evaluation process of semantic separation used
in the language identification phase (see Section 9.3.2 for details).
CHAPTER 7: MACHINE LEARNING ]68

A useful measure of efficiency of a model varies depending on the environment


(problem domain). For example, in a planning domain the amount of computational
effort required to generate plans could be used as measure of efficiency. For the
purposes of this book, efficiency is measured in terms the size of the induced models;
for example the number and dimensionality of Cartesian granule features used to model
concepts. (See Section 9.3.2 for further details).

Useful measures of induced model transparency or understandability are more


difficult to capture, but like model efficiency it is sometimes measured syntactically by
examining the simplicity or parsimony of the model. The approaches presented in this
book are for the main part symbolic in nature and, in general, benefit from their glass-
box nature, thereby enabling human inspection and understanding. In this book, model
parsimony is used as a measure of the transparency of the model. Other inductive
learning approaches, due to the knowledge representation approaches used, provide
little or no transparency for the user and thus, these measures do not apply; for
example, most neural network approaches.

7.8.4 Knowledge representation


Knowledge representation is one of the most crucial dimensions of any learning
algorithm, dictating the what, the how and the when of learning algorithms, and
affecting performance issues such as accuracy, and model transparency. Knowledge
representation encompasses the representation of specific knowledge in term of the
observation and hypothesis languages, and the more general inference and decision
making mechanisms. Knowledge representation, due to its crucial role in machine
learning, was presented separately, in detail, in Chapter 2, while various soft computing
approaches to knowledge representation were presented in Chapters 3 (fuzzy set
theory), 4 (fuzzy logic), 5 (probability theory), and 6 (a soft computing programming
environment enabling the representation of systems in terms of fuzzy and probabilistic
knowledge forms).

7.8.5 Inductive bias


One of the many important problems facing machine learning using induction is that
generalisation from any set of observations is never logically justified, since there
always exist many models (hypotheses) that could account for the observed data. One
trivial hypothesis is simply the conjunction of the observations. However, this is not the
sort of knowledge structure one desires from a learning method.

Clearly, a system that learns concepts from examples must somehow be guided through
the space of inductive generalisations (model space) not solely by the training
instances. The machine learning literature often refers to this as inductive bias [Mitchell
1982]. Rendall [Rendall 1986] makes a further important distinction by defining both a
representational bias and a search bias. Representational bias restricts the space of
possible models by limiting the language. For example, in additive Cartesian granule
feature modelling the allowed dimensionality of the features plays an important role in
the search for a model- rendering it comptractable (computationally tractable) or not.
A more flexible approach incorporates the notion of search bias, which considers all
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: NTRODUCING CARTESIAN GRANULE FEATURES 169

possible concept descriptions, but examines some earlier than others in the search
process. Most learning algorithms, if the choice is afforded to them, will prefer to
search simpler hypotheses before more complex ones. For example, the ID3 learning
algorithm proceeds from very general concepts (simple) to more specific (detailed). In
the learning approaches presented in this book (the G_DACG algorithm), a genetic
search based upon the genetic programming paradigm is used in the selection of
possible models. A search bias is encoded in the fitness function, where model
parsimony (simplicity) and high model performance, which is estimated using a cheap
measure based upon the semantic separation of the concepts learned, are promoted.

7.9 COMPUTATIONAL LEARNING THEORY

A different strand of research within machine learning treats machine learning as an


area of mathematical study. The goal of computational learning theorists is to formulate
and prove theorems about the tractability of entire classes of learning problems. Here
the typical goal involves defining some learning problem, conjecturing that it can or
cannot be solved with a reasonable number of training cases, and then proving that the
conjecture holds under very general conditions. Landmark work in this area centres on
Mitchell's "version space" [Mitchell 1982], and Valiant's PAC learning (probably
approximately correct) [Valiant 1984] amongst others. Computational learning theory
has provided many insightful and surprising theorems about the relative difficulty of
learning tasks and methods for solving them. One interesting strand of possible future
research for the work presented in this book is to explore learnability issues in the
context of Cartesian granule feature models.

7.10 GOALS AND ACCOMPLISHMENTS OF MACHINE LEARNING

Having introduced inductive learning from an ML perspective in the previous sections,


this section presents some of the goals and achievements of ML. The field of ML is
united by its concern with learning, but the literature suggests that researchers focus on
this field for a variety of reasons. One of the main goals of the field is to model the
mechanisms that underlie human learning referred to as "cognitive simulation" [Simon
1983]. Achieving this can help in discovering how humans work and can potentially
help them be more effective workers, especially in mission critical domains such as
power plant control, and space travel.

Another goal of machine learning is to automatically program computers to perform


some useful task that may be beyond human specification, such as diagnosing patients
for diabetes. In this case, machine learning is similar to the definition of empirical
learning or inductive learning as presented by Shavlik and Dietterich [Shavlik and
Dietterich 1990b]. In their definition, training examples are "externally supplied" in a
ready-to-use format. Viewed in this way, the field of ML has had many fine successes.
For example, Ben-Davis and Mandel [Ben-Davis and Mandel 1995] presented an
empirical study that "provides evidence that machine learning models can provide
CHAPTER 7: MACHINE LEARNING 170

better classification accuracy than explicit knowLedge acquisition". Kononenko


[Kononenko 1993] references 24 papers where inductive learning systems were actually
applied in medical domains, such as oncology, liver pathology, prognosis of patient
survival in hepatitis, urology, cardiology, and gynaecology amongst many others. He
remarks that "typicaLLy, automatically generated diagnostic ruLes slightLy outpeiformed
the diagnostic accuracy of physician specialists" and in some cases the automatically
programmed systems enhanced human understanding. Michalski et aJ. [Michalski,
Bratko and Kubat 1998] provide further examples of machine learning success, with
many fielded examples.

The past ten years, have seen applications of machine learning within new fields such
as knowledge discovery and knowledge discovery in databases (KDD). KDD is a
derivative of knowledge discovery that exploits machine learning algorithms to analyse
or discover patterns in very large databases. In these fields machine learning is viewed
as one step in the discovery process that is supplied with data by a previous step, in
contrast to being "externally supplied" (a traditional view of machine learning).
Chapter 1 presented an overview of this process, along with some of its successes. The
remainder of this book explores knowledge discovery from a Cartesian granule feature
perspective and demonstrates the process on real world problems.

Finally, other researchers within ML treat it as an area of mathematical study, setting


goals to formulate and prove theorems about the learnability and tractability of learning
problems and the learning approaches designed to solve those problems. See the
following textbooks [Cristianini and Shawe-Taylor 2000; Mitchell 1997] for a more
detailed overview and discussion of this ML perspective.

7.11 SUMMARY

Induction can be seen as learning a function from input/output pairs. This function can
be represented using logical sentences, polynomials, beli~f networks, neural networks
and others. This chapter has provided an overview of machine learning, a field that cuts
across artificial intelligence and cognitive science. Formal definitions of the machine
learning were provided and the three main categories of machine learning: supervised
learning; reinforcement learning; and unsupervised learning, were described. Inductive
learning, an integral part of most computational learning algorithms, was presented in
detail. Popular induction algorithms for decision trees, and naive Bayes classifiers and
fuzzy classifiers were described and illustrated. In subsequent chapters (in Part IV),
new approaches to machine learning are presented in the context of Cartesian granule
feature models. These approaches are subsequently (in Part V) demonstrated on both
real world and artificial problems.

7.12 BIBLIOGRAPHY

Agency, F. E. D. (1995). Learning + Styles. Further Education Department Agency,


SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 171

Citadel Place, Tinworth, London SEll 5EH.


Baldwin, J. F., Lawry, J., and Martin, T. P. (1997). "Mass assignment fuzzy ID3 with
applications." In the proceedings of Fuzzy Logic: Applications and Future
Directions Workshop, London, UK, 278-294.
Baldwin, J. F., and Martin, T. P. (1995). "Fuzzy Modelling in an Intelligent Data
Browser." In the proceedings of FUZZ-IEEE, Yokohama, Japan, 1171-1176.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1996). "Modelling with Words using
Cartesian Granule Features", Report No. ITRC 246, Dept. of Engineering
Maths, University of Bristol, UK.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997). "Modelling with words using
Cartesian granule features." In the proceedings of FUZZ-IEEE, Barcelona,
Spain, 1295-1300.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1998). "System Identification of
Fuzzy Cartesian Granule Feature Models using Genetic Programming", In
lJCAI Workshop on Fuzzy Logic in Artificial Intelligence, Lecture notes in
Artificial Intelligence' (LNAI 1566) - Fuzzy Logic in Artificial Intelligence, A.
L. Ralescu and J. G. Shanahan, eds., Springer, Berlin, 91-116.
Banzhaf, W., et aI., eds. (1999). "Proceedings of the Genetic and Evolutionary
Computation Conference (GECCO) 1999, Orlando, USA", Morgan Kaufmann,
San Francisco, CA.
Ben-Davis, A., and Mandel, J. (1995). "Classification accuracy: machine learning vs.
explicit kno\llledge acquisition", Machine Learning, 18:109-114.
Bezdek, J. C. (1976). "A Physical Interpretation of Fuzzy ISODATA", IEEE Trans. on
System, Man, and Cybernetics, 6(5):387-390.
Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms.
Plenum Press, New York.
Bezdek, J. C., and Pal, S. K. (1992). Fuzzy Models for Pattern Recognition. IEEE
Press.
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Clarendon Press,
Oxford.
Borgelt, C., and Kruse, R. (1997). "Some experimental results on learning probabilistic
and possibilistic networks with different evaluation functions." In the
proceedings of 1st International joint conference on qualitative and
quantitative practical reasoning, ECSQUARU-FAPR, Bad Honnef, Germany,
71-85.
Bossley, K. M. (1997). "Neurofuzzy Modelling Approaches in System Identification",
PhD Thesis, Department of Electrical and Computer Science, Southampton
University, Southampton, UK.
Breiman, L., Friedman, J. H., Olsen, R. A., and Stone, C. J. (1984). "Classification and
Regression Trees", , Wadsworth Int. Group, Belmont, California.
Bryson, A. E., and Ho, Y. C. (1969). Applied Optimal Control. Blaisdell, New York.
Buchanan, B. G., and Mitchell, T. M. (1978). "Model directed learning of production
rules", In Pattern directed inference systems, D. A. Waterman and F. Hayes-
Roth, eds., Academic Press, New York.
Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., and Freeman, D. (1988).
"AutoClass: A Bayesian classification system." In the proceedings of Fifth
International workshop on machine learning, San Francisco, 54-64.
Cristianini, N., and Shawe-Taylor, J. (2000). An introduction to Support Vector
Machines. Cambridge University Press, Cambridge, UK.
CHAPTER 7: MACHINE LEARNING 172

Duda, R., Gaschnig, J., and Hart, P. (1979). "Model design in the Prospector consultant
system for mineral exploration", In Expert systems in the microelectronic age,
D. Michie, ed., Edinburgh University Press, Edinburgh, 153-167.
Feighenbaum, E. A. (1961). ''The simulation of verbal learning." In the proceedings of
Western joint computer conference (reprinted in Readings in Machine
Learning (1990), Eds.: Shavlik and Dietterich, Morgan Kaufmann Publishers),
Los Angeles, 121-132.
Fiesler, E., and Beale, R. (1997). Handbook of Neural Computation. Institute of Physics
Publishing Ltd. and Oxford University Press, Bristol, UK.
Fogel, L. J., Owens, A. J., and Walsh, M. J. (1966). Artificial intelligence through
simulated evolution. John Wiley, New York.
Friedberg, R. (1958). "A learning machine, part 1", IBM lournal of Research and
Development, 2:2-13.
Friedberg, R., Dunham, B., and North, T. (1959). "A learning machine, part 2", IBM
lournal of Research and Development, 3:282-287.
Goldberg, D. E., and ~eb, K. (1991). "A comparative analysis of selection schemes
used in genetic algorithms", In Foundations of Genetic Algorithms, G.
Rawlins, ed., Morgan Kaufmann, San Francisco.
Grabisch, M., and Nicolas, J. (1994). "Classification by fuzzy integral: Performance
and tests", Fuzzy Sets and Systems, 65:255-271.
Harris, C. J., Wu, Z. Q., and Feng, M. (1997). "Aspects of the Theory and Application
of Intelligent Modelling, Control and Estimation." In the proceedings of 2nd
Asian Control Conference (invited lecture), Seoul, Korea, 1-10.
Haussler, D. (1989). "Learning conjunctive concepts in structural domains", Machine
Learning, 4(1):7-40.
Hebb, D. O. (1949). The organisation of behaviour. Wiley, New York.
Hertz, J., Anders, K., and Palmer, R. G. (1991). Introduction to the Theory of Neural
Computation. Addison-Wesley, New York.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of
Michigan Press, Michigan.
Holland, J. H. (1986). "Escaping brittleness: the possibilities of general purpose
learning algorithms applied to parallel rule-based systems", In Machine
Learning: An Artificial Intelligence Approach (Vol. 2), R. S. Michalski, J. G.
Carbonell, and T. M. Mitchell, eds., Morgan Kaufman, San Francisco.
Holland, J. H., Holyoak, K. J., Nisbett, R. E., and Thagard, P. R. (1986). Induction:
Process of Inference, Learning, and Discovery. MIT Press, Cambridge, Mass.,
USA.
Honey, P., and Mumford, A. (1992). The Manual of Learning Styles. Peter Honey.
Hume, D. (1748). An inquiry concerning human understanding. Reprinted 1955.
Liberal Arts Press, New York.
Hunt, E. B., Marin, J., and Stone, P. J. (1966). Experiments in induction. Academic
Press, New York.
Ishibuchi, H., Nozaki, K., Yamamoto, N., and Tanaka, H. (1995). "Selecting fuzzy if-
then rules for classification problems using genetic algorithms", IEEE
Transactions on Fuzzy Systems, 3(3):260-270.
James, W. (1892). Briefer Psychology. Harvard University Press, Cambridge.
Kibler, D., and Aha, D. E. (1987). "Learning representative exemplars of concepts." In
the proceedings of International Workshop on Machine Learning, 24-30.
Kohonen, T. (1984). Self-Organisation and Associative Memory. Springer-Verlag,
SOI-T COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTI;SIAN GRANULE FEATURES 173

Berlin.
Kononenko, I. (1993). "Inductive and Bayesian learning in medical diagnosis",
Artificial Intelligence, 7:317-337.
Koza, J. R. (1992). Genetic Programming. MIT Press, Massachusetts.
Koza, J. R. (1994). Genetic Programming II. MIT Press, Massachusetts.
Langley, P. (1996). Elements of Machine Learning. Morgan Kaufmann, San Francisco,
CA, USA.
Langley, P., Simon, H. A., and Bradshaw, G. L. (1987). "Heuristics for empirical
discovery", In Computational models of learning, L. Bolc, ed., Springer-
Verlag, Berlin.
Lenat, D. B. (1977). ''The ubiquity of discovery", Artificial Intelligence, 9:257-285.
McCulloch, W. S., and Pitts, W. (1943). "A logical calculus of the ideas immanent in
neural nets", Bulletin of Mathematical Biophysics(5):115-137.
McDermott, J. (1982). "R 1: A rule-based configuration of computer systems", Artificial
Intelligence, 19( 1):39-88.
Michalski, R. S., Bratko, I., and Kubat, M., eds. (1998). "Machine Learning and Data
Mining", Wiley, New York.
Michalski, R. S., and Chilausky, R. L. (1980). "Learning by being told and by
examples", International Journal of Policy Analysis and Information Systems,
4:125-160.
Michie, D., and Chambers, R. A. (1968). "BOXES: An experiment in adaptive
control", In Machine Intelligence, E. Dale and D. Michie, eds., Oliver and
Boyd,London, 125-133.
Michie, D., Spiegelhalter, D. J., and Taylor, C. c., eds. (1993). "Machine Learning,
Neural and Statistical Classification", Ellis Horwood, New York, USA.
Mill, J. S. (1843). A system of logic, ratiocinative and inductive: being a connected
view of the principles of evidence, and methods of scientific investigation. J.
W. Parker, London.
Minsky, M., and Papert, S. (1969). Perceptrons: An introduction to computational
geometry. M.I.T. Press, Cambridge, MA.
Mitchell, T. M. (1982). "Generalization as search", Artificial Intelligence, 18:202-226.
Mitchell, T. M. (1997). Machine Learning. Mc Graw-HiII, New York.
Moller, M. F. (1993). "A scaled conjugate gradient algorithm for fast supervised
learning", Neural Networks, 6:525-533.
Muggleton, S., and Buntine, W. (1988). "Machine invention of first order predicates by
inverting resolution." In the proceedings of Fifth International Conference on
Machine Learning, Ann Harbor, MI, USA, 339-352.
Narazaki, H., and Ralescu, A. L. (1999). ''Translation and extraction problems for
neural and fuzzy systems: bridging over distributed knowledge representation
in multilayered neural networks and local knowledge representation in fuzzy
systems", In Fuzzy theory, systems, techniques, and applications (Volume 2),
C. T. Leondes, ed., Academic Press, New York, 917-935.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible
inference. Morgan Kaufmann, San Mateo.
Quinlan, J. R. (1983). "Learning efficient classification procedures and their application
to chess endgames", In Machine Learning: An Artificial Intelligence
Approach, R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, eds., Springer-
Verlag, Berlin, 150-176.
Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, 1(1):86-106.
CHAPI'ER 7: MACHINE LEARNING 174

Quinlan, J. R. (1990). "Learning logical definitions from relations", Machine Learning,


5(3):239-266.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San
Mateo,CA.
Rabiner, L. R. (1989). "A tutorial on hidden Markov models and selected applications
in speech recognition", Proceedings of the IEEE, 77(2):257-286.
Ralescu, A. L., and Hartani, R. (1994). "Modelling the perception of facial expressions
from face photographs." In the proceedings of The 10th Fuzzy Systems
Symposium, Osaka, Japan, 554-557.
Rendall, L. A. (1986). "A general framework for induction and a study of selective
induction", Machine learning, 1:177-226.
Rosenblatt, F. (1958). ''The perceptron: a probabilistic model for information storage
and organisation of the brain", Psychological Review, 65:386-408.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning internal
representations by error propagation", In Parallel Distributed Processing
(Volume I), D. E. Rumelhart and J. L. McClelland, eds., MIT Press,
Cambridge, USA.
Russell, S., Binder, J., Koller, D., and Kanazawa, K. (1995). "Local learning in
probabilistic networks with hidden variables." In the proceedings of IJCAl,
Montreal.
Russell, S., and Norvig, P. (1995). Artificial Intelligence a Modem Approach. Prentice-
Hall, Englewood Cliffs, New Jersey, USA.
Samuel, A. L. (1959). "Some studies in machine learning using the game of checkers II
- Recent progress", IBM Journal of Research and Development, 11(6):601-
617.
Schwefel, H. P. (1995). Evolution and optimum seeking. J. Wiley, Chichester.
Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University Press.
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Shavlik, J. W., and Dietterich, T. G. (1990a). "General aspects of machine learning", In
Readings in Machine Learning, J. W. Shavlik and T. G. Dietterich, eds.,
Morgan Kaufmann, San Mateo, CA, USA, 1-10. "
Shavlik, J. W., and Dietterich, T. G., eds. (1990b). "Readings in Machine Learning",
Morgan Kaufmann, San Mateo, CA, USA.
Simon, H. A. (1983). ''Why should machine learn?", In Machine Learning: An
Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell, and T. M.
Mitchell, eds., Springer-Verlag, Berlin, 25-37.
Stone, M. (1974). "Cross-validatory choice and assessment of statistical predictions",
Journal of the Royal Statistical Society: 11 I - I 47 (including discussion).
Sugeno, M., and Yasukawa, T. (1993). "A Fuzzy Logic Based Approach to Qualitative
Modelling", IEEE Trans on Fuzzy Systems, 1(1): 7-31.
Sutton, R. S. (1988). "Learning to predict by methods of Temporal differences",
Machine Learning, 3:9-44.
Sutton, R. S., and Barto, A. G. (1998). Reinforcement learning: an introduction. MIT
Press, Cambridge, MA.
Tackett, W. A. (1995). "Mining the Genetic Program",IEEE Expert, 6:28-28.
Tesauro, G. (1995). ''Temporal difference learning and TO-backgammon",
Communications of the ACM, 38(3):58-68.
SOFr COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 175

Valiant, L. G. (1984). "A theory of the learnable", Communications of the ACM,


27:1134-1142.
Vapnik, V. (1995). The nature of statistical learning theory. Springer-Verlag, Berlin.
Watkins, C. J. C. H., and Dayan, P. (1992). "Q learning", Machine learning, 8:279-292.
Winston, P. H., ed. (1975). "The Psychology of Computer Vision", McGraw-HilI, USA.
Wong, M. L., and Leung, K. S. (1995). "Inducing logic programs using genetic
algorithms", IEEE Expert, 9(5):23-34.
Yager, R. R. (1994). "Generation of Fuzzy Rules by Mountain Clustering", J.
Intelligent and Fuzzy Systems, 2:209-219.
PART IV
CARTESIAN GRANULE FEATURES

So far this book has been concerned with knowledge discovery, presenting it as a multi-
step process that discovers useful and valid knowledge from data, where knowledge
representation and machine learning play pivotal roles. Knowledge representation
influences knowledge discovery in many ways, including what can be discovered, how
it can be learned, when it can be learned, the understandability and tractability of the
discovered model and so on. Current approaches to knowledge discovery suffer from
one or more shortcomings that stem from the type of knowledge representation
employed. These include decomposition error, and performance issues such as
transparency, accuracy and efficiency.

The main focus of this part is to introduce a new form of knowledge representation and
corresponding learning algorithms, centred on Cartesian granule features. This
approach addresses some of the shortcomings of other knowledge discovery techniques
outlined above. Chapter 8 describes Cartesian granule features, and shows how fuzzy
sets and probability distributions can be defined over these features and how these can
be incorporated into both fuzzy logic and probabilistic models. Chapter 9 describes
induction algorithms for Cartesian granule feature models for both classification and
prediction problems. These algorithms are analysed and illustrated in Part V.
CHAPTER
CARTESIAN GRANULE
8 FEATURES

Current approaches to knowledge discovery suffer from one or more shortcomings that
stem from the type of knowledge representation employed. This chapter introduces a
new form of knowledge representation centred on Cartesian granule features, with
corresponding induction algorithms being presented in the next chapter. This approach
to knowledge representation and related induction algorithms, while not being a
panacea for knowledge discovery, do address some of the shortcomings of other
knowledge discovery techniques such as decomposition error, and performance issues
such as transparency, accuracy and efficiency.

In brief, a Cartesian granule feature is as a multidimensional feature, which is built


upon a linguistic partition of the base universe. Fuzzy sets and probability distributions
can be defined over Cartesian granules that make up the universe of a Cartesian granule
feature. In addition, these features can be incorporated into both fuzzy logic and
probabilistic models for both classification and prediction problems.

This chapter begins by providing basic definitions and examples of Cartesian granule
features and related concepts. Subsequently, it looks at the different possibilities for
aggregation within the context of individual Cartesian granule features based upon
fuzzy set theory and probability theory. Finally, it is shown how Cartesian granule
features can be incorporated into evidential logic (additive) and fuzzy logic models.
This results in a slightly modified approximate reasoning process for both fuzzy logic
and support logic reasoning, which is also described.

8.1 CARTESIAN GRANULE FEATURES

Cartesian granule features [Baldwin, Martin and Shanahan 1996; Baldwin, Martin and
Shanahan 1997; Shanahan 1998] were originally introduced to overcome some of the
shortcomings of existing forms of knowledge representation such as decomposition
error and also to enable the paradigm modelling with words through related learning
algorithms. In addition, this approach addresses other shortcomings of knowledge
discovery techniques as outlined above. A Cartesian granule feature can be
multidimensional in nature and is built upon a linguistic partition of the base universe.
This new approach exploits a divide-and-conquer strategy to representation, capturing
knowledge in terms of a network of low-order semantically related features - a network
of Cartesian granule features. The universes of these multidimensional features are
abstractly partitioned or discretised by Cartesian words, known as Cartesian granules.
This section begins by providing some basic definitions and examples of Cartesian
granule features and related concepts. It then provides a more complete presentation of

J. G. Shanahan, Soft Computing for Knowledge Discovery


© Kluwer Academic Publishers 2000
CHAPTER 8: KR USING CARTESIAN GRANULE FEATURES 180

the motivations behind the introduction of Cartesian granule features. Finally, it


discusses some previous usages of Cartesian granules.

Defmition: A granule is a collection of points, which are labelled by a word. This


collection of points is drawn together as a result of indistinguishability, similarity,
proximity or functionality [Zadeh 1994; Zadeh 1996]. A granule can be characterised
by a number of means such as a fuzzy set or a probability distribution (point or set
based).

Defmition: A Cartesian granule, is an expression of form wJx ... xwm where each Wi is
a granule defined over the universe Q j and where ''x'' denotes the Cartesian product. A
Cartesian granule can be intuitively visualised as a clump of elements in an n-
dimensional universe.

Defmition: A Cartesian granule universe QPtx...xPm is a discrete universe defined


over ~ p where each Pi is a linguistic partition of the universe ili and where "x" denotes
i=l f

the Cartesian product. More concretely, given a set of features {F], ... , Fm} defined over
the universes {il], il2 , ••• , Q".} and corresponding linguistic partitions {P], ... , Pd,
where each P; consists of labelled fuzzy sets as follows: {Wil' Wi2, ••••• , Wic}. A Cartesian
granule universe Qp,XP2X. ... Xp~ can be formed by taking the cross product of the words
making up each linguistic partition Pi as follows:

where each Cartesian granule is merely a string concatenation of the individual fuzzy
set labels Wi,j and each j denotes the granularity of partition Pj • Consider the following
example, where a two-dimensional Cartesian granule universe is formed using example
problem features of Position and Size. To construct a Cartesian granule universe, the
universe of each feature is linguistically partitioned arbitrarily as follows:

PPosition = {Left, Middle, Right}


and
PSize = {Small, Medium, Large}.

The Cartesian granule universe, Qpposition x PSize, will then consist of the following
discrete elements (Cartesian granules):

QpPositionxl'Size: {Left.Small, Left.Medium, Left.Large,


Middle.Small, Middle.Medium, Middle.Large,
Right.Small, Right.Medium, Right.Large}.

This is graphically depicted in Figure 8-1.

Definition: A Cartesian granule feature CG FjXF2X.... XF,. is a feature defined over a


Cartesian granule universe Qp,XP2X ... XPm' where each Fi is a domain feature and each Pi
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 181

is a linguistic partition of the respective universe n. for all i E {J, ... , mI. A Cartesian
granule feature can intuitively be viewed as a multidimensional linguistic variable. For
example, considering the problem features of Position and Size presented above, the
Cartesian granule feature CGposition x Size could denote a feature defined over the
Cartesian granule uni verse ilPPosition xPSiu as defined in Figure 8-1 .

"l\IIidde.Large" (Cartesian granule)

Ulrge

Medium

Small

.!?-
~
~

-g
<l)

<l)

~ Merrbership
Middle Right

Figure 8-J: The Cartesian granule universe ilPPosition x PSize defined in terms of the
linguistic partitions of the universes ilSize and ilposition-

Definition: A Cartesian granule fuzzy set CGFSF,xF2x.... xFm is a discrete fuzzy set
defined over a Cartesian granule universe QPIXP2X ... XP,n where each Fi is a domain
feature and each Pi is a linguistic partition of the respective universe n.
for all i E {J,
... , m}. Each Cartesian granule is associated with a membership value, which is
calculated by combining the individual granule membership values that individual
feature values have in the fuzzy sets that characterise the granules. For example,
consider the Cartesian granule WJJX .•. xwmh where each Wi/ is the word associated with
the first fuzzy subset in each linguistic partition p;. The membership value associated
with this Cartesian granule wJJx ••. xwmJ for a data tuple <Xl> ... , xm> is calculated as
follows:

where Xi is the feature value associated with the i 1h feature within the data vector. Here
the aggregation operator /\ can be interpreted as any t-norm (see Section 3.5.1) such as
product or min. The choice of conjunction operator is considered in Section 8.2.

Extending the example presented above, consider if the universes Qposition and QSize are
CHAPTER 8: KR USING CARTESIAN GRANULE f-EATURES 182

defined as [0, 100J and [0, 100J respectively then possible definitions of the fuzzy sets
in partitions OPosition and OSize (in Fril notation [Baldwin, Martin and Pilsworth 1995])4
could be:

Lejt:[0:],50:0J Small:[0:],50:0J
Middle:[O:O, 50:], 100:0J Medium:[O:O, 50:], 100:0J
Right:[50:0, 100:] J Large: [50:0, 100: 1].

Linguistic partitions provide a means of giving the data a more anthropomorphic feel,
thereby enhancing understandability. In essence, when generating a Cartesian granule
fuzzy set corresponding to a data tuple, it first fuzzifies (or reinterprets) the single
attribute values. Returning to the example, the attribute values for Position and Size are
reinterpreted in terms of the words that partition the respective universes, that is, a
linguistic description of the data is generated. Taking a sample data tuple (of the form
<Position, Size» <60, SO> (denoted as <x,y> in Figure 8-1), each data value is
individually linguistically summarised in terms of two fuzzy sets {Middle/O.S+
Right/0.2J and {MediumlO.4+ Large/O.6}. Subsequently, taking the Cartesian product of
these fuzzy data yields the following fuzzy set in the Cartesian granule universe:

CGFSPositionx sizC<60, 80) ={Middle.Mediuml0.32+ Middle.Large/0.48 +


Right.MediumlO.08 + Right.Large/0.12}.

Here the combination operator 1\ is interpreted as product.

8.1.1 Why Cartesian granule features?


There are a number of motivations behind the introduction of Cartesian granule
features. Some of these were alluded to earlier, but here a more complete description is
provided.

B.1.1.1 Modelling and computing with words


One of the primary concerns in connection with intelligent systems is that they should
be able to interact naturally with their environment. An integral part of many domains
(such as the knowledge discovery process itself or the outputs of knowledge discovery)
is the human; consequently, the discovered system needs to interact with the human.
This can be achieved by a variety of means and at many different levels, such as a
graphic display of trend data. However, one of the most natural forms of
communication (and sometimes most effective) is through words. Traditionally, this

4A fuzzy set definition in Fril such as Middle: [0:0, 50:], 100:0J can be rewritten
mathematically as follows (denoting the membership value of x in the fuzzy set
Middle):

j
0 if x:50
x
ifO<x :550
50
J.lMiddle(X)= loo-x
if 50< x <100
50
o ifx~loo
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 183

sort of communication has been implemented through symbolic knowledge


representations. However, to date, most of these approaches have had only mild success
in terms of performance accuracies compared to their mathematical counterparts.

The approach proposed here tries to fulfil both desires by discovering models that are
not only accurate but also understandable. This is enabled by the use of Cartesian
granule features - multidimensional features built on words. Learning Cartesian granule
feature models reduces to a simple probabilistic counting of linguistic interpretations of
data (numerical or otherwise). For example, consider Figure 8-2(b), which graphically
displays a linguistic partition of the Position variable. The variable value of 40 can be
linguistically summarised or described using the following fuzzy set: {left/O.2 +
Middlell}. Consequently, due to the linguistic nature of Cartesian granule feature,
modelling with Cartesian granule features enables the paradigm modelling with words,
where words, characterised by fuzzy granules, provide tractability, transparency and
generalisation. Similarly, reasoning in a Cartesian granule feature context can be
viewed as computing with words [Zadeh 1996]. As a result, Cartesian granule feature
models can facilitate a more natural interaction between the human and the computer.
In a sense, a knowledge discovery process centred on Cartesian granule features
discovers knowledge by letting your data speak (literally!). Part V of this book gives
concrete examples of this in terms of real world problems. Learning Cartesian granule
feature models is presented in detail in Chapter 9.

8.1.1.2 Model transparency


Computational learning approaches which generate models that are opaque to the user,
while facilitating machine learning, may also facilitate human learning. As a
consequence of modelling with words, the learnt models have the potential of becoming
more transparent or glassbox in nature and therefore more amenable to human
inspection and understanding. Transparency is enabled through the succinct and
linguistic nature of Cartesian granule features. For example, consider the probability
density presented in Figure 8-2(a) defined over the universe of the Position variable.
To describe this density requires the specification of a lot of information. However,
using a Cartesian granule feature approach, this density could be summarised by three
words and corresponding membership values, that is, a fuzzy set defined in terms of the
three words, Left, Middle, and Right: {Left/0.33 + Middlell + Right/O.l7} as depicted in
Figure 8-2(c) (see Chapters 5 and 9 for a more detailed presentation of the formal bi-
directional transformation between probability distributions and linguistic fuzzy sets).
Each word is characterised by a fuzzy set, as depicted in Figure 8-2(b). Even though
there is a drastic reduction in the amount information to be represented, there is no
significant loss of information. This is an example of exploiting uncertainty, in this case
imprecision, in order to achieve tractability and transparency on the one hand and
generalisation on the other. These claims are empirically supported by the results
presented in Part V of this book, where both real world and benchmark problems are
addressed by Cartesian granule feature models and compared with approaches that rely
on the more traditional forms of representation (such as probability densities). Figure
8-2(c) can be seen as a succinct linguistic description, a summarisation, of the
probability density in Figure 8-2(a).
CHAI'TER 8: KR USING CARTESIAN GRANUI.E FEATURES 184

I~
...
>.
:0
~
.. ,
,-" ~"
----
\~
.2 __ r--~
0
d:: ~_/ ...........
,,
I
,-- '-,

-
,~~

0
50 100
,
0
°posilioo
(a)

7
Lell Middle Riglll
I J
.9-
~
J!
,.,II
.. I --
/ /~
.,E r· - "
v'\
:::E 0 '. ,/
,....
~

0 40 50 100
° Postion
(b)

J I
Middle
.9-
.r:

J!'" Left Rigllt


.,E
:::E
0
I 50
I
100
)
0po ition
(c)

Figure 8-2: Concept descriptions in terms of a probability density and a Cartesian


granule fu zzy set (a) concept probability density; (b) linguistic partition of the universe
of position; (c) a concept Cartesian granule fuzzy set.

8.1.1.3 Eliminating the decomposition error


The decomposition error can be defined as the error that arises, when an n-variable
function is expressed as a composition of functions, each of which has less than n
variables. Many learning methodologies synthesise models in which the attributes
(variables) are utilised individually and later combined. For example, nai"ve Bayesian
[Duda and Hart 1973] approaches rely on total decomposition to model a problem
(using the nai"ve assumption that input variables are assumed to be conditionally
independent given the target value). Figure 8-3 graphically depicts a possible nai"ve
Bayes classifier for the parking problem, in terms of the class densities. The class
densities, in this case, are not very discriminating and thus, lead to a model with low
accuracy. The main reason for this lack in performance, high decomposition error, is
due to the decomposed nature of nai"ve Bayes. The Cartesian granule feature approach
proposed here provides a new form of knowledge representation that focuses on
modelling small clusters of semantically related variables, thereby providing
transparency, while alleviating the error due to the decomposed usage of attributes.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 185

8.1.1.4 Comprehensive leaming approach


Most symbolic approaches to computational learning tend to avoid or model badly
prediction (regression) problems. A further motivation of the work presented in this
book is to present a comprehensive framework for modelling classification, prediction
and unsupervised problem domains effectively.

_ _ unsucoessfulParldng
,-- .-~-
p
_____ successfulParking
I '"
u
u P
'"
0-
til
P
" p n
n n

p n n
p n
n p
n n
, p
-
I
P n n
I n n n

_.- .. '"
i

Figure 8-3: An example of modelling the car parking problem using approaches based
upon total decomposition. This figure shows the resulting probability density functions
(class conditional) using the nai"ve Bayes approach (see Sections 5.2.2 and 7.5.2.2 for a
further explanation).

8.1.1.5 A void local optimum models


Learning can be viewed as a mammoth search in the space of generalisations (models).
Traditional learning approaches such as neural networks or decision trees, to overcome
issues of tractability during learning, have resorted to greedy search techniques such as
gradient descent or hill-climbing. These local approaches to search, while
computationally attractive, are vulnerable to learning models that are locally optimum,
that is, leading potentially to overly complex or overly simplified models that provide
poor generalisation. The proposed induction algorithm (G_DACG, see Chapter 9) for
Cartesian granule feature models is based on a global evolutionary search technique,
and tends to identify near global optima models (in terms of model accuracy and
simplicity), and thus avoids some of the pitfalls of other induction algorithms such as
poor feature selection and feature abstraction. For example, in the case of decision tree
induction (using the ID3 algorithm [Quinlan 1986]), concepts are iteratively refined by
adding or removing features from concept definitions based upon entropy measures
(see Section 7.5.2.1). The greedy nature of such search algorithms can result in local
optima models.
CHAPTER 8: KR USING CARTESIAN GRANULE FEATllRES 186

8.1.1.6 Knowledge stability


Discretisation is a well-know problem in statistics and machine learning, where slightly
different partitions of a domain can lead to significantly different models (distributions,
decision trees etc.) [Baldwin and Pilsworth 1997; Shanahan 1998; Silverman 1986].
Owing to their fuzzy granular nature, Cartesian granule features help avoid this
problem, by allowing a graded transition between granules, in contrast to the sharp
transition in crisp approaches that can lead to sharp discontinuities and consequently
significantly different models. Moreover, the fuzzy nature of the granules in Cartesian
granule features result in models that are more stable, that is, small changes in the fuzzy
partition do not result in significantly different models (see Section 10.2.5.1 for further
details).

8.1.2 Other usages of Cartesian granules


In fuzzy logic, the definition of a Cartesian granule [Zadeh 1996] has been used
extensively in the context of the Cartesian granules that occur in the antecedent portion
of decision making fuzzy rules in the traditional (decomposed) sense. Whereas here, the
usage of Cartesian granules is extended to a new type of feature called a Cartesian
granule feature over which fuzzy sets or probability distributions can be defined. These
fuzzy sets are expressed in terms of Cartesian granules, a higher level of abstraction
than traditional fuzzy logic, where fuzzy sets are defined directly in terms of the
domain values. In other words, Cartesian granule fuzzy sets are defined in terms of the
fuzzy sets that characterise the Cartesian granules. Reasoning within a Cartesian
granule feature context is, in a sense, computing with words (i.e. in terms of linguistic
descriptions of data and not with measurements). In traditional fuzzy logic, reasoning is
performed at the individual rule level (or Cartesian granule level), whereas, for
reasoning in the context of Cartesian granule feature models, the Cartesian granules
make up fuzzy sets defined of these feature spaces, and consequently, are treated in
unison during inference. For example, using the compositional rule of inference (CRI,
see Section 4.2.1) would result in reasoning at the rule level, which directly
manipulates domain values and memberships in the case of traditional fuzzy logic,
while for Cartesian -granule feature models, CRI would -reason at a higher level of
abstraction, manipulating Cartesian granules and corresponding memberships. In the
case of [Ishibuchi et al. 1995], each Cartesian granule rule is associated with a certainty
factor, however reasoning occurs at the local granule level, in contrast to the global
level in Cartesian granule features. These differences will become more obvious during
the remainder of this chapter, where the reasoning strategies used within Cartesian
granule feature models are presented, and during successive chapters, where the
learning algorithms for such models and real world applications are described.

8.2 CHOICE OF COMBINATION OPERATOR

As mentioned previously, when constructing Cartesian granule fuzzy sets, there are
infinite ways of generating the membership values associated with the individual
Cartesian granules. Fuzzy and probabilistic approaches for generating these values are
examined. In the case of the fuzzy approaches, two commonly used operators - min and
product - are investigated and justified from a voting model perspective. This section is
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 187

included here for completeness and can be skipped on a first reading of this chapter
without any loss of continuity.

8.2.1 Generating Cartesian granule fuzzy sets via fuzzy approaches


Consider the Cartesian granule WlJX. •• ><Wml where each Wi! is the word associated with
the first fuzzy subset in each linguistic partition p;. When presented with a data vector
X, the membership value associated with the Cartesian granule WIIX. •• ><Wml is
calculated as a functionjofthe fuzzified individual attribute values:

where Xi is the feature value associated with the i-th feature in the data vector x.
Within fuzzy logic algebra, any of the functions that satisfy the t-norm axioms (see
Section 3.5.1) can be used as a conjunction operator, such as the min operator:

WI1 x ... x Wml / min(.ulllII (x1), ... ,,uwml (x m »)

or the product operator:

Both conjunction operators are commonly used in fuzzy logic and fuzzy control
applications [Baldwin, Martin and Pilsworth 1995; Klir and Yuan 1995]. In a more
general setting, the averaging operations such as Yager's OWA operators [Yager 1993]
or parameterised aggregators such as Zimmerman's 'Y operator [Zimmermann and
Zysno ] 980] could be used (see Sections 3.5.2 and 3.5.3 for details of these aggregation
operators). The next subsection justifies the use of product and min as combination
operators from a human reasoning perspective using voting model semantics.

8.2.1.1 Voting model justification of conjunction operators


The voting model lends a semantic interpretation of concepts expressed in terms of
fuzzy sets and mass assignments from a human reasoning perspective. It is based upon
a frequentist viewpoint. This section examines the applicability of product and min as
conjunction operators for combining the individual granule memberships when forming
Cartesian granule fuzzy sets and justifies their use using a voting model.

The definition of conjunction as the min operator is consistent with the voting model
interpretation of fuzzy sets, provided the voters vote consistently on all concepts that
are combined conjunctively [Baldwin 1991] I.e. the constant threshold assumption is
extended to cover all concepts. This use of min is justified with the following example.
Consider two die variables. Both die variables are defined over the following universe
of values:

QDicValucs: {I, ... , 6}.


CHAPTER 8: KR USING CARTESIA N GRANULE FEATURES 188

The two dice are thrown resulting in die! having a value of 5 and die2 having a value
of 6. A representative population of voters are then asked to vote on the appropriateness
of the words Small, Medium, and Large as a description of each die value. This voting
is performed independently for each of the die values resulting in a voting pattern for
"die! having a value of 5" presented Table 8-1 and a voting pattern for "die2 having a
value of6" presented in Table 8-2.

Table 8-J: A voting pattern for 10 people defining the linguistic description of die1
having the value of 5. This corresponds to the fuzzy set {Small/O.l + Medium/O.7 +
Large/I}.

Word\Person I 2 3 4 5 6 7 8 9 10
Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Medium Yes Yes Yes Yes Yes Yes Yes No No No
Small Yes No No No No No No No No No

Assuming that the voters who were optimistic in voting for the linguistic description of
dieJ having a value 5 share the same optimism when voting for the linguistic
description of die2 having a value 6, the voters in both voting patterns can be directly
matched. This leads to a voting pattern for the conjunction of both linguistic
descriptions that is presented in Table 8-3 . In this case, the cells containing Yes
correspond to voters who accept the Cartesian granules as appropriate descriptions of
both die values. This resulting voting pattern generates the following fuzzy set:

{dJ MediumANDd2Medium/0.7 +
dJ MediumANDd2Large/O. 7 +
dJ LargeANDd2Medium /0.8 +
dJ La rgeANDd2Large /I}

which coincides with the fuzzy set generated by using the min rule for the conjunction
of the individual granule memberships. This example illustrates that using min as a
granule conjunction operator is intuitive.

Table 8-2: A voting pattern defining the linguistic description of die2 having the value
of 6. This corresponds to the fuzzy set {Medium/0.8 + Large/I}.

Word\Person I 2 3 4 5 6 7 8 9 10
Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Medium Yes Yes Yes Yes Yes Yes Yes Yes No No

A similar argument can be used to justify the use of product as the conjunction operator
of granule memberships. In this case, the constant threshold assumption is dropped, that
is, a voter' s degree of optimism/pessimism is allowed to vary across voting for different
concepts. Once again the justification for using product conjunction is illustrated using
an example. This justification uses the linguistic descriptions generated by a voting
SOFT COMPUTIN(; FOR KNOWLED(;E DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 189

population for "diel =5" and "die2 = 6" that are presented in Table 8-1 and Table 8-2
respectively (i.e. using the same patterns that were used in justifying the min operator).
As a result of dropping the constant threshold assumption, the voters labelled 1 to I 0 in
the voting patterns for diel may not correspond to voters labelled 1 to IO in the voting
patterns for die2. In other words, there is no correlation between the voters in the voting
pattern for diel and the voters in the voting pattern for die2, that is the voter labelled I
in Table 8-1 may not correspond to the voter labelled 1 in Table 8-2. This is depicted
in Table 8-4 for the linguistic description of "die2 = 6", where each VP j denotes a voter
variable that can be assigned any of the ten voters. Table 8-5 depicts one possible
voting pattern for the die2 value. Consequently, this results in the voting pattern for the
conjunction of the voting patterns for linguistic descriptions of the dieJ value (Table 8-
1) and the die2 value (Table 8-2) that is presented in Table 8-6. However, there are
many possible instantiations for the voter variables VP j , each resulting in a different
overall voting pattern for the die2 value. This in turn results in a different voting pattern
for the conjunction of both patterns and subsequently a different corresponding fuzzy
set. No voter instantiation is preferable to another. Consequently, all voting patterns are
equally likely for linguistic descriptions of "die2 = 6". Since all voting patterns are
equally likely, the expected fuzzy set can be taken as the fuzzy set corresponding to
conjunction of linguistic descriptions of the individual die instantiations. This results in
the following Cartesian granule fuzzy set:

{dl MediumANDd2Medium/0.56 +
dl MediumANDd2Largel0. 7 +
dlLargeANDd2Mediuml0.8 +
dl LargeANDd2Large II}

Table 8-3: A voting pattern for 10 people corresponding to the linguistic description of
"diel = 5 and die2 = 6", which denotes thefuzzy set {dIMediumANDd2Medium/0.7 +
dlMediumANDd2Largel0.7 + dlLargeANDd2Mediuml0.8 + diLargeANDd2Larg II}.
Canesian word\Person 1 2 3 4 5 6 7 8 9 10
d ILargeAN Dd2Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
d 1LargeANDd2Medium Yes Yes Yes Yes Yes Yes Yes Yes No No
dl MediumANDd2Medium Yes Yes Yes Yes Yes Yes Yes No No No
d 1MediumAND<l2Largc Yes Yes Yes Yes Yes Yes Yes No No No

This fuzzy set coincides with the fuzzy set generated by using the product conjunction
of the individual granule memberships. This example illustrates that using product as a
granule conjunction operator is intuitive.

Table 8-4: A "general" voting pattern defining the linguistic description ofdie2 having
the value of6. This corresponds to the fuzzy set {Medium/0.8 + Large/I}.
Word\Person VPI VP2 VP VP4 VP5 VP6 VP7 VP8 VP9 VPI
Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Medium Yes Yes Yes Yes Yes Yes Yes Yes No No
CHAPTER 8: KR USING CARTESIAN GRANULE FEATURES 190

Table 8-5: A possible voting pattern for the linguistic description of die2 having the
value of6. This corresponds to thefuzzy set (MediumlO.8 + Large/l).
-
Word\Person 4 2 3 7 10 8 I 5 9 6
Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Medium Yes Yes Yes Yes Yes Yes Yes Yes No No

Table 8-6: A possible voting pattern for the conjunction of the linguistic description of
"diel = 5" (as presented in Table 8-1) and the linguistic description of "die2 = 6" (as
presented in Table 8-5). This corresponds to the fuzzy set
{dIMediumANDd2MediumlO.5 + dlMediumANDd2LargelO.5 +
dlLargeANDd2MediumlO.7 + diLargeANDd2Large II}.
Canesian word\Person I 2 3 4 5 6 7 8 9 10
d I LargeAN Dd2Large Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
dJLargeANDd2Medium Yes Yes Yes Yes Yes No Yes Yes No No
d I MediumANDd2Medium Yes Yes Yes Yes No No Yes No No No
d 1MediumANDdlLarge Yes Yes Yes Yes No No Yes No No No

The previous paragraphs have justified from a voting model perspective the
applicability of both product and min as conjunction operators for combining the
individual granule memberships. However, the use of the product operator is preferred
as it gives more discrimination between different data values, whereas the min operator
can exhibit a plateau (non-discriminating) behaviour. Furthermore, when mutually
exclusive partitions are used to partition the base variables universes, using the product
maintains this property of mutual exclusiveness in the Cartesian granule universe.

8.2.2 Generating Cartesian granule fuzzy sets via probability


theory
So far, Cartesian granule fuzzy sets have been generated directly from the membership
values associated with the linguistic descriptions of the raw data; maintaining truth
functionality. An alternative approach is considered here, where the Cartesian granule
fuzzy sets are generated via the least prejudiced distributions associated with the
linguistic descriptions of the raw data.

For presentation purposes, this approach to forming Cartesian granule fuzzy sets is
described using an illustrative example. Consider a two-dimensional Cartesian granule
consisting of two features F J and F2 defined over the set ~ of real numbers with
corresponding linguistic partitions where the granules are characterised by (mutually
exclusive) triangular fuzzy sets as depicted in Figure 8-4. In the general case, where the
variable F J is assigned a data value i.e. F J = Data, it can be reinterpreted as a linguistic
assignment as shown in Figure 8-4. The linguistic description of Data will take the
form of the following fuzzy set:
SOfT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 191

LDDara = wz/x + w/y (8-1)

where 0 ~ y ~ x ~ l.

g
e W3 W4 Ws
J. "" '"
x ...ft-----'~---7'''-----4....
... '" '"
/',
,/,
, .........
'/'
'
~

Y..+---7'''------'''-.0:-...-...-...-..-,,-,,--:;;1 ,/,/ ... , , ' / ',


O~------------~'---~--------~,/------------'~~,/------------,~~
Data

Figure 8-4: Generating a linguistic description of data using the linguistic partition of
the variable universe ilF• which is characterised by mutually exclusive triangular fuzzy
sets.

This linguistic fuzzy set LD Data corresponds to the following mass assignment:

The mass associated with the null set 0 (arising from the subnormal fuzzy set LD Data ) ,
is redistributed amongst each element in the core of the mass assignment according to a
renormalised prior (assume a uniform prior for this presentation). Other possible
distributions are also possible and are considered in Section 8.2.2.1. This leads to the
following revised mass assignment:

and corresponding least prejudiced distribution:

LPD Data = w2 : x, w3: Y .

As a result of the mutual exclusive nature of the fuzzy partition in this case and the
strategy used to redistribute the null set mass, the probability associated with each of
the words in the least prejudiced distribution LPD Data coincides with membership value
associated with the word in the original linguistic fuzzy set. On the other hand,
assigning a data value to variable F2 results in similar fuzzy descriptions, mass
assignments and least prejudiced distributions. Subsequently, the joint probability
distribution is formed over these words by associating the product of the individual
probabilities with the Cartesian granules. The use of product here is justified on the
grounds that the linguistic partitions of the base features were generated independently
of each other. In terms of the example, let the least prejudiced distributions
corresponding to the two data values of variables F J and F2 be defined as follows:
CHAPTER 8: KR USING CARTESIAN GRANULE FEATURES 192

LPDDa1al = W12: Xl' WI3: Yl


LPD Data2 = W22: X2' W23: Y2·

Combining these least prejudiced distributions leads to the following joint probability
distribution:

This joint least prejudiced distribution can be converted to the corresponding unique
fuzzy set via its mass assignment using the membership-to-probability transformation
(see Chapter 5). This yields the following Cartesian granule fuzzy set:

In this case, due to the mutually exclusive nature of the underlying linguistic partitions
of the variable universes, the resulting Cartesian granule fuzzy set coincides with the
Cartesian granule fuzzy set obtained when individual linguistic descriptions (fuzzy sets)
are combined using the product operation.

8.2.2.1 Generating normal Cartesian granulefuzz:y sets via LPDs


Mutually exclusive linguistic partitions generally result in subnormal linguistic fuzzy
sets. Normal fuzzy sets will occur where the data value coincides with the core of one
of the fuzzy sets. Mass assignment theory can be used to convert a non-normal fuzzy
set to a normal fuzzy set. Consider the following mass assignment, which corresponds
to the subnormal linguistic description LDData in Equation 8-1:

This is an incomplete mass assignment [Baldwin 1992] and does not correspond to a
family of probability distributions. Instead it will correspond to an non-normalised
family of probability distributions where

On the other hand, by redistributing the mass associated with the null set 0, normal
probability distributions can be generated. There are infinite ways of redistributing the
mass associated with the null set 0, thus leading to many different families of
probability distributions. Using a voting model interpretation of this mass assignment,
two ways to distribute the mass associated with the null set 0 amongst the other focal
elements can be justified: the mass can be distributed amongst the other active domain
elements (i.e. elements of the core of this mass asSIgnment) according to the prior; or
alternatively the mass can be distributed amongst the other focal elements in proportion
to their associated masses [Baldwin, Martin and Pilsworth 1995]. The first approach to
redistributing the mass associated with the null set 0 using the renormalised prior
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTROOUCING CARTESIAN GRANULE FEATURES 193

(assume a uniform prior here) amongst the domain elements yields the following mass
assignment and least prejudiced distribution:

MAData = ({W2}:X- y+ l~X, {W2,W3}: y, {W3}: I~X)


LPDOata = W2 : x, W3: y.

The probabilities in the least prejudiced distribution LPD Oata coincide with the
membership values in the original fuzzy set.

Conversely, the mass can be redistributed amongst the other focal elements in
proportion to their associated masses, thereby increasing the mass associated with lh}
by (x - y) y and the mass associated with {Wz, W3} by L: This results in the following
x x
probability distributions:

MAoala=({w2}:x: y , {W2'W3}:~)
LPD oa1a = w2:I_L, W3:~'
2x 2y

The least prejudiced distribution LPD Oata in this case corresponds to normalising the
fuzzy set before transforming it into its corresponding mass assignment and least
prejudiced distribution. Both methods result in different least prejudiced distributions
but are equally justifiable.

So far, this section has focused on the generation of Cartesian granule fuzzy sets via the
least prejudiced distributions associated with the feature linguistic descriptions where
the underlying partitions are mutually exclusive. However, a similar approach could be
taken where the underlying partitions are not mutually exclusive. In this case, the
resulting linguistic descriptions may be normal for all domain values, thereby
simplifying the aggregation process.

In this book Cartesian granule fuzzy sets corresponding to data vectors are generated
using the approach presented Section 8.1, whereby the individual granule memberships
are combined using the product conjunction operator. This preserves truth functionality,
and is a more efficient and a simpler (involving fewer steps) way of generating
Cartesian granule fuzzy sets. Empirical evidence to date suggests there is little
difference between any of the aggregation approaches considered here when they are
employed in a machine learning context [Baldwin, Martin and Shanahan 1997].

8.3 CARTESIAN GRANULE FEATURE RULES

In modelling a problem domain, since Cartesian granule features can assume fuzzy set
or probabilistic values, they can be quite naturally incorporated into conjunctive,
CHAPTER 8: KR USING CARTESIAN GRANULE FEATURES 194

evidential and causal relational rule structures, thus enabling reasoning using support
logic [Baldwin, Martin and Pilsworth 1995]. Alternatively, Cartesian granule features
can be incorporated into fuzzy logic rules, thus enabling approximate reasoning based
on CRI as presented in Chapter 4. Even though it is possible to combine all the features
(or base variables) of a problem into one Cartesian granule feature, it may not always
be desirable. For example, in Section 10.4, the need for discovering structural
decompositi~n of input feature spaces into lower order feature spaces is motivated by
the L problem. In general, decomposition is required not only on generalisation grounds
but also from knowledge transparency and tractability perspectives [Baldwin, Martin
and Shanahan 1998; Shanahan 1998]. This partial decomposition can be viewed as a
form of decomposition of the problem domain into low order relationships between
small clusters of semantically related variables, similar in spirit to Bayesian networks
[pearl 1986], where a Cartesian granule feature represents each cluster of semantically
related variables (variables that have dependencies, such as functional or probabilistic
dependencies). This correlation between Bayesian belief networks and Cartesian
granule features is furth~ discussed in Section 10.4.1.2. As a result of this
decomposition, a means of aggregating the individual Cartesian granule features is
required. In this book, the evidential logic rule is chosen as a natural mechanism for
representing this type of decomposed approach to systems modelling [Shanahan 1998].
This type of model is referred to as an additive model. On the other hand, using the
conjunctive rule to aggregate the individual Cartesian granule features results in a
product model. In Section 10.2.2.1, the use of product and additive models is
compared on the ellipse dataset. As mentioned earlier (Chapter 6) in the context of
evidential rules, additive models permit partial reasoning (i.e. tolerates missing values),
which can be an attractive facet in very uncertain problem domains.

More concretely stated, an additive model will consist of an evidential logic rule
corresponding to each class in the problem domain. An evidential logic rule structure is
reviewed here from a Cartesian granule feature perspective (see Section 6.1.2 for a
complete description) and is depicted in Figure 8-5. Here CLASS can be viewed as a
fuzzy set consisting of a single crisp value, in the case of classification type problems,
or as a fuzzy set characterising part of the output variable universe in the case of
prediction problems. Each rule characterises the relationship between input and output
data for a particular region of the output space i.e. a concept. The body (conditional
part) of each rule consists of a collection of Cartesian granule features F;, whose values
CGFSicLASS correspond to fuzzy sets defined over respective universes 0; that
correspond to the output variable value CLASS (in probabilistic terms this can be
viewed as the class conditional Pr(F; = CGFS;cLASS I Classification = CLASS)). Each
feature F; is associated with a weight term w; that reflects the importance of this feature
to CLASS;.

8.4 ApPROXIMATE REASONING USING CARTESIAN GRANULE


FEATURE MODELS

This section considers the support logic approximate reasoning process from an
additive Cartesian granule feature model perspective. As described in detail in the
Son COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 195

previous section, each rule consists of a body of Cartesian granule features and their
corresponding fuzzy set values. The first step in the inference process consists of
generating Cartesian granule fuzzy sets from the incoming data vector Xi corresponding
to each (CG) feature (as described in Section 8.1). This results in a Cartesian granule
fuzzy set description CGD; of Xi for each feature F;. Subsequently, for each feature F;
the first level of inference (as described in Section 6.2.1) is performed. That is, a fuzzy
set match is performed using semantic unification between each class fuzzy set
CGFS iCLASS and the corresponding data fuzzy set CGD; as follows:

SU{CGFSiCLASS I CGD i)

where CGDi corresponds to the Cartesian granule fuzzy set description of xi. Then
evidential reasoning proceeds as described in Section 6.2. Decision making is as
presented in Section 6.3.

«Cia ification of Object i CLASS) Head/Collsequellt


(Evlog filter
(Flof Objeci i CGFSICL..SS) WI
Body/A/ltecedents alld
(Pi of Object is CGFS iCLASS ) Wi associated weights

(Fm of Object is CGFS mCLASs )wm »


:«1 1)(00») Rule Supports

Figure 8-5: A canonical evidential logic rule.

8.5 CARTESIAN GRANULE FEATURES AND FUZZY LOGIC

The previous sections have shown how Cartesian granule features can be incorporated
into evidential logic and conjunctive rule structures and how probabilistic reasoning can
be carried out in this context. As an alternative, Cartesian granule features can also be
incorporated into fuzzy rules in the fuzzy logic sense as described in Chapter 4.
Consequently, as was the case in probabilistic reasoning, the first step in the inference
process consists of generating a Cartesian granule fuzzy set CGD; for each (CG) feature
Fi from the incoming data vector xi . Subsequently, reasoning is performed using the
compositional rule of inference (CRI) in conjunction with defuzzification procedures as
presented in Chapter 4. In this book, the presentation and results are limited to
probabilistic reasoning, however, the use of fuzzy logic in conjunction Cartesian
granule features will form part of future work.
CHAPTER 8: KR USING CARTESIAN GRANULE FEATURES 196

8.6 SUMMARY

Cartesian granule features have been introduced as a new fonn of knowledge


representation. A Cartesian granule feature exploits a divide-and-conquer strategy to
representation, capturing knowledge in tenns of a network of low-order semantically
related features. This chapter has provided basic definitions and examples of Cartesian
granule features and related concepts. Aggregation within individual Cartesian granule
features has been described both from a fuzzy and probabilistic perspective. Finally, it
was illustrated how these features can be incorporated into both fuzzy logic and
probabilistic models for both classification and prediction problems. This results in a
slightly modified approximate reasoning process for both fuzzy logic and support logic
reasoning, which was also described.

Overall Cartesian granule features open up a new and exciting avenue in uncertainty
modelling which permits not only computing with words but also modelling with words.
The next chapter describes a constructive induction algorithm that facilitates the
extraction of Cartesian granule features models from example data automatically
(modelling with words) for both classification and prediction problems.

8.7 BIBLIOGRAPHY

Baldwin, J. F. (1991). "Combining evidences for evidential reasoning", International


Journal of Intelligent Systems, 6(6):569-616.
Baldwin, J. F. (1992). "Inference Under Incompleteness", Report No. ITRC 175,
Department of Engineering Maths, University of Bristol, UK.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - Fuzzy and Evidential
Reasoning in A.I. Research Studies Press(Wiley Inc.), ISBN 0863801595.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1996). "Modelling with Words using
Cartesian Granule Features", Report No. ITRC 246, Dept. of Engineering
Maths, University of Bristol, UK.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997). "Modelling with words using
Cartesian granule features." In the proceedings of FUZZ-IEEE, Barcelona,
Spain, 1295-1300.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1998). "Aggregation in Cartesian
granule feature models." In the proceedings of IPMU, Paris, 6.
Baldwin, J. F., and Pilsworth, B. W. (1997). "Genetic Programming for Knowledge
Extraction of Fuzzy Rules." In the proceedings of Fuzzy Logic: Applications
and Future Directions Workshop, London, UK, 238-251.
Duda, R., and Hart, P. (1973). Pattern classification and scene analysis. Wiley, New
York.
Ishibuchi, H., Nozaki, K., Yamamoto, N., and Tanaka, H. (1995). "Selecting fuzzy if-
then rules for classification problems using genetic algorithms", IEEE
Transactions on Fuzzy Systems, 3(3):260-270.
Klir, G. J., and Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic, Theory and Applications.
Prentice Hall, New Jersey.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 197

Pearl, J. (1986). "A constraint-propagation approach to probabilistic reasoning", In


Uncertainty in AI, J. F. Kanal and L. N. Lemmer, eds., Elsevier Science
Publishers, North-Holland, 357-370.
Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, 1(1):86-106.
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman
and Hall, New York.
Yager, R. R. (1993). "Families of OWA Operators", Fuzzy Sets and Systems,
59(2):125-148.
Zadeh, L. A. (1994). "Soft Computing and Fuzzy Logic", IEEE Software, 11(6):48-56.
Zadeh, L. A. (1996). "Fuzzy Logic = Computing with Words", IEEE Transactions on
Fuzzy Systems, 4(2): 103-111.
Zimmermann, H. J., and Zysno, P. (1980). "Latent connectives in human decision
making", Fuzzy Sets. and Systems, 4(1):37-51.
PART V
APPLICATIONS

Having introduced Cartesian 'pule feature models and related induction algorithms,
Part V shifts its attention to applications of Cartesian granule features within the more
general context of knowledge discovery. Chapter 10, for the purposes of illustration and
analysis, applies this approach to artificial problems in both classification and
prediction. Chapter 11 focuses on practical applications of knowledge discovery of
Cartesian granule feature models, in the real world domains of computer vision,
diabetes diagnosis and control, while also comparing this approach with other
techniques such as neural networks, decision trees, naive Bayes and various fuzzy
induction algorithms. Chapter 11 finishes by summarising knowledge discovery from a
Cartesian granule feature perspective and gives some views on what the future may
hold for knowledge discovery in general and for Cartesian granule features.
CHAPTER LEARNING CARTESIAN
GRANULE FEATURE
9 MODELS

In the previous chapter, it was shown how Cartesian granule feature models exploit a
divide-and-conquer strategy to representation, capturing knowledge in terms of a
network of low-order semantically related features. Both classification and prediction
problems can be modelled quite naturally in terms of these models. This chapter
describes a constructive induction algorithm, G_DACG (Genetic Discovery of Additive
Cartesian Granule feature models), which facilitates the learning of such models from
example data [Shanahan 199~; Shanahan, Baldwin and Martin 1999]. This involves two
main steps: language identification (identification of the low-order semantically related
features in terms of Cartesian granule features); and parameter identification of class
fuzzy sets and rules. The G_DACG algorithm achieves this by embracing the
synergistic spirit of soft computing, using genetic programming to discover the
language (structure) of the model, fuzzy sets and evidential rules for knowledge
representation, while relying on the well-developed probability theory for learning the
parameters of the model.

This chapter begins by introducing the G_DACG constructive induction algorithm. The
algorithm is presented from both a classification (Section 9.1) and a prediction problem
(Section 9.1.4) perspective. Feature selection and discovery play important roles in the
induction of Cartesian granule feature models, and consequently, in Section 9.2 a
literature review of existing approaches is given. Section 9.3 describes, in detail, the
language identification (feature discovery) component of G_DACG. It is a population-
based search algorithm, centred on genetic programming [Koza 1992; Koza 1994],
where each node in the search space is a Cartesian granule feature that is characterised
by its constituent features and their abstractions (linguistic partitions). A couple of
novel fitness functions are presented in Section 9.3.2, including fitness based upon the
semantic separation of learnt concepts and parsimony promotion. Sections 9.1.2, 9.4,
and 9.5 present the main steps in parameter identification and optimisation -
identification of class fuzzy sets, evidential weights and rule filters respectively.
Section 9.6 proposes an alternative approach to parameter identification that exploits
neural network learning algorithms. For illustration purposes, in Section 9.7, G_DACG
is applied to a small artificial problem - the ellipse classification problem. The use of
different types of fitness function is examined in the context of this problem. Further
applications (real world) of G_DACG are provided in Chapters 10 and 11.

9.1 LEARNING USING THE G_DACG ALGORITHM

The induction of additive Cartesian granule feature models falls into the category of
supervised learning algorithms. Within this framework, problem domains are generally

J. G. Shanahan, Soft Computing for Knowledge Discovery


© Kluwer Academic Publishers 2000
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 200

represented as databases of examples organised in a spreadsheet format as presented in


Table 9-1, which consists of N examples, where each example corresponds to a row t
that is made up of both input feature values {Vt/, ... , V,m} for corresponding problem
features {Ji, ... , In}, and an output feature value C, that corresponds to a concept label in
the problem domain. Even though additive Cartesian granule feature models can model
both classification and prediction problems, the G_DACG induction algorithm is
introduced from a classification problem perspective for straightforwardness.
Subsequently however, the algorithm is presented from a prediction perspective. In the
case of Cartesian granule feature modelling, each feature value Vif can correspond to a
numeric value, a symbolic value, or to uncertain or vague information that can be
specified in terms of fuzzy subsets or interval values. Background knowledge about the
domain, other than examples, can also be accommodated within the Cartesian granule
feature framework but is not considered here. This will be a topic of future work (see
Chapter 11).

Table 9-1: Example database in spreadsheetJormat.


Example fl ... fr ... fn Class
1 VII ... VIr ... V In C1
... ... ... ... ... ... ...
t VII ... VIr ... V ln CI
... ... ... ... ... ... ...
N V N1 ... V Nr ... V Nn CN

The goal of supervised learning is to generate a model from the training examples, in
this case an additive Cartesian granule feature model, that covers (classifies correctly)
not only training examples, but also examples that have not been seen during training
i.e. that generalises well. Subsequent paragraphs describe the main steps in learning an
additive Cartesian granule feature model from example data using the G_DACG
constructive induction algorithm (Genetic Discovery of Additive Cartesian Granule
feature models). Since the induction of additive Cartesian granule feature models
involves the construction of new features, the G_DACG algorithm can be categorised
as a constructive induction algorithm [Dietterich and Michalski 1983].

G_DACG can be viewed abstractly in terms of the following two steps (see Figure 9-1
for a schematic overview of G_DACG from a knowledge discovery perspective):

• Language identification (step 2 in G_DACG): This step is concerned with


identifying the language that can be used to describe models in an effective,
tractable and transparent manner, that is, the identification of a network of
low-order semantically related features. The step can also be viewed as feature
selection and discovery. In other words, it identifies "useful" Cartesian granule
features, the language of the model. Feature abstraction (in terms of linguistic
partitions) and feature selection are considered simultaneously, thereby
avoiding local minima models that can result from treating these tasks
independently, as is the case in decision tree approaches [Quinlan 1986]. The
parameter identification phase of the induction algorithm (outlined next) is
used as an evaluation function for identifying the language of the model. As
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANUL E FEATURES 201

language identification is done outside the main phase of the induction method
but uses the induction method as the evaluation function, the feature selection
and discovery component of G_DACG is classified as a wrapper approach
[Kohavi and John 1997].
• Parameter identification (steps 3 to 5 in G_DACG): Having identitied the
language of the model, parameter identification then estimates the class fuzzy
sets and class aggregation rules. Setting up the class aggregation rules is
further divided into the tasks of estimating the weights associated with the
individual Cartesian granule features (sub-models) and with identifying the
rule filters.

Constructive
PlrMleter
Induction
Irentific:l ion
(G..J)ACG)

Language
lck:ntific:lion

Data Seleaion,
Preprocessing,
Tr:.lsfomHlI ion
I
~

Figure 9-1: The G_DACG perspective: knowledge discovery of additive Cartesian


granule feature models.

In traditional systems modelling, language identification is sometimes referred to as


structure identification [Ljung 1987]. Within the G_DACG algorithm, the structure
identification phase is referred to as language identification due to the linguistic nature
of Cartesian granule features.

«Classification of Objcct is CLASS) Head/Consequent


(Evlog filter
(Flof Object is CGFS ICLASS ) WI
Body/Anrecedellls
(Fi of Object is CGFSiCLASS )w, and associated
weights
(Fm of Object is CGFSrrCLASS) wm) .. "

):«(1 1)(00» Rille Supports

Figure 9-2: Evidential logic rule structure.


CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 202

9.1.1 G_DACG Algorithm


The O_DACO algorithm consists of the following five steps. The details of most steps
are limited here, in order to provide a succinct overview, however, forward references
to detailed presentations for each step are provided.

Step 1: Setup datasets. Split the database of examples into a training database
Dtrain, a control database DCoDlrol and a testing database Dtes,.
Step 2: Language identification. Select which features jj should be combined to
form Cartesian granule features F;. This step is taken care of by an
automatic, near optimal, feature discovery algorithm that discovers which
Cartesian granule features and their abstractions (Le. the linguistic
partition of each problem feature universe Pj;) are necessary to model a
problem effectively. It outputs a set of Cartesian granule features {FJ, •.• ,
F;. ... , Fm}. These features are subsequently incorporated into evidential
logic rules of the form depicted in Figure 9-2. This algorithm and related
material are presented in Sections 9.2 and 9.3.
Step 3: Learn the class Cartesian granule fuzzy sets. This step extracts the fuzzy
set values CGFSjClass of each class-rule feature. For each class Class in
{CLASS], ..• , CLASSc}, extract a fuzzy set CGFSjclass defined over each
Cartesian granule feature universe Q Fi using the procedure outlined in
Section 9.1.2.
Step 4: Identify rule weights and filter. The Cartesian granule features {FJ, ... , Fj,
... , Fm} and corresponding fuzzy set values are incorporated into
evidential logic rules of the form depicted in Figure 9-2. This step
estimates the weights associated with each Cartesian granule feature F j
using semantic discrimination analysis (see Section 9.4) and sets each
class filter to the identity filter. Using the estimated weights, generate the
corresponding ACOF model, ACGFsDA .
Step 5: Optimise rule weights and filter. This step is optional but can improve the
performance of the learnt additive model in some cases. This step
optimises the rule weights and filters using the Powell's direction set
optimisation algorithm (presented in detail in Section 9.5). The macro-
level details of this step are as follows:
• Take the model ACGFsDA generated in step 4, and optimise the
filters using Powell's direction set algorithm [Powell 1964].
Regenerate the ACOF model with the newly optimised filters
and SDA-based weights, yielding the model ACGFOptFilters_SDA'
• Using the model ACGFOptFilters_SDA generated above, optimise the
weights using Powell's direction set algorithm. Regenerate the
ACOF model with the optimised filters and optimised weights,
yielding A CGFOptFilters_OptWeights'
• Re-optimise the filters of the model ACGFOptFilters_OptWeight., using
Powell's direction set algorithm. Regenerate the ACOF model
with the re-optimised filters and optimised weights, yielding
A CGFOptFilters_OptWeights_2'
• Calculate the accuracy for each of the generated models on the
control dataset. Select the ACOF model with the highest
accuracy on the control set as the learnt model.
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 203

9.1.2 Learning Cartesian granule feature fuzzy sets from data


The following steps outline how to extract a Cartesian granule fuzzy set from example
data for a class or concept defined over a feature universe n Ft • This procedure
corresponds to step 3 in the G_DACG algorithm. In the next section, an illustrative
example of this fuzzy set learning algorithm is presented.

A Cartesian granule fuzzy set CGFSiClass for the concept Class over feature universe n Ft
is learned from example data tuples as follows:

Step Fl. Initialise a frequency distribution DISTiclass defined over all the
Cartesian granules in the Cartesian granule feature universe n Ft, that
is, set each Cartesian granule to zero.
Step F2. For each cll,lSs training tuple TrCIa.s perform the following (Section 5.4
presents the membership-to-probability bi-directional transformation
that exists between fuzzy set theory and probability theory, which is
used extensively below):
• Construct the corresponding Cartesian granule fuzzy set (Le.
linguistic description of the data vector) CGFStc/a•s
corresponding to the training tuple Ttc/ass using the approach
outlined in Section 8.1.
• Subsequently, the fuzzy set CGFStC/ass is transformed into its
corresponding least prejudiced distribution LPDtC/ass .
• Update the overall frequency distribution DISTic/ass with this
least prejudiced distribution LPDtc/ass.
Step F3. This frequency distribution DISTic/ass corresponds to the least
prejudiced distribution LPDiC/ass which can then be transformed into
the Cartesian granule fuzzy set CGFSiC/ass (using the bi-directional
transformation). In the absence of any other information, a uniform
prior distribution over the Cartesian granules is assumed for this
transformation.

9.1.3 Cartesian granule fuzzy set induction example


The following example illustrates how to form a one dimensional Cartesian granule
fuzzy set corresponding to the concept of car positions in images (i.e. step 3 in the
G_DACG algorithm). Firstly, the universe of the Position feature is linguistically
partitioned. One possible linguistic partition could be:

PPosition ={Left, Middle, Right}.

This linguistic partition is depicted in Figure 9-3. The main steps in extracting a
Cartesian granule fuzzy set for this simple example are graphically presented in Figure
9-4. The process begins by taking examples of car positions in images and generating
corresponding Cartesian granule fuzzy sets and least prejudiced distributions. The top
left table corresponds to examples of car positions, corresponding linguistic
descriptions (in this case, the Cartesian granule fuzzy sets are equivalent to the
CHAPTER 9: LEARNIN(; CARTESIAN GRANULE FEATURE MODELS 204

linguistic descriptions due to the one-dimensional nature of the CO feature) and least
prejudiced distributions. The top middle graph corresponds to the initial Cartesian
granule frequency distribution. The top right graph depicts the Cartesian granule
frequency distribution after updating with the LPD corresponding to the value of 40.
The right middle graph shows the Cartesian granule frequency distribution after
updating with the LPD corresponding to the value of 60. The bottom right graph
displays the Cartesian granule frequency distribution after counting all the LPDs
corresponding to the example car positions. Finally, the bottom left graph depicts the
corresponding Cartesian granule fuzzy set for car positions in images i.e. a linguistic
summary of car positions in images in terms of the words Left, Middle and Right. Here,
for presentation purposes, the Cartesian granule feature is one dimensional in nature,
however, multidimensional features can be accommodated in a similar fashion.

linguistic Variable: Position

...
• I
t. ~'
"
•• I "
,..
~·.I
/
"
I •••
.. " /'

o 40 50 100

Figure 9-3: Fuzzy partition of universe Dposition.

9.1.4 G_DACG algorithm from a prediction perspective


In the case of prediction problems the output (dependent) variable is reinterpreted as a
linguistic variable. The universe of the output variable is partitioned into fuzzy classes,
where the class values CLASS are semantically represented using fuzzy sets. These
fuzzy sets form a fuzzy partition of the universe, and are used to linguistically describe
the data values (in the training dataset) i.e. (Cartesian) granule fuzzy set descriptions.
Each output data value generates a linguistic summary that, in general, includes two or
more linguistic terms (because of the graded transition between concepts modelled by
overlapping fuzzy sets). Consequently, when constructing Cartesian granule fuzzy sets
from example data corresponding to each CLASS (fuzzy subset), the probabilities
associated with each training example need to be distributed in proportion to the
probabilities in the least prejudiced distribution corresponding to the linguistic
description of the output value. In other words, the probabilities associated with a
training example need to be distributed in proportion to Pr(TtCla.nIOutputValue). This
results in the following small change to Step F2 in the Cartesian granule fuzzy set
learning algorithm (Section 9.1.2) in order to accommodate prediction problems:
SOH COMI'UTIN(; FOR KNOWLEDGE DISCOVERY : I NTROIJUCIN(; C ARTESIAN GRANUI.E FEATURES 205

Step F2: For each training tuple that sati sfies Pr(T,classIOutputValue) > 0, perform
the following :
• Construct the corresponding Cartesian granule fuzzy set (i.e.
linguistic description of the data vector) CGFS,class
corresponding to the training tuple T,cl.ss.
• Transform the fuzzy set CGF'CIass into its corresponding least
prejudiced distribution LPD,Class.
• Update the overall frequency di stribution DISTiclass with this
least prejudiced distribution LPD'Class in proportion with
Pr(T,classIOutputValue).

iL ll=:
# Position I.inguistic lID
D!scriJXion
1 40 Leftl.4+ l.eftl.2,
Md:ile/l Mddle/.8
(:J) Md:lle/l+ Middlel.S,
e ..,
2 tl. ..,

Rightl.4 Ri tI.2 -=-l.., '0


-0
i:
1:10 .:::0> '0
-0
i:
1:10

3 &> Middle/.S Middlel.4, ~ ~ ..J


~ ~

+Rightll Rightl.6

N 45 l.eftl.6+ Leftl.3,
Mddlell Middle/.7

Cartesian Granule Fuzzy Set -=-l.,


CarPosition

CarPosition =
(left/.3 + Middle/l + Rightl.25)

Figure 9-4: Induction of the Cartesian granule fu zzy set, {Left/O.3 + Middlell +
Right/O.25}, corresponding to car positions in images (lower left graph) from example
car positions (top left table).

9.2 FEATUREDISCOVERY

The feature discovery (language identification) component of G_DACG is a


population-based search algorithm, based on genetic programming [Koza 1992; Koza
CHAPTER 9: LEARNING CARTESIAN GRANULE PEATURE MODELS 206

1994], where each node in the search space is a Cartesian granule feature that is
characterised by its constituent features and their abstractions (linguistic partitions).
Before describing in detail feature discovery using G_DACG, a brief review of other
feature discovery and selection approaches in the literature is given. This section on
feature discovery and selection approaches and constituent subsections can be omitted
on a first reading without loss of continuity.

9.2.1 Feature selection and discovery


Feature selection can be viewed as the process of selecting those features that should be
used in the subsequent steps of an induction or modelling process. Feature discovery
can be viewed as a process of synthesising features from the base features and also
involves feature selection. The synthesised features (and possibly the original feature
set) can then be used by any induction process for the extraction of concept
descriptions. Synthesised features tend to lead to more succinct and more
discriminating concept.'descriptions. Numerous ways of synthesising new features have
been proposed in the literature including the work of Baldwin and his colleagues
[Baldwin, Martin and Pilsworth 1995; Baldwin and Pilsworth 1997], where a genetic
programming approach to the synthesis of compound features as algebraic expressions
of base features was proposed and illustrated on some artificial problems. These
synthesised features are subsequently used in fuzzy modelling. Several examples are
presented in the following references [Koza 1992; Koza 1994; Tackett 1995], which
have incorporated feature synthesis indirectly into model construction through genetic
programming. Feature synthesis and selection also forms an important part of neural
network construction, where the hidden nodes may be viewed as higher order features
that are discovered by the learning algorithm. Principal component analysis [Jolliffe
1986] offers an alternative route in constructing higher-order features from weighted
combinations of base features based on variance measures. In this work, Cartesian
granule features are constructed based on the cross product of feature linguistic
partitions (feature abstractions). In this work and in general, one of the most critical
steps in feature synthesis is the feature selection process.

One can view the task of feature discovery as a search problem; for example, the
discovery of Cartesian granule features can be viewed as a search problem, with each
state in the search space specifying a possible Cartesian granule feature. This task can
be viewed as both a feature selection and construction process. There has been
substantial work on feature discovery and selection in various fields such as pattern
recognition, statistics, information theory, machine learning theory and computational
learning theory. Numerous feature selection algorithms exist. Kohavi and John [Blum
and Langley 1997; Kohavi and John 1997] characterise the various approaches as
follows: those that embed the selection within the basic induction algorithm; those that
use feature selection to filter features passed to induction; and those that treat feature
selection as a wrapper around the induction process. Since feature selection plays a
critical role in feature discovery, the various approaches to feature selection are
examined using these categories.
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 207

9.2. J. J Embedded approaches to feature selection


Embedded feature selection involves selecting features within the induction algorithm,
where the general idea is to add or remove features from a concept description in
response to an evaluation function (also known as the cost or fitness function) e.g.
prediction errors on unseen data. The various techniques differ mainly in the search
strategies and heuristics used to guide the search. Because the search space can be
exponentially large, managing the problem requires strong heuristics. For example,
logical description induction techniques such as ID3, C4.5, and CART carry out a hill-
climbing search strategy, guided by information-gain heuristics, to search programs
(discover good features conjunctions), by working from general to specific. The
ASMOD neuro-fuzzy algorithm and its various extensions [Bossley 1997; Kalvi 1993]
are examples of an embedded feature selection strategy where the model is iteratively
refined by modifying, adding or removing features.

These embedded techniques, due to the search mechanisms employed, are very
vulnerable to starting points, and local minima [Blum and Langley 1997; Bossley 1997;
Kalvi 1993; Kohavi and John 1997]. These search techniques work well in domains
where there is little interaction amongst the relevant features. However, the presence of
attribute interactions can cause significant problems for these techniques. Parity
concepts constitute the most extreme example of this situation, but it also arises in other
target concepts. Embedded selection methods that rely on greedy search cannot
distinguish between relevant and irrelevant features early in the search. Although
combining forward selection and backward elimination to concept construction may
help to overcome this problem. A better alternative may be to rely on a more random
search such as simulated annealing, or a more random and diverse search technique
such as genetic algorithms or genetic programming.

9.2. J. 2 Filter approaches to feature selection


A second general approach to feature selection introduces a separate process for this
purpose that occurs before the basic induction step. For this reason Kohavi and John
[Kohavi and John 1997] have termed them filter methods; they filter out irrelevant
features before induction occurs. The pre-processing step generally relies on general
characteristics of the training set to select some features and exclude others. Thus,
filtering methods are independent of the induction algorithm that will use their output
and they can be combined with any such method. RELIEF [Kira and Rendell 1992] and
FOCUS [Almuallim and Dietterich 1991] and their extensions are amongst the more
commonly used approaches to feature selection and have be shown to contribute
significant improvements in a variety of induction approaches such as decision trees,
nearest neighbours and naive Bayesian classifiers [Blum and Langley 1997]. RELIEF
samples training instances randomly, summing a measure of the relevance of a
particular attribute across each of the training instances. The relevance measure used is
based upon the difference between the selected instance and k nearest instances of the
same class and k nearest instances in the other classes ("near-hit" and "near-miss")
[Kononenko and Hong 1997]. REIGN [Bastian 1995] relies on the use of a feed
forward neural networks (using back propagation learning algorithm) combined with a
hill climbing search strategy to determine the features set that should subsequently be
used by a fuzzy induction algorithm. Principal component analysis [Jolliffe 1986] is a
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 208

form of filter that constructs higher-order features, orders them and selects the best such
features. These features are then passed on to the induction algorithm. Filter
approaches, while interesting and useful, totally ignore the demands and capabilities of
the induction algorithm and thus can introduce an entirely different inductive bias to
that of the induction algorithm [Kohavi and John 1997]. This leads to the argument that
the induction method planned for use with the selected features should provide a better
estimate of accuracy than a separate measure that has an entirely different inductive
bias; this leads to the wrapper technique for feature selection.

9.2.1.3 Wrapper approaches to feature selection


A third generic approach for feature selection is done outside the induction method but
uses the induction method as the evaluation function. For this reason Kohavi and John
refer to these as wrapper approaches [Kohavi and John 1997]. The typical wrapper
approach conducts a search in the space of possible parameters. Each state in the
parameter space corresponds to a feature subset and various other information
depending on the type of knowledge representation and induction algorithm used. For
example, in the case of Cartesian granule features, the abstraction of a feature universe
is also a parameter of the search space. Each state is evaluated by running the induction
algorithm on the training data and using the estimated accuracy of the resulting model
as a metric (other measures can also be used). Typical search techniques use a stepwise
approach of adding or deleting features to previous states beginning with a state where
all features or no features are present. The wrapper scheme has a long history within the
statistics and pattern recognition communities [Devijver and Kittler 1982; Ivanhnenko
1971]. The major disadvantage of wrapper methods over tilter schemes is the former's
computational cost, which results from calling the induction algorithm for each
parameter set evaluated. The approach is also susceptible to local minima when used in
conjunction with stepwise search strategies. The language identification component of
the proposed G_DACG algorithm, which is presented subsequently in Section 9.3, can
be viewed as a wrapper approach to feature selection as it uses the rule and fuzzy set
induction algorithm as part of the evaluation function. This language identification
component avoids some of the problems of other wrapper techniques, such as local
minima, by using a pseudo-random population-based search (based on genetic
programming). Furthermore, the computational cost is reduced drastically by the use of
a cheap evaluation function based on parsimony and discrimination power, in contrast
to evaluating the results of induction on a test dataset.

9.3 FEATURE DISCOVERY IN THE G_DACG ALGORITHM

Due to the constructive nature of Cartesian granule features, the discovery of good,
highly discriminating, and parsimonious Cartesian granule features (i.e. the feature
subsets and the feature universe abstractions) is an exponential search problem that
forms one of the most critical and challenging tasks in model identification. An additive
model composed of Cartesian granule features that are too simple or too inflexible to
represent the data will have a large bias, while one which has too much flexibility (Le.
redundant structure) may fit idiosyncrasies found in the training set, producing models
SOFT COMPUTING fOR KNOWLEIXiE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURr:.S 209

that generalise poorly; in this case the model's variance is too high. This is an example
of the classical bias/variance dilemma presented in [Geman, Bienenstock and Doursat
1992]. Bias and variance are complementary quantities, and the best generalisation is
obtained when the model provides the best compromise between the conflicting
requirements of small bias and small variance.

Bias: This represents how the average model (often referred to as the best model)
differs from the true system fix). If the extracted model converges to the true system,
the model is said to be unbiased i.e. well matched to the system. This type of bias
differs from the inductive bias presented in Section 7.8.5, which refers to the bias used
in the discovery or search for a model and not bias associated with the discovered
model, as is the case here.

Variance: This represents how sensitive the model is to different datasets by measuring
the expected error between the average model and a model identified on a single
dataset.

In order to find the optimum balance between bias and variance, a means of controlling
the effective complexity of the model is required. This trade-off is incorporated directly
into the G_DACG (Genetic Discovery of Additive Cartesian Granule feature models)
discovery algorithm at two levels; one in terms of a fitness function for the individual
Cartesian granule features and the other at aggregate model level where lowly
significant Cartesian granule features based on their weights are eliminated. In the case
of additive Cartesian granule features models, both the bias and variance can be drawn
towards their minimum, by adding, removing, or altering (granularities, granule
characterisations) the constituent Cartesian granule features, thereby generating models
which tend to generalise better and have a simpler model structure; i.e. Occam's razor,
where all things being equal, the simplest is most likely to be the best.

The search algorithm plays a big part in the discovery of good Cartesian granule
features. It can influence what parts of the space are or are not evaluated and can be
vulnerable to local minima, starting states and computational constraints. Each state in
the parameter space corresponds to a feature subset and the granularity of the individual
base features, that is, the feature selection and feature abstraction steps are combined.
The size of the finite space of all possible Cartesian granule features for any problem
given a finite number of base features is given by the following equation [Baldwin,
Martin and Shanahan 1998]:

L L
MaxGran MarDim
NumOfFeat C dim * (gran )dim
gran =MinGnm dim=1

where MaxDim is the maximum allowed dimensionality of Cartesian granule features


and MinGran and MaxGran denote the granularity range for Cartesian granule features.
Note that in this case, the granule characterisations are assumed to be fixed (for
example, triangular fuzzy sets); otherwise, the complexity could potentially increase by
another order of magnitude. For a sample problem, like the Pima Indian diabetes
problem presented later in Chapter I], the number of possible Cartesian granule
features runs into millions if the eight base (domain) features are considered with base
feature granularity ranges of [2, ] 5]. In general for most problems, the search space will
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 210

be of the order of millions, increasing exponentially with the permitted dimensionality


of the Cartesian granule features.

The proposed language identification component of the G_DACG algorithm centres


around a pseudo-random distributed search paradigm based upon natural selection and
population genetics; genetic programming (GP) [Koza 1992]. The genetic search
paradigm, due to its distributed nature, avoids pitfalls such as local minima by
exploring large areas of the search space in parallel. Before describing the feature
discovery component of the G_DACG algorithm, the chromosome structure, the fitness
function and the modified GP operators used are described.

9.3.1 Chromosome structure


There are infinite ways of forming the membership value (Le. aggregating the
individual granule memberships) associated with a Cartesian granule in a Cartesian
granule fuzzy set (see Section 3.5 for an overview of aggregation operators in fuzzy set
theory). This would correspond to an infinite junction set in genetic programming
terms. To date, two operators, namely, product and min operators, have been used.
However, empirical evidence on various problem domains seems to suggest that there
is very little difference between the effectiveness of both these operators [Baldwin,
Martin and Shanahan I997b; Shanahan 1998]. As a result, the function set has been
reduced to the product operator CGProduct. At a later date, it is hoped to allow a richer
function set and genetically select appropriate conjunction operators. The arity of the
CGProduct function can vary from one to the number of available base features, though
parsimonious (low dimensional) Cartesian granule features are encoumged. This
desirelbehaviour is encoded in the fitness function. The terminal set consists of all the
base features that are used in systems modelling along with their respective granUlarity
range (abstraction). For example, if a problem consists of two base features /1 and 12
and a granularity range of [2 ..4] is permitted for each base feature, then, this leads to a
terminal set made up of the following granule features:

where fi_G j corresponds to base feature i and with a granularity of j.


Since the current function set consists of one member, CGProduct, the complexity of
the chromosome structure can be reduced from a tree structure to a variable length list
structure. The granularity range for the base feature universes is very much feature and
problem dependent, although a range of [2 ..15] is thought to be sufficient for most
problem domains (see Chapters 10 and 11 for examples). The distribution of fuzzy sets
across each of the feature universes is set, by default, to be uniform, in order to reduce
the search complexity. However, this could be automatically determined by extending
the current search space.

9.3.2 Fitness
The most important and difficult step of genetic programming is the determination of
the fitness function. The fitness function dictates how well a discovered program is able
to solve the problem. The output of the fitness function is used as the basis for selecting
which individuals get to procreate and contribute their genetic material to the next
genemtion. The structure of the fitness function will vary greatly from problem to
SOfT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURI,S 211

problem. In the case of Cartesian granule feature identification, the fitness function
needs to find Cartesian granule features that provide good class separation (class
corresponds to specific areas of the output variable universe) and that are parsimonious.
Parsimonious features, while providing better transparency, also avoid over-fitting the
data. As a result, when used in fuzzy modelling, these features should yield high
classification accuracy with low computational overhead along with transparent
reasoning. Two functions are proposed for evaluating the fitness of Cartesian granule
feature individuals: fitness based on the semantic separation of concepts and parsimony
promotion; and fitness based on the accuracy of the resulting model on a control dataset
and parsimony promotion.

9.3.2.1 Fitness based on semantic separation and parsimony


In this case the fitness for an individual Cartesian granule feature is a weighted
combination of the discriminatjon (separation) of the individual and the parsimony of
the individual. To calculate the semantic discrimination of a Cartesian granule feature,
the Cartesian granule fuzzy sets corresponding to each class in the output universe need
to be constructed. Subsequently, the process of semantic discrimination analysis
[Baldwin 1993; Baldwin, Martin and Pilsworth 1995] determines the mutual
dissimilarity of individuals, measured in terms of the point semantic unifications
between each Cartesian granule fuzzy set CGFSk and the other class fuzzy sets CGFSj •
This is formally defined as follows:

Discrimination = c. [c
M!fl 1- Max Pr( CGFS klCGFS j) J
k-I j=l
(9-1)

j*k

where c corresponds to the number of classes in the current system and Pr( ·1-) denotes
point semantic unification.

Parsimony is measured in terms of the dimensionality of the individual and the size
(cardinality) of the individual's universe of discourse. The dimensionality factor
corresponds to the number of base features making up a Cartesian granule feature. The
cardinality of a Cartesian granule feature universe is simply the number of Cartesian
granules in the corresponding universe. During the process of evolution it is important
to promote individuals that have high discrimination, low dimensionality and small
universe size. The latter of these two desires is expressed linguistically using the fuzzy
sets depicted in Figure 9-5. The individual measures are combined in the following
manner:

Fitness; = WD;s * Discrimination; + (9-2)


WDim * JlSmallD;11II!n.,;onality( Dimensionality;} +
WUS;ze * Jlsmallun;vers.(UniverseSize;)

where WDis. WDim and WUS;ze take values in the range [0, 1] and all weights must sum to
one. Since Cartesian granule features of high discrimination are desirable regardless of
other criteria, WDis tends to take values in the range [0.7, 0.9]. The remaining weight is
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 212

split evenly amongst WDim and WUSize' The weights and parsimony promoting fuzzy sets
(depicted in Figure 9-5) are determined heuristically from trial runs.

( a)

:1 "'~'u.".'(U"& ~
o 3000 6000
UniverseSize
(b)

Figure 9-5: Fuzzy sets corresponding (a) to small dimensionality and (b) to the small
size offeature universes.

9.3.2.2 Fitness based on accuracy and parsimony


An alternative approach to calculating the fitness of a Cartesian granule feature is based
upon a weighted combination of the accuracy of the resulting model on a control
dataset and the parsimony of the individual. In order to calculate the accuracy of the
individual it is necessary to construct the corresponding Cartesian granule feature
model. Subsequently the accuracy of the resulting model is calculated. This
corresponds to the proportion of correctly classified tuples in the control dataset for
classification problems, and to the model RMS error on the control dataset in the case
of prediction problems. The parsimony of the individual is characterised as above in
terms of the dimensionality of the individual and the cardinality of the individual's
universe of discourse. Formally, the fitness based on accuracy is calculated as follows:

Fitnessi = WAcc * Accuracy + (9-3)


WDim * flSmallDimensionalit.J Dimensionalityi) +
WUSize * flsmallUnim",(UniverseSize;)

where WAn. WDim and WUSize take values in the range [0, I] and all weights must sum to
one. Since Cartesian granule features of high accuracy are desirable regardless of other
criteria, WAn tends to take values in the range [0.7, 0.9]. The remaining weight is split
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 213

evenly amongst WDim and WUSize • The weights and parsimony promoting fuzzy sets
(depicted in Figure 9-5) are determined heuristically from trial runs.

9.3.3 Modified crossover and mutation


Duplicate base features are not permitted in Cartesian granule features. Consequently,
the GP operators of crossover and mutation need to be modified so that legal Cartesian
granule features are evolved. In the case of mutation, a mutation point is randomly
selected within the chromosome and replaced with a randomly grown (sub) Cartesian
granule feature such that no duplicate features arise in the unmodified part of the
chromosome and in the randomly generated sub-chromosome. The crossover operator
is simplified from sub-tree crossovers to list crossovers. Crossover points are randomly
selected within both chromosomes and two offspring are constructed in the normal GP
fashion. Both offspring are inserted into the population if they correspond to legal
Cartesian granule features (Le'. contain no duplicate base features). However, offspring
that result in iIIegal Cartesian granule features are dropped. Other crossover operators
could also be considered, such as limiting the choice of crossover point in the second
chromosome to points that result in legal chromosomes.

9.3.4 Reproduction
The reproduction operator is asexual in that it operates on only one individual in the
current population and produces only one individuaUoffspring in the next generation.
The reproduction operator consists of two steps. First an individual is selected from the
current population according to some selection mechanism based on fitness.
Subsequently, the selected individual is copied, without alteration, from the current
population into the new population. There are many different selection methods based
on fitness. Three of the commonly used methods are fitness-proportionate selection
[Holland 1975; Koza 1992], k-tournament selection and rank selection [Goldberg and
Deb 1991]. To date in this work, both the fitness-proportionate, and k-tournament
selection mechanisms have been used, which are both described here. The fitness-
proportionate approach uses probability based on the fitness of the individual. If j(s;(t))
is the fitness of an individual Si in the population at time t, then, under fitness-
proportionate selection, the probability that the individual Si will be selected (and thus
copied into the next generation) is

!(Si(t))
M
L!(Sj(t))
j=I

where the denominator corresponds to the sum of the all the individual member
fitnesses in the current population. K-tournament involves selecting k individuals from
the current population on a fitness proportionate basis. The individual with the highest
fitness within the k selected individuals is selected and copied into the next generation
in this case.
CHAPTER 9: LEARNING CARTI:SIAN GRANULE FEATURE MODELS 214

9.3.5 Feature discovery algorithm in G_DACG


The feature discovery component of G_DACG is concerned with identifying the
language of the model in terms of Cartesian g-ranule features. It is a population-based
search algorithm, based on genetic programming [Koza 1992; Koza 1994], where each
node in the search space is a Cartesian g-ranule feature that is characterised by its
constituent features and their abstractions (linguistic partitions). The steady state
flavour of genetic prog-ramming (SSGP) [Koza 1992; Syswerda 1989] is used in this
process. SSGP permits overlapping generations and when used in conjunction with k-
tournament selection avoids the problem of losing good individuals. A version of SSGP
is used where duplicate children are discarded rather than inserted into the population
[Syswerda 1989]. This helps promote diversity and avoids premature convergence in
the population. Furthermore, since the individuals will solve problems collectively
(rather than individually), in the case of additive Cartesian g-ranule feature modelling,
this flavour of genetic programming is deemed to be appropriate. The key steps
involved in the feature discovery component of the G_DACG algorithm are as follows:

Language identification algorithm


(i) Generate a random population of individual Cartesian granule features.
(ii) Assign a fitness value to each individual.
(iii) REPEAT
• Generate n new fitnessed children
• Insert new children into population
• Eliminate n individuals from the population
• Select best m non-overlapping features from the population
• Construct and evaluate best-oj-generation additive model
consisting of these m features using steps 3-5 in G_DACG (see
Section 9.1)
UNTIL a satisfactory solution or the number of generations expires.
(iv) Select best m non-overlapping features from the features visited during
the search.
(v) Construct and evaluate overall-best additive model consisting of these m
features using steps 3-5 in G_DACG (see Section 9.1).
(vi) Select the model from the best-oj-generation models and the overall-best
model that has the highest performance (i.e. accuracy) on the control
dataset.
(vii) The Cartesian granule features IF], ... , Fb ... , Fml of this model will be
the language of the model to be learned. Proceed with the remaining steps
in the G_DACG algorithm (i.e. steps 3-5), which corresponds to
parameter identification of the model. This learnt model will correspond
to the output of the G_DACG algorithm.

The value n, referred to as the generation gap, is normally chosen to be equal to a


percentage of the population size, typically set to 50% for most problems addressed in
this book. The value of m, the number of Cartesian g-ranule features selected to create
an additive model, is problem dependent. m Cartesian granule features are selected,
such that the set of base features that constitute each feature are not equivalent. That is,
Cartesian granule features with overlapping base feature sets are allowed but if they
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATUR ES 215

contain exactly the same base features and cardinality, then the Cartesian granule
feature with the higher fitness is selected.

Having determined the language of the model (step (vii) of the language identification
algorithm above), the parameters of the model are determined using steps 3-5 in the
G_DACG algorithm. Step 5 of G_DACG can further simplify the language of the
model by removing superfluous Cartesian granule features (through weights learning),
thereby decreasing the additive model's bias and variance. In addition, lowly
contributing features (associated with low weights) can be removed using backward
elimination [Devijver and Kittler 1982], whereby the worst feature in a rule (in terms
of lowest rule weight) is removed. This process is repeated for each rule until the
elimination of a feature results in a model with a severely degraded performance.
Having identified the language and parameters of a model, an additional step, that of
determining the optimal granule characterisation, can further boost the performance
of a model. Section 9.3.6 looks at different ways of generating linguistic partitions. In
addition, Chapter 10 analyse~ the effect of granule characterisation in the context of
artificial problems. The overall G_DACG algorithm is depicted in block format in
Figure 9-6.

pq>ulation
randomly

Evaluate fitness
of each
individual

Figure 9-6: G_DACG constructive induction algorithm.

9.3.5. J Language identification for prediction problems


In the case of prediction problems, an extra step in language identification is required;
CHAPTER 9: LEARNIN(; CARTESIAN GRANULE FEATURE MODELS 216

that of determining the language of the output variable i.e. the linguistic partition of the
output variable. This can be discovered automatically or provided by the expert in the
domain. In the case of the former, the linguistic partition of the output variable's
universe is determined in an iterative manner beginning with a conservative number of
words and iteratively increasing until no improvement in generalisation is achieved.
There are a variety of ways of deciding on characterisations of the granules (as is the
case for input granules); these are examined next.

9.3.6 Generating linguistic partitions


Linguistic partitions can be generated automatically over base variable universes or can
be provided manually to the system by an expert. Some of the automatic approaches to
partition generation include heuristic based approaches or data driven approaches. One
of the simplest heuristic approaches is to uniformly populate each universe Q i with
fuzzy sets of a predetermined shape. Alternatively, data centred approaches could be
used to automatically determine clusters in the data and subsequently use these to
generate partitions. These approaches include percentile techniques and unsupervised
clustering techniques such as Kohonen Networks [Kohonen 1984] and Fuzzy C-Means
[Bezdek 1976; Bezdek 1981].

Any of these clustering techniques will take a training dataset as input and search the
data for structure. The discovered structure is expressed in terms of a list of cluster
centres represented as vectors. These cluster centres can be viewed as corresponding to
concepts in the data. The number of clusters, in this case, is provided by the feature
discovery component of G_DACG, however, this could instead be set by the user or
could autonomously be resolved by the clustering algorithm itself. These centres can
then be used to generate partitions on the variable universes. Figure 9-7 illustrates an
example of how the cluster centres generated by any of the above algorithms could be
utilised in generating a mutually exclusive triangular partition. In this case, cluster
centres xl and x2 defined over the variable universes QPosition and QSize are used to
partition each of the universes. Each cluster centre xi corresponds to a vector, where
each vector element xij is a cluster centre on the corresponding universe £:? Given the
cluster centres for a particular universe, the methods described in Section 4.1.1.4 can be
used to generate the corresponding partitions and thus the associated linguistic variable.
In the case of discrete universes values can be grouped together into subsets and
labelled as previously described or each discrete value could be used directly to form a
partition.

The formation of partitions has been the subject of many research areas. Some of the
more interesting approaches to generating partitions include ID3 [Quinlan 1986] and its
many fuzzy versions [Jang 1994], which rely on entropy measures to partition the
feature universes. In [Bouchon-Meunier, Marsala and Ramdani 1997] an interesting
partitioning approach is proposed where the morphological operations (from image
processing) of dilation and erosion are employed to grow and shrink regions of input
feature universes that correspond to the same class. These partitioning approaches
would not however, prove useful in the generation of partitions for Cartesian granule
features, where the words that define the partition structure are used as a means of
describing a class as opposed to being used as part of a decision making rule.
SOFT COMI'UTIN(; 1'01{ KNOWLED(;E DISCOVERY: INTRODUCING CARTESIAN GRANUI.E FEATURES 217

Figure 9-7: Partition generation using centres produced by a clustering algorithm.

9.4 PARAMETER IDENTIFICATION IN G_DACG

Parameter identification is concerned primarily with determining the class Cartesian


granule fuzzy sets, as described in Section 9.1.2, and also with setting up the class
aggregation rules, that is, estimating the weights associated with the individual
Cartesian granule feature and the class rule filters. For now, the class rule filters are set
to the identity filter, however, the next section shows how the filters can be learned
from data using Powell's direction set minimisation algorithm [Powell 1964]. The
weights associated with each feature can be determined using semantic discrimination
analysis [Baldwin 1993; Baldwin, Martin and Pilsworth 1995]. Semantic
discrimination analysis measures the degree of mutual dissimilarity of concepts (or
classes), represented by fuzzy sets defined over the same feature universe. This
similarity measure is based on point semantic unification. The semantic discriminating
power of a Cartesian granule feature Fi for a class Class is calculaled as follows:

C
Discrimination_Ficl,ss = I - Max Pr(CGFS iC/ass I CGFS j ) (9-4)
j=1
f#C/ass

This yields a vector of discriminations values (consisting of one discrimination value


for each feature) for each rule. The weight associated with each rule feature Fi is
obtained by normalising each value such that the resulting weights for a rule sum to
one. This is achieved as follows for:

Discrimination_FiCla" (9-5)
wiClass = m

I Discrimination_FjCla"
j=1

where m corresponds to the number of features in the class rule (this can vary from
class to class, see next section for details).
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 218

An alternative means of estimating the weight associated with each feature F; is to


normalise the fitness values associated with each feature such that the weights sum to
one. More formally:

Fitness_Fi (9-6)
WiClass = m

L Fitness_F j
j=1

where Fitness_F; is calculated using either Equation 9-2 or 9-3, and m corresponds to
the number of features in the class rule (this can vary from class to class).

9.5 PARAMETER OPTIMISATION IN G_DACG

The previous section has shown how to estimate the weights associated with each class
rule feature based on the semantic separation of class fuzzy sets or on titness. In
addition, the filter was set to the identity filter but the filter can provide an extra degree
of freedom that can sometimes boost the performance and transparency of a learnt
model. This section shows how optimisation techniques can identify rule weights and
tilters that can boost the performance of learnt models, while also shedding new light
on the understandibility of the model. The next section introduces an alternative
parameter identification technique based upon the Mass Assignment Neuro Fuzzy
(MANF) framework, where neural network learning algorithms can be applied to learn
the aggregation rules.

9.5.1 Feature weights identification using Powell's algorithm


The determination of feature weights can be formulated as an optimisation problem
where the goal is to determine the set of weights that model the aggregation behaviour
of a problem effectively. Any of a number of optimisation techniques can be used to
determine the weights. For this work a direction-set method approach based upon
Powell's minimisation algorithm [Powell 1964] was selected. The choice of Powell's
method was due to the well known nature and simplicity of the approach. Any other
optimisation technique could equally have been used. For example, the weights could
be encoded in a chromosome structure and a genetic search [Holland 1975] carried out
to determine near optimum weights. The cost or fitness function is based on the model
error as determined on the validation dataset.

The weights identitication problem is encoded as follows: each class rule weight W; is
viewed as a variable that satisfies the following constraint:

The approach begins with estimating the weights by measuring the semantic separation
of the inter class fuzzy sets using semantic discrimination analysis (Section 9.4). Then a
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 219

constrained Powell's direction set minimisation (see Figure 9-8 for an outline of
Powell's direction set minimisation technique) is carried out for p iterations or until the
function stops decreasing. Each iteration involves N direction sets (where N = R * Wi,
and R corresponds to the number of rules in the knowledge base and Wi denotes the
number of feature weights in the body of rule Ri ), where the initial directions are set to
the unit directions. Note in this case it is assumed that each class rule has equal
numbers of weights W, however, this can vary for each class. In order to evaluate the
cost function for a set of weights, the corresponding additive Cartesian granule feature
model is constructed. The class rule weights are set to the normalised Powell variable
values i.e. the constituent weights for a class rule are normalised so that the weights for
a rule sum to one. The constructed model is then evaluated on the validation dataset. In
this case, the class filters are set to the identity function. Following Powell
minimisation, the weight values, whose corresponding model yielded the lowest error,
are taken to be the result of the optimisation.

9.5.2 Filter identification using Powell's algorithm


This section presents an algorithm that determines the filter function for each of the
class evidential logic rules from the data. It begins by reviewing the role of the filter
within the evidential reasoning process [Baldwin 1991; Baldwin, Martin and Pilsworth
1995], before pnsenting the filter identification algorithm using Powell's direction set
minimisation.

9.5.2.1 Filter interpretation


During evidential reasoning, the support for an evidential logic rule is determined in
two steps: firstly, an intermediate support value is determined by taking a weighted sum
of the supports for the individual terms present in the rule body and secondly this
intermediate value is passed through the filter (usually an S-function) which determines
the overall interpretation of the support for the rule body. The filter takes the following
form:

S(x): [0, 1] ~ [0, 1].

The filter plays the role of a linguistic quantifier within the evidential logic rule:
determining the conjunctive and disjunctive nature of the rule. This can be more clearly
seen within an evidential logic rule where there are equally weighted terms. Consider
the following filter:

S(x) = { °
I x =1
otherwise

In this case, the filter yields a rule body that is equivalent to a logic conjunction of
terms i.e. all terms must be satisfied.

Alternatively, consider the following filter:


CHAI'TI'R 9: LEARNIN(; CARTESIAN GRANUI.E FEATURE MODELS 220

I X ~.l
S(x) = { n
o othelwise

where n corresponds to the number of body terms. In this case, the filter yields a rule
body that is equivalent to a logic disjunction of terms i.e. only one term is required to
be satisfied. When the weights are not equal, then these interpretations can be modified
to represent weighted conjunction and weighted disjunction interpretations.

Powell's direction set minimisation technique i an iterative approach, carrying


out function minimisations along favourable directions in N-dimcn iona! pace.
The following is an outline of the main steps involved:

Initialise the set of directions U j, for i e [I, NJ


REPEAT
- Save starting position as po.
• For i E [I, N] move Pi- I to the minimum along direction Uj
(using a golden section search) and call this point Pi
-Forie [I,N-I] ctUjlOUi_1
- Set UN to P - Po
• Move P to the minimum along direction UN and call this point po.
UNTIL Function tops decreasing.

Figure 9-8: Outline of Powell 's direction set minimisation algorithm.

In the case of the work presented here, the filter structure is limited to two degrees of
freedom and is canonically defined as follows (see also Figure 9-9):

S(X)=j~_a
x~a

a<x<b
boa
I otherwise

where 0 ~ a ~ b ~ I .

As a and b tend to I, the function corresponds to an overall logical conjunction of


features, while a and b tending to 0 corresponds to an overall logical disjunction of
features. Outside these values the filter represents a combination of these interpretations
allowing flexibility in determining an appropriate solution. One might consider the
disjunctive region a more optimistic or relaxed interpretation of the body since fewer
features must be supported for the body as a whole to be supported. The conjunctive
region might be considered to be more pessimistic, since most of the features must be
supported for the body as a whole to be supported. These interpretations are discussed
in [Baldwin 1993; Baldwin, Martin and Pilsworth 1995]. The default filter can be
represented as the true fuzzy set, that is, the identity function S(x)=x. The filter can
semantically be viewed as playing a similar role to that of the parameter y in
SOH COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 221

compensative operators (Section 3.5.3). The number of degrees of freedom in a filter is


restricted to two in order to increase the transparency of the model and also to reduce
computational complexity in determining the filter. As an alternative, the structure of
each class filter could be left free and the filter structure and parameters could be
determined automatically. This can be quite easily achieved using genetic algorithms
(or any other gradient-free optimisation technique), where the length of the
chromosome is variable.

o
a b 1
x
Figure 9-9: An S-functionfilter for the evidential logic rule.

9.5.2.2 Filter extraction


While the true filter, may provide adequate behaviour in some problems, its
performance may prove less than adequate in others [Shanahan 1998]. Here, an
algorithm is proposed that determines a filter with two degrees of freedom for each
class rule. Each filter is viewed as piecewise linear function with two degrees of
freedom that ultimately determine the shape of the filter function. This is depicted in
Figure 9-10. Varying a and b yields a large and flexible range of filters while
maintaining filter transparency. a and b are subject to the following constraints:

The determination of class filters can be formulated as an optimisation problem where


the goal is to determine the set of class filters that model the aggregation behaviour of
the problem effectively. An effective model, as previously presented, is a model that
generalises well and is transparent. Filter transparency is promoted by allowing only
two degrees of freedom in each filter, while the generalisation power attributed to the
selected filters is measured in terms of the model accuracy on a validation dataset.

Any of a number of optimisation techniques can be used to determine the filter


structures. Since the filter functions are piecewise linear in nature, thereby producing
discontinuities, the optimisation techniques need to be derivative free. Consequently, a
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 222

direction-set approach based upon Powell's minimisation algorithm [Powell 1964] is


chosen.

o
.---- a - - - - - - -
• . _._--------------------------- b ----------.
x

Figure 9-10: A parameterised class filter.

The problem is encoded as follows: each filter degree of freedom (ai and bi filter points)
is viewed as a variable in the range [0, 1] that satisfies the following constraint:

The initial filters are set to true position_ Then a constrained Powell's direction set
minimisation (see Figure 9-8 for an outline of Powell's direction set minimisation
technique) is carried out for p iterations (empirical evidence suggests a range of [1, 10]
[Shanahan 1998]) or until the function stops decreasing. Each iteration involves N
(where N= C * 2) direction sets (corresponding to number of filter variables), where the
initial directions are set to the unit directions. In order to evaluate the cost function for a
set of filters the corresponding additive Cartesian granule feature model is constructed
and evaluated on the validation dataset. Following Powell minimisation, the values
associated with each of the variables, whose corresponding model yielded the lowest
error, are taken as the result of the optimisation and are used to generate the respective
class rule filters.

9.6 A MASS ASSIGNMENT-BASED NEURO-FUZZY NETWORK

In the previous section, parameter identification and optimisation of Cartesian granule


feature models was achieved using a combination of semantic discrimination analysis
and optimisations techniques. This section introduces an alternative parameter
identification procedure based upon the notion of Mass Assignment based Neuro-Fuzzy
(MANF) networks. This section can be omitted on a first time read without loss of
continuity.
SOI'T COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 223

MANFs are a supervised modelling approach which combine several complementary


soft computing techniques (fuzzy set theory, mass assignment theory and neural
networks) in order to model problems more accurately [Baldwin, Martin and Shanahan
1997a].

The more general category of MANF networks may exhibit some of the following
characteristics (the list is not exhaustive):

• Data inputs are fuzzy sets (including Cartesian granule fuzzy sets), results of
semantic unifications, or raw feature data or combinations of these;
• Outputs are fuzzy numbers;
• Weights are fuzzy numbers;
• Weighted inputs of each neuron are aggregated by some other aggregation
operator (evidential logic aggregator, fuzzy integral[Grabisch and Nicolas
1994; Klir and Yuan 1995] etc.) besides summation;
• Probabilistic neurons;
• Be represented as a (feed forward) neural network combined whose
parameters are learned using an algorithm such as the backward propagation
learning algorithm (or fuzzified version of this).

Some of the above could prove interesting as future directions for the work presented
here. Figure 9-11 depicts a typical architecture of a MANF network utilised in this
work. In general MANF networks accept raw feature values as inputs. Subsequently, in
the case of Cartesian granule features, it linguistically interprets the raw data values and
performs a match (semantic unification) between previously learned classes (expressed
in terms of Cartesian granule fuzzy sets) and the linguistic data value. The results of
semantic unification are then taken as input to the neural net. The neural network then
classifies based on these input values. In the case of a feed forward net, the
classification value corresponding to maximum output node activation is deemed to be
the classification of the input data tuple. In the case of prediction problems, the output
layer would correspond to a single node whose value denoted the output of the network.

Figure 9-12 outlines the main steps involved in learning a MANF network (such as the
network depicted in Figure 9-11) from classified training data. Step I (in Figure 9-12)
corresponds to structure identification of Cartesian granule feature models as outlined
in the previous sections in the G_DACG algorithm, while step 2 is equivalent to
parameter identification. Step 3 is a data transformation step and essentially replaces
the raw database values for each training tuple with the results of semantic unification,
which are generated by taking the semantic unification of each class Cartesian granule
fuzzy set given the corresponding data fuzzy set. Step 4 trains the neural network using
the transformed data. Empirical evidence to date, suggests that single-layer feed
forward neural network are sufficient to model complex real world problems [Baldwin,
Martin and Shanahan 1997a; Baldwin, Martin and Shanahan 1997c]. For these
problems a conjugate gradient descent learning algorithm [Moller 1993] was used.

The neurons in a MANF network (generally single-layered) can alternatively be


represented as a series of evidential logic rules. Each neuron in the output layer of the
extracted single layer neural network, maps directly onto an evidential logic rule with
equivalent behaviour. See Figure 9-13 for a graphic depiction of this correspondence.
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 224

The weights extracted by the MANF learning algorithm could be used to aggregate the
features in the evidential logic rule and the filter function could be set to the activation
function used by the MANF network (see Figure 9-13). Each evidential logic rule
needs to take into account the biases of the individual neurons by adding an extra
feature that is always satisfied. This mapping allows the use of the MANF framework
for parameter identification of Cartesian granule feature rule based models. This
permits the use of well known and proven algorithms in the field of machine learning
for parameter identification, although the resulting knowledge representation tends to
be less intuitive due to the presence of negative weights and bias terms. However,
recent work in neural networks addressing restricted weights ranges has shown the
usefulness of neural networks as induction algorithms for logic rule based systems
[Bishop 1995; Fletcher and Hinde 1995; Hertz, Anders and Palmer 1991; Hinde 1997].
These ideas should prove useful in extracting more intuitive models using the MANF
framework and should be investigated in future work.

fI

Figure 9-11: A possible architecture/or a MANF network.

A MANF network is very similar to the perceptron architecture described in [Minsky


and Papert 1969]. Minsky and Papert viewed the perceptron as consisting of an input
layer, which they termed the retina, and of hidden and output layers. They viewed the
retina as array of pixels, where each pixel corresponded to a bivalent logic predicate
(i.e. predicates/functions which returned 0 or I). The work in this book extends this
notion to allow pixels in the retina to be probabilistic-valued predicates (or alternatively
fuzzy predicates). In this case the predicates output probabilistic semantic unifications
or fuzzy set membership values in the unit interval.
SOFT COMPUTING ['OR KNOWLEDGE DISCOVERY: INTRO DUCING CARTESIAN GRANULE FEATURES 225

1. Select clas Cartesian granule features

2. Construct class Cartesian granule fuzzy ets


--
3. Foreach training data tuple DO (this step transforms the data)
• Foreach clas Cartesian granule feature
-Generate the Cartesian granule fuzzy set
corresponding to the raw data values
-Perform semantic unification of the class
Cartesian granule fuzzy set and the Cartesian
granule fuzzy set corre ponding to the raw input
data
• End Foreach
• End Foreach
4. Train the neural network u ing the transformed data of semantic
unifications and corresponding classifications.

Figure 9-12: The training algorithm for a MANF network.

Pr(CGFSICLASS ICGFSIDATA) WI
«Oassification of Object i CLASS)
(Evlog activation Function (
Pr(CGFS ICLASS ICGFSIDATA) WI
Pr(CGFS,cLASS ICGFSiDATA) w, ....,.,
...
.a.
Pr(CGFSiCLASS ICGFSiDATA) W,
c
.2
Pr(CGFS JTCLASS ICGFSnt>ATA) Wm tl
Pr(CGFS JTCLASS ICGFSnt>ATA)Wm c
Bia Wuios » :::I

:« I 1)(00» tJ..
c
.2
Bias 1 c:;
.~
ti
'"
( a) (b)

Figure 9-13: The correspondence of evidential logic rules with neural network
neurons. (a) A partially evaluated evidential logic rule; (b) A MANF neuron where the
input values are the probabilities associated with the semantic unifications in the
partially evaluated evidential logic rule (i.e. Pr(CGFSiCLASS ICGFSiDATA) ).
CHAI'TER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 226

9.7 A DETAILED EXAMPLE RUN OF G_DACG

The previous sections have described the G_DACG constructive induction algorithm
as a means of learning additive Cartesian granule feature (ACGF) models from data.
Here, for the purposes of illustration, the G_DACG algorithm is applied to a small
artificial problem: the ellipse classification problem. The use of both fitness functions
proposed above is examined.

9.7.1 Ellipse classification problem


The ellipse problem is a binary classification problem based upon artificially generated
data from the real universe R x R. Points satisfying an ellipse inequality are classified
as Legal, while all other points are classified as Illegal. This is graphically depicted in
Figure 9-14 for the ellips~ inequality x2 + / :s; 1. The two domain input features, X and
Yare defined over the universes !lx, = [-1.5, 1.5] and !lv. = [-1.5, 1.5] respectively.
Different training, control (validation) and test datasets, consisting of 1000, 300 and
1000 data vectors respectively, were generated using a pseudo-random number stream.
An equal number of data samples for each class were generated. Each data sample
consists of a triple <X, Y, Class>, where Class adopts the value Illegal, indicating that
the point <X, Y> does not satisfy the ellipse inequality, and the value Legal, otherwise.

1.5

o -f-,--- f - - - - - - + - - - - - l -

.1.5
·1.5 o 1.5

Figure 9-14: An ellipse in Cartesian space. Points in lightly shaded region satisfy the
ellipse inequality and thus are classified as Legal. Points in darker region are
classified as Illegal.

9.7.2 Using G_DACG to learn ellipse classifiers


This section presents the steps and parameter settings involved in a typical run of the
G_DACG constructive induction algorithm, with the aim of constructing an additive
Cartesian granule feature model for the ellipse problem. In order to run the G_DACG
algorithm, various parameters, mostly GP related, need to be set. In a typical run the
population size is limited to 30 chromosomes, due to the small nature of the problem.
Initial populations are generated using the ramped-half-and-half procedure [Koza 1992]
i.e. half-random length chromosomes and half full-length chromosomes. The length of
SOIT COMP UTING FOR KNOWI.ElXiE DISCOVERY : INTRODUCING CARTESIAN GRANU LE F EATURES 227

chromosome range, in the initial population and in subsequent generations is problem


dependent but parsimony is promoted. The k-toumament selection parameter k was set
to 3 for this problem. Table 9-2 depicts the G_DACG tableau of parameters and
objectives used to construct an additive Cartesian granule feature model for the ellipse
problem.

Table 9-2: G_DACG tableau Jor ellipse problem where fitness was measured using
semantic discrimination and parsimony.

Objective Find additive CG Feature model, which cla sifies a


point as a Legal or Illegal ellipse point correctly.
Tenninal Set X,Y
Chromosome Length [1,2]
Feature Granularity [2, 12]
Function Set CGProduct
GP Aavour Steady State with no duplicates
GP Selection K-toumament, k=3
Fitness Fitnc i = 0.9 * Discrimination; +
0.0 * Il 11'tlIlDlm(Dimen ionalitYi) +
0.1 * Il m:lIluniv(UniverseSizei)
(gpUGoodSize «30 1)( I 00 0»)
Fuzzy set SmallUniv [30: I, 100: 0]'
Standardised fitness Same as Fitness
r the ACGP model size [1,3]
Testing Mechanism Holdout estimate
Parameters PopSize=30, #Generations =31
Dataset Size(tuples) Train = 1000, Control = 300, Test = 1000
Success Predicate 100% classification accuracy on control dataset

The language identification phase of the G_DACG algorithm was allowed to iterate for
31 generations or halted earlier if the stopping criterion was satisfied. The stopping
criteria in this case, specified that if the best-oj-generation model had a classification
accuracy of 100% on the control dataset, then language identification would halt. The
model language was then set to the language of the model, chosen from the best-oJ-
generation and overall-best models, which had the highest performance on the control
dataset. Then parameters of the corresponding model were determined using steps 3-5
in G_DACG along with an investigation of which granule characterisation gave the
best accuracy. The use of triangular and trapezoidal (with different degrees of overlap)
fuzzy sets was examined. As alluded to previously, both fitness functions were

5 Here the fuzzy set SmallUniv : [30:1, 100:00} can be rewritten mathematically as

!
follows (denoting the membership value of x in the fuzzy set SmaliUniv):

Il SITUJ IIUnil' (x) =


loJ-X
~oo
if xS3
if 30< x < 100
if x~ 100
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 228

compared in the context of the ellipse problem: fitness based on the semantic separation
of concepts and parsimony promotion; and fitness based on accuracy on the control and
parsimony promotion. The results for both approaches and a brief discussion is
presented SUbsequently.

9.7.2.1 G_DACG results using semantic separation as fitness


The G_DACG algorithm was run several times on the ellipse problem using the fitness
function based on the semantic separation of concepts and parsimony promotion
(Equation 9-2). In general, the language of the model was selected from one of the best-
of-generation models, even though the algorithm ran for the maximum allowed number
of generations (i.e. the early stopping condition was not satisfied). The best discovered
ACGF model was generated by taking the three best Cartesian granule features from
generation 15 of a G_DACG run. This resulting rule-based model is depicted in Figure
9-15. The Cartesian granule fuzzy sets for each of the rule features are not shown due
to space limitations. Both class rules have similar structure or language in this model.
The learnt filter for the j,iIegal class corresponds to the "identity" filter, while the filter
for the legal class exhibits an intermediate behaviour between disjunction and
conjunction. A trapezoidal fuzzy set with 50% overlap was determined to be the best
granule characterisation in the case of this model. The discovered additive model yields
an accuracy of 98.8% on the unseen test dataset. The class confusion matrix for this
model is displayed in Table 9-3. The cells in the diagonal of the confusion matrix
correspond to the correctly classified points (test tuples) for each class; for example,
495 of the 500 (99%) legal points were correctly labelled. The other non-diagonal cells
correspond to misclassified ellipse points, with the row label representing the actual
classification and the column representing the model predicted classification. The
G_DACG algorithm took approximately one hour to discover this ACGF model on a
multi-user, single CPU, Sun Ultra workstation.

Table 9-3: Confusion matrix ellipse model presented in Figure 9-15 for test dataset.
This model yields an accuracy of 98. 8% on the test dataset.
Actual\ CIa
Predicted Legal Illegal Total %Accuracy
Legal 495 5 500 99.0
Illegal 7 493 500 98.6

Taking a closer look at step 5 of G_DACG, the weights and filter optimisation step, it
can be seen that the performance of the discovered ellipse model improved as a result
of optimisation. Table 9-4 displays the effects of optimisation in terms of the
accuracies of the resulting models on the training, control and test datasets (columns 3,
4, 5 respectively). The results, expressed in terms of model accuracies on training,
control and test datasets, presented in each row correspond to the following models:
(Row 2) for models where the weights were determined using fitness measures (that
were based on semantic discrimination analysis and parsimony) and the filters were set
to the identity filter; (Row 3) for models where the weights were determined using
fitness measures and the tilters were determined using the filter optimisation algorithm
(presented in Section 9.5.2); (Row 4) for models where the filters were set to those
SOFT COMPUTING FOR KN OWLEDGE DISCOVi:RY: INTRODUCING CARTESIAN GRANULE FEATURES 229

determined in Row 3, and where the weights were optimised using Powell's algorithm
(presented in Section 9.5.1); (Row 5) for models where the weights were set to those
determined in Row 4, and where the filters were re-optimised using Powell's algorithm.

?«deUtype LEGAL_FILTER [0:0,0.89: 1.0])) % filter represented


?«deUtype ILLEGAL_FILTER [0:0. 1.0: 1.0 ])) % as a fuzzy sets

«Classification for Point is Legal)


(evlog LEGAL_FILTER (
(cgValue of «X 4» of Point is legalClass)0.168
(cgValue of «Y 8» of Point is legalClass) 0.342
(cgValue of «X Il)(Y 7») of Point is legaJClass) 0.49) ) ):«1 1)(00»

«Classificalion for Point i Illegal)


(evlog ILLEGAL_FILTER (
(cgValue of «X 4» of Point is legalCla s)0.168
(cgValue of «Y 8» of Point i legalCla ) 0.342
(cgValue of «X II )(Y 7») of Point is legalClass) 0.49) ) ):«1 1)(00»
Figure 9-15: An example of an additive Cartesian granule feature model for the ellipse
problem. This model's accuracies for training, control and test datasets are 99.2%,
99.7% and 98.8% respectively.

Table 9-4: Model accuracies expressed in terms of training, control and test datasets
at various stages of filter and weights optimisation. The model resulting from
optimising the filters (row 3, in bold) was selected as the output of the G_DACG
algorithm for the ellipse problem on the basis of its superior performance on the
control dataset.

Filters Weight Train Control Test


%Accuracy %Accuracy %Accuracy
Identity filters Filness-SDA-based 0.978 0.983 0.978
Optimised fiJters Fitness-SDA-based 0.992 0.997 0.988
Optimised filters Oplimised 0.979 0.993 0.979
Re-optimised Optimised 0.979 0.993 0.979

From an evolutionary perspective, the behaviour of the G_DACG algorithm was


monitored using fitness and variety curves. Figure 9-16 contains the fitness curves for
the G_DACG run that resulted in the above model. This figure shows, by generation,
the progress of one G_DACG run of the ellipse problem between generations 0 and 30,
using three plots: the fitness of the best-of-generation individual (or Cartesian granule
feature) represented by the curve Best Fitness; the fitness of the worst-of-generation
individual represented by the curve Worst Fitness; and average value of fitness for all
individuals in the population represented by the curve Average Fitness. As can be seen,
the fitness of the best-of-generation individual starts at approximately 0.78 but
gradually improves to 0.8. Even though this fitness level is reached at generation 13,
the solution (the language of the learnt model) is not discovered, in this run, until
generation IS. This situation arises since the individual Cartesian granule features solve
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MO[)EL~ 230

problems collectively (rather than individually). The improvement in fitness from


generation to generation is monotonically increasing in the case of the best-of-
generation individual (due to the nature of steady state GPs), however, the average and
worst-of-generation fitnesses exhibit an erratic behaviour, resulting from the
exploration of new individuals (or previously explored individuals) and also from the
small number of individuals in the population.

Evolution of Chromosome Fitness

Worst Fitness ~
Best Fitness - t -
, , , Average Fitness - -EI- •
I-
CI)
0.8 - - - +=-l'----i=-'I--=-F =!-=-r-~- _=--t=-=!---F--=I--=-f =!=-+-_ =1-=-1"-- -
I I l!a I I
w
co j3 -0- a-G ~G -0- a-G !p-G -0- a-O- E)rG -rr "13 t?-s 13 D- ElJ}lD- s.Q b
I I I I I

m
c:
0.6
,
-------I-------,-------~-------T-------r------
,
""
u. '" , ,,
0.4 --------------,-------,-------r-------
t;
II:

~ 0.2

0
0 5 10 15 20 25 30
Generation

Figure 9-16: Fitness curves for the ellipse problem, where fitness is based on semantic
discrimination analysis and parsimony.

Population Variety within Pool and Overall(DB usage)

";'~"""- .I"'>f.""",- v*¥'~IVariety~"


y, """ ~oiChromosomes Revisited - t -

---?\)t ~ ----i- -------i- ------: -------: -------


I / I I I I

0.8

0.6
I + I
I
I
I
I
I
______ _ ,_ _______ , ________ , ________, _______ J ______ _
I
1
I
I

j
I I I I I

~
~
0.4 -T----i- -------i- -------i -------:-------:-------
f -----~ -------:- -------:- ------~ -------~ -------
I I I I I

0.2 -

V ' ,
I I I I I

O+-------~--------~------~~-------L--------L-------~
o 5 10 15 20 25 30
Number of Training Cycles

Figure 9-17: Percentage of Cartesian granule features for ellipse problem that were
revisited in each generation on a G_DACG run, where fitness is based on semantic
discrimination analysis and parsimony.
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 231

On the other hand, Figure 9-17 presents the variety, by generation, of the evolutionary
search for the G_DACG run that resulted in the above model. This figure shows, by
generation, the progress of one G_DACG run of the ellipse problem between
generations 0 and 30, using two plots: the percentage of new Cartesian granule features
visited in each generation, though the curve (labelled % oj Chromosomes Revisited) is
plotted from the perspective of the number of features that are revisited; and the second
curve displays the chromosome variety in the current population, but this can be
ignored here, since duplicates are not allowed within a population. The number of novel
features in each population decreases steadily as a result of the small scale of this
problem's search space (and population count) and also because of the evolutionary
nature of the search.

9.7.2.2 G_DACG results using accuracy as fitness


The previous section has demonstrated how G_DACG, guided by a fitness function
based on semantic separation and parsimony, discovered an ACGF model for the
ellipse problem. An alternative fitness function based on accuracy and parsimony
(Equation 9-3) is considered here. Once again, the G_DACG algorithm was run several
times on the ellipse problem using the accuracy and parsimony-based fitness function.
In general, over several runs, the language discovery component of G_DACG halted
early, having discovered a model language that yielded an accuracy of 100% on the
control dataset. Figure 9-18 presents a typical model resulting from running the
G_DACG algorithm. The Cartesian granule fuzzy sets for each of the rule features are
not shown due to space limitations. The class rules differ as the two-dimensional
feature was eliminated from the legal rule during weights optimisation. The learnt
filters for both classes corresponds to the "identity" filter. A trapezoidal fuzzy set with
60% overlap was determined to be the best granule characterisation in the case of this
model. The discovered additive model yields an accuracy of 99.2% on the unseen test
dataset. The class confusion matrix for this model is displayed in Table 9-5. This
language of this model (see Figure 9-18) corresponds to the best-oj-generation model
for generation 4 of a G_DACG run. In the case of this run, the G_DACG ran for 31
generations. Typical G_DACG runs on the ellipse problem using this fitness function
take approximately 150 minutes on a multi-user, single CPU, Sun Ultra workstation.
Fitness and variety curves for this run are presented in Figure 9-19 and Figure 9-20
respectively. They portray similar characteristics to the G_DACG run presented above
where search was guided by a semantic separation- and parsimony-based fitness
function .

Table 9-5: The test confusion matrix Jor the ellipse model presented in Figure 9-15.
This corresponds to an accuracy oJ 98. 8% on the test dataset.
Actual\ Class
Predicted Legal Illegal I Total %Accuracy
Legal 495 5 I 500 99.0
Illegal 3 497 I 500 99.4
CHAPTER 9: Li'ARNIN(, CARTESIAN GRANULE Fl'ATURE MODELS 232

?«deCitype LEGAL_FILTER [0:0, 1.0: 1.0]))


?«deUtype ILLEGAL_FILTER [0:0, 1.0: 1.0]))

«Cia ification for Point is Legal)


(cvlog LEGAL_FILTER (
(cgValue of «X 10» of Point is legalClas )0.16
(cgValue of «Y 3» of Point is legalClass) 0.84
) ) ):« 1 1)(00»

«Classification for Point is Illegal)


(evlog ILLEGAL_FILTER (
(cgValue of «X 10» of Point is iIIegalClas )0.24
(cgValue of «Y 3» of Point is illegalClass) 0.36
(cgValue of «X II )(Y 7))) of Point is illegalClass) 0.4
» ):«1 \)(00»
Figure 9-18: An example of an additive Cartesian granule feature modelfor the ellipse
problem discovered on a G_DACG run guided by an accuracy- and parsimony-based
fitness function. This model's accuracies for training, control and test datasets are
99.1%, 100% and 99.2% respectively.

-l--
Average Flness - 0 -
Ev olution d Chromosome Fitness
Worsl Finess -

l- 0.8
(f)
:.u I
[IJ I ~~:Hl""
I I
I I
0.6
ill
Q)
--------~-------~--------+--------~----- ~--------
c I 1 I I I
;'E I
1
I
I
I
I
I
I
I
I
I I I I · I
0.4 --------r-------,--------T--------r----- ,--------
1- I I I I I
(f) I I I
a: I I I
0 I I I I J
;: 0.2 --------~-------~--------.--------~-------~--------
I I I I I
t I J I I
I I I I I
I I I I I
0
0 5 10 15 20 25 30
Generation

Figure 9-19: Fitness curves for the ellipse problem over 31 generations, where fitness
is based on control dataset accuracy and parsimony.

9.8 DISCUSSION

The discovery of good, highly discriminating, and parsimonious Cartesian granule


SOI'T COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 233

features is an exponential search problem that forms one of the most critical and
challenging tasks in the identification of Cartesian granule feature models. Most of this
exponential effort is spent in evaluating individuals, in other words, evaluating the
fitness function. As a result, the fitness function dictates the efficiency of the algorithm.
In this chapter, two fitness functions were proposed, which both promote parsimony,
but differ on the second measure used: one corresponds to feature accuracy
(incorporated in a rule-based model) on a control dataset; and the other on the semantic
separation of classes (concepts) expressed in terms of fuzzy sets over this feature space.
The latter is computationally far more efficient than the former. This becomes obvious
after examining the steps involved in computing both. The fitness function measured in
terms of the accuracy of the model on a control dataset involves the following steps:

(i) Learn the class fuzzy sets for this feature.


(ii) Generate the corresponding rule-based model.
(iii) Evaluate the mo~el on the control dataset.

Population Variety within Pool and Overall(DB usage)

: r+'\f
I I I \t riety
¥ % of ~~osomes R i~ited---+-
-- --- A~ --- -:-- --- ---~ ------ -~- ------~- --- ---
I

0.8

t
_______
i
I
I
l

_______
I
I
I
I
________ L
I

_______
r
J
I
_______
I
t
I
______ _
0.6
~ ~ ~ ~

~
Q)
'£:
!II / I , :
> + I I I I I
0.4 --r----~-------~--------L-------L-------~-------
I I I I I
I I r I I
r I I I I

0.2
~-----f-------~--------~-------f-------~-------
I r I I I

0
0 5 10 15 20 25 30
Number of Training Cycles

Figure 9-20: Percentage of Cartesian granule features for ellipse problem that were
revisited in each generation on a G_DACG run, where fitness is based on control
dataset accuracy and parsimony.

On the other hand, the fitness function measured in terms of the semantic separation of
concepts involves the following steps:

(i) Learn the class fuzzy sets for this feature.


(ii) Compute the semantic discriminating power of the feature using Equation
9-1.

Step (ii) for the accuracy-based fitness function can be considered almost negligible,
thus both approaches differ in terms of their final steps. From a computational
perspective, the effort involved in calculating the semantic separation of concepts, will
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODELS 234

in general, be a fraction of the effort required to reason about all the examples in a
control dataset. To compute the semantic separation of c classes requires the c· (c-l)
(i.e. c times c-l) semantic unifications. This is in contrast to N· c semantic unifications
for evaluating a model on a control dataset consisting of N examples. Despite the fact
that the computational effort required to calculate one semantic unification for the
semantic separation-based fitness function will be greater than that of the accuracy-
based approach, the total computational effort of this approach will be much less. This
claim is corroborated by the results on this ellipse problem, where the computational
time required for the G_DACG algorithm run with the semantic separation-based
fitness function is less than of that required for a G_DACG run, where fitness was
based on accuracy. In both runs all other parameters were the same.

Another attractive feature of the semantic separation-based fitness function is that it


avoids problems that arise when learning algorithms are applied in domains where the
prior distribution of data examples is biased towards a small number of classes. In
general, for such domains, learning algorithms such as neural networks will focus on
the dominant (or the most frequently occurring) classes generating models or
hypotheses that model these classes very well, while ignoring rarer classes. Even
though various modifications and extensions of learning algorithms have been proposed
to alleviate this problem, these tend to be heuristic in nature and problem dependent
[Lawrence et al. 1999]. A cost function based on the semantic discrimination of
concepts ignores biased priors and directs the search in terms of features that provide
good separation of concepts. However, this approach is not a panacea, as it also has an
in-built bias, that of favouring features that provide high discrimination. In general,
Cartesian granule features with high granularity will also provide higher discrimination,
but, in general, will not provide good generalisation to unseen cases i.e. it over fits the
training data. The G_DACG algorithm combats this using a parsimony measure.

Generic improvements to genetic programming will naturally improve the performance


of the G_DACG algorithm. For example, the evaluation of the fitness function for each
chromosome could be carried out in parallel improving the performance of the
G_DACG algorithm by orders of magnitude.

Instead of using a population-based search approach to discover the language of


Cartesian granule feature models, cheaper, point-based local-search approaches such as
simulated annealing could be used. However, these approaches, while attractive from
these computational requirements, can lead to models that do not provide satisfactory
performance.

9.9 SUMMARY

The G_DACG constructive induction algorithm was introduced as a means of learning


additive Cartesian granule feature models from example data. The main steps in
G_DACG algorithm involve identifying the language of the model and subsequently
identifying the various rule parameters. The language identification component of
G_DACG, which is concerned with identifying Cartesian granule features (expressed in
terms of the problem features and their universe abstractions) that model the problem
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 235

domain effectively, tractably and transparently, is based on genetic programming. A


couple of novel fitness functions were presented, including fitness based upon the
semantic separation of learnt concepts and parsimony promotion. Parameter
identification, on the other hand, learns the class fuzzy sets, evidential weights and
filters. The class fuzzy sets are extracted using a probabilistic aggregation of linguistic
descriptions of data (i.e. data Cartesian granule fuzzy sets), whereas the feature weights
and rule filters are extracted using various optimisation techniques. The G_DACG
algorithm was illustrated, in detail, on a small artificial problem and the behaviour of
the proposed fitness functions was examined. The discovery of good, highly
discriminating, and parsimonious Cartesian granule features is an exponential search
problem that forms one of the most critical and challenging tasks in the identification of
Cartesian granule feature models. Most of this exponential effort is spent in evaluating
the fitness of individuals. The proposed fitness function based on semantic separation
of concepts helps reduce the complexity of the discovery task.

In subsequent chapters, the G,:..DACG algorithm is applied to various artificial and real
world problems. It is also compared to other well known learning techniques and
parallels are drawn between these approaches from knowledge representation and
learning points of view.

9.10 BIBLIOGRAPHY

Almuallim, H., and Dietterich, T. G. (1991). "Learning with irrelevant features." In the
proceedings of AAAI-9I, Anaheim, CA, 547-552.
Baldwin, J. F. (1991). "Combining evidences for evidential reasoning", International
Journal of Intelligent Systems, 6(6):569-616.
Baldwin, J. F. (1993). "Evidential Support logic, FRIL and Cased Base Reasoning", Int.
J. of Intelligent Systems, 8(9):939-961.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - Fuzzy and Evidential
Reasoning in A.I. Research Studies Press(Wiley Inc.), ISBN 0863801595.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997a). "Fuzzy logic methods in
vision recognition." In the proceedings of Fuzzy Logic: Applications and
Future Directions Workshop, London, UK, 300-316.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997b). "Modelling with words
using Cartesian granule features." In the proceedings of FUZZ-IEEE,
Barcelona, Spain, 1295-1300.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997c). "Structure identification of
fuzzy Cartesian granule feature models using genetic programming." In the
proceedings of IJCAl Workshop on Fuzzy Logic in Artificial Intelligence,
Nagoya, Japan, 1-11.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1998). "System Identification of
Fuzzy Cartesian Granule Feature Models using Genetic Programming", In
/JCAI Workshop on Fuzzy Logic in Artificial Intelligence, Lecture notes in
Artificial Intelligence (LNAI 1566) - Fuzzy Logic in ArtificiaL Intelligence, A.
L. Ralescu and J. G. Shanahan, eds., Springer, Berlin, 91-116.
CHAPTER 9: LEARNING CARTESIAN GRANULE FEATURE MODEL') 236

Baldwin, J. F., and Pilsworth, B. W. (1997). "Genetic Programming for Knowledge


Extraction of Fuzzy Rules." In the proceedings of Fuzzy Logic: Applications
and Future Directions Workshop, London, UK, 238-251.
Bastian, A (1995). "Modelling and Identifying Fuzzy Systems under varying User
Knowledge", PhD Thesis, Meiji University, Tokyo.
Bezdek, J. C. (1976). "A Physical Interpretation of Fuzzy ISODATA", IEEE Trans. on
System, Man, and Cybernetics, 6(5):387-390.
Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms.
Plenum Press, New York.
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Clarendon Press,
Oxford.
Blum, A L., and Langley, P. (1997). "Selection of relevant features and examples in
machine learning", Artificial Intelligence, 97:245-271.
Bossley, K. M. (1997). "Neurofuzzy Modelling Approaches in System Identification",
PhD Thesis, Department of Electrical and Computer Science, Southampton
University, Southampton, UK.
Bouchon-Meunier, B., Marsala, C., and Ramdani, M. (1997). "Learning from Imperfect
Data", In Fuzzy Information Engineering, H. P. D. Dubois, R. R. Yager, ed.,
Wiley &Sons, Inc., New York.
Devijver, P. A, and Kittler, J. (1982). Pattern Recognition: A Statistical Approach.
Prentice-Hall, Englewood Cliffs, NJ.
Dietterich, T. G., and Michalski, R. S. (1983). "A comparative review of selected
methods for learning from examples", In Machine Learning: An Artificial
Intelligence Approach, R. S. Michalski, J. G. Carbonell, and T. M. Mitchell,
eds., Springer-Verlag, Berlin, 41-81.
Fletcher, G. P., and Hinde, C. J. (1995). "Using neural networks for constructing rule
based systems", Knowledge Based Systems, 8(4):183-189.
Geman, S., Bienenstock, E., and Doursat, R. (1992). "Neural networks and the
bias/variance dilemma", Neural computation, 4: 1-58.
Goldberg, D. E., and Deb, K. (1991). "A comparative analysis of selection schemes
used in genetic algorithms", In Foundations of Genetic Algorithms, G.
Rawlins, ed., Morgan Kaufmann, San Francisco. _
Grabisch, M., and Nicolas, J. (1994). "Classification by fuzzy integral: Performance
and tests", Fuzzy Sets and Systems, 65:255-271.
Hertz, J., Anders, K., and Palmer, R. G. (1991). Introduction to the Theory of Neural
Computation. Addison-Wesley, New York.
Hinde, C. J. (1997). "Intelligible interpretation of neural networks." In the proceedings
of Fuzzy Logic: Applications and Future Directions Workshop, London, UK,
95-122.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of
Michigan Press, Michigan.
Ivanhnenko, A G. (1971). "Polynomial theory of complex systems", IEEE
Transactions on Systems, Man and Cybernetics, 1(4):363-378.
Jang, J. S. R. (1994). "Structure Determination in Fuzzy Modelling." In the proceedings
of International Conference on Fuzzy Systems, 480-485.
Jolliffe, I. T. (1986). Principal Component Analysis. Springer, New York.
Kalvi, T. (1993). "ASMOD: an algorithm for Adaptive Spline Modelling of
Observation Data",lnternational Journal of Control, 58(4):947-968.
SOl'T COMPUTING FOR KNOWLEIX;E DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 237

Kira, K., and Rendell, L. (1992). "A practical approach to feature selection." In the
proceedings of 9th Conference in Machine Learning, Aberdeen, Scotland,
249-256.
K1ir, G. J., and Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic, Theory and Applications.
Prentice Hall, New Jersey.
Kohavi, R., and John, G. H. (1997). "Wrappers for feature selection", Artificial
Intelligence, 97:273-324.
Kohonen, T. (1984). Self-Organisation and Associative Memory. Springer-Verlag,
Berlin.
Kononenko, I., and Hong, S. J. (1997). "Attribute selection for modelling", FGCS
Special Issue in Data Mining(Fall):34-55.
Koza, J. R. (1992). Genetic Programming. MIT Press, Massachusetts.
Koza, J. R. (1994). Genetic Programming II. MIT Press, Massachusetts.
Lawrence, S., Bums, I., Back, A., Tsos, A. C., and Giles, C. L. (1999). "Neural network
classification and prior probabilities", In Tricks of the trade, Lecture notes in
computer science, G. Orr, K. R. Muller, and R. Caruana, eds., Springer-
Verlag, New York, 20-36.
Ljung, L. (1987). System identification: theory for the user. Prentice Hall, Englewood
Cliffs, New Jersey, U.S.A.
Minsky, M., and Papert, S. (1969). Perceptrons: An introduction to computational
geometry. M.LT. Press, Cambridge, MA.
Moller, M. F. (1993). "A scaled conjugate gradient algorithm for fast supervised
learning", Neural Networks, 6:525-533.
Powell, M. J. D. (1964). "An efficient method for finding the minimum of a function of
several variables without calculating derivatives", The Computer Journal,
7:155-162.
Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, 1(1):86-106.
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Shanahan, J. G., Baldwin, J. F., and Martin, T. P. (1999). "Constructive induction of
fuzzy Cartesian granule feature models using Genetic Programming with
Applications." In the proceedings of Congress of Evolutionary Computation
(CEC), Washington D.C., 218-225.
Syswerda, G. (1989). ''Uniform crossover in genetic algorithms", In Third Int'l
Conference on Genetic Algorithms, J. D. Schaffer, ed., Morgan Kaufmann,
San Francisco, USA, 989-995.
Tackett, W. A. (1995). "Mining the Genetic Program", IEEE Expert, 6:28-28.
CHAPTER ANAL YSIS OF CARTESIAN

10
GRANULE FEATURE
MODELS

Additive Cartesian granule feature (ACGF) models and a corresponding constructive


induction algorithm - G":DACG - were introduced in the previous chapters. G_DACG
automatically determines the language (Cartesian granule features and linguistic
partitions) and parameters of a Cartesian granule feature model. Here, for the purposes
of illustration and analysis, this approach is applied in the context of artificial problems
in both the classification and prediction domains. Even though the G_DACG algorithm
can automatically learn models from example data, here the language of the models is
determined manually, while 'the model parameters are identified automatically. This
allows a close analysis of the effect of various decisions taken primarily in the language
identification phase of learning, on the resulting Cartesian granule feature models. This
analysis involves the systematic sampling of the possible model space in the following
ways and subsequently measuring the accuracy of the resulting model on a test dataset:
use different linguistic partitions of input variable universes; vary the feature
dimensionality of the Cartesian granule features; vary the type of rule used to
aggregate; use different linguistic partitions of the output variable's universe (in the
case of prediction problems). This analysis provides insights on how to model a
problem using Cartesian granule features. Furthermore, this chapter provides a useful
platform for understanding many other learning algorithms that mayor may not
explicitly manipulate fuzzy events or probabilities. For example, in this chapter it is
shown how a naYve Bayes classifier is equivalent to crisp Cartesian granule feature
classifiers under certain conditions. Other parallels are also drawn between learning
approaches such as decision trees [Quinlan 1986; Quinlan 1993] and the data browser
[Baldwin and Martin 1995; Baldwin, Martin and Pilsworth. 1995]. The example
problems detailed in this chapter lay the foundation for the next chapter, which
describes real world applications of Cartesian granule feature modelling in the fields of
medical decision support, computer vision and control.

This chapter is organised as follows: The first section describes the format for the
experiments and analyses that are described in subsequent sections. Sections 10.2 and
10.3 provide a detailed analysis of Cartesian granule feature modelling for a
classification problem and for a prediction problem respectively. Section 10.4 describes
the application of Cartesian granule feature modelling to a noisy and sparse problem -
the L problem. Finally, an overall discussion on the application of Cartesian granule
features to these artificial problems is presented in Section 10.5.

10.1 EXPERIMENT VARIABLES AND ANALYSIS

The example problems to follow, namely the ellipse problem (Section 10.2) and the

J. G. Shanahan, Soft Computing for Knowledge Discovery


© Kluwer Academic Publishers 2000
CHAPTER I0: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 242

sin(x*y) problem (Section 10.3) contain two base (problem) input variables, namely X
and Y, and one output (predicted or dependent) variable. All problems are sufficiently
small, permitting the examination of a significant portion of the possible Cartesian
granule feature models. The purpose of these experiments is to investigate the impact of
different decision variables on the induced Cartesian granule feature model, most of
which lie within the feature discovery process of the G_DACG algorithm. Models
consisting of Cartesian granule features with various levels of granulation, granule
characterization and feature dimensionality are manually and systematically sampled.
Due to resource constraints (time and computing power), the analysis is limited to the
Cartesian granule features where the underlying abstractions of all base feature
universes (within a single Cartesian granule feature) are equivalent; though for the
investigation into data-driven approaches to partitioning, this assumption is dropped.
The examined model sample space represents only a very small proportion of the
infinite abyss of possible models.

In the case of both problems, the use of both one and two dimensional Cartesian
granule features formed 'over the problem input features X and Y is examined. The
granularity of the partitions is varied from coarse (few fuzzy sets) to very fine (many
fuzzy sets). The finer the granularity, the better the powers of prediction, although
empirical evidence tends to suggest that there is a threshold on the number of fuzzy
sets, above which no significant gains are made in terms of model accuracy. This
threshold will vary from problem to problem. For the results presented here,
granularities in the interval [2, 20] were considered, bearing in mind that if the
partitioning is too fine, model generalisation will suffer. This is more succinctly stated
in the principle of generalisation [Baldwin 1995]: "The more closely we observe and
take into account the detail, the less we are able to generalise to similar but different
situations... ". The effect of the following granule characterisations is observed:
triangular fuzzy sets; crisp sets; and trapezoidal fuzzy sets with differing degrees of
overlap. As presented previously, different rule structures lead to different Cartesian
granule feature models. Evidential logic rules lead to additive models and conjunctive
rules lead to product models. Both rule structures are examined here. Table 10-1
summarises the decision variables and their respective values that are investigated.

The analysis of the results takes place at two levels: firstly assorted Cartesian granule
features models are compared amongst themselves; and secondly Cartesian granule
features models are compared with the results of other learning approaches, for decision
trees, neural networks and fuzzy models. This analysis provides a useful platform for
understanding learning algorithms that mayor may not explicitly manipulate fuzzy
events or probabilities.

For both problems, the experimental results are presented as follows: first the use of
different two-dimensional feature models is examined (in terms of linguistic partitions
where the fuzzy sets are characterized by triangular, crisp and trapezoidal fuzzy sets);
then the use of various one dimensional feature models is studied; subsequently the
results of other learning approaches are contrasted with those of Cartesian granule
feature modelling; finally each problem section finishes with a discussion of the results.
In the case of the ellipse problem, the impact of using alternative approaches to
generating linguistic partitions is also investigated.
SOIT COMPUTING FOR KNOWLEDG E DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 243

Table 10--1: Decision variables (and possible choices) analysed in the context of
Cartesian granule feature model construction for three artificial problems.

Decision Variable Possible Values


Dimensionality of CG features { 1,2)
Granule characterisation Triangular, cri p and trapezoidal with various
degrees of overlap (10%, 20%, ... , 100%)
Granularity [ I, 20] fuzzy sets
Linguistic partition construction Heuristic, data-driven
Rule structure Evidential, Conjunctive

10.2 ELLIPSE CLASSIFICATION PROBLEM

Before presenting the results of the analysis, the ellipse problem is presented in brief
again for convenience. The ellipse problem is a binary classification problem based
upon artificially generated data from the real universe R x R. Points satisfying the
ellipse inequality, x2 + l ~ 1, are classified as Legal, while all other points are
classified as Illegal. The two domain input features, X and Y. are defined over the
universes ax = [-1.5, 1.5] and Q y = [-1.5, 1.5] respectively. Different training, control
(validation) and test datasets, consisting of 1000, 300 and 1000 data vectors
respectively, were generated using a pseudo-random number stream. An equal number
of data samples for each class were generated. Each data sample consists of a triple <X,
Y, Class>, where Class adopts the value Illegal or Legal.

10.2.1 An example of ACGF modelling for the ellipse problem


A detailed example of one experiment is presented here outlining how a particular type
of Cartesian granule feature model can be used to linguistically represent an ellipse.
This example serves as a template for problems tackled by Cartesian granule feature
models and their results, as presented in subsequent sections. Each experiment consists
of five steps outlined below. As alluded to previously, the language identification phase
of modelling (Cartesian .granule feature selection), corresponding to steps (i) and (ii)
below, are performed manually. Parameter identification, corresponding to steps (iii) to
(v) below, is performed automatically using steps 3 and 4 of the G_DACG algorithm
(Section 9.1.1). For the experiment described subsequently, a Cartesian granule feature
model is constructed in terms of one two-dimensional Cartesian granule feature. The
following steps overview this process:

(i) Select Cartesian granule features: The use of one two-dimensional


Cartesian granule feature consisting of the base input features X and Y
is examined.

(ii) Determine the granularity of base features in each Cartesian granule


feature: In this case the linguistic partitions of the base features are
characterised by six uniformly placed trapezoidal fuzzy sets that
overlap to a degree of 0.5 (50% overlap). The linguistic partitions of
CHAPTER 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 244

the universes of the input variables X and Y are defined in Figure 10-
2 (a corresponding graphic depiction is presented in Figure 10-1).

(iii) Learn Cartesian granule fuzzy sets: Subsequently, a Cartesian


granule fuzzy set is learned for each of the legal and illegal classes.
The Cartesian granule fuzzy sets corresponding to the legal and
illegal classes, when the base feature universes were partitioned using
the above linguistic partitions, are depicted graphically in Figure 10-
4. In both figures each grid point corresponds to a Cartesian granule
and its associated membership value. This isomorphic relationship
that exists between the class structures, as represented in Cartesian
format (raw attribute values), and the graphic representation of the
respective Cartesian granule fuzzy sets adds a somewhat intuitive
meaning and interpretation to Cartesian granule fuzzy sets.

(iv) Generate rule set: These Cartesian granule features and learnt class
fuzzy sets are then incorporated directly into the body of the
respective classification rules. In this case since the model only
consists of one feature, the conjunctive rule and the evidential logic
rule will have equivalent behaviour. However, the evidential logic
rule has another degree of freedom, made available through the filter,
which could allow a more accurate modelling of a problem domain.
For this experiment however, the filter is set to the identity filter i.e. x
= fix). The generated rule set for this problem is presented in Figure
10-3.

(v) Estimate the accuracy of the generated model: The effectiveness of


the learnt model is measured based on the accuracy of that model on
the test dataset. In addition to model accuracy, a decision boundary or
surface (in the case of prediction problems) is also graphically
presented. In this experiment the classification accuracy of the
induced model is 96.5% and the corresponding decision boundary is
depicted in Figure 10-5, where the shaded region corresponds to the
predicted legal class, while the unshaded region corresponds to the
predicted illegal class. The true ellipse is superimposed on the
predicted results to illustrate the accuracy of the model. The fuzzy
sets used to partition the base variable universes ilx and ily are also
shown beneath and on the left of the classification area respectively.
In Figure 10-5 an example of a linguistic term, yAround_Negl.25, on
ily is denoted by a fuzzy set on the vertical axes. In subsequent
results, linguistic terms are omitted from graphs to avoid clutter. In
this experiment, the extracted model forms a good approximation of
the ellipse as shown in Figure 10-5, though there are regions where
false negatives and false positives occur. In terms of area (this
measure only applies in classification problems), measured in
Cartesian space, the extracted model yields an error rate of around
3.5%. In subsequent sections, accuracy of the extracted models is
presented only in terms of test datasets as opposed to area
SOfT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 245

percentages as this is more representative (and controllable),


especially where classes do not occupy equal size areas (as is the case
in the ellipse problem).

xAround_O.1S

~ I - - - - -
xAround_O U
7; - - -." .. - .. f\: .................... .
xAroUDd_ 1.25

~ /\ .\ /.
~ / \ I \ ../ \
E / \/ \ . .'" \
O~--T---.---~--.-~~/---r--~\~:--.---\;·-... --,---,---,--..
-1.5 -1.0 -0.5 o 0.5 1.0 Ox 1.5

Figure 10-1: A linguistic partition of the variable universe ilx, where the granules are
characterised by trapezoidal fuzzy sets with 50% overlap.

p.: ( (xAround_Negl.25 [-1:1 -0.75:0))


(xAround_Neg0.75 (-I.25:0-1:1 -0.5:1-0.25:0))
(xAround_NegO.25 (-0.75:0 -0.5:10:1 0.25:0j)
(xArOtlnd_0.25 (-0.25:0 0:10.5:1 0.75:0J)
(xArotllld_0.75 (0.25:0 0.5:1 1:1 1.25:0/)
(xAroLlnd_I.25 (0.75 :0 1:1))) )
And
Py :( (yArouluCNegl.25 [-1:1 -0.75:0))
(yAround_Neg0.75 [-1.25:0 -1:1 -0.5:1 -0.25:0))
(yAround_NegO.25 [-0.75:0 -0.5:10:10.25:0))
(yAround_0.25 [-0.25:00: I 0.5: I 0.75:0})
(yAround_0.75 [0.25:00.5:11:1 1.25:0J)
(yAroulue/.25 [0.75:0 1:1 j)) ).

Figure 10-2: Linguistic partitions Px and Py of the variable universes ilx and ily
respectively, where each granule is characterised by trapezoidal fuzzy sets with 50%
overlap.

10.2.2 Ellipse classification using 2D Cartesian granule features


In the previous section, a prototypical experiment of Cartesian granule feature
modelling in the ellipse domain was presented in terms of a two-dimensional feature
model. Here, the use of other types of two-dimensional Cartesian granule features is
investigated. This investigation begins by exploring the use of uniformly placed,
mutually exclusive, triangular fuzzy sels as a means of partitioning the base feature
universes. The granularity of the base feature universes is varied uniformly across each
feature (i.e. same number of fuzzy sets in each partition). Levels of granularity ranging
from 2 to 20 were investigated and the results achieved using the learnt models on
CHAPTER I0: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 246

unseen test data are plotted in Figure l~. For convenience, the top right hand comer
of Figure l~ (and of subsequent result graphs) is used to denote the problem being
addressed and the type of Cartesian granule feature model being used to solve it. In the
case of Figure 10-6, the graph presents results for the ellipse problem where the
underlying models consist of one two-dimensional Cartesian granule feature. The
horizontal axis corresponds to the granularity of the base universes and is expressed in
terms of the number of fuzzy sets used. The vertical axis represents the level of
accuracy obtained by the corresponding model. To avoid repetition, it is assumed for
the remainder of this chapter, unless otherwise stated, that result graphs of this type
follow this presentation format. Figure 10-7 shows the ellipse decision boundaries that
were achieved using models where the granularities of the underlying base features
were varied from two to ten. At a granularity level of seven (see Figure 10-7 (f)), the
extracted model starts to fit the ellipse but it is not until a granularity level of about nine
that a good fit is achieved: with an error rate of about 4.8%. Notice that the model
accuracies oscillate (especially in the lower levels). This oscillation is primarily due to
the "lucky fit" of the triangular sets, which have broader support for lower levels of
granularity. This "luck); fit" is more apparent in the case of crisp granules that are
presented subsequently.

«Classificaioo fer FUrt is ~ )

(X, YciFUIl wle5JXId; to ) ):(1 IXOO)

•.. I . - . -.
.'

«0 ·fiatioo fer FtiJt· JJkgaI)

(X, Y cfFtiJt wle5JXId; to »:(1 IXOO)

',--;-

Figure 10-3: A possible rule set for the ellipse problem in terms of two-dimensional
Cartesian granule features. See Figure 10-4 for a close-up version of the fuzzy sets in
this model.

Next the use of words that are characterised by trapezoidal fuzzy sets is examined, as a
means of partitioning the base feature universes. This type of linguistic partition is not
mutually exclusive. Again the use of one two-dimensional Cartesian granule feature
formed over the base input features X and Y is explored. The trapezoidal fuzzy sets
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 247

were positioned uniformly over the base universes, varying the trapezoidal overlap
factor from 100% overlap to 0% (0% overlap corresponds to a crisp partition). Figure
10-8 depicts the results obtained using linguistic partitions generated by trapezoidal
fuzzy sets with the following degrees of overlap: 100% overlap (curve named T=1.0),
50% overlap (curve named T= 0.5) and no overlap (curve named crisp i.e. T = 0.0).
Again the granularity of the base input feature universes was varied from 2 to 20 fuzzy
sets.

... • • ,..1'11 ...

(a)

0 ••
0.7
a ••
o.~
0. '
t ••
o.~
0.1
o

(b)

Figure 10-4: Graphic representation of (a) Legal and (b) Illegal Cartesian granule
fuzzy sets where each grid point corresponds to a Cartesian granule and its associated
membership.
CHAPTER 10: ANALYSIS OF CARTESIAN GRANUI.E FEATURE MODELS 248

············· i············ ··i··· ···········:·········· ... :............. +...........


1-:: 1 __;--. !
if Ii I ~. . . . . . .
. . . . . . ..:' . . . .=.~:~~i::.: . . . . . ..
....... ···· f

............. ;........... - -. ~

].I
o

- I . :5
· . ~:.J!. . .\/····/·~· ·it.~........
- 1 - 0.S e 0.:1
·····.V........
I...../·/ ··....
>< - T ..... p.'ZQid .. l . T - e.~. G~."ul.~I~~ - 6

Figure 10-5: Decision boundary for the ellipse problem using a two-dimensional
Cartesian granule feature model, where the base feature universes were partitioned
using six uniformly placed trapezoidalJuzzy sets with 50% overlap.

100
~Jl2D(X."

90

a'"
0
;;;
~
c:
0
70
e
>.

:::J
u -+- Triangul
u
« 60
~

2 3 4 5 6 7 8 9 10 II 12 13 14 15 16 17 18 1920
Granularity in Icrms of fuzzy sets

Figure 10-6: Classification results for the ellipse problem using one 2D Cartesian
granule feature, where triangular fuzzy sets were used to partition the base features.
SOI'T COMPUTING FOR KNOWJ.ED<;E DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 249

In general, the use of fuzzy sets as a means of linguistically quantising the base feature
universes gives better results than obtained using crisp sets. The results shown in Figure
10-8 empirically support this claim. The decision boundaries of models using crisp
Cartesian granule features lie along the boundaries of the linear crisp granules and thus
it becomes more difficult to model problems other than those with a stepwise linear
decision boundary. Decision tree approaches (ID3/C4.5 [Quinlan 1986]) yield similar
piecewise linear boundaries. This is clearly depicted in Figure 10-9 where the decision
boundaries of various learnt models that use crisp granules are presented. Nevertheless,
as the granularity increases, the Cartesian granules will better fit the surface boundary
for the problem, thereby reducing the model error. But with this increased model
accuracy comes a high complexity cost, which may prove intractable in more complex
systems, and may lead to over fitting.

(a)

1
'---- . "'--
0 '9

(d)
• • ...
'-'-","'':'~'' --''''''''-- "1

(h) (i)

Figure 10-7: A montage of ellipse decision boundaries generated by models consisting


of one 2D Cartesian granule feature, where various numbers of triangular fuzzy sets
were used to partition the base features: (a) 2 fuzzy sets; (b) 3 fuzzy sets (everything is
classified as illegal); (c) 4 fuzzy sets; (d) 5 fuzzy sets; (e) 6 fuzzy sets; if) 7 fuzzy sets;
(g) 8fuzzy sets; (h) 9 fuzzy sets; (i) 10 fuzzy sets.
CHAPTER I 0: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 250

90

~
~
8 70
'"
~
___ 1=0.5
~
< 60
"
SO

40~~~~~~~~--~~~~~~--~~~

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1920
Granula.riay in terms of rU1~1.)' .selS

Figure 10-8: Classification results for the ellipse problem using one 2D Cartesian
granule feature, where the base feature universes have been partitioned using
trapezoidal fuzzy sets with various degrees of overlap.

-1- . ...
....... ..,... ....... ........ . .... ..,.
.. ... .-...._... "
~ --L... -' .. • .. -"- -""- ~ "'"
, ... . ..":. :.. _ . .. ..... ! ...
(a) (b)
Figure 10-9: (a) Decision boundary for the ellipse problem using a two-dimensional
Cartesian granuLe feature model, where the base feature universes were partitioned
using 3 crisp sets; and (b) with 10 crisp sets.

Figure 10-11 and Figure 10-12 present the model classification accuracies obtained
using different two-dimensional Cartesian granule features with varying base feature
granularities where the granules are characterised by trapezoidal fuzzy sets with
different degrees of overlap, ranging from 0% overlap (curve named crisp i.e. T = 0.0)
to 100% overlap (curve named T = 1.0). In Figure 10-14, graphs (a)-(k) illustrate the
effect of the overlap rate on the decision boundary. These are contrasted with the
decision boundary generated by a model where the granule characterisation is a
triangular fuzzy set as depicted in Figure 10-14(1). In general, for the ellipse problem,
granules characterised by trapezoids with overlapping degrees of between 50% and
70% yield models that fit the ellipse adequately (i.e. error rates in terms of misclassified
area of around 3%) with very few words (five words) used in the linguistic partition of
the base feature universes. Figure 10-10 depicts a model with accuracy of 98% using
SOFTCOMPUTlI\G fOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 251

seven words that are characterised by trapezoidal fuzzy sets with an overlap degree of
60%. As Figure 10-10 depicts, the misclassified areas correspond to false positive areas
for the ellipse class. This is one of the best results obtained using relatively
parsimonious/succinct linguistic partitions (well inside Miller's magic number of
7 ± 2 concepts [Miller 1956]). Furthermore, when compared with triangular-based
partitions, the use of trapezoidal-based partitions tends to yield models which are more
parsimonious and which better fit the problem. Figure 10-13 contrasts the results
obtained using models that use trapezoidal-based partitions with overlap rates of 0%
(crisp case) and 50%, with models that use triangular-based partitions.

. . ..._....... , ......... -._ ...... ,


a l . 1I . 1 . 6J' a . .... _ . . .. . _., . • •• a

Figure 10-10: Decision boundary for the ellipse problem using a two-dimensional
Cartesian granule feature model, where the base feature universes were partitioned
using 7 trapezoidal fuzzy sets with an overlap rate of60%.

1011

90

9
Q 80

~
§
70
~
"~
< 60 ___ T=().2
tj<

___ T=(). I

- - - Cri,p
4() " - - ' - - ' -- ' - - ' - - ' -................_ - ' - - ' - - ' - - ' - - ' - - ' - _ - - ' - - - ' - . . J

2 4 5 6 7 R 9 III II 12 I) 14 1~ 16 17 I ~ 19 20
Granula,ilY m h: rm!" nf fu n, s~Ls

Figure 10-11: Classification results for the ellipse problem using two-dimensional
Cartesian granule features where the base feature universes are partitioned with
trapezoidal fuzzy sets with various degrees of overlap, ranging from 0% (curve named
crisp) to 50% (curve named T=0.5).
CHAI'TU{ 10: ANALYSIS OF CARTESIAN GRANULE I-); ATURE MODELS 252

The two-dimensional features presented here represent only a very small proportion of
the abyss of possible two-dimensional features. For example, it is possible to use
features in which the base attribute universes could have been partitioned with different
types of fuzzy set, different numbers of fuzzy sets, and data centred partitioning.

100
~

~r
~o

!:l
0 80

~
'G
l-
e
",., 70
....... T=0. 7
~
i:l"
< 60
t'<
~
50

40
2 J 4 5 6 7 H ~ 10 II 12 I J 14 15 16 17 I g 19 20
Granul nt)' In lI.:rm . of lUll)' 'CI~

Figure 10-12: Classification results for the ellipse problem using two-dimensional
Cartesian granule features where the base feature universes were partitioned with
trapezoidal fuzzy sets of various overlapping degrees (from 50% to 100%).

CJO

.3
8 KO I-H---r----------------; -+- T=0.5
~
l-

t 70 ~~_1----------------;
~
a - - - Crcsp
~ tlO
II<

50 t-Jr-----------------~
....... FuuTri

2 3 4 tI 7 8 9 10 II 12 n 14 15 Iii 17 18 19 20
('t'.mularilYin h:::rm\ of fuuy SCl~

Figure 10-13: Comparison of classification results for the ellipse problem using two-
dimensional Cartesian granule features where the base feature universes are
partitioned with triangular and trapezoidal fuzzy sets.
SOFT COMPUTING FOR KNOWLEDGE DISCOV ERY: INTRODUCING CARTI ,SIAN GRANULE FEATURES 253

_.'.' . ._-
1
...... : ... , ' r. .. . .... ','
(e) - Trapezoid - 0 .2

. (.

] ;. ~.~ :1
--+----- .

r---
• _ .. ~.. , ' r. .. " . _ "
(d) - Trapezoid - 0 .3

(I ) - Trapez oid - O.B

.: .... 'r_ ... ....


i
(J) - Trapezoid - 0 .9 -:'(1) - 'T';anQ~~ ..

Figure 10-14: A montage of decision boundaries for the ellipse problem using an
assortment of two-dimensional Cartesian granule feature models, where the base
feature universes were partitioned with a granularity of five as follows: (a) - (k)
Trapezoidal fuzzy sets where the degree of overlap varies from 0 to 100% in steps of
10%; (I) Triangular fuzzy sets.

10.2.2.1 Ellipse classification using 1D Cartesian granule features


The use of various types of one-dimensional Cartesian granule feature is examined
subsequently. In this case, each class rule consists of two one-dimensional features that
are based upon the X and Y features respectively. Once again, the use of mutually
exclusive triangular fuzzy sets that were placed uniformly across the base feature
universes is explored initially. Levels of granularity ranging from 2 to 20 were
investigated. The results obtained using the learnt models to classify unseen test data
are graphed Figure 10-15 (curve named Triang) . In this case the Cartesian granule
features were incorporated into evidential logic rule structures. The weights of
importance associated with each feature were estimated using semantic discrimination
analysis (Section 9.4). The one-dimensional models in this case yields accuracies of
around 90%. The best results were obtained from models with a granularity level of 8
and where the granules were characterised by trapezoidal fuzzy sets with overlap
CHAPTER 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 254

degrees of 30%. Such models yielded an impressive accuracy of 96.6%. However


overall, the performance of these one-dimensional Cartesian granule features is less
favourable when compared with their two dimensional counterparts (compare Figure
10-13 and Figure 10-15). This drop in model accuracy results from the decomposition
error that arises when two-dimensional features are decomposed into one-dimensional
features. Figure 10-16(a) and Figure 10-17(a) illustrate some of the typical decision
boundaries achieved using these types of models. In general, the extracted models find
it difficult to capture the curvilinear nature of the ellipse's boundary. In fact it takes a
granularity level of around 11 to achieve a respectable boundary (which is still
somewhat jagged). On the other hand, when the one-dimensional Cartesian granule
features were combined using conjunctive (product) rule structures the model
accuracies decreased by a couple of percentage points (see Figure 10-18 - curve named
ConTriang). Figure 10-16(b) presents a typical decision boundary for conjunctive
models with triangular granule characterizations. In general, using the conjunctive rule
structure, as a means of combining the features, does not produce any false positives for
the ellipse class (see Figure 10-16(b) for example), unlike the evidential logic rule
Figure 10-16(a). This feature of no false positives may prove desirable in certain
mission critical systems.

100 . - - - - - - - - - - - - - - - - - - ..,.

.....
,,'ID (X, '"'.I

~
8 80 -+--- Tri.ng

~
g 70~--r+----------------------------~ _ _ 'risp

f
<
<to
60

50

40L-~~~~~~~~~~ __~~~~~~
2 3 4 5 6 7 H 9 10 II 12 13 14 15 16 17 18 19 20
Gronul3rily .n lemos of (uay >cIS

Figure 10-15: Comparison of classification results for the ellipse problem using two
one-dimensional Cartesian granule features where the base feature universes are
partitioned with trianguLar and trapezoidal fuzzy sets. The Cartesian granule features
are combined using the evidential Logic rule.

The use of two one-dimensional Cartesian granule features where the underlying
granules are characterised by trapezoidal fuzzy sets is now examined. The trapezoidal
fuzzy sets were distributed uniformly over the base universes, varying the trapezoidal
overlap factor from 100% overlap to 0% (i.e. a crisp partition). A granularity range of
[2, 20] was investigated with uniformly positioned trapezoidal fuzzy sets with varying
SOfT COMPUTING FOR KNOWI.EDGE DISCOVERY: INTRODUCING CARTESIAN GRANUI.E FEATURES 255

overlap. A subset of the results is presented and is restricted to the following types of
granules: trapezoids with the best overlap rate; crisp granules; and trapezoidal granules
with 100% overlap. This should give some indication of the accuracies attainable with
different degrees of overlap. Figure 10-15 presents results where the evidential logic
rule structure was used as a means of combining the supports of the individual
Cartesian granule features. The classification results plotted correspond to models
where the underlying granules were characterised by trapezoidal fuzzy sets with an
overlap degree of 100% (curve named T=1 .0), with an overlap degree of 30% (curve
named T=O.3), and no overlap (curve named crisp i.e. T = 0.0). The extracted
evidential logic rule models, once again outperform their conjunctive counterparts,
yielding results in the low 90s (see Figure 10-18 for a comparison and the next
paragraph for an explanation). This is due primarily to the fact that the Y based
Cartesian granule feature is more discriminating than the X based feature; as the ellipse
is horizontally oblong. Consequently, this Cartesian granule feature is given a higher
weight (via semantic discrimination analysis) within the evidential reasoning process,
resulting in better model a<;curacies than the conjunctive rule structure that treats the
features as equally important. Figure 10-17 gives an indication of the nature of the
decision boundary generated by evidential and conjunctive rules structures in this case .

•i
.t

11- ..-._1_ .... _1_.*"-11


~ ... " II • •
., .~,L:
.•---7:----:'1
:II •
-:-.- - - :.----="
. •.,_1_.
, .. ~ . _I_ ..... 10
•.•=---~----;'
(a) (b)

Figure 10-16: Decision boundaries when (a) the evidential logic rule and (b) the
conjunctive rule are used as a means of combining one-dimensional Cartesian granule
features for the ellipse problem. A granularity level of 11 was used on each base
feature universe. The granules were characterised by triangular fuzzy sets.

Overall, in the case of one-dimensional Cartesian granule features, regardless of the


rule structure, the use of granules which are characterised by trapezoidal fuzzy sets
outperform their triangular counterparts. This comparison of rule structure use and the
shape of granule membership Junction is presented in Figure 10-18 where the named
curves denote the following: ELTriang corresponds to the use of the evidential logic
rule with triangular based granules; ELT=O.3 corresponds to the use of the evidential
logic rule with trapezoidal based granules with an overlap of 30%; ConTriang
corresponds to the use of the conjunctive rule with triangular based granules; ConT=O.2
corresponds to the use of the conjunctive rule with trapezoidal based granules
overlapping to a degree of 20%. On the whole, the use of the evidential logic rule,
where the granules are trapezoidal with 30% overlap, gives the best results.
CHAPTER 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 256

••
!

i
••

.I.!.'":
. ' ----:':.,,-----::.•~.,:----,.~-,,.
•.,....----7,_---:-l -, '! IL:.'--';.,-~
.•.-:-.- - 0,-_ _.:-,:.•:--~,_---,J
• - , .. _ ••••• 1. I - • • , . .... _ 1 _ •• 11' .. Ie • .. ,.. _-•.•• 1. , ••.•. ""'_1 ... '..".,.
(a) (b)

Figure 10-17: (a) Decision boundary when an evidential logic rule is used as a means
of combining one-dimensional Cartesian granule features for the ellipse problem. A
granularity level of 10 was used on each base feature universe. The granules were
characterised by trapezoidal fuzzy sets with an overlap degree of 30%. (b) Decision
boundary when a conjunctive rule is used and the granules were characterised by
trapezoidal fuzzy sets with an overlap degree of20%.

IOO~-------------------,

s
8 W+-'---+~--------------i -+-ELT=O.3
~
5 70

~
____ ConT=O.2

~ OO~~~------------------i
~ ......... ELTriang
50 ~ __- - - - - - - - - - - - - - - - - - i
____ COnTn.ng
40~~~~~~~~~~~~~~~~-~

2 3 4 5 6 7 8 9 IO I I 12 13 14 IS 16 17 18 19 20
Granularity in terms of fuzzy sets

Figure 10-18: A comparison of using the evidential logic rule vs. the conjunctive rule
as a means of combining one-dimensional Cartesian granule features for the ellipse
problem, where the base feature universes are partitioned with triangular and
trapezoidal fuzzy sets.

10.2.3 Data centred Cartesian granule features


In the previous sections various types of Cartesian granule features were investigated
where the underlying feature partitions were uniform in nature. Here, however, the
presentation briefly digresses to investigate the use of partitions that are generated by
data-centred approaches and also the use of granule merging (commonly known as
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 257

pruning in the literature) in order to enhance the generalisation and transparency of


learnt Cartesian granule feature models. This is done in the context of modelling the
ellipse problem with two-dimensional Cartesian granule features. Two data-driven
approaches to generating partitions of the problem feature universes are investigated: a
percentile-based approach; and a clustering approach. Subsequently, pruning is
examined, that is, where neighbouring granules are merged.

10.2.3.1 Cartesian granule features using percentile-based partitions


Firstly, generating partitions using data percentiles is investigated. Here the cluster
centres were generated for each base variable (i.e. all class data was considered together
- heterogeneous) as follows. The data for each feature was ordered and subsequently
distributed such that the dataset is uniformly distributed across the partition sets. The
boundary points of each partition set are then used to generate fuzzy partitions of the
base universes. Class Cartesian granule fuzzy sets and rules are subsequently learned in
terms of the corresponding,Cartesian granule features. Figure 10--19 presents some of
the interesting results obtained using two-dimensional Cartesian granule features whose
base feature universes were partitioned as described above. Overall, the heterogeneous
percentile approach to partitioning does not perform as well as the uniform approach
(see the curve labelled Uniform T=0.5 in Figure 10-19). However, heterogeneous
percentile based partitioning does provide granules characterised by crisp sets with a
significant boost in model accuracy. This is mainly due to the focusing of the crisp sets,
in the case of the legal class, in areas where data exist.

100
97
94
91
88
85
9... 82 -+- Unifom,T=0.5
0 79
~
u 76
!- 73 _Crisp
c
0
>. 70
E" 67

"«""
64
61
IiIl 58
55 --e--Trinng
52
49
46
43 --lIE- T= 1.0
40
2 3 4 5 6 7 8 9 10 II 12 D 14 15 16 17 I 19 20
Grnnularit y in terms of fuzzy sets

Figure 10-/9: Ellipse Classification using two-dimensional CG features where the


partitions were generated using the one-dimensional percentile approach.
CHAPTER 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODFLS 258

10.2.3.2 Cartesian granule features using clustering-based partitions


The use of clustering techniques is examined as a means of generating partitions in the
input feature universes. Any of a number of clustering techniques such as fuzzy c-
means (FCM) [Bezdek 1981], Kohonen [Kohonen 1984], LVQ [Bezdek 1981] could be
used to cluster the input feature data. Here, the FCM clustering algorithm is used.
Clustering is considered at different levels of dimensionality and homogeneity, where
dimensionality refers to the number of variables considered for clustering at one time
and where homogeneity refers to whether all classes are clustered together or whether
each class is clustered individually. Homogeneous clustering, while facilitating the
extraction of knowledge in terms of constructs which best capture the structure of the
underlying training data, may lead to over fitting (i.e. fitting the class too tightly and
thereby, leaving little room for generalisation). In general, the number of cluster
centres is manually input, but could quite easily be determined automatically (see
[Sugeno and Yasukawa 1993] in the case of FCM). In the case of the G_DACG
algorithm, the language identification step provides the number of cluster centres. The
provided cluster centres are then used to generate partitions of the respective base
universes as described previously in Section 4.1.1.4. Words can subsequently be fitted
to the fuzzy sets either automatically from a predefined dictionary or by the user. Class
Cartesian granule fuzzy sets are generated in terms of these words. The ultimate goal is
to extract cluster centres, which partition the individual universes, in such a way that
good, parsimonious and intuitive linguistic descriptions of concepts can be extracted
from the example data that model the system effectively. In other words, the goal is to
extract anthropomorphic knowledge descriptions of the system that are effective in
modelling the system.

10.2.3.2.1 Ellipse classification using one-dimensional clustering based


Cartesian granule features
This section briefly illustrates the application of "single feature" clustering approaches
in generating partitions of the base feature universes. For single feature clustering, two
cases are considered: (l) where cluster centres were generated for each feature
independently using the FCM clustering algorithm (i.e. heterogeneous clustering); and
(2) where cluster centres for each class over each feature universe were generated
independently (i.e. homogeneous clustering). Subsequently, linguistic partitions were
created using the extracted cluster centres. These partitions were then used in
conjunction with two-dimensional Cartesian granule features. Figure 10-20 presents
the results obtained using two-dimensional Cartesian granule features where the base
feature universes were partitioned as described above for a fixed granularity of seven
and where the granules were characterised by various types of fuzzy sets (see the X axis
in Figure 10-20). By and large, the "single feature clustering" based models do not
perform as well as their uniform counterparts (compare curves labelled
J DHetroClustering and IDHomogClustering, representing features generated using
homogeneous and homogeneous clustering respectively, and Uniform in Figure 10-20).

10.2.3.2.2 Ellipse classification using two-dimensional clustering based


Cartesian granule features
The use of multidimensional clustering in generating triangular-based partitions of the
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 259

base feature universes is examined here. The cluster centres for each class were
generated independently (homogeneous clustering) using the FCM clustering
algorithm. These cluster centres were then used to generate mutually exclusive
triangular based partitions of the base universes. Table I ~2 presents some of the more
interesting results obtained using two-dimensional Cartesian granule features where the
base feature universes were partitioned as described above. The performance of the
models using multidimensional clustering compares very favourably to models that use
uniformly partitioned features (compare columns 3 and 4 in Table 1~2). Forming
Cartesian granule features using multi-dimensional clustering can lead to close-lying
cluster centres when the multi-dimensional cluster centres are projected on the
individual universes. Consequently, the next variation in the approach is to merge
close-lying cluster centres.

100

98

..
8
96 · -+-Uniform

~c: 94
0
>.
~ 92 _ _ 10
:>
t: HelfoCluslC
< ring
~ 90

88 -*-ID

.a
HomogClus

.
86 lering
..,
·c
.C!"-
U 'i
N

'I-Y li liI- ~ '"'ii ~ ~I- '"'it q

l- I- l- l- I- l- l- I-
Type of fuzzy SCI used

Figure 10-20: Ellipse classification using CG Features where "the underlying feature
partitions are generated using uniform and various clustering approaches. The
granularity of the feature universe panition was fixed at seven.

10.2.3.3 Pruning Cartesian granules


Cluster centres which were within 2% of the domain range of each other were merged.
The merging of neighbouring cluster centres results in the generation of a new cluster
centre, which is the midpoint between the merged centres. Table 1~3 gives the results
obtained using the pruned models and also indicates the corresponding reduction. Here
cluster elimination/merging can be viewed as a complexity reduction. While achieving
a reduction in the granularity of the Cartesian granule universe, this also results in a
reduction of the model accuracy. The non-merged approach outperforms the merged
approach as indicated by the results in Table I ~2 and Table I ~3. However, this
reduction in accuracy for more complex system may be tolerable since it may be a way
CHAPTER 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 260

of producing a model that is comptractable (computationally tractable) and


comprehensible, while also performing satisfactorily in terms of accuracy.

Other forms of pruning are also possible, but are not discussed here, such as logical
merging of granules. For example, neighbouring granules (in the projected one-
dimensional sense) that exhibit similar membership levels could be merged. Similarly,
modified entropy algorithms as used in decision tree pruning [Quinlan 1993] could be
used here to logically merge neighbouring granules. Pruning in this way is an example
of how to exploit the tolerance for imprecision and uncertainty, while achieving
tractability, robustness and low solution cost, one of the guiding principles of soft
computing [Zadeh 1994].

Table 10-2: A comparison of classification results using Cartesian granule features


CGXXY based upon two-dimensional clustering and uniform panitioning.

CG Features Granularity % Accuracy- % Accuracy -


Triangular Fuzzy sets 2D-Clustering Uniform
2D using X, Y 5 87 81.9
2D using X, Y 7 93 93.2
2D using X, Y 9 94 94.4
2D using X, Y 12 92 95.6

Table 10-3: Classification results using Cartesian granule features CGFxy based upon
two-dimensional clustering partitioning (after pruning).

CG Feature Granularity % Accuracy - Granularity Granularity on


Triangular 2D-Clusteri ng on Ox for ily for
Fuzzy sets legal/i lIegal legaUillegal
2D using X, Y 5 88 3/4 515
2D using X, Y 7 91 515 517
2D using X, Y 9 90 517 6/6
2D using X, Y 12 88 517 517

10.2.4 A G_DACG run on the ellipse problem


In the previous sections, the language of Cartesian granule feature models has been
determined manually, while automatically identifying the parameters of the model. The
space of possible Cartesian granule feature models was manually sampled, resulting in
the construction and analysis of both one- and two-dimensional models (but not
mixtures). In contrast, in the previous chapter the G_DACG algorithm was used to
search through the space of possible models to discover the language (Cartesian granule
features) and parameters of the model automatically. The discovered models (mixtures
of one- and two-dimensional features), act as a yardstick against which the models
investigated in this chapter can be evaluated. Overall, the G_DACG discovered models
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 261

yield higher levels of accuracy (99%) and transparency. See Section 9.7 for more
details.

10.2.5 Ellipse results comparison


The previous sections have presented the results of experiments where Cartesian
granule feature models were used to model the ellipse problem. These models were
constructed automatically (using the G_DACG algorithm) and semi-automatically (that
is, where the language space was sampled manually and the corresponding model
parameters identified automatically). At this point, the results of experiments are
examined, where other supervised learning techniques were applied to the ellipse
problem: the data browser; neural networks; and the mass assignment tree induction
(MATI) algorithm. These approaches were assessed using the same datasets that were
used in Cartesian granule feature modelling. This is followed by a discussion where
these approaches are analysed and compared with the use of Cartesian granule feature
based models.

10.2.5.1 Fuzzy data browser


The data browser is an induction system that automatically extracts rules and fuzzy sets
(one-dimensional) from statistical data [Baldwin and Martin 1995]. See Section 7.5.2.3
for an overview of the data browser. Table 10-4 presents the results achieved when the
data browser was used to generate models of different rule types for the ellipse
problem: Le. evidential logic and conjunctive rule structures. The data browser, while
yielding high accuracy rates on the test dataset, produces an inaccurate decision
boundary, especially on the boundary areas of high curvature (see Figure 10-21(a)).
This may be due in part to the decomposed nature of the generated rules and features.
In the data browser, when generating fuzzy sets over continuous variables from
corresponding data distributions, it is necessary to discretise the domain or assume that
the data is distributed according to some distribution (such as a Gaussian distribution).
Discretisation is a well-know problem in statistics and machine learning where slightly
different partitions of a domain can lead to significantly different models (distributions,
decision trees etc.) [Baldwin and Pilsworth 1997; Shanahan 1998; Silverman 1986]. To
overcome some of the problems resulting from discretisation, such as discontinuities,
various smoothing algorithms can be used [Baldwin and Martin 1999; Weiss and
Indurkhya 1998]. For the models presented in Table 10-4, the data browser extracted
un smoothed fuzzy sets corresponding to the legal and illegal classes. However, when a
smoothing algorithm was applied to the extracted data distributions a decision
boundary, as depicted in Figure 10-21(b), resulted. Contrary to the thesis that
smoothing can both improve generalisation and reduce the modeVfuzzy set complexity,
in this case, it results in a big drop in performance. This drop in performance is mainly
attributed to the underlying abstraction of the domain using crisp buckets (i.e.
histogram-based) from which the smoothing algorithm cannot recover. The use of
fuzzy granules or buckets in histogramming should result in a significant improvement.
In the un smoothed case, a data browser induced fuzzy set is similar to a Cartesian
granule fuzzy set (Le. one dimensional features). In both cases a probability distribution
is generated on crisp/fuzzy granules from the example data. Subsequently, the data
browser converts this distribution to a continuous form by linking up the centre points
CHAPTER 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 262

of granules. The resulting probability density is then transformed to a fuzzy set.


Smoothing algorithms can also be applied prior to this transformation. For example,
piece-wise regression techniques can be used, where neighbouring points in the
probability distribution with similar characteristics, such as similar derivatives, are
summarized using a line. More sophisticated smoothing algorithms may provide better
results. In the case of Cartesian granule features the granularity of the probability
distribution is maintained after the transformation to a fuzzy set (that is, it is not
converted to a continuous fuzzy set).

Table J0-4: Classification results using the data browser on the ellipse problem.
Fril Rule Type % Accuracy Decision Surface Figures
Conjunctive 93.5 Similar to Figure 10-21 (a)
EvidenLial 94 Figure 10-21 (a)

.,.!'~.'--:C----=:;----:.----;'';---:--~
(a) (b)

Figure 10-21: (a) Ellipse decision boundary using data browser generated rules and
fuzzy sets with no smoothing; (b) ellipse decision boundary using data browser
generated rules and fuzzy sets with smoothing.

10.2.5.2 Multi-layer perceptron


Two-layered perceptrons of various architectures were applied to the ellipse problem.
The neural networks were implemented with the SNNS simulator from [Zell et al.
1995]. A scaled conjugate gradient (SCG) algorithm [Moller 1993] was used as a
learning algorithm for these feed forward neural networks. SCG learning algorithms,
due to their second order nature, tend to find better ways to minimum models (local)
than first order techniques (such as back propagation), but at a higher computational
cost. The simulator learning grain was set to "pattern". Table 10-5 presents the results
obtained when perceptrons with hidden layers of different sizes were used to model the
ellipse problem. The number of hidden nodes was varied from two to five and this is
indicated in the network architecture column in Table 10-5. For example, the
architecture 2-2-2 corresponds to the following feed forward network: the network has
two input nodes corresponding to the input features X and Y; it has two hidden nodes;
and 2 output nodes, each corresponding to the output classification of legal or illegal
respectively. The output classification of a data vector is determined by taking the
classification corresponding to the maximum of the output values generated by the data
SOFT COMPUTING FOR KNOWLEIXiE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 263

vector. The decision boundaries generated by the neural networks models presented in
Table 10-5 are depicted in a series of graphs; the details of which are given in the
column entitled "Decision boundary figures". The neural network performs very well
in modelling this problem but it does require at least three hidden nodes in order to
yield good classification accuracy.

Table 10-5: Classification results using neural networks on ellipse problem.


Network Architecture # of Training % Accuracy Decision boundary [,
, Epochs figures
2-2-2 500 67 Figure 10-22(a)
2-3-2 500 99.4 Figure J0-22(b)
2-4-2 500 99.5 Figure J0-22(c)
2-5-2 500 99.29 Like Figure 10-22(c)

.. .....
(a) (b) (c)

Figure 10-22: Decision boundaries achieved using different multi-layer perceptrons:


(a) perceptron with a 2-2-2 architecture; (b) a perceptron with a 2-3-2 architecture;
and (c) a perceptron with a 2-4-2 architecture.

10.2.5.3 Mass Assignment based decision trees


The mass assignment tree induction algorithm (MATI) [Baldwin, Lawry and Martin
1997] induces probabilistic decision trees over linguistically partitioned universes. The
extracted decision trees can be directly translated in extended Fril rules. Applying the
MATI algorithm to the ellipse problem yields a classification accuracy of 99% on the
unseen test data. In this case, the base features of the induced model were partitioned
uniformly using granules that were characterised by trapezoidal fuzzy sets with an
overlap degree of 50%.

10.2.6 Ellipse problem discussion and summary


The previous sections have presented the results of experiments where different
learning paradigms were used to model the ellipse problem. This section discusses and
summarises the main conclusions of these experiments.

In the case of the Cartesian granule features paradigm, models were constructed
automatically (using the G_DACG algorithm) and semi-automatically (the language
space was sampled manually and the model parameters identified automatically). The
CHAPTER 10; ANAI.YSIS OF CARTESIAN GRANUI.E FEATURE MODELS 264

latter formed a basis for evaluating models consisting of Cartesian granule features with
different levels of granulation, granule characterisation and feature dimensionality. Due
to resource constraints (time and computing power), this analysis was limited to
Cartesian granule features where the underlying abstractions of the base feature
universes were equivalent, though for the investigation into data-driven approaches to
partitioning, this assumption was dropped. This sample space represents only a very
small proportion of the infinite abyss of possible models. The following are the main
findings of these experiments on the ellipse problem:

• Overall, granules characterised by trapezoidal fuzzy sets outperformed


other characterisations.
• One-dimensional and two-dimensional Cartesian granule features were
investigated with the two-dimensional feature yielding a higher accuracy
on unseen test data. This suggests a necessity to model this higher-
dimensional association in order to avoid decomposition error.
• Generating, partitions using data-centred approaches such as clustering
and percentile based techniques can lead to simpler models but can reduce
model generalization. Overall, the use of uniformly positioned fuzzy sets
is computationally more efficient and effective. These uniformly placed
fuzzy sets can subsequently be remapped on to a more natural or
humanistic vocabulary using dictionaries, disjunctions, conjunctions,
linguistic hedges etc. to give the model a more anthropomorphic flavour.
• The investigated models consist solely of either one-dimensional features
or of a two-dimensional feature, but when models consisting of mixed-
dimensional features, resulting from the G_DACG algorithm, are used
they lead to better performance and transparency.
• Pruning, as a means of reducing model complexity, while also enhancing
the extracted model's accuracy and generalisation powers, was briefly
presented but needs further work to illustrate its practical usefulness in
this context.

Despite the uncomplicated nature of the ellipse c1assificati(:m problem, it does serve to
illustrate some of the key differences between the Cartesian granule feature approach
and other supervised machine learning techniques. All of the approaches examined here
do very well in modelling the ellipse problem. Table 10-6 presents a summary of some
of the best results achieved using these approaches. From a generalisation perspective,
the composed approaches, such as the multidimensional Cartesian granule feature
models, MATI models, and neural network models, perform better than the approaches
that rely on total decomposition, such as the single dimensional Cartesian granule
feature approaches and data browser approaches. From a model complexity perspective,
the Cartesian granule feature models and the associated reasoning and inference
procedure are glass-box/transparent in nature, and relatively easily interpreted. The data
browser and MATI algorithms provide similar transparency of representation and
inference. The multi-layer perceptron based models, in addition to their high degree of
parameterisation, also have the disadvantage that the mapping they approximate is
embodied in the weights and biases matrices and thus, the approximation may not be
amenable to inspection or analysis except in simple cases.

Using this example, it is easy to see the parallels between product Cartesian granule
SOFT COMPUTINO FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIA N GRANULE FEATURES 265

feature models and nai"ve Bayes classifiers (see Sections 5.2.2 and 7.5.2.2 for an
overview of na·ive Bayes). The use of crisp one-dimensional Cartesian granule features
incorporated into product rules yields a model that is equivalent to a na·ive Bayes
classifier under certain conditions, even though at the surface level, the models and
inference strategies look very different, with Cartesian granule feature models, being
represented by fuzzy sets and probabilistic rules, and nai"ve Bayes classifiers being
represented by conditional probabilities and a class prior. Both models yield the same
results when the class priors are uniform, and the distribution of data amongst Cartesian
granules is uniform. See [Shanahan 2000] for further details of this comparison. A
possible new approach to learning is to use a nai"ve Bayesian approach where the events
are no longer precise but fuzzy granular.

Table 10---6: Summary of ellipse classification problem using various learning


approaches.

Approach Features details Accuracy


Additive Carte ian «X 10» «Y 3» - Legal 99.2
granule feature model «X 10», «Y 3», «X II) (Y 7» - Illegal
One two-dimensional «X I I) (Y II », Granularity = II, 60% 98.8
Cartesian granule feature Overlapping Trapezoids
Data browser X, Y (non- moothed fuzzy sets) 94
(evidential logic rules)
Neural network X, Y, and 3 hidden nodes 99.5
MATI X, Y [Baldwin, Lawry and Martin 1997] 99

From a model representation point of view, learnt probability distributions in terms of


Cartesian granule features are equivalent to maximum depth crisp (as generated by ID3
or C4.5) or probabilistic decision fuzzy trees (as generated by MATI), where each leaf
node is equivalent to a Cartesian granule. However, once these distributions are
transformed into their equivalent fuzzy sets (in the Cartesian granule feature case), the
parallel no longer exists. Nevertheless, both approaches tend to yield similar results (for
example, see Table 10-6). The added attraction of the Cartesian granule feature
approach is that it tries to decompose the problem in a network of low-order
semantically related variables, which are represented by Cartesian granule features that
are incorporated into rule-based models, whereas decision trees in general, try to solve
the problem with one big decision tree. Moreover, recent work has illustrated that
combining multiple decision-trees can lead to useful results [Breiman 1996].

10.3 SIN(X * Y) PREDICTION PROBLEM

The previous section has examined and compared the effectiveness of Cartesian granule
features in modelling classification systems, that is, systems where the dependent
output variable is discrete in nature. Here, however, prediction problems are addressed
where the dependent output variable is continuous in nature. This study investigates the
effectiveness with which Cartesian granule features can model a non-linear static
system; in this case, in terms of a small artificial problem - the function sin(X * Y). The
CHAP1F..R 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 266

sin (X * Y) function (nicknamed the swan's neck) has two base input variables, X and Y,
and is graphically depicted in Figure 10-23. The considered domain for both the X and
Y variables is [0, 3]. Different training, control (validation) and test datasets, consisting
of 529 (in grid fashion), 600 (generated randomly) and 900 (in grid fashion) data
vectors respectively, were generated. Each data sample consists of a triple <.X, Y, sin(X
* Y».

1
0.5
o
-0.5
-1

Figure 10-23: Graphic representation of Sin (X * Y).

10.3.1 ACGF modelling of the Sin(X * Y) problem


As presented previously, when modelling prediction problems it is also necessary to
determine an effective linguistic partition of the output variable's universe. Below, a
number of approaches are considered, including uniform partitioning and percentile-
based partitioning. Different granule characterisations are also considered. The
granularity of the output universe (the number of rules, or fuzzy classes) could
alternatively be given by an expert or could be determined in an iterative manner
beginning with a conservative number and iteratively increasing until no improvement
in generalisation is achieved. As in the ellipse problem, the investigation examines
models consisting of Cartesian granule features with various levels of granulation,
granule characterisation and feature dimensionality. In addition, models with different
linguistic partitions of the output variable universe are also explored.

Firstly, the results obtained using two-dimensional Cartesian granule features are
examined. Figure 10-24 summarises the results obtained when the output universe was
partitioned with five uniformly placed mutually exclusive triangular fuzzy sets, and the
input space consisted of a two-dimensional Cartesian granule feature, where the
underlying granules are characterised by the following types of fuzzy sets: triangular
fuzzy sets (curve named Triang); trapezoidal fuzzy sets with an overlap rate of 10%
(curve named T=0.1); trapezoidal fuzzy sets with an overlap rate of 40% (curve named
T=O.4); and trapezoidal fuzzy sets with an overlap rate of 100% (curve named T=1.0).
Other degrees of overlap were also investigated (20%, 30%, 50%, 60%, 70%, 80% and
90%), however, an overlap of 40% gave the best results in terms of accuracy and
transparency (granularity). More specifically, the use of granules characterised by
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 267

trapezoidal fuzzy sets with an overlap degree of 40% in two-dimensional Cartesian


granule features gave an RMS error level that tends towards 3% when the granularity is
increased to twenty. A decision surface for a two-dimensional model is presented in
Figure 10-25 (a).

)0

27

24

~ 21 -+-T ~ 1.0
~
g 18
~ IS -e-Triang

~
Vl
12
~

'"o!< 9 --i>-T ~ O . I

""""*-T ~ O . 4

o~--~~~~~~~~~~~~~~~--~

2 1 4 5 (, 7 ~ 9 I() II 12 13 14 15 16 17 18 19 20
Oranulnrlly In tcrm~ or runy ~t.\

Figure 10-24: RMS error of various two-dimensional Cartesian granule feature


models where the base granules are characterised by trapezoidal (of different overlap
degrees) and triangular fuzzy sets. The output space was partitioned using five
uniformly placed, mutually exclusive triangular fuzzy sets.

(a) (b)

Figure 10-25: (a) Sin(X * Y) decision suiface generated using two-dimensional


Cartesian granule feature models, where the base feature universes have been
partitioned using 14 trapezoidal fuzzy sets with 40% overlap. The RMS error is 4.12%.
(b) Sin(X * Y) decision suiface generated using two one-dimensional Cartesian granule
features, where the base feature universes have been partitioned using 14 triangular
fuzzy sets. The RMS error is 28%.

Figure 10-26 summarises the results obtained, where the output universe is partitioned
CHAPTER 10: ANAI.YSIS OF CARTESIAN GRANULE FEATURE MODELS 268

with six uniformly placed mutually exclusive triangular fuzzy sets and the input space
consists of a two-dimensional Cartesian granule feature, where the underlying granules
are characterised by the following types of membership functions: triangular fuzzy sets
(curve named Triang); trapezoidal fuzzy sets with an overlap rate of 10% (curve named
T=O.l); trapezoidal fuzzy sets with an overlap rate of 40% (curve named T=O.4); and
trapezoidal fuzzy sets with an overlap rate of 100% (curve named T=l.O). Overall, the
use of granules characterised by trapezoidal fuzzy sets with an overlap degree of 40%
for two-dimensional Cartesian granule features outperformed other types of granule,
yielding an RMS error level which tends towards 2.75% as the granularity is increased
to twenty.

30

27

24

9
.a 21
-+- T=1.0
§ I~
c
0

~ IS _ _ Tri.n~

~ 12
'"<r:
::;:
t!< 9

6 ~----- ----=~~~~~-.:::----
J •

o L-~~~~ __ ~~~~~ __ ~~~~ ____ ~~~-J

2 3 4 S 6 7 8 9 10 II 12 13 14 IS 16 17 18 19 20
(iranularilY In h;rms of rUIIY.\C1S

Figure 10-26: RMS error of various two-dimensional Cartesian granule feature


models where the base granules are characterised by trapezoidal (of different overlap
degrees) and triangular fuzzy sets. The output space was partitioned using six
uniformly placed, mutually exclusive triangular fuzzy sets.

Figure 10--27 gives an overall summary of the results obtained where the output
universe was partitioned using uniform and percentile base approaches with different
levels of granularity. In this graph, only the type of input partition (characterized by the
shape of the fuzzy set used) yielding the best results (best average accuracy for
granularities in range [2,20]) for the corresponding output partition type are presented.
The curves graphed here correspond to the following types of Cartesian granule feature
model: the output space was partitioned with 5 uniformly placed triangular fuzzy sets
and input granules were characterised by trapezoidal fuzzy sets with an overlap rate of
40% (curve named T=0.4(5, UT»; the output space was partitioned with 6 uniformly
placed triangular fuzzy sets and input granules were characterised by trapezoidal fuzzy
sets with an overlap rate of 40% (curve named T=0.4(6, UT) ); the output space was
partitioned on a percentile basis with mutually exclusive triangular fuzzy sets and input
granules were characterised by triangular fuzzy sets (curve named Triang=(6, PT) );
SOfT COMI'UTIN<O !'OR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 269

and the output space was partitioned with 7 uniformly placed triangular fuzzy sets and
input granules were characterised by trapezoidal fuzzy sets with an overlap rate of 40%
(curve named T=0.4(7, UT) ). The use of trapezoidal fuzzy sets to partition the output
universe was also examined but does not yield any significant performance
improvement over the results presented previously for triangular based fuzzy sets. The
two-dimensional model, where the output space was partitioned with 6 uniformly
placed triangular fuzzy sets and input granules were characterised by trapezoidal fuzzy
sets with an overlap rate of 40% (curve named T=0.4(6, UT), performed best overall in
modelling the sin(X * Y) problem (in this battery of experiments).

The use of one-dimensional Cartesian granule features of various granularities and


granule type was also examined. Their incorporation into rule-based models, regardless
of the partition type and granularity of the output space, yielded rather large RMS
errors, which were generally over 25%. This high RMS error rate results directly from
the decomposed nature of the one-dimensional Cartesian granule feature. A typical
decision surface where one-~imensional Cartesian granule feature based model were
used to model the sin(X * Y) problem is depicted Figure 10-25 (b).

10

27

24

~ _ _ T"nogC6
." 21
I'TI
~ 18
c:
0

"e IS
__ T~n4C6

T)
~ 12
'"
~ y
t'
6
-*-T:U4C7
J UTI
0
, 4 $ 6 7 8 Y I II II 12 11 14 1$ 16 17 18 I Y 211
Grlll1ulllrhy In Icrms of rUII), S,-'ls

Figure 10-27: RMS error of various two-dimensional Cartesian granule feature


models where the base granules are characterised by trapezoidal (of different overlap
degrees) and triangular fuzzy sets. The output space was partitioned using various
types of partition as indicated by the curve names.

10.3.2 A Comparison with other inductive learning techniques


The previous sections have presented the results of experiments where Cartesian
granule feature models were used to model the sin (X * Y) problem. This section
describes the results obtained when other supervised learning techniques were applied
to the sin(X * Y) problem: the data browser; neural networks; and the mass assignment
tree induction (MATI) algorithm. These approaches were assessed using the same
CHAPTER 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 270

datasets that were used in Cartesian granule feature modelling. This is followed in the
next section by a discussion where these approaches are analysed and compared with
the use of Cartesian granule feature based models. The sin(X * Y) problem proves to be
a particularly difficult prediction problem for most supervised learning approaches.

10.3.2.1 Data browser


Table 10-7 presents the results achieved when the data browser was used to generate
models of different rule types: evidential logic; and conjunctive rule structures. Various
approaches to partitioning the output universe were investigated including uniform
partitioning and percentile partitioning, with the following numbers of fuzzy set; 5, 7, 9,
12, and 20. However, none of the resulting models gave a RMS error better than 23%.
In addition, the use of smoothing algorithms during the input fuzzy set induction was
investigated, however, the non-smoothed models tended to yield more accurate results.
The data browser's performance is heavily affected by the decomposed nature of the
generated model. This is clearly depicted in the decision surface presented in Figure
10-28 for an extracted evidential logic rule based model.

Table 10-7: RMS error results using data browser on Sin(X * Y) problem.
Fril Rull! Granularity of % RMS Error Decision Surface Figures
Typ\! Output Variable
Conjunctive 6 24 Similar to Figure 10-28
Evidential 6 23.6 Similar to Figure 10-28
Evidential 9 23.48 Figure 10-28
Evidential 12 23.6 Similar to Figure 10-28

Figure 10- 28: Sin(X * Y) decision surface generated by a data browser induced model
using nine percentile-positioned triangular fuzzy sets in the output universe. The RMS
error this model is 23.48%.

10.3.2.2 Multi-layer perceptron


Two-layered perceptrons of different architectures were applied to the sin(X * y)
Son COMPUTING FOR KNOWLEDGE DISCOVERY : INTROD UC ING CARTESIAN GRANULE F EATURES 271

problem. Table 10-8 presents the results obtained when perceptrons with hidden layers
of different sizes were used to model the sin(X * Y) problem. In this case, the output
node corresponds to the predicted sin(X * Y) value. The two-layered neural network
performs very well in modelling the sin(X * Y) problem, attaining RMS errors of less
that 2% with eight hidden nodes. The number of training epochs (one epoch
corresponds to presentation of all the training data) is also presented in sin(X * Y).

Table 10-8: RMS error for variousfeedforward neural networks for Sin(X*Y).

NNTopology # Training Epochs RMS Error on Test


2-3-1 1000 6.11
2-4-1 1000 6.12
2-5-1 1500 3.5
2-6-1 2000 2.4
2-7-1 2000 2.19
2-8-1 2000 1.3
2-9-1 2000 2.90
2-10-1 2000 1.82

10.3.2.3 Mass Assignment based decision trees


The MATI algorithm yields a RMS error of 4.22% on unseen test data when the base
feature universes were partitioned uniformly using ten trapezoidal fuzzy sets with an
overlap degree of 50%. The output variable universe was percentile-partitioned using
five triangular fuzzy sets [Baldwin, Lawry and Martin 1997].

10.3.3 Sin(X * Y) problem discussion and summary


The previous sections have described the results of experiments where different
learning paradigms were used to model the sin(X * Y) problem .. This section discusses
and summarises the main results of these experiments.

In the case of the Cartesian granule features modelling, models consisting of input
Cartesian granule features with various levels of granulation, granule characterisation
and feature dimensionality were systematically sampled. Various linguistic partitions of
the output universe were also investigated. The following are the main findings of these
experiments on the sin(X * Y) problem:

• Overall, granules characterised by trapezoidal fuzzy sets outperformed


other characterisations.
• Both one-dimensional and two-dimensional Cartesian granule features
were investigated with the two-dimensional feature yielding a
significantly higher accuracy on unseen test data. The one-dimensional
models suffer from a large decomposition error.
• Linguistic partitions of the output variable's universe, where the granules
are characterised by triangular fuzzy sets give a marginal improvement in
performance over their trapezoidal counterparts. This possibly suggests
CHAPTER I 0: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 272

that fuzzy numbers are a more appropriate form of abstraction for


prediction.
• Cartesian granule features where the underlying granules are characterised
by crisp sets cannot be used to model prediction problems and generally
the crisper fuzzy sets tended to perform badly in modelling this problem.
• This problem illustrates how mathematical models can be translated into a
symbolic rule form.
• The investigated models consist solely of features of either two one-
dimensional features or of one two-dimensional feature, but when models
consisting of mixed-dimensional features were investigated, they resulted
in RMS errors typically greater than 20%.

Overall, modelling approaches which use total decomposition, including the one-
dimensional Cartesian granule feature models and the data browser, suffer from large
decomposition errors. A results summary of each of the examined approaches is
presented in Table 10-9. On the whole, additive Cartesian granule feature models lead
to relatively high lever of accuracy for this problem while also providing moderate
levels of model transparency. For prediction problems in general, using data-centred
partitioning (e.g. clustering) may yield a more natural partition of the base feature
universes by focussing on where the data lies rather than covering the whole universe
uniformly.

Table 10-9: Results summary for Sin(X * Y) prediction problem using various
supervised learning approaches.

Approach Features detail %RMS


Error
Neural network X, Y, and 7 hidden nodes 2.19
Two-dimensional (X, V), Granularity = 20, Triangular 2.6
Cartesian granule features Fuzzy Set, Output granularity = 6
Two-di mensional (X, V), Granularity = 14,40% 4.12
Carte ian granule features Overlapping Trapezoids, Output
granularity =6
MATI X, Y. Trapezoidal fuzzy sets with overlap 4.2
degree of SO%, Output granularity =S.
One- dimen ional (X), (Y), regardless of granularity of 2S+
Carte ian granule featur . input or output spaces
in Evidential Rules
Data browser X, Y (non-smoothed fuzzy set) 23+
(conjunctive logic rules)

10.4 WHY DECOMPOSED CARTESIAN GRANULE FEATURE


MODELS?

Step 2 of the G_DACG algorithm (Section 9.1.1) is concerned with decomposing the
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 273

input feature space into low order relationships between small clusters of semantically
related variables (dependent variables), which are subsequently modelled with
Cartesian granule features. This decomposition is necessary on a number of grounds
including generalisation, transparency and comptractibility. This section provides a
motivational example as to why decomposition is necessary from a generalisation
perspective.

In order to enhance the generalisation powers and transparency of Cartesian granule


feature models when applied to real world problems, models consisting of decomposed
Cartesian granule features need to be identified. This forms the basis of additive
Cartesian granule feature modelling. This is motivated by the L example problem
presented in the next subsection, where it is necessary to decompose the input feature
space into Cartesian granule feature sub-models (sub-functions) in order to generate
models that generalise well. Functional decomposition is further substantiated by the
desire to generate models that are transparent and amenable to human inspection
thereby providing insight into the problem being modelled.

10.4.1 L classification problem


The L problem is an example of an induction problem with a small dataset and was
introduced previously in [Baldwin 1996a]. The problem posed here is to construct a
model from the training set that classifies a vector of black or white pixels as
corresponding to the letter L or not. The training dataset consists of eight positive and
eight negative examples of the letter L. A training example corresponds to an image
taken by a small 4-pixel camera (see Figure 10-30) that is randomly positioned over a
grid that either corresponds to the letter L or not (see Figure 10-29). The camera is
moved vertically or horizontally until some black pixels occurs, thus generating a 2x2
image (or mask) as depicted in Figure 10-30. This image is then transmitted back to a
classification system (either a human or computer). The L-problem is graphically
depicted in Figure 10-31.

(a) (b)

Figure 10-29: Example ofLs (a) good L and (b) bad L.

10.4.1.1 L-Problem - Datasets


The training dataset consists of the patterns found in Table 10-11, where each example
CHNYfER 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 274

quadruplet <A, B, C, D> corresponds to the mask elements (Figure 10-30) and each
element takes the value 1 if the corresponding pixel is black and 0 if the pixel is white.
These patterns are illustrated in Figure 10-32 and Figure 10-33. The training set
consists of patterns that are not entirely sufficient to discriminate between positive and
negative examples; some patterns occur as both positive and negative examples. Within
this problem domain it is known that errors in data communication can occur, thus
resulting in patterns of the form presented in Table 10-10 and Figure 10-34. These
patterns correspond to patterns that can only occur as a result of communication errors.
The classification of these patterns is unknown. The patterns with known classifications
are used to train models that provide generalised classification for these unclassified
cases.

GLJ
~
Figure 10-30: L mask usedfor sensing the environment.

Sender

Q1I11l))
Receiver L-Oassifier

[iliJ 1011

~
Mask
Figure 10-31: L-problem definition.

Figure 10-32: Training examples of good Ls.

Figure 10-33: Training examples of bad Ls.


SOFT COMI'UTINU FOR KNOWLEDUE DISCOVERY: INTRODUCINU CARTESIAN GRANULE FEATURES 275

Figure 10-34: Patterns of unknown classification.

Table 10-10: Unclassified corrupted L patterns.

Input Features LClas


A B C 0
I 0 I 0 Unlnown
,, 0 I 0 I Unknown
I I 0 I Unknown
I I I 0 Unknown
I I I I Unknown
0 0 I I Unknown

Table 10-11: Training L examples.


[nput Features L Class
A B C 0
I 0 I I Good
0 0 0 I Good
0 0 0 I Good
0 0 I 0 Good
0 I I 0 Good
I I 0 0 Good
0 I 0 0 Good
I 0 0 0 Good
0 I I I Bad
0 0 I 0 Bad
0 0 I 0 Bad
0 0 0 I Bad
I 0 0 I Bad
I I 0 0 Bad
I 0 0 0 Bad
0 I 0 0 Bad

10.4.1.2 Modelling the L Problem with Cartesian Granule Features


Each of the input features for the L problem are discrete in nature, where the respective
universes consist of two granules or words, true or false (corresponding to the existence
of a black pixel, or not respectively). The output variable is also discrete in nature with
similar words (corresponding to positive or negative examples of the letter L). Due to
the discrete nature of both the input and output universes the number of possible
Cartesian granule features is small. The use of all possible combinations of one, two,
three, and four-dimensional features in modelling this problem is investigated. All
possible features of the same dimensionality are used in the same model, that is, all four
one-dimensional features are combined into one model. Features of different
CHAPTER 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODF1..S 276

dimensionality are not mixed in the same model. In each case, the Cartesian granule
features were combined using the evidential logic rule and the weights were determined
using semantic discrimination analysis (see Section 9.4). Table 10-12 presents the
results achieved using these various models. The test results are compared with a
Bayesian model generated using the assumption that an error model exists [Baldwin
1996b]. The following observations regarding these results can be made:

• The one-dimensional Cartesian granule feature model captures equivalent


representations for both positive and negative concepts, thereby resulting in
indistinguishable behaviour when it comes to classification. Consequently, this
model proves useless in terms of generalisation.

• The two-dimensional Cartesian granule feature model captures the training


data moderately well but misclassifies the uncertain tuples 7, 8, which appear
in both the test and training data. It also misclassifies the training tuples 5, and
13 as uncertain. This is caused by the activated Cartesian granules being in
both the positive and negative fuzzy sets. This scenario arises because of the
low dimensionality (i.e. low constraints) of the Cartesian granule features; thus
resulting in no discrimination for these cases. However the models generalise
well yielding a behaviour equivalent to the best Bayesian model.

• The three-dimensional Cartesian granule feature models captures the training


data well but misclassifies the uncertain tuples 7, 8 which appear in both the
test and training data. The model generalises well yielding a behaviour
equivalent to best Bayesian model.

• The four-dimensional Cartesian granule feature model captures the training


perfectly, including its idiosyncrasies, leaving no room for generalisation.

• Although the results are not presented here, mixing Cartesian granule features
of different dimensionality in a model (which includes either 2D or 3D
features) does not improve on those achieved by the three-dimensional model.

• The MATI probabilistic decision tree induction algorithm gives similar results
to the three-dimensional Cartesian granule feature model [Baldwin 1996a].

• Discovering the structural dependencies (relationships) in the input features is


a critical step in generating models from example data. If these dependencies
are captured successfully then the learnt model not only describes the training
data effectively and concisely, it also in general, provides better generalisation
on unseen data.

In this section, the L example problem has highlighted the need for discovering
structural decomposition in order to generate Cartesian granule feature models that
provided good generalisation and knowledge transparency. This approach to structural
decomposition parallels other modelling techniques such as Bayesian networks [Good
1961; Lauritzen and Spiegelhalter 1988] and ANOV A [Efron and Stein 1981; Friedman
1991]. The Cartesian granule feature decomposition can be viewed as a linguistic
functional decomposition [Shanahan 1998].
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 277

Table 10-12: L problem classification results using various learning algorithms. where
G corresponds to Good. U to Uncertain. B to Bad and NIA to not applicable. The
training results are presented as a triple consisting of the number of correctly classified
tuples. the number of tuples classified as uncertain (i.e. each classification rule returns
the same level of support). and the number of misclassified examples. The test results
are presented as a comparison with the results given by a Bayes classifier.

# Actual Evlog Evlog Evlog Evlog


Input Features Class IDCO 20 CO 3DCG 40CG MATI Bayes
Features Features Features Features
Train Set
I 1011 a u a G G G G
2 0001 0 U 0 G G 0 G
3 0001 G U G G a a G
4
"'---
0010
.---._---------_.---- a u
----------
B B
.................. ----B... . -_ ........
B B
-------
a a a
-.~

5 OlIO G U U G
6 1100' a u u u u u U
7 0100' G U G G U G U
8 1000' G U B B U B U
9 0111 B U B B B B B
10 0010 B U B B B B B
II 0010 B U B B B B B
12
.......... iO<ii ········ .. -_BB........ ..... _-------
........0001 U G G G a a
13 U U ---------
B ---------
B B B
14 1100' B U U U U U U
15 1000' B U B B U B U
16 0100' B U a a U G U
Test Set
1 1010 U U G G U a G
2 0101 U U B B U B B
- ..3-- 1101 U -- .. U B B .. -_ ..U-_ .. _- B B
-------------._--.--- ........ ---------- ---------
4 1110 U U G G U G G
5 1111 U U U U U U U
6 0011 U U U U U U U
Train accuracy
(Correct! IA 0116/0 8/4/4 1012/4 8/612 101214 812/6
Uncertain/Incorrect)
Test Results N/A 33% 100% 100% 33% 100% N/A
(Corre pondence
with Bayes)

10.5 OVERALL DISCUSSION

The previous sections have demonstrated the application of Cartesian granule feature
modelling in the context of artificial classification and prediction problems. This
section details some general comments on the use of Cartesian granule features.
CHAPTER 10: ANALYSIS 01- CARTESIAN GRANULE FEATURE MODELS 278

In summary, the results presented in this chapter support the following argument:
approaches that rely on total decomposition, that is, ignore the problem structure (such
as one-dimensional Cartesian granule features, the data browser and naIve Bayes) will
not, in general, perform as well as approaches that focus on modelling the problem
structure (multidimensional Cartesian granule feature models, neural networks and
Bayesian networks). Cartesian granule feature modelling, as personified by the
G_DACG algorithm searches for structure in terms of a network of low-order
semantically related or dependent features.

Fuzzy sets are a more desirable characterisation of granules than crisp sets. Firstly,
models, which employ fuzzy set characterisation of granules, will in general, require a
lower granularity. This lower granularity will tend to lead to better generalisation.
Secondly, fuzzy set based models due to the interpolative nature of smooth fuzzy sets
give a much more flexible decision boundary/surface (i.e. not piecewise linear),
whereas the use of crisp sets or fairly crisp sets (fuzzy sets with low degrees of overlap)
yield decision boundaries which are stepwise in nature. Thirdly, models based upon
crisp granules tend to be very sensitive to the location of granule boundaries,
sometimes yielding a discontinuous behaviour when the boundaries are changed,
whereas the use of fuzzy granules tends to be more robust in this respect. Finally,
empirical evidence presented here corroborates that granules, which are characterised
by fuzzy sets, give accurate models that are more succinct than their crisp counterparts.

The use of fuzzy granules facilitates the expression of both classification and prediction
induction algorithms in a single, coherent framework, such that classification problems
can be viewed as a special case of the more general prediction problem, where each
output classification value is interpreted as a crisp classification.

10.6 SUMMARY AND CONCLUSIONS

This chapter has concentrated mainly on the analysis of l"tarnt Cartesian granule feature
models, for both classification and prediction problems, under various conditions. For
the selected problems, the space of possible models was systematically sampled
examining the effect of the following on the resulting model: different linguistic
partitions of input variable universes; the feature dimensionality of the Cartesian
granule features; the type of rule used to aggregate; and different linguistic partitions of
the output variable's universe (in the case of prediction problems). This analysis
provides insights on how to model a problem using Cartesian granule features. It also
serves as a means of comparing this approach with other well known learning
paradigms. In general, the learnt Cartesian granule feature based models performed as
well and in some cases outperformed other well-known learning approaches.

Furthermore, this chapter has provided a useful platform for understanding many other
learning algorithms that mayor may not explicitly manipulate fuzzy events or
probabilities. For example, it was shown how a naIve Bayes classifier is equivalent to
crisp Cartesian granule feature classifiers under certain conditions. Other parallels were
also drawn between learning approaches such as decision trees and the data browser.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 279

As a result of this analysis, an extension to the naIve Bayesian approach from crisp
events to fuzzy events is proposed.

Overall, Cartesian granule features opens up a new and exciting avenue in probabilistic
fuzzy systems modelling which allows not only the ability to compute with words but
also to model with words. The use of Cartesian granule features facilitates the paradigm
modelling with words, yielding anthropomorphic knowledge descriptions that are
effective in modelling classification and prediction systems. The next chapter presents
applications of Cartesian granule features to real world problems in the fields of
medical decision support, computer vision and control.

10.7 BIBLIOGRAPHY

Baldwin, J. F. (1995). "Mllchine Intelligence using Fuzzy Computing." In the


proceedings of ACRC Seminar (November), University of Bristol.
Baldwin, J. F. (1996a). "Knowledge from data using Fril and fuzzy methods", In Fuzzy
Logic, J. F. Baldwin, ed., John Wiley & Sons, 34-76.
Baldwin, J. F. (1996b). "Knowledge from data using Fril and fuzzy methods -
induction", Report No. ITRC 236, A.I. Group University of Bristol, Bristol.
Baldwin, J. F., Lawry, J., and Martin, T. P. (1997). "Mass assignment fuzzy ID3 with
applications." In the proceedings of Fuzzy Logic: Applications and Future
Directions Workshop, London, UK, 278-294.
Baldwin, J. F., and Martin, T. P. (1995). "Fuzzy Modelling in an Intelligent Data
Browser." In the proceedings of FUZZ-IEEE, Yokohama, Japan, 1171-1176.
Baldwin, J. F., and Martin, T. P. (1999). "Basic concepts of a fuzzy logic data browser
with applications", Report No. , ITRC Report 250, Dept. of Engineering
Maths, University of Bristol.
Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - Fuzzy and Evidential
Reasoning in A. I. Research Studies Press(Wiley Inc.), ISBN 0863801595.
Baldwin, J. F., and Pilsworth, B. W. (1997). "Genetic Programming for Knowledge
Extraction of Fuzzy Rules." In the proceedings of Fuzzy Logic: Applications
and Future Directions Workshop, London, UK, 238-251.
Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms.
Plenum Press, New York.
Breiman, L. (1996). "Bagging predictors", Machine Learning, 66:34-53.
Efron, B., and Stein, C. (1981). "The Jackknife Estimate of Variance", Annals of
Statistics, 9:586-596.
Friedman, J. H. (1991). "Multivariate Adaptive Regression Splines", The Annals of
Statistics, 19: 1-141.
Good, I. J. (1961). "A causal calculus", British journal of the philosophy of science,
11 :305-318.
Kohonen, T. (1984). Self-Organisation and Associative Memory. Springer-Verlag,
Berlin.
Lauritzen, S. L., and Spiegel halter, D. J. (1988). "Local computations with probabilities
on graphical structures and their application to expert systems", Journal of the
Royal Statistical Society, B50(2):157-224.
CHAPTER 10: ANALYSIS OF CARTESIAN GRANULE FEATURE MODELS 280

Miller, G. A. (1956). ''The magical number seven, plus or minus two: some limits on
our capacity to process information", Psychological Review, 63:81-97.
Moller, M. F. (1993). "A scaled conjugate gradient algorithm for fast supervised
learning", Neural Networks, 6:525-533.
Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, 1(1):86-106.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San
Mateo, CA ..
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Shanahan, J. G. (2000). "A comparison between naive Bayes classifiers and product
Cartesian granule feature models", Report No. In preparation, XRCE.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman
and Hall, New York.
Sugeno, M., and Yasukawa, T. (1993). "A Fuzzy Logic Based Approach to Qualitative
Modelling", I~EE Trans on Fuzzy Systems, 1(1): 7-31.
Weiss, S. M., and Indurkhya, N. (1998). Predictive data mining: a practical guide.
Morgan Kaufmann.
Zadeh, L. A. (1994). "Soft computing", LIFE Seminar, LIFE Laboratory, Yokohama,
Japan (February, 24), published in SOFT Journal, 6:1-10.
Zell, A., Mamier, G., Vogt, M., and Mache, N. (1995). SNNS (Stuggart Neural Network
Simulator) Version 4.1. Institute for Parallel and Distributed High
Performance Systems (IVPR), Applied Computer Science, University of
Stuggart, Stuggart, Germany.
CHAPTER
APPLICATIONS
11
Having illustrated Cartesian granule feature modelling on artificial problems in the
previous chapter, the focus in this chapter switches to real world applications. Four
applications are considered in the domains of computer vision, diabetes diagnosis and
control. Both classification and regression applications are investigated. Knowledge
discovery of Cartesian granule feature models in these problem domains is contrasted
with other techniques such as neural networks, decision trees, naIve Bayes and various
fuzzy induction algorithms using a variety of performance criteria such as accuracy,
understandability and efficiency.

The first problem considered is a region classification problem in outdoor image


understanding [Shanahan et al. 1999; Shanahan, Baldwin and Martin 1999; Shanahan et
al. 2000]. Subsequently, knowledge discovery is applied to discovering useful patterns
in a medical records database of Native American Pima Indians that would improve
medical decision making in the area of diabetes detection [Baldwin, Martin and
Shanahan 1997; Shanahan 1998]. The final two applications investigated are both
regression, coming from the field of control [Baldwin, Martin and Shanahan 1999].
Both focus on the discovery of patterns in historical datasets that facilitate automatic
control. The first example deals with the widely used benchmark problem of modelling
a gas furnace (an example of a dynamical process), which was first presented by Box
and Jenkins [Box and Jenkins 1970], while the second application focuses on the
discovery of patterns in a chemical plant control dataset that facilitate automatic
control. The chapter finishes by drawing some overall conclusions regarding the
knowledge discovery of Cartesian granule feature models and also by identifying future
avenues of research for knowledge discovery in general.

11.1 REGION CLASSIFICATION IN IMAGE UNDERSTANDING

Current learning approaches to computer vision have mainly focused on low-level


image processing and object recognition, while tending to ignore high-level processing
such as understanding. Here the discovery of Cartesian granule feature models is
proposed as an approach to object recognition that facilitates the transition from object
recognition to object understanding. This approach begins by segmenting the images
into regions using standard image processing approaches, which are subsequently
classified using a discovered fuzzy Cartesian granule feature classifier. Understanding
is made possible through the transparent and succinct nature of the discovered models.
The recognition of roads in images is taken as an illustrative problem in the vision
domain.

J. G. Shanahan, Soft Computing for Knowledge Discovery


© Kluwer Academic Publishers 2000
CHAPTER 11: ApPLICATIONS 282

The next section begins by reviewing existing approaches to object recognition and
discussing some of the motivations behind applying the Cartesian granule feature
knowledge discovery process to image understanding. Section 11.1.2 overviews the
main knowledge discovery steps from an object recognition application perspective,
while also providing a task oriented breakdown of the rest of this section.

11.1.1 Motivations
Fischler and Firschein [Fischler and Firschein 1987] list learning, and representation
and indexing as two of the problems and open issues in computer vision.
Representation and indexing in this context refers to the design of representations for
the visual description of complex scenes that are also suitable for reasoning and
indexing into a large database of stored knowledge. However, in the interim both of
these areas have received much attention from various groups. This attention has
mainly been motivated by the following:

• Traditional image understanding focused on techniques from physics,


mathematics, psychology, computer science and artificial intelligence that
depended tremendously on human input and direction. This led to many
limitations and endless assumptions on what these techniques could achieve
and usually were very labour intensive, very limited, and very sensitive to
change. Examples include knowledge-based approaches such as [Draper et al.
1989] and Condor [Strat 1992] and model-based approaches such as [Brooks
1987; Grimson and Lozano-Perez 1984; Mirmehdi et al. 1999].
• Recent technological advances in visual data (still images and video)
acquisition (through low-cost scanners and digital cameras), in storage
(through compression standards and cheap storage devices) and in
communication (via internet, extranet and intranet) has led to a flood of visual
data. For example, Frawley et al. [Frawley, Piatetsky-Shapiro and Matheus
1991] note that "earth observation satellites planned for the 1990s are expected
to generate one terabyte (l015) of data every day". Consequently, the
development of multimedia management systems (M 3 Systems) that provide
efficient, effective and intuitive means of storing, representing and retrieving
visual data (along with other data forms such as sound) is currently a very
important area of research and development.
• Autonomous vehicles due to their extensive use of visual data has fuelled a lot
interest in object recognition and scene understanding using learning
approaches

Like many domains of application in pattern recognition, most successful applications


in computer vision that exploit machine learning rely on black box representations;
generally neural network or eigenspace-based models. Consequently, these approaches
to vision provide little or no interpretability. Some eigenspace-based modelling
approaches to 2D and 3D recognition problems include [Kosako, Ralescu and
Shanahan 1994; Murase and Nayar 1993; Turk and Pentland 1991]. Region based
approaches and pixel (very local) based recognition systems that have relied on neural
networks for their learning capabilities include [Campbell et al. 1997; Mukunoki,
Minoh and Ikeda 1994; Wood, Campbell and Thomas 1997]. While symbolic learning
techniques have been applied to vision problems, they have had only mild success in
terms of performance accuracies compared to their mathematical (generally opaque)
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 283

counterparts. Some of the more interesting symbolic learning work in the field of image
understanding (IU) includes Winston's [Winston 1975] landmark work; a high level
approach using semantic nets to learn object structures from examples and counter-
examples (Winston's near-misses). This approach while providing understandable
models, largely ignores the lower and intennediate levels of image processing and
understanding. Other symbolic approaches to learning and representation have tended
to focus on small and limited problem domains. These include [Shepherd 1983] where
learnt decision trees were used in the classification of chocolates. Other approaches
based upon learning semantic net representations include the classification of hammers
and overhead views of commercial aircraft [Connell and Brady 1987]. Michalski et al.
[Michalski et al. 1998] provide some interesting results using a battery of learning
approaches: rule-based learning using AQ [Michalski and Chilausky 1980]; neural
network learning; and a hybrid of AQ and neural networks. The application domains
considered by Michalski et al. of outdoor image classification, detection of blasting
caps in X-ray images of luggage, and action recognition in motion video though
somewhat interesting were limited to rather simple uncluttered scenarios [Michalski et
al. 1998]. Ralescu and Slianahan [Ralescu and Shanahan 1995; Ralescu and Shanahan
1999] propose a novel approach to learning the rules of perceptual organisation using
fuzzy modelling techniques (resulting in intuitive and transparent models). This
approach was limited to the perceptual organisation of edge images, however it could
very easily be extended to other fonns of perceptual organisation including that of
regions, objects and scenes. The resulting high-level structures could then be used to
compare with object models and thus, lead to object recognition.

Overall, successful approaches to learning in vision are still predominantly black box in
nature. Here a new approach to image understanding is proposed, based upon Cartesian
granule features, that not only provides high levels of accuracy, but also facilitates
understanding due to the transparent and succinct nature of the knowledge
representation used. The approach is illustrated on a road recognition problem. Image
representation and indexing, though not addressed directly in this paper, can benefit
from image recognition and understanding approaches that provide transparency.

11.1.2 Knowledge discovery in image understanding


This section summarises the main knowledge discovery steps in recognising object
regions within the context of digital images of outdoor scenes. The problem was
partitioned into two natural but distinct parts: region segmentation and region
classification. Segmentation was achieved using standard image processing approaches,
whereas region classification is carried out by a classifier. The main goal of the
knowledge discovery process was to construct a classifier, automatically from
examples, that provides high accuracy and good transparency. Additive Cartesian
granule feature models were chosen as the form of knowledge representation, and
consequently, the G_DACG constructive induction algorithm was chosen to learn such
classifiers. Figure 11-1 presents a block diagram of the proposed approach in terms of
the main tasks: feature value generation (Section 11.1.4), system extraction, and system
evaluation (Section 11.1.7). The Bristol image database [Campbell, Thomas and
Troscianko 1997; Mackeown et al. 1994] is chosen as a representative dataset in order
to evaluate the proposed approach and is presented in Section 11.1.3. The results
presented here are limited to object recognition for road and not road regions in
CHAPTER i i: ApPLICATIONS 284

images. In Section 11.1 .8, the results obtained with Cartesian granule feature models
are compared with standard machine learning approaches. Finally, Section 11 .1.9
finishes with some specific conclusions for the vision problem, while more general
conclusions about knowledge discovery using Cartesian granule features are presented
in Section 11.5.

t8 ge. in C I ••• ifier


G fDC'rarion
~-
~ .­
--.::_.i..:..~_

Reg ion
F ... ure
Exam pIe V.lues F •• lur.
Region S.I •• lIon And
Dat.bau
Ree.lon Feature Values

Background


K nowledg<
C Iaui'ier
G tntration C I... ified R 02 ions

Road

Figure I I-I: Three stages in classifier generation for the vision problem: stage I -
feature value generation, stage 2 - system, stage 3 - system evaluation. Note that these
stages are iterative.

11.1.3 Vision problem description


In order to recognise objects in images, the problem was decomposed into two stages:
image segmentation and object recognition. The first stage automatically segmented the
images into regions and then generated feature values that described each region (see
Section 11 .1.5). Various features were considered and the feature selection step was
revisited on a number of occasions during the knowledge discovery process. In order to
reduce the complexity of the learning task, the original feature set was then reduced to a
more manageable size using the filter feature selection algorithm introduced in Section
11.1 .6.1. Finally, example regions, expressed in terms of their attribute values and
classification value, were used to train (additive Cartesian granule feature) classifiers
using the G_DACG constructive induction algorithm. The induced classifiers were then
used to classify new image regions.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUC ING CMTESIAN GRANULE FEATURES 285

11.1.4 Vision dataset


The Bristol image database [Campbell, Thomas and Troscianko 1997; Mackeown et al.
1994] consists of over 350 colour images of a wide range of urban and rural scenes.
Figure 11-2 depicts a typical urban scene in this database. The images were originally
hand-segmented and labelled, thus providing a ground truth about the image regions
[Mackeown et al. 1994]. Figure 11-3 illustrates the result of hand segmentation when
applied to the image in Figure 11-2. Subsequent work has led to the automatic
segmentation of the images using various techniques: Campbell et al. [Campbell,
Thomas and Troscianko 1997] illustrate that a k-means approach can effectively
segment images; in [Campbell, Thomas and Troscianko 1997], a self organising feature
map was shown to provide marginally better image segmentation, based upon a cross
correlation with the ground truth regions, than the k-means approach for this database,
when various pixel properties such as colour and texture were considered; other
segmentation approaches were also presented. In this work, the k-means algorithm is
used to automatically segment both the training and testing images. The k-means
segmentation algorithm is computationally more efficient than the self organising
feature map approach presented by Campbell et al. [Campbell, Thomas and Troscianko
1997]. This approach and has only one parameter, k, which is set to 4 for this work.
This k value could be determined automatically through cross-correlating the
segmented regions with the hand-segmented regions. For the scenes in the Bristol
image database, this automatic segmentation generally results in 100-150 regions (of
greater than 80 pixels in size) per image being generated when intensity is used. The k-
means segmented regions are classified based upon a correlation with the hand
segmented and labelled regions (i.e. each automatically segmented region gets the label
of the ideal region with the greatest overlap), thus facilitating supervised learning.
Figure 11-4 depicts the k-means segmentation of the image in Figure 11-2.

Figure 11-2: Typical image of an outdoor scene in the Bristol image database.
CHAPTER II: APPLICATIONS 286

Figure 11-3: The hand segmentation of the image in Figure 11-2.

Figure 11-4: The k-means segmentation of the image in Figure 11-2.

11.1.5 Description of region features


Each segmented region can be described using a variety of features such as colour,
texture and shape, some of which have been inspired by psychophysical models. A set
of over sixty features for each image region were considered in the road classification
problem. In this section, an overview of the selected features is provided.

11.1.5.1 Colour features


The first three features used are colour related. The luminance of a pixel at position (x,
SOFT COMPUTING FOR KNOWLEIXlE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 287

y) is defined as:

. 3R(XY) +6G(XY) +B(XY)


Lummance = ' . ,
10

where ~X,y), G(X,Y) and B(x,y) are the colours of the pixel, scaled in [0, 1]. The
coefficients approximate the contribution of the three colour separations to luminance.
To achieve a 3D psychophysically plausible definition of colour [Glassner 1995; Valois
and Valois 1993], the features R-G and Y-B are also considered and are defined as
follows:

R -G = R(x,y) -G(X,Y) + 1
2

and

Y -B= R(X,Y) +G(X,Y) -2B(X,Y) +2


4

These are the opponent red/green and yellowlblue colour difference signals. The region
value for each colour feature is taken as the average over the corresponding region pixel
values.

11.1.5.2 Location, size and orientation features


Size corresponds to the number of pixels in the regions. Orientation is expressed as the
sine and cosine of the angle of the principal axis. The location of the region is
expressed as the X and Y co-ordinates of the region centroid.

11.1.5.3 Shape features


The next subset of features corresponds to a transformed description of the region
boundary. In order to generate a compact description of a region's boundary, the region
is firstly rotated so that its principle axis of inertia is horizontal. Subsequently, the
boundary is mapped onto a one-dimensional length vector of (uniformly placed) radii
emanating from the centroid of the region. Currently 32 radii are used, approximating
the region shape as a 32-sided polygon (see Figure 11-5 for an example). This vector of
radii lengths is then normalised so that the longest radius is of length one. This
description provides a very robust representation of shape, making it invariant to
translation, scaling, and rotation in the image. In order to reduce the complexity of the
shape description, it is transformed using principal components analysis (PCA) [Jolliffe
1986]. This approach to shape complexity reduction has been successfully used for
flexible template matching [Cootes and Taylor 1995; Cootes et al. 1992] and for region
classification [Campbell, Thomas and Troscianko 1997], Consequently, only eight of
the resulting transformed variables (that best explain variance) are chosen as a
relatively accurate description of the region boundaries. Since the (transformed) shape
features serve as only polygonal approximations of the original region, the shape
features are supplemented with three measures of the approximation error. The shape
error features correspond to the following: the difference in size (measured in pixels)
CHAPI'ER 11: ApPLICATIONS 288

between that actual region and approximated region (the 32-sided polygon); the
difference in size between that actual region and approximated region generated using
the PCA eigenvectors. All size differences are normalised by the size of the actual size
of the region. These features provide added discrimination between the polygonal
approximation and all possible region boundaries that can lead to this polygonal
approximation.

11.1.5.4 Texture features


The next set of features arises from the use of a psychophysically plausible model of
texture, based upon Gabor filters. This approach to modelling texture is inspired by
various psychophysical experiments, which suggest that there are mechanisms, known
as channels, in the visual cortex of mammals that are tuned to combinations of
frequency and orientation in a narrow range [Campbell and Robson 1968; Valois,
Albreacht and Thorell 1982] for processing visual information in the early stages of the
mammal vision system. The theory fIrst proposed by Campbell and Robson [Campbell
and Robson 1968], states' that the visual system decomposes the retinal image into a
number of flltered images, each of which contains intensity variations over a narrow
range of frequency and orientation. The channels that give rise to these filtered images
correspond very naturally to band-pass filters, which can be characterised by 2D Gabor
fllters [Bovlik, Clark and Geisler 1990] or Gaussian derivative models [Malik and
Perona 1990]. From a computational modelling perspective this characterisation of
texture has several attractive properties, such as optimal localisation, or resolution, in
both the spatial domain, fix, y), and the frequency domain, F(u, v) [Daugman 1985].
This multi-channelling and multi-resolution flltering approach to texture analysis, via
Gabor fllters, have become widely used for image segmentation [Campbell, Thomas
and Troscianko 1997; Jain, Ratha and Lakshmanan 1997] and in region classifIcation
[Caelli and Reye 1993; Campbell et al. 1997].

(a) (b) (c)

Figure 11-5: Region shape description: (a) A typical region boundary; (b) radii from
the region's centroid to its boundary; (c) polygon approximation (dashed line) overlaid
on the original region boundary.

For the purposes of this work, the channels are represented with a bank of Gabor fllters,
which are designed to sample the entire frequency domain of an image, by varying the
shape, bandwidth, centre frequency and orientation parameters. In the spatial domain a
Gabor fllter takes the form of a complex sinusoidal grating oriented in a particular
direction, modulated by a two-dimensional Gaussian. The convolution fllter located at
(xo> 'yo) with centre frequency wo' orientation with respect to the x-axis, 80, and scales of
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 289

the Guassian's major and minor axes 0'" O:v, is defined as follows [Bovlik, Clark and
Geisler 1990]:

The filter has a modulation of (Ul), vo) such that Wo = ~u; +v; and the orientation of
the filter is 80 = tan-I (vo/uo).

Thus, each Gabor filter is tuned to detect only a specific local sinusoidal pattern of
frequency wo, orientated at angle ~ in the image plane. The frequency and orientation
selective properties of a Gabor filter are more explicit in its frequency domain
representation, where it is corresponds to a Gaussian bandpass filter, centred at a
distance Wo from the origin, with its minor axis at an angle 80 to the u-axis.

In the experiments described here, a variety of Gabor filters were considered: regular,
angular and isotropic. Regular Gabor filters correspond to Gabor filters as described
above. Thirty-two Gabor filters were used, positioned on the centre frequencies of 2, 4,
8, 16,32,64, 128 and 256 and at orientations of 0°,45°,90°, and 135°. Angular Gabor
filters consider texture as a function of angle while ignoring frequency and corresponds
to the mean response magnitude of all filters at a centre angle. In order to reduce
complexity, the angular Gabor filter responses are reduced to 0), an integer value
indicating which angular Gabor filter provided the highest response, and (2), the sine
and cosine of the angle of the highest response angular Gabor filter. Isotropic filters, on
the other hand, view texture as a function of frequency while ignoring the orientation,
and corresponds to the mean response magnitude of all filters at a centre frequency. The
texture measure, in the case of all three filter types, for a pixel in the spatial domain
corresponds to the magnitude of the complex filter output. The region value for each
texture feature is taken as the average over the corresponding region pixel values.

11.1.6 Region datasets


Eighty images of typical outdoor rural scenes were selected from the Bristol image
database. Subsequently, these images were segmented into road and non-road regions
using the k-means segmentation algorithm, where k is set to 4. This resulted in 13,628
regions being generated. Feature values were subsequently generated for each region
feature. Non-overlapping training, validation and test sets of regions were subsequently
generated in a class-wise manner as follows: 70% of data allocated to training, 15% to
validation and 15% to testing. Table 11-1 gives a sample-count breakdown for each
class. In order to lessen the complexity of the learning task, the original feature set is
reduced to a more manageable size using a filter feature selection algorithm introduced
subsequently in Section 11.1.6.1. In this case, k has been set to 10. Table 11-2
summaries the original feature as described in Section 11.1.5, while Table 11-3
describes the features that have been selected, using the proposed filter feature selection
algorithm, for the subsequent step of induction.
CHAPTER II: ApPI.ICATlONS 290

Table 11-1: Object classifications for each region and corresponding sample counts.

Class # Train examples # Validation # Test


No. Class examples examples
1 Not-Road 8381 1796 1797
2 Road 1157 248 249
TOTAL 13628 9538 2044 2046

Table 11-2: Summary offeatures generated for each region.

63 Original FEATURES
No. Features
0 Luminance
I R-O
00
2 Y-B
3 Size
4,5 Centroid (X, y)
6,7 Orientation
00
8-15 Shape: Principle modes
M 16, 17,18 Shape Error
00 19-26 Isotropic Oabor: OJ, O2, 0 4, ... , 0 256
('I
27.28 Largest Oabor Directional Re pon es in
sine and cosine
('I
29-61 Regular Oabor filters 0(1.0),0 11 . 45 ),0(1 .
M
90) •...• 0'256.135)

62 Direction of Large t Texture re pon e

Table 11-3: Selected features for each region that are considered for learning.

10 Selected FEATURES
No. Features
0 Luminance
I R-O
2 Y-B
3 X
4 Y
5 Orientation I
6 Orientation 2
7 Shape I (principle mode)
8 Texture 0128 - high frequency.
isotropic
9 Texture 0 256- high frequency.
isotropic
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 291

11.1.6.1 A filter feature selection algorithm using neural networks


One of the central issues of inductive learning techniques concerns the selection of
useful features. Although most learning approaches provide some mechanism for
feature selection or assign them degrees of importance, the complexity of the learning
task can be greatly reduced, especially in the case of high-dimensional problem
domains, by performing feature selection prior to induction. Such selection algorithms
are known as filter algorithms [Kohavi and John 1997]. Amongst the more commonly
used approaches to filter feature selection algorithms are RELIEF [Kira and Rendell
1992] and FOCUS [A1muallim and Dietterich 1991] and their extensions. These
approaches, while aiding in reducing complexity, have also been shown to improve
generalisation for a variety of induction approaches such as decision trees, nearest
neighbours and naIve Bayesian classifiers [Blum and Langley 1997]. For instance,
RELIEF samples training instances randomly, summing a measure of the relevance of a
particular attribute across each of the training instances. The relevance measure used is
based upon the difference between the selected instance and k nearest instances of the
same class and k nearest iIJ'stances in the other classes ("near-hit" and "near-miss")
[Kononenko and Hong 1997]. REIGN [Bastian 1995] is another example of a filtering
technique, which relies on the use of a feed forward neural networks, with various
feature subsets, combined with a hill climbing search strategy to determine the features
set that should subsequently be used by a fuzzy induction algorithm.

Here, a filter feature selection algorithm is proposed based upon neural networks. The
proposed approach, apart from being computationally very efficient (requiring a single
trained network), benefits from relying on widely available software and well
understood learning algorithms; neural networks. Furthermore, neural networks can
generally handle high dimensional problems and contrary, to their blackbox nature, can
provide very useful indicators, in a behavioural manner, on the appropriateness of
deploying various features in modelling a problem domain. The approach, while relying
(vulnerable to) on an accuracy measure that has an entirely different inductive bias than
the induction method planned for use with the selected features, benefits from
maintaining the original data distribution - be it may in a scrambled format.

Feature tilter algorithm


The proposed filtering technique begins by training a single feed forward neural
network (using, in this case, the Scaled gradient descent algorithm [Moller 1993]),
TrainNN, using all n problem features. Feature selection is then based upon iteratively
testing the trained neural network as follows:

Repeat
1. Select a featurej.
2. Scramble the values of feature j in the test dataset i.e. new test
dataset consists of n-J features with their original values and
one feature, whose values are randomly sampled from the
original feature values.
3. Test the trained network TrainNN using the new test dataset.
Until all features have been processed

Subsequently, the k features that give the most degraded performance are selected as
the reduced feature set that will be considered during induction.
CHAPTER II: APPLICATIONS 292

11.1.7 ACGF modelling of the vision problem


For the purposes of this work the objects that can make up an image were decomposed
into two generic classes: road and not-road. The G_DACG constructive induction
algorithm was used to discover additive Cartesian granule feature classifiers (that
classified image regions into road and not-road) using the selected features and
example regions. Table 11-4 summarises the region classification problem from a
G_DACG constructive induction perspective. The reduced feature set of ten base
features was considered and Cartesian granule features of dimensionality up to three
with granularity ranges of [2, 12] were considered (while parsimony was promoted),
thus yielding a search space of over 500,000 nodes.

Table 11-4: G_DACG parameter tableau/or region classification problem

Objective Discover an ACGF model that classifies road regions.


Terminal Set ro, f1, t2, f3, f4, f5. F6, fl , fR , f9
Chromosome Length [1,3]
Feature Granularity [2, 12]
Function Set CGProduct
GP Flavour Steady State wilh no duplicates
GP Selection K-toumamenl, k=4
Fitness Fitne Si = W Dis * Discriminationi +
W Dim * )..IsmaIlOim(DimensionalitYi) +
W USi, . * )lSm3I1Uni.(UniverseSize;)
Standardised fitness Same as Fitness
GP Parameters PopSize= 120, #Generations =50, Crossover probability = 0.7
ACGF Model Size [I, 10]
Testing Mechanism Holdout estimate
Dataset Size(tuples) Train= 2314, Control=2044, Test =2046
Success Predicate Around 100% classification accuracy on validation data

Table 11-5: ACGF models discovered using the G_DACG algorithm

Dimension Train% Valid % Test% Optimised Carte ian Granule Features


Accuracy Accuracy Accuracy Weights
ID 92 95 95.5 No «05» «2 5» «4 5»
2D 94 93.3 96.6 No «05)(25» «2 5) (4 5»
ID 92 95 95.5 Yes «05» «2 5» «4 5»
2D 93.9 93.5 96.7 Yes «05)(25» «2 5) (4 5»

The G_DACG algorithm iterated for fifty generations and at the end of each generation,
five of the best Cartesian granule features were selected from the current population.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 293

The discovered features were then used to form additive Cartesian granule feature rule-
based models. Backward elimination was also employed, eliminating extraneous lowly
contributing features. Table 11-5 presents the results of some of the more interesting
additive Cartesian granule feature models that were discovered using G_DACG. In the
case of the models presented in Table 11-5, the models were constructed using equal
numbers of examples of Road and Not-Road for training. By equalising the example
count across classes, a slight improvement in test case accuracy (of less than 1%) was
achieved over learning from the original skewed training set.

For example, when an additive model consisting of three one-dimensional Cartesian


granule features was formed respectively over the features Luminance, Y-B (Colour
difference) and Y-Position, a classification accuracy of 95.5% (after tuning the weights)
on unseen image regions was achieved. The feature universes in this model were
linguistically partitioned using five words, which were characterised by uniformly
placed trapezoidal fuzzy sets with 50% overlap. The resulting additive rule base is
presented in Figure 11-8 . . The linguistic descriptions, characterised by Cartesian
granule fuzzy sets, correspoilding to the luminance for Road and Not-Road classes are
presented in close-up detail in Figure 11-6. Notice that in the additive rule model in
Figure 11-8 that the Luminance feature receives a lower weight than the other features
involved in the decision making process. This is due mainly because the Luminance
linguistic summaries do not provide as a good a separation of concepts as the other
features, such as the Y-B feature. Figure 11-7 depicts a Java applet screendump that
illustrates the results of applying this ACGF model to a k-means segmented image. The
results are qualitatively very good from a classification perspective, however the low-
level k-means and region growing segmentation process has under segmented parts of
the image, thus leading to some areas of the image being misclassified. An additive
Cartesian granule model composed of two two-dimensional features give a marginal
improvement (to 96.7) over the one-dimensional model (see Table 11-5 for details).
The test confusion matrix for this model is presented in Table 11-6.

Road NotRoad
1
0.8 0.8
0.6 0.6 I--

-11
0.4 0.4 I--
02 0.2 r I--
0~~L-~~-r~~~L-~~~~ 0 r:1
vLow low medium high vH lgh vLow low medium high -.+ilgh

{vLow:O.29, low:1, nX!dium:O.86. (vLow:O.8, low:l, medium: 0.39,


high:O.33, vHigh:O.056} high:O.09. vHigh:O.036}

Figure 11-6: A linguistic summary, in the form of Cartesian granule fuzzy sets, of
luminance for Road and NotRoad Classes.
CHAPTER 11: ApPLICATIONS 294

~Hi 1
- i~
i
l
0000 00

j i
} JI
o®*"ooo
fj

Figure 11-7: Screendump of a Java applet that displays the original image (top left
quadrant), k-means segmented image (top right quadrant) and the results of region
classification using a rule-based ACGF model (bottom left quadrant). The regions
classified as road are highlighted in grey and the non-road regions are displayed in
black.

Table 11-6: The confusion Matrix generated by the discovered 2D optimised model
detailed in Table 11-5.
Actual\Predicted NotRoad Road Total Class %Accuracy
NoLRoad I 1767 I 30 I 1797 I 98.3
Road I 39 I 210 I 249 I 84.3
SOH COMPUTING FOR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 295

«Classification of Object i ' Road)


(Evlog identilyFilter ~
(Luminance of Object is ~ ) 0.2 J

(Y-B of ObjCCl ; ~ ) 0.48

(Y _Po,;,;on of Object ;s 1"- _!! fill ) 0.31 ) ):(1 1)(0 0) )

«Cia" ificalion of Objecl is NotRoad)


(Evlog identilyFillcr ~ ) 0.21
(Luminance OfOb~ ~

(Y-B of Object is ~ [)m0(;.5~;;:;;;]


(Y _Position of Ohject i. ) 0.29) ):( I 1) (0 0)

Figure 11-8: An additive Cartesian granule feature model for road classification.

11.1.8 Vision problem results comparison


The results obtained when additive Cartesian granule feature (ACGF) modelling was
applied to the region classification problem were compared with those achieved using
other standard induction approaches such as neural nets, naIve Bayes, and various
decision tree approaches. The same data and reduced base feature set were used to
compare ACGF modelling with these standard learning techniques. Table 11-7
summarises the results of various modelling approaches that were used on the road
classification problem. It indicates the accuracy of the resulting models on the test
dataset and also the number of domain features that were used in generating these
models.

11.1.9 Vision problem conclusions


A new approach to object recognition, based upon a Cartesian granule feature classifier,
has been proposed that facilitates the transition from recognition to understanding
(discussed in more detail in Section 11.5). The approach was illustrated on a road
classification problem, yielding high levels of accuracy (97%) and very understandable
models. The approach, when compared with decision tree approaches, naIve Bayes and
neural networks, provided simpler models with better accuracy, although taking a little
longer to discover. The extra discovery time needed is mainly due to the search for a
transparent and accurate model. Envisioned applications include content based image
retrieval systems (CBIR). CBIR is an area which relies heavily on human-computer
interaction, where interaction requires understanding, and thus would greatly benefit
from the glassbox approach proposed here. Other potential applications of the proposed
CHAl'TER II: Al'l'l.lCATIONS 296

approach include autonomous vehicle navigation systems, medical image analysis, and
landmine detection.

Table 11-7: Comparison of results obtained using a variety of machine leaming


techniques on the road classification problem.
Approach. # Features used % Accuracy
Additive Cartesian granule feature model 3 96.7
na'ive Bayes [Good 1965] 10 96.2
Oblique decision trees [Murphy, Ka if and Salzburg 10 94.7
1994]
Feed forward neural net [Rum Ihart Hinton and 10 97
Williams 1986]
lD3 or C4.5 [Quinlan 1993J 10 92.75

11.2 MODELLING PIMA DIABETES DETECTION PROBLEM

The problem posed here is to discover patterns that predict whether a patient would test
positive or negative for diabetes according to the World Health Organisation criteria
given a number of physiological measurements and medical test results. The dataset
was originally donated by Vincent Sigillito, Applied Physics Laboratory, John Hopkins
University, Laurel, MD 20707 and was constructed by a constrained selection from a
larger database held by the National Institute of Diabetes and Digestive and Kidney
Diseases [Smith et al. 1988]. It is publicly available from the machine learning
repository at UCI [Merz and Murphy 1996J. All the patients represented in this dataset
are females at least 21 years old of Pima Indian heritage living near Phoenix, Arizona,
USA. There are eight input features, and one output or dependent feature, the diabetes
diagnosis, that is discrete, taking one of two value: "positive for diabetes"; or "negative
for diabetes". These input-output features and their corresponding feature numbers
(used for convenience) are listed in Table 11-8. There are 500 positive examples and
268 negative examples.

11.2.1 ACGF modelling of Pima diabetes problem


The Pima diabetes dataset of 768 tuples was split class-wise, approximately as follows:
60% of data allocated to training; 15% to validation; and 25% to testing. The G_DACG
constructive induction algorithm was subsequently applied to the Pima diabetes
problem. Table 11-9 summarises the Pima diabetes detection problem from G_DACG
constructive induction perspective. All eight base features were considered and
Cartesian granule features of dimensionality up to five with granularity ranges of [2,
12J were considered (while parsimony was promoted in the form of the fitness function
used) thus yielding a multi-million node search space. The G_DACG algorithm iterated
for fifty generations (or if the stopping criterion was satisfied it halted earlier,
arbitrarily set at 90% accuracy) and at the end of each generation five of the best
SOH COMPUTING fOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 297

Cartesian granule features were selected from the current population to form the best-
of-generation model.

Table 11-8: Base features and their corresponding feature numbers for the Pima
diabetes problem.

No. Class
0 Number of limes pregnant
I Plasma glucose concentration in an oral glucose tolerance test
2 Diastolic blood pressure (mmlHg)
3 Triceps skin fold thickness (mm)
4 2-hour serum insulin (mulU/ml)
5 Body mass index (kg/m')
6 Diabetes pedigree function
7 Age(years)
8 Clas ification

Table 11-9: G_DACG parameter tableau for Pima diabetes detection problem

Objective Discover an additive CG Feature model which c1as ifie a


patient as having diabetes (Po itive) or not (Negative)
according to the World Health Organi ation criteria given a
number of physiological measurements and medical test
results
Terminal Set ro, fJ, f2, f3, f4, f5, F6, f7
Chromosome Length [1,5)
Feature Granularity [2, 12)
Function Set CGProduct
GP Flavour Steady State with no duplicates
GP Selection K-toumament, k=4
Fitnes Filnessi == W I)j * Discriminationi +
WJ}im * ~s"",IIDim(DimensionalitYi) +
W USi,,, • )lson.lluni,,(UniverseSizej)
Standardised fitness Same as Fitness
GP Parameters PopSize= I 00, #Generations =30, Crossover probability == 0.7
ACGF Model Size [1,5]
Testing Mechanism Holdout estimate
Dataset Size (tuples) Train == 460, Control = 116. Test = 192
Success Predicate Around 90% classification accuracy on test data

The best discovered ACGF model for the diabetes problem was generated by taking the
five best Cartesian granule features that were visited during the genetic search phase.
Table 11-10 shows the results of the backward elimination process that was taken to
arrive at this model. During the genetic search process the granule characterisations
CHAPTER II : APPLICATIONS 298

were set to trapezoidal fuzzy sets with 50% overlap. However, after further
investigation on the best model, a trapezoidal fuzzy set with 70% overlap was
determined to be the best granule characterisation. The best discovered model from
both a model accuracy and simplicity perspective consists of two Cartesian granule
features, yielding a model accuracy on the test data of 79.7% (see Table 11-10). An
evidential logic rule corresponding to the positive class and model filters are presented
in Figure 11-9. The negative class rule filter in this case is more disjunctive or
optimistic in nature than its positive counterpart. This optimism may arise from the fact
that a single feature may be adequate to model this class. Models with other granule
characterisations give similar or slightly higher accuracies but require more Cartesian
granule features. For example, a model with triangular fuzzy set granule
characterisations gives an accuracy of 79.18% but requires all five Cartesian granule
features; resulting in a rather complex model.

Table 11-10: Model accuracies before and after filter optimisation for various ADCe
models (see Table 11-1 j for feature key). The underlying granule characterisations are
trapezoidal fuzzy sets with 70% overlap.
Before Filter Optimisation After Filter Optimisation
# of CG Features Train % Valid % Test % Train % Valid % Test %
1 76.52 69.83 66.15 81.96 75 74.5
2 76.04 73.3 69.79 82.17 79.31 79.69
3 78.04 77.59 73.96 82.17 78.45 77.08
4 78.48 76.72 75.5 79.57 80.17 77.08
5 77.83 75 75 81.74 78.45 78.12

Table 11-11: The Cartesian granule features sets using in the ACCe models presented
in Table 11-10, wherefor example, the Cartesian granulefeature ((08) (110) (2 2) (3
12)) denotes the following: this feature consists of four base features pregnancyCount,
glucoseConcentration etc. and the universe of each feature. is abstracted by a linguistic
partition with the indicated granularity. For example, the universe ofpregnancyCount
is 8.

fit of CG Features CG Features


I «08)( I 10)(22)(3 12»
2 «010)(14) (5 11)(73» «08) (I 10)(22)(3 12»
3 «14)(22)(312)(512» «010)(14)(5 II) (7 3»
«0 8)( I 10) (2 2)(3 12»
4 « I 4) (3 10) (5 6)(66» «14) (2 2) (3 12) (5 12»
«010)(14)(5 II) (7 3» «08)(1 10)(22)(3 12»
5 «010)(1 11)(35)(53» «
I 4)(3 10)(5 6)(6 6»
«14)(22)(312)(512» «010)(14)(5 11)(73»
«0 8)( I 10)(22)(3 12»

Figure 11-10 contains the fitness curves for this run. This figure shows, by generation,
the progress of one run of the pima diabetes problem between generations 0 and 50
SOFT COMPUTING fOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 299

using three plots: the fitness of the best-of-generation individual (Cartesian granule
feature) in the population, the fitness of the worst-of-generation individual in the
population, and the average fitness for all the individuals in the population. On the
other hand, Figure II-II presents the variety, by generation, of the evolutionary search
for the G_DACG run that resulted in the above model. This figure shows, by
generation, the progress of one G_DACG run on the diabetes problem between
generations 0 and 50, using two plots: the percentage of new Cartesian granule features
visited in each generation, though the curve (labelled % of Chromosomes Revisited) is
plotted from the perspective of the number of features that are revisited; and the second
curve displays the chromosome variety in the current population, but this can be
ignored here, since duplicates are not allowed within a popUlation. The number of novel
features in each population decreases steadily over time mainly because of the
evolutionary nature (low mutation rate, and a relatively small population) of the search.

«POSITIVE_FILTER [0:0 1:1)))


«NEGATIVE_Fll..TER [0:00.78: 1.0)))
-

«Diabete Clas ification of Patient i Positive)


(evlog POSITIVE_FILTER (
(cgValue of «pregnancyCount 10) (glucoseConcentralion 4)
(bodyMassJndex II) (Age 3»
of Patient is positivcClass) .49
(cgValue of «pregnancyCount 8) (gluco eConcentration 10)
(blood Pressure 2) (tricep SkinThicknes 12»
of Patient i positiveClas) .51
) ) ):« I 1)(00»
Figure 11-9: An example of an additive Cartesian granule feature model for Pima
diabetes detection. This model gives 79.7% accuracy on test cases.

11.2.2 Pima diabetes problem results comparison


The Pima diabetes dataset serves as a benchmark problem in the field of machine
learning and has been tested on many learning approaches. Table 11-12 compares some
of the results of the more common machine learning techniques with the ACGF
modelling approach.

The Pima diabetes problem is a notoriously difficult machine learning problem. Part of
this difficulty arises from the fact that the dependent output variable is really a
binarised form of another variable which itself is highly indicative of certain types of
diabetes but does not have a one-to-one correspondence with the condition of being
diabetic [Michie, Spiegel halter and Taylor 1993a]. To date, only one other machine
learning approach has obtained an accuracy higher than 78% [Merz and Murphy 1996];
this is the mass assignment based induction of decision trees - MATI algorithm
[Baldwin, Lawry and Martin 1997]. The discovered ACGF models have yields an
equivalent accuracy of79.7% (see Table 11-12).
CHAI'TER II: ApPLICATIONS 300

Evolution of Chromosome Fitness

, Worst Fitness ~
Best Fitness -+--
I I I I I

t- 0.8
,
++++++++++++++++++++++++ I I I I Malaga Fil"ess ~
en
w
co

lZCD 0.6
au..
0.4 ----r----r-----I----4----'----'----T----r----r---
t- I I I I I I
en
cr:
0
:s: 0.2
I I I I I
----~--------~----~----1----1----T----r----r---
I I I I

o
o 5 10 15 20 25 30 35 40 45 50
Generation

Figure 11-10: Fitness curves for Pima problem for a G_DACG run.

Population Variety within Pool and Overall(DB usage)

iii i Pool Variety ~


, , , , % of Chromosomes Revisited -+--.,..

0.8 ----~----~----~----~----~----~--~~-~\~i~~f--
: : : : : 1.{ V -¥ :
~
CD
0.6 --- -~--- -~/---l~~At0,f~:---f----f-- --f----
.~

>
I
I
I
I V. r
I
I
I
I
I
I
I
I
I
I
I
I
I

0.4 ----~--~~----~----~----~----~----~----~----~----
Tr-+ '
, I I
'.
I I I I I I
0.2
-)l-:----:----:----:----:----r----r----r----r----
j+ I I I I I I I I I

0
0 5 10 15 20 25 30 35 40 45 50
Number of Training Cycles

Figure 11-11: Percentage of Cartesian granule features that were revisited in each
generation for the Pima diabetes problem. Pool variety is 100% (since duplicates are
not allowed).
Son COMPUTING I'OR KNOWLEDGE DISCOVERY : INTRODUCING CARTESIAN GRANULE FEATURES 301

Table 11-12: Comparison of results obtained using a variety of machine learning


techniques on the Pima diabetes detection problem.

Approach. % Accuracy
Additive Cartesian granule feature Model 79.7
MATI decision trees [Baldwin, Lawry and Martin 1997] 79.7
Oblique decision trees [Cristianini 1998) 78.5
Neural networks (normalised Data) 78
C4.5 [Michie, Spiegelhalter and Taylor 1993b) 73
Neural networks (unnormalised Data) 67
Data brow er 70

11.3 MODELLING THE BOX-JENKINS GAS FURNACE PROBLEM


:

This application deals with the widely used benchmark problem of modelling a gas
furnace (an example of a dynamical process), which was first presented by Box and
Jenkins [Box and Jenkins 1970]. The modelled system consists of a gas furnace in
which air and methane are combined to form a mixture of gases containing CO2 (carbon
dioxide). Air fed to the furnace is kept constant, while the methane feed rate can be
varied in any desired manner. The furnace output, the CO 2 concentration, is measured in
the exhaust gases at the outlet of the furnace.

The dataset here corresponds to a time series consisting of 296 successive pairs of
observations of the form (u(t), yet)), where u(t) represents the methane gas feed rate at
the time step t and yet) represents the concentration of CO2 in the gas outlets. The
sampling time interval is nine seconds. Using a time-discrete formulation, the dynamics
of the system is represented by a relationship that links the predicted system state
y(t+ I) to the previous input states u(t;) and the previous output states yeti), that is y(t+ I)
is a function of the previous input and output states i.e. y(t+ I) =/(u(t/), U(t2), ... , u(tn),
yetI), y(t2), .. ., y(tn)).

The goal of knowledge discovery here is detect patterns in the time series data that
facilitate automatic control. After a few iterations of the knowledge discovery process,
the value of n was set to five. Consequently, ten input variables were considered and
the database reduces to 291 data tuples of the form (u(t), u(t-I), ... , u(t-4), yet), y(t-I),
... , y(t-4), y(t+ I)).

11.3.1 ACGF modelling of the gas furnace problem


The same training and testing datasets were used for the gas furnace problem. The main
reason for this is to provide a comparison with other approaches presented in the
literature. When the G_DACG constructive induction algorithm was applied to the gas
furnace problem, all ten base features were considered and Cartesian granule features of
dimensionality up to five with granularity ranges of [2, 12] were considered (while
parsimony was promoted in the form of the fitness function used), thus yielding a
CHAPTER II : ApPLICATIONS 302

multi-million node search space. The k-toumament selection parameter k was set to 4
for this problem. The output universe was uniformly partitioned using eight triangular
fuzzy sets. The G_DACG algorithm iterated for fifty generations (or if the stopping
criterion was satisfied it halted earlier, arbitrarily set at an mean square error (MSE) of
less than 0.05). As a result of the G_DACG process an additive Cartesian granule
feature model where each rule consists of two Cartesian granule features was deemed to
be the most suitable model. The model consists of eight rules and a trapezoidal fuzzy
set with 50% overlap was determined to be the best input feature granule
characterisation. The performance <lccuracy of the model was measured based upon the
mean square error (MSE) between the actual data outputs (yi) and the model outputs
(Y), and calculated as follows:
N
MSE= ~L(/-y)2
;=1

The discovered model yields a relatively low MSE of 0.128. In Figure 11-12, the
model performance is compared with the original data. The horizontal axis corresponds
to the time while the vertical axis denotes the furnace output, the CO 2 concentration. A
sample rule in this model is presented in Figure 11-13 describing a fuzzy class in the
output space namely, Small. Increasing the granularity of the output universe (and
consequently, the number of rules) can lead to models with lower MSE, however, this
also leads to more complex models. For example, if the granularity of the output
universe is increased to ten, the MSE of the model drops to 0.11 .

61

59

57

•••• - • AClUnl
55
[
8 53

51

49 ---ACGF
Model
47

45 ~ __ ~~ __ ~ _ _ _- L_ _ _ ~ ___ ~_~ ~ ______ ~

o or.

Time

Figure 11-/2: ACGF model predictions versus the actual data for the gas furnace
problem.
SOIT COMPUTING fOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 303

11.3.2 Gas furnace results comparison


The gas furnace problem serves as a benchmark problem in the field of system
identification and has been tested on many learning approaches. Table 11-13 compares
the results of common statistical and fuzzy based techniques with the ACGF modelling
approach. Overall, the ACGF modelling approach outperforms the other fuzzy and
statistical based approaches from an accuracy perspective, apart from the Takagi-
Sugeno approach. The Takagi-Sugeno linear model gives the best performance
accuracy, however it lacks the transparency provided by the other approaches including
that of ACGF modelling. The models generated by the various approaches were
evaluated on the same data that was used to generate them. As a result the results
provide no information on the generalisation powers of the extracted models. From a
model transparency, the extracted ACGF model is relatively easy to interpret since the
extracted Cartesian granule fuzzy sets are all two dimensional in nature. The different
fuzzy approaches, apart from the G_DACG approach, listed in Table 11-13, differ
mainly in the identification algorithms used. In general, they use local hill climbing
strategies and treat the steps of input variable selection and abstraction separately,
which may subsequently result in models that are only locally optimum.

«Predicted level of CO 2 at time(t+ I) is Small)


(evlog identityFilter (
(cgValue of ((x(t-3) 10) (y(t) 10» is smallClass) .49
(cgValue of ((x(t-2) 10) (y(t) 10» is smaIiClass).51 »)
Figure 11-13: An example of a rule in the ACGF model for the gas furnace problem.
This model yields an MSE of 0.128. Here identityFilter corresponds to f(s)=s.

Table 11-13: Comparison of results for the gas furnace problem.


Approach. MSE
Box & Jenkins statistical approach[Box and Jenkins 1970J 0.710
Tong's fuzzy model [Tong 1980J 0.469
Pedrycz's fuzzy model [Pedrycz 1984) 0.320
Linear model [Sugeno and Yasukawa 1993) 0.193
Takagi-Sugeno linear model [Sugcno and Yasukawa 1993J 0.068
Fuzzy Position gradient model [Sugeno and Yasukawa 1993] 0.190
Nakoula et ai's. fuzzy model [Nakoula, Galichet and Foulloy 1997] 0.175
Additive Carte ian granule feature model 0.128

11.4 MODELLING THE HVMAN OPERATION OF A CHEMICAL


PLANT CONTROLLER

The goal of this application is to discover a model of a control engineer's actions in a


CHAPTER II: ApPLICATIONS 304

chemical plant. This problem and corresponding dataset were originally presented by
Sugeno and Yasukawa [Sugeno and Yasukawa 1993]. The chemical plant produces a
polymer through a process of monomer polymerisation. Since the start-up of the plant
is very complicated, a human operator is required to manually control the plant.

The dataset consists of 70 observations taken from actual plant operation. Each
observation consists of five input variables (see Table 11-14 for details) and an output
variable corresponding to the set point for monomer flow rate. The human operator
determines the set point for the monomer flow rate and gives this information to a PID
controller, which calculates the actual monomer flow rate for the plant.

Table 11-14: Input and output base features for chemical plant control problem.
No. Class
0 Monomer concentration
J Change of monomer concentration
2 Monomer flow rate
3,4 Local temperatures inside plant
5 Set point for monomer flow rate

11.4.1 ACGF modelling of the chemical plant problem


In the case of this problem all data tuples were considered for both training and testing.
The G_DACG constructive induction algorithm was applied to the chemical plant
problem, where all the base input features were considered and Cartesian granule
features of dimensionality up to five with granularity ranges of [2, 12] were considered.
The k-toumament selection parameter k was set to 4 for this problem. The output
universe was uniformly partitioned using eight triangular fuzzy sets. The G_DACG
algorithm iterated for fifty generations (or if the stopping criterion was satisfied it
halted earlier, arbitrarily set at a root mean square error (RMS) of less than 0.05%). As
a result of the G_DACG process, an additive Cartesian grailUle feature model, where
each rule consists of a single Cartesian granule features, was deemed to be the most
suitable process controller. This model was made up of eight rules and a trapezoidal
fuzzy set with 5% overlap was determined to be the best input feature granule
characterisation. The performance accuracy of the model was measured based upon the
root mean square error (RMS). The discovered model yields an RMS of 2%. The model
performance is compared with that of the human operator Figure 11-15. A sample rule
in this model is presented in Figure 11-14.

«Predicted level of y is small)


(evlog identityFilter (
(cgValue of «fO
14) (fi 14) (f2 13) (OS» is smal/Class) I »)
Figure 11-14: An example of a rule in the ACGF model for the chemical plant
problem. This model yields an RMS of2%.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 305

11.4.2 Chemical Plant Results Comparison


Overall the discovered additive Cartesian granule feature model performs very well
when compared to the human operator. In the case of this problem eight rules have
been generated to qualitatively describe the behaviour of the plant. When a neural
network is generated using the same input features as the ACOF model and with 4
hidden nodes, an RMS error of 1% is achieved.

7001

6001

5001 +--------~------~----__f - - • - • ·AclU.1

o
~ 4001 +----~--==~------------__f

2001
- - ACGI'
Model
1001 to~~------------------__f

;;:;
Time

Figure I I-I 5: ACGF model predictions versus human operator for the chemical plant.

The discovered ACOF model has a high complexity for this problem and may be
suffering from the uniform partitioning of the input feature universes. A more efficient
and possibly a lower dimensional Cartesian granule feature may result from a data
centred approach to partitioning.

11.5 DISCUSSION

This section presents a more general discussion of the results presented above,
evaluating Cartesian granule feature models on performance criteria such transparency,
efficiency and accuracy.

J J.5. J.1 Understandability and glassbox-ness


One of the primary concerns of intelligent systems is that they should be able to interact
naturally with their environment. An integral part of many domains is the human.
Consequently, the intelligent system (agent) needs to interact with the human. This can
be achieved by a variety of means and at many different levels such as a graphic display
of trend data. However, one of the most natural forms of communication (and
CHAPTER 11: ApPLICATIONS 306

sometimes most effective) is through words. Considering the road classification


problem, the proposed Cartesian granule feature approach has generated a road
classification system that enlightens the user about what a road is, in terms of
luminance and other feature value descriptions. These descriptions are in terms of
words such as low and very low - generic words in this case, but these could be
assigned from a user-defined dictionary and supplemented with hedges such as very,
not so much, etc. and with connectives such as conjunctions and disjunctions.
Furthermore, the weights associated with each feature inform the user of how important
a particular feature is in the inference process. For this problem, the induced Cartesian
granule feature model facilitates a transition from a low-level object recognition task to
a high level understanding task which should enhance human computer interaction.
This simplification comes from the expression of the knowledge in a form that is
almost directly interpretable by the human user. The proposed approach, while
facilitating machine learning, may also facilitate human learning and understanding
through the generated anthropomorphic models. With regard to the other learning
approaches that were applied to the vision problem, such as the 103 and C4.5
algorithms, the induced models while being readable tend to be large and consequently,
make understanding very difficult [Shanahan et al. 2000]. In the case of neural
networks and oblique decision trees, the induced knowledge is encoded in vectors of
weights (and biases), which may prove difficult for a user to interpret and understand.
A further consequence of readability and understandability is that it will generally
increase user's confidence in the system and it can also enhance reliability. For
example, the user may augment the systems reliability by identifying a data deficiency
or a variable deficiency.

Regarding the other problems considered in the chapter, the resulting Cartesian granule
feature models were not as simple, but they do however decompose the problem
domain in lower order dependencies between semantically related features. The
granUlarity of these features in most cases is high. This may result from the uniform
characterisation of each granule. Using a more data centred approach to partitioning,
such as clustering, may lead to Cartesian granule features with lower granularities.

11.5.1.2 Computational efficiency


From a classification task perspective, all approaches considered in this chapter have
similar computational requirements. On the other hand, most approaches differ on the
amount of computational effort required for learning. Learning can be split into two
subtasks: structure identification and parameter identification. From a structure
identification task perspective (that is in the case of neural networks determining a
network topology, or in the case of ACGF modelling, this corresponds to discovering
the Cartesian granule features) has varying computational requirements for the each of
the approaches considered. These computational requirements are commensurate with
the effectiveness of the search techniques used to determine the structure of the induced
model. For example, the decision tree approaches such as ID3 and OCI (learns oblique
decision trees [Murphy, Kasif and Salzburg 1994]) have low computational
requirements, which directly result from the local hill-climbing search technique used.
This search technique while facilitating efficient structure identification is vulnerable to
local minima. Furthermore, decision tree approaches such as 103, in order to provide
better generalisation, require pruning [Quinlan 1993], which can prove to be expensive
SOI'T COMPUTING FOR KNOWI.ED(JE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 307

in the case of bushy decision trees. On the other hand, the G_DACG constructive
induction algorithm is computationally intensive which is due to the global, population
based search approach used, but avoids local minima. The determination of neural
network topologies is also computationally intensive.

From a parameter identification task perspective, again the computational requirements


vary with the approach used. In the case of ID3 no parameter identification is required.
The naIve Bayes parameter identification step has low computational requirements,
since the data examples need to be processed only once in order to estimate the class
densities. Parameter identification for the OCl algorithm, the G_DACG algorithm and
for neural networks requires the identification of weights and involves a search through
the possible weight space using various search algorithms that offer efficiency
commensurate with the effectiveness of the determined solution. Neural networks, in
general, are multi-layered, whereas oblique decision trees and additive Cartesian
granule models can be viewed as a single-layered networks. Parameter identification for
all three approaches involves, the identification of feature weights, but in the case of
Cartesian granule feature models, it also involves the identification of fuzzy sets (and
potentially filters). The identification of weights in both neural networks and oblique
decision trees is based on a search through weight space and is far more
computationally intensive than that of Cartesian granule feature models, where the
weights can be identified in one step - that of semantic discrimination analysis. In
addition, parameter identification for additive Cartesian granule feature requires
determining the class Cartesian granule fuzzy sets; this involves a single pass of the
data. Consequently, the parameter identification of Cartesian granule features can be
significantly more efficient than the parameter identification of a multi-layered neural
network or oblique decision trees.

In the case of additive Cartesian granule feature models, the system identification step
is not just concerned with identifying a model that provides high performance accuracy
(the goal of most other induction algorithms), but is also concerned with identifying a
model that is glassbox in nature. This issue of identifying glassbox models, while
having extra computational requirements, is compensated by the identification of
accurate models that facilitate understanding. .

11.5.1.3 Feature value representations


The input features for most problems considered in this chapter are continuous in
nature. The values of these features are single numbers, which in the case of some
features correspond to simple statistical measures such as the average. For example, in
the vision problem, the luminance value for a region corresponds to average pixel
luminance value across that region. For this problem such features prove adequate in
modelling the problem, but in the case of a more difficult multi-class problem more
detailed feature values may be necessary. Single numeric values such as the measures
used here can be susceptible to noise, and generally lead to high data requirements for
learning. An alternative, and possibly promising, approach is to linguistically
summarise the pixel values using a one-dimensional Cartesian granule feature i.e.
generate a linguistic histogram. Other features in this problem that may benefit from
linguistic summaries include the texture features, the colour difference features,
location feature etc. Linguistic summaries of feature values provide more information
CHAPTER 11: ApPLICATIONS 308

to discriminate amongst different classes while also combating the curse of


dimensionality [Bellman 1961].

11.5.1.4 Alternative features


Current work describes concepts in terms of their own attributes, whereas more
succinct and possibly easier to understand concept definitions can be acquired where
objects are described in terms of other objects. For example, again referring to the
vision problem, an object could be defined as similar to another object. In the case of
the induced Cartesian granule feature models, class rules can be improved by hand by
adding in further conditions that a class should satisfy. For example, a reasonable
condition for the road class is that "cars should be above road", where the user
provides the condition and also a definition for above (or alternatively above could be
extracted from example data by taking the difference in y-position.

11.6 GENERAL CONCLUSIONS

The focus and motivation behind the new approach presented in the latter parts of this
book has been the development of a knowledge discovery process that leads to models
that are ultimately understandable not only by computers but also by experts in the
domain of application and that perform effectively. This has resulted in the
development of a new form of knowledge representation - Cartesian granule feature
models- and a corresponding constructive induction algorithm - G_DACG. The book
has highlighted that "much of the power comes not from the specific induction method,
but from proper formulation of the problems and from crafting the representation to
make learning more tractable" [Langley and Simon 1998]. Cartesian granule features
incorporated into additive models tolerate and exploit uncertainty in order to achieve
tractability and transparency on the one hand and generalisation on the other. This
approach has been demonstrated on various real world problems, attaining the goals of
understandability and effectiveness to a great extent. Overall, soft computing
approaches (including Cartesian granule feature modelling), through tolerating and
exploiting uncertainty, provide a very powerful means of attaining transparent and
effective inductive inference and will be a key player in the knowledge discovery
processes of the future.

11.7 CURRENT AND FUTURE WORK DIRECTIONS

In the age of ubiquitous computing, knowledge discovery is fast becoming an essential


part of the workplace, bringing with it new requirements. Several avenues of future
work will address these. Some have already been highlighted throughout the course of
this book, however they are summarised here for convenience. The more general
avenues of future work are presented first, while more specific issues related to the
knowledge discovery of Cartesian granule feature models are presented subsequently.
In particular, future work in knowledge discovery will address the following:
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 309

• Incrennentallearndng
By nature, the world is dynamic, continuously changing and evolving.
Knowledge, one of the artefacts of mankind, is no different. For example, the
concept of granny's image will change over time and for an image retrieval
system to successfully retrieve her image over time, its representation will
need to evolve with granny. This type of learning lies in the domain of
incremental learning [Utgoff 1989]. Incremental learning refers to learning
where the observations are presented one (or a few) at a time to the learning
algorithm. Incremental learning is seen a means of tackling concept drift
[Schlimmer and Granger 1986] and is becoming an important area in
knowledge discovery due to the embryonic nature of the information world.
Although the results presented in this book are the result of one-shot learning,
the proposed approaches, due to their probabilistic nature, can facilitate an
incremental approach to learning, which is the subject of current work.

• Distributed learning
To date, knowledge discovery has mainly been centralised in nature, that is,
the discovered knowledge corresponds to a single model that is determined
from a single database of examples. However, as organisations and their
database management systems become more decentralised, knowledge
discovery systems that offer decentralised model building and deployment
capabilities will become more essential and prominent. Alternatively,
individual entities such as banks may pool decision support systems, such as
fraud detection systems, thereby providing more powerful and trustworthy
support (a software parallel of human committees). Distributed knowledge
discovery can be realised by a number of techniques including bagging and
boosting. Future work will investigate the use of Cartesian granule feature
models in this distributed context. An alternative to the aforementioned
approaches to distributed knowledge discovery, in the case of Cartesian
granule feature models, is to merge the individual models into one overall
model. This is the subject of current work.

• Exploiting backgrouud knowledge


To date, the Cartesian granule feature learning algorithms, like many
approaches to knowledge discovery, have for the most part ignored
background knowledge, a valuable resource that can increase the efficiency
and quality of the knowledge discovered. Future work should investigate how
to exploit background knowledge within a Cartesian granule feature context.

• Scaling to extrennely large datasets


Knowledge discovery of Cartesian granule feature models, like many
approaches to knowledge discovery, has to date been demonstrated on datasets
with thousands of training examples, however, many important datasets are
significantly larger. For example, large retail customer databases can easily
involve a terabyte or more. To provide reasonably efficient knowledge
discovery using the proposed approach (or other approaches) requires
additional research on multiple fronts including database management
systems, data visualisation and machine learning.
CHAPTER 11: APPLICATIONS 310

• Unsupervised knowledge discovery


In the recent past as the Internet has grown and become mobile, the amount of
unstructured information has increased dramatically. Unsupervised learning
approaches can help in organising this information by bringing structure to it.
Unsupervised learning algorithms have, to date, not received as much attention
as supervised learning approaches. However, this should change over the
coming years. Though the unsupervised learning of Cartesian granule feature
models was not addressed in this book, it will form an important part of future
work.

• Naive Bayes and Cartesian granule features


In this book (see Section 10.2.6), a parallel has been drawn between naive
Bayes and Cartesian granule features, highlighting that naive Bayes is a
special case of one dimensional Cartesian granule feature models under certain
conditions. Current work [Shanahan 2000] is investigating this further; making
these connectio,ns more formal and also in exploiting some of the well
developed ideas in naive Bayes within a Cartesian granule feature context.

• Belief networks and Cartesian granule features


Cartesian granule feature models represent problems domains in terms of a
network of low-order semantically related features in a similar fashion to
Bayesian networks. A Cartesian granule fuzzy set can be thought of as
representing a probability distribution on granules, and as a consequence,
Bayesian networks can be represented in terms of Cartesian granule features.
Future work will focus on trying to harness the reasoning power of Bayesian
networks and the expressiveness or qualitative power of Cartesian granule
features.

11.8 SUMMARY

This chapter has described how additive Cartesian granule- feature modelling has been
applied to a number of real world problems, including the diabetes detection in Pima
Indians and region classification in vision understanding. The O_DACO constructive
inductive algorithm was used to discover these models. The discovered models perform
very well; yielding in some cases simpler more transparent models with accuracies
higher than other well known machine learning techniques. Various ways of extending
and improving the ACOF modelling approach were suggested especially in the context
of the region classification problem. Some overall conclusions regarding the knowledge
discovery of Cartesian granule feature models were drawn. Several future avenues of
research were identified for knowledge discovery in general and for Cartesian granule
feature modelling in particular.
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 311

11.9 BmLIOGRAPHY

Almuallim, H., and Dietterich, T. G. (1991). "Learning with irrelevant features." In the
proceedings of AAAI-9I, Anaheim, CA, 547-552.
Baldwin, 1. F., Lawry, J., and Martin, T. P. (1997). ''Mass assignment fuzzy ID3 with
applications." In the proceedings of FuzzY Logic: Applications and Future
Directions Workshop, London, UK, 278-294.
Baldwin, J. F., Martin, T. P., and Shanahan, 1. G. (1997). "Structure identification of
fuzzy Cartesian granule feature models using genetic programming." In the
proceedings of ]JCAI Workshop on Fuzzy Logic in Artificial Intelligence,
~agoya,Japan, 1-11.
Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1999). "Controlling with words
using automatically identified fuzzy Cartesian granule feature models",
International Journal of Approximate Reasoning (]JAR), 22:109-148.
Bastian, A. (1995). "Modelling and Identifying FuzzY Systems under varying User
Knowledge", PhD Thesis, Meiji University, Tokyo.
Bellman, R. E. (1961). Adaptive Control Processes. Princeton University Press.
Blum, A. L., and Langley, P. (1997). "Selection of relevant features and examples in
machine learning", Artificial Intelligence, 97:245-271.
Bovlik, A. c., Clark, M., and Geisler, W. S. (1990). "Multichannel texture analysis
using localised spatial filters", IEEE Transactions on PAMI, 12(1}:55-73.
Box, G. E., and Jenkins, G. M. (1970). Time series analysis, forecasting and control.
Holden Day, San Francisco, CA.
Brooks, R. A. (1987). "Model-based three-dimensional interpretations of two-
dimensional images", In Readings in computer vision: issues, problems,
principles, and paradigms, M. A. Fischler and O. Firschein, eds., Kaufmann
Publishers, Inc., Los Altos, CA, USA, 360-370.
Caelli, T., and Reye, D. (1993). "On the classification of image regions by colour,
texture and shape", Pattern Recognition, 26(4}:461-470.
Campbell, F. W., and Robson, J. G. (1968). "Application of Fourier analysis to the
visibility of gratings", Journal of Physiology, 197:551-566.
Campbell, ~. W., Mackeown, W. P. J., Thomas, B. T., and Troscianko, T. (1997).
''Interpreting Image Databases by Region Classification", Pattern Recognition,
30(4}:555-563.
Campbell,~. W., Thomas, B. T., and Troscianko, T. (1997). "Automatic segmentation
and classification of outdoor images using neural networks", International
Journal of Neural Systems, 8(1}:137-144.
Connell, J. H., and Brady, M. (1987). "Generating and generalising models of visual
objects", Artificial Intelligence, 34:159-183.
Cootes, T. F., and Taylor, C. J. (1995). "Combining point distributions with shape
models based on finite-element analysis", Image Vision Computation,
13(5}:403-409.
Cootes, T. F., Taylor, C. 1., Cooper, D. H., and Graham, J. (1992). ''Training models of
shape from sets of examples." In the proceedings of British Machine Vision
Conference, Leeds, UK, 9-18.
Cristianini, ~. (1998). "Application of oblique decision trees to Pima diabetes
problem", Personal Communication, Department of Engineering Mathematics,
University of Bristol, UK.
CHAPTER 11: APPLICATIONS 312

Daugman, J. G. (1985). "Uncertainty relation for resolution in space, spatial, frequency,


and orientation optimised by two-dimensional visual cortical filters", Journal
of Optical Soc. Am., 2(7):1160-1169.
Draper, B. A, Collins, R. T., Brolio, J., Hanson, A R., and Riseman, E. M. (1989).
''The Schema system", The International Journal of Computer Vision, 2:209-
250.
Fischler, M. A, and Firschein, 0., eds. (1987). "Readings in computer vision: issues,
problems, principles, and paradigms", Kaufmann Publishers, Inc., Los Altos,
CA, USA, 765-768.
Frawley, W. J., Piatetsky-Shapiro, G., and Matheus, C. J. (1991). "Knowledge
Discovery in Databases: An Overview", In Knowledge Discovery in
Databases, G. Piatetsky-Shapiro and W. J. Frawley, eds., AAAI Press/MIT
Press, Cambridge, Mass, USA, 1-27.
Glassner, A S. (1995). Principles of digital image synthesis. Morgan Kaufmann, San
Francisco.
Good, I. J. (1965). The, estimation of probabilities: an essay on modem Bayesian
methods. M. I. T. Press.
Grimson, W. E. L., and Lozano-Perez, T. (1984). "Model-based recognition and
localization from sparse range or tactile data", Int'l Jrnl of Robotics Research,
3(3):3-35.
Jain, A K., Ratha, N. K., and Lakshmanan, S. (1997). "Object detection using Gabor
filters", Pattern Recognition, 30(2):295-309.
Jolliffe, I. T. (1986). Principal Component Analysis. Springer, New York.
Kira, K., and Rendell, L. (1992). "A practical approach to feature selection." In the
proceedings of 9th Conference in Machine Learning, Aberdeen, Scotland,
249-256.
Kohavi, R., and John, G. H. (1997). "Wrappers for feature selection", Artificial
Intelligence, 97:273-324.
Kononenko, I., and Hong, S. J. (1997). "Attribute selection for modelling", FGCS
Special Issue in Data Mining(Fall):34-55.
Kosako, A, Ralescu, A L., and Shanahan, J. G. (1994). "Fuzzy techniques in Image
Understanding." In the proceedings of ISCIE Joint Conference for Automatic
Control, Osaka, Japan, 17-22.
Langley, P., and Simon, H. A (1998). "Fielded Applications of machine learning", In
Machine Learning and Data Mining, R. S. Michalski, I. Bratko, and M. Kubat,
eds., Wiley, New York, 113-129.
Mackeown, W. P. J., Greenway, P., Thomas, B. T., and Wright, W. A (1994).
"Contextual Image Labelling with a Neural Network." In the proceedings of
lEE Vision, Speech and Signal Processing, 238-244.
Malik, I., and Perona, P. (1990). "Preattentive texture discrimination with early vision
mechanisms", Journal of Opt. Society Am., A(7):923-932.
Merz, C. J., and Murphy, P. M. (1996). UCI Repository of machine learning databases
[http://www.ics.uci.edul-mlearnlMLRepository.html].lrvine.CA. University
of California, Irvine, CA
Michalski, R. S., and Chilausky, R. L. (1980). "Learning by being told and by
examples", International Journal of Policy Analysis and Information Systems,
4:125-160.
SOFT COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 313

Michalski, R. S., Rosenfeld, A, Duric, Z., Maloof, M., and Zhang, Q. (1998).
''Learning patterns in images", In Machine Learning and Data Mining, R. S.
Michalski, I. Bratko, and M. Kubat, eds., Wiley, New York, 241-268.
Michie, D., Spiegelhalter, D. J., and Taylor, C. C. (1993a). ''Dataset Descriptions and
Results", In Machine Learning, Neural and Statistical Classification, D.
Michie, D. J. Spiegelhalter, and C. C. Taylor, eds., 131-174.
Michie, D., Spiegelhalter, D. J., and Taylor, C. c., eds. (1993b). "Machine Learning,
Neural and Statistical Classification", Ellis Horwood, New York, USA.
Mirmehdi, M., Palmer, P. L., Kittler, J., and Dabis, H. (1999). "Feedback control
strategies for object recognition", IEEE Transactions on Image Processing,
8(8):1084-1101.
Moller, M. F. (1993). "A scaled conjugate gradient algorithm for fast supervised
learning", Neural Networks, 6:525-533.
Mukunoki, M., Minoh, M., and Ikeda, K. (1994). "Retrieval of images using pixel
based object models." In the proceedings of IPMU, Paris, France, 1127-1132.
Murase, H., and Nayar, S. K. (1993). "Learning and recognition of 3D objects from
appearance." In the proceedings of IEEE 2nd Qualitative Vision Workshop,
New York, NY, 39-50.
Murphy, S. K., Kasif, S., and Salzburg, S. (1994). "A system for induction of oblique
decision trees", Journal of Artificial Intelligence Research, 2:1-33.
Nakoula, Y., Galichet, S., and Foulloy, L. (1997). "Identification of linguistic fuzzy
models based on learning", In Fuzzy Model Identification, H. Helledoorn and
D. Driankov, eds., Springer, Berlin, 281-319.
Pedrycz, W. (1984). "An identification algorithm in fuzzy relational systems", Fuzzy
Sets and Systems, 13:153-167.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San
Mateo, CA.
Ralescu, A L., and Shanahan, J. G. (1995). "Line structure inference in fuzzy
perceptual grouping." In the proceedings of NSF Workshop on Computer
Vision, Islamabad, Pakistan, 225-239.
Ralescu, A L., and Shanahan, J. G. (1999). "Fuzzy perceptual organisation of image
structures", Pattern Recognition, 32:1923-1933.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning internal
representations by error propagation", In Parallel Distributed Processing
(Volume 1), D. E. Rumelhart and J. L. McClelland, eds., MIT Press,
Cambridge, USA
Schlimmer, J. c., and Granger, R. H. (1986). "Beyond incremental processing: tracking
concept drift." In the proceedings of Fifth National Conference on Artificial
Intelligence, Philadelphia, 502-507.
Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of
Additive Models for Classification and Prediction", PhD Thesis, Dept. of
Engineering Mathematics, University of Bristol, Bristol, UK.
Shanahan, J. G. (2000). "A comparison between naive Bayes classifiers and product
Cartesian granule feature models", Report No. In preparation, XRCE.
Shanahan, J. G., Baldwin, J. F., Campbell, N., Martin, T. P., Mirmehdi, M., and
Thomas, B. T. (1999). ''Transitioning from recognition to understanding in
vision using additive Cartesian granule feature models." In the proceedings of
North American Fuzzy Information Processing Society (NAFlPS), New York,
USA,71O-714.
CHAPTER 11: APPLICATIONS 314

Shanahan, J. G., Baldwin, J. F., and Martin, T. P. (1999). "Constructive induction of


fuzzy Cartesian granule feature models using Genetic Programming with
Applications." In the proceedings of Congress of Evolutionary Computation
(CEC), Washington D.C., 218-225.
Shanahan, J. G., Thomas, B. T., Mirmehdi, M., Martin, T. P., Campbell, N., and
Baldwin, J. F. (2000). "A soft computing approach to road classification",
Journal of Intelligent and Robotic Systems (Theory & Applications)(To
appear):30.
Shepherd, B. A. (1983). "An appraisal of a decision tree approach to image
classification." In the proceedings of International Joint Conference on AI,
473-475.
Smith, J. W., Evalhart, J. E., Dickson, W. C., Knowler, W. C., and Johannes, R. S.
(1988). "Using the ADAP learning algorithm to forecast the onset of diabetes
mellitus." In the proceedings of Symposium on Computer Applications and
Medical Care, 261-265.
Strat, T. M. (1992). Natl,lral object recognition. Springer-Verlag, New York, USA.
Sugeno, M., and Yasukawa, T. (1993). "A Fuzzy Logic Based Approach to Qualitative
Modelling", IEEE Trans on FuzzY Systems, 1(1): 7-31.
Tong, R. M. (1980). ''The evaluation of fuzzy models derived from experimental data",
Fuzzy Sets and Systems, 4:1-12.
Turk, M. A., and Pentland, A. P. (1991). "Face recognition using eigenfaces." In the
proceedings of IEEE Conf. on Computer Vision and Pattern Recognition, 586-
591.
Utgoff, P. E. (1989). "Incremental induction of decision trees", Machine
Learning(4):161-186.
Valois, L. R. D., and Valois, K. K. D. (1993). "A multi-stage color model", Vision
Research, 33(8): 1053-1065.
Valois, R. L. D., Albreacht, D. G., and Thorell, L. G. (1982). "Spatial-frequency
selectivity of cells in macaque visual cortex", Visual Research, 22:545-559.
Winston, P. H., ed. (1975). "The Psychology of Computer Vision", McGraw-Hill, USA.
Wood, M. E. 1., Campbell, N. W., and Thomas, B. T. (1997). "Searching large image
databases using radial basis function neural net~orks." In the proceedings of
International Conference on Image Processing and its Applications, London,
U.K.,116-120.
APPENDIX EVOLUTIONARY
COMPUTATION

Evolutionary computation is a branch of soft computing, that by analogy with the


phenomena of evolution in nature, attempts to solve problems through the processes of
natural selection and reproduction. Several versions of evolutionary computing exist
including genetic algorithms [Holland 1975], genetic programming [Koza 1992],
evolutionary programming [Fogel, Owens and Walsh 1966], and evolutionary strategies
[Schwefel 1995]. All approaches build on ideas originally presented by Friedberg
[Friedberg 1958; Friedberg, Dunham and North 1959] who tried to solve simple
problems by teaching a computer to write Fortran computer programs through
simulated evolution. He uS,ed a framework similar to modern genetic algorithms. The
presentation here is limited to genetic programming and genetic algorithms, as they lie
at the core of the induction algorithms presented in this book. Though the principal
ideas behind evolutionary computation originated in the work of Friedberg, it was not
until the mid-seventies [Holland 1975] that genetic algorithms were accepted and
illustrated (both empirically and theoretically) as robust search techniques in complex
spaces. Genetic programming was subsequently introduced in the late eighties by Koza
[Koza 1992] as a more flexible extension of genetic algorithms.

Genetic programming (GP) is a highly parallel computational model of biological


evolution. It is an evolutionary approach in the Darwinian sense of the word, to
computer program induction [Koza 1992]. Genetic Programming is isomorphically
very similar to genetic algorithms (GA) [Holland 1975] in the following ways:

• They both operate on a family/population of individuals (also known as


structures or chromosomes).

• They both have an objective function normally known as a fitness


function which helps direct the evolutionary-based search and promote
good individuals/solutions.

• They both possess a collection of reproduction operators that use


individuals in the current population to produce individuals in the next
generation. Usually this set will consist of crossover, reproduction and
mutation operators.

• An evolutionary process using the above operations evolves individuals in


light of the fitness function over a predetermined number of generations
or until a satisfactory solution is reached.

However GP and GAs differ in many ways as outlined below:


ApPENDIX: EVOLUTIONARY COMPUTATION 316

• Whereas GAs operate on individuals which correspond to actual


solutions, GPs operate on individuals which provide a means of getting
solutions i.e. individuals in GP correspond to programs.

• In GA individuals are generally represented in terms of fixed length


strings of bits or characters, while in GP, individuals are represented in
terms of variable-sized hierarchical tree structures such as depicted in
Figure A-I.

• Fitness functions and reproduction operators may differ slightly but will
share the same philosophy.

As mentioned previously in this section the individual structures that undergo


adaptation in GP are hierarchically structured programs. The size, shape and contents of
these computer programs will dynamically vary during this evolution process. The set
of possible structures/programs come from the set of all possible compositions of
functions from F = {ii, ... , In} and a set of terminals/variables T = {tb ... , tm }. The set of
functions F corresponds normally to algebraic, trigonometric or logic functions that
form the basis for the programming language that is used to describe the chromosomes.
The set of terminals T normally corresponds to the problem variables and constants.
Each function /; will take a specified number of arguments (also known as the arity)
which are either terminals (tj ) or the results of other functions. Since program structures
are recursive in nature the axiom of closure is very important.

The whole evolution process begins by generating a random set of individual


programs/structures utilising the set of functions and terminals of the problem. The
maximum allowed size for evolved programs is preset in terms of tree depth. To
generate an individual, a function is randomly selected from the function set. This
corresponds to the root node. Subsequently, an item is randomly selected from the
terminal set or function set. This item will correspond to an argument of the root node.
To force trees of certain depth, random selection of node labels is restricted to functions
(grow deeper trees) or to terminals (if the maximum depth has been reached). This
process is repeated until all leaf nodes are terminal nodes. Figure A-I depicts an
example of a chromosome which is generated from the function set {+, -, *, I} and the
terminal set {fl, 12, 13, f4, 15, f6}. This program denotes the following simple
calculation: f5 + (f6 - 13)·

One of the most important and difficult concepts of genetic programming is the
determination of the fitness function. The fitness function determines how well a
program is able to solve the problem. The output of the fitness function is used as the
basis for selecting which individuals get to procreate and contribute their genetic
material to the next generation. The structure of the fitness function will vary greatly
from problem to problem. See Section 9.3.2 for example fitness functions.

Three primary operations are used to adapt the chromosomes in genetic programming:

• reproduction;
• crossover;
• and mutation.
SOFf COMPUTING FOR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 317

The reproduction operator selects an individual from the current population according
to some selection mechanism based (such as k-toumament selection, see Section 9.3.4)
on fitness and copies it, without alteration, from the current population into the new
population. The crossover operation creates variation in the population by producing
new offspring that consist of mutually shared parents. This operation consists of
selecting two parents from the current population. Subsequently, a node is selected
randomly within each of the selected parents. The sub-trees rooted at these nodes are
then swapped resulting in two new offspring that are inserted in to the next generation.
The mutation operator introduces random changes in individuals in the current
population. Once again an individual is selected. Then a node is randomly selected
within this individual. The sub-tree rooted at this node is replaced by a newly generated
random sub-tree. The mutated individual is subsequently inserted into the next
generation. These operations are described, in more detail, in the context of learning in
Section 9.3.

f5

f6 f3
Figure A-J: An example chromosome structure that denotes the program "f5 + (f6 -
f3)".

A Generic genetic programming algorithm


There are many flavours of genetic programming algorithms (see Section 9.3.5 for a
description and application of steady state genetic programming [Koza 1992]). A
generic GP algorithm is outlined here. A schematic of the algorithm is presented in
Figure A-2. Having selected the problem function set and terminal set, an initial
population is created as previously described. The following steps are then repeated:

• Assign a fitness value to each individual


• Create a new population of individuals using the reproduction, crossover
and mutation operators

until a satisfactory solution (in terms of fitness) or the number of generations is


reached. Then the best of the best of each generation (corresponding to the fittest
individual in each generation) is generally chosen as the solution. Typically algorithms
run for 50 to 100 generations with population sizes varying from hundreds to
thousands.
ApPENDIX: EVOLUTIONARY COMPUTATION 318

No

Figure A-2: The main steps of a generic genetic programming algorithm.

Bibliography
Fogel, L. J., Owens, A. 1., and Walsh, M. J. (1966). Artificial intelligence through
simulated evolution. John Wiley, New York.
Friedberg, R. (1958). "A learning machine, part 1", IBM Journal of Research and
Development, 2:2-13.
Friedberg, R., Dunham, B., and North, T. (1959). "A learning machine, part 2", IBM
Journal of Research and Development, 3:282-287.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of
Michigan Press, Michigan.
Koza, J. R. (1992). Genetic Programming. MIT Press, Massachusetts.
Schwefel, H. P. (1995). Evolution and optimum seeking. J. Wiley, Chichester.
GLOSSARY OF MAIN SYMBOLS

{x. y ... } Set of elements x, y, ...


0 Null set or empty set
{xIP(x)} Set determined by the property P. "I" is read as "such
that".
<XI. X2, ...• Xn> n-tuple
'\;/ Universal quatifier denotingfor all
3 Existential quatifier denoting there exists
IAI (:ardinality of a set A
{a, ...• b} {a, ... , b} denotes a discrete interval, such that a:S: x:S: b
'\;/ x e {a, ...• b}. For example. {I •... , 6} denotes {I. 2. 3,
4.5.6}
(a, b) A continuous interval denoting any value x that satisfies
the following condition: a < x < b
[a. b] A continuous interval denoting any value x that satisfies
the following condition: a S; x S; b
9\ Set of all real numbers
Xl> ...• Xn Problem domain input variables
y Problem domain output or dependent variable
X~Y The function or mapping from the variable X to the
variable Y
Ox The universe of values that a variable X can assume
iff If and only
xeA Element x belongs to the crisp set A
A(x) Characteristic function of a crisp set A
J..lA(X) The membership grade of x in the fuzzy set A
n
Lx;IPA(X;) The fuzzy set defined over the discrete universe Ox = {Xl
;=1 + X2+ ... + xn}

fx'PA(X)
The fuzzy set defined over the continuous universe Ox
x
An a-cut of fuzzy set A
A=B Set equality
AcB Set inclusion
AcB Proper set inclusion (Le. A c B iff A :f:. B)
...,A The complement of set A
A The complement of set A
AuB The union of the sets A and B
AnB The intersection of sets A and B
A®B Fuzzy intersection or t-norm
GLOSSARY OF MAIN SYMBOLS 320

AE9B Fuzzy union or t-cononn


·AxB The Cartesian or cross product of sets A and B
n Shorthand for the Cartesian product i.e . .Qxlx ... xaxn
xQ
i=l X;

Rxv The relation R between variables X and Y


[Rj,X] The projection of relation R with respect to the variables in
the set X
[RIX-Y] The cylindric extension of relation R with respect to the
variables in X-Y
avb Disjunction of propositions
at\b Conjunction of propositions
-,avb Material implication
I(a, b) Implication
P(X) The set of crisp subsets of X (power set)
F(X) The set of fuzzy subsets of X (fuzzy power set)
Pr(X=xj) or Pr(xj) The probability measure of variable X having a value Xi
Pr(X) The probability distribution associated with variable X
Pr(X = Xj I Y=Yj) The conditional probability measure of variable X having a
value Xj given that the variable Y has a value Yj
Bel Belief measure
PI Plausibility measure
Nec Necessity measure
Pos Possibility measure
1t(x) The point possibility of x E ax
mE = <Aj:m(Aj» A basic probability assignment (bpa) defined in tenns of it
focal elements Ai and associated masses
MAE= <Aj:m(Aj» A mass assignment defined in terms of it focal elements Ai
and associated masses
LPDE Least Prejudiced probability Distribution
WIX ... xWm A Cartesian granule defined over words Wi
nl\x ... xPm A Cartesian granule universe

CGFjXF2X ... XF,. A Cartesian granule feature

CGFSFjXF2X. ••• XF,. A Cartesian granule fuzzy set

LX;
"
;=1
The sum XI + X2 + ... + Xn
n

llXi The product XI x X2X ... X Xn


i=1

argmaxf(x) The value of X that maximisesftx}. For example,


.refix
argmaxf(x 2 ) =-3
.re{1.2.-3}
max min[uA (x),,uB (x)] E.g . max(min(O.3, 0.3), min(1, 0.9), min(0.5, 0.5» =
.refix
max(O.3, 0.9, 0.5) = 0.9, where ax = {x" X2, X3} and A and
B are fuzzy sets defined on ax
I SUBJECT INDEX
I
AutoClass, 161
SYMBOL II averaging operators, 54

- complement, 49
-, complement, 49
i cylindrical extension, 60
B
background knowledge, 27

e element of, 36 basic probability assignment, 104
V forall, 36 Bayes' rule, 96
n intersection, 48 Bayesian network, 99, 278
Jl membership function, 40 belief measure, 105
belief network. See Bayesian network
J. projection, 59 bias, 209
E9 t-conorm, 51
bias/variance dilemma, 209
® t-norm, 50 bijective transformation, 119
uunion, 49 body of a rule, 80
L union notation, 39, 40 body of evidence, 110
n universe, 35 bootstrap procedure, 167
a-cut, 43 Box and Jenkins gas furnace problem,
y-operator, 56, 187 301
(... ),40 bpa. See basic probability assignment
/ membership separator, 39 Bristol image database, 283, 285
[... ],40
{... J,39
1,36,43,95
C II
+ union notation, 39 C4.5, 152. See ID3
< ... >, 97, 98, 104, 114 car parking problem. See parking
problem
A II Cartesian granule, 180
Cartesian granule features, 179, 243,
accuracy, 244 245
ACGF model. See additive Cartesian approximate reasoning, 194
granule feature model definition, 180
additive Cartesian granule feature fuzzy logic, 195
model, 194, 228, 241, 242, 293 fuzzy set, 181
additive model. See additive Cartesian fuzzy set induction, 203
granule feature model product models, 194
antecedent, 80 rules, 193
applications, 241, 281 Cartesian granule fuzzy set, 61, 181,
arity, 316 190,203
SUBJECT INDEX 322

example, 203 deductive learning, 146


learning, 203 defuzzification, 86
Cartesian granule universe, 180 centre of gravity, 86
case-based learning, 159 for c1assificarion, 87
causal network. See Bayesian network maximum height method, 87
causal rule structure, 132 DeMorgan triple, 54, 82
centre of gravity, 86 Dempster's rule of combination, 106
characteristic function, 36 Dempster-Shafer theory, 103
chemical plant application, 304 diabetes application, 296
chromosome, 315 distributed learning, 309
chromosome structure, 210
classical set theory, 35
classification, 150 E I
closure, 316 efficiency, 168
clustering, 161, 216 ellipse problem, 226, 243, 261
COO. See centre of gravity embedded feature selection, 207
cognitive simulation, 169 evaluation function, 165
compensative operators, 55 evidential logic, 131, 202, 219
compositional rule of inference, 77 evidential logic rule, 202, 2]9, 224,
computational learning theory, 169 242,276
computer vision, 281 evidential reasoning. See evidential
computing with words, 183 logic
concept drift, 309 evolutionary computation, 157,315
conceptual clustering, 160 evolutionary programming, ]57,3]5
conditionally independent, 96 evolutionary strategies, 157, 315
confusion matrix, 166, 228, 231 extended rule, 132
conjunction based inference, 83
conjunctive rule, 131, 242
consequent, 80 F II
conservation of uncertainty method,
119 FCM, 16]. See fuzzy C-means
consonant focal elements, 109 feature discovery, 205
constant threshold assumption, 71, feature selection, 206
188 embedded approach, 207
constructive induction, 200 filter approach, 207
cost function, 165 wrapper approach, 208
CRI. See compositional rule of filter, 2] 9,244
inference filter feature selection, 207, 291
crossover, 213, 315 filter identification, 219
cylindrical extension, 60 fitness. See fitness function
fitness function, 210, 315, 316
focal element, 104, ]]4
D FOCUS, 207
forward chaining, 25
data browser, 154,261 frame of discernment, 95, 103
data-driven partitioning, 257 Fril,129
decision making, 25, 137, 148, 150, decision making, 137
159 inference, 133
decision trees, 152,260 rule structures, ] 29
decomposition error, 184, 272 FRIL. See Fril
SOFT COMPUTING I'OR KNOWLEDGE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 323

fuzziness, 38 matching, 56
fuzzy C-means, 258 membership, 38
fuzzy complement, 54 membership function, 38
fuzzy decision making. See normal,44
defuzzification normalisation, 44
fuzzy implication, 81 notation, 40
fuzzy inference, 76 operations, 47
conjunction based, 83 possibility theory, 113
implication based, 81 properties, 43
fuzzy integrals, 159,223 representation, 45
fuzzy interval, 40 semantic unification, 116
fuzzy logic, 67 support, 43
applications, 89 tranformation to probabiliy
defuzzification, 85 distributions, 119
inference, 76 trapezoidal, 73
learning, 159 triangular, 73
fuzzy measures, 159 type-I fuzzy set, 62
fuzzy modifiers, 68. See linguistic type-2 fuzzy sets, 62
hedges union, 47
fuzzy mutually exclusive partition, 70 voting model interpretation, 42
fuzzy non-mutually exclusive fuzzy set theory, 35, 37
partition, 70 learning, 159
fuzzy number, 40, 45, 131 motivations, 37
fuzzy partition, 69 fuzzy truth values, 68
fuzzy mutually exclusive partition,
69, 70
linguistic partition, 71 G II
fuzzy patch, 78 G_DACG,200
fuzzy predicate, 68 applications, 241, 281
fuzzy probabilities, 68 chromosome structure, 210
fuzzy relation, 58, 80, 81 detailed example, 226
cylindrical extension, 60 for classification problems, 202
projection, 59 for prediction problems, 204
Fuzzy Relational Inference Language, wrapper-based,208
129 G_DACG Algorithm. See G_DACG
fuzzy set, 38 Gabor filters, 288
alpha-cut, 43 generalisation, 162, 164, 183
averaging operators, 54 generalised modus ponens, 78
cardinality, 45 generalised modus tollens, 80
complement, 47 generation gap, 214
core, 43 genetic algorithms, 157, 315
degree of membership, 38 genetic programming, 157, 210, 214,
example, 39 315
generalisations, 57 glassbox models. See model
height, 44 transparency
interpretations, 41 granularity, 73, 210, 243
intersection, 47 granule, 180
involutive complement. See granule characterisation, 215
involutive complement granule fuzzy set, 61
SUBJECT INDEX 324

H II taxonomy, 27
knowledge stability, 27, 186
head of rule, 80 k-toumament selection, 213, 317
holdout estimate, 167
human learning, 145
hypothesis. See model
L II
hypothesis language, 24 L problem, 273
language identification, 200, 205,208,

~ t 214. See also structure


identification
learning by analogy, 146
ID3, 216, 296
ignorance, 103, 104 learning by instruction, 146
implication based inference, 81 least prejudiced distribution, 115, 190,
inconsistency, 103, 114 203
incremental learning, 309 leave-out-one method, 167
incrementally learning, 27 linguistic hedges, 76
independence, 96 linguistic partition, 71, 216, 243, 271
inductive bias, 168 linguistic quantifier, 219
inductive inference, 146 linguistic summary, 61
inductive learning, 145, 148, 199 linguistic variable, 71
inference, 25 Lukasiewicz implication, 82
inference engine, 25
interpretations of fuzzy sets, 41
interval semantic unification, 117
M II
interval-valued fuzzy sets, 62 Machine Intelligence Quotient, 3, 14
intuitionistic fuzzy sets, 63 machine learning
involutive complement, 49,54 categories, 148
components, 162
J II definition, 147
history, 143
joint probability distribution, 97 MANF. See mass assignment based
neuro-fuzzy network
mass assignment, 114
K mass assignment based neuro-fuzzy
network, 222
Kleenes-Dienes implication, 82 mass assignment theory, 113
knowledge base. See model mass assignment tree induction
knowledge discovery, 6, 170 algorithm. See MATI
applications, 12 MATI, 263, 271, 276, 299
knowledge discovery in databases, maximum height method, 87
170 mean square error, 302
knowledge extraction, 10 membership function, 38
knowledge representation, 21, 23, 168 membership-to-probability bi-
desiderata, 26 directional transformation, 119,
fuzzy sets, 27 155,203
mathematical, 27 m-estimate, 154
probabilistic, 27 MIQ. See Machine Intelligence
prototypes, 27 Quotient
symbolic, 27 model, 24, 151
SOH COMPUTING FOR KNOWLEOOE DISCOVERY: INTRODUCING CARTESIAN GRANULE FEATURES 325

model transparency, 183 Pima diabetes application. See


modelling with words, 183 diabetes application
mutation, 213, 315 point semantic unification, 116
mutually exclusive, 96 possibilistic principle, 113, 119
mutually exclusive linguistic possibility distribution function, 110
partitions, 192 possibility measure, 56, 110
possibility theory, 109
N II body of evidence, 110
fuzzy set theory, 113
naive assumption, 98 possibility/probability consistency
naive Bayes, 98, 184, 296 principle, 42, 119
naive Bayes classifier, 241, 265 Powell's minimisation alg(jrithm, 218
naive Bayes classifiers, 153, 155, 265 prediction, 150,204,215,265,266
learning algorithms, 153 principle of generalisation, 38, 242
m-estimate, 154 principle of incompatibility, 37
necessity measure, 57, 110 principle of insufficient reason, 108
nested focal elements, 109 probability density, 95
neural networks, 158,262,271,291, probability distribution, 95
296 probability measures, 105
neuro-fuzzy, 222 probability theory, 94
n-fold cross validation, 167 axioms, 95
nonmonotonic logic, 26 Bayes' rule, 95
normal fuzzy set, 44 independence, 95
normalisation of fuzzy sets, 44 learning, 158
point-based approaches, 97, 98
product rule, 95
o II set-based approaches, 102
probability/possibility transformation,
oblique decision trees, 296, 301 119
observation language, 24 product model, 194, 242
observations, 24 product rule, 96
Occam's razor, 209 projection, 59 .
one-shot learning, 27 proposition, 95
order-weighted aggregation, 55 propositions, 68
OWA. See order-weighted pruning, 164,257,259,264
aggregation

p II Q

Q-learning, 160
PAC learning, 169 QL-implications, 82
parameter identification, 201, 202, qualified propositions, 68
217,218,222
parking problem, 6, 152, 154, 155,
162 R
parsimony, 211
partition, 69 regression. See prediction
percentile-based partitions, 257 reinforcement learning, 159
performance, 166 relation, 58
pignistic distribution, 109, 116 RELIEF, 207, 291
SUBJECT INDEX 326

representational bias, 168 parameterised, 52


reproduction, 213, 315 Schweizer and Sklar class, 53
resubstitute estimate, 167 theorem of total probabilities, 96
R-implications, 82 t-norm,50
RMS error, 212 non-parameterised, 53
road classification, 289 parameterised, 50
root mean square error. See RMS error Schweizer and Sklar class, 51
rote learning, 146 tractability, 183
transparency. See model transparency
s triangular conorm. See t-conorm
triangular norm. See t-norm
scaled conjugate gradient algorithm, type-I fuzzy set, 62
262 type-2 fuzzy sets, 62
search bias, 168
semantic discrimination analysis, 211, u •
217,228,253,276
semantic unification, 116, 211 uncertainty management, 25, 26
shape features, 287 understandability. See model
sigma count, 45 transparency
sin(X * Y) problem, 265 unqualified propositions, 68
soft computing, 13, 129,260 unsupervised learning, 160
specialisation, 162
steady state genetic programming, 214
stochastic uncertainty, 38 v II
structure identification, 205 variance, 209
subnormal fuzzy set, 44 version space, 165, 169
Sugeno's parametric A complement, vision problem, 284, 295
49 voting model, 42, 71, 123, 187
supervised learning, 149, 199

II
a taxonomy of approaches, 156
support logic, 129 w
support pairs, 130
support vector machines, 159 weighted generalised means, 55
symbolic learning, 156 weights identification, 218
wrapper-based feature selection, 208

T II z II
Takagi-Sugeno-Kang model, 85
t-conorm, 51 Zadeh implication, 82
non-parameterised, 53

You might also like