Soft Computing Notes PDF

Lecture Notes
SOFT COMPUTING
Edition for course at Umeå University, 2007
Umeå University
Department of Computing Science
SE-901 87 Umeå
Sweden
i
These lecture notes were originally prepared for the course in Computational Intel-
ligence at the Department of Computing Science, Umeå University.
Text contributions for this edition have been provided by Jens Bohlin, Patrik Eklund,
Lena Kallin-Westin and Tony Riissanen.
Umeå November 9, 2007
The authors
ii
Contents
I TYPICAL APPLICATIONS 1
1 Application Scenarios 3
1.1 Applications in the Health Care Domain . . . . . . . . . . . . . . . . 3
1.1.1 Medical decision support . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Information scope and management . . . . . . . . . . . . . . . 4
1.1.3 Case studies and scenarios . . . . . . . . . . . . . . . . . . . . 5
1.2 Industrial Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Diagnostics and control . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Case studies and scenarios . . . . . . . . . . . . . . . . . . . . 7
II LOGIC AND FOUNDATIONS 11

2 Two-Valued Logic 13
2.1 Propositional calculus . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Boolean operators and expressions . . . . . . . . . . . . . . . 14
2.1.2 Boolean interpretations, logical equivalence and substitutions . 16
2.1.3 Satisfiability, validity and consequence . . . . . . . . . . . . . 16
2.2 Predicate calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Formulas in predicate calculus . . . . . . . . . . . . . . . . . . 17
2.2.2 CNF, PCNF, Skolem functions and Clausal forms . . . . . . . 19
3 Logic Programming 21
3.1 Logical formulae as a program . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Resolution in logic programming . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 The resolution procedure . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Unification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Many-Valued Logic 27
4.1 Fuzzy Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Useful Membership Functions . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Triangular and trapetsoidal functions . . . . . . . . . . . . . . 30
4.2.2 Gaußian and sigmoidal functions . . . . . . . . . . . . . . . . 32
4.3 Fuzzy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iii
iv CONTENTS
4.3.1 Logic connectives . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.2 Operations on fuzzy sets . . . . . . . . . . . . . . . . . . . . . 36
4.3.3 Triangular norms . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Products, Relations and Compositions . . . . . . . . . . . . . . . . . 39
4.4.1 The extension principle . . . . . . . . . . . . . . . . . . . . . . 39
4.4.2 Fuzzy products, relations and compositions . . . . . . . . . . . 40
4.4.3 Compositional rule of inference . . . . . . . . . . . . . . . . . 41
4.5 Approximate Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5.1 Fuzzy propositional calculus . . . . . . . . . . . . . . . . . . . 46
4.5.2 Fuzzy predicate calculus . . . . . . . . . . . . . . . . . . . . . 47
5 Summary and Exercises 49
III FUZZY SYSTEMS 53

6 Fuzzy Control 55
6.1 Fuzzy Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1.1 Fuzzy rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1.2 Fuzzy rule bases . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.3 Overview of fuzzy controller structures . . . . . . . . . . . . . 57
6.2 Inference in Fuzzy Controllers . . . . . . . . . . . . . . . . . . . . . . 58
6.2.1 Mamdani’s method . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.2 Takagi-Sugeno’s method . . . . . . . . . . . . . . . . . . . . . 59
6.3 Defuzzification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7 Fuzzy Clustering 67
7.1 Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2 Fuzzy c-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3 Identification of Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.4 Geometric Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . 71
7.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8 Generalised Perceptrons as Fuzzy Rules 73

8.1 Enlightening the Black Box . . . . . . . . . . . . . . . . . . . . . . . 73
8.2 Conventional use of Neural Nets . . . . . . . . . . . . . . . . . . . . . 74
8.3 Network Feedforward . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.4 The Generalised Preprocessing Perceptron . . . . . . . . . . . . . . . 77
8.5 Logical Interpretations of the Weighted Sum . . . . . . . . . . . . . . 79
8.5.1 Results of the case study . . . . . . . . . . . . . . . . . . . . . 80
8.6 The production line . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.6.1 The extraction phase . . . . . . . . . . . . . . . . . . . . . . . 82
8.6.2 The analysis phase . . . . . . . . . . . . . . . . . . . . . . . . 84
8.6.3 The synthesis phase . . . . . . . . . . . . . . . . . . . . . . . . 84
CONTENTS v
8.7 Neural Networks and Multi-Valued Propositional Calculus . . . . . . 84

8.7.1 Fuzzy logic programs . . . . . . . . . . . . . . . . . . . . . . . 84
8.7.2 Generalised neural networks as fuzzy logic programs . . . . . . 86
9 Parameter Estimations 89
9.1 Tuning of Fuzzy Rule-Base by Using Gradient-Descent Algorithm . . 89
9.2 Tuning of Fuzzy Rule-Base by Using Gauss-Newton with Regulariza-
tion Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.3 Tuning of Fuzzy Rule-Base by Using Levenberg-Marquardt Method . 94
10 Software Developments 97
10.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.2 System Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.3 Design and Integration of Fuzzy Controllers . . . . . . . . . . . . . . 100
10.3.1 Scenario: Rule Generation based on manual control . . . . . . 101
10.3.2 Scenario: Integration of control rules . . . . . . . . . . . . . . 101
10.4 Overview of AboaFuzz . . . . . . . . . . . . . . . . . . . . . . . . . . 102
IV PROBABILISTIC COMPUTING 107

12 Introduction 109
12.1 Syntax and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
12.2 Basic Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.2.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . 110
12.2.3 Interpretations of Probability . . . . . . . . . . . . . . . . . . 111
13 Bayesian Networks 113

13.1 A Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
13.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . 115
13.2.1 Linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
13.2.2 Diverging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
13.2.3 Converging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
13.2.4 d-separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
13.2.5 A notation for conditional independence . . . . . . . . . . . . 117
13.3 Finding the Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 117
13.3.1 Disjunctive Interaction . . . . . . . . . . . . . . . . . . . . . . 117
14 Inference in Bayesian Networks 121

14.1 The Joint Probability Distribution . . . . . . . . . . . . . . . . . . . 121
14.2 Singly Connected Networks . . . . . . . . . . . . . . . . . . . . . . . 124
vi CONTENTS
14.2.1 Car Trouble - a Singly Connected Network . . . . . . . . . . . 127

14.3 Multiply Connected Networks . . . . . . . . . . . . . . . . . . . . . . 128
14.3.1 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
14.3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
14.4 Symbolic Probabilistic Inference (SPI) . . . . . . . . . . . . . . . . . 140
14.4.1 Set-Factoring SPI . . . . . . . . . . . . . . . . . . . . . . . . . 141
14.5 Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
14.5.1 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 144
14.6 Connection to propositional calculus . . . . . . . . . . . . . . . . . . 145

Part I
TYPICAL APPLICATIONS
1
Chapter 1
Application Scenarios
1.1 Applications in the Health Care Domain

1.1.1 Medical decision support
Decision support systems are usually considered to have little autonomy and severe
difficulties to survive. However, one should not always blaim the specific systems
themselves, but also reflect on the frequently changing environments in which the
systems are being executed. For a particular DSS, the core of the system is a
knowledge base, or some reasoning or feedforward structure, that given symptoms
and signs, possible affecting medication, and other diagnosis present, provides a
quantitative estimation of a concluding diagnosis, or some predecision within the
process towards reaching a definit conclusion or providing a health care plan. In
this narrow sense, a carefully designed DSS within the medical domain, competetive
with respect to diagnosing capability, can in fact be seen to be both autonomous
and surviving. Dispite this fact, the success of particular systems have seemingly
little bearing for the survival of the hosting clinical information system in which
they are integrated.
This bring us not only to clearly distinguish a DSS from its hosting CIS, but also
to separate between knowledge acquisition and systems integration views related to
creating DSSs. Further, for development and maintenance, it is all to common to
prescribe this exclusively as a computer engineering task. Here is, in fact, one leading
argument of this paper. We should neither neglect nor underestimate the possibility
that end-users participate most actively in these tasks. Clearly, this means providing
end-users with appropriate tools to make engineering more trivial than challenging.
In the following we describe a demonstrator for a tool selection, in fact a production
line, which uses data from CPRs to generate and integrate DSSs.
One of the driving forces for autonomy and survival is still continuous mainte-
nance, and thus end-user creation and updating of particular DSSs becomes an ob-
vious necessity, with maintenance performed not just by end-users, but in a broader
perspective by end-using. This leads to the view of refinement and recycling, a
3
4 CHAPTER 1. APPLICATION SCENARIOS
scenario in which clearly the computer systems engineer becomes more superfluous,
and the domain expert is heading developments and is specifying types and styles
of decision making required to be backed up and supported within an information
system.
1.1.2 Information scope and management

The overall objective of information handling, within the medical domain, is to use
interconnected information systems to support patient care management, in order
to enable communication and information flow between home care units, primary
care units and hospitals.
In a workflow scenario describing patient care management (see Figure 1.1),
patient information is always instantly available for the general practitioner as well
as the specialist.
HOME CARE & PRIMARY HEALTH CARE
PATIENT GP SPEC SURG

CONS
LAB
SUPPORTING TECHNOLOGY
EDUCATION INFORMATION AUDIO/VIDEO

SYSTEMS SYSTEMS SYSTEMS
Figure 1.1: Patient care management.
Information systems within patient care management should promote large-scale

access to information sources, both related to patient specific information as well as
supporting information required during the health care situation. Patient specific
information resides mainly in patient record systems, and supporting documentation
and decision support are integrated into multimedia systems.
In general, information retrieval involves a long-term broad co-ordination of dif-
ferent health care registers and databases to support region-wide information ex-
change and utility, thereby supporting a more flexible database management and
providing a platform for developing distributed information systems. Database man-
agement systems (DBMSs) range from patient record systems in primary care units,
through hospital information systems, to other public organisation information sys-
tems and demographic backbones.
Patient records need to fulfil requirement for ease-of-use and instant information
access, while still enabling the practitioner to maintain focus on the patient. For
refinement purposes, data pools need to be available to a broad user community.
Multimedia information systems need an architectural organisation so as to be
open for immediate integration of generated information packages and decision sup-
port systems.
1.1. APPLICATIONS IN THE HEALTH CARE DOMAIN 5
EXPERT EXPERIENCE
information usage
and refinement
PATIENT MULTIMEDIA
RECORD INFORMATION
SYSTEMS SYSTEMS
Figure 1.2: Information systems usage, maintenance and refinement.
In the data refinement scenario, data mining is supported through a data analysis
workbench, including tools both within statistics and computational intelligence,
together with supporting tools to enable system integration (see Figure 1.2).
1.1.3 Case studies and scenarios

In this section we will present some case studies to illuminate in which way data
and information are available for analysis.
Nephropathia Epidemica
NE is a haemorrhagic fever with renal syndrome [Lähdevirta et al 84, Settergren 89].
It is a mild form of this acute infectious disease that occurs in Europe. The bank vole,
Clethrionomys glareolus, is the natural host of the Puumala virus [Lähdevirta 71]
which is the cause of NE. The diagnosis of NE is often apparent from the clinical
findings. NE begins with abrupt high fever for 3-5 days with headache from the
second day onwards. From the third day there are nausea and vomiting, backache
and abdominal pains. Some of the patients have acute myopia for one to three days.
In the acute phase the patients often have more or less profound thrombocytopenia.
During the first week the patients develop symptoms of acute renal failure like pro-
teinuria, oliguria or anuria, and azotemia. In ultrasonographic imaging the swelling
of kidneys is a typical finding. Description of patient material can be found in
[Eklund and Forsström 91], where initial numerical tests were made with 27 symp-
toms and signs. Later diagnostic success rates were improved to provide up to 88%
correctness rates. The number of symptoms and signs can be reduced to five, using
only CRP (C-reactive protein, using values < 200), ESR (erythrocyte sedimenta-
tion rate, using values < 100), creatinine (using values < 500), thrombocytes (using
values < 400) and myopia (as a binary value yes/no), still providing success rates
above 80%. The diagnosis (output of network) is either positive (virus present in
blood) or negative.
Myocardial infarction
Acute Myocardial Infarction (AMI) is the death of heart-muscle cells from reduced
or obstructed blood flow through the coronary arteries. Traditionally, the diagnosis
of AMI is made upon signs as sweating, nausea, pain in the chest, changes in the
ECG and raised levels of biochemical markers. The symptoms can vary a great
deal between different patients and they come under treatment in different stages
of AMI. The time factor is critical in the diagnose process, 50% of the patients die
within two hours after the first symptoms so it is important to find a quick and
reliable way to make the diagnosis.
The diagnosis of AMI requires at least two of the following criteria: A history
of characteristic chest pain, evolutionary changes on the ECG or elevation of serial
cardiac enzymes [Apple 92]. The values of the enzymes are measured from samples
taken at regular intervals within the first 24 hours after the infarction. Markers used
are, for instance, CK-MB, Myoglobin, and Tropinin.
In the case of AMI, the level of CK-MB increases 2-3 hours after that the AMI
has started, reaches a peak 10-24 hours after the AMI and returns to normal within
3-4 days afterwards. The myoglobin level is increased 2-3 hours after the start, peaks
at 6-9 hours and returns to normal within 18-24 days. The values for troponin is
increased at 4-6 hours, peaks at 10-24 hours and are back to normal within 10-15
days.
Observations related to data as indicated above should be used as a basis for
specifying combinations and transformations of data. Apart from data modelling it
is also interesting to investigate whether or not biological modelling, e.g. of CK-MB
behaviour, is possible. We know that the value of the markers increases, peaks and
decreases but the values differ from patient to patient. Reasons for the different
values are that the patients are seeking treatment at different stages of their AMI,
the treatment given, e.g. intravenous drip and/or drugs, affects the values and that
physical activity can rise the level of CK-MB without being a sign of AMI. Clearly,
also intravenous drip, when started, and so on, needs to be considered in data
modelling.
Downs syndrome
The most common used biochemical markers are AFP (alpha-foetoprotein) and hCG
(human chorion gonadotropin), and especially its free subunit β-hCG. There are
several factors that have an influence on AFP and β-hCG levels e.g. insulin diabetes,
race, smoking, and overweight of the mother. Therefore it can be expected that when
adding more anamnestic information to be taken into consideration, especially those
known to have an influence on marker levels, it is possible to get more specific and
sensitive information in regard to find the group that is at risk of having a DS baby.
A more detailed description of the syndrome, as far as data analysis is concerned,
can be found in [Kallin et al 95].
Based on computational methods as indicated in [Kallin et al 95] and explained
later in the text, a decision support system has been constructed which evaluates
a risk for the syndrome given the three inputs; the mothers age, AFP, and β-hCG.
The system shows a improved performance over the multiple Gaußian formula, one
of the most common statistical formulas used in software today.
1.2. INDUSTRIAL APPLICATIONS 7
Polycystic ovary syndrome

The primary symptoms of Polycystic Ovary Syndrome (PCOS) include menstrual
irregularities, hirsutism (increased body and facial hair), acne, obesity (large over-
weight) and infertility. PCOS is a very common condition among women, up to
10% of women may be affected with some form of this syndrome. Most women with
PCOS have a characteristic appearance to their ovaries, known as polycystic ovaries.
These consist of multiple small ”cysts” or follicles less than 1/2 inch in diameter
that form in the ovary.
The late consequences, such as risk of endometrial cancer, cardiovascular disease,
and infertility warrant an early and effective diagnosis of the syndrome. Although
60 years have passed since the first description of the syndrome, its definition and
diagnostic criteria are still controversial.
In [?] the syndrome is described in more detail and the data used in the anal-
ysis consists of ten different parameters; body mass index, the concentration of sex
hormone binding gloubin in serum, the concentration of testosterone in serum, the
concentration of androstenedione in serum, total renin, the free androgen index, the
concentration of luteinizing hormone in serum, the concentration of follicle stimu-
lating hormone in serum, and the quotient between the two last hormones.
1.2 Industrial Applications

1.2.1 Diagnostics and control
As our case studies we have selected a set of examples that is representable and
illustrates the use of controllers and predictors in different domains. Also, the case
studies provide examples of both data well suited for optimisation as well as data
being incomplete and even contradictory.
The case studies were selected from the domains of autonomous guided vehicles
and chemical processes. Data from AGVs provide well-behaving data, whereas data
from the chemical plant and the paper machine are typical data sets difficult to
handle and refine.
1.2.2 Case studies and scenarios
Autonomous Guided Vehicles

For basic steering control experiments, a simple car simulator has been used. In
the simulator, several controller types can be selected, including PID, fuzzy control,
multi-layer perceptron based control, and manual control. Different system models
can be included, together with system parameters for delays, speed and disturbances.
Driving, either in manual or controller mode, can be monitored, and on-line learning
strategies can be integrated. The simulator can be used e.g to generate data from
Figure 1.3: Simulated car following a trajectory.
a set of different control situations. A typical application is to demonstrate the use

of data recorded from manual control as a basis for generating rules to establish
automation. Data sets are based on two inputs, namely Position in the lane and
∆-Position, and the corresponding steering value as the output.
In a real situation, consider AGNES (Autonomous Gyrocontrolled Navigating

Experimental System), presenting a fully automatic test platform for investigations
in the field of automatic guided vehicles (AGV), including fuzzy control strategies
for automation and multisensor fusion for self guidance.
AGNES, developed at FH-Offenburg [Huber 96], is a electric car (originally a golf
course caddy), that has been rebuild to act as a robot in real world conditions as in-
tegrated in a CIM production line, where the positioning performance of the system
for different operations (free running, docking, gripping, etc.) can be evaluated.
Figure 1.4: AGNES.
AGNES data:
1.2. INDUSTRIAL APPLICATIONS 9
• 3 wheel without suspension
• air-filled tyres
• 250 kg weight, 230 kg max load
• drive unit with two 550W DC motors electronically coupled
• max speed 15 km/h, 42 km reach, 15
• steering axis chain coupled with 0.9 Nm DC motor and 56:1 transmission
• odometer with 235 m resolution
• steering angle encoder with 0.314 mrad resolution
• digital control loops with 100 Hz sampling rate
• two independent control loops for speed and steering angle
• rule base to handle absolute deviations of the vehicle
measurements
trajectories from sensors
calculate
deviations
generate
setpoints
steering setpoints odometer

angle direction / speed measurement
steering speed
fuzzy controller fuzzy controller
AGNES
Figure 1.5: AGNES control schema.
Later on we describe methods by which we can successfully convert manual

control of the steering subprocess in AGNES into an automatic fuzzy controller.
Conventional methods based on dynamic systems had been tested at FH-Offenburg
but could not be utilized to overcome the slip control control problem.
Fertilizer Production
In the fertiliser production line, consistency of material is controlled by injections of
liquid to the crystallise and drum, respectively. The objective is to produce fixed size
granules. Disturbances derive mainly from recycling and added chemicals. Change
of recipes also present situations and control problems where the process needs to
be stabilised with minimum delay.
Figure 1.6: A simplified view of fertiliser production.
A data set recorded from particular situations involving manual control has been
used for analysis. The control actions, injection change to the drum, injection
change to the crystallise, and circulation change, are based on observation of size of
the granules, direction of size of the granules, rear temperature, front temperature,
revolution of the drum, revolution of the crystalliser, and amount in circulation.
In extraction of data, much attention has to be given to obtaining reasonably
good approximations for measurements to enable structure identification. At this
stage, much engineering knowledge of the process is required. A general comment
is that chemical processes of this type are impossible to model. Furthermore, data
extracted is far from reasonably clean, even after transformation to suit numerical
experimentation. However, for the discussion here, this case study provides a useful
counterpart to other case studies. From integration point of view, the chemical plant
is a typical example that demonstrates the suitability of the rule base generation
approach adopted. The industrial automation system can directly integrate the
controller as developed within a design workbench, where the integration with the
automation system can be done in various ways.
Part II
LOGIC AND FOUNDATIONS
11
Chapter 2
Two-Valued Logic
This chapter has a very strong influence from the book ”Mathematical logic for
computer science” by Ben-Ari. The book is recommended for further reading about
the classical two-valued logic.
The study of logic was begun by the ancient Greeks. They found philosophy and
rhetoric extremely important in their culture. The logic was a way to define rules
of deduction so that if everybody started with the same premises and followed the
same logical rules the derived conclusions would also be the same.
Words as axiom, theorem and syllogism were used by the Greeks and are still
intact to this day. The famous rule of syllogism:
1. All men are mortal.
2. X is a man.
3. Therefore, X is mortal.
If the first two sentences are true (the premises) the Law of Syllogism assures
that the third sentence is true for every X. We can for example use X = Socrates
and deduce that Socrates is mortal.
There are however some problems, natural language are a too imprecise notation.
We can claim false statements to be true or to claim that a statement is true even
though its truth does not follow from the premises. An illustrative example:
1. Some cats are black.

2. My cat is some cat.
3. Therefore, my cat is black.
Even if you use the logic carefully, paradoxes can arise. One famous paradox is
”the Liar’s Paradox”:
Epimenides who was a Cretan (a person from the island of Crete) was
heard to say; ”All Cretans are liars.”
13
14 CHAPTER 2. TWO-VALUED LOGIC
Now if in saying this Epimenides is telling the truth, then he must be a liar (by
virtue of being a Cretan.) But if he is lying, then then by his statement he is
telling the truth. It is hard to figure how he can be both lying and telling the truth
simultaneously, and a vicious circle results.
Another way of seeing the puzzle here is to consider the following statement:
”This statement is false.” The same vicious circle arises. If the statement is true,
then it is the case that it is false. But if false, then it must actually be true...
Mathematicians have study the logic in order to formalize the concept of math-
ematical proof. Hilbert (1862-1943) tried to find a ’proof theory’ that should be a
direct check for the consistency of mathematics. He wanted to find a system where
(i) one cannot prove false statements
(ii) if a statement is in fact true, there is a proof somewhere out there just waiting
to be discovered.
The mathematician Gödel spoiled Hilbert’s dreams when he showed that there
are true statements of arithmetic that are not provable.
In this notes three types of logic is presented, the propositional calculus, the
predicate calculus, and the fuzzy logic. It should be noted that there exists other
logics as well. Examples of other kind of logics are the intuitionistic logic, the
temporal logic, and the modal logic. For more information about them .....
2.1 Propositional calculus

In the propositional calculus we study two-valued (also called boolean) expressions.
The atomic sentences are called propositions and they can be combined together to
propositional formulas using boolean operators.
2.1.1 Boolean operators and expressions

The propositions can have two values. The symbols of the values are arbitrary and
can for example be called true and false, T and F or 1 and 0. In this compendium the
values will be denoted 1 and 0 in order to simplify the transition from the two-valued
logic to the many-valued logic later on.
x ¬
1 0
0 1
Table 2.1: Truth-table for the one-place operator not.

2.1. PROPOSITIONAL CALCULUS 15
As in arithmetic we use one- and two-place operators to combine propositions

together to formulas. Only one of the possible one-placed operators are of interest
and that is the negation. The truth-table for negation is showed in table 2.1 and
the notation for negation is the symbol ¬.
We will not mention all the 16 possible two-placed operators but restrict ourselves
to the ones shown in table 2.2. The corresponding truth-tables are shown in table 2.3.
operator description operator description

∨ disjunction (also called or) ⊕ xor (negation of equivalence)
∧ conjunction (also called and) ← reverse implication
→ implication ↑ nand (negation of and)
≡ equivalence ↓ nor (negation of or)
Table 2.2: Some of the two-placed operators.
x y ∨ ∧ → ≡ ⊕ ← ↑ ↓
1 1 1 1 1 1 0 1 0 0
1 0 1 0 1 0 1 0 1 0
0 1 1 0 0 0 1 1 1 0
0 0 0 0 1 1 0 1 1 1
Table 2.3: Truth-table for the a selected set of two-place operators.
The set of operators is highly redundant and in fact only the negation and one
of the six first operators is needed to express all other operators. The two last ones
(nand and nor) is by themselves sufficient to define all other operators. What set
of operators to use is highly dependent on the application. In mathematics the
implication is crucial but in electronics nand and nor are more important.
The grammar for building a formula is as follows:
formula ::= p for any p ∈ P

formula ::= ¬ formula
formula ::=formula op formula
op ::= ∨| ∧ | → | ≡ | ←
The set P is called the set of atomic propositions or atoms. As in arithmetic the
operators has a order of precedence and parentheses can be used to make the prece-
dence clear or change it if needed in the same way as in arithmetic. The order of
precedence (from high to low) is ¬, ∧, ∨, →, ←, ≡.
2.1.2 Boolean interpretations, logical equivalence and sub-

stitutions
A propositional formula is a syntactical construction where the truth-value of the
formula is dependent on the truth-values of the atoms in A.
Definition Let A be a propositional formula and let {p1 , ..., pn } be the set of atoms
appearing in A. An interpretation for A is a function v : {p1 , ..., pn } → {0, 1}.
Example Given the formula (P ∧ Q) ∨ ¬(P ∧ Q) and the interpretation v(P ) = 1
and v(Q) = 0 the truth-value of the formula can be calculated.
v(P ∧ Q) = 0
v(¬(P ∧ Q)) = 1
v((P ∧ Q) ∨ ¬(P ∧ Q)) = 1
Definition Given two formulas A1 , A2 , if v(A1 ) = v(A2 ) for all interpretations v,

then A1 is logically equivalent to A2 , denoted A1 ↔ A2 .
As the reader might suspect ↔ and ≡ are closely related, in fact A1 ↔ A2 iff
A1 ≡ A2 is true in every interpretation.
Definition If A is a subformula of B and A0 is any formula the B 0 , then substitution
of A0 for A in B, denoted B{A ← A0 }, is the formula obtained by replacing the
subformula A with A0 .
If the substitution is made with logically equivalent formulas the resulting formula
is logically equivalent with the original one. The substitutions is used to simplify
formulas or to transform them into formulas with a special structure. For a brief
survey of the most common used logical equivalencies see figure 2.9 in [?].
2.1.3 Satisfiability, validity and consequence
Definition (Satisfiability) A propositional formula A is satisfiable if its truth-value

is 1 in some interpretation. Such a satisfying interpretation is called a model for
A. If it is not satisfiable, that is 0 in all interpretations, the formula is said to be
unsatisfiable or contradictory.
Definition A propositional formula A is valid (also called a tautology if, for every
interpretation the truth-value is 1. Notation: |= A. If the truth-value is 0 in some
interpretation, the formula is not-valid or falsifiable.
2.2. PREDICATE CALCULUS 17
2.2 Predicate calculus

In propositional calculus it is impossible to represent the sentence x < y since the
truth-value depends on the values of the variables x and y. This problem is solved
in predicate calculus by introducing a function that returns boolean values. It is
also possible to express sentences containing the phrases ”it exists” and ”for all” in
a compact and simple way.
An example of the difficulties can be the syllogism presented in the beginning of
this chapter.
1. All men are mortal.
2. Socrates is a man.
3. Therefore, Socrates is mortal.
In the propositional calculus we would use three atoms: p1 = All men are mortal.,
p2 = Socrates is a man., and p3 = Socrates is mortal.. The syllogism would be
written as a formula like this: p1 ∧ p2 → p3 . Even though we intuitively know that
the implication is valid, we are unable to show it.
In predicate calculus we can use the predicates man(x) and mortal(x) and the
three formulas would be written as
1. ∀x(man(x) → mortal(x))
2. man(Socrates)
3. (man(Socrates) ∧ ∀x(man(x) → mortal(x))) → mortal(Socrates)
∀ is the universal quantifier and is read ”for all”. ∃ is the existential quantifier
and is read ”there exists”. Quantifiers have the same precedence as negation. As
an example of the existential quantifier consider the sentence ”There exists birds
that cannot fly”. If we use the predicates bird(x) and can fly(x) the corresponding
formula in predicate calculus is:
∃x(bird(x) ∧ ¬can f ly(x))

Please, note that we have to distinguish the symbols from the meaning of the
symbols. The formula above could just as well describe something about boats
unable to float as birds unable to fly.
2.2.1 Formulas in predicate calculus

A formula in predicate calculus is built of variables, constants, function symbols,
predicate symbols, connectives, and quantifiers.
Definition A term is defined inductively as follows:
• Variables and constants are terms.
• If f is an n-ary function symbol and t1 , ..., tn are terms, then f (t1 , ..., tn ) is a
term.
Definition A formula is defined inductively as follows:
• If p is an n-ary predicate symbol and t1 , ..., tn are terms, then p(t1 , ..., tn ) is a
formula (also called an atom).
• If F and G are formulas, then so are ¬F , F ∨ G, F ∧ G, F → G, and F ≡ G.
• If F is a formula and x is a variable, then ∀xF and ∃xF are formulas.
Definition For a quantified formula such as ∃xA, x is called the bound variable and
A is the scope of the quantified variable. If every variable in a formula is bounded
the formula is said to be closed. The universal closure is obtained by bounding every
free variable in a formula with the universal quantifier. If the existential quantifier
is used, it is called a existential closure instead.
As mentioned before the symbols in the predicate calculus does not have a special
meaning. Before we can define an interpretation for a predicate formula we have to
introduce the concept of substitution.
Definition Let A be a formula, x a variable and a a constant. A[x ← a], the
substitution of a for x, is defined inductively as follows:
• If A is an atom (predicate, variable or constant), replace every x by a in A.
• If A = ¬B, A[x ← a] = ¬B[x ← a]
• If A = B op C, A[x ← a] = B[x ← a] op C[x ← a]
• If A = ∀xB, A[x ← a] = A. Similarly for A = ∃xB.
• If A = ∀yB, for y 6= x, A[x ← a] = ∀yB[x ← a]. Same for A = ∃yB.
Example. A = ∀x(p(x) ∧ p(y))

B = p(x) ∧ p(y)
A[x ← a] = ∀x(p(x) ∧ p(y))
B[x ← a] = p(a) ∧ p(y)
Definition Let ∀xA(x) be a quantified formula and a a constant. A[x ← a] or A(a)
is an instance of ∀xA(x). This procedure to substitute a constant for a quantified
variable called instantiation.
2.2. PREDICATE CALCULUS 19
Definition An interpretation I of a set of formulas is:
I = (D, {R1 , ..., Rn }, {F1 , ..., Fm }, {d1 , ..., dk })
such that an n-ary relation is assigned to each n-ary predicate symbol, an n-ary
function is assigned to each n-ary function symbol and a domain element is assigned
to each constant symbol.
2.2.2 CNF, PCNF, Skolem functions and Clausal forms
Definition A literal is an atmoic formula or the negation of an atomic formula. A

formula is in conjunctive normal form (CNF) if it is a conjunction of disjunctions
of literals.
Example. The formula
(¬p(x) ∨ q(f (x), y) ∨ r(z)) ∧ (¬p(x) ∨ r(z))
is in CNF while
(¬p(x) ∨ q(f (x), y) ∨ r(z)) ∧ ((p(x) ∧ ¬q(f (x), y)) ∨ r(z)) ∧ (¬r(z))
is not in CNF because of the conjunction embedded within the second disjunction.
Definition A formula is in prenex konjunktiv normalform (PCNF) iff it is on the
form:
Ql xl ...Qn xn M
where the Qi are quantifiers and M is a formula in CNF with no quantifiers. The
sequence of quantifiers is called the prefix and M is called the matrix.
Definition A formula is in clausal form if it is in PCNF and the prefix consists only
of universal quantifiers.
Example.
∀x∀y((p(f (x)) ∨ ¬p(x) ∨ ¬p(y)) ∧ (p(a) ∨ ¬p(y)))
While transforming a formula in predicate calculus to PCNF there might be exis-

tential quantifiers that has to be ”removed” in order to accomplish the desired form.
Two different situation might arise:
1. A = ∀x∃yp(x, y)
2. A = ∃yp(y)
In the first case there is at least one universal quantifier to the left of the ex-
istential quantifier and in the second there is no universal quantifier to the left.
Fortunately Skolem showed that there is a way to find a formula A0 in clausal form
such that A is satisfiable iff A0 is satisfiable:
In the first example there exists a y for every x such that p(x, y) is true. Then
there exists a new function f , y = f (x) that produces these values and A0 =
∀xp(x, f (x)).
In the second example it is said that there exists a y such that p(y) is true. We
choose a new constant c (or more correct a 0-ary function symbol c) that maps to
the correct y.
How to transform a formula into clausal form:
1. Give all bounded variables unique names.
2. Eliminate all connectives except ¬, ∧, and ∨.
3. Push the negation inward using de Morgan’s laws:
¬(A ∧ B) ↔ (¬A ∨ ¬B)
¬(A ∨ B) ↔ (¬A ∧ ¬B)
When pushing a negation through a quantifier use the equivalencies:
¬∀xA(x) ↔ ∃¬A(x)
¬∃xA(x) ↔ ∀¬A(x)
4. Extract all quantifiers from the matrix. Choose a quantifier that is not in the
scope of any other quantifier in the matrix and extract it using the following
rules (Q is either ∀ or ∃ and op is either ∨ or ∧:
AopQxB(x) ↔ Qx(AopB(x))
QxB(x)opA ↔ Qx(B(x)opA)
Repeat until no quantifiers are left in the matrix.
5. Use the distributive laws to transform the matrix into CNF:
(A ∨ (B ∧ C)) ↔ (A ∨ B) ∧ (A ∨ C)
(A ∧ (B ∨ C)) ↔ (A ∧ B) ∨ (A ∧ C)
6. Use Skolem functions to eliminate the existential quantifiers.
Example. Consider the formula ∃x∀yp(x, y) ↔ ∀y∃xp(x, y). We transform it into
clausal form step by step:
Rename bound variables: ∃x∀yp(x, y) ↔ ∀w∃zp(z, w)
Eliminate connectives: ¬∃x∀yp(x, y) ∨ ∀w∃zp(z, w)
Push negations inward: ∀x∃y¬p(x, y) ∨ ∀w∃zp(z, w)
Extract quantifiers: ∀x∃y∀w∃z(¬p(x, y) ∨ p(z, w))
Remove existential quantifiers: ∀x∀w(¬p(x, f (x)) ∨ p(g(x, w), w))
Chapter 3
Logic Programming
In this section logic programming is only given a short overview. For a complete
description, see [Lloyd 87].
3.1 Logical formulae as a program

Logic programming is based on first order predicate logic. Every program clause
is a logical formula consisting of one or more predicates. A predicate describes a
relation, like for instance father(peter, lisa) (defines Peter as the father of Lisa). The
set of formulae can be viewed as a knowledge base of known facts and rules.
A rule is of the form A ← B1 , . . . , Bn , where A and Bi are predicates with the arity
≥ 0. The meaning of the rule is as follows: for each assignment of each variable, if
B1 , . . . , Bn are all true, then A is true. A is called the head and B1 , . . . , Bn is called
the body of the rule. A rule is also called a definite program clause.
A fact is of the form A ←, where A is a predicate with arity ≥ 0. The meaning of

the fact is as follows: ”A is true for every assignment of every variable in A”. A fact
is also called a unit clause.
A query to a logic program is of the form ← B1 , . . . , Bn , where Bi is a predicate

with the arity ≥ 0. A query is usually called a goal or a definite goal.
3.2 Resolution in logic programming

The execution of logic programs is based on the resolution principle [Robinson 65].
A logic program consists of a finite set of facts and rules. Every clause is considered
to be a true statement. The program is executed by ”asking it questions”. The
principle is to take the negation of the question (or a query) and to try to prove
that this negation is not true in the given program. If this is the case, it implies
that the original question is a logical consequence of the program. This is called
21
22 CHAPTER 3. LOGIC PROGRAMMING
reductio ad absurdum. The resolution of the question generates a so called search

tree, where every branch may give new evidence that the given question is a logical
consequence.
An important and central part of logic programming is the substitution of variables.

Every branch in the search tree may give a new substitution. This substitution is
reported and is therefore an answer to the given query. Substitutions are generated
by the unification algorithm. The purpose of the algorithm is to find a substitution,
called the most general unifier.
The algorithm accepts usually two expressions, like two predicates. It checks if
the expressions, with the help of variable substitutions, are equal. If this is the case,
the algorithm returns this substitution. The algorithm also reports if the expressions
are not unifiable.
One of the most important theoretical principles in logic programming is if the

resolution really can produce logical consequences of the given program. Such a
system is called sound (Soundness theorem). Another important theoretical question
is , if for every logical consequence of the program, there exists a resolution. Then
the system is said to be complete (Completeness theorem).
Example Suppose we have the following logic program (variables start by capital
letters):
loves(X,Y)←mother(X),child of(Y,X)
mother(mary)←
child of(tom,mary)←
If this program is given the query ←loves(Person,Who), the answer is YES, Per-
son=mary, Who=tom
3.2.1 The resolution procedure
Definition A ground term is a term containing no variables. Similarly for ground

atom, ground literal and ground clause.
Definition Let C1 and C2 be ground clauses such that l ∈ C1 and lc ∈ C2 . C1 and
C2 are said to be clashing clauses and to clash on the complementary literals l, lc .
C, the resolvent of C1 and C2 , is the clause:
Res(C1 , C2 ) = (C1 − {l}) ∪ (C2 − {lc })
C1 and C2 are the parent clauses of C.
Example of ground resolution:

3.2. RESOLUTION IN LOGIC PROGRAMMING 23
Parent clauses Resolvent

P ¬P ∨ Q Q
P ∨Q ¬P ∨ Q Q The clause Q ∨ Q ”collapses” to Q
P ∨Q ¬P ∨ ¬Q Q ∨ ¬Q or P ∨ ¬P
¬P P 2 Empty clause shows a contradiction
¬P ∨ Q ¬Q ∨ R ¬P ∨ R
Definition (Resolution procedure) Let S be a set of clauses and define S0 = S.
Assume that we have constructed Si . Choose two clashing clauses C1 , C2 ∈ Si
and let C be the resolvent Res(C1 , C2 ). If C = 2, terminate the procedure - S is
unsatisfiable. Otherwise construct Si+1 = Si ∪ {C}. If Si+1 = Si for all possible
pairs of clashing clauses, terminate the procedure - S is satisfiable.
In order to be able to use resolution on general clauses in predicate calculus (with
variables) we have to use substitutions to obtain clashing clauses.
Definition A substitution is a set
{x1 ← t1 , . . . , xn ← tn }
where each xi is a distinct variable and each ti is a term which is not identical to
the corresponding variable xi .
Let E be a clause and θ = {x1 ← t1 , . . . , xn ← tn } a substitution. An instance
Eθ of E is obtained by simultaneously replacing each occurrence of xi in E by ti .
Example
E = p(x) ↔ q(y)
θ = {x ← y, y ← a}
Eθ = p(y) ↔ q(a)
We can see that the word simultaneously in the definition imply that we should not
substitute y for x and then a for y.
Definition If we have two substitutions θ = {x1 ← t1 , . . . , xn ← tn } and σ = {y1 ←

s1 , . . . , yn ← sk } the composition θσ of θ and σ is the substitution
θσ = {xi ← ti σ|xi ∈ X, xi 6= ti σ} ∪ {yj ← sj |yj ∈ Y, yj 6∈ X}
where X and Y is the sets of variables substituted for in θ and σ, respectively.
3.2.2 Unification
With the help of substitutions two atoms can be made equal, i.e. we can obtain
clashing ground clauses.
Definition A unifier is a substitution that makes two atoms identical. A most

general unifier (mgu) is a unifier µ such that any other unifier θ can be obtained
from µ bu a further substitution λ: θ = µλ.
Definition Let k be the first position from the left at which the sequence of symbols
are different. For each expression in W extract the term that starts at position k.
These terms form the disagreement set.
Definition (Robinsons unification algorithm) Let W be a set of expressions that
are to be unified.
1. Let k = 0, Wk = W, σk = ε
2. If Wk is a singelton set, stop; σk is mgu for W .

Otherwise, find the disagreement set Dk for Wk .
3. If there exists elements vk och tk i Dk such that tk is a term and vk is a variable

which is not in the term tk , go to step 4.
Otherwise, stop; W is not unifiable.
4. Let σk+1 = σk {vk ← tk } and Wk+1 = Wk [vk ← tk ].
5. Let k = k + 1. Go to step 2.
Example Consider this set of formulas:
W = {p(x, f (y, z)), p(x, A), p(x, g(h(k(x))))}
W0 = W = {f (x, t), f (g(z, A), v), f (v, g(y, z))}

D0 = {x, g(z, A), v}
σ1 = {x ← g(z, A)}
W1 = {f (g(z, A), t), f (g(z, A), v), f (v, g(y, z))}
Dl = {g(z, A), v}
σ2 = σ1 {v ← g(z, A)} = {x ← g(z, A), v ← g(z, A)}
W2 = {f (g(z, A), t), f (g(z, A), g(z, A)), f (g(z, A), g(y, z))}
D2 = {t, g(z, A), g(y, z)}
σ3 = σ2 {t ← g(z, A)} = {x ← g(z, A), v ← g(z, A), t ← g(z, A)}
W3 = {f (g(z, A), g(z, A)), f (g(z, A), g(y, z))}
D3 = {z, y}
σ4 = σ3 {y ← z} = {x ← g(z, A), v ← g(z, A), t ← g(z, A), y ← z}
W4 = {f (g(z, A), g(z, A)), f (g(z, A), g(z, z))}
D4 = {A, z}
3.2. RESOLUTION IN LOGIC PROGRAMMING 25
σ5 = σ4 {z ← A} = {x ← g(A, A), v ← g(A, A), t ← g(A, A), y ← A, z ← A}

W5 = {f (g(A, A), g(A, A))} singelton, that is σ5 is mgu
Chapter 4
Many-Valued Logic
4.1 Fuzzy Sets

An ordinary set (or crisp set, is defined by its sharp distinction between membership
and non-membership of that set. A fuzzy set is an extension in that elements are
characterised by their grade of membership of the fuzzy set. This generalises the
traditional membership of an element in a set (x ∈ A) from being binary to being
a value (typically) in the unit interval I = [0, 1]. As we shall see, traditional set
theoretic operations have analogies in fuzzy set theory.
Intuitively, the motivation for using fuzzy sets is to capture and model classes
like ’patients with high blood pressure’ and ’measurements showing low pH value’.
In a further extension, also different quantifiers, e.g. like ’very high blood pressure’,
can be used.
In classical (naive) set theory, membership of an element x in a set A, considered
as some subset of a universe of discourse X, can be defined as
(
1, if x ∈ A
µA (x) =
0, if x 6∈ A.
In this definition, µA is the characteristic function
µA : X → {0, 1}.
Note that µA (x) represents the truth value of
”x is in A”.
This function can also be interpreted as a relation consisting of ordered pairs (x, µA (x)).
In a first step, a fuzzy set can be seen as an extension of characteristic functions,
i.e. a fuzzy set µ can be defined mathematically by assigning to each possible individ-
ual in the universe of discourse a value, µ(x), representing its grade of membership
in the fuzzy set µ. This grade corresponds to the degree to which that individual is
similar or compatible with the concept represented by the fuzzy set. Often a fuzzy
27
28 CHAPTER 4. MANY-VALUED LOGIC
set is denoted as µA , and if their is no confusion it is common to speak of µA and A

as representing one and the same fuzzy set. Also, we say that µA is a fuzzy subset
of X, or a membership function in X.
Thus, a fuzzy set µA is a function
µA : X → I.
Using A to denote µA we can speak of the truth value for
”x is in A”
or even
”x is A”
if it is desirable to speak of compatibility with the concept A rather than member-
ship.
If X = {x1 , . . . , xn } is a finite set and A a fuzzy subset of X, a more relational
notation for A would be
A = {(x1 , µA (x1 )), · · · , (xn , µA (xn ))} ⊆ X × I,
where (xi , µA (xi )) is an ordered pair representing the corresponding membership

values.
Remark. Zadeh’s original notation is
A = µA (x1 )/x1 + · · · + µA (xn )/xn ,
where µA (xi )/xi contains respective grades of membership and + should be seen as
a union. This notation is very informal and certainly not algebraic in any sense.
Example. Suppose we want to define a fuzzy set of natural numbers ”close to 4”
(see Figure 4.1). This can be given e.g. as
A = {(1, 0.0), (2, 0.2), (3, 0.6), (4, 1.0), (5, 0.6), (6, 0.2), (7, 0.0)}.
0
2 3 4 5 6
Figure 4.1: Membership function for ”close to 4”.

4.1. FUZZY SETS 29
The above definition of a fuzzy set is a typical situation where the relational style
is convenient.
Example. A fuzzy set A defining ”normal room temperature” can be given as




0, x < 16◦ C

 ◦ ◦ ◦ ◦
 (x − 16 C)/2 C, 16 C ≤ x < 18 C

µA (x) = 1, 18◦ C ≤ x ≤ 22◦ C




 (24◦ C − x)/2◦ C, 22◦ C < x ≤ 24◦ C

 0, x > 24◦ C,
where x is a measurement of temperature. The corresponding function µA is

depicted in Figure 4.2.
0
16 18 20 22 24
Figure 4.2: Membership function for ”normal temperature”.
As can be seen, temperatures below 16◦ C or above 24◦ C are not to any degree
considered to be normal.
Example. A fuzzy set B defining ”high moisture rates” can be given as


 0, x < 30%
µB (x) = (x − 30%)/20%, 30% ≤ x ≤ 50%


1, x > 50%,
depicted in Figure 4.3.
0
10 30 50 70 90
Figure 4.3: Membership function for ”high moisture rates”.

4.2 Useful Membership Functions

Membership functions can be defined in various ways and be based on different
types of functions. In the following we will discuss linear and exponential functions
typically used especially for applications described in the previous chapter. We will
use notations from [Dubois and Prade 80].
4.2.1 Triangular and trapetsoidal functions

Initially we will define two so called open membership functions. These are char-
acterised as being non-decreasing and having values inside 0 and 1 only within a
bounded interval.
Firstly, we have functions with open right shoulders, Γ : X → [0, 1], and defined
by two parameters according to the following:


 0, x<α
Γ(x; α, β) = (x − α)/(β − α), α ≤ x ≤ β


1, x > β.
Correspondingly, we have functions with open left shoulders, L : X → [0, 1], defined
by


 1, x<α
L(x; α, β) = (β − x)/(β − α), α ≤ x ≤ β


0, x > β.
Figures 4.4 and 4.5 shows the graphical representation.
0
α β
Figure 4.4: Γ-membership function.
Similarly, we closed membership functions which allow non-zero membership val-

ues only in a bounded interval. The triangular membership function, Λ : X → [0, 1],
is given by three parameters according to


 0, x<α

 (x − α)/(β − α), α ≤ x ≤ β
Λ(x; α, β, γ) =


 (γ − x)/(γ − β), β ≤ x ≤ γ

0, x > γ.
4.2. USEFUL MEMBERSHIP FUNCTIONS 31
0
α β
Figure 4.5: L-membership function.
0
α β γ
Figure 4.6: Λ-membership function.
See Figure 4.6.

Example. In control applications, it is common to use linguistic variables like
negativebig (NB), negativesmall (NS), zero (ZO), positivesmall (PS) and positivbig
(PB) to express measurement values in a fuzzy way. The arrangement with the
membership functions to cover the entire measurement space is then obtained using
shoulders on respectively leftmost and rightmost functions. See Figure 4.7.
NB NS ZO PS PB
0
-40 -30 -20 -10 0 10 20 30 40
Figure 4.7: A family of membership functions.
The trapetsoidal membership function, Π : X → [0, 1], is given by four parameters

according to


 0, x<α



 (x − α)/(β − α), α≤x<β

Π(x; α, β, γ, δ) = 1, β≤x≤γ



 (δ − x)/(δ − γ), γ<x≤δ


 0, x > δ.
0
α β γ δ
Figure 4.8: Π-membership function.
The level between β and γ is sometimes called a plateau. See Figure 4.8).
From implementation point of view it is obviously advantageous to consider Γ, L
and Λ as special cases of Π. This is possible if the universe of discourse is a bounded
interval. Suppose the universe of discourse is [-10,10]. Then
Γ(x; α, β) = Π(x; α, β, 10, 10),

L(x; γ, δ) = Π(x; −10, −10, γ, δ),
Λ(x; α, β, δ) = Π(x; α, β, β, δ).
Example. The membership function for ”normal room temperature” could be

written as Π(x; 16, 18, 22, 24). For ”high moisture rates” it becomes Γ(x; 30, 50),
which can be rewritten as Π(x; 30, 50, 100, 100), since moisture values are within the
interval [0,100].
4.2.2 Gaußian and sigmoidal functions

For exponential membership functions there are corresponding open and closed func-
tions. The open functions are frequently used e.g. in neural networks as threshold
functions but also in data preprocessing.
A Gaußian membership function, G : X → [0, 1], is given by two parameters as
G(x; α, β) = e−β(x−α) ,
2
where α is the midpoint and β reflects the slope value. Note that β must be positive,
and that the function never reaches zero.
The Gaußian function can also be extended to have different left and right slopes.
We then have three parameters in
(
e−βl (x−α) , x ≤ α
2
G(x; α, βl , βr ) =
e−βr (x−α) , x > α,
2
where βl och βr are, respectively, left and right slopes.

The typical sigmoidal membership function, or S-function, σ : X → [0, 1], needs
two parameters in the σ-function
4.2. USEFUL MEMBERSHIP FUNCTIONS 33
0
α
Figure 4.9: G-membership function.
1
σ(x; α, β) = ,
1+ e−β(x−α)
where α is the midpoint and β is the slope value at the inflexion point. Note that
β is 4 times the derivative value at the inflexion point.
Similarly, β must be positive. This S-function never reaches neither 0 nor 1.
0
α
Figure 4.10: σ-function.
Zadeh defined an S-function using polynomials rather than exponentials accord-

ing to 

 0, x≤α

 2((x − α)/(γ − α))2 , α<x≤β
S(x; α, β, γ) =


 1 − 2((x − γ)/(γ − α))2
, β<x≤γ

1, x > γ,
where β = (α + γ)/2. This function again is more efficient when considering imple-
mentations. See Figure 4.11.
As a motivation for using the σ-function, we have the following. For two populations,
Dg and not-Dg, and both are, according to some measurement x, distributed in a
such a way, that the mean values differ with standard deviations being (more or
less) the same, then the likelihood ratio (LR) can be estimated. If the standard
deviation coincides for both population, then LR is exponential wrt the input.
Transforming to probabilities we have
LR 1
P (Dg | x) = = 1 .
1 + LR 1 + LR
0
α β γ
Figure 4.11: S-membership function.
Since LR takes the form

LR(x) = eβx
the probability becomes
1
P (Dg | x) = .
1 + eβx
We thus arrive at the proposal for using sigmoidal transformation functions as de-
fined above.
This can be compared to similar assumption about normal distributions and
equal size variances typically made in statistical approaches.
Note how we seemingly introduce the requirement that the domain expert will
eventually be imposed with the additional responsibility to create and fine-tune the
data transformation functions. But here comes the good news: Parameters for the
transformation curves are trainable in more or less exactly the same fashion as we
typically train weights in a neural network! What remains to be given is, of course,
the type of function to be used. Note that we basically have two different shapes,
namely, S-functions (increasing or decreasing) and bell-shapes (with the upward
bell-shape occuring only rarely). Even here we have statistical techniques that can
present preliminary suggestions of function types based on collected data.
4.3 Fuzzy Logic

Work in multi-valued logic can originates from logicians in the 1920’s. Three-valued
logic, and later on unit interval valued logic studied e.g. by L
Ã ukasiewicz soon became
widely accepted and further studied especially in the 1950’s by logicians like Rose
and Rosser [Rose and Rosser 58].
Further, as a generalisation of ordinary logic, we should mention intuitionistic
logic (not accepting the law of the excluding middle) developed by Brouwer.
Considering only generalisation of truth values, theory development is broadly
using algebraic instrumentation generally based on partially ordered sets or semilat-
tices. It must be remarked that theories develop differently as compared to corre-
sponding theories e.g. where lattices are boolean. Similarly, a logic based on truth
values in the unit interval becomes more specific.
4.3. FUZZY LOGIC 35
The purpose of this section is to introduce multi-valued logic from a more intu-
itive and informal point of view as compared to a strongly algebraically developed
theory of multi-valued logic. Generally speaking we will stay more on the syntactic
side, rather than diving deeply into semantics. Keep in mind our purpose to present
multi-valued logic as method and technique to support application development.
Furthermore, our general viewpoint is that success in applications should guide the
search for ”the best” understanding of the foundations.
4.3.1 Logic connectives

By now we understand the advantage of using statements like ”x is NORMAL” or ”y
is HIGH ”. To introduce the fuzzy logic connectives, recall the meaning of ∧ (and),
∨ (or), and ¬ (not) in classical logic. In fuzzy logic then, how can we understand
expressions like
(x is NORMAL and y is HIGH) or (not z is MEDIUM),
where x, y and z are to be seen as measurements with membership in, respectively,

NORMAL, HIGH and MEDIUM as linguistic variables? Connectives as functions
take and produce values in the unit interval. In the following we will examine some
typical connectives.
Originally, Zadeh proposed to use the following connectives [Zadeh 65]:
¬a =1−a
a ∧ b = min{a, b}
a ∨ b = max{a, b}
The intuition for these are obvious as the clearly reflect worst and best case char-
acterisations. However, this is also a disadvantage since it means that an outcome
might remane unchanged even if we modify some value, e.g. min{0.7,0.5} is the
same as min{0.8,0.5}. If we desire any change in a or b to be effective we can use
e.g. the well-known product connectives
a∧b =a·b
a∨b =a+b−a·b
Ã ukasiewicz, for ∧ och ∨ we have

In the multi-valued logic proposed by L
[ÃLukasiewicz and Tarski 30]:
a ∧ b = max{0, a + b − 1}
a ∨ b = min{a + b, 1}
In the above mentioned connectives the outcome depends only on a and b, i.e.
there are no additional parameters. Adding parameters, however, introduces several
interesting and useful classes of connectives.
[Yager 80]:
a ∧ b = 1 − min{[(1 − a)p + (1 − b)p ]1/p , 1}, for p > 0

a ∨ b = min{[ap + bp ]1/p , 1}, för p > 0
[Hamacher 75]:
a ∧ b = a · b/(γ + (1 − γ) · (a + b − a · b)), for γ > 0

a ∨ b = (a + b − (2 − γ) · a · b)/(1 − (1 − γ) · a · b), for γ > 0
Lookig more closely at Yager’s connectives, we have the following observations. If

p = 1, then Yager’s connectives are the same as those defined ny L
Ã ukasiewiczs, and
if p → ∞ then Yager’s connectives converge to Zadeh’s connectives. In the case of
p → 0+ we obtain the so called drastic product


 a, om b = 1
a∧b= b, om a = 1


0, annars
and the drastic sum



 a, om b = 0
a∨b= b, om a = 0


1, annars.
Thus, in Yager’s connectives, we normally use p ≥ 1.
Example. Let us evaluate the expression

(x is NORMAL and y is HIGH) or (not z is MEDIUM).
Suppose further that the linguistic variables NORMAL (for temperature) and HIGH
(for moisture) are according to previous examples in Chapter II. The interpretation
of MEDIUM is left open. Assume x = 22.8◦ C, y = 48% och z = 5.5. Calculating
degrees of membership using x and y gives µN ORM AL (22.8) = 0.6 and µHIGH (48) =
0.9. For z, suppose µM EDIU M (5.5) = 0.8. Now depending on definitions for ∧ och
∨ we will obtain different numerical scemas. Table 4.1 shows different calculations
for (a ∧ b) ∨ ¬c. Here a represents ”x is NORMAL”, b represents ”y is HÖG” and
¬c represents ”icke z is MEDIUM ”. In all of the cases we use Zadeh’s definition for
negation and parameters p and γ are set to 2.
4.3.2 Operations on fuzzy sets

It is now obvious how to use fuzzy logic connectives to define corresponding fuzzy
set operations. Assuming A and A be two fuzzy sets in X with corresponding
membership functions µA and µB . The classical set operations of union, intersection
and complement for fuzzy sets are defined via corresponding connectives. Using the
4.3. FUZZY LOGIC 37
Definition a∧b (a ∧ b) ∨ ¬c
Zadeh min{0.6, 0.9} = 0.6 max{0.6, (1 − 0.8)} = 0.6
Produkt 0.6 · 0.9 = 0.54 0.54 + (1 − 0.8) − 0.54 · (1 − 0.8) ≈ 0.63
L
Ã ukasiewicz max{0, 0.6 + 0.9 − 1} = 0.5 min{0.5 + (1 − 0.8), 1} = 0.7
Yager 1 − min{[(1 − 0.6)2 + (1 − 0.9)2 ]1/2 , 1} ≈ 0.59 min{[0.592 + (1 − 0.8)2 ]1/2 , 1} ≈ 0.62
0.6·0.9 0.52+(1−0.8)−(2−2)·0.52·(1−0.8)
Hamacher ≈ 0.52 ≈ 0.65
2+(1−2)·(0.6+0.9−0.6·0.9) 1−(1−2)·0.52·(1−0.8)
Table 4.1: Evaluation of (a ∧ b) ∨ ¬c.
maximum operator for disjunction, union of A and B is then a fuzzy set µA∪B given
by
µA∪B (x) = max{µA (x), µB (x)}.
Similarly, the intersection µA∩B is given by
µA∩B (x) = min{µA (x), µB (x)}
and complement µĀ is defined according to
µĀ (x) = 1 − µA (x).
Example. Let A and B be discrete fuzzy subsets of X = {−3, −2, −1, 0, 1, 2, 3}. If
A = {(−3, 0.0), (−2, 0.3), (−1, 0.6), (0, 1.0), (1, 0.6), (2, 0.3), (3, 0.0)}
and
B = {(−3, 1.0), (−2, 0.5), (−1, 0.2), (0, 0.0), (1, 0.2), (2, 0.5), (3, 1.0)}
then by Zadeh’s definitions we have
A ∧ B = {(−3, 0.0), (−2, 0.3), (−1, 0.2), (0, 0.0), (1, 0.2), (2, 0.3), (3, 0.0)}
and
A ∨ B = {(−3, 1.0), (−2, 0.5), (−1, 0.6), (0, 1.0), (1, 0.6), (2, 0.5), (3, 1.0)}.
See Figure 4.12 for a graphical representation.

Negations of A och B are
¬A = {(−3, 1.0), (−2, 0.7), (−1, 0.4), (0, 0.0), (1, 0.4), (2, 0.7), (3, 1.0)}
and
¬B = {(−3, 0.0), (−2, 0.5), (−1, 0.8), (0, 1.0), (1, 0.8), (2, 0.5), (3, 0.0)}.
A B
∨
A B
1 1 1
=
∨
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
A B A∨B
1 1 1
∨ =
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Figure 4.12: Graphical representation of A ∧ B och A ∨ B using Zadeh’s connectives.
4.3.3 Triangular norms

In a general framework, connectives ∧ and ∨ are related to t-norm and co-t-norm,
respectively.
A t-norm is a mapping T : [0, 1] × [0, 1] → [0, 1] satisfying
(1) T (a, 1) = a (unit element),
(2) a ≥ b ⇒ T (a, c) ≥ T (b, c) (monotonicity),
(3) T (a, b) = T (b, a) (commutativity),
(4) T (a, T (b, c)) = T (T (a, b), c) (associativity)
and similarly a co-t-norm is a mapping S : [0, 1] × [0, 1] → [0, 1] satisfying
(1) S(a, 0) = a (unit element),
(2) a ≥ b ⇒ S(a, c) ≥ S(b, c) (monotonicity),
(3) S(a, b) = S(b, a) (commutativity),
(4) S(a, S(b, c)) = S(S(a, b), c) (associativity).
It can be seen that for any t-norm T and any co-t-norm S we have
T (a, b) ≤ min(a, b) and max(a, b) ≤ S(a, b).

Connectives decribed previously all satisfy the properties of t-norms and co-t-norms,
respectively, for ∧ and ∨.
Yager [Yager 88] has also introduced a class of OWA (ordered weighted averag-
ing) operators, that are ”between” t-norms and co-t-norms, but do not satisfy all
norm conditions.
”OWA”(a, b) = λ · T (a, b) + (1 − λ) · S(a, b), λ ∈ I,
where λ can be seen as a weight expressing to what extent the operator is similar
to either a t-norm or a co-t-norm. OWA operators, for λ 6= 0, 1, are neither t-norms
nor co-t-norms.
4.4. PRODUCTS, RELATIONS AND COMPOSITIONS 39
4.4 Products, Relations and Compositions

4.4.1 The extension principle
In ordinary set theory a function
f :X→Y
as a mapping of points x to f (x) can be extended to a function
f : PX → PY
between the corresponding power sets P X (P X = {A | A ⊆ X}) and P Y through
f (A) = {f (x) | x ∈ A}.
Strictly speaking we should use another notation for f , e.g. P f since the mappings
are not the same. However, as there is usually no confusion, it is very common to
use f in both situations.
Viewing this extension in the fuzzy domain, we have the following obvious ques-
tion: Given a fuzzy subset µA of X, what is the corresponding image f (µA ) as a
fuzzy subset of Y ? A natural approach is to require that
f (µA )(f (x)) = µA (x),
but in such cases we need to define f (µA )(f (x)) also in situations where we might
have other points x0 for which f (x) = f (x0 ) with µA (x) 6= µA (x0 ). Typically in such
a situation we ”the best we have”, i.e. f (µA )(f (x)) would be the largest value (or
supremum in an infinite case) of all µA (x0 ) for which f (x) = f (x0 ). Also we need to
specify the values of f (µA )(y) where we cannot find any x such that f (x) = y. In
such a situation it is natural to require that f (µA )(y) = 0.
To summarize, the Extension Principle states that
( W
f (x)=y µA (x), if {x ∈ X | f (x) = y} =
6 ∅,
f (µA )(y) =
0, otherwise.
In a more general situation we are dealing with functions
f : Xn → Y
and then we have to define f (µA 1 , . . . , µA n )(f (x)), x = (x1 , . . . , xn ), given that we
have µA 1 (x1 ), . . . , µA n (xn ). Now we need to consider worst cases wrt µA i (xi ) as the
combination is more of a conjunction. But the Extension Principle remains basically
the same, i.e.
( W
f (x)=y mini {µA i (xi )}, if {x ∈ X n | f (x) = y} =
6 ∅,
f (µA 1 , . . . , µA n )(y) =
0, otherwise.
A typical example of the use of the Extension Principle is in development of Fuzzy

Arithmetic. Several approaches to fuzzifying e.g. addition exists in the literature,
but it turns out to be difficult to define a general purpose arithmetic as basis e.g.
for developing differential calculus using fuzzy numbers.
Example. Consider the following modified definition of the Gaußian membership
function:
x−α 2
G0 (x; α, β) = e−( β ) .
We can then apply the Extension Principle to the addition function
+:R×R→R
and obtain a fuzzy addition ⊕ for the Gaußian membership functions:
G0 (−; α1 , β1 ) ⊕ G0 (−; α2 , β2 ) = G0 (−; α1 + α2 , β1 + β1 ).
4.4.2 Fuzzy products, relations and compositions

Definition (Fuzzy Relation) Let U and V be two universes of discourse. A fuzzy
relation R is a fuzzy set in the product space U × V ; that is, R has the membership
function µR (u, v), where u ∈ U and v ∈ V .
Definition (Cartesian Product) If A1 , · · ·, An are fuzzy sets in U1 , · · ·, Un , respec-
tively, the Cartesian Product of A1 , · · ·, An is a fuzzy set in product space U1 ×· · ·×Un
with the membership function
µA1 ×···×An (u1 , u2 , · · ·, un ) = min{µA1 (u1 ), · · ·, µAn (un )}
or
µA1 ×···×An (u1 , u2 , · · ·, un ) = µA1 (u1 )·µA2 (u2 )· · ··µAn (un ).
Definition (Sup-Star Composition) If R and S are fuzzy relation in U ×V and

V ×W , respectively, the composition of R and S is fuzzy relation denoted by R◦S
and is defined by
R◦S = {[(u, w), sup(µR (u, v) ∗ µS (v, w))], u∈U, v∈V, w∈W }.
v
where ∗ could be any operator in the class of membership function, namely, mini-
mum, algebraic product, bounded product or drastic product.
In a general form, a compositional operator may be expressed as the Sup-star
compositional, where ”star” denotes an operator - e.g., min, product, etc. In the
literature., four kind of compositional operators can be used in the compositional
rule of inference as following:
• Sup-min operation,
• Sup-product operation,
• Sup-bounded-product operation,
• Sup-drastic-product operation.
In FLC applications, the Sup-min and Sup-product compositional operator are the
most frequently used.
4.4.3 Compositional rule of inference

In many cases of aggregation of fuzzy sets the type of aggregation required is neither
the pure anding of the t-norm with its complete lack of compensation nor the pure
oring of the t-conorm with its complete submission to any good satisfaction. In
many cases the type of aggregation operator desired lies somewhere between these
two extremes. In the previous sections we have been shown how to translate the
premise of a fuzzy rule into a fuzzy relation and to translate an if...then statement
into a fuzzy relation. thus we are able to translate fuzzy rules into fuzzy relation.
If there is more than one proposition in the consequences of fuzzy rules, the fuzzy
rules are assumed to be separable with respect to the proposition in the consequent.
Suppose a fuzzy rule base with two antecedents is usually written as ”if x is A and
y is B then z is C ”. The corresponding problem for approximate reasoning is
expressed as
0 0
premise 1 (fact): x is A and y is B ,
premise 2 (rule) : if x is A and y is B then z is C.
0
consequence: z is C
The fuzzy rule in premise 2 above can be put into the simpler form ” A × B → C
”. Intuitively, this fuzzy rule can be transformed into a ternary fuzzy relation R,
which is specified by the following MF:
µR (x, y, z) = µ(A×B)×C (x, y, z) = µA (x)∧µB (y)∧µC (z) (4.1)
0
And the resulting C is expressed as
0 0 0
C = A ×B ◦(A × B → C) (4.2)
Thus
µC 0 (z) = ∨x,y [µA0 (x) ∧ µB 0 (y)] ∧ [µA (x) ∧ µB (y) ∧ µC (z)]

= ∨x,y {[µA0 (x) ∧ µB 0 (y) ∧ µA (x) ∧ µB (y)]} ∧ µC (z) (4.3)
= {∨x [µA0 (x) ∧ µA (x)]} ∧ {∨y [µB 0 (y) ∧ µB (y)]} ∧µC (z)
| {z } | {z }
τ1 τ2
= τ ∧τ ∧µC (z),
| 1 {z 2}
f iringstrength
0
where τ1 is the degree of match between A and A ; τ2 is the degree of match between
0
B and B ; and τ = τ1 ∧ τ2 is called firing strength or degree of firing of this fuzzy
rule.
Figure 4.13: Approximate reasoning for multiple antecedents.
The interpretation of multiple rule is usually taken as the union of the fuzzy relation
corresponding to the fuzzy rules. for example, given the following fact and rules:
0 0
premise 1 (fact): x is A and y is B ,
premise 2 (rule 1) : if x is A1 and y is B1 then z is C1 ,
premise 3 (rule 2) : if x is A2 and y is B2 then z is C2 .
0
consequence: z is C
we can use the fuzzy reasoning shown in the Figure 4.14. as an inference procedure
0
to derive the resulting output fuzzy set C .
Figure 4.14: Fuzzy reasoning for multiple rules with multiple antecedents.
Let R1 = A1 × B1 → C1 and R2 = A2 × B2 → C2 . Since the max-min composition

operator ◦ is distributive over the ∪ operator, it follows that
0 0 0
C = (A ×B )◦(R1 ∪ R2 )
0 0 0 0
= [(A ×B )◦R1 ] ∪ [(A ×B )◦R2 ] (4.4)
0 0
= C1 ∪C2 )
0 0
where C1 and C2 are the inferred fuzzy sets for rule 1 and 2, respectively. Figure 4.14
shows graphically the operation of fuzzy reasoning for multiple rules with multiple
antecedents. Suppose a fuzzy rule base consists of a collection of fuzzy if...then rules
in the following form:
Rulei : if x1 is Ai1 and...and xm is Aim then y is Ci (4.5)
where Ai1 and Ci are fuzzy sets in Ui ∈ R and V ∈ R, x = (x1 , x2 , · · ·, xm )T ∈

U1 ×U2 ×· · ·×Um and y∈V are linguistic variables. Let m be the number of fuzzy
if...then rules in the form (4.5) in the fuzzy rule base; that is, i = 1, 2, · · ·, n in (4.5).
The x and y are the input and output to the fuzzy logic system, respectively, and
Aij and Ci are linguistic terms characterized by fuzzy membership function µAij (xj )
and µCi (y), respectively. Without loss of generality, we consider multi-input-single-
output (MISO) fuzzy logic system, Each rule Rulei is associated with fuzzy relation
Ri which is interpreted as fuzzy intersection of fuzzy sets and implemented by a
fuzzy implication and defined as
Ri (x1 , x2 , ...xm , y) = Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y) (4.6)
We could combine the rules by an aggregation operator Agg into one rule which
0 0
used to obtain C from A .
R = Agg(R1 , R2 ...Rn ) (4.7)
if the sentence connective also is interpreted as and then we get

\n
R= i=1
Ri
that is
\n
R(x, y) = i=1
Ri (x, y) = min(Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y))
or by using a t-norm T for modelling the connective and
R(x, y) = T (R1 (x, y), ...Rn (x, y))
if the sentence connective also is interpreted as or then we get

[n
R= i=1
Ri
that is
[n
R(x, y) = i=1
Ri (x, y) = max(Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y))
or by using a t-conorm S for modelling the connective or
R(x, y) = S(R1 (x, y), ...Rn (x, y))

0 0
Then we compute C from A by the compositional rule of inference as (4.5) we get
0 0 0
C = A ◦R = A ◦Agg(R1 , R2 , ...Rn )
0 0 0
= (A i1 (x1 )×A i2 (x2 )×· · ·×A im (xm ))◦(R1 ∪ R2 ∪ · · · ∪ Rn )
[n 0 0 0
= i=1
(A i1 (x1 )×A i2 (x2 )×· · ·×A im (xm ))◦Ri (4.8)
[n 0
= i=1
Ci
Thus, from (4.4) we get
µC 0 (y) = ∨ni=1 {[µA0 (x1 ) ∧ µAi1 (x1 )] ∧ [µA0 (x2 ) ∧ µAi2 (x2 )]
i1 i2
∧· · · ∧ [µA0 (xm ) ∧ µAim (xm )]} ∧ µCi (y), (4.9)

im
= ∨ni=1 {∧m
j=1 [µA0ij (xj ) ∧ µAij (xj )]} ∧ µCi (y)
= ∨ni=1 {τi ∧ µCi (y)}
Figure 4.15 shows graphically the operation of fuzzy reasoning for MISO.
If x1 is A11 and ... and xm is A1m Then y is C1

τ 1 = [µ A11 (x 1 )∧ µ A12 (x 2 ) ∧ L ∧ µ A1 m (x m )] → τ 1 ∧ µ C 1 (y)
x1
x2 If x1 is A21 and ... and xm is A2m Then y is C2 n
τ = [µ A21 (x 1 ) ∧ µ A22 (x 2 )∧ L ∧ µ A2 m (x m )] → τ 2 ∧ µ C 2 (y) ∨ ∨ (τ i ∧ µC (y))
M 2 i=1 i
M
xm If x1 is An1 and ... and xm is Anm Then y is Cn
τ n = [µ An 1 (x 1) ∧ µ An 2 (x 2 ) ∧ L ∧ µA nm (x m )]→ τ n ∧ µ C n (y)
Figure 4.15: Block diagram of the Fuzzy reasoning for MISO.
and from previous definitiona we see that there are six interpretations (1-6) for
fuzzy implication, and in each interpretation we may employ different t-norms or
t-conorms, therefore, a fuzzy if...then rule (4.5) can be interpreted in a number of
ways, then the output of the fuzzy inference mechanisms can be in different way.
for these different types of outputs. we could use different defuzzifier to defuzzify
them into a single point in the output space V.
4.5. APPROXIMATE REASONING 45
4.5 Approximate Reasoning

One of the most important components in approximated reasoning is the member-
ship function for the implication. As for union and intersection, several functions
can also be defined for the implication. There are three classes of fuzzy implication:
S-implications, R-implications and QL-implications. S-implications are based on
the classical implication where p → q is defined as ¬p ∨ q. R-implications are based
on the idea that implication reflects a partial order. QL-implications are based on
quantum logic.
The following properties have been requested for an operation I(p, q), that rep-
resents an implication operation [Dubois, Prade 91, Dubois, Lang, Prade 91].
(1) p ≤ p0 ⇒ I(p, q) ≥ I(p0 , q)
(2) q ≥ q 0 ⇒ I(p, q) ≥ I(p, q 0 )
(3) I(0, q) = 1 (False imply anything)
(4) I(1, q) = q (tautology cannot justify anything)
(5) I(p, q) ≥ q
(6) I(p, p) = 1 (identity principle)
(7) I(p, I(q, r)) = I(q, I(p, r)) (exchange principle)
(8) I(p, q) = 1 om och edast om p ≤ q ( implication defines an ordering)
(9) I(p, q) = I(c(p), c(q)) för någon negation/complement c (contraposition law)
(10) I is continuous.
The following table gives the most usual implications, which class they belong to
and which properties are satisfied [Dubois, Prade 91, Dubois, Lang, Prade 91].
Name Membership function Type Property

Kleene-Dienes max(1 − p, q) S and QL 1-5, 7, 9-10
Reichenbach 1 − p + pq S 1-5, 7, 9-10
L
Ã ukasiewicz min(1
½
− p + q, 1) S and R 1-10
1, if p ≤ q
Gödel R 1-8
q, else
Zadeh max(1 − p, min(p, q)) QL 2-4, 10
Examples on fuzzy implications
Note that all ten properties are satisfied by the L

Ã ukasiewicz implicationen.
4.5.1 Fuzzy propositional calculus

When truth values from the unit interval is connected to formulae, the knowledge
base can be interpreted as a fuzzy set of formulae. Reasoning is based on modus
ponens. The problem of computing the truth value of the conclusion can be generally
described as follows [Dubois, Prade 91, Dubois, Lang, Prade 91]:
Let p → q represent the rule if p then q, where p can be of the form p1 and . . . and
pn . We can then then say that:
(i) The truth value connected to p, τ (p), is then of the form τ (p) = τ (p1 ) ∗ . . . ∗
τ (pn ), where ∗ is a t-norm;
(ii) the evaluation of the truth value τ (q), connected to q, is interpreted as a

”detachment problem”: from τ (p) and the truth value connected to τ (p → q),
where τ (p → q) = I(τ (p), τ (q)) and I is a fuzzy implication function, compute
τ (q)
To formalise this propositional calculi, consider a propositional language L of
well formed formulae (wff’s) over an alphabet A containing the propositional con-
stants, the symbols ‘(’ and ‘)’, the unary logical connective ¬ (negation), the binary
connective ← (implication), and a finite set B of binary connectives. Truth values
are assumed to be in the unit interval. To each logical connective we associate a
function, denoted by the same symbol.
¬ : [0, 1] → [0, 1], α 7→ 1 − α,
←: [0, 1] × [0, 1] → [0, 1], (α, β) 7→ min{α − β + 1, 1}.
The functions associated to connectives in B are assumed to be continuous, as-

sociative, and monotone increasing in both arguments. Therefore, we also write
¦(α1 , . . . , αn ), or ¦i αi , for ¦(α1 , ¦(α2 , ¦ . . . , αn ) . . .). B can be thought of as a set of
t-norms and co-t-norms, or OWA-operators [Yager 88].
This provides a very general framework including various logical operators. For
the justification of the use of such operators see [Dubois, Prade 91, Dubois, Lang, Prade 91].
Restrictions for the use for the operators arise from the applied neural learning tech-
niques, which require some continuity and derivability conditions.
Definition A mapping Γ : L → [0, 1] is called a fuzzy set of axioms.
Definition Let Γ be a fuzzy set of axioms. A valuation w.r.t. Γ is a mapping
Υ : L → [0, 1], s.t. for all P ∈ L
(i) If P = (¬Q), P = (Q ← R), or P = (Q ¦ R), then
Υ(P ) = ¬(Υ(Q)), Υ(P ) =← (Υ(Q), Υ(R)), Υ(P ) = ¦(Υ(Q), Υ(R)), respec-

tively.
4.5. APPROXIMATE REASONING 47
(ii) Γ(P ) ≤ Υ(P ).
Definition Let Γ be a fuzzy set of axioms. The mapping Γ |=: L → [0, 1] is given
by Γ |= P = inf{Υ(P ) | Υ is a valuation w.r.t. Γ}, where inf ∅ = 1.
4.5.2 Fuzzy predicate calculus

By now it is obvious that fuzzifying predicate calculus is not only related to a
generalisation of binary truth values to the unit interval. We also need to handle
generalisations of terms, and in all, fuzzifying the language. These are the frontiers
of present research in multi-valued applications in computer science, and will be
reported in future versions of these Lecture Notes.
Chapter 5
Summary and Exercises
Topics to keep in mind:
• intuition and motivation for fuzzy sets,
• particular useful membership functions,
• logic connectives (also as norms) and operations on fuzzy sets,
• applications of the extension principle,
• compositional rule of inference as a preparation for fuzzy controllers.
EXERCISES
II.1 Let A and B be fuzzy subsets of X = {−3, −2, −1, 0, 1, 2, 3, 4}.
A = 0.4/ − 3 + 0.5/ − 2 + 0.4/ − 1 + 0.6/0 + 1.0/1 + 0.6/2 + 0.3/3 + 0.3/4

B = 0.1/ − 3 + 0.2/ − 2 + 0.7/ − 1 + 0.9/0 + 1.0/1 + 1.0/2 + 0.3/3 + 0.2/4
Suppose that their intersection and union are defined by the ”min” and ”max”
operators respectively. What are then the membership function of {A ∩ B} and
{A ∪ B} ?
II.2 Let µA and µB be fuzzy subsets of X = {−3, −2, −1, 0, 1, 2, 3, 4}.
A = 0.4/ − 3 + 0.5/ − 2 + 0.4/ − 1 + 0.6/0 + 1.0/1 + 0.6/2 + 0.3/3 + 0.3/4

B = 0.1/ − 3 + 0.2/ − 2 + 0.7/ − 1 + 0.9/0 + 1.0/1 + 1.0/2 + 0.3/3 + 0.2/4
49
50 CHAPTER 5. SUMMARY AND EXERCISES
Suppose that their intersection and union are defined by the Hamacher’s t-norm
and t-conorm with γ = 1, respectively. What are then the membership function of
{A ∩ B} and {A ∪ B} ?
II.3 Show that Yager’s ∧ and ∨ are, respectively, t-norms and co-t-norms.
II.4 Prove that for any t-norm T and any co-t-norm S we have
T (a, b) ≤ min(a, b) and max(a, b) ≤ S(a, b).
II.5 Fuzzy sets µ1 , µ2 ∈ Fc (R) (which is class of all upper semicontuinuous fuzzy
sets of R). With the help of the extension principle, for the sum µ1 ⊕µ2 , the product
µ1 ¯ µ2 , we lay down the following:
(µ1 ⊕ µ2 )(t) = sup{min{µ1 (x1 ), µ2 (x2 )} x1 , x2 ∈ R ∧ x1 + x2 = t}

(µ1 ¯ µ2 )(t) = sup{min{µ1 (x1 ), µ2 (x2 )} x1 , x2 ∈ R ∧ x1 x2 = t}
What are then the graphs of µ1 ⊕ µ2 and µ1 ¯ µ2 ?
II.6 Let f (x) = x2 and let A ∈ F be a symmetric triangular fuzzy number with
membership function
(
1 − |a − x|/α if |a − x| ≤ α
A(x) =
0 otherwise
Then use the extension principle to calculate the membership function of fuzzy set
f (A).
II.7 Prove that G0 (−; α1 , β1 ) ⊕ G0 (−; α2 , β2 ) = G0 (−; α1 + α2 , β1 + β2 ).
II.8 Consider two fuzzy relations R =”x is considerable smaller than y” and G =
”y is very close to y”
   
y 1 y2 y3 y4 y1 y2 y3 y4
 x 0.5 0.1 0.1 0.7   x 0.4 0 0.9 0.6 
R=

1   1
G = 


 x2 0 0.8 0 0   x2 0.9 0.4 0.5 0.7 
x3 0.9 1 0.7 0.8 x3 0.3 0 0.8 0.5
1. What are their intersection of R and G means that R =”x is considerable smaller
than y” and G = ”y is very close to y”.
2. What are their union of R and G means that R =”x is considerable smaller than
y” or G = ”y is very close to y”.
51
II.9 Consider two fuzzy relations R =”x is considerable smaller than y” and G =
”y is very close to z”
 
  z1 z2 z3
y1 y2 y3 y4  
 x  y1 0.4 0.9 0.3 
0.5 0.1 0.1 0.7   
R=

1 
G = 
 y2 0 0.4 0  
 x2 0 0.8 0 0  
 y3 0.9 0.5 0.8 

x3 0.9 1 0.7 0.8
y4 0.6 0.7 0.5
What are their sup-product composition R ◦ G

Part III
FUZZY SYSTEMS
53
Chapter 6
Fuzzy Control
6.1 Fuzzy Controllers

Fuzzy control is one of the most well known applications of fuzzy logic. In 1974,
E. H. Mamdani was the first one to demonstrate the use of fuzzy logic applied
to control [Mamdani 74]. A major breakthrough of fuzzy control came in the be-
ginning of 1980’s. A controller development for a cement kiln was the first real
application [Holmblad and Østergaard 82] of fuzzy control. Within the next year,
several successful applications were presented, many of them exposed during the
IFSA (International Fuzzy Systems Association) meeting in Tokyo 1987.
6.1.1 Fuzzy rules

Fuzzy control is conclusion based on rules. A fuzzy IF-THEN rule R is symbolically
expressed as
R: IF hfuzzy criteriai THEN hfuzzy conclusioni,
where hfuzzy criteriai and hfuzzy conclusioni either are atomic or compound fuzzy
propositions. Such a rule can be seen as a causal relation between measurements
and control values of the process. If e and ė are insignals and u̇ an outsignal, and
further NS, PS and NL are linguistic variables, then
IF e is N S AND ė is P S THEN u̇ is N L
is a symbolic expression of the following causal relation:
if the present value of e is N S and the present value of ė is P S then u̇

should be N L,
or
55
56 CHAPTER 6. FUZZY CONTROL
if the present deviation of the control value is N S and the latest change
in the deviation of the control value is P S then this should cause the
control value to be N L.
Both antecedents and consequents can involve several linguistic variables. If this
is the case, the system is called a multi-input-multi-output (MIMO) fuzzy system.
Such systems have several insignals and outsignals. Also multi-input-single-output
(MISO) systems, with several insignals but only one outsignal are very common.
An example of a MISO system is as follows.
R1 : IF x is A1 AND y is B1 THEN z is C1
R2 : IF x is A2 AND y is B2 THEN z is C2
.. ..
. .
Rn : IF x is An AND y is Bn THEN z is Cn .
Here x and y are insignals and z is the outsignal. Further, Ai , Bi and Ci are linguistic
variables.
6.1.2 Fuzzy rule bases

A rule base contains all the fuzzy rules that are needed for an inference mechanism
to provide conclusions (control signals) from input signals. How is a rule base then
constructed? We have several options:
• An expert that knows the process provides linguistic rules that are specified
given previous knowledge and know-how related to the process.
• An expert, manually controlling the process, is observed over a period of time,

and corresponding in and out data is recorded. The data set obtained can be
used in a knowledge and rule extraction procedure, such as provided by the
(fuzzy) clustering method [Bezdek 81].
• The process is described within a fuzzy model, based on which control rules
can be directly derived. Such methods do not yet exist, and require further
research.
• En fuzzy controller is adaptive in the sense that the rule base together with
parameters in rules (and possibly in inference mechanisms) are adjusted in real-
time given possibilities for the systems to identify itself as being in respectively
good or bad states. Some suggestions of related techniques are found e.g. in
[Procyk and Mamdani 79], [Shao 88] and [Sugeno 85].
Whatever technique we use, our goal is to construct a number of fuzzy rules with
the following syntax:
6.1. FUZZY CONTROLLERS 57
R1 : IF x1 is A11 AND x2 is A12 AND · · · AND xm is A1m THEN u is U1

R2 : IF x1 is A21 AND x2 is A22 AND · · · AND xm is A2m THEN u is U2
.. ..
. .
Rn : IF x1 is An1 AND x2 is An2 AND · · · AND xm is Anm THEN u is Un
Note that this syntax is valid for MISO systems. For rule Ri we have x1 , . . . , xm as
insignals and Ai1 , . . . , Aim as respective linguistic quantifiers of the insignals. The
consequent of the rule is ”u is Ui ”.
Example. The following shows an example of fuzzyness used for speed control. As
insignals we have the actual speed v (km/h) and the load l (N) of the car. Load
components are e.g. force F and friction Fµ . As linguistic variables for speed we use
LS (low speed), N S (normal speed) and HS (high speed) and for load similarly LL
(low load), N L (normal load) and HL (high load). These are typically bell-shaped
functions and as midpoint in N S we use 70 km/h which also is the constant speed
we try to maintain. The choice of precise shape and transposition of membership
functions is left open at this point. Low load appear e.g. downhill and high load
uphill (see figure 6.1).
HIGH LOAD LOW LOAD

v Fµ v
F+Fµ
N
G
F+Fµ N NORMAL N
LOAD v
G G
Figure 6.1: A physical view.
The rule base for speed control could be as follows.

R1 : IF v is LS AND l is N L THEN v is N S
R2 : IF v is LS AND l is HL THEN v is HS
R3 : IF v is N S AND l is LL THEN v is LS
R4 : IF v is N S AND l is HL THEN v is HS
R5 : IF v is HS AND l is LL THEN v is LS
R6 : IF v is HS AND l is N L THEN v is N S
6.1.3 Overview of fuzzy controller structures

A fuzzy controller consists respectively of fuzzification, rule base, inference mecha-
nism and defuzzification.
In a functional description, the fuzzy controller can be described as a function
u = f (x), where x is a measurement (insignal) and u is the computed control value
(outsignal). Generally, f (x) is given as
D
F
x U
E
Z
RULE BASE F
U
Z Z
I Z
F I
I F
C I
A C
T A
I INFERENCE T u
O I
N
MECHANISM O
N
Figure 6.2: Components of a fuzzy controller.
u = def uzz[DISJ[Φ[CON Jj [Aij (xj )], Ui ]]],
i.e. a control value u is computed from measurements xj . In a first step, measure-

ments xj have to be fuzzified, i.e. they are transformed to their corresponding truth
values Aij (xj ) representing the truth value of ’xj is Aij ’. Here Aij is the linguistic
variable in the ith rule of the jth measurement. Secondly, fuzzified values are com-
bined by a logical conjunction operator CON Jj providing conjunction over all the
values Aij (xj ). Thirdly, Φ acts as the implication operator that carries the conjunc-
tion value over to combine with the output membership function of the rule. In a
fourth step, a logical disjunction operator DISJi computes the resulting member-
ship function, on which finally a defuzzification def uzz is applied to provide the
final output signal. This final output value is now used as the control signal. See
figure 6.2 for a graphical view of the structure.
6.2 Inference in Fuzzy Controllers

In this section we will examine the inference mechanisms of Mamdani [Mamdani 74,
Mamdani and Assilian 75] and Takagi-Sugeno [Takagi and Sugeno 85, Sugeno 85].
We also mention some modifications ([Larsen 80, Tsukamoto 77]) of these methods.
Inference either produces fuzzy sets µU for defuzzification (Mamdani’s method), or
directly a control value (Takagi-Sugeno’s method). For simplicity, in illuminations
and examples we often only two insignals x1 and x2 and one outsignal u. Further,
let A and U denote fuzzy sets and u the defuzzified control value.
6.2.1 Mamdani’s method

Inference in Mamdani’s method, with membership functions as outputs in rules, can
be informally described as follows:
R1 : IF x1 is A11 AND x2 is A12 THEN u is U1

R2 : IF x1 is A21 AND x2 is A22 THEN u is U2
Conclusion: u is U
6.2. INFERENCE IN FUZZY CONTROLLERS 59
1 A 11 1 A12 1 U1
α1
0 0 0
1 A21 1 A22 1 U2
α2
0 0 0
x1 x2 min
Figure 6.3: Inference using Mamdani’s method.
This is the most common view of fuzzy control. In Mamdani’s method, conjunc-
tion is given by the minumum operator, implication likewise, and resulting output
membership functions are combined using the maximum operator as disjunction.
To be more precise, if the activation level is given by
^
m
αi = Aij (xj ),
j=1
then the resulting output membership function of rule i is αi ∧ Ui (marked grey in

figure 6.3) and thus the conclusion fuzzy set becomes
_
n
U= (αi ∧ Ui ),
i=1
where ∨ is the maximum operator and ∧ is used to compute implication. Note

that αi in this expression is identified with its corresponding fuzzy set with constant
value αi .
Larsen’s method differs from Mamdani’s method only in the choice of the product
operator for implication (see figure 6.4). This means
_
n
U= (αi · Ui ).
i=1
6.2.2 Takagi-Sugeno’s method

Inference in Takagi-Sugeno’s method, with linear functions as outputs in rules, can
be described as follows:
R1 : IF x1 is A11 AND x2 is A12 THEN u1 =p10 +p11 x1 +p12 x2
R2 : IF x1 is A21 AND x2 is A22 THEN u2 =p20 +p21 x1 +p22 x2
Conclusion: u
1 A 11 1 A12 1 U1
α1
0 0 0
1 A21 1 A22 1 U2
α2
0 0 0
x1 x2 min
Figure 6.4: Inference using Larsen’s method.
For every rule, an output ui is computed according to
ui = pi0 + pi1 x1 + · · · + pim xm ,
where pi0 , . . . , pim are constants related to rule i. Methods to specify the con-
stants are discussed in [Takagi and Sugeno 85], including also algorithms for se-
lecting insignals related to respective ui . The final control value is given by
X
n
αi ui
i=1
u= Xn .
αi
i=1
Example. Given insignals x1 = 6.5 and x2 = 9.2, and linear output functions
u1 = 2 + 1.7x1 + 1.3x2 and u2 = −3 + 0.5x1 + 2.1x2 , we obtain u1 ≈ 25.0 and
u2 ≈ 22.3. Thus the control value becomes
α1 u1 + α2 u2 0.5 · 25.0 + 0.3 · 22.3

u= = ≈ 24.0.
α1 + α2 0.5 + 0.3
Tsukamoto’s method is a modification of the Takagi-Sugeno method in that mem-

bership functions are assumed to be monotonous. Activation levels are computed
as for the Takagi-Sugeno method. Values for ui are the solutions to
Ui (ui ) = αi .
The final control value is

6.3. DEFUZZIFICATION 61
1 A11 1 A12 1
α1
0 0 0
u1
1 A 21 1 A 22 1
α2
0 0 0
x1 x2 min u2
Σ αi u i
u =
Σ αi
Figure 6.5: Inference using Takagi-Sugeno’s method.
X
n
αi ui
i=1
u= Xn .
αi
i=1
See Figure 6.6 for a graphical representation.

Example. Consider the situation described in figure 6.6. We see that A11 (x1 ) =
0.8, A12 (x2 ) = 0.5 and therefore α1 = min{A11 (x1 ), A12 (x2 )} = min{0.8, 0.5} =
0.5. Similarly A21 (x2 ) = 0.3, A22 (x2 ) = 0.5 and α2 = min{A21 (x1 ), A22 (x2 )} =
min{0.3, 0.5} = 0.3.
Suppose we have the following functions, U1 (u) = −0.16u + 2 and U2 (u) =
0.1u − 0.5, for conclusions. Then the individual control values are given as solutions
in the equations U1 (u1 ) = α1 = 0.5 and U2 (u2 ) = α2 = 0.3. This gives u1 ≈ 9.4 and
u2 = 8.0. Finally, we compute the control value, and obtain
α1 u1 + α2 u2 0.5 · 9.4 + 0.3 · 8.0
u= = ≈ 8.9.
α1 + α2 0.5 + 0.3
6.3 Defuzzification
As a result of inference we obtain a fuzzy set µU of proposed control values. Each
activated rule Ri , i.e. for which the activation level is non-zero, contributes to µU ,
and therefore we obtain the final conclusion as
A 11 A12 U1
1 1 1
α1
0 0 0
u1
A21 A22 U2
1 1 1
α2
0 0 0
x1 x2 min u2
Σ αi ui
u =
Σ αi
Figure 6.6: Inference using Tsukamoto’s method.
[
n
µU = µUi ,
i=1
where µUi is Ui in the fuzzy rule Ri .

Since the insignal to the process must be a value u, the fuzzy set µUi must be
defuzzified to provide a value. In the following we present the most common defuzzi-
fication methods. See also [Driankov et al 93] and [Jager 95] for further reading.
Centre-of-Gravity, CoG
CoG finds the centre of gravity of µU . In the discrete case, we have
X
l
uk · µU (uk )
k=1
u=
X
l
µU (uk )
k=1
and in the continuous case,

R
v · µU (v) dv
u = UR .
U µU (v) dv
Figure 6.7 provides a graphical representation of CoG. Computationally, this method

is inefficient.
Indexed Centre-of-Gravity, ICoG

CoG
1
Figure 6.7: Defuzzification with CoG.
In ICoG we only consider the area of µU that is above a specified level α (see figure
6.8), and compute the centre of gravity for this area. Thus, in the discrete case, we
have
X
l
uk · [µU (uk )]α
k=1
u= ,
X
l
[µU (uk )]α
k=1
where [µU (uk )]α denotes the area of fuzzy set µU above α level.
In the continuous case, we have
R
[U ]α v · [µU (v)]α dv
u= R .
α
[U ]α [µU (v)] dv
Note, that if α = 0 then ICoG computations coincide with those of CoG.
ICoG
1
Figure 6.8: Defuzzification with ICoG.
Centre-of-Sums, CoS
In CoG we had to consider the whole of µU . CoS is similar to CoG but imple-
mentationally much more efficient. In CoS, we consider all output functions when
computing the sum of all µUi , i.e. overlapping areas may be considered more than
once (see figure 6.9). In the discrete case, we obtain
X
l X
n
uk · µUi (uk )
k=1 i=1
u=
X
l X
n
µUi (uk )
k=1 i=1
and in the continuous case,

R Pn
U v·
µU (v) dv
u = R Pn i=1 i .
U i=1 µUi (v) dv
CoS
1
Figure 6.9: Defuzzification with CoS.
First-of-Maxima, FoM
In FoM, defuzzication of µU is defined as the smallest value in the domain of U with
maximal membership value, i.e.
u = min{v ∈ U | µU (v) = max U }.
The method is illustrated in figure 6.10.

Analogously, we can define Last-of-Maxima, LoM, according to
u = max{v ∈ U | µU (v) = max U }.
Middle-of-Maxima, MoM
MoM is similar to FoM. Instead of taking the first values with maximal grades of
membership, we compute the average of all values with maximal grades,
min{v ∈ U | µU (v) = max U } + max{v ∈ U | µU (v) = max U }
u= .
2
A graphical representation is given in figure 6.11.
Height Method, HM
FoM
1
Figure 6.10: Defuzzification with FoM.
MoM
1
Figure 6.11: Defuzzification with MoM.
The height method is not applied directly on µU , but is focused on heights, and
computes a weighted sum of these heights. The weighted sum is according to
X
n
µU peak · fi
i
i=1
u= X
n ,
fi
i=1
where µU peak is the ui for which µUi (ui ) = 1 (where µUi is in its original form). See
k
figure 6.12. For the trapetsoidal membership function, µU peak becomes an interval
k
(plateau), from which we can select a representative, e.g. the mean value. The value
fi is the height of µUi , i.e. max µUi .
HM is computationally both simple and fast.
1
HM
f2
f1
0
µ U peak µ U peak
1 2
u
Figure 6.12: Defuzzification with HM.

Centre-of-Largest Area, CoLA

Finding the centre of largest area is especially useful for concave functions. CoLA
finds the largest area from the set of convex subareas (see figure 6.13), and computes
the centre of the largest such area. The computations are somewhat complicated,
and in fact, can be applied in a number of different ways.
CoLA
1
Figure 6.13: Defuzzification with CoLA.
Example. Defuzzification can lead to undesirable effects. Consider e.g. a car trying
to avoid an obstacle directly in front of the car. The membership function repre-
senting candidates of control values is then typically as shown in figure 6.14. Both
CoG and CoS, however, will result in driving straightforward.
Figure 6.14: Undesirable phenomenon given e.g. the CoG method.

Chapter 7
Fuzzy Clustering
7.1 Data Clustering

A cluster, according to [Everitt 74], is a group of entities with similar properties,
and entities from different clusters cannot be similar. Further, the distance between
two points in a cluster must be smaller than the distance between a point in that
cluster and any other point outside the cluster.
In conventional clustering, points in clusters have binary memberships (see figure
7.1). In fuzzy clustering we allow points to have different grades of membership in
different clusters (see figure 7.2). A clustering algorithm aims at finding a reasonable
set of clusters in a finite number of iterations.
0
1
0
1 1 1
0 0
1 1
1 0 0
x 1
1 1 0 x
1 1 0 0
1 0
0
1 0
Figure 7.1: Conventional clustering of data.
7.2 Fuzzy c-Means Clustering

The fuzzy c-means (FCM) algorithm, also called fuzzy ISODATA algorithm, was
originally developed by J. C. Dunn [Dunn 73, Dunn 74] and later generalised by J.
C. Bezdek [Bezdek 74, Bezdek 81].
67
68 CHAPTER 7. FUZZY CLUSTERING
CLUSTER 1 CLUSTER 2
0.47 0.53
0.68 0.32
0.32 0.68
0.73 0.77 0.66 0.27 0.23 0.34
0.30 0.70
0.19 0.81
0.80 0.20
0.77 0.15 0.23 0.85
0.91 0.10 0.09 0.90
x 0.86 x 0.14
0.80 0.61 0.20 0.39
0.85 0.14 x 0.86 x
0.77 0.09 0.20 0.15
0.23 0.91 0.80
0.74 0.27 0.26 0.73
0.24 0.76
0.69 0.30 0.31 0.70
Figure 7.2: Fuzzy clustering with two clusters.
To describe the algorithm, we need some notations. The set of all points con-
sidered is X = {x1 , · · · , xn }(⊂ Rd ). We write ui : X → [0, 1] for the ith cluster,
i = 1, . . . , c, and we will use uik to denote ui (xk ), i.e. the grade of membership of xk
in cluster ui . We also use U = huik i, for the matrix of all membership values. The
’midpoint’ of ui is vi (∈ Rd ), and is computed according to
X
n X
n
vi = (uik )m xk / (uik )m .
k=1 k=1
A parameter m, 1 ≤ m < ∞, will be used as a weighting exponent, and the

particular choice of value of m is application dependent [Bezdek 81]. For m = 1,
FCM coincides with the ISODATA algorithm, and if m → ∞, then all uik values
tend to 1/c. For a discussion on optimal choices of m, see [Choe and Jordan 92].
Membership in fuzzy clusters must fulfill the condition
X
c
uik = 1,
i=1
i.e. for each x ∈ X, the sum of memberships of x in respective ci must be 1.

A distance measure k · k in Rd will affect the shape of the clusters (see also 7.4).
A typical distance measure is the euclidian distance.
The objective of the clustering algorithm is to select ui so as to minimize the
error function
X
c X
n
J= (uik )m k xk − vi k2 .
i=1 k=1
The following algorithm for FCM clustering will meet this objective:
7.2. FUZZY C-MEANS CLUSTERING 69
Step 1: Fix c and m. Initialise U to some U (1) .

Select ² > 0 for a stopping condition.
Step 2: Update midpoint values vi for each cluster ci ,
Step 3: Compute the set µk ≡ {i : 1 ≤ i ≤ c :k xk − vi k= 0},

and update U (`) according to the following:
if µk = ∅, then
X
c
uik = 1/[ (k xk − vi k / k xk − vj k)2/(m−1) ]
j=1
otherwise X
uik = 0 ∀i 6∈ µk and uik = 1.
i∈µk
Step 4: Stop, if k U (`+1) − U (`) k< ε, otherwise go to step 2.

In step 1, c(≥ 1) is set to a fixed number of clusters. In the rule generation
phase, each cluster will be the basis for one rule. It is usually desirable to keep c
as small as possible, not so much because of computations related to clustering, but
mainly in order to keep the number of rules within reasonable bounds. Clustering
is computation intensive, but since clustering seldom is needed to be computed in
real time, computational efficiency is usually not a critical issue.
Further in step 1, the matrix U = huik i is to be initialised. A crisp, and even
random, partition of X into c subsets can be sufficient to provide a good starting
point for the algorithm.
In step 2, midpoint values vi are computed, and respective midpoints will of
course move towards points with higher membership values in their clusters. Note
that a midpoint can coincide with some xλ , and in such a case we will have uiλ = 1,
and then also for all ξ 6= i we will have uξλ = 0.
Step 3 is the core of the algorithm. In this step, membership values uik are
updated. Note that we have to distinguish between cases depending on whether
or not midpoints coincide with data points. The variable ` denotes the iteration
number.
In step 4 we compute the difference between present and previous matrices of
membership values. If the stopping condition is met, we are done.
We are now faced with the problem of selecting the value of c, i.e. we need to
be able to measure the quality of the obtained clustering. Given a value for c, and
given the resulting U and vi ’s of the clustering algorithm, we can compute a criterion
number S(U, c) according to
X
c X
n
S(U, c) = (uik )m [k xk − vi k2 − k vi − x̄ k2 ].
i=1 k=1
v1
x v
x 2
v3x
Figure 7.3: Cluster midpoint.
where x̄ is the midpoint of all points in X (see figure 7.3). Obviously, we will select
that particular c0 for which
S(U, c0 ) = min S(U, c).
c
i.e. we should iterate the clustering algorithm with c ranging in a suitably selected
interval, e.g. 2 to 20 clusters.
Clearly, in many cases, it is not straightforward to find out the optimal number
of clusters. For small number of data points in X, a worst case scenario is that the
criterion number decreases with increasing c and reaches an optimality with c equal
to the number of data points in X. Also intuitively (see figure 7.4), it is not obvious
which number of clusters will be better from practical viewpoints.
Figure 7.4: Two or three fuzzy clusters.
Convergence of the FCM algorithm is discussed in [Bezdek 80, Bezdek et al 87].
7.3 Identification of Rules

A point xk ∈ X can be projected on each of its coordinates p = 1, . . . , d. This
projection is denoted πp , i.e. πp ((s1 , . . . , sp , . . . , sd )) = sp . Further, let Xp denote
the set of projected points πp (xk ), xk ∈ X.
7.4. GEOMETRIC FUZZY CLUSTERING 71
Each cluster will now generate one rule in the following way. Consider a fuzzy
cluster ui on X, and write πp (ui ) : Xp → [0, 1] for the corresponding cluster projected
on the pth axis, i.e. πp (ui )(πp (xk )) = uik . Assume we have decided to use Gaussian
functions in our generated rule. This means we want to estimate the parameters of
Gaussian functions with best fit to respective πp (ui ). Thus, for each p we need to
find βp such that
X
| uik − e−βp (πp (xk )−αip ) |
xk ∈X
is minimized (αip = πp (vi )).

In figure 7.5 we see how cluster u3 produces the antecedent in the rule
R3 : IF π1 (x) is A31 AND π2 (x) is A32 THEN . . .
where A3p (t) = e−βp (t−αip ) .
If each xk is attached with corresponding output values yk ∈ Re , then the corre-
sponding (discrete) membership function(s) in the conclusion of the rule can com-
puted once we have fixed a conjunction function for the rule base.
x
x
A 32
v
α 32 x 3 x
A 31
α 31
Figure 7.5: Projection of a cluster.
The rule generation technique can be further extended to involve a capability

to find a ranking among the dimensions of the input space, and to determine an
optimal subset of these dimensions. Thus in a set of d signals, a subselection can
be made so as to improve rule base functionality. Details of this technique is found
in [Sugeno and Yasukawa 93], which is also our reference for the rule generation
technique.
7.4 Geometric Fuzzy Clustering

The basic FCM clustering algorithm identifies spherical structures. For identification
of more elliptic, or linear, clusters, we need to modify the error function J. In the
following we give an outline of this technique. Details can be found in [Windham 83].
Given X = {x1 , . . . , xn } ⊂ Rd , a subset of vectors in ad-dimensional space. The

distance between two vectors xk and xl can be given as
| xk − xl |2M = (xk − xl )T M (xk − xl ),
where M is a positive semidefinit matrix. Using such a distance measure, a cluster

can be seen as a geometric object where M prescribes locations of higher concentra-
tion of data points (vectors), i.e. M defines a geometric structure. Note that FCM
provides geometric clustering with M being the identity matrix.
Using some positive semidefinit matrix M , we now obtain a scatter matrix
X
Si = (uik )m (M 1/2 (xk − vi ))(M 1/2 (xk − vi ))T
k
related to the ith cluster. If this cluster is elliptic with points being concentrated
around the centre point of the ellipse, then the principal axes of the cluster is given
by the eigenvectors of M .
7.5 Applications
There is a wide range of applications of clustering. In industrial process control
there are numerous applications, such as fertilizer production ([Riissanen 92b]),
automatic steering (AGNES [Olli 95, Huber 96]), and process state prognostics
([Sutanto and Warwick 94]), only to mention some examples. The reader is referred
to proceedings e.g. related to IEEE and IFSA conferences on fuzzy systems.
Also in pattern recognition and image processing there are typical application
e.g. in segmentation of colour images ([Lim and Lee 90]), edge detection in 3-D
objects ([Huntsberger et al 86]), and surface approximation for reconstruction of
images ([Krishnapuram et al 95]).
Chapter 8
Generalised Perceptrons as Fuzzy

Rules
In this chapter we demonstrate how to combine available NFSs techniques and

technologies. We emphasize i.e. preprocessing of data. By this we mean extraction
of cut-off values and ”softening”, or fuzzifying, the cut-off barrier. In general, we
are specifying system parameters, as informative ingredients in the decision support
environment. When parameters, and thereby knowledge, have been extracted, our
raw data is transformed into a corresponding logical form. Transformed data can
(and should) be further used in more traditional neural network environments in
order to optimize diagnostic performance. Note that for resulting networks and
computations, code can often be generated.
We show how preprocessed data from single layer networks lead to faster con-
vergence and better diagostic performance in backpropagation networks. It is re-
markable, that single layer networks together with preprocessing can compete with
multi-layer network that relies on mere linear transformations.
8.1 Enlightening the Black Box

A typical application of neural networks consists of ”throwing in numbers” and
experimenting with different network structures. Partial success is acknowledged as
a discovery and unlocking of secrets of nature. The fact that resulting networks are
huge and impossible to interpret or explain, remains even as a sign of complexity,
and that ”this shows we must have solved something really difficult”.
We claim that network size can be drastically reduced by paying attention to
preprocessing of raw numerical data. We can see, that even specifying cut-off values
instead of using raw data, will not just reduce network size, but also give a better
diagnostic performance than that of a multi-layered network.
Strategically, our suggestion is to preprocess before enlargening networks. In doing
so, network structures are kept small and explainable. The possibility of explaining
73
74 CHAPTER 8. GENERALISED PERCEPTRONS AS FUZZY RULES
single layer network multi-layer network

& linear transformation & linear transformation
neural
fuzzy computing
inference view
view
single layer network multi-layer network

& preprocessing & preprocess
Figure 8.1: Fuzzy inference vs neural computing.
networks and its parameters is of utmost importance to practisioners both in medical

and technical diagnostics.
Explanations and interpretations of the network are logical in nature. In many
applications, such as diagnostic tasks, network summations are to be interpreted as
incremence of evidence. We are asked to deliver inputs all simultaneously, with data
not available classified as ’missing data’. All inputs (also in hidden units) are used to
calculate the weighted sum, which is activated and thereafter is fit to act as an input
in further feedforward. Note that partial evidence cannot be estimated and saved
for a separate incremental procedure. Formally, of course, this is possible. In this
case we would activate partial sums and have to choose some function to aggregate
all separate activations. From a (fuzzy) logical point of view, some properties should
then be satisfied. One such is associativity of the incremence procedure, which is
difficult to achieve with activations. Another aspect concerns the linearity of the
weighted sum. An independent evidence will add to the result always with a static
quantity regardless of the current measure of belief. When dealing with values in
the unit interval we expect the speed of incremence to slow down the closer we come
to maximal belief.
8.2 Conventional use of Neural Nets

Neural networks provide an environment for classifying patterns without having to
design and organize sophisticated knowledge based architectures for the classify-
ing machinery. Numbers are used as such in experiments with different network
types/sizes, and diagnostic performance is evaluated the wrt choices of system pa-
rameters.
The schema for ”walking through” an application is as follows [Klimasauskas 91]:
• (1) Collect data and transform into network input.
• (2) Separate data into training and test sets.
• (3) Select, train and test the network.

8.3. NETWORK FEEDFORWARD 75
Iterate (1) to (3) as appropriate.
• (4) Deploy the network in your application.
For neuralists, the first step is the most ”magic”. (A fuzzy practisioner considers
this step to be ”logic”.) In this step, transformations like ’1-of-N Codes’ is applied to
handle symbolic data, and ’Histogram Equalization’ overcomes difficulties with out-
of-normal-range values. For the second step there are few rules-of-thumb concerning
splitting of the data file into pieces for, respectively, training and testing. The third
step is experimental, and is a generate-and-test approach to finding (what in the end
is believed to be) near-optimal networks. The fourth phase is often nicely supported
by code generating modules in software packages.
According to the neural computing view, raw data is to be collected from a
physical description of the disease. Data is inserted and used as such by the network.
Preprocessing of data is a post-activity after the final network has been extracted.
By preprocessing, the diagnostic performance might be further improvable.
The fuzzy inference view is the opposite: Preprocess first, and organize your sys-
tem based on knowledge extracted from the preprocessing phase. The preprocessing
phase includes a reorganization (often symptom combinations) of the physical dis-
ease description. Data is pushed through (non-linear) transformation functions, and
resulting transformed data is used instead of raw data. In subsequent sections we
will see examples of the power of using transformed data.
8.3 Network Feedforward

The basic elements in feedforward networks are related to the preprocessing (or
fuzzifying) of data, the network structure acting on logical data values, and the
postprocessing (or defuzzifying) of network output.
In preprocessing we distinguish between input modelling and the transformation of
physical values into logical, not only normalized counterparts. Input modelling is
needed when physical input nodes cannot as such act as input to the network.
Contrary to laboratory tests related e.g. to general symptoms like leucocytes,
C-reactive protein and erythrocyte sedimentation rate, where the physical measure-
ments acts directly also as the input to the network, in many cases we do not have
such one-to-one correspondances. Many measurements can be significantly time-
dependent, such as enzyme values measured for myocardial infarction. In these
cases, measurements of one type taken at different points in time, need to be com-
bined wrt to proportional, derivative and integrative aspects, somewhat similar to
as done within dynamical systems and control. Of course, very accurate modelling
is impossible, an not even required, but even basic observations vis-a-vis modelling
provides an input reorganization necessary for improved discrimance performance.
Note, that this knowledge, in particular, resides within the domain expertise.
physical output Decision Space
reformation
logical output
defuzzification
network output
n e t w o r k inference
logical input
preprocessing
normalized physical input
linear transformation
organization of input Input Modelling
physical input Measurement Space
Figure 8.2: Basic elements in feedforward diagnostics.
Once the input modelling has been fixed, the second step in forwarding starts
off by observing that we need transformation functions to convert physical data
into logical values. To see the necessity of the preprocessing functions we need
only mentioned the standard example of fever. A value for fever when determining
bronchitis cannot be transformed in the same way as when determining pneumonia.
In all, transformation functions related to particular inputs are always to be given
for the specific disease under consideration.
At this point we need to be critical about the selection of preprocessing function
types. As we have seen, e.g. sigmoidal transformation functions can be useful.
Again, whether the function needs to be ascending or descending, is usually trivially
known by the domain expert. Learning strategies for parameter estimation within
the network is now easily extended to apply also on the parameters within the
sigmoidal functions. Of course, other preprocessing functions can be used, with the
recommendation that they be differentiable to enable parameter tuning.
Once the logical inputs are given, the network is then the computational method
that implements the numerical optimization task, i.e. the training of all the system
parameters. Note, that whereas in the neural approach we usually only identify
parameter values given a certain network structure, the structures within fuzzy
systems domains are identified together with the paramater estimations [Bezdek 81,
Jang 92, Sugeno and Yasukawa 93].
The output of the network is still only of logical nature, typically a unit interval
value, and needs further explanation if representing a final decision or recommen-
dation from the end-user support system. The value can also remain as such, or
simply be appropriately transformed, to indicate a risk value, but can also act as
an input into a decision maker that provides even binary type decisions regarding
further actions to be suggested within patient care.
Generally speaking, a diagnostic task is the problem to find the best representa-
8.4. THE GENERALISED PREPROCESSING PERCEPTRON 77
tive as some function

Y
m
f : Rn → Dm ,
1
Qm
1 Dm being the decision product space. Using our modularization of the global
network, f should be seen as a composition
f = finterpretation ◦ fnetwork ◦ fpreprocessing ,

or as some other corresponding modularization of the system.
0
The mapping fpreprocessing : Rn → [0, 1]n lumps nodes together within the input
modelling, giving n0 nodes, and imposes a straightforward normalization of numerical
values, typically into the unit interval. To simplify, we use m0 = 1 to illuminate our
generalized preprocessing perceptron, i.e. the output represents an estimation of risk
or a probability for a single disease.
For decision making, a deeper understanding of the structure of the mapping
0 Q
finterpretation : [0, 1]m → m1 Dm is obviously important.
8.4 The Generalised Preprocessing Perceptron

The single layer perceptron consisting of the weighted sum
X
net = wi · xi ,
i
and activated by the sigmoid

1
f (net) = ,
1 + e−φ(net−θ)
has a natural logical interpretation in that xi represent antecedents and wi uncer-
tainties for the partial conclusions from xi to the diagnostic output. The summation
of the partial conclusions corresponds to a logical disjunction.
As a formalization of the activated weighted sum in a general logical framework,
we obtain the generalized preprocessing perceptron, consisting of a function from
[0, 1]n to R, written as
y = Φni=1 ϕi (wi , gi (xi )),

where xi are the inputs, gi is in the preprocessing layer, wi represents the uncertainty
for the conclusion from xi to the output y, ϕi (wi , si ) represents the semantics of
implication
xi →wi y,
and Φ corresponds to a logical disjunction, representable by a generator function
[Mizumoto 89, Schweizer and Sklar 61].
Specialized to the traditional weighted sum domain, the generalized preprocess-

ing perceptron becomes
X
n
y = act( wi ∗ g[αi , βi ](xi )),
i=1
where act is the activation function and g[αi , βi ] are sigmoidal functions.
The specialized preprocessing perceptron shows comparable performance as com-
pared to the multilayer perceptron. Improved discrimance effects and diagnos-
tic success rates have been demonstrated in several case studies [Eklund et al 94,
Kallin et al 95]. This in conjunction with the remarks that generalized preprocessing
perceptrons use only interpretable parameters, makes this approach very appealing.
The logical disjunction as the generalization of summation in the weighted sum,
corresponds to some co-t-norm S as represented by generator functions g : R →
[0, 1], with an existing inverse g −1 . The representation is given by
S(a, b) = g −1 (g(a) + g(b)).

As an illumination of the representation function, consider the situation as in 8.3.
Having the real line placed on top of the halfcircle, and pushing x out to the real
line in a direction perpendicular to the gradient at the halfcircle, we obtain a one-
to-one mapping g : R →]0, 1[, x = g(x0 ), which immediately is the generator to a
corresponding co-t-norm. This provides the means to consider addition on the real
line as corresponding to an addition in the unit interval.
Figure 8.3: Generator function given by the half circle.
Incidently, the representation functions are exactly the counterparts of the activation
functions used in the neural framework, and it comes as a surprise to see that
respective literatures on co-t-norms and neural activation functions have very little
in common.
Note how the representation functions can be parametrized in different ways.
From these observations we realize that we obtain a wide spectrum of param-
etrizations of the weighted sum, with interpretations in a logical framework. The
presentation of neural networks as a logical structure can indeed be described in ex-
treme as is done in a subsequent section, thereby concluding that from a generalized
viewpoint, fuzzy logic and neural networks come down to one and the same thing!
8.5. LOGICAL INTERPRETATIONS OF THE WEIGHTED SUM 79
Note that the transformation of addition on the real line is not restricted to the
unit interval only. For learning rules we thereby obtain the following convenience.
For a parametrized function
(o1 , . . . , om ) = M[α1 , . . . , αr ](x1 , . . . , xm ),
parameters αk , k = 1, . . . , r, can be trained by
αk := αk ⊕ ∆pattern αk ,
where ⊕ is the corresponding addition in the range of the α’s, and ∆p αk is given by
a particular optimization strategy.
For the specialized preprocessing function we adapt the parameters in the sigmoid
functions according to
∂Ep
∆p αi = − = −η(tp − op )op (1 − op )wi g[αi , βi ](xi )(1 − g[αi , βi ](xi ))βi
∂αi
and
∂Ep
∆p βi = − = η(tp − op )op (1 − op )wi g[αi , βi ](xi )(1 − g[αi , βi ](xi ))(xi − αi ),
∂βi
respectively, with mean squares in the error function.
Learning as described above is based on gradient descent. However, this being
rather an exemplification than a choice, we certainly do refer also to other optimiza-
tion techniques to be evaluated for these purposes. For some examples, we refer to
chapters describing this in more detail.
8.5 Logical Interpretations of the Weighted Sum

In this section we discuss examples of ϕ and Φ.
For ϕ, we will use Yager’s implication as a replacement of the multiplication in
1
w
the weighted sum. If we have the implication xi →i y, then y = I(xi , wi ) = wi i . For
x
the disjunction function we will examine a special family of parametrised t-norms

and dual t-conorms. See [Mizumoto 89, Schweizer and Sklar 61] for more details of
these connectives.
The Yager implication is depicted in Figure 8.4.
Figure 8.4: Yager’s implication operator.
For a particular choice of w, the resulting exponential implication function is the

corresponding slice in the figure.
For Φ, we demonstrate the use of the Schweizer-Sklar t-norms, Tp (a1 , a2 ), where
p is the parameter that spans the space of t-norms, and the t-norm is given by
1
Tp (a1 , a2 ) = (ap1 + ap2 − 1) p if (ap1 + ap2 ) ≥ 1 and p 6= 0,
Tp (a1 , a2 ) = 0 if (ap1 + ap2 ) < 1 and p 6= 0,
Tp (a1 , a2 ) = a1 a2 if p = 0.
This can easily be extended to the case of n inputs a1 , ..., an , so that

 P 1 P
 [ i=1 ai − (n − 1)] p
 when p 6= 0 and api ≥ n − 1,
n p n
Qn Pi=1
Tp (a1 , . . . , an ) =
 ai when p =6 0 and i=1 ai < n − 1,
n p
 i=1
0 when p = 0
The corresponding co-t-norm Sp (a1 , . . . , an ) is given similarly.

In our PCOS case-study, described in section 1.1.3, we simplify to use only p = 0.
This gives us the following functions in the network:
1
gi (xi ) = g[αi , βi ](xi ) = 1+e−βi (xi −αi )
1
−βi (xi −αi )
= wi1+e
g[αi ,βi ](xi )
ϕ(wi , g[αi , βi ](xi )) = wi
and
Qn
Φni=1 (zi ) = 1 − i=1 (1 − zi ),
where zi = ϕ(wi , g[αi , βi ](xi )).

Note the fact that Φ : [0, 1]n → [0, 1] and we therefore need no special activation
function.
Gradient descent based training of w, α and β gives

∂yj
∂E
∂wi
= ∂E
∂yj
· ∂zi
· ∂zi
∂wi
−β (x −α ) Q
= −(tj − yj )(1 + e−βi (xi −αi ) )wie i i i k6=i (1 − zk )
∂E
∂αi
∂E
= ∂y · ∂y
∂zi
j
· ∂α
∂zi
j i
Q
= −(tj − yj )zi e−βi (xi −αi ) ln(wi )βi k6=i (1 − zk )
∂E
∂βi
∂E
= ∂y · ∂y
∂zi
j
· ∂β
∂zi
j i
Q
= −(tj − yj )zi e−βi (xi −αi ) ln(wi )(−xi + αi ) k6=i (1 − zk )
using the update rule as described in the previous section.
8.5.1 Results of the case study

The study group consisted of 40 patients diagnosed with PCOS and a control group
of 24 weight-matched women with regular menstrual cycles. We made five learning
files with 14 non-PCOS and 20 PCOS cases in each. The remaing cases created
the five testfiles. We made the tests and the results shown in this paper are the
meanvalue of this tests.
8.5. LOGICAL INTERPRETATIONS OF THE WEIGHTED SUM 81
Weight Cut-Off (α) Slope (β)

Input WS YI WS YI WS YI
BMI -0.56 0.138 39.0 28.6 10.82 3.03
SHGB -0.732 0.146 78.4 41.1 12.34 4.008
Testost. 0.766 0.394 2.0 2.1 18.98 7.32
Androst. 0.784 0.322 9.5 9.5 27.80 7.68
Tot renin 0.562 0.286 195.0 198.5 21.50 6.82
FAI 0.72 0.332 77.9 64.8 21.64 10.23
LH 0.65 0.272 9.6 11.2 22.71 6.43
FSH 0.374 0.076 3.9 4.1 17.76 25.07
LH/FSH 1.27 0.474 1.5 1.4 38.65 10.94
Table 8.1: WS = weighted sum, YI = Yager implication.
In this case study we use both the preprocessing perceptron with the weighted
sum, and also the generalised preprocessing perceptron with a selection of connec-
tives as specified in the previous section.
The data consists of ten different parameters as described in section 1.1.3. Cor-
rectness rates as average values are high both for the weighted sum (99%) and when
using the generalised preprocessing perceptron (97%). Values are for sensitivity
given that specificity is enforced to provide 95% correctness.
Values for weights (implication uncertainties), cut-offs and slopes are shown in
table 8.1. In this table let WS stand for the method using a weighted sum and YI
stand for the method using Yagers implication.
Note that cut-off values in some cases are surprisingly similar. A cautious con-
clusion can be that the preprocessing layer indeed, at least to some extent, is inde-
pendent of the network structure. More experiments should, of course, be done to
verify such claims, but, nevertheless, this observation is most encouraging for aiming
at further understanding hybrids and modularity in learning architectures.
We can say that these methods are successful, but a comparison between them
does not seem meaningful. However, we should note that shifting from a linear
feedforward to a non-linear logic based approach, seems not to endanger success
rates. The advantage of the logic approach is obvious since we immediately open
up possibilities to identify respective suitable implication operators for particular
symptoms and signs. Thus, this methods can in fact be used to identify inference
structures in medical decision making, or at least in some general cases.
8.6 The production line

From the software organisation point of view, the production line (figure 8.5) is sep-
arated into three different modules. Firstly, there is a data and information retrieval
module to provide fast and convenient extraction of patient data to prepare for a
variety of data analysis experiments. Secondly, there is a data analysis workbench
integrating and interconnecting the data analysis tools of choice. Most of these tools
usually include structure identification facilities of different degrees of sophistication.
Thirdly, we envision the existence of system generation support. Once structures
are identified in the analysis phase, the domain expert needs to be liberated from
technicalities in end-user system development and integration.
Figure 8.5: Production line of decision support systems.
In this subsection, we will describe the architecture of the production line in

detail. To make it easier for the reader to follow the different steps of the production
line we will use the Polycystic ovary syndrome or PCOS (see section 1.1.3 for a
description of the syndrome) as an illustrative example. In the following subsections,
we will follow the path from raw data to a decision support system.
8.6.1 The extraction phase

In this phase, the goal is to obtain data to work with later on. For a discussion of
how data is obtained and what kind of data we can obtain see the section below
on data retrieval. The clinician feeds a physical disease description of the expected
symptoms and signs into the system and specifies the amount of data needed. This
is done by selecting diagnoses from a lookup table (in our case we need information
on only one diagnosis - PCOS), choosing the measurements needed (weight, length,
values of different hormones etc) and perhaps setting criteria (e.g., only female
patients with age in the range of 15-50). It is also possible to make the system
retrieve data only if the measurement lies within a specified interval.
A specification file is generated from the given disease description and this file is
used to create an extraction program for the given database (the database program
can be written in SQL, MUMPS etc.). The data returned from the database is in
raw format and might contain empty fields and, sometimes, unnecessary information
depending on the structure of the database. The clinician is able to instruct the
system how to organise and combine the data in order to produce the expected data
files. For example, in PCOS we know that the body mass index (BMI) is an factor
that might be of interest. We instruct the program to extract data on the weight
and length of the patients and then to combine these data in order to calculate the
BMI.
8.6. THE PRODUCTION LINE 83
Data retrieval
In order to understand what kind of data it is possible to extract we must look at

the patient’s way through health care. By that we can learn were and how the data
is gathered. This subsection gives a few examples of different interactions between
patient and clinicians, see also figure 8.6.
Figure 8.6: The different paths a patient can take through the health care system.
A health care centre is visited for preventive care or illness. In both cases the
general practitioner examines the patient and laboratory samples might be taken.
The general practitioner uses a computerised patient record system to store the
information about a patient. As the systems in primary health care are often local,
they may differ from place to place. This makes it difficult to compare and/or
combine the data stored about a patient at a health care centre and at a hospital.
If the doctor at the health care centre wishes to consult specialists or if the patient
needs treatment that cannot take place at home, the patient is referred to hospital.
The patient can also be admitted to hospital directly, e.g., in emergency situations.
In hospital the specialists examine the patient and laboratory tests and/or x-rays
are taken. The hospital performs surgery if needed and often the patient stays in
hospital for after care. During the hospital stay several laboratory samples might
be taken and other data gathered.
Also at the hospitals information is stored about patients’ health status such
as laboratory test results, medications, and diagnoses in vast electronic databases.
Many important medical observations have been made by studying patient history,
and much more could be done if scientists would have faster and broader access to
the knowledge that can be found in the huge data masses. Further observations
might be done if it were possible to connect and compare databases at health care
centres with the ones at hospitals.
The scientific and the administrative viewpoint on databases do not coincide.
The administrative interest is to find information on one patient at a time, inserting
or retrieving data concerning this particular patient. The scientific interest is in
turn to combine information concerning patients’ health status. Individual patients
are not in focus but larger groups that have certain things, e.g., the same diagnosis,
in common. For administrative purposes summarised information is also of interest,
but such information concerns mainly, for example, how much an average patient
costs or which hospital wards consume the most resources. The scientific viewpoint
has generally not been considered when building hospital databases, but the trends
seem to be shifting slowly.
8.6.2 The analysis phase

When the data have been extracted, the analysis phase begins. In this phase, the
data are used to construct a decision support system for the given disease. The anal-
ysis tools used can be statistical, logical, neural, or hybrid comprising two or more
approaches. In this part, the clinician has a toolbox with different approaches at
hand and can easily test and evaluate different decision support systems. Our own
approach, the generalised preprocessing perceptron, is implemented in the Diatune
software. Regardless of what system the clinician chooses, the result is a set of
parameters, which has to be made conform to a module and integrated in the sur-
rounding system.
8.6.3 The synthesis phase

Finally, the information from the analysis part must be integrated into the existing
system during the synthesis phase. Let us assume that the clinician has decided to
use the generalised preprocessing perceptron (GPP) as a decision support system
for the polycystic ovary syndrome. The analysis phase has produced parameters
indicating the type of functions to use in the GPP and the values of the parameters
in the network.
In automatic creation of a decision support system module the following happens:
the parameters, given by the program, are written as constants into an include file.
The include-file is linked to a template program, producing a new instance of the
GPP program in C code, specialised for PCOS.
A program using template HTML pages and another program given to construct
a graphical user interface for the decision support system module makes the inte-
gration into the existing system (the World Wide Web pages). The program also
creates information pages providing a short description of the disease and an ex-
planation of the material and methods used. The converting program also makes
a page presenting the results. Finally, an index file for the disease is constructed.
All this can be made automatically and the resulting files are written in a special
library and are easily transferred to the web server file system. All that is left to do
is to create a link between the name of the disease in the ICD file and the index file
in the library.
8.7 Neural Networks and Multi-Valued Propo-

sitional Calculus
8.7.1 Fuzzy logic programs
Continuing the discussion on fuzzy propositional calculus, we will now describe
Prolog-like programs without the use of individuals, but with arbitrary binary con-
nectives from B, not only one corresponding to the logical conjunction. In addition,
8.7. NEURAL NETWORKS AND MULTI-VALUED PROPOSITIONAL CALCULUS85
these program clauses are weighted with values in the unit interval. Thus we consider
a Prolog-modification based on LÃ ukasiewicz logic.
Definition The subset of atomic formulae of L is denoted Kf . Its elements are
called facts. A rule is a formula of the form
(P ← (. . . (Q1 ¦ Q2 ) ¦ . . . ¦ Qn )),
where n ≥ 1, P, Q1 , . . . , Qn ∈ Kf , and ¦ ∈ B, if n > 1. If n = 1 (i.e. there is

no binary operation ¦), we have a simple rule. We write Kr for the set of rules.
K = Kf ∪ Kr is the set of program clauses.
The following definitions introduce the procedure of deriving facts with uncertainty
factors from a fuzzy logic program.
Definition A fuzzy logic program (FLP) Π is a fuzzy set of axioms satisfying for all
P ∈ L : (Π(P ) > 0 ⇒ P ∈ K).
For a fact S, Π(S) should be interpreted as the measurement value for the input at
S.
Definition Let Π and Π0 be two fuzzy logic programs. Π0 is directly derivable from
Π, if there is a fact P ∈ Kf and a rule R ∈ Kr , R = (P ← (. . . (Q1 ¦Q2 )¦. . .¦Qn )), s.t.
for all Q ∈ L−{P } : Π0 (Q) = Π(Q) and Π0 (P ) = max{Π(R)+¦(Π(Q1 ), . . . , Π(Qn ))−
1, Π(P )}.
Definition A fuzzy logic program Π0 is derivable from the fuzzy logic programm Π
(Π ≺ Π0 ), if there exists a sequence Π0 , . . . , Πn of fuzzy logic programs satisfying
• (i)Π0 = Π and Πn = Π0
• (ii)for all k ∈ {0, . . . , n − 1} : Πk+1 is directly derivable from Πk .
Note, that ≺ defines a preordering on the set of fuzzy logic programs (i.e. ≺ is a
reflexive and transitive relation).
Definition For a fuzzy logic program Π the mapping Π `: Kf → [0, 1] is given by
Π ` P = sup{Π0 (P ) | Π ≺ Π0 }.
In opposition to standard Prolog, where it is sufficient to find just one proof for a
fact, in this many-valued modification all proofs for a fact have to be taken into
account, in order to obtain the greatest uncertainty factor for the fact. Even in
the case, when the FLP Π has a finite support, not necessarily for all facts P there
exists an FLP Π0 derivable from Π s.t. Π0 (P ) = Π ` P . If f.e. Π(P ) = 0.5 and
Π(P ← P ¦ P ) = 1, Π(Q) = 0, otherwise, where ¦(α, β) = α + β − α · β, then
Π ` P = 1, but for all FLP’s Π0 derivable from Π, Π0 (P ) < 1 holds.
Although the supremum is never reached, it can be approximated by a sequence
of FLP’s derivable from Π. Therefore, the supremum does not cause severe prob-
lems. The need to consider all instead of only one proof for a fact enforces the
restrictions for efficient implementations of this Prolog modification. But soundness

and completeness are still preserved.
Theorem (Soundness) Let Π be a fuzzy logic program and let P ∈ K. Then
Π ` P ≤ Π |= P holds.
Proof. The theorem is obvious from the facts, that the binary connectives in B
are monotone increasing and direct derivations of FLP’s can be shown to maintain
soundness, because of
Υ(Q) ≥ α, Υ(R ← Q) ≥ β ⇒ Υ(R) ≥ α + β − 1,
where Υ is a valuation w.r.t. Π.

Theorem (Completeness w.r.t. atomic formulae) Let Π be a fuzzy logic program
and let P ∈ Kf . Then Π |= P ≤ Π ` P holds.
Proof. Define Υ0 : Kf → [0, 1] by Υ0 (Q) = Π ` Q. Extend Υ0 to a valuation in
the obvious way. We have to show, that Υ is a valuation w.r.t. Π. This implies
Π |= P ≤ Υ(P ) ≤ Π ` P .
For facts Q ∈ Kf we have Υ(Q) = Π ` Q ≥ Π(Q).
Let R = (Q ← (. . . (Q1 ¦ Q2 ) . . . ¦ Qn )) be a rule and assume
Π(R) > Υ(R) = Υ0 (Q) − ¦(Υ0 (Q1 ), . . . , Υ0 (Qn )) + 1.
The continuity and the monotonicity of ¦ imply, that there are ε1 , . . . , εn > 0, s.t.
Π(R) + ¦(Υ0 (Q1 ) − ε1 , . . . , Υ0 (Qn ) − εn ) − 1 > Υ0 (Q). Since Υ0 (Qi ) = Π ` S =
sup{Π0 (S) | Π ≺ Π0 } > Υ0 (Qi ) − εi for i = 1, . . . , n, there is a fuzzy logic program
Π0 , satisfying Π ≺ Π0 and for all i ∈ {1, . . . , n} : Π0 (Qi ) > Υ0 (Qi ) − εi . But from Π0
we can directly derive a fuzzy logic program Π00 with
Υ0 (Q) < Π(R) + ¦(Π0 (Q1 ), . . . , Π0 (Qn )) − 1 ≤ Π00 (Q) ≤ Π ` Q = Υ0 (Q),
which is obviously a contradiction. For simple rules R, Υ(R) ≥ Π(R) can be shown
analogously.
8.7.2 Generalised neural networks as fuzzy logic programs

We will restrict to FLPs with finite support.
FLPs carry out diagnostic tasks as specified by the knowledge base designer.
The performance of the FLP depends on the program structure as well as on the
uncertainty factors chosen.
We will see how FLPs transform to corresponding networks, and how a sub-
class of these networks can involve learning procedures for adapting the uncertainty
factors in the FLP.
A corresponding fuzzy logic network (FLN) consists of netnodes
F LN etN ode(N ame, Oper, AssList),

8.7. NEURAL NETWORKS AND MULTI-VALUED PROPOSITIONAL CALCULUS87
one for each program clause in the FLP where N ame is the name of the node. Oper
represents the operation in the rule, and is assigned to id for simple rules. AssList
is the list of pairs consisting of an antecedent Si in the rule together with the value
Π(Si ← X). Note that, before transformations the program will be completed with
rules so that there never exist two rules with the same head. For a leaf node, AssList
is NIL. This corresponds to a fact in the FLP.
We may write Ω = F LN (Π) for this set of netnodes. Given an FLP Π : K →
[0, 1], a corresponding FLN is now constructed as follows.
If there is more than one rule with head D, we rename these heads with unique
D1 , . . . , Dn , respectively, and complete the program with the rule D0 ← D1 ¦. . .¦Dn ,
with Π(D0 ← D1 ¦ . . . ¦ Dn ) = 1, where ¦ is the maximum operation.
In the completed program, for facts S we assign
S 7→ F LN etN ode(S, id, N IL).
Simple rules are transformed according to
D←S 7→ F LN etN ode(D, id, [hS, Π(D ← S)i]).
Non-simple rules assign according to
D ← D1 ¦ . . . ¦ Dn 7→ F LN etN ode(D, ¦, [. . . , hSi , wi i . . .])
where wi = Π(Di ← Si ) if Di ← Si is a simple rule and Π(Di ← Si ) > 0, and wi = 1
if Di ← X is a non-simple rule.
Conversely, given a FLN, a corresponding FLP is generated as follows.
F LN etN ode(S, id, N IL) assigns to facts S, where Π(S) are the corresponding
input values in these leaf nodes.
F LN etN ode(D, id, [hS, wi]) assigns to simple rules S ← D with Π(D ← S) = w.
F LN etN ode(D, ¦, [hS1 , w1 i . . . hSn , wn i]) assigns to the non-simple rule D ← D1 ¦
. . . ¦ Dn , together with simple rules D1 ← S1 . . . Dn ← Sn , where Π(S ← S1 ¦ . . . ¦
Sn ) = 1 and Π(Di ← Si ) = wi .
Again, we can write Π = F LP (Ω) for this program. Note that Ω = F LN (F LP (Ω)),
and that Π and F LP (F LN (Π)) are semantically equivalent.
For FLNs to allow for invoking learning procedures, restrictions have to be made.
These restricted nets are called neural logic nets (NLNs) and must fulfil the following
conditions:
• (i)there are no two rules with the same head,
• (ii)in one and the same rule each proposition constant appears at most once,
• (iii)the graph of named nodes created is acyclic,
• (iv)for non-simple rules R, Π(R) = 1.
The corresponding FLPs are called neural logic programs (NLPs). Because of (i)-
(iv), and the finiteness of the support of Π we get Π ` P = max{Π0 (P ) | Π ≺ Π0 }.
Chapter 9
Parameter Estimations
As we have already seen, there are been many successful applications of fuzzy con-
trol, but control designers still need to face two major obstacles in implementing
fuzzy control. The first is the acquisition of fuzzy rules and the second is the search
for optimal parameters of membership functions for the linguistic rules. We have
techniques for generating rules based on clustering algorithm. In general, however,
the initialisation design of the fuzzy controller will not result in an optimal control
behaviour. To improve the control behaviour, tuning is necessary. From tuning
point of view, the problem becomes to search for optimal parameters of membership
functions for the linguistic rules. Recently, using supervised learning methods to
fine-tune membership functions in a fuzzy rule base, has received more and more
attention. From adaptive fuzzy control point of view, we take the pragmatic atti-
tude that an adaptive fuzzy controller is a controller with adjustable fuzzy rule-base
parameters and a mechanism for adjusting the rule-base parameters. An adaptive
fuzzy control system can be thought of as having two loops. One loop is a nor-
mal feedback with the process and fuzzy logic controller. The other loop is the
parameter adjustment loop. The parameter adjustment loop is often slower than
the normal feedback loop. The derivation of different learning algorithms are given
in the following sections.
9.1 Tuning of Fuzzy Rule-Base by Using Gradient-

Descent Algorithm
For a given training set, the basic mechanism behind most supervised learning rules
is the updating of the parameters of membership functions, until the mean-squared-
error between the output predicted by the rule base system and the desired output
(target) is less than a prespecified tolerance. Optimisation is based on the objective
function F which is defined as
1X 2 1X ∗
F = Ep = (y − yp )2 . (9.1)
2 p 2 p p
89
90 CHAPTER 9. PARAMETER ESTIMATIONS
where p labels the pattern. The derivatives of F are obtained by summing the
derivatives obtained for each pattern separately. yp∗ is the target output value for
pattern p and yp is the actual output of some network function for pattern p. Of
course, the implementation of the different learning strategies will obtain different
improving of control behaviours. But it is not immediately clear, how different
learning techniques are suited for not well-behaving data such as typically found
in chemical, biological systems or medical informatics, where modelling usually is
difficult. In this section we will describe the gradient descent learning method.
Iterative gradient descent techniques can be used in any feed-forward networks.
Therefore, if we represent the fuzzy control system as feed-forward networks in
Figure 5.2.1, then we can use gradient descent (GD) to train different subclasses
of parameters. The objective is the updating of the parameters of the membership
functions in a fuzzy rule base such that the rule base performs a desired mapping
of input to output activations. Optimisation is based on (9.1). The learning goal
now is to find a global minimum of F . The parameter in the rule base are changed
along a search direction d(t) which is the first order derivative, namely the gradient
∂Ep
∇F := ∂α ij
is commonly used, it is driving the parameters in the direction of the
estimated minimum:
∂F
4α(t) = η (9.2)
∂α
α(t + 1) = α(t) + 4α(t) (9.3)
Once the partial derivatives are known, the next step in GD learning is to com-
pute the resulting parameters update. In its simplest form, the parameters update
is scaled step in the opposite direction of gradients, in other words, the negative
derivative is multiplied by a constant value, the learning rate η. This minimization
technique is commonly known as ”gradient descent”:
4α(t) = −η∇F (t) (9.4)
or, for a single parameter:
∂F
4αij (t) = −η (t) (9.5)
∂αij
The derivatives of F are obtained by summing the derivatives obtained for each
pattern, respectively: i = 1, 2, . . ., n, k = 0, 1, 2, . . . is the update (iteration) in-
dex, from Figure 5.2.1 [Wang and Mendel 92], we see that y depends on αij only
P P
through gi . For notation simplicity, let y = a/b, a = ni=1 Ci gi , b = ni=1 gi and
Q −βij (xi −αij )2
gi = m j=1 e , hence, using the ”chain rule”, we have
∂F ∂y ∂gi Ci − y
= −(y ∗ − y) = −2(y ∗ − y) gi βij (xj − αij ) (9.6)
∂αij ∂gi ∂αij b
9.1. TUNING OF FUZZY RULE-BASE BY USING GRADIENT-DESCENT ALGORITHM91
In a manner similar to that described to tune the consequent part, the mean (mid-
point) and variance (slope) are update as follows:
(
αij (k) ⊕ 2ηα (y ∗ − y) Cib−y gi βlij (k)(xj − αij (k)) if xj ≤ αij
αij (k + 1) = (9.7)
αij (k) ⊕ 2ηα (y ∗ − y) Cib−y gi βrij (k)(xj − αij (k)) if xj > αij
Ci − y
βlij (k + 1) = βlij (k) ª ηβl (y ∗ − y) gi (xj − αij (k))2 (9.8)
b
Ci − y
βrij (k + 1) = βrij (k) ª ηβr (y ∗ − y) gi (xj − αij (k))2 (9.9)
b
y∗ − y
ci (k + 1) = ci (k) ª ηc gi . (9.10)
b
where ⊕ and ª are a (isomorphic) operations in the subset of R, where i = 1, 2, . . ., n,
j = 1, 2, . . ., m, k = 0, 1, 2, . . .. ηα , ηβ and ηc are the learning rates.
The training procedure for the neural fuzzy system is a two pass procedure. First, for
given input values x = (x1 , · · ·, xm ), compute forward along the network (i.e. neural
fuzzy system) to obtain gi (i = 1, 2, . . ., n), a, b and y; then, train the network
parameters αij , βlij , βrij and Ci (i = 1, 2, . . ., n, j = 1, 2, . . ., m) backward using
(9.6), (9.7), (9.8), (9.9) and (9.10), respectively.
Although the basic learning rule is rather simple, it is often a difficult task to
choose the learning-rate appropriately. A good choice depends on the shape of
the error function, which obviously changes with the learning task itself. A small
learning-rate will result in long convergence time on a flat error-function, whereas
a large learning-rate will possible lead to oscillations, preventing the error to fall
below a certain value. A comparison of the different learning rates will be shown
later. The typical adaptive fuzzy control system tuning by GD learning algorithm
is shown in the following:
GD Learning Algorithm for Fuzzy Rule-Base Tuning:
1X 1X ∗
F = Ep = (y − yp )2
2 p 2 p p
Pn Qm −β (x −α )2
C ·( e ij j ij )
{ Denote: y = i=1 i
P n Q m
j=1
−β (x −α )2
i=1
e ij j ij
Pn
j=1
Pn Qm −βij (xj −αij )2
where y = a/b, a = i=1 Ci gi , b = i=1 gi and gi = j=1 e , }
1. Given the initial value to the training parameters.{p, α, βl, βr, C, ηα , ηβ , eg}
2. Initilized the output of fuzzy rule-base for learning data.

C ·( e ij j ij )
y= P
i=1 i
n Q m
j=1
−β (x −α )2
, error e := y ∗ − y and Sum-squeared error goal
e ij j ij
i=1
P j=1
∗
SSE := 1
2 p (yp − yp )2 ;.
3. k := 1; y = yinit ;
4. While not convergence do
4.1 If Sum-squeared error goal less than 0.02 then stop the loop.
4.2 ∂F
:= −e ∂g
∂αij (k)
∂y ∂gi
i ∂αij (k)
:= −2e Cib−y gi βij (k)(xj − αij (k)); { Compute the
derivative of the error with respect to the midpoint of membership function}
Pn Qm −βij (k)(xpi −αij (k))2
4.2.1 bp := i=1 j=1 e ;
Q p
−βij (k)(xi −αij (k)) 2
4.2.2 gip := m j=1 e ;
Ci −yp
4.2.3 cout := bp ;
∂Ep
4.2.4 ∂αij (k)
:= −2 ∗ ep ∗ cout ∗ gi ∗ βij (k)(xpj − αij (k));
P
4.2.5 ∂F
∂αij (k)
:= Pp=1 ∂α∂E p
ij (k)
;
4.3 Normalized αij from the unit inteval [0, 1] into the real set;
4.4 αij (k + 1) := αij (k) − ηα ∂α∂F
ij (k)
; {learning rule-base parameter αij }
4.5 Normalized the αij from the real set into the unit inteval;
1P
4.6 Compute y := simuinf er(p, α, βl, βr, c, f 1); e := y ∗ −y; SSE := (y ∗ −
2 p p
yp )2 ;
end {of while}
9.2 Tuning of Fuzzy Rule-Base by Using Gauss-

Newton with Regularization Method
If the error function that is to be minimized is the usual (9.1), learning a set of
examples consists in solving a nonlinear least-squares problem. Let’s define as R(x)
the vector whose components are the residuals for the different patterns in the train-
ing set and output units rp (w ) = (yp∗ − yp (w )), so that the error can be expressed as
E = 12 R(w)T R(w). It is straightforward to see that the first and second derivatives
of E(w) are, respectively:
X
∇E(w) = rp (w)∇rp (w) = J(w)T R(w) (9.11)
p
X
∇2 E(w) = [rp (w)∇rp (w)T + rp (w)∇2 rp (w)]
p
= J(w)T J(w) + S(w) (9.12)

9.2. TUNING OF FUZZY RULE-BASE BY USING GAUSS-NEWTON WITH REGULARIZATIO
h i
where J(w) is the Jacobian matrix J(w)p,ij = ∂r∂w p (w)
and S(w) is the part of the
ij
P
Hessian containing second derivatives of rp (w ), that is, S(w) = p rp (w)∇2 rp (w).
The Gauss-Newton method consists in neglecting the S(w) part, so that a single
iteration is
w+ = wc − [J(wc )T J(wc )]−1 J(wc )T R(wc ) (9.13)
To train the fuzzy controller parameters αij and βij , we use
αij (k + 1) = αij (k) − [J(αij (k))T J(αij (k))]−1

J(αij (k))T R(αij (k)) (9.14)
βij (k + 1) = βij (k) − [J(βij (k))T J(βij (k))]−1
J(βij (k))T R(βij (k)) (9.15)
It can be shown that this step is completely equivalent to minimizing the error
obtained from using an affine model of R(α) around α(k):
1 1
min kMk (α)k = min kR(α(k)) + J(α(k))(α − α(k))k (9.16)
2 2
If J(α(k)) has full column rank, J(α(k))T J(α(k)) is nonsingular, the Gauss-Newton
step is a descent direction and the method can be modified with line searches
(damped Gauss-Newton method) [?]. The use of Tikhonov regularization elimi-
nates that weakness in an elegant way. The introduction of the augmented term
µ(x − c) has the following impact on the computation of the search direction
°Ã ! Ã !°2
1° J(wc ) E(wc ) °
minn °° p+
°
° . (9.17)
p∈< 2° µt In µt (wc − c) °
Ã ! Ã !
J(wc ) E(wc )
Define Jaug = , Eaug = . It is clear that Jaug has always full
µt In µt (wc − c)
rank (with the approximate condition number kJk2 /µ). Thus, the Gauss-Newton
method applied on the regularized problem becomes well-defined and now solves a
much easier sub-problem. If p solves (9.17) and α is the steplength along the search
direction p then w+ = wc + pα becomes the new iterate. Last, in each epoch, a
new decision is made whether to decrease the regularization parameter µ or not. It
is appropriate to start with µ = 0.1 and then decrease µ during the iterations. In
this implementation µ = µ ∗ 0.8 if the steplength is α > 0.5 in the previous step
[Zhou and Eriksson]. The typical adaptive fuzzy control system tuning by GNR
learning algorithm is shown in the following:
GNR Learning Algorithm for Fuzzy Rule-Base Tuning:

1X 1X ∗
F = Ep = (y − yp )2
2 p 2 p p
C ·( e ij j ij )
{ Denote: y = i=1 i
P n Q m
j=1
−β (x −α )2
i=1
e ij j ij
Pn
j=1
1. Given the initial value to the training parameters.{p, t, α, βl, βr, C, ηα , ηβ , eg};

C ·( e ij j ij )
y= i=1 i
P n Q m
j=1
−β (x −α )2
e ij j ij
1P
i=1 j=1
∗
SSE := 2 p (yp − yp )2 ;.
3. k := 1; y = yinit ;
4.1 If Sum-squeared error goal less than 0.02 then stop the loop;
4.2 (b1 , · · · , bp ) := i=1 j=1 e ;
Qm −βij (k)(xpi −αij (k))2
4.3 (gi1 , · · · , gip ) := j=1 e ;
h i
∂ep
4.4 Compute Jacobian Matrix J(αij ) := ∂αij
;
4.4.1 for p = 1 to p do
(Ci −yp )
4.4.1.1 J(αij )p := −2 ∗ bp
∗ gip ∗ βij (k)(xpj − αij (k));
4.4.2 end forÃ !
J(α)
4.4.3 Jaugα =
µα In
Ã !
E(α)
4.4.4 Eaugα =
µα (α − c)
4.4.5 Pα = −Jaugα \Eaugα ;
4.4.6 Normalized αij from the unit inteval [0, 1] into the real set;
4.5 αij (k + 1) := αij (k) + γ ∗ Pα ; {learning fuzzy rule-base parameter αij }
4.8 Normalized the αij from the real set into the unit inteval;
1P
4.9 Calculate y; e := y ∗ − y; SSE := (y ∗
2 p p
− yp )2 ;
end {of while}
9.3 Tuning of Fuzzy Rule-Base by Using Levenberg-

Marquardt Method
The LM method is based-on the trust-region idea. It is an example of a model trust
region approach in which the model is trusted only within some region around the
9.3. TUNING OF FUZZY RULE-BASE BY USING LEVENBERG-MARQUARDT METHOD95
current search point. The size of this region is governed by the value of parameter µ.
This approach is addressed by seeking to minimize the error function while at same
time trying to keep the step size small so as to ensure that the linear approximation
remains valid. This method is an approximation to Gauss-Newton’s method. The
LM modification to the Gauss-Newton. The step is defined as:
w+ = wc − [J(wc )T J(wc ) + µc I]−1 J(wc )T R(wc ) (9.18)
To train the fuzzy controller parameters αij and βij , we use
αij (k + 1) = αij (k) ª [J(αij (k))T J(αij (k)) + µα I]−1

J(αij (k))T R(αij (k)) (9.19)
βij (k + 1) = βij (k) ª [J(βij (k))T J(βij (k)) + µβ I]−1
J(βij (k))T R(βij (k)) (9.20)
where ª is a (isomorphic) operations in the subset of R. The parameter µ is multi-

plied by some factor µ∗ (µ∗ > 1) whenever a step would result as an increased error.
When a step reduces error, µ is divided by µ∗ . For very small values of the param-
eter µ we recover the Gauss-Newton method, while for large values of µ we recover
standard gradient descent. In this latter case the step length is determined by µ−1 ,
so that it is clear that, for sufficiently large values of µ, the error will necessarily
decrease since (9.18) then generates a very small step in the direction of negative
gradient. This method can be used also if J(w) does not have full column rank
(this happens, for example, if the number of examples is less than the number of
rules in the FLC). The typical adaptive fuzzy control system tuning by LM learning
algorithm is shown in the following:
LM Learning Algorithm for Fuzzy Rule-Base Tuning:

1X 1X ∗
F = Ep = (y − yp )2
2 p 2 p p
C ·( e ij j ij )
{ Denote: y = P
i=1 i
n Q m
j=1
−β (x −α )2
i=1
e ij j ij
PN
j=1
1. Given the initial value to the training parameters.{p, t, α, βl, βr, C, ηα , ηβ , eg};

C ·( e ij j ij )
y= i=1 i
P n Q m
j=1
−β (x −α )2
e ij j ij
i=1
P j=1
∗
SSE := 1
2 p (yp − yp )2 ;.
3. k := 1; y := yinit ; µ := µ ∗ µinit ;
4.1 If Sum-squeared error goal less than 0.02 then stop the loop;
4.2 (b1 , · · · , bp ) := i=1 j=1 e ;
Qm −βij (k)(xpi −αij (k))2
4.3 (gi1 , · · · , gip ) := j=1 e ;
4.4 repeat
h i
∂ep
4.4.1 Compute Jacobian Matrix J(αij ) := ∂αij
;
(Ci −yp )
4.4.1.1 J(αij )p := −2 ∗ bp
∗ gip ∗ βij (k)(xpj − αij (k));
4.4.2 Jα := [J(αij )T J(αij ) + µα I]−1 J(αij )T R(αij (k));
4.4.3 Normalized αij from the unit inteval [0, 1] into the real set;
4.4.4 newαij := αij (k) − Jα ; {learning rule-base parameter αij }
4.4.5 Normalized the newαij from the real set into the unit inteval;
4.4.6 Compute newy := simuinf er(p, newα , βl, βr, c, f 1); newe ; newSEE ;
4.4.7 If newSSE < SSE then termination
4.4.8 µ := µ ∗ µinc ;
4.5 until (the errors are reduced);
4.6 If µ > µmax , k := k − 1, break, end
4.7 µ := µ ∗ µdec ;
4.8 αij (k + 1) := newαij ; {Updating parameter αij }
4.9 y := newy ; e := newe ; SSE := newSSE ;
4.10 k := k + 1;
end {of while}

Chapter 10
Software Developments
For decades the CI community has been developing applications with a various
number of methods and tools. Applications and applications engineering styles
presented at conferences has always been considered as freeware in the community.
The Fuzzy Boom can at least partly be explained by this generosity provided by
R&D groups all around the world.
As a consequence of increasing interests towards applications, the need for so-
phistication in tools and supporting software is also growing rapidly. These demands
are by the software engineers met either by incorporating more and more functional-
itites into their own software packages, thus aiming at providing complete solutions
from problem solving to installations, or else by aiming at open architectures and
configuration of toolboxes in order to reduce engineering efforts and support ease of
integration into existing systems.
In recent years, several commercial and public domain software has become avail-
able to the CI community. Commercial and freeware tools are used in different
configurations in a wide range of applications. The pressure on software developers
is obvious. The ”complete solutions” approach requires a continuous incorpora-
tion of standard functionalitites found elsewere. The ”open architectures” approach
needs to continuously follow the CI toolkit market in order to maintain and improve
integration capabilities.
This calls for an obvious need for development groups to communicate and ex-
change ideas and development styles. Furthermore, there is certainly opportunities
also to exchange source code and software libraries, thereby creating symbiosis be-
tween development groups.
From engineering point of view
• (i) development environments: Use ”toy tools” or sophisticated development

tools? (MS Visual Basic, Borland Delphi, ..., XVT, ...)
• (ii) platform independency (i.e. managing shifts e.g. between PC and Unix)
97
98 CHAPTER 10. SOFTWARE DEVELOPMENTS
• (iii) integration aspects (requiring open systems for integration of databases,

using SQL, and communication, using TCP)
need to be covered. Capabilities to communicate with end-user and the environment

are handled by techniques for database management, communications and graphical
user interfaces.
10.1 General Overview

For complex, and highly non-linear problems, intelligent systems need an openness
for including both expertise knowledge as well as observations from analysis of data.
The information technology approach of AboaFuzz to control problems aims at
using both expertise and data to generate support systems to be integrated into
automation systems. Intelligent control rules are typically used in situations where
process modelling is not available and conventional controllers are difficult to design
and tune.
There are mainly two approaches to design and implement expert based control
systems. The first design strategy draws upon interviews with control experts that
have experience with manually controlling the process. Interaction between the
process engineer and the automation system developer creates substance for a rule
base, where the knowledge content stems from expert heuristics, and the rule base
structure is a guarantee of implementability for the developer.
The second design strategy uses data which is a recording of process parameter
values input and control action parameter values output. Data is analysed and
further processed for automatic rule generation.
The interview approach requires a rule representation and editing tool. For fuzzy
control, several tools are available. These often provide code generation facilities to
support controller integration.
Commercial
informal Tools
raw "Interview" represen- Previewer/
MONITORING info tation Rule Editor
Code for
Rule Base Code (or paramsList) Fuzzy
CONTROLLER generation Rule
+ Base
Process support
PROCESS raw formatted

data Extraction data Clustering
Figure 10.1: The design workbench.
The data analysis approach uses a clustering technique (Fuzzy C-means) as a

basis for rule generation. Generated rule bases intend to provide approximations of
10.2. SYSTEM ARCHITECTURES 99
manual control actions, thus being initializations of automatic controllers. Naturally,

adjustments are required, especially for data not ranging in the data set provided
through the recording phase. Off-line tuning shifts to the editing mode of the rule
design system.
It is important to note, that for any system, a successful development rests
upon the quality of system submodules. An acquisition tool in conjunction with an
inexperienced domain expert never results in a good expert system. An elicitation
tool working on incomplete and malicious data never results in a good expert system.
An editing/previewing tool with an unfit inference mechanism never results in a good
expert system.
Software modules for producing applications in intelligent control should promote
the development of a unifying development style and minimize time spent for design,
implementation and integration of intelligent controllers.
Data analysis tools together with expert systems generators for complex prob-
lems provide an increasing degree of automation in the design and implementation
process.
10.2 System Architectures

Given a process
in out
sensor(s) actuator(s)
PROCESS
Figure 10.2: Sensors and actuators in a process.
the controller software modules need to integrate with different controller hardware
architectures, and interlink to drivers and servers running under a wide range of
operating systems.
PROCESSOR
(PLC boards)
in out
Figure 10.3: Controller board.
For control applications several hardware suppliers are available.

Further, the software modules need to integrate with different controller software
environments, and interlink to monitoring and simulation modules running under a
wide range of operating systems.
MONITORING AND CONTROL

(PC)
RS 232
Figure 10.4: Software components.
For control applications several software suppliers are available.

A typical basic configuration for a controller design and implementation envi-
ronment is given in Figure 10.5.

(PC)
RS
232
PROCESSOR
(PLC boards)
in out
sensor(s) actuator(s)
PROCESS
Figure 10.5: PC-based automation.
For design purposes the hardware systems may, of course, in a design stage be
replaced by simulation environments, typically with simulation software interlinked
with monitoring and control software.
10.3 Design and Integration of Fuzzy Controllers

The Toolkit Selection needs to support
generation and editing of control rules,
generation and editing of control rules,
integration of control rules into Monitoring&Control systems,
on-line and in-line fine-tuning and adaptation of control rules.

10.3. DESIGN AND INTEGRATION OF FUZZY CONTROLLERS 101
log-file for
manual measurements and
control actions
CONTROL
PROCESSOR
PROCESS
Figure 10.6: Recording manual control actions.
10.3.1 Scenario: Rule Generation based on manual control

Experienced, accurate and consistent manual control is monitored, and data is
recorded from measurements as well as from control actions.
The recorded data log-file can be converted, e.g. through clustering techniques, to a
rule base representing the intelligent controller. The rule base format is internal to
the software kernel module, and must be convertible to rule base formats readable
in other fuzzy logic software environments.
*.FLC RULE GENERATION *.log

(clustering)
Figure 10.7: Recognizing rules from control patterns.
If the rule base is only presented as a list of text it is very difficult to handle all
but the smallest rule bases, especially it is difficult to detect errors in the rule base.
Various visualization techniques have been tried to clarify the rule base graphically.
10.3.2 Scenario: Integration of control rules

The integration of the generated and/or edited controller rule base should be instant
and executable without any further encodings or installations.
automatic
*.FLC
CONTROL
PROCESSOR
PROCESS
Figure 10.8: Integrating generated rules.
The integration can be of following types:
DDE to existing monitoring software
compiled to be integrated into ”function blocks” of control software

compiled to be executed on controller hardware
A software controller with its corresponding rule base should be directly interlinked
with software modules providing analysis, monitoring, simulation and editing facil-
ities.
MONITORING
&
VISUALIZATION
DATA ANALYSIS SIMULATION

(neural, statistical)
CONTROL EDITING
(fuzzy logic inference)
Figure 10.9: Interconnection of monitoring and control software modules.
10.4 Overview of AboaFuzz

The AboaFuzz system is an easy to use fuzzy control environment that combines rule
base editing with rule base generation. The data analysis tool, AboaFuzz@Cluster,
based on clustering methods, generates rules from data recorded from a process
being controlled by some experienced engineer, i.e. manual control actions convert
into control rules.
Editing and executing the rule base is done in the second subsystem Aboa-
Fuzz@Control. This tool can be connected to other Windows and OS/2 applica-
tions using DDE (Dynamic Data Exchange). The DDE server can be an I/O driver
application for direct control or a simulation program for testing the rule base.
Collecting the data can be done with any software, such as process monitoring
programs or I/O driver programs producing process data in simple ASCII file format.
AboaFuzz@Cluster automatically finds the best number of rules and membership
functions for the fuzzy controller using the process data, and produces a list of
parameters for the whole rulebase, which can be read into AboaFuzz@Control for
final tuning and execution.
AboaFuzz@Cluster suggests one rule base, but gives the user the possibility to
view different rule bases and to generate the controller from any clustered rule base.
The clusters are combined to rules of the form ”IF measurements = ... THEN control
actions = ...”. The set of all rules builds the rule base holding the knowledge of
controlling the process. Clustering gives intelligent control rules for situations where
process modelling is hard or not even available, and conventional controllers difficult
to design and tune.
The variable editor has built-in support for proportional, integrating and deriva-
tive terms for each input variable. Editing variables means editing membership
functions for each term. The membership functions typically have linguistic names
10.4. OVERVIEW OF ABOAFUZZ 103
TUNING AND ADAPTATION
REINFORCEMENT
learning
SUPERVISED UNSUPERVISED
learning learning
CONTROL
&
MONITORING
Figure 10.10: The AboaFuzz architecture.
such as COLD, HIGH and SMALL depending on the type of variable and the pa-
rameters of the membership function. The rule base parameters can be changed by
editing or dragging the shapes of the functions.
For a clustered rule base adjustments are required, especially for data not ranging
in the data set provided through the recording phase. Giving linguistic names to
cluster projections makes the rule base more understandable for process engineers.
Execution of the rule base using DDE is easy for the user - no programming
or even integration of code is needed. It works with I/O driver software, process
monitoring applications and simulation software providing DDE service. Aboa-
Fuzz@Control also supports DDE links to multiple servers if needed. AboaFuzz@Control
includes the possibility to watch measurements while running. More sophisticated
data analysis and statistics can be applied using existing commercial tools, which is
possible due to DDE.
Chapter 11
• inference in Mamdani and Takagi-Sugeno controllers,

• defuzzification methods,
• rules generated from sets of data,
• variety of strategies for parameter estimation.
EXERCISES
III.1 Given the rule base
IF x is SM ALL AND y is SM ALL AND z is SM ALL THEN u = 0

IF x is M EDIU M AND y is M EDIU M AND z is M EDIU M THEN u = 1
IF x is BIG AND y is BIG AND z is BIG THEN u = 2
IF x is SM ALL AND y is SM ALL AND z is M EDIU M THEN u = 3
IF x is SM ALL AND y is SM ALL AND z is BIG THEN u = 4
IF x is SM ALL AND y is M EDIU M AND z is BIG THEN u = 5
where
(
1 − 4t , if 0 ≤ t ≤ 4
SM ALL(t) =
0, otherwise
(
|t−2|
1− , if 0 ≤ t ≤ 4
M EDIU M (t) = 2
0, otherwise
105
(
1− 4−t
4
, if 0 ≤ t ≤ 4
BIG(t) =
0, otherwise
use the Takagi-Sugeno method to compute the output for x = 2, y = 2 and z = 4.
III.2 Given the rule base
IF x is SM ALL AND y is SM ALL AND z is SM ALL THEN u is SM ALL

IF x is M EDIU M AND y is M EDIU M AND z is M EDIU M THEN u is M EDIU M
IF x is BIG AND y is BIG AND z is BIG THEN u = 2 is BIG
IF x is SM ALL AND y is M EDIU M AND z is BIG THEN u is M EDIU M
where
(
1 − 4t , if 0 ≤ t ≤ 4
SM ALL(t) =
0, otherwise
(
|t−2|
1− , if 0 ≤ t ≤ 4
M EDIU M (t) = 2
0, otherwise
(
1− 4−t
4
, if 0 ≤ t ≤ 4
BIG(t) =
0, otherwise
use the Mamdani method of inference together with a defuzzification method of your
choice to compute the output of inputs given by x = 2, y = 2 and z = 4.
Part IV
PROBABILISTIC COMPUTING
107
Chapter 12
Introduction
A substantial number of our everyday decisions are made under conditions of un-
certainty. We are often faced with situations where conflicting evidence forces us
to explicitly value observed information in order to make rational decisions. Dif-
ferent methods, such as logical, probabilistic and numerical approaches have been
developed to handle uncertainty in reasoning systems. Bayesian networks (BNs),
also known as causal probabilistic networks, belief networks etc., is one appealing
formalism to describe uncertainty within domains. In recent years, the interest for
BNs has increased due to new efficient algorithms and user-friendly software making
BNs available to others than the research community responsible for them.
BNs have been successfully applied in applications such as medical diagnosis, im-
age processing and software debugging and new applications are reported frequently.
12.1 Syntax and Notation

Throughout this part, the following notations will be used unless stated differently.
A random variable will be denoted with a uppercase letter, for example A, B or Xi .
The states of a random variable will be denoted with the corresponding lowercase
letter, A = {a1 , a2 } and Xi = {xi1 , . . . , xik }. Sometimes, more than one letter will be
used to avoid any risks of misunderstanding. The probability of a random variable,
P (A = ai ) will be written P (ai ). If the variable is binary with the states “true” and
“false”, the states will be denoted a and a respectively, i.e. P (A = true) ≡ P (a)
and P (A = f alse) ≡ P (a).
A set of random variables will be written with a uppercase letter with a different
style, such as U, V, W, F and C. The symbols ∪, ∩ and \ will be used for union,
intersection and subtraction of sets. For example, if A = {A, B, C} and B = {B, C}
then, A ∪ B = {A, B, C}, A ∩ B = {B, C} and A \ B = {A}.
109
110 CHAPTER 12. INTRODUCTION
12.2 Basic Probability Theory

This section describes the basic probability theory that underlie the forthcoming
sections about Bayesian networks.
12.2.1 Definitions
The probability of a event a is defined as
g(a)
P (a) = , (12.2.1)
N
where g(a) is the number of positive outcomes for a and N is the total number of
outcomes. The number of positive outcomes for a is bounded by
0 ≤ g(a) ≤ N, (12.2.2)
and dividing by N gives the property
0 ≤ P (a) ≤ 1. (12.2.3)
If A is a random variable with the discrete and exhaustive states {a1 , . . . , aj } then
X
j
P (A = ai ) = 1. (12.2.4)
i=1
12.2.2 Conditional Probability

The term P (a) denotes the prior or unconditional probability that the proposition
on a is true. Unconditional means that all relevant information that could influence
the status of a is assumed to be unknown. If new information emerge that affect
the random variable, the prior probability is no longer of interest. Instead, the
probability must be specified in context of the evidence that have come to light.
This is called posterior or conditional probability, denoted P (a|b). Semantically,
this means “the probability of a when b is known with absolute certainty and all
other influences are unknown”. In terms of random variables and joint events the
posterior probability is defined as
P (A, B)
P (A|B) = (12.2.5)
P (B)
where P (A, B) is the joint probability for A and B, i.e “A and B”. Rewriting equa-
tion (12.2.5) gives an expression for the joint probability for dependent variables:
P (A, B) = P (A|B)P (B). (12.2.6)

12.2. BASIC PROBABILITY THEORY 111
A is independent of B, if and only if, P (A|B) = P (A). Using this fact in equation
(12.2.6) gives an expression for independent variables:
P (A, B) = P (A)P (B). (12.2.7)
The probability of an random variable A can be marginalized out from a combined

distribution. This can be written
X
n
P (A = aj ) = P (A = aj , B = bi ). (12.2.8)
i=1
However, throughout this part, we will use a shorter notation:

X
P (A) = P (A, B). (12.2.9)
B
Since the phrase “A and B” is equal to “B and A”, equation (12.2.5) can be used
to formulate:
P (B|A)P (A)
P (A|B) = (12.2.10)
P (B)
which is known as Bayes’ rule or Bayes’ theorem. This equation makes it possible to
calculate the posterior probability, P (A|B), for a event when the opposite condition,
P (B|A), is known (and the unconditional probabilities for the individual events).
12.2.3 Interpretations of Probability

At a strict mathematical level, the semantics of the fundamental equation (12.2.1) is
not in any way controversial. At the level of interpretation and use, however, there
are two distinct positions with different opinions how probability can be estimated
and used in practice.
An objectivist believes that the use of probability is only applicable to events that
can be repeated many times, i.e. domains where many observations can be observed
under the same conditions. For example, the objectivist can speak of probabilities
that a dice will face a certain side up or the probability that a mass-produced light-
bulb will be broken. These cases all have a large number of observations which makes
the probability reliably. On the other hand, the objectivist is not interested in single,
unique events. The probability that Turkey will become a member of the European
Community before year 2 000 or that McDonald’s will open a hamburger restaurant
in Umeå cannot be measured since there is no long-run ratio to rely on. Thus, a
number of applications are set aside since they are not considered appropriate in
this strict objective viewpoint.
A subjectivist on the other hand, has a more wider perspective of the term
probability. In this approach, the probability is regarded as a personal degree of
belief for a certain proposition. The subjectivist has, in principle, no problem with
assigning a probability for the examples above. The personal degree of belief can be
112 CHAPTER 12. INTRODUCTION
determined from a variety of sources, not just observations on previously repeated

events. For example, since a normal coin is symmetric both in body and mass there
are no factors in favour for either side. The subjectivist can use this information
to assign a probability of head and tail without performing any empirical tests.
However, since the degrees of belief is personal and individuals can differ in their
conclusion, there can be different degree of belief for a particular proposition.
Of course, the objectivist and subjectivist are endpoints on a highly graded scale
and there are no reasons to take either side. However, it is important to realize that
people assign different meaning to the term probability.
Chapter 13
Bayesian Networks
A Bayesian network (BN) is a network of dependent random variables representing

an uncertain domain.
The aspects and concepts of Bayesian networks are introduced by an illustrative
example in section 13.1, followed by a discussion on conditional dependence and
independence between variables in section 13.2. A model for disjunctive interaction
is presented in 13.3.
13.1 A Network Model

Consider the following situation. Sherlock Holmes lives in a country where the
temperature in the winters often go below zero degrees Celsius. To minimize the
risk that his car will cause him any trouble in the mornings, he has installed an
engine pre-heater. Holmes next-door neighbour, Dr Watson, also has a pre-heater
system installed in his vehicle. To save power, the two pre-heaters are connected to
the same timer-trigged power supply, which is set to start a few hours before Holmes
and Watson need their cars in the mornings. However, the timer is rather old and
doesn’t have perfect functionality. If the timer fails, which happens from time to
time, none of the pre-heaters will receive any electrical power and will therefore be
unable to pre-heat the engines. Also, a working pre-heater is no guarantee that the
cars will start. Both Holmes and Watson’s cars can have other errors causing them
not to start.
Holmes observes that the timer fails about 20 days in a year. Furthermore,
Holmes’ car fails to start two out of ten times if the pre-heater is working and nine
out of ten if the pre-heater is off. When talking to Watson, Holmes learns that
Watson’s car starts five out of ten times without the pre-heater and nine out of ten
times when the pre-heater is working.
Three steps are included in the construction of a network model, identifying
the random variables, finding the structure and specifying the probabilities for each
variable.
In the text about Holmes’ and Dr Watson’s car troubles there are principally
113
114 CHAPTER 13. BAYESIAN NETWORKS
Timer

T

@
R
@
H W

Holmes car Watsons car
Figure 13.1: A network model for the car trouble example.
Table 13.1: CPT for Holmes and Watson s cars
car starts T=y T=n T=y T=n

yes 0.8 0.3 0.9 0.5
no 0.2 0.7 0.1 0.5
Holmes car Dr Watson’s car
three random variables: Holmes’ car (H), Watson’s car (W ) and the timer (T ). All
these variables are discrete with two mutually exclusive states each. The cars can
either start or not, and the timer is either working or not.
The causal relationship between these random variables can be expressed by
connecting them pairwise with directed links. Here, the timer affects (through the
power supply) the functionality of the cars, so a link can be drawn from the timer to
each of the two cars, see figure 13.1. The direction of the links is of great importance,
the state of the timer (working or not working) affects the cars ability to start, not
the other way around.
In order to complete the network model, a conditional probability table (CPT)
must be specified for each node. Since the variable T is a root node in the network,
i.e. node T has no incoming links, only the unconditional probability of each state
needs to be specified. The probability that the timer will fail, P (t), is 20/365 ≈ 0.05
and the probability that the timer will work, P (t), is therefore 1 − P (t) ≈ 0.95. The
probability tables for Holmes and Watson’s cars must to be specified in context of
the timer. For example, from the text it can read that the chance that Holmes’ car
will fail when the timer is working, P (h|t), is 2/10 = 0.2. The complete CPT for
Holmes’ and Watson’s cars are given in table 13.1.
With the above example in mind it is now appropriate to give a formal definition
of Bayesian networks:
Definition 13.1.1 A discrete Bayesian network consists of a finite set of variables,

U = {A1 . . . Ai }, and a set of directed edges, {Aj → Ak }, forming a directed acyclic
graph (DAG). Each variable, Aj has a finite exhaustive set of mutually exclusive
states Aj = {a1 , . . . , am } and a conditional probability table (CPT), P (Aj |PAj ),
13.2. CONDITIONAL INDEPENDENCE 115
where PAj is the set of parents to Aj .
13.2 Conditional Independence

It is important to notice that the links between the variables represent a direct
causal relationship. In the car trouble example above it is obvious that Watson’s
cars starting capabilities has no direct influence on Holmes’ car. However, if Holmes
finds out that Watson was unable to start his car on a particular morning the
prerequisite for Holmes the same morning is changed to the worse. In this sense the
two variables are dependent. This relationship is not caused by a direct influence,
but is instead transfered by the common node T . To be more precise, the fact
that Watson’s car will not start, could be caused by a failing timer. Normally,
the probability of the timer working is 0.95 but Watson’s unsuccess increases this
probability since a broken timer is a plausible reason to Watson’s cars failure. With
a higher probability of a broken timer the chances for Holmes’ car decrease. On the
other hand, if Holmes knew whether the timer was working or not with absolute
certainty, the probability of his car is exactly as given in table 13.1, and not in any
way affected by Watson’s cars status. In this case, the observed information from
Watson’s car is said to be blocked by evidence in the timer variable.
The above discussion illustrates the properties of conditional dependency and in-
dependency in Bayesian networks. New information can be “transmitted” through
nodes in the same and opposite directions of the links. There are three different
network topologies to consider in terms of dependence; linear, diverging and con-
verging.

A

?
P A B

B
J J

J J

J J

^
J
^
J

?
A B C

C

a) Linear b) Diverging c) Converging
Figure 13.2: Different network connections.

13.2.1 Linear
A linear connection is shown in figure 13.2a. If B in unknown, the probability of
B is determined from the status of A. Since C is determined from B, C therefore
becomes dependent of A.
If B is known to be in state bi (and therefore unaffected of A), the probability
of C can be calculated direct from its probability table, P (C|bi ) and is therefore
conditionally independent of A.
13.2.2 Diverging
This case was previously illustrated by the car trouble example. Generally, nodes
from a common parent P are dependent unless there is evidence in P . Evidence in
P blocks the path from A to B, see figure 13.2b.
13.2.3 Converging
The last case to consider is when two or more variables causes the same effect (figure
13.2b). If nothing is known about C, the parents A and B are independent. For
example, it is fairly safe to state that cold and allergy-attack, which both can cause
a person to sneeze, are independent, see fig 13.3 (this example is borrowed from
Henrion [Henrion et al 91]). However, observing a person sneezing, makes the two
events dependent. If a person sneezes when an allergen (for example, from a cat) is
present the support for cold is reduced in favour of the allergy-attack theory. This
phenomena is also known as “explaining away”. The allergy-attack explains away
the cold. Here, the evidence was observed directly in the converging node S. This
is not necessarily the case. In fact, any observed descendant node from S can act
as transmitter between the parents and make them conditionally dependent.
The conclusion is that a converging node, C, blocks the path between its parents
unless there is evidence in C or any of its descendants.

T Cat

?
Cold C A Allergy-
attack

J
J^ /

S

Sneeze
Figure 13.3: If a person is sneezing when an allergen is present, the support for cold
is reduced.
13.3. FINDING THE PROBABILITIES 117
13.2.4 d-separation
The three cases above can be used to formulate a separability criterion, called
direction-dependent separation or d-separation. The d-separation is a very impor-
tant property that can be used to find efficient inference algorithms A formal proof
can be found in Pearl [Pearl 88].
Definition 13.2.1 Any two nodes, A and B, in a Bayesian network are d-separated,
and therefore conditionally independent, if every path between A and B is blocked
by an intermediate node, V ∈ / {A, B}.
V blocks the path if and only if one of the following holds:
(i) the structure is linear or diverging and V is known.
(ii) the structure is converging and neither V nor its descendants are known.
The d-separability criterion will later be used frequently in the presentation of in-
ference methods.
13.2.5 A notation for conditional independence

Sometimes the symbol ⊥ ⊥ is used to denote conditional independence. For example,
in figure 13.2 the following conditional statements are true
⊥C|B
(a) A⊥
⊥B|P
(b) A⊥
⊥B (however A⊥
(c) A⊥ ⊥B|C is false)
13.3 Finding the Probabilities

As stated in definition 13.1.1, there is a conditional probability table, P (A|PA ),
associated with each random variable. Depending on the present application, this
table can be derived from statistical and mathematical relations, a database of
previously collected observations or expert judgment. Some people, who find it
hard to describe their degree of belief as numerical values, are often more willingly
to use an appropriate adjective or adverb instead. Several such translations can be
found in the literature. An example on how numerical ranges can be mapped on
verbal expressions is presented in table 13.2 [Druzdzel 95].
13.3.1 Disjunctive Interaction

Since the size of the probability table is equal to the product of the number of states
for each parent variable, the number of probability values needed in the CPT is
exponential to the number of parents. Even for a small set of parents the number of
Table 13.2: Example of verbal expressions for numerical ranges.
Probability value Adjective Adverb

0 impossible never
0.0 - 0.1 very unlikely very rarely
0.1 - 0.25 unlikely rarely
0.25 - 0.4 fairly unlikely fairly rarely
0.4 - 0.5 less likely than not less often than not
0.5 as likely as not as often as not
0.5 - 0.6 more likely than not more often than not
0.6 - 0.75 fairly likely fairly often
0.75 - 0.9 likely commonly
0.9 - 1.0 very likely very commonly
1.0 certain always
probability values needed can be too many to manage, and with larger sets of parents
the CPT quickly becomes unwidely. For example, a discrete boolean variable with
only four boolean parents needs as many as 32 values to complete the CPT. Even
with a large database there is a potential risk that some of the combinations are
too unusual to provide a reliable estimate. One commonly used model to avoid this
problem is the noisy-Or gate derived by Judea Pearl in [Pearl 88].
The model is based on disjunctive interaction, that is when the likelihood for a
particular condition is unchanged when other conditions occur at the same time. For
example, if cold, pneumonia and chicken-pox is likely to cause fever, then disjunctive
interaction applies when a person suffering from several of these diseases at the same
time would only be more likely to develop fever. Furthermore, if the person is also
suffering from a disease that is unlikely to cause fever, this additional evidence
does not reduce the support for fever caused by the other diseases. There exist an
well-founded theory for a disjunctive model if the following assumptions are made:
(i) Boolean variables: All included variables must have two discrete states, namely
true and false.
(ii) Accountability: An event is presumed false if no one of the included conditions

are true, e.i. all causes are listed.
(iii) Exception independence: The processes that inhibit an event under a condition
are independent.
Assumption (ii) is not as strict as it might look, as it is always possible to add a
“Other causes”-variable to represent what is not explicitly specified in the closed-
world assumption. The model requires that only the individual probabilities for each
parent are specified, i.e. P (E| only Hi ). Thus, using this technique requires only
13.3. FINDING THE PROBABILITIES 119
n values instead of 2n for a node with n parents. The probability of a conjunctive

event can be computed in the following way:
Y
P (E|H1 , . . . , Hn ) = 1 − 1 − P (E|Hi ). (13.3.1)
i:{Hi =true}
13.3.1.1 An Example of Disjunctive Interaction

Cold Co Noisy-Or gate
HH .
HH
j
Pneumonia Pn - - Fe Fever

3

Chicken-pox Cp

Figure 13.4: A Noisy-Or model.
Suppose P (f e|co)= 0.4, P (f e|pn) = 0.8 and P (f e|cp) = 0.9. For example, the
probability of fever when both Cold and Pneumonia are present is
P (f e|co, nn) = 1 − (1 − P (f e|co))(1 − P (f e|pn))

= 1 − 0.6 · 0.2 = 0.88.
Table 13.3: Probability values for the fever calculated with the noisy-Or model.
Co Pn Cp P (f e) P (f e)
f f f 1 0
f f t 0.1 0.9
f t f 0.2 0.8
f t t 0.02 0.98
t f f 0.6 0.4
t f t 0.06 0.94
t t f 0.0.12 0.88
t t t 0.012 0.988
Chapter 14
Inference in Bayesian Networks
Once a network is constructed it can be used to answer queries about the domain.
The basic task in a Bayesian network is to compute the probability of a variable
under evidence. Since there are no special input or output nodes, any variable can
be computed or observed as evidence. There are basically three different types of
inference that occur in a Bayesian network:
• Causal inference
• Diagnostic inference
• Intercausal inference
Causal inference is when the reasoning follows the same direction as the links in the
network. Diagnostic inference, on the other hand, is when the line of reasoning is
the opposite of the causal dependencies. The basic strategy to handle diagnostic
reasoning is to apply Bayes’ rule in equation (12.2.10). Intercausal inference means
reasoning between causes of a common effect. This is the same condition as in section
13.2.3. The presence of one cause make the others less likely. Finally, combinations
of these inferences can appear in Bayesian networks. This is sometimes called mixed
inference.
14.1 The Joint Probability Distribution

The joint probability distribution assigns probability values to each conjunctive
atomic event of the included variables. That is, the joint distribution is a complete
description of the domain.
Lemma 14.1.1 Every Bayesian network with the nodes Bn = {X1 , . . . , Xn } has a
joint distribution on the form
Y
n
P (X1 , . . . , Xn ) = P (Xi | PXi ), (14.1.1)
i=1
where PXi is the set of parents to Xi .
121
122 CHAPTER 14. INFERENCE IN BAYESIAN NETWORKS
Proof 14.1.1 Induction on the nodes in the network. Suppose that every causal
Bayesian network has a joint distribution as in equation 14.1.1. For a network with
only one node the hypothesis is obviously true. Suppose that the hypothesis is true for
Q
network of the variables Bn−1 = {X1 , . . . , Xn−1 }, that is P (Bn−1 ) = n−1
i=1 P(Xi |PXi ).
Let Bn = Bn−1 ∪ Xn , where X is a leaf in Bn . (Since Bn is a DAG, there is at least
one leaf in Bn ). By using the fundamental rule in equation (12.2.5) the formula can
be written
P (Bn ) = P(Bn−1 , Xn ) = P(Xn |Bn−1 )P(Bn−1 )
Since Xn is independent of Bn−1 \ PXn the left-hand side can be reduced to
P (Bn ) = P(Xn |PXn )P(Bn−1 ),
which corresponds to equation 14.1.1.
The fact that a Bayesian network can be represented by a joint distribution on the
form above results in two important properties.
(i) A Bayesian network is a complete and unique specification of the domain.
(ii) A Bayesian network is consistent (if each individual variable distribution is

consistent).
Here, consistent means that the probability values does not conflict each other. It is
rather easy, intentional or not, to construct an inconsistent system. Let for example
P (a) = 0.7, P (b) = 0.2 and P (a|b) = 0.65. These values might seem alright at a
first glance, but they are in fact inconsistent. By using Bayes’ rule the term P (b|a)
can be written as P (a|b)P (a)/P (b) which is > 1. The consistency property ensures
that a Bayesian network does not violate the axioms of probability.
The joint distribution can be used to answer any query of the network. To
illustrate this, lets again return to the previous car trouble example. Here, the
joint distribution is P (H, W, T ) = P (H|T )P (W |T )P (T ). The computation of, say,
P (h, w, t) can be done by simply multiplying the corresponding values in the CPT:
P (h, w, t) = P (h|t)P (w|t)P (t)

= 0.8 · 0.1 · 0.95 = 0.076.
A complete calculation of all atomic events is shown in table 14.1.

Conditional queries can be handled by applying the basic equation 12.2.5. For
example, the probability that Holmes’ cars will start when knowing that Watson’s
car failed is:
P (h, w)
P (h|w) =
P (w)
14.1. THE JOINT PROBABILITY DISTRIBUTION 123
Table 14.1: Atomic events for the car trouble example calculated with the joint
distribution.
H W T P(H, W, T)
y y y 0.684
y y n 0.0075
y n y 0.076
y n n 0.0075
n y y 0.171
n y n 0.0175
n n y 0.019
n n n 0.0175
The terms P (h, w) and P (w) can be computed by using equation (12.2.9) to sum
over all matching terms in table 14.1:
0.076 + 0.0075
P (h|w) = ≈ 0.696
0.076 + 0.0075 + 0.019 + 0.0175
If both cars fail to start, the probability of a broken timer is:
P (t, h, w) 0.0175
P (t|h, w) = = ≈ 0.48 (14.1.2)
P (w, h) 0.019 + 0.0175
Summing over the joint distribution is an easy method to use when answering queries
in Bayesian networks. Unfortunately, due to the exponential growth of atomic
events, this method cannot be used it practice in this simple way. The number
of cases to consider is equal to the product of the number of states for each variable.
If every variable is binary, the number of atomic events in a network with n nodes
are as many as 2n . With larger networks, this method simply becomes intractable.
For example, the medical diagnose system MUNIN (Olesen et al in [Olesen et al 89])
consists of more than 1000 variables with up to seven states each. Even though the
MUNIN system is one of the most elaborative application found in the literature,
real-world modeling often requires far more nodes than is manageable within the
reach of this simple method. Even worse, Cooper proved in [Cooper 87] that infer-
ence in a Bayesian network is NP-hard irrespective of the method used. There are
no efficient methods for undertaking an arbitrary network. Despite this dishearten-
ing result there are ways to tackle this problem. The proof in [Cooper 87] applies
to an arbitrary network, and there are certain families of network topologies where
more efficient algorithms exist. Also, many applications generate sparse graphs, in
which the prospects of finding a more local computation scheme are good. Three
such exact inference methods are described in section 14.3.1, 14.3.2 and 14.4.

.
@ @ A
R
@
@
?
= @@
?
R= AU

@ @
? @

R
@ ?
? @

R
@

.
. .

@ @
?
R
@
@ ? @
R
@

a) Singly connected b) Multiply connected
Figure 14.1: Different network topologies.
14.2 Singly Connected Networks

A singly connected network, or polytree, is a graph without any undirected cycles.
As a result of this property, any pair of nodes in the network are connected only
through one unique path, see figure 14.1a. This facts makes it easy to apply the
d-separability criterium in order to find an efficient inference method.
The task of interest is to compute the probability for a variable under some
evidence, P (X|E). Consider figure 14.2. The support to the variable X can be
divided into two parts, EX+ , the causal support connected to X from its parents, and
EX− , the evidential support connected through X from its children. Therefore,
P (X|E) = P(X|EX+ , EX− ). (14.2.3)
In order to separate the evidence into different terms, Bayes’ rule conditioned on EX+
can be applied:
− P(EX− |X, EX+ )P(X|EX+ )
P (X|EX , EX ) =
+
.
P(EX− |EX+ )
The term P (EX− |X, EX+ ) can be simplified into P (EX− |X) since X is conditioned and
therefore d-separates EX− from EX+ . Next, the term P (EX+ |EX− ) is independent of the
states of X and is therefore scalar. Let α = 1/P (EX+ |EX− ):
P (X|E) = αP(EX− |X)P(X|EX+ ) (14.2.4)
Now, applying equation (12.2.4) yields:
X
αP (EX− |X, EX+ ) = 1.
X
Moving α outside the summation forms:

X
α P (EX− |X, EX+ ) = 1,
X
14.2. SINGLY CONNECTED NETWORKS 125
. . . . . . . . . . . . . . . . . .. .. .. . . . . .
. . .
. . . .
. . .. . .
. . . .
. . C . .
Z . .
. Z
~
Z
. W
C . .
. .
. U1 . Um EUm X . .
. . .
.E + =
@ . . . . . S
. X . .. . S w . .
. . @. .
. . .
. .
. . . . . . . . . . @ . .. . .
. .. .. .
R
@ /
.
. . . . . .. .. . . X
. . . . . . . . . .
. . . . . . . . ..
.
. . . EXY1 Z1k . . . S . .
. . . . . . .S .
Znj
.
.
. . Z11 B . S . .
. .
HH BN / . . w
S .
. . j Y
H . Yn .
. . 1 .
. . .
.
. .
S

w
S .
U .
. . .
. . .
. . . . .
. .. . . . . . . .
. −
.
. . . . . . . . . . . . . . X. . . . . . E . .
Figure 14.2: Parents, children and evidence sets to the node X.
which makes α:
1
α= P − + . (14.2.5)
X P (EX |X, EX )
That is, α can be treated as a normalizing constant which scales the sum of the indi-
vidual probabilities to 1. There is no need to find an explicit formula for P (EX+ |EX− ).
Next, P (X|EX+ ) can be computed by considering all possible configurations of
the parents of X. Let PX = {U1 , . . . , Um }, we get
X
P (X|EX+ ) = P(X|PX , EX+ )P(PX |EX+ ) (14.2.6)
PX
The term P (X|PX , EX+ ) can be simplified into P (X|PX ) since the parents PX d-
separates EX+ from X. Furthermore, since X is a converging node, it blocks the
undirected path between the parents which ensures that they are conditionally in-
dependent. Using the equation of independent events (12.2.7) the equation can be
written X Y
P (X|EX+ ) = P(X|PX ) P(Uk |EX+ ). (14.2.7)
PX k
The evidence, EX+ , can be further decomposed into {EU1 X , . . . , EUm X }, see figure 14.2.
Since the parents are independent, each Uk is independent all other parents and their
evidence sets. This observation gives
X Y
P (X|EX+ ) = P(X|PX ) P(Uk |EUk X ). (14.2.8)
PX k
A closer lock at the term P (Uk |EUk X ) reveals that it is in fact a recursive instance
of the original problem, excluding the node X. The term P (X|PX ) can be found in
the conditional probability table for X.
Now, returning to equation (14.2.4) the last term to consider is P (EX− |X). Let
Y = {Y1 , . . . , Yn } be the set of children of X. Decomposing the evidence into
{EXY1 , . . . , EXYn } yields
P (EX− |X) = P(EXY1 , . . . , EXYn |X). (14.2.9)
Since X is conditioned and therefore assumed to be known, it blocks the path

between all its children, which makes them conditionally independent. As before,
the probability of a conjunction of independent variables can be expressed as the
product over each individual probability,
Y
P (EX− |X) = P(EXYi |X). (14.2.10)
i
Let Zi = {Zi1 , . . . , Zik } be the set of parents of Yi except X. Averaging (equation

12.2.9) over the states in Yi and all combinations of states of Zi gives
YXX
P (EX− |X) = P(EXYi |Yi , Zi , X)P(yi , Zi , |X). (14.2.11)
i Yi Zi
The evidence, EXYi , can be divided into causal support, EXY

+
i
, and evidential support,
− − −
EXYi . Note that EXYi is equal to EYi . As before, the two fractions of evidence are
independent:
YXX
P (EX− |X) = +
P(EXYi
|Yi , Zi , X)P(EY−i |Yi , Zi , X)P(Yi , Zi |X). (14.2.12)
i Yi Zi
Here, Yi d-separates EY−i from X and Zi . Also, Zi d-separates EXY

+
i
from X and Yi :
YXX
P (EX− |X) = +
P(EXYi
|Zi )P(EY−i |Yi )P(Yi , Zi |X) (14.2.13)
i Yi Zi
Applying Bayes’ rule to the first term gives:

Y X X P(Zi |EXY
+ +
)P(EXY )
P (EX− |X) = i i
P(EY−i |Yi )P(Yi , Zi |X) (14.2.14)
i Yi Zi P(Zi )
The last term, P (Yi , Zi |X) can be written P (Yi |X, Zi )P(Zi |X) and since X is d-
separated from Zi the formula becomes:
Y X X P(Zi |EXY
+ +
)P(EXY )
P (EX− |X) = i i
P(EY−i |Yi )P(Yi |X, Zi )P(Zi ) (14.2.15)
i Yi Zi P(Zi )
14.2. SINGLY CONNECTED NETWORKS 127
+
The term P (EXY i
) is unconditioned and therefore independent of the states of X.
+
Replacing P (EXYi ) with a constant βi and cancelling the terms P (Zi ) gives:
YXX
P (EX− |X) = βi P(Zi |EXY
+
i
)P(EY−i |Yi )P(Yi |X, Zi ) (14.2.16)
i Yi Zi
Each parent of Yi is independent each other, so P (Zi |EXY

+
i
) can be replaced with a
product of each individual probability. Also, since βi is constant it can be moved
Q
outside the summation and replaced with a new constant, β = i βi . Finally, the
term P (EY−i |Yi ) can be moved outside first summation:
YX X Y
P (EX− |X) = β P(E−
Yi |Yi ) P(Yi |X, Zi ) P(Zij |EZij Yi ) (14.2.17)
i Yi Zi j
Now, inspecting each term reveals that P (EY−i |Yi ) is a recursive instance of P (EX− |X)
excluding X, P (Yi |X, Zi ) is a lookup in the conditional probability table (CPT) for
Yi and P (Zij |EZ+ij Yi ) is a recursive instance of the original problem, excluding Yi .
Notice that there is no need to find an explicit value of β since it can be combined
into α in equation 14.2.18 to form a new normalizing constant ξ. To summarize,
the belief for a variable X under evidence, E, in a singly connected network can be
evaluated as:
P (X|E) = ξπ(X)λ(X) (14.2.18)
where ξ is a normalizing constant and
X Y
π(X) = P (X|U) P(Uk |EUk X ) (14.2.19)
U k
YX X Y
λ(X) = P (EY−i |Yi ) P(Yi |X, Zi ) P(Zij |EZij Yi ) (14.2.20)
i Yi Zi j
Different versions on how to turn the above equation into a general algorithm
can be found in the literature. Pearl constructs in [Pearl 88] an object-orientated
message passing scheme where the flow of belief is updated by messages sent be-
tween adjacent nodes. A recursive formulation is derived by Russell and Norvig in
[Russell and Norvig 95].
14.2.1 Car Trouble - a Singly Connected Network

For further illustrations, let us use the equation above to perform inference in the
previous car trouble example. For example, if both Holmes’ and Watson’s cars
failed, the probability of a broken timer can be expressed as P (T = n|E) where the
evidence is E = {h, w}. Since the timer, T , is a root node without any parents, the
term π(X) is simply the unconditional probability value, P(T). Furthermore, both
the node W and the node H has no other parents than T which makes it possible
to reduce equation (14.2.18) to
YX
P (T |E) = ξP(T) P(EY−i |Yi )P(Yi |T)
i Yi
− −
Identifying the symbols reveals that Y1 = H, Y2 = W , EH = {h} and EW = {w}.
P (T |E) = ξ(P(T)(P(h|h)P(h|T) + P(h|h)P(h|T)) · (P(w|w)P(w|T) + P(w|w)P(w|T))
Since P (h|h) = P (w|w) = 0 and P (h|h) = P (w|w) = 1, the formula can be reduced
to
P (T |E) = αP (T )P (h|T )P (w|T )

= α(0.95, 0.05)(0.2, 0.7)(0.1, 0.5)
= α(0.019, 0.0175)
≈ (0.52, 0.48).
(14.2.21)
That is, P (t|h, w) ≈ 0.48, which is the same result as in equation (14.1.2).
14.3 Multiply Connected Networks

Singly connected networks are often too restricted to describe more complex domain
applications. Since only one path is allowed between any two nodes, it is impossible
to describe a system where two or more causes to a variable share a common ancestor.
Such models are however, not rare. For instance, metastatic cancer can cause both
increased total serum calcium and a brain tumor. These two effects can, in turn,
explain a person falling into coma. A network model is shown in figure 14.3, here,
the diagnostic line of reasoning can take two different paths; Coma could be caused
by increased serum calcium which in turn could be caused by metastatic cancer
(Co → Sc → M e). On the other hand, the coma could also be caused by a brain
tumor, which in turn could be caused by metastatic cancer (Co → Br → M e).
One of the problems when designing exact methods for multiply connected networks
is that the d-separation criterium cannot be applied with the same success as in
singly connected networks. The solution is to find a method that avoids global, and
therefore inefficient, computation by identifying subsets with desirable properties
where it is possible to locally compute the probabilities. There are principally three
different approaches in designing an exact inference method for multiply connected
networks; conditioning, clustering and symbolic factoring.
14.3.1 Conditioning
The basic idea in the conditioning approach is to divide the multiply connected
network into several smaller singly connected networks conditioned on a set of in-
stantiated variables. Figure 14.4 shows the two networks created when the boolean
variable M e in figure 14.3 is instantiated. Generally, the number of resulting sub-
trees is exponential to the product over the states of each variable in the cutset. The
cutset is the set of conditioned variables. The problem in this approach is to find
14.3. MULTIPLY CONNECTED NETWORKS 129
Metastatic cancer P (me) = 0.2

Me

@
Me P (sc|M e) Increased @R
@ Me P (br|M e)
t 0.8 total serum Sc Br Brain tumor t 0.2

f 0.2 calcium @ f 0.05
R
@
@
Co Coma

Br Sc P (co|Br, Sc)
t t 0.8
t f 0.8
f t 0.8
f f 0.05
Figure 14.3: A Bayesian network for metastatic cancer.

me me me me

?
?
?
?

Sc Br Sc Br

@ @
R
@
@ R
@
@
Co Co

Figure 14.4: Conditioning the node M e creates two singly connected networks.
the minimal cutset that divides the original network into singly connected subnets.
Once created, the probability for a variable can be calculated as the weighted sum
over each individual polytree.
A nice side-effect of this technique is that the weighted sum can be used to quickly
calculate an approximate answer. Starting with the largest weight, the system can
compute the probability until a desired level of accuracy is obtained. A simple way
to calculate the accuracy range is to sum over all remaining weights to calculate an
upper bound. The lower bound is, of course, the probability calculated so far, since
probabilities values are always positive.
14.3.1.1 The Conditioning Approach - An Example

The probability of a person falling into coma can be calculated in the following way.
First, calculate the probability of coma when node M e is instantiated to be true,
i.e P (co|me). Since the network becomes singly connected upon this instantiation
it is possible to use the method presented in section 14.2, which in this case gives
P (co|me) ≈ 0.68 (the actual calculations omitted). Next, the same thing is done for
the second polytree, where M e is instantiated to false (me), here, P (co|me) ≈ 0.23.
Finally, P (co), can be calculated as the weighted sum:
P (co) = P (co|me)P (me) + P (co|me)P (me)

= 0.68 · 0.2 + 0.23 · 0.8
= 0.32
14.3.2 Clustering
Clustering takes the opposite approach to conditioning. Instead of dividing the net-
work into smaller parts, clustering algorithms combines nodes into larger clusters.
The variables Br and Sc in the coma example could be collapsed into a compound
variable Z = {Br, Sc}. The states of the cluster node becomes the set of combina-
tions to all included variables. Here, the states of Z are {(br, sc), (br, sc), (br, sc),
(br, sc)}. The clustering transforms the network to a polytree where the inference
can be performed as usual. The disadvantage, is of course, that if the network is
dense, the compound variables can become intractable large, since the number of
states is exponential to number of collapsed variables. Despite this fact, clustering
techniques are by many considered as the best exact algorithm for most types of
non-singly connected networks.
One particularly interesting and the current standard algorithm for clustering
networks was originally developed by Lauritzen and Spiegelhalter in [Lauritzen and Spiegelhalter 88].
The method was later improved with a general absorption scheme and by Jensen in
[Jensenet al 90]. This technique uses properties of the clusters in order to efficiently
propagate the flow of belief.
The key issue is the concept of consistent universes. Consider two clusters, V =
{A, B, C} and W = {C, D, E}, in figure 14.5. Now, the probability for the common
variable C can be calculated by summing over all elements except C in both V and
W. Thus, X X
P (C) = P (A, B, C) = P (C, D, E) (14.3.22)
A,B D,E
If evidence changes the information in V then the above condition can be used to
update the probability for W in the following way: Initially, let the distributions
for V and W be P 0 (A, B, C) and P 0 (C, D, E), respectively. Now, suppose that
evidence in V changes the distributions to P 1 (A, B, C). With this new information,

A,B,C C C,D,E

Figure 14.5: Two clusters with the variable C in common.
the probability for the common variable C can be marginalized out of P 1 (A, B, C)
as in equation 14.3.22, X
P 1 (C) = P 1 (A, B, C). (14.3.23)
A,B
Using the fundamental equation 12.2.5 the term P (C, D, E) can be written
P 0 (C, D, E) = P 0 (D, E|C)P 0 (C) (14.3.24)
Since P (C) is updated to P 1 (C) the new distribution for W is
P 1 (C, D, E) = P 0 (D, E|C)P 1 (C) (14.3.25)
where the term P 0 (D, E|C) can be calculated from the initial distribution:
P 0 (C, D, E) 1
P 1 (C, D, E) = P (C) (14.3.26)
P 0 (C)
The scheme above is called absorption, the cluster W has absorbed from V. In
general terms the absorption process can be described as follows:
Definition 14.3.1 Let V and W be cluster of variables and let S be the set of their
common variables, that is S = V ∩ W. Let ψV0 , ψW 0
and ψS0 be the belief tables
associated with each cluster. The absorption procedure is defined by the following
steps:
(i) Update the belief table for S,

X
ψS1 = ψV0 (14.3.27)
V\S
(ii) Update the table for W,

1 0 ψS1
ψW = ψW (14.3.28)
ψS0
Although the procedure for absorption at an algorithmic level is easy to follow, the
actual numerical multiplication and division can be a little more tricky. The case
below shows a numerical example of the absorptions process.
Let V and W be two inconsistent cluster with belief tables ψV0 and ψW
0
as in table
14.2. Suppose the information behind W is less reliable then V, then the belief table
Table 14.2: Belief tables for V and W
V W
A B ψV0 (A, B) B C 0
ψW (B, C)
a1 b1 0.05 b1 c1 0.1
a1 b2 0.6 b1 c2 0.3
a2 b1 0.3 b2 c1 0.4
a2 b2 0.05 b2 c2 0.2
for W can be calibrated to V by letting W absorb from V. The new belief table for
the separator S = {A} is
X
ψS1 = ψV0
B
= ψV0 (b1 )
+ ψV0 (b2 )
= (0.05 + 0.3, 0.6 + 0.05)
= (0.35, 065).
The old belief table for the separator S is

X
ψS0 = 0
ψW = (0.1 + 0.3, 0.4 + 0.2) = (0.4, 0.6).
C
0
Finally, ψW can be updated:
1 0 ψS1
ψW = ψW
ψS0
0 (0.35, 0.65)
= ψW ≈ ψW0
(0.875, 1.083)
(0.4, 0.6)
≈ (0.1, 0.3, 0.4, 0.2)(0.875, 1.083)
| {z } | {z }
b1 b2
≈ (0.1 · 0.875, 0.3 · 0.875, 0.4 · 1.083, 0.2 · 1.083)
≈ (0.0875, 0.2625, 0.4333, 0.2166).
14.3.2.1 Absorption as a General Propagation Algorithm

To use the absorption procedure as an inference method for multiply connected
Bayesian networks the original representation must first be transformed into a tree
of clusters, where the links indicates a non-empty intersection between clusters, like
in figure 14.5. Furthermore, there must be a way to describe observed evidence and
initialize the probability tables for each cluster.
14.3.2.2 Finding the Cluster Tree

In order to the absorption process to work correctly, the cluster tree must have two
properties:
(i) There must be at least one cluster containing the variables Ai ∪ PAi for all i,
where PAi is the set of parents to Ai .
(ii) Every cluster on the path between any two clusters, A and B, must contain
A ∩ B.
Property (i) ensures that it is possible to create an initial belief table to each cluster
so that the cluster tree represent a probability distribution equal to the original
Bayesian network. The CPT for a variable, Ai , is, as mentioned earlier, specified in
context of the parents to Ai and without any cluster containing both the variable
Ai and its parents, PAi , it would not be any easy task to integrate the CPT into
the cluster network. Property (ii) makes it possible to obtain the belief for a single
variable Ai by marginalization of any cluster containing Ai . If this condition is not
fulfilled, information on Ai can not be transmitted to every other cluster containing
Ai , and the cluster tree would not be globally consistent.
If a cluster tree fulfills the two properties above it is called a junction tree. Now,
before presenting the method in detail, it is appropriate to first define the terms
used in the algorithm.
Definition 14.3.2 A clique is a complete maximal subset of a graph. In a complete
subset, every node has a link to every other node in the subset.
Definition 14.3.3 A moral graph is a undirected graph constructed from a network,
such that the nodes Ai ∪ PAi is a complete subset. PAj denotes the set of parents to
Ai in the original directed network.
Definition 14.3.4 A graph is triangulated (or chordal) if there for every cycle of
length ≥ 4 exist a link connecting two non-consecutive nodes.
The algorithm for identifying the junction tree from a Bayesian network is described
as follows:
(i) Form the moral graph, G.
(ii) Triangulate G to form G0 .
(iii) Identify the cliques of G0 and connect any pair of cliques with a non-empty
common subset. The result is called a junction graph.
(iv) Form a junction tree from the junction graph.
A moral graph is easily constructed from a Bayesian network; simply drop the
directions of the links and add links between any two unconnected parents of every
node in the network. For example, see figure 14.6.
The moral graph can then be triangulated in the following way:

@ , @
,
@
R
@ ,

?
?

@ @
R
@
@ @
@

CPN Moral graph
Figure 14.6: A BN and the corresponding moral graph.
14.3.2.3 Graph triangulation algorithm

(i) Number the nodes from 1 to |V | using maximum cardinality search:
(1) Let S be the set of selected nodes so far.

(2) Initially, let S = {A} and n = 1 where A is any node in the graph.
(3) Assign the number n to A.
(4) Let X ∈ / S be the node with the largest number of neighbour in S and
assign the number n + 1 to X.
(5) Update n and S: n = n + 1, S = S ∪ X.
(6) Continue from (4) until all nodes are in S.
(ii) Triangulate the graph by fill-in:
(1) Let i = |V |
(2) Let X be the ith node (using the order above) and let P be the set of
neighbours to X with numbers < i.
(3) Add links between ant two nodes in P that are not already connected.
(4) Let i = i − 1
(5) Continue from (2) until i = 0.
The triangulation of the graph ensures that the final cluster tree will fulfill the second
property above, see figure 14.9 for an example. Notice that the triangulation of a
graph is not unique. There are several steps in the algorithm were arbitrary choices
can be made. Intuitively, the best triangulation is the one that yields minimum
fill-in. However, in this case the optimal triangulation is concerned with the size
of final the junction tree. Unfortunately, finding the optimal triangulated graph is
NP-hard. See Arnborg el al. in [Arnborg et al 87] and further discussion by Jensen
et al. in [Jensen and Jensen 94].
Next, the junction graph can be constructed by identifying the cliques in the
moral and triangulated graph. Links between clusters are added by connecting
cluster with a non-empty intersection, see figure 14.7. The intersection between two
variables is called separator, denoted S. Finally, the junction tree can be found in the

A

@ ABC
@
@ .....
. .....
.....
..... ..
B C
C .
"" ...
. BC .....
" .
.. .....
..
"
" BCD CD CDE
" ..... .
D E ......
. ......

@ D .
...
DE
@
@
...
.. %
F
DEF

Triangulated graph Junction graph
Figure 14.7: The junction graph of a triangulated graph.
junction graph by finding a maximum spanning tree (Jensen [Jensen and Jensen 94]),
where the weight of a link is represented by the number of variables in the separator,
i.e |S|. Figure 14.8 shows a junction tree found in a junction graph.

ABC ABC
... ..... ..
.....

.....
....
...
.... ....
BC C .
BC
... ..... ...
.... .....
... ... .
BCD CD CDE BCD CD CDE
..... ......
. ....
......

. ..
... . ...
D DE DE
...
...
... % %
DEF DEF

Junction graph Junction tree
Figure 14.8: The junction tree of a junction graph.

14.3.2.4 Prim’s Junction Tree Algorithm

(i) Let N be the nodes in the junction tree so far. Initially let N = {A}, where
A is any node in the graph.
(ii) Choose a link (B, C) with maximal weight so that B ∈ N and C ∈

/ N.
(iii) Let N = N ∪ C. Continue from v until every node is in N .

A A AB A AC

@ , @
R
@ ,
, @
B C B C B C
.

?
? BD CE
D E D E ...
.
.
..

... ..
.
@ @ D E
R
@
@ @
@ .....
....... ......
....
F F DEF

CPN Moral graph Cluster graph
Figure 14.9: If the triangulation phase is omitted, it is impossible to construct a

junction tree, since none of the links can be removed to obtain a tree with the
properties as described earlier.
14.3.2.5 Initialization of Clique Probability Tables

The initial probability tables to each clique in the junction tree can be constructed
in the following way:
(i) Give all nodes (clusters) and separators a table of ones, i.e. ψ 0 = 1.
(ii) For each variable, A, choose one cluster, C, containing A ∪ PA and multiply
P (A|PA ) (the CPT) with ψC0 .
14.3.2.6 Evidence
Entering observed evidence in a junction tree is easy. Evidence is normally on the
form A = aj . Semantically, this means that the probability of all other states is 0.
P (A) = {0, . . . , 1, 0, . . .}, a “1” in the j:th position. The same can be done for a
cluster of variables: Let E = {0, . . . , 1, 0, . . .} be a finding on A. Multiply E with
the belief table for any cluster containing A.
In order to represent a probability distribution the table must be normalized after

evidence has been inserted, since the sum over all rows no longer are necessarily equal
to 1. However, this normalization process can be omitted at this stage, to instead
be carried out in the final marginalization process, when the belief for each single
variable is computed.
14.3.2.7 Propagation Scheme

Finally, the last issue to consider is how the updating process can performed with as
few absorptions as possible so that every link in the junction tree is consistent. To
make reasoning easier, the absorption process can be regarded as messages sent from
one node to another in the junction tree. A message from a node A to B means that
B absorbs from A. It can be proved that when a message has been passed in both di-
rection of each link the junction tree is globally consistent (Jensen [Jensen 96]). This
can be obtained by a bottom-up propagation, BottomUp(), followed by a top-down
traversal, TopDown, see figure 14.10. BottomUp and TopDown() can be formulated
as recursive functions:
BottomUp(node,parent)
(i) Let N = neighbours(node) \ parent
(ii) For each node, A ∈ N : Call ButtomUp(A,node) recursively.
(iii) Let the parent absorb from the node: Call Absorb(parent, node) (unless
the parent is the root if the tree).
TopDown(node,parent)
(i) Let N = neighbours(node) \ parent
(ii) For each node, A ∈ N : Call Absorb(A,node) followed by recursive call:

TopDown(A,node)
When the junction tree is consistent, the belief for a single variable can be computed
by marginalization: X
P (A) = ψC , (14.3.29)
C\A
where C is any clique (or separator) containing A. However, to keep computation

at a minimum, the cluster with the smallest number of states should be selected. It
is now possible to summarize the steps above into an general algorithm:
(i) Construct a moral triangulated graph from the original network.
(ii) Form the junction graph.
(iii) Find the junction tree.


R R

6
1
2
. ?

..... ..... S 2
1 .....
.
..... .....
.. o
S 1 2
.....
.
..... .....
.. S
S w
S
... S. . ...
.
... ... ...
...
...
.. ... ..

BottomUp(R) TopDown(R)
Figure 14.10: Bottom-up and top-down belief propagation.
(iv) Construct initial belief tables for each node and separator in the junction tree.
(v) Insert observed evidence.
(vi) Select a root node, R (any node in the junction tree can act as root).
(vii) Call BottomUp from the root.
(viii) Call TopDown from the root.
(ix) Compute the probability for each variable by marginalization.
Note that step (i) to (iii) is a static task, there is no need to redo this process unless
the structure of the network is changed.
14.3.2.8 Jensen’s Algorithm - An Example

The following is an numerical example made to clarify the algorithm above. Consider
the cancer network in figure 14.3. The network can be turned into a moral graph by
ignoring the direction of the links and by adding a connection between the nodes Br
and Sc. It is not necessary to apply the triangulation algorithm since the graph is
already chordal. Next step is to form the junction graph by identifying the cliques.
As mentioned earlier, a clique is a maximum subset where every node in the subset
has a link to every other node in the same subset. The cliques in the moral graph
are C1 = {Me, Br, Sc} and C2 = {Sc, Br, Co}. Their common variables, the separator
S, are {Br, Sc}. The junction graph, which also is the junction tree for this small
example, is shown in figure 14.11. The initial belief tables for the three cluster are
computed by first setting all values to 1 and then multiply with the matching CPTs.
The distributions P (M e), P (Sc|M e) and P (Br|M e) are multiplied with ψC01 and the

Me Me MSB

@ @
@
R
@ @
@
Sc Br Sc Br SB

@ @
@
@
R
@
@
Co Co SBC

Original BN Moral graph Junction tree
Figure 14.11: The original BN is transformed to a junction tree.
term P (Co|Sc, Br) is multiplied with ψC02 . The initial belief table for the separator
remains 1, the tables for C1 and C2 are shown in table 14.3.
Notice that these two clusters are not consistent. To make them consistent the
absorption process must be applied. Suppose that C1 is selected as root in the tree.
A call to BottomUp() will cause C1 to absorb from C2 . In this absorption nothing is
changed however, since the separator ψS1 will remain equal to 1 when marginalized
out of ψC02 . This is of course, not a coincidence since ψC02 in this stage is equal to
P (Co|Sc, Br). Next, the call to TopDown() will force C2 to absorb from C1 . The new
belief table for the separator S is shown in table 14.4. Finally, when the junction tree
is globally consistent it is possible to calculate the probability for every individual
variable. For instance, the probability for coma can be marginalized out of ψC22 :
X
P (co) = ψC22
Sc,Br
= 0.032 + 0.224 + 0.032 + 0.032
= 0.32
Now, updating ψC02 to ψC12 by multiplying ψS2 / ψS1 with ψC02 . The result is shown in
the last column in table 14.3.
Table 14.3: Initial belief table for C1 and C2
Me Sc Br ψC01 Co Sc Br ψC02 ψC12

y y y 0.032 y y y 0.8 0.032
y y n 0.128 y y n 0.8 0.224
y n y 0.008 y n y 0.8 0.032
y n n 0.032 y n n 0.05 0.032
n y y 0.008 n y y 0.2 0.008
n y n 0.152 n y n 0.2 0.056
n n y 0.032 n n y 0.2 0.008
n n n 0.608 n n n 0.95 0.608
Table 14.4: Belief table for S
Br Sc ψS2
y y 0.04
y n 0.04
n y 0.28
n n 0.64
14.4 Symbolic Probabilistic Inference (SPI)

A quite different approach for probabilistic inference has been suggested by D’Ambrosio
in [D’Ambrosio 89, Shachter et al 90, Ambrosio 95]. In contrast to graph algo-
rithms, this method uses a symbolical approach find an effective formula of the
joint probability distribution. The basic SPI method is based on the following ob-
servations.
Consider the famous example network in figure 14.12 (Lauritzen and Spiegelhalter
in [Lauritzen and Spiegelhalter 88]). Dyspnea (shortness-of-breath) can be caused
from tuberculosis, lung cancer or bronchitis (or a combination of them). A recent
visit to Asia increases the risk for tuberculosis and a smoker has a higher possibility
to develop both lung cancer and bronchitis. A X-ray can detect tuberculosis and
lung cancer but not discriminate between them. The joint distribution (14.1.1) for
the domain can be expressed as:
P (A)P (T |A)P (E|T, L)P (X|E)P (L|S)P (B|S)P (D|E, B)P (S) (14.4.30)
To compute the unconditionally probability of, say, dyspnea it is possible sum over
14.4. SYMBOLIC PROBABILISTIC INFERENCE (SPI) 141
visit to Asia smoking

A S

lung cancer A
? AU

tuberculosis T L B bronchitis
HH
j
H

positive E XX
) X
z D dyspnea
X
X-ray X
L
L
L
tuberculosis or
lung cancer
Figure 14.12: Example of a Bayesian network for dyspnea.
each variable except D:

X
P (D) = P (A)P (T |A)P (E|T, L)P (X|E)P (L|S)P (B|S)P (D|E, B)P (S)
A,T,E,X,L,B
(14.4.31)
As mentioned earlier in section 14.1 this technique is extremely inefficient and is
intractable for larger networks. However, using a clever factoring the marginalization
can instead be written as:
X X X X
P (D) = P (A) P (T |A) P (X|E) P (E|T, L)
A T E,X L
X X
P (D|E, B) P (L|S)P (B|S)P (S)
B S
which requires substantial fewer computations. The SPI approach is to find the op-
timal factoring so that the necessary calculations in the joint distribution are kept
to a minimum. This problem is closely related to the standard optimal factoring
problem, OFP, which is believed to be NP-hard. However, recently, some heuris-
tic search algorithms have been developed which appears to find good factoring
solutions. Also, computations can be saved using cache techniques, to avoid recom-
puting values that are already calculated in a previous step in the summation. For
P
example, the term S P (L|S)P (B|S)P (S) above is unchanged during the summa-
tion of the variables A,T ,E and B and can therefore be saved in a cache memory
after the first computation, and then be accessed when needed during the remaining
computations.
14.4.1 Set-Factoring SPI

The set-factoring SPI algorithm can be used to find an effective form of the joint dis-
tribution of any conjunctive query, but since every conditional probability, P (X|Y )
can be rewritten as P (X, Y )/P (Y ), the algorithm is not restricted to handling con-
junctive queries.
A factor is a subset of the complete probability distribution. Each factor contains
a set of variables, which affect the distribution. For example, the factor P (B|A)
includes the variables {A, B} and combining this factor with the factor P (A) yields
a conformal factor, P (B|A)P (A), with the same set of variables as P (B|A).
Now, let Q be the set of target variables, a good factoring for P (Q) can be found
in the following way:
(i) First, find the relevant nodes in the original BN. This can be done using the
d-separation property to exclude parts of the network which have no relevancy
to the current query. A linear time algorithm to find this subtree can be found
in Pearl [Geiger et al 89].
(ii) Let F be a factor set which contains all factors to consider in the next com-
putation, and let C be the set of factor candidates. Initially, let F be all
distributions from the subtree and let C be empty.
(iii) Combine the factors in F pairwise and add all pairs in which one factor of the
pair contains a variable which is a parent or a child of at least one variable in
the second pair, to the candidate set, C. There is also no need to add factors
which are already in C.
(iv) Let Ui be the set of variables for each combination in C. Compute vars(Ui ), the
number of variables in Ui excluding any target variable: vars(Ui ) = |Ui \ Q|.
For each combination in C, compute sum(Ui ), which is the number of variables
that can be summed out when two factors are combined. A variable can be
summed out when it does not appear in neither the set of target variables,
Q, nor any of the other factors in F (excluding those in the current pair).
Compute the result size as vars(Ui ) - sum(Ui ).
(v) Select the best candidate in C in the following way: Choose the element in C
with the lowest result size. If more than one element apply, choose the one of
these with most number of variables (including target variables). If there is
still more than one candidate, choose one of these arbitrary.
(vi) Construct a new factor by combining the chosen pair into a conformal factor.
Update F by replacing the two chosen factors with the new combined one.
Update C by deleting any pair which has a non-empty (factor-level) intersection
with the above chosen factor.
(vii) Continue from step (iii) until only one factor remains in F. This is the resulting
factor.
Finally, use the resulting factor above to compute an answer for the conjunctive
probability.
14.4. SYMBOLIC PROBABILISTIC INFERENCE (SPI) 143
Table 14.5: Loop 1: Candidate pairs in C
C (fM e , fBr ) (fM e , fSc ) (fM e , fCo ) (fBr , fSc ) (fBr , fCo ) (fSc , fCo )
U Br,Me Me,Sc Me,Br,Sc,Co Me,Br,Sc Me,Br,Sc,Co Me,Br, Sc, Co
sum(U) 0 0 0 0 0 0
vars(U) 2 2 2 3 3 3
result size 2 2 2 3 3 3
In this presentation all variables are assumed to be binary, i.e., they all have two
states. If the number of states is not equal for all variables, it is possible to compute
the size of a factor as the product over the states for every included variable instead
of just considering the number of variables.
According to Li and D’Ambrosio in [Li and D’Ambrosio 94] the set-factoring SPI
algorithm is superior to Jensen’s algorithm for most kinds of networks.
14.4.1.1 An Example of SPI

Consider the multiply connected cancer network in figure 14.3 in section 14.3. To
compute, for instance, the probability that a person falls into coma, P (Co), a good
factoring can be found using the algorithm above. Here, Co is the target variable,
Q = {Co}. The relevant variables are {Me, Br, Sc, Co} since no part of the network
can be excluded in this query. To keep syntax small, a factor P (A|...) will be denoted
fA . The factor fM e includes the variable {M e}, factor fBr includes {M e, Br}, factor
fSc : {M e, Sc} and factor fCo : {Co, Br, Sc}.
Loop 1: The factor set F is initialized to {fM e , fBr , fSc , fCo } and C is empty. Every
factor in F is then pairwised combined and added to C. The result is shown
in table 14.5.
Here, the best combinations are candidate no 1 and 2, since both have mini-
mum result size and equal number of variables. Candidate no 1, (fM e , fBr ) is
choosed to replace the factors in F which is updated to {(fM e , fBr ), fSc , fCo }.
After deleting every factor with a non-empty intersection with (fM e , fBr ), the
candidate set C becomes {(fSc , fCo )}.
Loop 2: Adding combinations from F makes C = {((fM e , fBr ), fSc ), ((fM e , fBr ),
fCo ), (fSc ,fCo )}. The best combination is ((fM e , fBr ), fSc ), in which it was
possible to sum out the variable M e. F is updated to {((fM e , fBr ), fSc ), fCo }
and C is empty.
Loop 3: The candidate set C becomes {(((fM e , fBr ), fSc ), fCo )} and the only com-
bination to choose is therefore (((fM e , fBr ), fSc ), fCo ). Both the variables Br
and Sc could be summed out. F is updated to {(((fM e , fBr ), fSc ), fCo )} which
fulfills the termination condition in step (vii) above.
Thus, the factoring result is:
X X
P (Co) = P (Co|Br, Sc) P (Sc|M e)P (Br|M e)P (M e) (14.4.32)
Br,Sc Me
14.5 Continuous Variables

So far we have only discussed discrete Bayesian networks, where every random vari-
able has a finite number of states. It is also possible to construct more general
models with continuous variables or a mixture of continuous and discrete variables.
The exact inference methods described here does not in general work for such net-
works. Jensen’s clustering algorithm has been modified [Lauritzen 92, Olesen 93] to
be able to handle mixture models with both discrete and Gaussian distributed ran-
dom variables. However, in order for the algorithm to work, all continuous variables
are assumed to be linear dependent of continuous parents. Also, the continuous
variables are not allowed to have discrete children.
14.5.1 Gibbs sampling

More general models where no exact solution is available require approximate meth-
ods. Gibbs sampling is a popular simple brute force technique which can be used
to generate artificial samples from a probability distribution specified by a Bayesian
network. Probabilities can then be estimated from these samples.
Suppose we have a Bayesian network with the variables {X1 , . . . , Xn } and we
would like to generate cases where E = {Xm = xm , . . . , Xn = xn } is the set of
observed variables.
(i) Choose valid arbitrary values for all unobserved variables.
(ii) Compute the distribution P (X1 |X2 = x02 , X3 = x03 , . . . , Xm−1

0
, E)
(iii) Sample a random value x11 according to the distribution above.
(iv) Compute the distribution P (X2 |X1 = x11 , X3 = x03 , . . . , Xm−1

0
, E)
(v) Sample a random value x12 .
(vi) Compute the distribution P (X3 |X1 = x11 , X2 = x12 , . . . , Xm−1

0
, E)
(vii) etc . . .
(viii) When a value has been sampled for all unobserved variables, restart with X1
and repeat the process until sufficiently many cases have been generated.
14.6. CONNECTION TO PROPOSITIONAL CALCULUS 145
To avoid bias from the initial configuration (which can be very unlikely) it is common
to discard the first 5-10 percent of the generated samples. This is called “burn-in”.
One problem with this kind of logic sampling is that it is possible to get stuck
in certain areas. There might exist an equal likely area, but in order to reach it,
a variable need to take a highly unlikely value. Another problem is that it can be
very time-consuming to estimate very unlikely events. Finally, the task of selecting
a valid starting configuration can be very tedious, it is in fact a NP-hard problem
in the general case!
14.6 Connection to propositional calculus

Even though Bayesian networks are a formalism for describing uncertainty, the con-
nection to logic is not far. Actually, it can easily be shown that any propositional
formula can be implemented and evaluated within a binary Bayesian network frame-
work.
The logical propositions, such as p and q can be represented by nodes with no
parents. The operators, (⇒, ∧, ∨, etc.) are written as nodes where the parents
are equal to the input and the conditional probability table defines the truth ta-
bles. Since boolean propositional calculus only handles the values true and false the
probability table can be written using only zeros and ones. Any formula can then
be constructed by a formation graph. The probability of the “root” node is to be
interpreted in the following way:
• If P (root) = 0, then the root can never be true, and the formula is therefore
unsatisfiable.
• If 1 > P (root) > 0, then this means that the formula is satisfiable.
• If P (root) = 1 then we know that the formula is always true, and the formula
is therefore valid (or a tautology).
For example, consider the following propositional formula
(p ⇒ q) ≡ (¬p ⇒ ¬q) (14.6.33)
This is represented with the Bayesian network below, where (≡) is the root node.

≡

7
o
S
S
S
¬ p S
S

+ @ S
R
@
⇒ ⇒

}
Z
Z ¬ q

Chapter 15
• Bayesian networks implements a joint probability distribution over the domain.
• The noisy-Or model can be used to avoid a parameter explosion when a vari-
able have many parents.
• As a reasoning system, Bayesian networks performs causal, diagnostic, and

intercausal inference. These basic types can also be combined without restric-
tions.
• Inference in Bayesian networks is in general NP-hard.
• Polynomial time algorithms exist for singly connected networks.
• Gibbs sampling can be used when no exact method is available.
• Boolean propositional calculus is a subset of Bayesian networks.
EXERCISES
IV.1 (Software required) Implement the network of metastatic cancer de-

scribed in the lecture notes. Use the software to the compute the following:
(a) P (Co = true)

(b) P (M e = true|Co = false)
(c) P (Br = false|M e = true, Co = false)
147
(d) P (Br = false, Co = true)

(e) P (Br = false, Co = true|M e = false)
IV.2 Consider the following Bayesian network:

A .S

.
B
?

BN
T . L B

HH
j

. E .P

PP q D
P
X

(a) Determine if the following conditional independence statements are true

or false:
(i) T ⊥
⊥A
(ii) A⊥⊥T
(iii) A⊥⊥S
(iv) A⊥ ⊥X|E
(v) A⊥ ⊥D|E
(vi) T ⊥⊥S|X
(vii) X⊥ ⊥S|L
(b) Is the network single connected or multiply connected?
(c) Suppose S, A and X have three states and all the other variables are
binary, how many parameters are needed to complete the specification?
IV.3 (Software required) The Monty Hall puzzle gets its name from an Amer-
ican TV game show, ”Let’s make a deal”, hosted by Monty Hall. In this show,
you have the chance to win some prize if you are lucky enough to find the prize
behind one of three doors. The game goes like this:
– You are asked to select one of the three doors.

– Monty Hall (who knows where the prize is) opens another door which
does not contain the prize.
– You are asked if you want to redo your selection (select one of the two
closed doors).
– You get the prize if it is behind the door you selected.
149
The problem of the puzzle is: What should you do at your second selection?
Some would say that it does not matter because it is equally likely that the
prize is behind the two remaining doors. This, however, is not quite true. Build
a Bayesian network to conclude which action gives the highest probability.
Here is some help to get you started:
The Monty Hall puzzle can be modeled in three random variables: Prize,
First Selection, and Monty Opens.
– Prize represents the information about which door contains the prize.
This means that it has three states: ”Door 1”, ”Door 2”, and ”Door 3”.
– First Selection represents your first selection. This variable also has the
three states: ”Door 1”, ”Door 2”, and ”Door 3”.
– Monty Opens represents Monty Halls choice of door when you have made
your first selection. Again, we have the three states: ”Door 1”, ”Door
2”, and ”Door 3”.
IV.4 (Software required) Implement a Bayesian network which represents the

following propositional formula:
(p ∨ q) ∧ ¬p ∧ ¬q (15.0.1)
Is the formula satisfiable?

IV.5 Use a noisy-Or model to complete the conditional probability table for
a node “Alarm”, with the following possible causes:
– Burglary (B), P (Alarm|Burglary) = 0.95

– Earthquake (E), P (Alarm|Earthquake) = 0.5
– Electrical disturbance (D), P (Alarm|Disturbance) = 0.8
Under which assumptions are the noisy-Or model valid? Do you think the
noisy-Or model is appropriate to use in this particular application?
IV.6 Given a discrete Bayesian network, B = {X1 , . . . , Xn }, an atomic con-
figuration is a specific assignment of each individual variable, i.e. X1 =
x1 , . . . , Xn = xn .
(a) Explain why the sum of the probability of each atomic configuration must
be equal to one, i.e.
X
P (X1 , . . . , Xn ) = 1. (15.0.2)
X1 ,...,Xn
(b) Prove equation (15.0.2). (Hint: use induction)

IV.7 Consider two small Bayesian networks, A → B → C and A ← B ← C.

Prove that even though their causal semantics are different they implement
the same probability distribution.
Bibliography
[AboaFuzz 95] AboaFuzz 1.0 User Manual (P. Eklund, M. Fogstrm, S. Olli), Åbo
Akademi University, 1995.
[] R. G. Almond, http://bayes.stat.washington.edu/almond/belief.html, January,

1997.
[D’Ambrosio 89] B. D’Ambrosio, Symbolic probabilistic inference in belief nets,

Technical Report, Department of Computer Science, Oregon State University,
1989.
[Ambrosio 95] B. D’Ambrosio, Local expression languages for probabilistic depen-

dence, Int. J. Approximate Reasoning 13 (1995), 61-81.
[Anderson et al 91] H. Anderson, M. Dellborg, J. Markenvard, R. Jagenburg,

J. Herlitz, CK-MB räcker som specifik markör vid diagnos av hjärtinfarkt,
Läkartidningen 83 nr 36 (1991).
[Anttila et al 96] L. Anttila, P. Eklund, L. Kallin, P. Koskinen, T.-A. Penttilä, The

generalised preprocessing perceptron for medical data analysis: A case study for
the Polycystic Ovary Syndrome, Proc. EMCSR ’96, Vienna, April 9-12, 1996.
[Apple 92] F. S. Apple, Acute myocardial infarction and coronary reperfusion -

Serum cardiac markers for the 1990s, A. J. C. P. 97 no 2 (1992).
[Arnborg et al 87] S. Arnborg, D. G. Corneil, A. Proskurowski, Complexity of find-

ing embeddings in a k-tree, SIAM Journal on Algebraic and Discrete Methods 8
no 2 (1987), 277-284.
[Atanassov and Georgiev 93] K. Atanassov, C. Georgiev, Intuitionistic fuzzy Prolog,

Fuzzy Sets and Systems 53 (1993) pp. 121-128.
[Baker 87] J. E. Baker. Reducing Bias and Inefficiency in the Selection Algorithm, In
J. J. Grefenstette, editor, Genetic Algorithms and their Applications: Proceedings
of the Second International Conference on Genetic Algorithms, pages 14-21, 1987.
[Beasly 93] D. Beasely and D. R. Bull and R. R. Martin. An Overview of Genetic

Algorithms: Part 1, Fundamentals. University Computing, 15/2(:58-69, 1993.
151
152 BIBLIOGRAPHY
[Ben-Ari 93] M. Ben-Ari, Mathematical Logic for Computer Science, Prentice Hall,
1993.
[Bezdek 74] J. C. Bezdek, Cluster Validity with fuzzy sets, Journal of Cybernetics,
3 (1974), 58-73.
[Bezdek 80] J. C. Bezdek, A Convergence Theorem for the Fuzzy ISODATA Cluster-
ing Algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence
2 (1980), 1-8.
[Bezdek 81] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algo-
rithms, Plenum Press, 1981.
[Bezdek et al 87] J.C. Bezdek, R. Hathaway, M. Sabin, W. Tucker, Convergence

Theory for Fuzzy c-Means: Counterexamples and Repairs, IEEE Transactions on
Systems, Man and Cybernetics 17 (1987), 873-877.
[Bohlin et al 1997] J. Bohlin, P. Eklund, V. Kairisto, K. Pulkki, L.-M. Voipio-

Pulkki, A probabilistic network for the diagnosis of acute myocardial infarction,
Proc. IFSA 97, Prague, June, 1997, Vol. 4, 44-49.
[Buckles 97] B.
Buckles. Seminar Course: Evolutionary Computation - Lecture 3. Published on
www, adress http://www.eecs.tulane.edu/www/Buckles.Bill/ec.html, 1997.
[Casdagli and Weigend 91] M. C. Casdagli, A. S. Weigend, Exploring the continuum

between deterministic and stochastic modeling, Addison-Wesley, 1993.
[Charniak 91] E. Charniak, Bayesian networks without tears, AI Magazine, Winter

(1991), 50-63.
[Choe and Jordan 92] H. Choe, J. Jordan, On the Optimal Choice of Parameters
in a Fuzzy C-Means Algorithm, Proc. IEEE International Conference on Fuzzy
Systems, San Diego, 349-354, 1992.
[Cooper 87] G. F. Cooper, Probabilistic inference using belief networks is NP-hard,

Technical Report, Medical Computer Science Group, Stanford University, KLS
87-27.
[Cooper and Herskovits 92] G. F. Cooper, E. Herskovits, A Bayesian method for the
induction of probabilistc networks from data, Machine Learning 9 (1992), 309-347.
[Cybenko 88] G. Cybenko, Approximation by superposition of a sigmoidal function,

Research Notes in Artificial Intelligence, Pitman Publishing, London, (1987).
[Davis 87] L. Davis (ed.), Genetic Algorithms and Simulated, Technical Report, Uni-
versity of Illinois, 1988.
BIBLIOGRAPHY 153
[Driankov et al 93] D. Driankov, H. Hellendoorn, M. Reinfrank, An Introduction to

Fuzzy Control, Springer-Verlag, 1993.
[Druzdzel 95] M. J. Druzdzel, Qualitative verbal explanations in Bayesian belief net-
works, AISBQ 94, Winter95/Spring96, 1995, 43-54.
[Dubois, Prade 91] D. Dubois, H. Prade, Fuzzy Sets in Approximate Reasoning -
Part 1: Inference with possibility distributions, Fuzzy Sets and Systems 40 (1991)
pp. 143-202.
[Dubois, Lang, Prade 91] D. Dubois, J. Lang, H. Prade, Fuzzy Sets in Approximate
Reasoning - Part 2: Logical approaches, Fuzzy Sets and Systems 40 (1991) pp.
203-243.
[Dubois and Prade 80] D. Dubois, H. Prade, Fuzzy Sets and Systems: Theory and
Applications, Vol 144 in Mathematics in Science and Engineering, Academic Press,
1980.
[Dunn 73] J. C. Dunn, A Fuzzy Relative of the ISODATA Process and Its Use in
Detecting Compact Well-Separated Clusters, Journal of Cybernetics 3 (1973),
32-57.
[Dunn 74] J. C. Dunn, Well-separated Clusters and Optimal Fuzzy Partitions, Jour-
nal of Cybernetics 4 (1974), 95-104.
[Efron 82] B. Efron, The jackknife, the bootstrap and other similar resampling plans,
SIAM 38, 1982.
[Eklund 94] P. Eklund, Network size versus preprocessing, in:Fuzzy Sets, NeuralNet-
works and Soft Computing, ed. R.R.Yager, L.A.Zadeh, Van Nostrand Reinhold,
New York (1994), 250-264.
[Eklund 95] P. Eklund, A production line for generating clinical decision support
systems, EPIA ’95, Workshop on Fuzzy Logic and Neural Networks in Engineer-
ing, 69-80.
[Eklund and Forström 95] P. Eklund, J. Forsström, Computational intelligence for
laboratory information systems, Scand J Clin Lab Invest, 55 Suppl. 222 (1995),
21-30.
[Eklund et al 94] P. Eklund, J. Forsström, A. Holm, M. Nyström, G. Selén, Rule
Generation as an Alternative to Knowledge Acquisition, Fuzzy Sets and Systems,
66 (1994), 195-205.
[Eklund and Forsström 91] P. Eklund, J. Forsstrm, Diagnosis of Nephropathia Epi-
demica by Adaptation through L Ã ukasiewicz Inference, COMPUTATIONAL IN-
TELLIGENCE III - Proc. International Symposium ’Computational Intelligence
90’, ed. N. Cercone and F. Gardin, Elsevier Science Publisher B.V, 237-246, 1991.
154 BIBLIOGRAPHY
[Eklund and Klawonn 92] P. Eklund, F. Klawonn, Neuro Fuzzy Logic Programming,
IEEE Transactions on Neural Networks, Vol 3, No. 5, September 1992, 815-818.
[Eklund and Riissanen 92] P. Eklund, T. Riissanen, Learning in Neural Control:

Clustering vs Adaptation, manuskript, 1992.
[Eklund et al 91] P. Eklund, T. Riissanen, H. Virtanen, On the Fuzzy Logic Nature

of Neural Nets, Proc. 4th International Conference Neuro-Nimes ’91, Nimes, 293-
300, 1991.
[Eklund and Zhou 96] P. Eklund, J. Zhou, Comparison of Learning Strategies for
Adaptation of Fuzzy Controller Parameters, J. Fuzzy Sets and Systems, to appear.
[Ericson 93] C. Ericson. An Introduction to Genetic Algorithms. Technical Report

UMNAD 97.93, Department of Computing Science, Umeå University, 1993.
[Everitt 74] B. S. Everitt, Cluster Analysis, John Wiley & Sons, 1974.
[Fahlman and Lebiere 66] S. E. Fahlman, C. Lebiere, The cascade-correlation learn-

ing architecture, Morgan Kaufmann, 1990.
[Fogel 66] L. Fogel, M. Walsh, A. Owens, Artificial intelligence through simulated

evolution, John Wiley & Sons, New York, 1966.
[Fullér 95] R. Fullér, Neural Fuzzy Systems, Meddelanden från ESF vid Åbo
Akademi, Serie A:443, 1995.
[Geiger et al 89] G. Geiger, T. Verma, J. Pearl, d-separation: From theorems to

algorithms, Proceedings of the Fifth Workshop on Uncertainty in AI, August,
1989, 118-125.
[Geisser 75] S. Geisser, The predictive sampling reuse method with applications, J.
Amer. Stat. Assoc. ? (1975), xx-xx.
[Geman et al 92] S. Geman, E. Bienstock, R. Doursat, Neural networks and the

bias/variance dilemma, Neural Computation 5 (1992), 1-58.
[Ginsberg 87] M. L. Ginsberg (ed.), Readings in Nonmonotonic Reasoning, Morgan

Kaufman Publishers, Inc. 1987.
[Goldberg 89] D. E. Goldberg, Genetic algorithms in search, optimization, and ma-

chine learning. Addison-Wesley, Redwood City, CA, 1989.
[Hamacher 75] H. Hamacher, Über logische Verknüpfungen unscharfer Aussagen

und deren zugehrige Bewertungsfunktionen, Progress in Cybernetics and Systems
Research II, ed. R. Trappl and F. de P. Hanika, Hemishere Publication Corpora-
tion, 276-288, 1975.
BIBLIOGRAPHY 155
[Heckerman et al 1995] D. Heckerman, D. Gieger, D. M. Chickering, Learing

Bayesian networks: The combination of knowledge and statistical data, Networks
20 (1995), 197-243.
[Henrion et al 91] M. Henrion, S. Breese, E. J. Horvitz, Decision analysis and expert

systems, AI Magazine, Winter (1991), 64-91.
[Herrera, Lozano and Verdegay 95] F. Herrera, M. Lozano, J. L. Verdegay, Tuning

Fuzzy Controller by Genetic Algorithms, To appear in International Journal of
Approximate Reasoning 1995.
[Hertz, Krogh and Palmer 1991] J. Hertz, A. Krogh, R. G. Palmer, Introduction to

the Theory of Neural Computation, Addison Wesley, 1991.
[Höhle 89] U. Höhle, Monoidal Closed Categories, Weak Topoi, and Generalized
Logics preprint, 1989.
[Holland 92] J. Holland. Adaption in natural and artificial systems, The MIT Press,
Cambridge Massachusetts, London, England, 1992.
[Holmblad and Østergaard 82] L. P. Holmblad, J. J. Østergaard, Control of a Ce-

ment Kiln by Fuzzy Logic, Fuzzy Information and Decision Processes, ed. M.M.
Gupta and E. Sanchez, North-Holland, 389-400, 1982.
[Huber 96] B. Huber, Fibre Optic Gyro Application on Autonomous Vehicular Nav-
igation, PhD thesis, University of Strasbourg, 1996.
[Huntsberger et al 86] T. Huntsberger, C. Rangarajan, S. Jayaramamurthy, Repre-

sentation of Uncertainty in Computer Vision Using Fuzzy Sets, IEEE Transac-
tions on Computers 35 (1986), 145-156.
[Ishizuka and Kanai 85] M. Ishizuka , N. Kanai, Prolog-Elf - Incorporating Fuzzy

Logic, Proc 9th IJCAI, 1985, pp. 701-703.
[Jager 95] R. Jager, Fuzzy Logic in Control, doktorsavhandling, Technische Univer-

siteit Delft, 1995.
[Jain and Dubes 88] A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice
Hall, 1988.
[Jang 92] J.-S. R. Jang, Self-learning fuzzy controllers based on temporal back prop-
agation, IEEE Trans. Neural Networks 3 No 5 (1992), 714-723.
[Jensen 94] F. Jensen, Implementation aspects of various propagation algorithms in

Hugin, Technical Report, Department of Mathematics and Computer Science,
Aalborg University, Denmark, March, 1994.
156 BIBLIOGRAPHY
[Jensen 96] F. V. Jensen, An Introduction to Bayesian Networks, UCL Press Lim-

ited, 1996.
[Jensen and Jensen 94] F. V. Jensen, F. Jensen, Optimal junction trees, Technical
Report, Department of Mathematics and Computer Science, Aalborg University,
Denmark, July, 1994.
[Jensen et al 89] F. V. Jensen, S. L. Lauritzen, K. G. Olesen, Bayesian updating in
recursive graphical models by local computations, Technical Report, Department
of Mathematics and Computer Science, Aalborg University, Denmark, 1989.
[Jensenet al 91] F. V. Jensen, S. L. Lauritzen, K. G. Olesen, Bayesian updating in
recursive graphical models by local computations, Technical Report, Department
of Mathematics and Computer Science, Aalborg University, Denmark, 1991.
[Jensenet al 90] F. V. Jensen, K. G. Olesen, S. K. Andersen, An algebra of Bayesian
belief universes for knowledge based systems, Networks 20 (1990), 637-659.
[Kallin et al 95] L. Kallin, R. Räty, G. Selén, K. Spencer, A Comparison of
Numerical Risk Computational Techniques in Screening for Down’s Syndrome,
ICANN’95, International Conference on Artificial Neural Networks, Industrial
Conference (Medicine), Paris, France, 9-13 October 1995.
[Karr 91] C. Karr, Genetic Algorithms for Fuzzy Controllers, AI Expert, February
1991, 26-33.
[Klawonn and Kruse 94] F. Klawonn, R. Kruse, A L
Ã ukasiewicz Logic Based Prolog,
Mathware and Soft Computing No 1, (1994), pp. 5-29.
[Klimasauskas 91] C. C. Klimasauskas, Neural Networks: Application Walkthrough,
tutorial, Neuro-Nimes ’91, Nimes, November 4-8, 1991. (Contains: Applying Neu-
ral Networks, Parts I-V, reprinted from PC/AI Magazine.)
[Koza 92] J. R. Koza. Genetic Programming. The MIT Press, ”A Bradford Book”,
1992.
[Koza 94] J. R. Koza. Genetic Programming 2. The MIT Press, 1992.
[Krishnapuram et al 95] R. Krishnapuram, H. Frigui, O. Nasraoui, Fuzzy and
Prossibilistic Shell Clustering Algorithms and Their Application to Boundary De-
tection and Surface Approximation - Part II, IEEE Transactions on Fuzzy Systems
3 (1995), 44-60.
[Kruse et al 94] R. Kruse, J. Gebhardt, F. Klawonn, Foundations of Fuzzy Systems,
John Wiley & Sons, 1994.
[Lähdevirta 71] J. Lähdevirta, Nephropathia epidemica in Finland: A clinical, his-
tological and epidemiological study, Ann. Clin. Res. 3 (Suppl 8) (1971), 1-154.
BIBLIOGRAPHY 157
[Lähdevirta et al 84] J. Lähdevirta et al, Clinical and serological diagnosis of

Nephropathia Epidemica, the mild type of haemorrhagic fever with renal syndrome,
J. Infect. 9 (1984), 230-238.
[Larsen 80] P.M. Larsen, Industrial Applications of Fuzzy Logic Control, Interna-
tional Journal of Man Machine Studies 12 (1980), 3-10.
[Lauritzen 92] S. L. Lauritzen, Propagation of probabilities, means and variances in
mixed graphical association models, Journal of the American Statistical Associa-
tion 86 (1992).
[Lauritzen and Spiegelhalter 88] S. L. Lauritzen, D. J. Spiegelhalter, Local compu-
tations with probabilities on graphical structures and their application to expert
systems (with discussion), J. Royal Statistical Society 50 (1988), 157-224.
[Lee 72] R.C.T. Lee, Fuzzy Logic and the Resolution Principle, Journal of the As-
sociation for Computing Machinery, Vol. 19, No. 1, January 1972, pp. 109-119.
[Li and D’Ambrosio 94] Z. Li, B. D’Ambrosio, Efficient inference in Bayes networks
as a combinatorial optimization problem, Int. J. Approximate Reasoning 10 no 5
(1994).
[Lieven 96] K. Lieven, Applications of Intelligent Techniques in Data Analysis, Op-
pivien ja lykkiden jrjestelmien vuosiseminaari, TEKES, Esbo, 1996.
[Lim and Lee 90] Y. Lim, S. Lee, On the Color Image Segmentation Algorithm
Based on the Thresholding and the Fuzzy c-Means Techniques, Pattern Recog-
nition 23 (1990), 935-952.
[Lindahl 96] B. Lindahl, Biochemical Markers of Myocardial Damage for Early Di-
agnosis and Prognosis in Patient with Acute Coronary Syndromes, Acta Univer-
sitatis Upsaliensis, 1996.
[Ljung 81] L. Ljung, Reglerteori: Moderna analys- and syntesmetoder, Studentlit-
teratur, 1981.
[Lloyd 87] J. W. Lloyd, Foundations of Logic Programming, second extended edi-
tion, Springer-Verlag, 1987.
[ÃLukasiewicz and Tarski 30] J. LÃ ukasiewicz, A. Tarski, Untersuchunen über den
Aussagen kalkül, C. R. Soc. Sci. et Lettres Varsovie III, 23 (1930), 1-21.
Ã ukaszewicz, Non-Monotonic Reasoning, Formalization of
[ÃLukaszewicz 90] W. L
Commonsence Reasoning, Ellis Horowood series in Artificial Intelligence, Ellis
Horowood Limited, 1990.
[Mamdani 74] E.H. Mamdani, Applications of Fuzzy Algorithms for Control of Sim-
ple Dynamic Plant, Proc. IEEE 121 (1974), 1585-1588.
158 BIBLIOGRAPHY
[Mamdani and Assilian 75] E. H. Mamdani and S. Assilian, An Experiment in Lin-

guistic Synthesis with a Fuzzy Logic Controller, International Journal of Man
Machine Studies 7 (1975), 1-13.
[McDermot and Doyle 87] D. McDermot, J. Doyle, Non-monotonic logic I, Artificial

Intelligence J., 13(1980), 41-72. Also in [Ginsberg 87].
[Minker 91] J. Minker, An overview of nonmonotonic reasoning and logic program-

ming, updated version of an invited Banquet Address, Int. Workshop on Logic
Programming and Non-Monotonic Reasoning, Washington, D.C., July 23, 1991.
[Minsky and Papert 69] M. L. Minsky, S. Papert, Perceptrons: An Essay in Com-

putational Geometry, The MIT Press, 1969.
[Mizumoto 89] M. Mizumoto, Pictorial representations of fuzzy connectives, part I:

Cases of t-norms, t-conorms and averaging operators, Fuzzy Sets and Systems,
31 (1989), 217-242.
[Moody 94] J. Moody, Prediction risk and architecture selection for neural networks,
In: V. Cherkassky, J. H. Friedman, H. Wechsler (Eds.), From Statistics to Neural
Networks: Theory and Pattern Recognition, NATO ASI Series F, Springer-Verlag,
1994.
[Moody and Utans 95] J. Moody, J. Utans, Architecture selection strategies for neu-
ral networks: Application to corporate bond rating predictions, In: A.-P. Refenes
(Ed.), Neural Networks in the Capital Markets, John Wiley & Sons, 1995, 277-
300.
[Moore 83] R. C. Moore, Semantical Considerations on Nonmonotonic Logic, Proc

8th IJCAI, Karlsruhe, Germany, 1983, 272-279.
[Moore 84] R. C. Moore, Possible-World Semantics for Autoepistemic Logic, Proc.

AAAI Workshop on Non-Monotonic Reasoning, New York, 1984, 344-354.
[Novak 90a] V. Novak, On the Syntactico-Semantical Completeness of First-Order

Fuzzy Logic, Part I: Syntax and Semantics, Kybernetica, Vol. 26, No. 1, 1990, pp.
47-66.
[Novak 90b] V. Novak, On the Syntactico-Semantical Completeness of First-Order

Fuzzy Logic, Part II: Main Results, Kybernetika, Vol. 26, No. 2, 1990, pp. 134-154.
[Olesen et al 89] K. G. Olesen, U. Kjærulff, F. Jensen, F. V. Jensen, B. Falck, S.

Andreassen, S. K. Andersen, A MUNIN network for the median nerve – a case
study on loops, Applied Artificial Intelligence (Special issue: Towards Causal AI
Models in Practice), 1989.
BIBLIOGRAPHY 159
[Olesen 93] K. G. Olesen, Causal probabilistic networks with both discrete and con-
tinuous variables, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 3 (1993).
[Olli 95] S. Olli, Fuzzy Control for AGNES, opublicerade anteckningar, bo Akademi,
1995.
[Pearl 88] J. Pearl, Probabilistic Reasoning in intelligent Systems, Morgan Kauf-

mann, 1988.
[Procyk and Mamdani 79] T. J. Procyk, E. H. Mamdani, A Linguistic Self-

Organizing Process Controller, Automatica 15 (1979), 15-30.
[Rechenberg 73] I. Rechenberg. Evolutionstrategie: Optimierung Technischer Sys-

teme nach Prinzipen der Biologischen Evolution. Friederich Frommann Verlag,
Stuttgart, Germany, 1973.
[Riissanen 92a] T. Riissanen, Experiences in Fuzzy Control: A Case Study for

Chemical Plants, Proc. CIFT’92, 2nd International Workshop on Current Issues
in Fuzzy Technologies, Trento, 96-103, 1992.
[Riissanen 92b] T. Riissanen, Clustering Manual Control Actions, Proc. MEPP’92,

International Seminar on Fuzzy Control through Neural Interpretations of Fuzzy
Sets, Mariehamn, 57-63, 1992.
[Riissanen and Eklund 96] T. Riissanen and P. Eklund, Working within a Fuzzy
Control Application Development Workbench: Case Study for a Water Treatment
Plant, Proc. EUFIT’96, 4th European Congress on Intelligent Techniques and
Soft Computing, Aachen, 1142-1145, 1996.
[Rissanen 82] J. Rissanen, Modelling by shortest data descriptions, Automatica 14

(1978), 465-471.
[Robinson 65] J. A. Robinson, A Machine-Oriented Logic Based on the Resolution

Principle, Journal of the Association for Computing Machinery, Vol. 12, No. 1
(January 1965), pp.23-41
[Roeher et al 95] O. Roeher, R. Schmidt, S. Korth, Fuzzy-Controlled Drug Infu-

sion During Extracorporeal Blood Purification, Proc. EUFIT’95, 3rd European
Congress on Intelligent Techniques and Soft Computing, Aachen, 1626-1632, 1995.
[Rose and Rosser 58] F. Rosenblatt, Fragments of many-valued statement calculi,

Trans. Amer. Math. SOc. 87 (1958), 1-53.
[Rosenblatt 58] F. Rosenblatt, The perceptron: A probabilistic model for informa-

tion storage and organization in the brain, Psychological Review 65 (1958), 386-
408.
160 BIBLIOGRAPHY
[Russell and Norvig 95] S. Russell, P. Norvig, Artificial Intelligence – a Modern Ap-
proach, Prentice-Hall International, 1995.
[Saupe et al 95] D. Saupe, M. Rombach, H. Fischer, Fuzzy Clustering for Fractal

Image Compression with Applications to Digital Angiography, Proc. EUFIT’95,
3rd European Congress on Intelligent Techniques and Soft Computing, Aachen,
1685-1689, 1995.
[Schweizer and Sklar 61] B. Schweizer, A. Sklar, Associative functions and statisti-
cal triangle inequalities, Publicationes Mathema-ticae Dedrecen, 8 (1961), 169-
186.
[Settergren 89] B. Settergren et al, Clinical characteristics of Nephropathia Epidem-

ica in Sweden: prospective study of 74 cases, Rev. Infect. Dis. 11 (1989), 921-927.
[Shachter et al 90] R. D. Shachter, B. D’Ambrosio, B. A. Del Favero, Symbolic prob-

abilistic inference in belief networks, AAAI, August (1990), 126-131.
[Shao 88] S. Shao, Fuzzy Self-Organizing Controller and its Application for Dynamic
Processes, Fuzzy Sets and Systems 26 (1988), 151-164.
[Smith and Kelleher 88] B. Smith, G. Kelleher (eds.), Reason Maintenance Sys-
tems and their Applications, Ellis Horowood series in Artificial Intelligence, Ellis
Horowood Limited, 1988.
[Smits et al 94] P. Smits, M. Mari, A. Teschioni, S. Dellepiane, F. Fontana, Appli-

cation of Fuzzy Methods to Segmentation of Medical Images, Proc. IPMU’94, 5th
International Conference on Information Processing and Management of Uncer-
tainty in Knowledge-Based Systems, Paris, 910-915, 1994.
[Sugeno 85] M. Sugeno, An introductory survey of fuzzy control, Information Sci-

ences 36 (1985), 59-83.
[Sugeno and Yasukawa 93] M. Sugeno and T. Yasukawa, A Fuzzy-Logic-Based Ap-

proach to Qualitative Modeling, IEEE Transactions on Fuzzy Systems 1 (1993),
7-31.
[Sutanto and Warwick 94] E. Sutanto, K. Warwick, Cluster Analysis - An Intelli-

gent System for the Process Industries, Proc. Cybernetics and Systems ’94, 12th
European Meeting on Cybernetics and Systems Research, ed. R. Trappl, World
Scientific, 327-334, 1994.
[Takagi and Sugeno 85] T. Takagi, M. Sugeno, Fuzzy Identification of Systems and
Its Applications to Modeling and Control, IEEE Transactions on Systems, Man
and Cybernetics 15 (1985), 116-132.
BIBLIOGRAPHY 161
[Tsukamoto 77] Y. Tsukamoto, An Approach to Fuzzy Resoning Method, Advances

in Fuzzy Set Theory and Applications, ed. M.M Gupta, R.K. Ragade and R.R.
Yager, North-Hollan Publishers, 137-149, 1977.
[Umano 87] M.Umano, Fuzzy-Set Prolog, Second IFSA Congress, Tokyo, 1987, pp.
750-753.
[Varsek and Filipic 93] T. U. Varsek, B. Filipic, Genetic Algorithms in Controller

Design and Tuning , IEEE Trans. Syst. Man. Cybern., 23 (1993),1330-1339.
[Wang and Mendel 92] L. X. Wang, J. M. Mendel, Fuzzy Basis Functions, Universal
Approximation, and Orthogonal Least-Squares Learning, IEEE Trans. on Neural
Networks., 3 No.5. September, 1992, 807-813.
[Weigend and LeBaron 94] A. S. Weigend, B. LeBaron, Evaluating neural network

predictors by bootstrapping, Technical Report CU-CS-725-94, Department of Com-
puter Science and Institute of Cognitive Science, University of Colorado, Boulder,
CO, May, 1994.
[Weigend and Rumelhart 92] A. S. Weigend, D. E. Rumelhart, Weight elimination

and effective network size, ???, Chap. 16, ???, 1992, 457-476.
[Weigend et al 90] A. S. Weigend, D. E. Rumelhart, B. A. Huberman, Predicting the

future: A connectionist approach, Internat. J. Neural Systems ? (1990), 193-209.
[Windham 82] M. Windham, Cluster Validity for the Fuzzy c-Means Clustering Al-
gorithm, IEEE Transactions on Pattern Analysis and Machine Intelligence, 4
(1982), 357-363.
[Windham 83] M. Windham, Geometrical Fuzzy Clustering Algorithms, Fuzzy Sets

and Systems 10 (1983), 271-279.
[Virtanen 91] H. Virtanen, Combining and Incrementing Fuzzy Evidence - Heuristic

and Formal Approaches to Fuzzy Logic Programming, Proc. 4th IFSA Congress
(Artificial Intelligence Chapter), Brussels, Belgium, 1991, pp. 200-203.
[Virtanen 94a] H. E. Virtanen, A Study in Fuzzy Logic Programming, in Cybernetics

and Systems’94, Proceedings of the 12th European Meeting on Cybernetics and
Systems Research, Ed. R.Trappl, Vienna, Austria, 1994, pp.249-256.
[Virtanen 94a] H. E. Virtanen, Fuzzy Unification, Proc 5th IPMU’94, Information

Processing and Management of Uncertainty in Knowledge-Based Systems, Paris,
France, 1994, pp.1147-1152.
[Virtanen 96] H. E. Virtanen, Linguistic Logic Programming, Åbo Akademi, Reports

on Computer Science & Mathematics, Ser. A. No 176,1996, 23 pages. A five page
abstract of this work has also been accepted to EUFIT’96, Aachen, Germany.
162 BIBLIOGRAPHY
[Yager 80] R. R. Yager, On a General Class of Fuzzy Connectives, Fuzzy Sets and
Systems 4 (1980), 235-242.
[Yager 88] R. R. Yager, On Ordered Weighted Averaging Aggregation Operators in

Multicriteria Decisionmaking, IEEE Transactions on Systems, Man and Cyber-
netics 18 (1988), 183-190.
[Yager 96a] R. R. Yager, Knowledge-Based Defuzzification, Fuzzy Sets and Systems

80 (1996), 177-185.
[Yager 96b] R. R. Yager, Constrained OWA Aggregation, Fuzzy Sets and Systems
81 (1996), 89-101.
[Zadeh 65] L. A. Zadeh, Fuzzy Sets, Information and Control 8 (1965), 338-353.
[Zadeh 73] L. A. Zadeh, Outline of a New Approach to the Analysis of Complex

Systems and Decision Process, IEEE Trans. Syst. Man. Cybern., Vol. SMC-3 No
1, January, 1973, 28-44.
[Zadeh 75] L. A. Zadeh, The Concepts of a Linguistic Variable and its Application
to Approximate Reasoning, Information Science, 8 (1975), 199-249.
[Zadeh 89] L. A. Zadeh, Fuzzy Sets as a Basis for a Theory of Possibility, Fuzzy
Sets and Systems, Vol 1, 1978, pp. 3-28.
[Zadeh 89] L. A. Zadeh, The coming age of fuzzy logic, plenary talk at 3rd IFSA,
Seattle, August 6-11, 1989.
[Zhou and Eklund 95] J. Zhou, P. Eklund, Some Remarks on Learning Strategies for
Parameter Identification in Rule Based Systems, Proc. EUFIT’95, 3rd European
Congress on Intelligent Techniques and Soft Computing, Aachen, 1911-1916, 1995.
[Zhou and Eriksson] J. Zhou, J. Eriksson, Suitability of Fuzzy Controller Fine-

tuning: A Case Study for a Chemical Plant, In R.Trappl (Ed.), Proceedings of
the EMCSR ’96 (Cybernetics and Systems 96), Vienna, Austria, 9-12 April, 1996,
324-328.
[strm 76] K. J. strm, Reglerteori, Almqvist & Wiksell, 1976.
[] Discussion with Mats Nilsson, University Hospital of Northern Sweden, on vari-

ables for acute myocardial infarction, September, 1996.
[] Discussions with Veli Kairisto, Turku University Central Hospital, on the diag-
nostic problem and the TUCH data set for acute myocardial infarction, Spring,
1996.

Soft Computing Notes PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Soft Computing Notes PDF

Uploaded by

Copyright:

Available Formats

Lecture Notes

Edition for course at Umeå University, 2007

Umeå November 9, 2007

II LOGIC AND FOUNDATIONS 11

4.3.1 Logic connectives . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Summary and Exercises 49

III FUZZY SYSTEMS 53

8 Generalised Perceptrons as Fuzzy Rules 73

8.7 Neural Networks and Multi-Valued Propositional Calculus . . . . . . 84

11 Summary and Exercises 105

IV PROBABILISTIC COMPUTING 107

13 Bayesian Networks 113

14 Inference in Bayesian Networks 121

14.2.1 Car Trouble - a Singly Connected Network . . . . . . . . . . . 127

15 Summary and Exercises 147

1.1 Applications in the Health Care Domain

1.1.2 Information scope and management

HOME CARE & PRIMARY HEALTH CARE

PATIENT GP SPEC SURG

EDUCATION INFORMATION AUDIO/VIDEO

Figure 1.1: Patient care management.

Information systems within patient care management should promote large-scale

Figure 1.2: Information systems usage, maintenance and refinement.

1.1.3 Case studies and scenarios

Polycystic ovary syndrome

1.2 Industrial Applications

1.2.2 Case studies and scenarios

Autonomous Guided Vehicles

Figure 1.3: Simulated car following a trajectory.

a set of different control situations. A typical application is to demonstrate the use

In a real situation, consider AGNES (Autonomous Gyrocontrolled Navigating

Figure 1.4: AGNES.

• 3 wheel without suspension

• 250 kg weight, 230 kg max load

• drive unit with two 550W DC motors electronically coupled

• max speed 15 km/h, 42 km reach, 15

• odometer with 235 m resolution

• steering angle encoder with 0.314 mrad resolution

• digital control loops with 100 Hz sampling rate

• two independent control loops for speed and steering angle

• rule base to handle absolute deviations of the vehicle

steering setpoints odometer

Figure 1.5: AGNES control schema.

Later on we describe methods by which we can successfully convert manual

Figure 1.6: A simplified view of fertiliser production.

LOGIC AND FOUNDATIONS

1. Some cats are black.

(i) one cannot prove false statements

2.1 Propositional calculus

2.1.1 Boolean operators and expressions

Table 2.1: Truth-table for the one-place operator not.

As in arithmetic we use one- and two-place operators to combine propositions

operator description operator description

Table 2.2: Some of the two-placed operators.

Table 2.3: Truth-table for the a selected set of two-place operators.

formula ::= p for any p ∈ P

2.1.2 Boolean interpretations, logical equivalence and sub-

v((P ∧ Q) ∨ ¬(P ∧ Q)) = 1

Definition Given two formulas A1 , A2 , if v(A1 ) = v(A2 ) for all interpretations v,

2.1.3 Satisfiability, validity and consequence

Definition (Satisfiability) A propositional formula A is satisfiable if its truth-value

2.2 Predicate calculus

1. All men are mortal.