Computational Intelligence-Notes1 PDF

Lecture Notes
COMPUTATIONAL INTELLIGENCE
Edition for course in Spring 2000
Umeå University
Department of Computing Science
S-901 87 Umeå
Sweden
i
These lecture notes were originally prepared for the course in Computational In-
telligence held in Spring 1998 at the Department of Computing Science at Umeå
University.
Text contributions for this edition have been provided by Jens Bohlin, Patrik Eklund
and Tony Riissanen.
Umeå, January 3, 2000
The authors
ii
Contents
I LOGIC AND FOUNDATIONS 1

1 Many-Valued Logic 3
1.1 Fuzzy Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Useful Membership Functions . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Triangular and trapetsoidal functions . . . . . . . . . . . . . . 6
1.2.2 Gaußian and sigmoidal functions . . . . . . . . . . . . . . . . 8
1.3 Fuzzy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Logic connectives . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Operations on fuzzy sets . . . . . . . . . . . . . . . . . . . . . 12
1.3.3 Triangular norms . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Products, Relations and Compositions . . . . . . . . . . . . . . . . . 15
1.4.1 The extension principle . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2 Fuzzy products, relations and compositions . . . . . . . . . . . 16
1.4.3 Compositional rule of inference . . . . . . . . . . . . . . . . . 17
1.5 Approximate Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.1 Fuzzy propositional calculus . . . . . . . . . . . . . . . . . . . 22
1.5.2 Fuzzy predicate calculus . . . . . . . . . . . . . . . . . . . . . 23
2 Summary and Exercises 25
II FUZZY SYSTEMS 29
3 Fuzzy Control 31
3.1 Fuzzy Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Fuzzy rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Fuzzy rule bases . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 Overview of fuzzy controller structures . . . . . . . . . . . . . 33
3.2 Inference in Fuzzy Controllers . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Mamdani’s method . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Takagi-Sugeno’s method . . . . . . . . . . . . . . . . . . . . . 35
3.3 Defuzzification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
iv CONTENTS
4 Fuzzy Clustering 43
4.1 Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Fuzzy c-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Identification of Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Geometric Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
III PROBABILISTIC COMPUTING 51

6 Introduction 53
6.1 Syntax and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Basic Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . 54
6.2.3 Interpretations of Probability . . . . . . . . . . . . . . . . . . 55
7 Bayesian Networks 57
7.1 A Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2.1 Linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.2 Diverging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.3 Converging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.4 d-separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2.5 A notation for conditional independence . . . . . . . . . . . . 61
7.3 Finding the Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.3.1 Disjunctive Interaction . . . . . . . . . . . . . . . . . . . . . . 61
8 Inference in Bayesian Networks 65

8.1 The Joint Probability Distribution . . . . . . . . . . . . . . . . . . . 65
8.2 Singly Connected Networks . . . . . . . . . . . . . . . . . . . . . . . 68
8.2.1 Car Trouble - a Singly Connected Network . . . . . . . . . . . 71
8.3 Multiply Connected Networks . . . . . . . . . . . . . . . . . . . . . . 72
8.3.1 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.4 Symbolic Probabilistic Inference (SPI) . . . . . . . . . . . . . . . . . 84
8.4.1 Set-Factoring SPI . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.5 Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.5.1 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.6 Connection to propositional calculus . . . . . . . . . . . . . . . . . . 89

Part I
LOGIC AND FOUNDATIONS
1
Chapter 1
Many-Valued Logic
1.1 Fuzzy Sets

An ordinary set (or crisp set, is defined by its sharp distinction between membership
and non-membership of that set. A fuzzy set is an extension in that elements are
characterised by their grade of membership of the fuzzy set. This generalises the
traditional membership of an element in a set (x ∈ A) from being binary to being
a value (typically) in the unit interval I = [0, 1]. As we shall see, traditional set
theoretic operations have analogies in fuzzy set theory.
Intuitively, the motivation for using fuzzy sets is to capture and model classes
like ’patients with high blood pressure’ and ’measurements showing low pH value’.
In a further extension, also different quantifiers, e.g. like ’very high blood pressure’,
can be used.
In classical (naive) set theory, membership of an element x in a set A, considered
as some subset of a universe of discourse X, can be defined as
(
1, if x ∈ A
µA (x) =
0, if x 6∈ A.
In this definition, µA is the characteristic function
µA : X → {0, 1}.
Note that µA (x) represents the truth value of
”x is in A”.
This function can also be interpreted as a relation consisting of ordered pairs (x, µA (x)).
In a first step, a fuzzy set can be seen as an extension of characteristic functions,
i.e. a fuzzy set µ can be defined mathematically by assigning to each possible individ-
ual in the universe of discourse a value, µ(x), representing its grade of membership
in the fuzzy set µ. This grade corresponds to the degree to which that individual is
similar or compatible with the concept represented by the fuzzy set. Often a fuzzy
3
4 CHAPTER 1. MANY-VALUED LOGIC
set is denoted as µA , and if their is no confusion it is common to speak of µA and A

as representing one and the same fuzzy set. Also, we say that µA is a fuzzy subset
of X, or a membership function in X.
Thus, a fuzzy set µA is a function
µA : X → I.
Using A to denote µA we can speak of the truth value for
”x is in A”
or even
”x is A”
if it is desirable to speak of compatibility with the concept A rather than member-
ship.
If X = {x1 , . . . , xn } is a finite set and A a fuzzy subset of X, a more relational
notation for A would be
A = {(x1 , µA (x1 )), · · · , (xn , µA (xn ))} ⊆ X × I,
where (xi , µA (xi )) is an ordered pair representing the corresponding membership

values.
Remark. Zadeh’s original notation is
A = µA (x1 )/x1 + · · · + µA (xn )/xn ,
where µA (xi )/xi contains respective grades of membership and + should be seen as
a union. This notation is very informal and certainly not algebraic in any sense.
Example. Suppose we want to define a fuzzy set of natural numbers ”close to 4”
(see Figure 1.1). This can be given e.g. as
A = {(1, 0.0), (2, 0.2), (3, 0.6), (4, 1.0), (5, 0.6), (6, 0.2), (7, 0.0)}.
0
2 3 4 5 6
Figure 1.1: Membership function for ”close to 4”.

1.1. FUZZY SETS 5
The above definition of a fuzzy set is a typical situation where the relational style
is convenient.
Example. A fuzzy set A defining ”normal room temperature” can be given as




0, om x < 16◦ C

 ◦ ◦ ◦ ◦
 (x − 16 C)/2 C, om 16 C ≤ x < 18 C

µA (x) = 1, om 18◦ C ≤ x ≤ 22◦ C




 (24◦ C − x)/2◦ C, om 22◦ C < x ≤ 24◦ C

 0, om x > 24◦ C,
where x is a measurement of temperature. The corresponding function µA is

depicted in Figure 1.2.
0
16 18 20 22 24
Figure 1.2: Membership function for ”normal temperature”.
As can be seen, temperatures below 16◦ C or above 24◦ C are not to any degree
considered to be normal.
Example. A fuzzy set B defining ”high moisture rates” can be given as


 0, om x < 30%
µB (x) = (x − 30%)/20%, om 30% ≤ x ≤ 50%


1, om x > 50%,
depicted in Figure 1.3.
0
10 30 50 70 90
Figure 1.3: Membership function for ”high moisture rates”.

1.2 Useful Membership Functions

Membership functions can be defined in various ways and be based on different
types of functions. In the following we will discuss linear and exponential functions
typically used especially for applications described in the previous chapter. We will
use notations from [Dubois and Prade 80].
1.2.1 Triangular and trapetsoidal functions

Initially we will define two so called open membership functions. These are char-
acterised as being non-decreasing and having values inside 0 and 1 only within a
bounded interval.
Firstly, we have functions with open right shoulders, Γ : X → [0, 1], and defined
by two parameters according to the following:


 0, x<α
Γ(x; α, β) = (x − α)/(β − α), α ≤ x ≤ β


1, x > β.
Correspondingly, we have functions with open left shoulders, L : X → [0, 1], defined
by


 1, x<α
L(x; α, β) = (β − x)/(β − α), α ≤ x ≤ β


0, x > β.
Figures 1.4 and 1.5 shows the graphical representation.
0
α β
Figure 1.4: Γ-membership function.
Similarly, we closed membership functions which allow non-zero membership val-

ues only in a bounded interval. The triangular membership function, Λ : X → [0, 1],
is given by three parameters according to


 0, x<α

 (x − α)/(β − α), α ≤ x ≤ β
Λ(x; α, β, γ) =


 (γ − x)/(γ − β), β ≤ x ≤ γ

0, x > γ.
1.2. USEFUL MEMBERSHIP FUNCTIONS 7
0
α β
Figure 1.5: L-membership function.
0
α β γ
Figure 1.6: Λ-membership function.
See Figure 1.6.

Example. In control applications, it is common to use linguistic variables like
negativebig (NB), negativesmall (NS), zero (ZO), positivesmall (PS) and positivbig
(PB) to express measurement values in a fuzzy way. The arrangement with the
membership functions to cover the entire measurement space is then obtained using
shoulders on respectively leftmost and rightmost functions. See Figure 1.7.
NB NS ZO PS PB
0
-40 -30 -20 -10 0 10 20 30 40
Figure 1.7: A family of membership functions.
The trapetsoidal membership function, Π : X → [0, 1], is given by four parameters

according to


 0, x<α



 (x − α)/(β − α), α≤x<β

Π(x; α, β, γ, δ) = 1, β≤x≤γ



 (δ − x)/(δ − γ), γ<x≤δ


 0, x > δ.
0
α β γ δ
Figure 1.8: Π-membership function.
The level between β and γ is sometimes called a plateau. See Figure 1.8).
From implementation point of view it is obviously advantageous to consider Γ, L
and Λ as special cases of Π. This is possible if the universe of discourse is a bounded
interval. Suppose the universe of discourse is [-10,10]. Then
Γ(x; α, β) = Π(x; α, β, 10, 10),

L(x; γ, δ) = Π(x; −10, −10, γ, δ),
Λ(x; α, β, δ) = Π(x; α, β, β, δ).
Example. The membership function for ”normal room temperature” could be

written as Π(x; 16, 18, 22, 24). For ”high moisture rates” it becomes Γ(x; 30, 50),
which can be rewritten as Π(x; 30, 50, 100, 100), since moisture values are within the
interval [0,100].
1.2.2 Gaußian and sigmoidal functions

For exponential membership functions there are corresponding open and closed func-
tions. The open functions are frequently used e.g. in neural networks as threshold
functions but also in data preprocessing.
A Gaußian membership function, G : X → [0, 1], is given by two parameters as
G(x; α, β) = e−β(x−α) ,
2
where α is the midpoint and β reflects the slope value. Note that β must be positive,
and that the function never reaches zero.
The Gaußian function can also be extended to have different left and right slopes.
We then have three parameters in
(
e−βl (x−α) , x ≤ α
2
G(x; α, βl , βr ) =
e−βr (x−α) , x > α,
2
where βl och βr are, respectively, left and right slopes.

The typical sigmoidal membership function, or S-function, σ : X → [0, 1], needs
two parameters in the σ-function
1.2. USEFUL MEMBERSHIP FUNCTIONS 9
0
α
Figure 1.9: G-tillhörighetsfunktionen.
1
σ(x; α, β) = ,
1+ e−β(x−α)
where α is the midpoint and β is the slope value at the inflexion point. Note that
β is 4 times the derivative value at the inflexion point.
Similarly, β must be positive. This S-function never reaches neither 0 nor 1.
0
α
Figure 1.10: σ-function.
Zadeh defined an S-function using polynomials rather than exponentials accord-

ing to 

 0, x≤α

 2((x − α)/(γ − α))2 , α<x≤β
S(x; α, β, γ) =


 1 − 2((x − γ)/(γ − α))2
, β<x≤γ

1, x > γ,
where β = (α + γ)/2. This function again is more efficient when considering imple-
mentations. See Figure 1.11.
As a motivation for using the σ-function, we have the following. For two populations,
Dg and not-Dg, and both are, according to some measurement x, distributed in a
such a way, that the mean values differ with standard deviations being (more or
less) the same, then the likelihood ratio (LR) can be estimated. If the standard
deviation coincides for both population, then LR is exponential wrt the input.
Transforming to probabilities we have
LR 1
P (Dg | x) = = 1 .
1 + LR 1 + LR
0
α β γ
Figure 1.11: S-tillhörighetsfunktionen.
Since LR takes the form

LR(x) = eβx
the probability becomes
1
P (Dg | x) = .
1 + eβx
We thus arrive at the proposal for using sigmoidal transformation functions as de-
fined above.
This can be compared to similar assumption about normal distributions and
equal size variances typically made in statistical approaches.
Note how we seemingly introduce the requirement that the domain expert will
eventually be imposed with the additional responsibility to create and fine-tune the
data transformation functions. But here comes the good news: Parameters for the
transformation curves are trainable in more or less exactly the same fashion as we
typically train weights in a neural network! What remains to be given is, of course,
the type of function to be used. Note that we basically have two different shapes,
namely, S-functions (increasing or decreasing) and bell-shapes (with the upward
bell-shape occuring only rarely). Even here we have statistical techniques that can
present preliminary suggestions of function types based on collected data.
1.3 Fuzzy Logic

Work in multi-valued logic can originates from logicians in the 1920’s. Three-valued
logic, and later on unit interval valued logic studied e.g. by L
Ã ukasiewicz soon became
widely accepted and further studied especially in the 1950’s by logicians like Rose
and Rosser [Rose and Rosser 58].
Further, as a generalisation of ordinary logic, we should mention intuitionistic
logic (not accepting the law of the excluding middle) developed by Brouwer.
Considering only generalisation of truth values, theory development is broadly
using algebraic instrumentation generally based on partially ordered sets or semilat-
tices. It must be remarked that theories develop differently as compared to corre-
sponding theories e.g. where lattices are boolean. Similarly, a logic based on truth
values in the unit interval becomes more specific.
1.3. FUZZY LOGIC 11
The purpose of this section is to introduce multi-valued logic from a more intu-
itive and informal point of view as compared to a strongly algebraically developed
theory of multi-valued logic. Generally speaking we will stay more on the syntactic
side, rather than diving deeply into semantics. Keep in mind our purpose to present
multi-valued logic as method and technique to support application development.
Furthermore, our general viewpoint is that success in applications should guide the
search for ”the best” understanding of the foundations.
1.3.1 Logic connectives

By now we understand the advantage of using statements like ”x is NORMAL” or ”y
is HIGH ”. To introduce the fuzzy logic connectives, recall the meaning of ∧ (and),
∨ (or), and ¬ (not) in classical logic. In fuzzy logic then, how can we understand
expressions like
(x is NORMAL and y is HIGH) or (not z is MEDIUM),
where x, y and z are to be seen as measurements with membership in, respectively,

NORMAL, HIGH and MEDIUM as linguistic variables? Connectives as functions
take and produce values in the unit interval. In the following we will examine some
typical connectives.
Originally, Zadeh proposed to use the following connectives [Zadeh 65]:
¬a =1−a
a ∧ b = min{a, b}
a ∨ b = max{a, b}
The intuition for these are obvious as the clearly reflect worst and best case char-
acterisations. However, this is also a disadvantage since it means that an outcome
might remane unchanged even if we modify some value, e.g. min{0.7,0.5} is the
same as min{0.8,0.5}. If we desire any change in a or b to be effective we can use
e.g. the well-known product connectives
a∧b =a·b
a∨b =a+b−a·b
Ã ukasiewicz, for ∧ och ∨ we have

In the multi-valued logic proposed by L
[ÃLukasiewicz and Tarski 30]:
a ∧ b = max{0, a + b − 1}
a ∨ b = min{a + b, 1}
In the above mentioned connectives the outcome depends only on a and b, i.e.
there are no additional parameters. Adding parameters, however, introduces several
interesting and useful classes of connectives.
[Yager 80]:
a ∧ b = 1 − min{[(1 − a)p + (1 − b)p ]1/p , 1}, for p > 0

a ∨ b = min{[ap + bp ]1/p , 1}, för p > 0
[Hamacher 75]:
a ∧ b = a · b/(γ + (1 − γ) · (a + b − a · b)), for γ > 0

a ∨ b = (a + b − (2 − γ) · a · b)/(1 − (1 − γ) · a · b), for γ > 0
Lookig more closely at Yager’s connectives, we have the following observations. If

p = 1, then Yager’s connectives are the same as those defined ny L
Ã ukasiewiczs, and
if p → ∞ then Yager’s connectives converge to Zadeh’s connectives. In the case of
p → 0+ we obtain the so called drastic product


 a, om b = 1
a∧b= b, om a = 1


0, annars
and the drastic sum



 a, om b = 0
a∨b= b, om a = 0


1, annars.
Thus, in Yager’s connectives, we normally use p ≥ 1.
Example. Let us evaluate the expression

(x is NORMAL and y is HIGH) or (not z is MEDIUM).
Suppose further that the linguistic variables NORMAL (for temperature) and HIGH
(for moisture) are according to previous examples in Chapter II. The interpretation
of MEDIUM is left open. Assume x = 22.8◦ C, y = 48% och z = 5.5. Calculating
degrees of membership using x and y gives µN ORM AL (22.8) = 0.6 and µHIGH (48) =
0.9. For z, suppose µM EDIU M (5.5) = 0.8. Now depending on definitions for ∧ och
∨ we will obtain different numerical scemas. Table 1.1 shows different calculations
for (a ∧ b) ∨ ¬c. Here a represents ”x is NORMAL”, b represents ”y is HÖG” and
¬c represents ”icke z is MEDIUM ”. In all of the cases we use Zadeh’s definition for
negation and parameters p and γ are set to 2.
1.3.2 Operations on fuzzy sets

It is now obvious how to use fuzzy logic connectives to define corresponding fuzzy
set operations. Assuming A and A be two fuzzy sets in X with corresponding
membership functions µA and µB . The classical set operations of union, intersection
and complement for fuzzy sets are defined via corresponding connectives. Using the
1.3. FUZZY LOGIC 13
Definition a∧b (a ∧ b) ∨ ¬c
Zadeh min{0.6, 0.9} = 0.6 max{0.6, (1 − 0.8)} = 0.6
Produkt 0.6 · 0.9 = 0.54 0.54 + (1 − 0.8) − 0.54 · (1 − 0.8) ≈ 0.63
L
Ã ukasiewicz max{0, 0.6 + 0.9 − 1} = 0.5 min{0.5 + (1 − 0.8), 1} = 0.7
Yager 1 − min{[(1 − 0.6)2 + (1 − 0.9)2 ]1/2 , 1} ≈ 0.59 min{[0.592 + (1 − 0.8)2 ]1/2 , 1} ≈ 0.62
0.6·0.9 0.52+(1−0.8)−(2−2)·0.52·(1−0.8)
Hamacher ≈ 0.52 ≈ 0.65
2+(1−2)·(0.6+0.9−0.6·0.9) 1−(1−2)·0.52·(1−0.8)
Table 1.1: Evaluation of (a ∧ b) ∨ ¬c.
maximum operator for disjunction, union of A and B is then a fuzzy set µA∪B given
by
µA∪B (x) = max{µA (x), µB (x)}.
Similarly, the intersection µA∩B is given by
µA∩B (x) = min{µA (x), µB (x)}
and complement µĀ is defined according to
µĀ (x) = 1 − µA (x).
Example. Let A and B be discrete fuzzy subsets of X = {−3, −2, −1, 0, 1, 2, 3}. If
A = {(−3, 0.0), (−2, 0.3), (−1, 0.6), (0, 1.0), (1, 0.6), (2, 0.3), (3, 0.0)}
and
B = {(−3, 1.0), (−2, 0.5), (−1, 0.2), (0, 0.0), (1, 0.2), (2, 0.5), (3, 1.0)}
then by Zadeh’s definitions we have
A ∧ B = {(−3, 0.0), (−2, 0.3), (−1, 0.2), (0, 0.0), (1, 0.2), (2, 0.3), (3, 0.0)}
and
A ∨ B = {(−3, 1.0), (−2, 0.5), (−1, 0.6), (0, 1.0), (1, 0.6), (2, 0.5), (3, 1.0)}.
See Figure 1.12 for a graphical representation.

Negations of A och B are
¬A = {(−3, 1.0), (−2, 0.7), (−1, 0.4), (0, 0.0), (1, 0.4), (2, 0.7), (3, 1.0)}
and
¬B = {(−3, 0.0), (−2, 0.5), (−1, 0.8), (0, 1.0), (1, 0.8), (2, 0.5), (3, 0.0)}.
A B
∨
A B
1 1 1
=
∨
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
A B A∨B
1 1 1
∨ =
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Figure 1.12: Graphical representation of A ∧ B och A ∨ B using Zadeh’s connectives.
1.3.3 Triangular norms

In a general framework, connectives ∧ and ∨ are related to t-norm and co-t-norm,
respectively.
A t-norm is a mapping T : [0, 1] × [0, 1] → [0, 1] satisfying
(1) T (a, 1) = a (unit element),
(2) a ≥ b ⇒ T (a, c) ≥ T (b, c) (monotonicity),
(3) T (a, b) = T (b, a) (commutativity),
(4) T (a, T (b, c)) = T (T (a, b), c) (associativity)
and similarly a co-t-norm is a mapping S : [0, 1] × [0, 1] → [0, 1] satisfying
(1) S(a, 0) = a (unit element),
(2) a ≥ b ⇒ S(a, c) ≥ S(b, c) (monotonicity),
(3) S(a, b) = S(b, a) (commutativity),
(4) S(a, S(b, c)) = S(S(a, b), c) (associativity).
It can be seen that for any t-norm T and any co-t-norm S we have
T (a, b) ≤ min(a, b) and max(a, b) ≤ S(a, b).

Connectives decribed previously all satisfy the properties of t-norms and co-t-norms,
respectively, for ∧ and ∨.
Yager [Yager 88] has also introduced a class of OWA (ordered weighted averag-
ing) operators, that are ”between” t-norms and co-t-norms, but do not satisfy all
norm conditions.
”OWA”(a, b) = λ · T (a, b) + (1 − λ) · S(a, b), λ ∈ I,
where λ can be seen as a weight expressing to what extent the operator is similar
to either a t-norm or a co-t-norm. OWA operators, for λ 6= 0, 1, are neither t-norms
nor co-t-norms.
1.4. PRODUCTS, RELATIONS AND COMPOSITIONS 15
1.4 Products, Relations and Compositions

1.4.1 The extension principle
In ordinary set theory a function
f :X→Y
as a mapping of points x to f (x) can be extended to a function
f : PX → PY
between the corresponding power sets P X (P X = {A | A ⊆ X}) and P Y through
f (A) = {f (x) | x ∈ A}.
Strictly speaking we should use another notation for f , e.g. P f since the mappings
are not the same. However, as there is usually no confusion, it is very common to
use f in both situations.
Viewing this extension in the fuzzy domain, we have the following obvious ques-
tion: Given a fuzzy subset µA of X, what is the corresponding image f (µA ) as a
fuzzy subset of Y ? A natural approach is to require that
f (µA )(f (x)) = µA (x),
but in such cases we need to define f (µA )(f (x)) also in situations where we might
have other points x0 for which f (x) = f (x0 ) with µA (x) 6= µA (x0 ). Typically in such
a situation we ”the best we have”, i.e. f (µA )(f (x)) would be the largest value (or
supremum in an infinite case) of all µA (x0 ) for which f (x) = f (x0 ). Also we need to
specify the values of f (µA )(y) where we cannot find any x such that f (x) = y. In
such a situation it is natural to require that f (µA )(y) = 0.
To summarize, the Extension Principle states that
( W
f (x)=y µA (x), if {x ∈ X | f (x) = y} =
6 ∅,
f (µA )(y) =
0, otherwise.
In a more general situation we are dealing with functions
f : Xn → Y
and then we have to define f (µA 1 , . . . , µA n )(f (x)), x = (x1 , . . . , xn ), given that we
have µA 1 (x1 ), . . . , µA n (xn ). Now we need to consider worst cases wrt µA i (xi ) as the
combination is more of a conjunction. But the Extension Principle remains basically
the same, i.e.
( W
f (x)=y mini {µA i (xi )}, if {x ∈ X n | f (x) = y} =
6 ∅,
f (µA 1 , . . . , µA n )(y) =
0, otherwise.
A typical example of the use of the Extension Principle is in development of Fuzzy

Arithmetic. Several approaches to fuzzifying e.g. addition exists in the literature,
but it turns out to be difficult to define a general purpose arithmetic as basis e.g.
for developing differential calculus using fuzzy numbers.
Example. Consider the following modified definition of the Gaußian membership
function:
x−α 2
G0 (x; α, β) = e−( β ) .
We can then apply the Extension Principle to the addition function
+:R×R→R
and obtain a fuzzy addition ⊕ for the Gaußian membership functions:
G0 (−; α1 , β1 ) ⊕ G0 (−; α2 , β2 ) = G0 (−; α1 + α2 , β1 + β1 ).
1.4.2 Fuzzy products, relations and compositions

Definition (Fuzzy Relation) Let U and V be two universes of discourse. A fuzzy
relation R is a fuzzy set in the product space U × V ; that is, R has the membership
function µR (u, v), where u ∈ U and v ∈ V .
Definition (Cartesian Product) If A1 , · · ·, An are fuzzy sets in U1 , · · ·, Un , respec-
tively, the Cartesian Product of A1 , · · ·, An is a fuzzy set in product space U1 ×· · ·×Un
with the membership function
µA1 ×···×An (u1 , u2 , · · ·, un ) = min{µA1 (u1 ), · · ·, µAn (un )}
or
µA1 ×···×An (u1 , u2 , · · ·, un ) = µA1 (u1 )·µA2 (u2 )· · ··µAn (un ).
Definition (Sup-Star Composition) If R and S are fuzzy relation in U ×V and

V ×W , respectively, the composition of R and S is fuzzy relation denoted by R◦S
and is defined by
R◦S = {[(u, w), sup(µR (u, v) ∗ µS (v, w))], u∈U, v∈V, w∈W }.
v
where ∗ could be any operator in the class of membership function, namely, mini-
mum, algebraic product, bounded product or drastic product.
In a general form, a compositional operator may be expressed as the Sup-star
compositional, where ”star” denotes an operator - e.g., min, product, etc. In the
literature., four kind of compositional operators can be used in the compositional
rule of inference as following:
• Sup-min operation,
• Sup-product operation,
• Sup-bounded-product operation,
• Sup-drastic-product operation.
In FLC applications, the Sup-min and Sup-product compositional operator are the
most frequently used.
1.4.3 Compositional rule of inference

In many cases of aggregation of fuzzy sets the type of aggregation required is neither
the pure anding of the t-norm with its complete lack of compensation nor the pure
oring of the t-conorm with its complete submission to any good satisfaction. In
many cases the type of aggregation operator desired lies somewhere between these
two extremes. In the previous sections we have been shown how to translate the
premise of a fuzzy rule into a fuzzy relation and to translate an if...then statement
into a fuzzy relation. thus we are able to translate fuzzy rules into fuzzy relation.
If there is more than one proposition in the consequences of fuzzy rules, the fuzzy
rules are assumed to be separable with respect to the proposition in the consequent.
Suppose a fuzzy rule base with two antecedents is usually written as ”if x is A and
y is B then z is C ”. The corresponding problem for approximate reasoning is
expressed as
0 0
premise 1 (fact): x is A and y is B ,
premise 2 (rule) : if x is A and y is B then z is C.
0
consequence: z is C
The fuzzy rule in premise 2 above can be put into the simpler form ” A × B → C
”. Intuitively, this fuzzy rule can be transformed into a ternary fuzzy relation R,
which is specified by the following MF:
µR (x, y, z) = µ(A×B)×C (x, y, z) = µA (x)∧µB (y)∧µC (z) (1.1)
0
And the resulting C is expressed as
0 0 0
C = A ×B ◦(A × B → C) (1.2)
Thus
µC 0 (z) = ∨x,y [µA0 (x) ∧ µB 0 (y)] ∧ [µA (x) ∧ µB (y) ∧ µC (z)]

= ∨x,y {[µA0 (x) ∧ µB 0 (y) ∧ µA (x) ∧ µB (y)]} ∧ µC (z) (1.3)
= {∨x [µA0 (x) ∧ µA (x)]} ∧ {∨y [µB 0 (y) ∧ µB (y)]} ∧µC (z)
| {z } | {z }
τ1 τ2
= τ ∧τ ∧µC (z),
| 1 {z 2}
f iringstrength
0
where τ1 is the degree of match between A and A ; τ2 is the degree of match between
0
B and B ; and τ = τ1 ∧ τ2 is called firing strength or degree of firing of this fuzzy
rule.
Figure 1.13: Approximate reasoning for multiple antecedents.
The interpretation of multiple rule is usually taken as the union of the fuzzy relation
corresponding to the fuzzy rules. for example, given the following fact and rules:
0 0
premise 1 (fact): x is A and y is B ,
premise 2 (rule 1) : if x is A1 and y is B1 then z is C1 ,
premise 3 (rule 2) : if x is A2 and y is B2 then z is C2 .
0
consequence: z is C
we can use the fuzzy reasoning shown in the Figure 1.14. as an inference procedure
0
to derive the resulting output fuzzy set C .
Figure 1.14: Fuzzy reasoning for multiple rules with multiple antecedents.
Let R1 = A1 × B1 → C1 and R2 = A2 × B2 → C2 . Since the max-min composition

operator ◦ is distributive over the ∪ operator, it follows that
0 0 0
C = (A ×B )◦(R1 ∪ R2 )
0 0 0 0
= [(A ×B )◦R1 ] ∪ [(A ×B )◦R2 ] (1.4)
0 0
= C1 ∪C2 )
0 0
where C1 and C2 are the inferred fuzzy sets for rule 1 and 2, respectively. Figure 1.14
shows graphically the operation of fuzzy reasoning for multiple rules with multiple
antecedents. Suppose a fuzzy rule base consists of a collection of fuzzy if...then rules
in the following form:
Rulei : if x1 is Ai1 and...and xm is Aim then y is Ci (1.5)
where Ai1 and Ci are fuzzy sets in Ui ∈ R and V ∈ R, x = (x1 , x2 , · · ·, xm )T ∈

U1 ×U2 ×· · ·×Um and y∈V are linguistic variables. Let m be the number of fuzzy
if...then rules in the form (1.5) in the fuzzy rule base; that is, i = 1, 2, · · ·, n in (1.5).
The x and y are the input and output to the fuzzy logic system, respectively, and
Aij and Ci are linguistic terms characterized by fuzzy membership function µAij (xj )
and µCi (y), respectively. Without loss of generality, we consider multi-input-single-
output (MISO) fuzzy logic system, Each rule Rulei is associated with fuzzy relation
Ri which is interpreted as fuzzy intersection of fuzzy sets and implemented by a
fuzzy implication and defined as
Ri (x1 , x2 , ...xm , y) = Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y) (1.6)
We could combine the rules by an aggregation operator Agg into one rule which
0 0
used to obtain C from A .
R = Agg(R1 , R2 ...Rn ) (1.7)
if the sentence connective also is interpreted as and then we get

\n
R= i=1
Ri
that is
\n
R(x, y) = i=1
Ri (x, y) = min(Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y))
or by using a t-norm T for modelling the connective and
R(x, y) = T (R1 (x, y), ...Rn (x, y))
if the sentence connective also is interpreted as or then we get

[n
R= i=1
Ri
that is
[n
R(x, y) = i=1
Ri (x, y) = max(Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y))
or by using a t-conorm S for modelling the connective or
R(x, y) = S(R1 (x, y), ...Rn (x, y))

0 0
Then we compute C from A by the compositional rule of inference as (1.5) we get
0 0 0
C = A ◦R = A ◦Agg(R1 , R2 , ...Rn )
0 0 0
= (A i1 (x1 )×A i2 (x2 )×· · ·×A im (xm ))◦(R1 ∪ R2 ∪ · · · ∪ Rn )
[n 0 0 0
= i=1
(A i1 (x1 )×A i2 (x2 )×· · ·×A im (xm ))◦Ri (1.8)
[n 0
= i=1
Ci
Thus, from (1.4) we get
µC 0 (y) = ∨ni=1 {[µA0 (x1 ) ∧ µAi1 (x1 )] ∧ [µA0 (x2 ) ∧ µAi2 (x2 )]
i1 i2
∧· · · ∧ [µA0 (xm ) ∧ µAim (xm )]} ∧ µCi (y), (1.9)

im
= ∨ni=1 {∧m
j=1 [µA0ij (xj ) ∧ µAij (xj )]} ∧ µCi (y)
= ∨ni=1 {τi ∧ µCi (y)}
Figure 1.15 shows graphically the operation of fuzzy reasoning for MISO.
If x1 is A11 and ... and xm is A1m Then y is C1

τ 1 = [µ A11 (x 1 )∧ µ A12 (x 2 ) ∧ L ∧ µ A1 m (x m )] → τ 1 ∧ µ C 1 (y)
x1
x2 If x1 is A21 and ... and xm is A2m Then y is C2 n
τ = [µ A21 (x 1 ) ∧ µ A22 (x 2 )∧ L ∧ µ A2 m (x m )] → τ 2 ∧ µ C 2 (y) ∨ ∨ (τ i ∧ µC (y))
M 2 i=1 i
M
xm If x1 is An1 and ... and xm is Anm Then y is Cn
τ n = [µ An 1 (x 1) ∧ µ An 2 (x 2 ) ∧ L ∧ µA nm (x m )]→ τ n ∧ µ C n (y)
Figure 1.15: Block diagram of the Fuzzy reasoning for MISO.
and from previous definitiona we see that there are six interpretations (1-6) for
fuzzy implication, and in each interpretation we may employ different t-norms or
t-conorms, therefore, a fuzzy if...then rule (1.5) can be interpreted in a number of
ways, then the output of the fuzzy inference mechanisms can be in different way.
for these different types of outputs. we could use different defuzzifier to defuzzify
them into a single point in the output space V.
1.5. APPROXIMATE REASONING 21
1.5 Approximate Reasoning

One of the most important components in approximated reasoning is the member-
ship function for the implication. As for union and intersection, several functions
can also be defined for the implication. There are three classes of fuzzy implication:
S-implications, R-implications and QL-implications. S-implications are based on
the classical implication where p → q is defined as ¬p ∨ q. R-implications are based
on the idea that implication reflects a partial order. QL-implications are based on
quantum logic.
The following properties have been requested for an operation I(p, q), that rep-
resents an implication operation [Dubois, Prade 91, Dubois, Lang, Prade 91].
(1) p ≤ p0 ⇒ I(p, q) ≥ I(p0 , q)
(2) q ≥ q 0 ⇒ I(p, q) ≥ I(p, q 0 )
(3) I(0, q) = 1 (False imply anything)
(4) I(1, q) = q (tautology cannot justify anything)
(5) I(p, q) ≥ q
(6) I(p, p) = 1 (identity principle)
(7) I(p, I(q, r)) = I(q, I(p, r)) (exchange principle)
(8) I(p, q) = 1 om och edast om p ≤ q ( implication defines an ordering)
(9) I(p, q) = I(c(p), c(q)) för någon negation/complement c (contraposition law)
(10) I is continuous.
The following table gives the most usual implications, which class they belong to
and which properties are satisfied [Dubois, Prade 91, Dubois, Lang, Prade 91].
Name Membership function Type Property

Kleene-Dienes max(1 − p, q) S and QL 1-5, 7, 9-10
Reichenbach 1 − p + pq S 1-5, 7, 9-10
L
Ã ukasiewicz min(1
½
− p + q, 1) S and R 1-10
1, if p ≤ q
Gödel R 1-8
q, else
Zadeh max(1 − p, min(p, q)) QL 2-4, 10
Examples on fuzzy implications
Note that all ten properties are satisfied by the L

Ã ukasiewicz implicationen.
1.5.1 Fuzzy propositional calculus

When truth values from the unit interval is connected to formulae, the knowledge
base can be interpreted as a fuzzy set of formulae. Reasoning is based on modus
ponens. The problem of computing the truth value of the conclusion can be generally
described as follows [Dubois, Prade 91, Dubois, Lang, Prade 91]:
Let p → q represent the rule if p then q, where p can be of the form p1 and . . . and
pn . We can then then say that:
(i) The truth value connected to p, τ (p), is then of the form τ (p) = τ (p1 ) ∗ . . . ∗
τ (pn ), where ∗ is a t-norm;
(ii) the evaluation of the truth value τ (q), connected to q, is interpreted as a

”detachment problem”: from τ (p) and the truth value connected to τ (p → q),
where τ (p → q) = I(τ (p), τ (q)) and I is a fuzzy implication function, compute
τ (q)
To formalise this propositional calculi, consider a propositional language L of
well formed formulae (wff’s) over an alphabet A containing the propositional con-
stants, the symbols ‘(’ and ‘)’, the unary logical connective ¬ (negation), the binary
connective ← (implication), and a finite set B of binary connectives. Truth values
are assumed to be in the unit interval. To each logical connective we associate a
function, denoted by the same symbol.
¬ : [0, 1] → [0, 1], α 7→ 1 − α,
←: [0, 1] × [0, 1] → [0, 1], (α, β) 7→ min{α − β + 1, 1}.
The functions associated to connectives in B are assumed to be continuous, as-

sociative, and monotone increasing in both arguments. Therefore, we also write
¦(α1 , . . . , αn ), or ¦i αi , for ¦(α1 , ¦(α2 , ¦ . . . , αn ) . . .). B can be thought of as a set of
t-norms and co-t-norms, or OWA-operators [Yager 88].
This provides a very general framework including various logical operators. For
the justification of the use of such operators see [Dubois, Prade 91, Dubois, Lang, Prade 91].
Restrictions for the use for the operators arise from the applied neural learning tech-
niques, which require some continuity and derivability conditions.
Definition A mapping Γ : L → [0, 1] is called a fuzzy set of axioms.
Definition Let Γ be a fuzzy set of axioms. A valuation w.r.t. Γ is a mapping
Υ : L → [0, 1], s.t. for all P ∈ L
(i) If P = (¬Q), P = (Q ← R), or P = (Q ¦ R), then
Υ(P ) = ¬(Υ(Q)), Υ(P ) =← (Υ(Q), Υ(R)), Υ(P ) = ¦(Υ(Q), Υ(R)), respec-

tively.
1.5. APPROXIMATE REASONING 23
(ii) Γ(P ) ≤ Υ(P ).
Definition Let Γ be a fuzzy set of axioms. The mapping Γ |=: L → [0, 1] is given
by Γ |= P = inf{Υ(P ) | Υ is a valuation w.r.t. Γ}, where inf ∅ = 1.
1.5.2 Fuzzy predicate calculus

By now it is obvious that fuzzifying predicate calculus is not only related to a
generalisation of binary truth values to the unit interval. We also need to handle
generalisations of terms, and in all, fuzzifying the language. These are the frontiers
of present research in multi-valued applications in computer science, and will be
reported in future versions of these Lecture Notes.
Chapter 2
Summary and Exercises
Topics to keep in mind:
• intuition and motivation for fuzzy sets,
• particular useful membership functions,
• logic connectives (also as norms) and operations on fuzzy sets,
• applications of the extension principle,
• compositional rule of inference as a preparation for fuzzy controllers.
EXERCISES
I.1 Let A and B be fuzzy subsets of X = {−3, −2, −1, 0, 1, 2, 3, 4}.
A = 0.4/ − 3 + 0.5/ − 2 + 0.4/ − 1 + 0.6/0 + 1.0/1 + 0.6/2 + 0.3/3 + 0.3/4

B = 0.1/ − 3 + 0.2/ − 2 + 0.7/ − 1 + 0.9/0 + 1.0/1 + 1.0/2 + 0.3/3 + 0.2/4
Suppose that their intersection and union are defined by the ”min” and ”max”
operators respectively. What are then the membership function of {A ∩ B} and
{A ∪ B} ?
I.2 Let µA and µB be fuzzy subsets of X = {−3, −2, −1, 0, 1, 2, 3, 4}.
A = 0.4/ − 3 + 0.5/ − 2 + 0.4/ − 1 + 0.6/0 + 1.0/1 + 0.6/2 + 0.3/3 + 0.3/4

B = 0.1/ − 3 + 0.2/ − 2 + 0.7/ − 1 + 0.9/0 + 1.0/1 + 1.0/2 + 0.3/3 + 0.2/4
25
26 CHAPTER 2. SUMMARY AND EXERCISES
Suppose that their intersection and union are defined by the Hamacher’s t-norm
and t-conorm with γ = 1, respectively. What are then the membership function of
{A ∩ B} and {A ∪ B} ?
I.3 Show that Yager’s ∧ and ∨ are, respectively, t-norms and co-t-norms.
I.4 Prove that for any t-norm T and any co-t-norm S we have
T (a, b) ≤ min(a, b) and max(a, b) ≤ S(a, b).
I.5 Fuzzy sets µ1 , µ2 ∈ Fc (R) (which is class of all upper semicontuinuous fuzzy sets
of R). With the help of the extension principle, for the sum µ1 ⊕ µ2 , the product
µ1 ¯ µ2 , we lay down the following:
(µ1 ⊕ µ2 )(t) = sup{min{µ1 (x1 ), µ2 (x2 )} x1 , x2 ∈ R ∧ x1 + x2 = t}

(µ1 ¯ µ2 )(t) = sup{min{µ1 (x1 ), µ2 (x2 )} x1 , x2 ∈ R ∧ x1 x2 = t}
What are then the graphs of µ1 ⊕ µ2 and µ1 ¯ µ2 ?
I.6 Let f (x) = x2 and let A ∈ F be a symmetric triangular fuzzy number with
membership function
(
1 − |a − x|/α if |a − x| ≤ α
A(x) =
0 otherwise
Then use the extension principle to calculate the membership function of fuzzy set
f (A).
I.7 Prove that G0 (−; α1 , β1 ) ⊕ G0 (−; α2 , β2 ) = G0 (−; α1 + α2 , β1 + β2 ).
I.8 Consider two fuzzy relations R =”x is considerable smaller than y” and G = ”y
is very close to y”
   
y 1 y2 y3 y4 y1 y2 y3 y4
 x 0.5 0.1 0.1 0.7   x 0.4 0 0.9 0.6 
R=

1   1
G = 


 x2 0 0.8 0 0   x2 0.9 0.4 0.5 0.7 
x3 0.9 1 0.7 0.8 x3 0.3 0 0.8 0.5
1. What are their intersection of R and G means that R =”x is considerable smaller
than y” and G = ”y is very close to y”.
2. What are their union of R and G means that R =”x is considerable smaller than
y” or G = ”y is very close to y”.
27
I.9 Consider two fuzzy relations R =”x is considerable smaller than y” and G = ”y
is very close to z”
 
  z1 z2 z3
y1 y2 y3 y4  
 x  y1 0.4 0.9 0.3 
0.5 0.1 0.1 0.7   
R=

1 
G = 
 y2 0 0.4 0  
 x2 0 0.8 0 0  
 y3 0.9 0.5 0.8 

x3 0.9 1 0.7 0.8
y4 0.6 0.7 0.5
What are their sup-product composition R ◦ G

Part II
FUZZY SYSTEMS
29
Chapter 3
Fuzzy Control
3.1 Fuzzy Controllers

Fuzzy control is one of the most well known applications of fuzzy logic. In 1974,
E. H. Mamdani was the first one to demonstrate the use of fuzzy logic applied
to control [Mamdani 74]. A major breakthrough of fuzzy control came in the be-
ginning of 1980’s. A controller development for a cement kiln was the first real
application [Holmblad and Østergaard 82] of fuzzy control. Within the next year,
several successful applications were presented, many of them exposed during the
IFSA (International Fuzzy Systems Association) meeting in Tokyo 1987.
3.1.1 Fuzzy rules

Fuzzy control is conclusion based on rules. A fuzzy IF-THEN rule R is symbolically
expressed as
R: IF hfuzzy criteriai THEN hfuzzy conclusioni,
where hfuzzy criteriai and hfuzzy conclusioni either are atomic or compound fuzzy
propositions. Such a rule can be seen as a causal relation between measurements
and control values of the process. If e and ė are insignals and u̇ an outsignal, and
further NS, PS and NL are linguistic variables, then
IF e is N S AND ė is P S THEN u̇ is N L
is a symbolic expression of the following causal relation:
if the present value of e is N S and the present value of ė is P S then u̇

should be N L,
or
31
32 CHAPTER 3. FUZZY CONTROL
if the present deviation of the control value is N S and the latest change
in the deviation of the control value is P S then this should cause the
control value to be N L.
Both antecedents and consequents can involve several linguistic variables. If this
is the case, the system is called a multi-input-multi-output (MIMO) fuzzy system.
Such systems have several insignals and outsignals. Also multi-input-single-output
(MISO) systems, with several insignals but only one outsignal are very common.
An example of a MISO system is as follows.
R1 : IF x is A1 AND y is B1 THEN z is C1
R2 : IF x is A2 AND y is B2 THEN z is C2
.. ..
. .
Rn : IF x is An AND y is Bn THEN z is Cn .
Here x and y are insignals and z is the outsignal. Further, Ai , Bi and Ci are linguistic
variables.
3.1.2 Fuzzy rule bases

A rule base contains all the fuzzy rules that are needed for an inference mechanism
to provide conclusions (control signals) from input signals. How is a rule base then
constructed? We have several options:
• An expert that knows the process provides linguistic rules that are specified
given previous knowledge and know-how related to the process.
• An expert, manually controlling the process, is observed over a period of time,

and corresponding in and out data is recorded. The data set obtained can be
used in a knowledge and rule extraction procedure, such as provided by the
(fuzzy) clustering method [Bezdek 81].
• The process is described within a fuzzy model, based on which control rules
can be directly derived. Such methods do not yet exist, and require further
research.
• En fuzzy controller is adaptive in the sense that the rule base together with
parameters in rules (and possibly in inference mechanisms) are adjusted in real-
time given possibilities for the systems to identify itself as being in respectively
good or bad states. Some suggestions of related techniques are found e.g. in
[Procyk and Mamdani 79], [Shao 88] and [Sugeno 85].
Whatever technique we use, our goal is to construct a number of fuzzy rules with
the following syntax:
3.1. FUZZY CONTROLLERS 33
R1 : IF x1 is A11 AND x2 is A12 AND · · · AND xm is A1m THEN u is U1

R2 : IF x1 is A21 AND x2 is A22 AND · · · AND xm is A2m THEN u is U2
.. ..
. .
Rn : IF x1 is An1 AND x2 is An2 AND · · · AND xm is Anm THEN u is Un
Note that this syntax is valid for MISO systems. For rule Ri we have x1 , . . . , xm as
insignals and Ai1 , . . . , Aim as respective linguistic quantifiers of the insignals. The
consequent of the rule is ”u is Ui ”.
Example. The following shows an example of fuzzyness used for speed control. As
insignals we have the actual speed v (km/h) and the load l (N) of the car. Load
components are e.g. force F and friction Fµ . As linguistic variables for speed we use
LS (low speed), N S (normal speed) and HS (high speed) and for load similarly LL
(low load), N L (normal load) and HL (high load). These are typically bell-shaped
functions and as midpoint in N S we use 70 km/h which also is the constant speed
we try to maintain. The choice of precise shape and transposition of membership
functions is left open at this point. Low load appear e.g. downhill and high load
uphill (see figure 3.1).
HIGH LOAD LOW LOAD

v Fµ v
F+Fµ
N
G
F+Fµ N NORMAL N
LOAD v
G G
Figure 3.1: A physical view.
The rule base for speed control could be as follows.

R1 : IF v is LS AND l is N L THEN v is N S
R2 : IF v is LS AND l is HL THEN v is HS
R3 : IF v is N S AND l is LL THEN v is LS
R4 : IF v is N S AND l is HL THEN v is HS
R5 : IF v is HS AND l is LL THEN v is LS
R6 : IF v is HS AND l is N L THEN v is N S
3.1.3 Overview of fuzzy controller structures

A fuzzy controller consists respectively of fuzzification, rule base, inference mecha-
nism and defuzzification.
In a functional description, the fuzzy controller can be described as a function
u = f (x), where x is a measurement (insignal) and u is the computed control value
(outsignal). Generally, f (x) is given as
D
F
x U
E
Z
RULE BASE F
U
Z Z
I Z
F I
I F
C I
A C
T A
I INFERENCE T u
O I
N
MECHANISM O
N
Figure 3.2: Components of a fuzzy controller.
u = def uzz[DISJ[Φ[CON Jj [Aij (xj )], Ui ]]],
i.e. a control value u is computed from measurements xj . In a first step, measure-

ments xj have to be fuzzified, i.e. they are transformed to their corresponding truth
values Aij (xj ) representing the truth value of ’xj is Aij ’. Here Aij is the linguistic
variable in the ith rule of the jth measurement. Secondly, fuzzified values are com-
bined by a logical conjunction operator CON Jj providing conjunction over all the
values Aij (xj ). Thirdly, Φ acts as the implication operator that carries the conjunc-
tion value over to combine with the output membership function of the rule. In a
fourth step, a logical disjunction operator DISJi computes the resulting member-
ship function, on which finally a defuzzification def uzz is applied to provide the
final output signal. This final output value is now used as the control signal. See
figure 3.2 for a graphical view of the structure.
3.2 Inference in Fuzzy Controllers

In this section we will examine the inference mechanisms of Mamdani [Mamdani 74,
Mamdani and Assilian 75] and Takagi-Sugeno [Takagi and Sugeno 85, Sugeno 85].
We also mention some modifications ([Larsen 80, Tsukamoto 77]) of these methods.
Inference either produces fuzzy sets µU for defuzzification (Mamdani’s method), or
directly a control value (Takagi-Sugeno’s method). For simplicity, in illuminations
and examples we often only two insignals x1 and x2 and one outsignal u. Further,
let A and U denote fuzzy sets and u the defuzzified control value.
3.2.1 Mamdani’s method

Inference in Mamdani’s method, with membership functions as outputs in rules, can
be informally described as follows:
R1 : IF x1 is A11 AND x2 is A12 THEN u is U1

R2 : IF x1 is A21 AND x2 is A22 THEN u is U2
Conclusion: u is U
3.2. INFERENCE IN FUZZY CONTROLLERS 35
1 A 11 1 A12 1 U1
α1
0 0 0
1 A21 1 A22 1 U2
α2
0 0 0
x1 x2 min
Figure 3.3: Inference using Mamdani’s method.
This is the most common view of fuzzy control. In Mamdani’s method, conjunc-
tion is given by the minumum operator, implication likewise, and resulting output
membership functions are combined using the maximum operator as disjunction.
To be more precise, if the activation level is given by
^
m
αi = Aij (xj ),
j=1
then the resulting output membership function of rule i is αi ∧ Ui (marked grey in

figure 3.3) and thus the conclusion fuzzy set becomes
_
n
U= (αi ∧ Ui ),
i=1
where ∨ is the maximum operator and ∧ is used to compute implication. Note

that αi in this expression is identified with its corresponding fuzzy set with constant
value αi .
Larsen’s method differs from Mamdani’s method only in the choice of the product
operator for implication (see figure 3.4). This means
_
n
U= (αi · Ui ).
i=1
3.2.2 Takagi-Sugeno’s method

Inference in Takagi-Sugeno’s method, with linear functions as outputs in rules, can
be described as follows:
R1 : IF x1 is A11 AND x2 is A12 THEN u1 =p10 +p11 x1 +p12 x2
R2 : IF x1 is A21 AND x2 is A22 THEN u2 =p20 +p21 x1 +p22 x2
Conclusion: u
1 A 11 1 A12 1 U1
α1
0 0 0
1 A21 1 A22 1 U2
α2
0 0 0
x1 x2 min
Figure 3.4: Inference using Larsen’s method.
For every rule, an output ui is computed according to
ui = pi0 + pi1 x1 + · · · + pim xm ,
where pi0 , . . . , pim are constants related to rule i. Methods to specify the con-
stants are discussed in [Takagi and Sugeno 85], including also algorithms for se-
lecting insignals related to respective ui . The final control value is given by
X
n
αi ui
i=1
u= Xn .
αi
i=1
Example. Given insignals x1 = 6.5 and x2 = 9.2, and linear output functions
u1 = 2 + 1.7x1 + 1.3x2 and u2 = −3 + 0.5x1 + 2.1x2 , we obtain u1 ≈ 25.0 and
u2 ≈ 22.3. Thus the control value becomes
α1 u1 + α2 u2 0.5 · 25.0 + 0.3 · 22.3

u= = ≈ 24.0.
α1 + α2 0.5 + 0.3
Tsukamoto’s method is a modification of the Takagi-Sugeno method in that mem-

bership functions are assumed to be monotonous. Activation levels are computed
as for the Takagi-Sugeno method. Values for ui are the solutions to
Ui (ui ) = αi .
The final control value is

3.3. DEFUZZIFICATION 37
1 A11 1 A12 1
α1
0 0 0
u1
1 A 21 1 A 22 1
α2
0 0 0
x1 x2 min u2
Σ αi u i
u =
Σ αi
Figure 3.5: Inference using Takagi-Sugeno’s method.
X
n
αi ui
i=1
u= Xn .
αi
i=1
See Figure 3.6 for a graphical representation.

Example. Consider the situation described in figure 3.6. We see that A11 (x1 ) =
0.8, A12 (x2 ) = 0.5 and therefore α1 = min{A11 (x1 ), A12 (x2 )} = min{0.8, 0.5} =
0.5. Similarly A21 (x2 ) = 0.3, A22 (x2 ) = 0.5 and α2 = min{A21 (x1 ), A22 (x2 )} =
min{0.3, 0.5} = 0.3.
Suppose we have the following functions, U1 (u) = −0.16u + 2 and U2 (u) =
0.1u − 0.5, for conclusions. Then the individual control values are given as solutions
in the equations U1 (u1 ) = α1 = 0.5 and U2 (u2 ) = α2 = 0.3. This gives u1 ≈ 9.4 and
u2 = 8.0. Finally, we compute the control value, and obtain
α1 u1 + α2 u2 0.5 · 9.4 + 0.3 · 8.0
u= = ≈ 8.9.
α1 + α2 0.5 + 0.3
3.3 Defuzzification
As a result of inference we obtain a fuzzy set µU of proposed control values. Each
activated rule Ri , i.e. for which the activation level is non-zero, contributes to µU ,
and therefore we obtain the final conclusion as
A 11 A12 U1
1 1 1
α1
0 0 0
u1
A21 A22 U2
1 1 1
α2
0 0 0
x1 x2 min u2
Σ αi ui
u =
Σ αi
Figure 3.6: Inference using Tsukamoto’s method.
[
n
µU = µUi ,
i=1
where µUi is Ui in the fuzzy rule Ri .

Since the insignal to the process must be a value u, the fuzzy set µUi must be
defuzzified to provide a value. In the following we present the most common defuzzi-
fication methods. See also [Driankov et al 93] and [Jager 95] for further reading.
Centre-of-Gravity, CoG
CoG finds the centre of gravity of µU . In the discrete case, we have
X
l
uk · µU (uk )
k=1
u=
X
l
µU (uk )
k=1
and in the continuous case,

R
v · µU (v) dv
u = UR .
U µU (v) dv
Figure 3.7 provides a graphical representation of CoG. Computationally, this method

is inefficient.
Indexed Centre-of-Gravity, ICoG

CoG
1
Figure 3.7: Defuzzification with CoG.
In ICoG we only consider the area of µU that is above a specified level α (see figure
3.8), and compute the centre of gravity for this area. Thus, in the discrete case, we
have
X
l
uk · [µU (uk )]α
k=1
u= ,
X
l
[µU (uk )]α
k=1
where [µU (uk )]α denotes the area of fuzzy set µU above α level.
In the continuous case, we have
R
[U ]α v · [µU (v)]α dv
u= R .
α
[U ]α [µU (v)] dv
Note, that if α = 0 then ICoG computations coincide with those of CoG.
ICoG
1
Figure 3.8: Defuzzification with ICoG.
Centre-of-Sums, CoS
In CoG we had to consider the whole of µU . CoS is similar to CoG but imple-
mentationally much more efficient. In CoS, we consider all output functions when
computing the sum of all µUi , i.e. overlapping areas may be considered more than
once (see figure 3.9). In the discrete case, we obtain
X
l X
n
uk · µUi (uk )
k=1 i=1
u=
X
l X
n
µUi (uk )
k=1 i=1
and in the continuous case,

R Pn
U v·
µU (v) dv
u = R Pn i=1 i .
U i=1 µUi (v) dv
CoS
1
Figure 3.9: Defuzzification with CoS.
First-of-Maxima, FoM
In FoM, defuzzication of µU is defined as the smallest value in the domain of U with
maximal membership value, i.e.
u = min{v ∈ U | µU (v) = max U }.
The method is illustrated in figure 3.10.

Analogously, we can define Last-of-Maxima, LoM, according to
u = max{v ∈ U | µU (v) = max U }.
Middle-of-Maxima, MoM
MoM is similar to FoM. Instead of taking the first values with maximal grades of
membership, we compute the average of all values with maximal grades,
min{v ∈ U | µU (v) = max U } + max{v ∈ U | µU (v) = max U }
u= .
2
A graphical representation is given in figure 3.11.
Height Method, HM
FoM
1
Figure 3.10: Defuzzification with FoM.
MoM
1
Figure 3.11: Defuzzification with MoM.
The height method is not applied directly on µU , but is focused on heights, and
computes a weighted sum of these heights. The weighted sum is according to
X
n
µU peak · fi
i
i=1
u= X
n ,
fi
i=1
where µU peak is the ui for which µUi (ui ) = 1 (where µUi is in its original form). See
k
figure 3.12. For the trapetsoidal membership function, µU peak becomes an interval
k
(plateau), from which we can select a representative, e.g. the mean value. The value
fi is the height of µUi , i.e. max µUi .
HM is computationally both simple and fast.
1
HM
f2
f1
0
µ U peak µ U peak
1 2
u
Figure 3.12: Defuzzification with HM.

Centre-of-Largest Area, CoLA

Finding the centre of largest area is especially useful for concave functions. CoLA
finds the largest area from the set of convex subareas (see figure 3.13), and computes
the centre of the largest such area. The computations are somewhat complicated,
and in fact, can be applied in a number of different ways.
CoLA
1
Figure 3.13: Defuzzification with CoLA.
Example. Defuzzification can lead to undesirable effects. Consider e.g. a car trying
to avoid an obstacle directly in front of the car. The membership function repre-
senting candidates of control values is then typically as shown in figure 3.14. Both
CoG and CoS, however, will result in driving straightforward.
Figure 3.14: Undesirable phenomenon given e.g. the CoG method.

Chapter 4
Fuzzy Clustering
4.1 Data Clustering

A cluster, according to [Everitt 74], is a group of entities with similar properties,
and entities from different clusters cannot be similar. Further, the distance between
two points in a cluster must be smaller than the distance between a point in that
cluster and any other point outside the cluster.
In conventional clustering, points in clusters have binary memberships (see figure
4.1). In fuzzy clustering we allow points to have different grades of membership in
different clusters (see figure 4.2). A clustering algorithm aims at finding a reasonable
set of clusters in a finite number of iterations.
0
1
0
1 1 1
0 0
1 1
1 0 0
x 1
1 1 0 x
1 1 0 0
1 0
0
1 0
Figure 4.1: Conventional clustering of data.
4.2 Fuzzy c-Means Clustering

The fuzzy c-means (FCM) algorithm, also called fuzzy ISODATA algorithm, was
originally developed by J. C. Dunn [Dunn 73, Dunn 74] and later generalised by J.
C. Bezdek [Bezdek 74, Bezdek 81].
43
44 CHAPTER 4. FUZZY CLUSTERING
CLUSTER 1 CLUSTER 2
0.47 0.53
0.68 0.32
0.32 0.68
0.73 0.77 0.66 0.27 0.23 0.34
0.30 0.70
0.19 0.81
0.80 0.20
0.77 0.15 0.23 0.85
0.91 0.10 0.09 0.90
x 0.86 x 0.14
0.80 0.61 0.20 0.39
0.85 0.14 x 0.86 x
0.77 0.09 0.20 0.15
0.23 0.91 0.80
0.74 0.27 0.26 0.73
0.24 0.76
0.69 0.30 0.31 0.70
Figure 4.2: Fuzzy clustering with two clusters.
To describe the algorithm, we need some notations. The set of all points con-
sidered is X = {x1 , · · · , xn }(⊂ Rd ). We write ui : X → [0, 1] for the ith cluster,
i = 1, . . . , c, and we will use uik to denote ui (xk ), i.e. the grade of membership of xk
in cluster ui . We also use U = huik i, for the matrix of all membership values. The
’midpoint’ of ui is vi (∈ Rd ), and is computed according to
X
n X
n
vi = (uik )m xk / (uik )m .
k=1 k=1
A parameter m, 1 ≤ m < ∞, will be used as a weighting exponent, and the

particular choice of value of m is application dependent [Bezdek 81]. For m = 1,
FCM coincides with the ISODATA algorithm, and if m → ∞, then all uik values
tend to 1/c. For a discussion on optimal choices of m, see [Choe and Jordan 92].
Membership in fuzzy clusters must fulfill the condition
X
c
uik = 1,
i=1
i.e. for each x ∈ X, the sum of memberships of x in respective ci must be 1.

A distance measure k · k in Rd will affect the shape of the clusters (see also 4.4).
A typical distance measure is the euclidian distance.
The objective of the clustering algorithm is to select ui so as to minimize the
error function
X
c X
n
J= (uik )m k xk − vi k2 .
i=1 k=1
The following algorithm for FCM clustering will meet this objective:
4.2. FUZZY C-MEANS CLUSTERING 45
Step 1: Fix c and m. Initialise U to some U (1) .

Select ² > 0 for a stopping condition.
Step 2: Update midpoint values vi for each cluster ci ,
Step 3: Compute the set µk ≡ {i : 1 ≤ i ≤ c :k xk − vi k= 0},

and update U (`) according to the following:
if µk = ∅, then
X
c
uik = 1/[ (k xk − vi k / k xk − vj k)2/(m−1) ]
j=1
otherwise X
uik = 0 ∀i 6∈ µk and uik = 1.
i∈µk
Step 4: Stop, if k U (`+1) − U (`) k< ε, otherwise go to step 2.

In step 1, c(≥ 1) is set to a fixed number of clusters. In the rule generation
phase, each cluster will be the basis for one rule. It is usually desirable to keep c
as small as possible, not so much because of computations related to clustering, but
mainly in order to keep the number of rules within reasonable bounds. Clustering
is computation intensive, but since clustering seldom is needed to be computed in
real time, computational efficiency is usually not a critical issue.
Further in step 1, the matrix U = huik i is to be initialised. A crisp, and even
random, partition of X into c subsets can be sufficient to provide a good starting
point for the algorithm.
In step 2, midpoint values vi are computed, and respective midpoints will of
course move towards points with higher membership values in their clusters. Note
that a midpoint can coincide with some xλ , and in such a case we will have uiλ = 1,
and then also for all ξ 6= i we will have uξλ = 0.
Step 3 is the core of the algorithm. In this step, membership values uik are
updated. Note that we have to distinguish between cases depending on whether
or not midpoints coincide with data points. The variable ` denotes the iteration
number.
In step 4 we compute the difference between present and previous matrices of
membership values. If the stopping condition is met, we are done.
We are now faced with the problem of selecting the value of c, i.e. we need to
be able to measure the quality of the obtained clustering. Given a value for c, and
given the resulting U and vi ’s of the clustering algorithm, we can compute a criterion
number S(U, c) according to
X
c X
n
S(U, c) = (uik )m [k xk − vi k2 − k vi − x̄ k2 ].
i=1 k=1
v1
x v
x 2
v3x
Figure 4.3: Cluster midpoint.
where x̄ is the midpoint of all points in X (see figure 4.3). Obviously, we will select
that particular c0 for which
S(U, c0 ) = min S(U, c).
c
i.e. we should iterate the clustering algorithm with c ranging in a suitably selected
interval, e.g. 2 to 20 clusters.
Clearly, in many cases, it is not straightforward to find out the optimal number
of clusters. For small number of data points in X, a worst case scenario is that the
criterion number decreases with increasing c and reaches an optimality with c equal
to the number of data points in X. Also intuitively (see figure 4.4), it is not obvious
which number of clusters will be better from practical viewpoints.
Figure 4.4: Two or three fuzzy clusters.
Convergence of the FCM algorithm is discussed in [Bezdek 80, Bezdek et al 87].
4.3 Identification of Rules

A point xk ∈ X can be projected on each of its coordinates p = 1, . . . , d. This
projection is denoted πp , i.e. πp ((s1 , . . . , sp , . . . , sd )) = sp . Further, let Xp denote
the set of projected points πp (xk ), xk ∈ X.
4.4. GEOMETRIC FUZZY CLUSTERING 47
Each cluster will now generate one rule in the following way. Consider a fuzzy
cluster ui on X, and write πp (ui ) : Xp → [0, 1] for the corresponding cluster projected
on the pth axis, i.e. πp (ui )(πp (xk )) = uik . Assume we have decided to use Gaussian
functions in our generated rule. This means we want to estimate the parameters of
Gaussian functions with best fit to respective πp (ui ). Thus, for each p we need to
find βp such that
X
| uik − e−βp (πp (xk )−αip ) |
xk ∈X
is minimized (αip = πp (vi )).

In figure 4.5 we see how cluster u3 produces the antecedent in the rule
R3 : IF π1 (x) is A31 AND π2 (x) is A32 THEN . . .
where A3p (t) = e−βp (t−αip ) .
If each xk is attached with corresponding output values yk ∈ Re , then the corre-
sponding (discrete) membership function(s) in the conclusion of the rule can com-
puted once we have fixed a conjunction function for the rule base.
x
x
A 32
v
α 32 x 3 x
A 31
α 31
Figure 4.5: Projection of a cluster.
The rule generation technique can be further extended to involve a capability

to find a ranking among the dimensions of the input space, and to determine an
optimal subset of these dimensions. Thus in a set of d signals, a subselection can
be made so as to improve rule base functionality. Details of this technique is found
in [Sugeno and Yasukawa 93], which is also our reference for the rule generation
technique.
4.4 Geometric Fuzzy Clustering

The basic FCM clustering algorithm identifies spherical structures. For identification
of more elliptic, or linear, clusters, we need to modify the error function J. In the
following we give an outline of this technique. Details can be found in [Windham 83].
Given X = {x1 , . . . , xn } ⊂ Rd , a subset of vectors in ad-dimensional space. The

distance between two vectors xk and xl can be given as
| xk − xl |2M = (xk − xl )T M (xk − xl ),
where M is a positive semidefinit matrix. Using such a distance measure, a cluster

can be seen as a geometric object where M prescribes locations of higher concentra-
tion of data points (vectors), i.e. M defines a geometric structure. Note that FCM
provides geometric clustering with M being the identity matrix.
Using some positive semidefinit matrix M , we now obtain a scatter matrix
X
Si = (uik )m (M 1/2 (xk − vi ))(M 1/2 (xk − vi ))T
k
related to the ith cluster. If this cluster is elliptic with points being concentrated
around the centre point of the ellipse, then the principal axes of the cluster is given
by the eigenvectors of M .
4.5 Applications
There is a wide range of applications of clustering. In industrial process control
there are numerous applications, such as fertilizer production ([Riissanen 92b]),
automatic steering (AGNES [Olli 95, Huber 96]), and process state prognostics
([Sutanto and Warwick 94]), only to mention some examples. The reader is referred
to proceedings e.g. related to IEEE and IFSA conferences on fuzzy systems.
Also in pattern recognition and image processing there are typical application
e.g. in segmentation of colour images ([Lim and Lee 90]), edge detection in 3-D
objects ([Huntsberger et al 86]), and surface approximation for reconstruction of
images ([Krishnapuram et al 95]).
Chapter 5
• inference in Mamdani and Takagi-Sugeno controllers,

• defuzzification methods,
• rules generated from sets of data,
• variety of strategies for parameter estimation.
EXERCISES
II.1 Given the rule base
IF x is SM ALL AND y is SM ALL AND z is SM ALL THEN u = 0

IF x is M EDIU M AND y is M EDIU M AND z is M EDIU M THEN u = 1
IF x is BIG AND y is BIG AND z is BIG THEN u = 2
IF x is SM ALL AND y is SM ALL AND z is M EDIU M THEN u = 3
IF x is SM ALL AND y is SM ALL AND z is BIG THEN u = 4
IF x is SM ALL AND y is M EDIU M AND z is BIG THEN u = 5
where
(
1 − 4t , if 0 ≤ t ≤ 4
SM ALL(t) =
0, otherwise
(
|t−2|
1− , if 0 ≤ t ≤ 4
M EDIU M (t) = 2
0, otherwise
49
(
1− 4−t
4
, if 0 ≤ t ≤ 4
BIG(t) =
0, otherwise
use the Takagi-Sugeno method to compute the output for x = 2, y = 2 and z = 4.
II.2 Given the rule base
IF x is SM ALL AND y is SM ALL AND z is SM ALL THEN u is SM ALL

IF x is M EDIU M AND y is M EDIU M AND z is M EDIU M THEN u is M EDIU M
IF x is BIG AND y is BIG AND z is BIG THEN u = 2 is BIG
IF x is SM ALL AND y is M EDIU M AND z is BIG THEN u is M EDIU M
where
(
1 − 4t , if 0 ≤ t ≤ 4
SM ALL(t) =
0, otherwise
(
|t−2|
1− , if 0 ≤ t ≤ 4
M EDIU M (t) = 2
0, otherwise
(
1− 4−t
4
, if 0 ≤ t ≤ 4
BIG(t) =
0, otherwise
use the Mamdani method of inference together with a defuzzification method of your
choice to compute the output of inputs given by x = 2, y = 2 and z = 4.
Part III
PROBABILISTIC COMPUTING
51
Chapter 6
Introduction
A substantial number of our everyday decisions are made under conditions of un-
certainty. We are often faced with situations where conflicting evidence forces us
to explicitly value observed information in order to make rational decisions. Dif-
ferent methods, such as logical, probabilistic and numerical approaches have been
developed to handle uncertainty in reasoning systems. Bayesian networks (BNs),
also known as causal probabilistic networks, belief networks etc., is one appealing
formalism to describe uncertainty within domains. In recent years, the interest for
BNs has increased due to new efficient algorithms and user-friendly software making
BNs available to others than the research community responsible for them.
BNs have been successfully applied in applications such as medical diagnosis, im-
age processing and software debugging and new applications are reported frequently.
6.1 Syntax and Notation

Throughout this part, the following notations will be used unless stated differently.
A random variable will be denoted with a uppercase letter, for example A, B or Xi .
The states of a random variable will be denoted with the corresponding lowercase
letter, A = {a1 , a2 } and Xi = {xi1 , . . . , xik }. Sometimes, more than one letter will be
used to avoid any risks of misunderstanding. The probability of a random variable,
P (A = ai ) will be written P (ai ). If the variable is binary with the states “true” and
“false”, the states will be denoted a and a respectively, i.e. P (A = true) ≡ P (a)
and P (A = f alse) ≡ P (a).
A set of random variables will be written with a uppercase letter with a different
style, such as U, V, W, F and C. The symbols ∪, ∩ and \ will be used for union,
intersection and subtraction of sets. For example, if A = {A, B, C} and B = {B, C}
then, A ∪ B = {A, B, C}, A ∩ B = {B, C} and A \ B = {A}.
53
54 CHAPTER 6. INTRODUCTION
6.2 Basic Probability Theory

This section describes the basic probability theory that underlie the forthcoming
sections about Bayesian networks.
6.2.1 Definitions
The probability of a event a is defined as
g(a)
P (a) = , (6.2.1)
N
where g(a) is the number of positive outcomes for a and N is the total number of
outcomes. The number of positive outcomes for a is bounded by
0 ≤ g(a) ≤ N, (6.2.2)
and dividing by N gives the property
0 ≤ P (a) ≤ 1. (6.2.3)
If A is a random variable with the discrete and exhaustive states {a1 , . . . , aj } then
X
j
P (A = ai ) = 1. (6.2.4)
i=1
6.2.2 Conditional Probability

The term P (a) denotes the prior or unconditional probability that the proposition
on a is true. Unconditional means that all relevant information that could influence
the status of a is assumed to be unknown. If new information emerge that affect
the random variable, the prior probability is no longer of interest. Instead, the
probability must be specified in context of the evidence that have come to light.
This is called posterior or conditional probability, denoted P (a|b). Semantically,
this means “the probability of a when b is known with absolute certainty and all
other influences are unknown”. In terms of random variables and joint events the
posterior probability is defined as
P (A, B)
P (A|B) = (6.2.5)
P (B)
where P (A, B) is the joint probability for A and B, i.e “A and B”. Rewriting
equation (6.2.5) gives an expression for the joint probability for dependent variables:
P (A, B) = P (A|B)P (B). (6.2.6)

6.2. BASIC PROBABILITY THEORY 55
A is independent of B, if and only if, P (A|B) = P (A). Using this fact in equation
(6.2.6) gives an expression for independent variables:
P (A, B) = P (A)P (B). (6.2.7)
The probability of an random variable A can be marginalized out from a combined

distribution. This can be written
X
n
P (A = aj ) = P (A = aj , B = bi ). (6.2.8)
i=1
However, throughout this part, we will use a shorter notation:

X
P (A) = P (A, B). (6.2.9)
B
Since the phrase “A and B” is equal to “B and A”, equation (6.2.5) can be used to
formulate:
P (B|A)P (A)
P (A|B) = (6.2.10)
P (B)
which is known as Bayes’ rule or Bayes’ theorem. This equation makes it possible to
calculate the posterior probability, P (A|B), for a event when the opposite condition,
P (B|A), is known (and the unconditional probabilities for the individual events).
6.2.3 Interpretations of Probability

At a strict mathematical level, the semantics of the fundamental equation (6.2.1) is
not in any way controversial. At the level of interpretation and use, however, there
are two distinct positions with different opinions how probability can be estimated
and used in practice.
An objectivist believes that the use of probability is only applicable to events that
can be repeated many times, i.e. domains where many observations can be observed
under the same conditions. For example, the objectivist can speak of probabilities
that a dice will face a certain side up or the probability that a mass-produced light-
bulb will be broken. These cases all have a large number of observations which makes
the probability reliably. On the other hand, the objectivist is not interested in single,
unique events. The probability that Turkey will become a member of the European
Community before year 2 000 or that McDonald’s will open a hamburger restaurant
in Umeå cannot be measured since there is no long-run ratio to rely on. Thus, a
number of applications are set aside since they are not considered appropriate in
this strict objective viewpoint.
A subjectivist on the other hand, has a more wider perspective of the term
probability. In this approach, the probability is regarded as a personal degree of
belief for a certain proposition. The subjectivist has, in principle, no problem with
assigning a probability for the examples above. The personal degree of belief can be
56 CHAPTER 6. INTRODUCTION
determined from a variety of sources, not just observations on previously repeated

events. For example, since a normal coin is symmetric both in body and mass there
are no factors in favour for either side. The subjectivist can use this information
to assign a probability of head and tail without performing any empirical tests.
However, since the degrees of belief is personal and individuals can differ in their
conclusion, there can be different degree of belief for a particular proposition.
Of course, the objectivist and subjectivist are endpoints on a highly graded scale
and there are no reasons to take either side. However, it is important to realize that
people assign different meaning to the term probability.
Chapter 7
Bayesian Networks
A Bayesian network (BN) is a network of dependent random variables representing

an uncertain domain.
The aspects and concepts of Bayesian networks are introduced by an illustrative
example in section 7.1, followed by a discussion on conditional dependence and
independence between variables in section 7.2. A model for disjunctive interaction
is presented in 7.3.
7.1 A Network Model

Consider the following situation. Sherlock Holmes lives in a country where the
temperature in the winters often go below zero degrees Celsius. To minimize the
risk that his car will cause him any trouble in the mornings, he has installed an
engine pre-heater. Holmes next-door neighbour, Dr Watson, also has a pre-heater
system installed in his vehicle. To save power, the two pre-heaters are connected to
the same timer-trigged power supply, which is set to start a few hours before Holmes
and Watson need their cars in the mornings. However, the timer is rather old and
doesn’t have perfect functionality. If the timer fails, which happens from time to
time, none of the pre-heaters will receive any electrical power and will therefore be
unable to pre-heat the engines. Also, a working pre-heater is no guarantee that the
cars will start. Both Holmes and Watson’s cars can have other errors causing them
not to start.
Holmes observes that the timer fails about 20 days in a year. Furthermore,
Holmes’ car fails to start two out of ten times if the pre-heater is working and nine
out of ten if the pre-heater is off. When talking to Watson, Holmes learns that
Watson’s car starts five out of ten times without the pre-heater and nine out of ten
times when the pre-heater is working.
Three steps are included in the construction of a network model, identifying
the random variables, finding the structure and specifying the probabilities for each
variable.
In the text about Holmes’ and Dr Watson’s car troubles there are principally
57
58 CHAPTER 7. BAYESIAN NETWORKS
Timer

T

@
R
@
H W

Holmes car Watsons car
Figure 7.1: A network model for the car trouble example.
Table 7.1: CPT for Holmes and Watson s cars
car starts T=y T=n T=y T=n

yes 0.8 0.3 0.9 0.5
no 0.2 0.7 0.1 0.5
Holmes car Dr Watson’s car
three random variables: Holmes’ car (H), Watson’s car (W ) and the timer (T ). All
these variables are discrete with two mutually exclusive states each. The cars can
either start or not, and the timer is either working or not.
The causal relationship between these random variables can be expressed by
connecting them pairwise with directed links. Here, the timer affects (through the
power supply) the functionality of the cars, so a link can be drawn from the timer to
each of the two cars, see figure 7.1. The direction of the links is of great importance,
the state of the timer (working or not working) affects the cars ability to start, not
the other way around.
In order to complete the network model, a conditional probability table (CPT)
must be specified for each node. Since the variable T is a root node in the network,
i.e. node T has no incoming links, only the unconditional probability of each state
needs to be specified. The probability that the timer will fail, P (t), is 20/365 ≈ 0.05
and the probability that the timer will work, P (t), is therefore 1 − P (t) ≈ 0.95. The
probability tables for Holmes and Watson’s cars must to be specified in context of
the timer. For example, from the text it can read that the chance that Holmes’ car
will fail when the timer is working, P (h|t), is 2/10 = 0.2. The complete CPT for
Holmes’ and Watson’s cars are given in table 7.1.
With the above example in mind it is now appropriate to give a formal definition
of Bayesian networks:
Definition 7.1.1 A discrete Bayesian network consists of a finite set of variables,

U = {A1 . . . Ai }, and a set of directed edges, {Aj → Ak }, forming a directed acyclic
graph (DAG). Each variable, Aj has a finite exhaustive set of mutually exclusive
states Aj = {a1 , . . . , am } and a conditional probability table (CPT), P (Aj |PAj ),
7.2. CONDITIONAL INDEPENDENCE 59
where PAj is the set of parents to Aj .
7.2 Conditional Independence

It is important to notice that the links between the variables represent a direct
causal relationship. In the car trouble example above it is obvious that Watson’s
cars starting capabilities has no direct influence on Holmes’ car. However, if Holmes
finds out that Watson was unable to start his car on a particular morning the
prerequisite for Holmes the same morning is changed to the worse. In this sense the
two variables are dependent. This relationship is not caused by a direct influence,
but is instead transfered by the common node T . To be more precise, the fact
that Watson’s car will not start, could be caused by a failing timer. Normally,
the probability of the timer working is 0.95 but Watson’s unsuccess increases this
probability since a broken timer is a plausible reason to Watson’s cars failure. With
a higher probability of a broken timer the chances for Holmes’ car decrease. On the
other hand, if Holmes knew whether the timer was working or not with absolute
certainty, the probability of his car is exactly as given in table 7.1, and not in any
way affected by Watson’s cars status. In this case, the observed information from
Watson’s car is said to be blocked by evidence in the timer variable.
The above discussion illustrates the properties of conditional dependency and in-
dependency in Bayesian networks. New information can be “transmitted” through
nodes in the same and opposite directions of the links. There are three different
network topologies to consider in terms of dependence; linear, diverging and con-
verging.

A

?
P A B

B
J J

J J

J J

^
J
^
J

?
A B C

C

a) Linear b) Diverging c) Converging
Figure 7.2: Different network connections.

7.2.1 Linear
A linear connection is shown in figure 7.2a. If B in unknown, the probability of
B is determined from the status of A. Since C is determined from B, C therefore
becomes dependent of A.
If B is known to be in state bi (and therefore unaffected of A), the probability
of C can be calculated direct from its probability table, P (C|bi ) and is therefore
conditionally independent of A.
7.2.2 Diverging
This case was previously illustrated by the car trouble example. Generally, nodes
from a common parent P are dependent unless there is evidence in P . Evidence in
P blocks the path from A to B, see figure 7.2b.
7.2.3 Converging
The last case to consider is when two or more variables causes the same effect
(figure 7.2b). If nothing is known about C, the parents A and B are independent.
For example, it is fairly safe to state that cold and allergy-attack, which both can
cause a person to sneeze, are independent, see fig 7.3 (this example is borrowed from
Henrion [Henrion et al 91]). However, observing a person sneezing, makes the two
events dependent. If a person sneezes when an allergen (for example, from a cat) is
present the support for cold is reduced in favour of the allergy-attack theory. This
phenomena is also known as “explaining away”. The allergy-attack explains away
the cold. Here, the evidence was observed directly in the converging node S. This
is not necessarily the case. In fact, any observed descendant node from S can act
as transmitter between the parents and make them conditionally dependent.
The conclusion is that a converging node, C, blocks the path between its parents
unless there is evidence in C or any of its descendants.

T Cat

?
Cold C A Allergy-
attack

J
J^ /

S

Sneeze
Figure 7.3: If a person is sneezing when an allergen is present, the support for cold
is reduced.
7.3. FINDING THE PROBABILITIES 61
7.2.4 d-separation
The three cases above can be used to formulate a separability criterion, called
direction-dependent separation or d-separation. The d-separation is a very impor-
tant property that can be used to find efficient inference algorithms A formal proof
can be found in Pearl [Pearl 88].
Definition 7.2.1 Any two nodes, A and B, in a Bayesian network are d-separated,
and therefore conditionally independent, if every path between A and B is blocked
by an intermediate node, V ∈ / {A, B}.
V blocks the path if and only if one of the following holds:
(i) the structure is linear or diverging and V is known.
(ii) the structure is converging and neither V nor its descendants are known.
The d-separability criterion will later be used frequently in the presentation of in-
ference methods.
7.2.5 A notation for conditional independence

Sometimes the symbol ⊥ ⊥ is used to denote conditional independence. For example,
in figure 7.2 the following conditional statements are true
⊥C|B
(a) A⊥
⊥B|P
(b) A⊥
⊥B (however A⊥
(c) A⊥ ⊥B|C is false)
7.3 Finding the Probabilities

As stated in definition 7.1.1, there is a conditional probability table, P (A|PA ), as-
sociated with each random variable. Depending on the present application, this
table can be derived from statistical and mathematical relations, a database of pre-
viously collected observations or expert judgment. Some people, who find it hard
to describe their degree of belief as numerical values, are often more willingly to use
an appropriate adjective or adverb instead. Several such translations can be found
in the literature. An example on how numerical ranges can be mapped on verbal
expressions is presented in table 7.2 [Druzdzel 95].
7.3.1 Disjunctive Interaction

Since the size of the probability table is equal to the product of the number of states
for each parent variable, the number of probability values needed in the CPT is
exponential to the number of parents. Even for a small set of parents the number of
Table 7.2: Example of verbal expressions for numerical ranges.
Probability value Adjective Adverb

0 impossible never
0.0 - 0.1 very unlikely very rarely
0.1 - 0.25 unlikely rarely
0.25 - 0.4 fairly unlikely fairly rarely
0.4 - 0.5 less likely than not less often than not
0.5 as likely as not as often as not
0.5 - 0.6 more likely than not more often than not
0.6 - 0.75 fairly likely fairly often
0.75 - 0.9 likely commonly
0.9 - 1.0 very likely very commonly
1.0 certain always
probability values needed can be too many to manage, and with larger sets of parents
the CPT quickly becomes unwidely. For example, a discrete boolean variable with
only four boolean parents needs as many as 32 values to complete the CPT. Even
with a large database there is a potential risk that some of the combinations are
too unusual to provide a reliable estimate. One commonly used model to avoid this
problem is the noisy-Or gate derived by Judea Pearl in [Pearl 88].
The model is based on disjunctive interaction, that is when the likelihood for a
particular condition is unchanged when other conditions occur at the same time. For
example, if cold, pneumonia and chicken-pox is likely to cause fever, then disjunctive
interaction applies when a person suffering from several of these diseases at the same
time would only be more likely to develop fever. Furthermore, if the person is also
suffering from a disease that is unlikely to cause fever, this additional evidence
does not reduce the support for fever caused by the other diseases. There exist an
well-founded theory for a disjunctive model if the following assumptions are made:
(i) Boolean variables: All included variables must have two discrete states, namely
true and false.
(ii) Accountability: An event is presumed false if no one of the included conditions

are true, e.i. all causes are listed.
(iii) Exception independence: The processes that inhibit an event under a condition
are independent.
Assumption (ii) is not as strict as it might look, as it is always possible to add a
“Other causes”-variable to represent what is not explicitly specified in the closed-
world assumption. The model requires that only the individual probabilities for each
parent are specified, i.e. P (E| only Hi ). Thus, using this technique requires only
7.3. FINDING THE PROBABILITIES 63
n values instead of 2n for a node with n parents. The probability of a conjunctive

event can be computed in the following way:
Y
P (E|H1 , . . . , Hn ) = 1 − 1 − P (E|Hi ). (7.3.1)
i:{Hi =true}
7.3.1.1 An Example of Disjunctive Interaction

Cold Co Noisy-Or gate
HH .
HH
j
Pneumonia Pn - - Fe Fever

3

Chicken-pox Cp

Figure 7.4: A Noisy-Or model.
Suppose P (f e|co)= 0.4, P (f e|pn) = 0.8 and P (f e|cp) = 0.9. For example, the
probability of fever when both Cold and Pneumonia are present is
P (f e|co, nn) = 1 − (1 − P (f e|co))(1 − P (f e|pn))

= 1 − 0.6 · 0.2 = 0.88.
Table 7.3: Probability values for the fever calculated with the noisy-Or model.
Co Pn Cp P (f e) P (f e)
f f f 1 0
f f t 0.1 0.9
f t f 0.2 0.8
f t t 0.02 0.98
t f f 0.6 0.4
t f t 0.06 0.94
t t f 0.0.12 0.88
t t t 0.012 0.988
Chapter 8
Inference in Bayesian Networks
Once a network is constructed it can be used to answer queries about the domain.
The basic task in a Bayesian network is to compute the probability of a variable
under evidence. Since there are no special input or output nodes, any variable can
be computed or observed as evidence. There are basically three different types of
inference that occur in a Bayesian network:
• Causal inference
• Diagnostic inference
• Intercausal inference
Causal inference is when the reasoning follows the same direction as the links in the
network. Diagnostic inference, on the other hand, is when the line of reasoning is
the opposite of the causal dependencies. The basic strategy to handle diagnostic
reasoning is to apply Bayes’ rule in equation (6.2.10). Intercausal inference means
reasoning between causes of a common effect. This is the same condition as in section
7.2.3. The presence of one cause make the others less likely. Finally, combinations
of these inferences can appear in Bayesian networks. This is sometimes called mixed
inference.
8.1 The Joint Probability Distribution

The joint probability distribution assigns probability values to each conjunctive
atomic event of the included variables. That is, the joint distribution is a complete
description of the domain.
Lemma 8.1.1 Every Bayesian network with the nodes Bn = {X1 , . . . , Xn } has a
joint distribution on the form
Y
n
P (X1 , . . . , Xn ) = P (Xi | PXi ), (8.1.1)
i=1
where PXi is the set of parents to Xi .
65
66 CHAPTER 8. INFERENCE IN BAYESIAN NETWORKS
Proof 8.1.1 Induction on the nodes in the network. Suppose that every causal
Bayesian network has a joint distribution as in equation 8.1.1. For a network with
only one node the hypothesis is obviously true. Suppose that the hypothesis is true for
Q
network of the variables Bn−1 = {X1 , . . . , Xn−1 }, that is P (Bn−1 ) = n−1
i=1 P(Xi |PXi ).
Let Bn = Bn−1 ∪ Xn , where X is a leaf in Bn . (Since Bn is a DAG, there is at least
one leaf in Bn ). By using the fundamental rule in equation (6.2.5) the formula can
be written
P (Bn ) = P(Bn−1 , Xn ) = P(Xn |Bn−1 )P(Bn−1 )
Since Xn is independent of Bn−1 \ PXn the left-hand side can be reduced to
P (Bn ) = P(Xn |PXn )P(Bn−1 ),
which corresponds to equation 8.1.1.
The fact that a Bayesian network can be represented by a joint distribution on the
form above results in two important properties.
(i) A Bayesian network is a complete and unique specification of the domain.
(ii) A Bayesian network is consistent (if each individual variable distribution is

consistent).
Here, consistent means that the probability values does not conflict each other. It is
rather easy, intentional or not, to construct an inconsistent system. Let for example
P (a) = 0.7, P (b) = 0.2 and P (a|b) = 0.65. These values might seem alright at a
first glance, but they are in fact inconsistent. By using Bayes’ rule the term P (b|a)
can be written as P (a|b)P (a)/P (b) which is > 1. The consistency property ensures
that a Bayesian network does not violate the axioms of probability.
The joint distribution can be used to answer any query of the network. To
illustrate this, lets again return to the previous car trouble example. Here, the
joint distribution is P (H, W, T ) = P (H|T )P (W |T )P (T ). The computation of, say,
P (h, w, t) can be done by simply multiplying the corresponding values in the CPT:
P (h, w, t) = P (h|t)P (w|t)P (t)

= 0.8 · 0.1 · 0.95 = 0.076.
A complete calculation of all atomic events is shown in table 8.1.

Conditional queries can be handled by applying the basic equation 6.2.5. For
example, the probability that Holmes’ cars will start when knowing that Watson’s
car failed is:
P (h, w)
P (h|w) =
P (w)
8.1. THE JOINT PROBABILITY DISTRIBUTION 67
Table 8.1: Atomic events for the car trouble example calculated with the joint
distribution.
H W T P(H, W, T)
y y y 0.684
y y n 0.0075
y n y 0.076
y n n 0.0075
n y y 0.171
n y n 0.0175
n n y 0.019
n n n 0.0175
The terms P (h, w) and P (w) can be computed by using equation (6.2.9) to sum
over all matching terms in table 8.1:
0.076 + 0.0075
P (h|w) = ≈ 0.696
0.076 + 0.0075 + 0.019 + 0.0175
If both cars fail to start, the probability of a broken timer is:
P (t, h, w) 0.0175
P (t|h, w) = = ≈ 0.48 (8.1.2)
P (w, h) 0.019 + 0.0175
Summing over the joint distribution is an easy method to use when answering queries
in Bayesian networks. Unfortunately, due to the exponential growth of atomic
events, this method cannot be used it practice in this simple way. The number
of cases to consider is equal to the product of the number of states for each variable.
If every variable is binary, the number of atomic events in a network with n nodes
are as many as 2n . With larger networks, this method simply becomes intractable.
For example, the medical diagnose system MUNIN (Olesen et al in [Olesen et al 89])
consists of more than 1000 variables with up to seven states each. Even though the
MUNIN system is one of the most elaborative application found in the literature,
real-world modeling often requires far more nodes than is manageable within the
reach of this simple method. Even worse, Cooper proved in [Cooper 87] that infer-
ence in a Bayesian network is NP-hard irrespective of the method used. There are
no efficient methods for undertaking an arbitrary network. Despite this dishearten-
ing result there are ways to tackle this problem. The proof in [Cooper 87] applies
to an arbitrary network, and there are certain families of network topologies where
more efficient algorithms exist. Also, many applications generate sparse graphs, in
which the prospects of finding a more local computation scheme are good. Three
such exact inference methods are described in section 8.3.1, 8.3.2 and 8.4.

.
@ @ A
R
@
@
?
= @@
?
R= AU

@ @
? @

R
@ ?
? @

R
@

.
. .

@ @
?
R
@
@ ? @
R
@

a) Singly connected b) Multiply connected
Figure 8.1: Different network topologies.
8.2 Singly Connected Networks

A singly connected network, or polytree, is a graph without any undirected cycles.
As a result of this property, any pair of nodes in the network are connected only
through one unique path, see figure 8.1a. This facts makes it easy to apply the
d-separability criterium in order to find an efficient inference method.
The task of interest is to compute the probability for a variable under some
evidence, P (X|E). Consider figure 8.2. The support to the variable X can be
divided into two parts, EX+ , the causal support connected to X from its parents, and
EX− , the evidential support connected through X from its children. Therefore,
P (X|E) = P(X|EX+ , EX− ). (8.2.3)
In order to separate the evidence into different terms, Bayes’ rule conditioned on EX+
can be applied:
− P(EX− |X, EX+ )P(X|EX+ )
P (X|EX , EX ) =
+
.
P(EX− |EX+ )
The term P (EX− |X, EX+ ) can be simplified into P (EX− |X) since X is conditioned and
therefore d-separates EX− from EX+ . Next, the term P (EX+ |EX− ) is independent of the
states of X and is therefore scalar. Let α = 1/P (EX+ |EX− ):
P (X|E) = αP(EX− |X)P(X|EX+ ) (8.2.4)
Now, applying equation (6.2.4) yields:
X
αP (EX− |X, EX+ ) = 1.
X
Moving α outside the summation forms:

X
α P (EX− |X, EX+ ) = 1,
X
8.2. SINGLY CONNECTED NETWORKS 69
. . . . . . . . . . . . . . . . . .. .. .. . . . . .
. . .
. . . .
. . .. . .
. . . .
. . C . .
Z . .
. Z
~
Z
. W
C . .
. .
. U1 . Um EUm X . .
. . .
.E + =
@ . . . . . S
. X . .. . S w . .
. . @. .
. . .
. .
. . . . . . . . . . @ . .. . .
. .. .. .
R
@ /
.
. . . . . .. .. . . X
. . . . . . . . . .
. . . . . . . . ..
.
. . . EXY1 Z1k . . . S . .
. . . . . . .S .
Znj
.
.
. . Z11 B . S . .
. .
HH BN / . . w
S .
. . j Y
H . Yn .
. . 1 .
. . .
.
. .
S

w
S .
U .
. . .
. . .
. . . . .
. .. . . . . . . .
. −
.
. . . . . . . . . . . . . . X. . . . . . E . .
Figure 8.2: Parents, children and evidence sets to the node X.
which makes α:
1
α= P − + . (8.2.5)
X P (EX |X, EX )
That is, α can be treated as a normalizing constant which scales the sum of the indi-
vidual probabilities to 1. There is no need to find an explicit formula for P (EX+ |EX− ).
Next, P (X|EX+ ) can be computed by considering all possible configurations of
the parents of X. Let PX = {U1 , . . . , Um }, we get
X
P (X|EX+ ) = P(X|PX , EX+ )P(PX |EX+ ) (8.2.6)
PX
The term P (X|PX , EX+ ) can be simplified into P (X|PX ) since the parents PX d-
separates EX+ from X. Furthermore, since X is a converging node, it blocks the
undirected path between the parents which ensures that they are conditionally in-
dependent. Using the equation of independent events (6.2.7) the equation can be
written X Y
P (X|EX+ ) = P(X|PX ) P(Uk |EX+ ). (8.2.7)
PX k
The evidence, EX+ , can be further decomposed into {EU1 X , . . . , EUm X }, see figure 8.2.
Since the parents are independent, each Uk is independent all other parents and
their evidence sets. This observation gives
X Y
P (X|EX+ ) = P(X|PX ) P(Uk |EUk X ). (8.2.8)
PX k
A closer lock at the term P (Uk |EUk X ) reveals that it is in fact a recursive instance
of the original problem, excluding the node X. The term P (X|PX ) can be found in
the conditional probability table for X.
Now, returning to equation (8.2.4) the last term to consider is P (EX− |X). Let
Y = {Y1 , . . . , Yn } be the set of children of X. Decomposing the evidence into
{EXY1 , . . . , EXYn } yields
P (EX− |X) = P(EXY1 , . . . , EXYn |X). (8.2.9)
Since X is conditioned and therefore assumed to be known, it blocks the path

between all its children, which makes them conditionally independent. As before,
the probability of a conjunction of independent variables can be expressed as the
product over each individual probability,
Y
P (EX− |X) = P(EXYi |X). (8.2.10)
i
Let Zi = {Zi1 , . . . , Zik } be the set of parents of Yi except X. Averaging (equation

6.2.9) over the states in Yi and all combinations of states of Zi gives
YXX
P (EX− |X) = P(EXYi |Yi , Zi , X)P(yi , Zi , |X). (8.2.11)
i Yi Zi
The evidence, EXYi , can be divided into causal support, EXY

+
i
, and evidential support,
− − −
EXYi . Note that EXYi is equal to EYi . As before, the two fractions of evidence are
independent:
YXX
P (EX− |X) = +
P(EXYi
|Yi , Zi , X)P(EY−i |Yi , Zi , X)P(Yi , Zi |X). (8.2.12)
i Yi Zi
Here, Yi d-separates EY−i from X and Zi . Also, Zi d-separates EXY

+
i
from X and Yi :
YXX
P (EX− |X) = +
P(EXYi
|Zi )P(EY−i |Yi )P(Yi , Zi |X) (8.2.13)
i Yi Zi
Applying Bayes’ rule to the first term gives:

Y X X P(Zi |EXY
+ +
)P(EXY )
P (EX− |X) = i i
P(EY−i |Yi )P(Yi , Zi |X) (8.2.14)
i Yi Zi P(Zi )
The last term, P (Yi , Zi |X) can be written P (Yi |X, Zi )P(Zi |X) and since X is d-
separated from Zi the formula becomes:
Y X X P(Zi |EXY
+ +
)P(EXY )
P (EX− |X) = i i
P(EY−i |Yi )P(Yi |X, Zi )P(Zi ) (8.2.15)
i Yi Zi P(Zi )
8.2. SINGLY CONNECTED NETWORKS 71
+
The term P (EXY i
) is unconditioned and therefore independent of the states of X.
+
Replacing P (EXYi ) with a constant βi and cancelling the terms P (Zi ) gives:
YXX
P (EX− |X) = βi P(Zi |EXY
+
i
)P(EY−i |Yi )P(Yi |X, Zi ) (8.2.16)
i Yi Zi
Each parent of Yi is independent each other, so P (Zi |EXY

+
i
) can be replaced with a
product of each individual probability. Also, since βi is constant it can be moved
Q
outside the summation and replaced with a new constant, β = i βi . Finally, the
term P (EY−i |Yi ) can be moved outside first summation:
YX X Y
P (EX− |X) = β P(E−
Yi |Yi ) P(Yi |X, Zi ) P(Zij |EZij Yi ) (8.2.17)
i Yi Zi j
Now, inspecting each term reveals that P (EY−i |Yi ) is a recursive instance of P (EX− |X)
excluding X, P (Yi |X, Zi ) is a lookup in the conditional probability table (CPT) for
Yi and P (Zij |EZ+ij Yi ) is a recursive instance of the original problem, excluding Yi .
Notice that there is no need to find an explicit value of β since it can be combined
into α in equation 8.2.18 to form a new normalizing constant ξ. To summarize,
the belief for a variable X under evidence, E, in a singly connected network can be
evaluated as:
P (X|E) = ξπ(X)λ(X) (8.2.18)
where ξ is a normalizing constant and
X Y
π(X) = P (X|U) P(Uk |EUk X ) (8.2.19)
U k
YX X Y
λ(X) = P (EY−i |Yi ) P(Yi |X, Zi ) P(Zij |EZij Yi ) (8.2.20)
i Yi Zi j
Different versions on how to turn the above equation into a general algorithm
can be found in the literature. Pearl constructs in [Pearl 88] an object-orientated
message passing scheme where the flow of belief is updated by messages sent be-
tween adjacent nodes. A recursive formulation is derived by Russell and Norvig in
[Russell and Norvig 95].
8.2.1 Car Trouble - a Singly Connected Network

For further illustrations, let us use the equation above to perform inference in the
previous car trouble example. For example, if both Holmes’ and Watson’s cars
failed, the probability of a broken timer can be expressed as P (T = n|E) where the
evidence is E = {h, w}. Since the timer, T , is a root node without any parents, the
term π(X) is simply the unconditional probability value, P(T). Furthermore, both
the node W and the node H has no other parents than T which makes it possible
to reduce equation (8.2.18) to
YX
P (T |E) = ξP(T) P(EY−i |Yi )P(Yi |T)
i Yi
− −
Identifying the symbols reveals that Y1 = H, Y2 = W , EH = {h} and EW = {w}.
P (T |E) = ξ(P(T)(P(h|h)P(h|T) + P(h|h)P(h|T)) · (P(w|w)P(w|T) + P(w|w)P(w|T))
Since P (h|h) = P (w|w) = 0 and P (h|h) = P (w|w) = 1, the formula can be reduced
to
P (T |E) = αP (T )P (h|T )P (w|T )

= α(0.95, 0.05)(0.2, 0.7)(0.1, 0.5)
= α(0.019, 0.0175)
≈ (0.52, 0.48).
(8.2.21)
That is, P (t|h, w) ≈ 0.48, which is the same result as in equation (8.1.2).
8.3 Multiply Connected Networks

Singly connected networks are often too restricted to describe more complex domain
applications. Since only one path is allowed between any two nodes, it is impossible
to describe a system where two or more causes to a variable share a common ancestor.
Such models are however, not rare. For instance, metastatic cancer can cause both
increased total serum calcium and a brain tumor. These two effects can, in turn,
explain a person falling into coma. A network model is shown in figure 8.3, here,
the diagnostic line of reasoning can take two different paths; Coma could be caused
by increased serum calcium which in turn could be caused by metastatic cancer
(Co → Sc → M e). On the other hand, the coma could also be caused by a brain
tumor, which in turn could be caused by metastatic cancer (Co → Br → M e).
One of the problems when designing exact methods for multiply connected networks
is that the d-separation criterium cannot be applied with the same success as in
singly connected networks. The solution is to find a method that avoids global, and
therefore inefficient, computation by identifying subsets with desirable properties
where it is possible to locally compute the probabilities. There are principally three
different approaches in designing an exact inference method for multiply connected
networks; conditioning, clustering and symbolic factoring.
8.3.1 Conditioning
The basic idea in the conditioning approach is to divide the multiply connected
network into several smaller singly connected networks conditioned on a set of in-
stantiated variables. Figure 8.4 shows the two networks created when the boolean
variable M e in figure 8.3 is instantiated. Generally, the number of resulting sub-
trees is exponential to the product over the states of each variable in the cutset. The
cutset is the set of conditioned variables. The problem in this approach is to find
8.3. MULTIPLY CONNECTED NETWORKS 73
Metastatic cancer P (me) = 0.2

Me

@
Me P (sc|M e) Increased @R
@ Me P (br|M e)
t 0.8 total serum Sc Br Brain tumor t 0.2

f 0.2 calcium @ f 0.05
R
@
@
Co Coma

Br Sc P (co|Br, Sc)
t t 0.8
t f 0.8
f t 0.8
f f 0.05
Figure 8.3: A Bayesian network for metastatic cancer.

me me me me

?
?
?
?

Sc Br Sc Br

@ @
R
@
@ R
@
@
Co Co

Figure 8.4: Conditioning the node M e creates two singly connected networks.
the minimal cutset that divides the original network into singly connected subnets.
Once created, the probability for a variable can be calculated as the weighted sum
over each individual polytree.
A nice side-effect of this technique is that the weighted sum can be used to quickly
calculate an approximate answer. Starting with the largest weight, the system can
compute the probability until a desired level of accuracy is obtained. A simple way
to calculate the accuracy range is to sum over all remaining weights to calculate an
upper bound. The lower bound is, of course, the probability calculated so far, since
probabilities values are always positive.
8.3.1.1 The Conditioning Approach - An Example

The probability of a person falling into coma can be calculated in the following way.
First, calculate the probability of coma when node M e is instantiated to be true,
i.e P (co|me). Since the network becomes singly connected upon this instantiation
it is possible to use the method presented in section 8.2, which in this case gives
P (co|me) ≈ 0.68 (the actual calculations omitted). Next, the same thing is done for
the second polytree, where M e is instantiated to false (me), here, P (co|me) ≈ 0.23.
Finally, P (co), can be calculated as the weighted sum:
P (co) = P (co|me)P (me) + P (co|me)P (me)

= 0.68 · 0.2 + 0.23 · 0.8
= 0.32
8.3.2 Clustering
Clustering takes the opposite approach to conditioning. Instead of dividing the net-
work into smaller parts, clustering algorithms combines nodes into larger clusters.
The variables Br and Sc in the coma example could be collapsed into a compound
variable Z = {Br, Sc}. The states of the cluster node becomes the set of combina-
tions to all included variables. Here, the states of Z are {(br, sc), (br, sc), (br, sc),
(br, sc)}. The clustering transforms the network to a polytree where the inference
can be performed as usual. The disadvantage, is of course, that if the network is
dense, the compound variables can become intractable large, since the number of
states is exponential to number of collapsed variables. Despite this fact, clustering
techniques are by many considered as the best exact algorithm for most types of
non-singly connected networks.
One particularly interesting and the current standard algorithm for clustering
networks was originally developed by Lauritzen and Spiegelhalter in [Lauritzen and Spiegelhalter 88].
The method was later improved with a general absorption scheme and by Jensen in
[Jensenet al 90]. This technique uses properties of the clusters in order to efficiently
propagate the flow of belief.
The key issue is the concept of consistent universes. Consider two clusters, V =
{A, B, C} and W = {C, D, E}, in figure 8.5. Now, the probability for the common
variable C can be calculated by summing over all elements except C in both V and
W. Thus, X X
P (C) = P (A, B, C) = P (C, D, E) (8.3.22)
A,B D,E
If evidence changes the information in V then the above condition can be used to
update the probability for W in the following way: Initially, let the distributions
for V and W be P 0 (A, B, C) and P 0 (C, D, E), respectively. Now, suppose that
evidence in V changes the distributions to P 1 (A, B, C). With this new information,

A,B,C C C,D,E

Figure 8.5: Two clusters with the variable C in common.
the probability for the common variable C can be marginalized out of P 1 (A, B, C)
as in equation 8.3.22, X
P 1 (C) = P 1 (A, B, C). (8.3.23)
A,B
Using the fundamental equation 6.2.5 the term P (C, D, E) can be written
P 0 (C, D, E) = P 0 (D, E|C)P 0 (C) (8.3.24)
Since P (C) is updated to P 1 (C) the new distribution for W is
P 1 (C, D, E) = P 0 (D, E|C)P 1 (C) (8.3.25)
where the term P 0 (D, E|C) can be calculated from the initial distribution:
P 0 (C, D, E) 1
P 1 (C, D, E) = P (C) (8.3.26)
P 0 (C)
The scheme above is called absorption, the cluster W has absorbed from V. In
general terms the absorption process can be described as follows:
Definition 8.3.1 Let V and W be cluster of variables and let S be the set of their
common variables, that is S = V ∩ W. Let ψV0 , ψW 0
and ψS0 be the belief tables
associated with each cluster. The absorption procedure is defined by the following
steps:
(i) Update the belief table for S,

X
ψS1 = ψV0 (8.3.27)
V\S
(ii) Update the table for W,

1 0 ψS1
ψW = ψW (8.3.28)
ψS0
Although the procedure for absorption at an algorithmic level is easy to follow, the
actual numerical multiplication and division can be a little more tricky. The case
below shows a numerical example of the absorptions process.
Let V and W be two inconsistent cluster with belief tables ψV0 and ψW
0
as in table
8.2. Suppose the information behind W is less reliable then V, then the belief table
Table 8.2: Belief tables for V and W
V W
A B ψV0 (A, B) B C 0
ψW (B, C)
a1 b1 0.05 b1 c1 0.1
a1 b2 0.6 b1 c2 0.3
a2 b1 0.3 b2 c1 0.4
a2 b2 0.05 b2 c2 0.2
for W can be calibrated to V by letting W absorb from V. The new belief table for
the separator S = {A} is
X
ψS1 = ψV0
B
= ψV0 (b1 )
+ ψV0 (b2 )
= (0.05 + 0.3, 0.6 + 0.05)
= (0.35, 065).
The old belief table for the separator S is

X
ψS0 = 0
ψW = (0.1 + 0.3, 0.4 + 0.2) = (0.4, 0.6).
C
0
Finally, ψW can be updated:
1 0 ψS1
ψW = ψW
ψS0
0 (0.35, 0.65)
= ψW ≈ ψW0
(0.875, 1.083)
(0.4, 0.6)
≈ (0.1, 0.3, 0.4, 0.2)(0.875, 1.083)
| {z } | {z }
b1 b2
≈ (0.1 · 0.875, 0.3 · 0.875, 0.4 · 1.083, 0.2 · 1.083)
≈ (0.0875, 0.2625, 0.4333, 0.2166).
8.3.2.1 Absorption as a General Propagation Algorithm

To use the absorption procedure as an inference method for multiply connected
Bayesian networks the original representation must first be transformed into a tree
of clusters, where the links indicates a non-empty intersection between clusters, like
in figure 8.5. Furthermore, there must be a way to describe observed evidence and
initialize the probability tables for each cluster.
8.3.2.2 Finding the Cluster Tree

In order to the absorption process to work correctly, the cluster tree must have two
properties:
(i) There must be at least one cluster containing the variables Ai ∪ PAi for all i,
where PAi is the set of parents to Ai .
(ii) Every cluster on the path between any two clusters, A and B, must contain
A ∩ B.
Property (i) ensures that it is possible to create an initial belief table to each cluster
so that the cluster tree represent a probability distribution equal to the original
Bayesian network. The CPT for a variable, Ai , is, as mentioned earlier, specified in
context of the parents to Ai and without any cluster containing both the variable
Ai and its parents, PAi , it would not be any easy task to integrate the CPT into
the cluster network. Property (ii) makes it possible to obtain the belief for a single
variable Ai by marginalization of any cluster containing Ai . If this condition is not
fulfilled, information on Ai can not be transmitted to every other cluster containing
Ai , and the cluster tree would not be globally consistent.
If a cluster tree fulfills the two properties above it is called a junction tree. Now,
before presenting the method in detail, it is appropriate to first define the terms
used in the algorithm.
Definition 8.3.2 A clique is a complete maximal subset of a graph. In a complete
subset, every node has a link to every other node in the subset.
Definition 8.3.3 A moral graph is a undirected graph constructed from a network,
such that the nodes Ai ∪ PAi is a complete subset. PAj denotes the set of parents to
Ai in the original directed network.
Definition 8.3.4 A graph is triangulated (or chordal) if there for every cycle of
length ≥ 4 exist a link connecting two non-consecutive nodes.
The algorithm for identifying the junction tree from a Bayesian network is described
as follows:
(i) Form the moral graph, G.
(ii) Triangulate G to form G0 .
(iii) Identify the cliques of G0 and connect any pair of cliques with a non-empty
common subset. The result is called a junction graph.
(iv) Form a junction tree from the junction graph.
A moral graph is easily constructed from a Bayesian network; simply drop the
directions of the links and add links between any two unconnected parents of every
node in the network. For example, see figure 8.6.
The moral graph can then be triangulated in the following way:

@ , @
,
@
R
@ ,

?
?

@ @
R
@
@ @
@

CPN Moral graph
Figure 8.6: A BN and the corresponding moral graph.
8.3.2.3 Graph triangulation algorithm

(i) Number the nodes from 1 to |V | using maximum cardinality search:
(1) Let S be the set of selected nodes so far.

(2) Initially, let S = {A} and n = 1 where A is any node in the graph.
(3) Assign the number n to A.
(4) Let X ∈ / S be the node with the largest number of neighbour in S and
assign the number n + 1 to X.
(5) Update n and S: n = n + 1, S = S ∪ X.
(6) Continue from (4) until all nodes are in S.
(ii) Triangulate the graph by fill-in:
(1) Let i = |V |
(2) Let X be the ith node (using the order above) and let P be the set of
neighbours to X with numbers < i.
(3) Add links between ant two nodes in P that are not already connected.
(4) Let i = i − 1
(5) Continue from (2) until i = 0.
The triangulation of the graph ensures that the final cluster tree will fulfill the second
property above, see figure 8.9 for an example. Notice that the triangulation of a
graph is not unique. There are several steps in the algorithm were arbitrary choices
can be made. Intuitively, the best triangulation is the one that yields minimum
fill-in. However, in this case the optimal triangulation is concerned with the size
of final the junction tree. Unfortunately, finding the optimal triangulated graph is
NP-hard. See Arnborg el al. in [Arnborg et al 87] and further discussion by Jensen
et al. in [Jensen and Jensen 94].
Next, the junction graph can be constructed by identifying the cliques in the
moral and triangulated graph. Links between clusters are added by connecting
cluster with a non-empty intersection, see figure 8.7. The intersection between two
variables is called separator, denoted S. Finally, the junction tree can be found in the

A

@ ABC
@
@ .....
. .....
.....
..... ..
B C
C .
"" ...
. BC .....
" .
.. .....
..
"
" BCD CD CDE
" ..... .
D E ......
. ......

@ D .
...
DE
@
@
...
.. %
F
DEF

Triangulated graph Junction graph
Figure 8.7: The junction graph of a triangulated graph.
junction graph by finding a maximum spanning tree (Jensen [Jensen and Jensen 94]),
where the weight of a link is represented by the number of variables in the separator,
i.e |S|. Figure 8.8 shows a junction tree found in a junction graph.

ABC ABC
... ..... ..
.....

.....
....
...
.... ....
BC C .
BC
... ..... ...
.... .....
... ... .
BCD CD CDE BCD CD CDE
..... ......
. ....
......

. ..
... . ...
D DE DE
...
...
... % %
DEF DEF

Junction graph Junction tree
Figure 8.8: The junction tree of a junction graph.

8.3.2.4 Prim’s Junction Tree Algorithm

(i) Let N be the nodes in the junction tree so far. Initially let N = {A}, where
A is any node in the graph.
(ii) Choose a link (B, C) with maximal weight so that B ∈ N and C ∈

/ N.
(iii) Let N = N ∪ C. Continue from v until every node is in N .

A A AB A AC

@ , @
R
@ ,
, @
B C B C B C
.

?
? BD CE
D E D E ...
.
.
..

... ..
.
@ @ D E
R
@
@ @
@ .....
....... ......
....
F F DEF

CPN Moral graph Cluster graph
Figure 8.9: If the triangulation phase is omitted, it is impossible to construct a

junction tree, since none of the links can be removed to obtain a tree with the
properties as described earlier.
8.3.2.5 Initialization of Clique Probability Tables

The initial probability tables to each clique in the junction tree can be constructed
in the following way:
(i) Give all nodes (clusters) and separators a table of ones, i.e. ψ 0 = 1.
(ii) For each variable, A, choose one cluster, C, containing A ∪ PA and multiply
P (A|PA ) (the CPT) with ψC0 .
8.3.2.6 Evidence
Entering observed evidence in a junction tree is easy. Evidence is normally on the
form A = aj . Semantically, this means that the probability of all other states is 0.
P (A) = {0, . . . , 1, 0, . . .}, a “1” in the j:th position. The same can be done for a
cluster of variables: Let E = {0, . . . , 1, 0, . . .} be a finding on A. Multiply E with
the belief table for any cluster containing A.
In order to represent a probability distribution the table must be normalized after

evidence has been inserted, since the sum over all rows no longer are necessarily equal
to 1. However, this normalization process can be omitted at this stage, to instead
be carried out in the final marginalization process, when the belief for each single
variable is computed.
8.3.2.7 Propagation Scheme

Finally, the last issue to consider is how the updating process can performed with
as few absorptions as possible so that every link in the junction tree is consistent.
To make reasoning easier, the absorption process can be regarded as messages sent
from one node to another in the junction tree. A message from a node A to B
means that B absorbs from A. It can be proved that when a message has been
passed in both direction of each link the junction tree is globally consistent (Jensen
[Jensen 96]). This can be obtained by a bottom-up propagation, BottomUp(), fol-
lowed by a top-down traversal, TopDown, see figure 8.10. BottomUp and TopDown()
can be formulated as recursive functions:
BottomUp(node,parent)
(i) Let N = neighbours(node) \ parent
(ii) For each node, A ∈ N : Call ButtomUp(A,node) recursively.
(iii) Let the parent absorb from the node: Call Absorb(parent, node) (unless
the parent is the root if the tree).
TopDown(node,parent)
(i) Let N = neighbours(node) \ parent
(ii) For each node, A ∈ N : Call Absorb(A,node) followed by recursive call:

TopDown(A,node)
When the junction tree is consistent, the belief for a single variable can be computed
by marginalization: X
P (A) = ψC , (8.3.29)
C\A
where C is any clique (or separator) containing A. However, to keep computation

at a minimum, the cluster with the smallest number of states should be selected. It
is now possible to summarize the steps above into an general algorithm:
(i) Construct a moral triangulated graph from the original network.
(ii) Form the junction graph.
(iii) Find the junction tree.


R R

6
1
2
. ?

..... ..... S 2
1 .....
.
..... .....
.. o
S 1 2
.....
.
..... .....
.. S
S w
S
... S. . ...
.
... ... ...
...
...
.. ... ..

BottomUp(R) TopDown(R)
Figure 8.10: Bottom-up and top-down belief propagation.
(iv) Construct initial belief tables for each node and separator in the junction tree.
(v) Insert observed evidence.
(vi) Select a root node, R (any node in the junction tree can act as root).
(vii) Call BottomUp from the root.
(viii) Call TopDown from the root.
(ix) Compute the probability for each variable by marginalization.
Note that step (i) to (iii) is a static task, there is no need to redo this process unless
the structure of the network is changed.
8.3.2.8 Jensen’s Algorithm - An Example

The following is an numerical example made to clarify the algorithm above. Consider
the cancer network in figure 8.3. The network can be turned into a moral graph by
ignoring the direction of the links and by adding a connection between the nodes Br
and Sc. It is not necessary to apply the triangulation algorithm since the graph is
already chordal. Next step is to form the junction graph by identifying the cliques.
As mentioned earlier, a clique is a maximum subset where every node in the subset
has a link to every other node in the same subset. The cliques in the moral graph
are C1 = {Me, Br, Sc} and C2 = {Sc, Br, Co}. Their common variables, the separator
S, are {Br, Sc}. The junction graph, which also is the junction tree for this small
example, is shown in figure 8.11. The initial belief tables for the three cluster are
computed by first setting all values to 1 and then multiply with the matching CPTs.
The distributions P (M e), P (Sc|M e) and P (Br|M e) are multiplied with ψC01 and the

Me Me MSB

@ @
@
R
@ @
@
Sc Br Sc Br SB

@ @
@
@
R
@
@
Co Co SBC

Original BN Moral graph Junction tree
Figure 8.11: The original BN is transformed to a junction tree.
term P (Co|Sc, Br) is multiplied with ψC02 . The initial belief table for the separator
remains 1, the tables for C1 and C2 are shown in table 8.3.
Notice that these two clusters are not consistent. To make them consistent the
absorption process must be applied. Suppose that C1 is selected as root in the tree.
A call to BottomUp() will cause C1 to absorb from C2 . In this absorption nothing is
changed however, since the separator ψS1 will remain equal to 1 when marginalized
out of ψC02 . This is of course, not a coincidence since ψC02 in this stage is equal to
P (Co|Sc, Br). Next, the call to TopDown() will force C2 to absorb from C1 . The new
belief table for the separator S is shown in table 8.4. Finally, when the junction tree
is globally consistent it is possible to calculate the probability for every individual
variable. For instance, the probability for coma can be marginalized out of ψC22 :
X
P (co) = ψC22
Sc,Br
= 0.032 + 0.224 + 0.032 + 0.032
= 0.32
Now, updating ψC02 to ψC12 by multiplying ψS2 / ψS1 with ψC02 . The result is shown in
the last column in table 8.3.
Table 8.3: Initial belief table for C1 and C2
Me Sc Br ψC01 Co Sc Br ψC02 ψC12

y y y 0.032 y y y 0.8 0.032
y y n 0.128 y y n 0.8 0.224
y n y 0.008 y n y 0.8 0.032
y n n 0.032 y n n 0.05 0.032
n y y 0.008 n y y 0.2 0.008
n y n 0.152 n y n 0.2 0.056
n n y 0.032 n n y 0.2 0.008
n n n 0.608 n n n 0.95 0.608
Table 8.4: Belief table for S
Br Sc ψS2
y y 0.04
y n 0.04
n y 0.28
n n 0.64
8.4 Symbolic Probabilistic Inference (SPI)

A quite different approach for probabilistic inference has been suggested by D’Ambrosio
in [D’Ambrosio 89, Shachter et al 90, Ambrosio 95]. In contrast to graph algo-
rithms, this method uses a symbolical approach find an effective formula of the
joint probability distribution. The basic SPI method is based on the following ob-
servations.
Consider the famous example network in figure 8.12 (Lauritzen and Spiegelhalter
in [Lauritzen and Spiegelhalter 88]). Dyspnea (shortness-of-breath) can be caused
from tuberculosis, lung cancer or bronchitis (or a combination of them). A recent
visit to Asia increases the risk for tuberculosis and a smoker has a higher possibility
to develop both lung cancer and bronchitis. A X-ray can detect tuberculosis and
lung cancer but not discriminate between them. The joint distribution (8.1.1) for
the domain can be expressed as:
P (A)P (T |A)P (E|T, L)P (X|E)P (L|S)P (B|S)P (D|E, B)P (S) (8.4.30)
To compute the unconditionally probability of, say, dyspnea it is possible sum over
8.4. SYMBOLIC PROBABILISTIC INFERENCE (SPI) 85
visit to Asia smoking

A S

lung cancer A
? AU

tuberculosis T L B bronchitis
HH
j
H

positive E XX
) X
z D dyspnea
X
X-ray X
L
L
L
tuberculosis or
lung cancer
Figure 8.12: Example of a Bayesian network for dyspnea.
each variable except D:

X
P (D) = P (A)P (T |A)P (E|T, L)P (X|E)P (L|S)P (B|S)P (D|E, B)P (S)
A,T,E,X,L,B
(8.4.31)
As mentioned earlier in section 8.1 this technique is extremely inefficient and is
intractable for larger networks. However, using a clever factoring the marginalization
can instead be written as:
X X X X
P (D) = P (A) P (T |A) P (X|E) P (E|T, L)
A T E,X L
X X
P (D|E, B) P (L|S)P (B|S)P (S)
B S
which requires substantial fewer computations. The SPI approach is to find the op-
timal factoring so that the necessary calculations in the joint distribution are kept
to a minimum. This problem is closely related to the standard optimal factoring
problem, OFP, which is believed to be NP-hard. However, recently, some heuris-
tic search algorithms have been developed which appears to find good factoring
solutions. Also, computations can be saved using cache techniques, to avoid recom-
puting values that are already calculated in a previous step in the summation. For
P
example, the term S P (L|S)P (B|S)P (S) above is unchanged during the summa-
tion of the variables A,T ,E and B and can therefore be saved in a cache memory
after the first computation, and then be accessed when needed during the remaining
computations.
8.4.1 Set-Factoring SPI

The set-factoring SPI algorithm can be used to find an effective form of the joint dis-
tribution of any conjunctive query, but since every conditional probability, P (X|Y )
can be rewritten as P (X, Y )/P (Y ), the algorithm is not restricted to handling con-
junctive queries.
A factor is a subset of the complete probability distribution. Each factor contains
a set of variables, which affect the distribution. For example, the factor P (B|A)
includes the variables {A, B} and combining this factor with the factor P (A) yields
a conformal factor, P (B|A)P (A), with the same set of variables as P (B|A).
Now, let Q be the set of target variables, a good factoring for P (Q) can be found
in the following way:
(i) First, find the relevant nodes in the original BN. This can be done using the
d-separation property to exclude parts of the network which have no relevancy
to the current query. A linear time algorithm to find this subtree can be found
in Pearl [Geiger et al 89].
(ii) Let F be a factor set which contains all factors to consider in the next com-
putation, and let C be the set of factor candidates. Initially, let F be all
distributions from the subtree and let C be empty.
(iii) Combine the factors in F pairwise and add all pairs in which one factor of the
pair contains a variable which is a parent or a child of at least one variable in
the second pair, to the candidate set, C. There is also no need to add factors
which are already in C.
(iv) Let Ui be the set of variables for each combination in C. Compute vars(Ui ), the
number of variables in Ui excluding any target variable: vars(Ui ) = |Ui \ Q|.
For each combination in C, compute sum(Ui ), which is the number of variables
that can be summed out when two factors are combined. A variable can be
summed out when it does not appear in neither the set of target variables,
Q, nor any of the other factors in F (excluding those in the current pair).
Compute the result size as vars(Ui ) - sum(Ui ).
(v) Select the best candidate in C in the following way: Choose the element in C
with the lowest result size. If more than one element apply, choose the one of
these with most number of variables (including target variables). If there is
still more than one candidate, choose one of these arbitrary.
(vi) Construct a new factor by combining the chosen pair into a conformal factor.
Update F by replacing the two chosen factors with the new combined one.
Update C by deleting any pair which has a non-empty (factor-level) intersection
with the above chosen factor.
(vii) Continue from step (iii) until only one factor remains in F. This is the resulting
factor.
Finally, use the resulting factor above to compute an answer for the conjunctive
probability.
8.4. SYMBOLIC PROBABILISTIC INFERENCE (SPI) 87
Table 8.5: Loop 1: Candidate pairs in C
C (fM e , fBr ) (fM e , fSc ) (fM e , fCo ) (fBr , fSc ) (fBr , fCo ) (fSc , fCo )
U Br,Me Me,Sc Me,Br,Sc,Co Me,Br,Sc Me,Br,Sc,Co Me,Br, Sc, Co
sum(U) 0 0 0 0 0 0
vars(U) 2 2 2 3 3 3
result size 2 2 2 3 3 3
In this presentation all variables are assumed to be binary, i.e., they all have two
states. If the number of states is not equal for all variables, it is possible to compute
the size of a factor as the product over the states for every included variable instead
of just considering the number of variables.
According to Li and D’Ambrosio in [Li and D’Ambrosio 94] the set-factoring SPI
algorithm is superior to Jensen’s algorithm for most kinds of networks.
8.4.1.1 An Example of SPI

Consider the multiply connected cancer network in figure 8.3 in section 8.3. To
compute, for instance, the probability that a person falls into coma, P (Co), a good
factoring can be found using the algorithm above. Here, Co is the target variable,
Q = {Co}. The relevant variables are {Me, Br, Sc, Co} since no part of the network
can be excluded in this query. To keep syntax small, a factor P (A|...) will be denoted
fA . The factor fM e includes the variable {M e}, factor fBr includes {M e, Br}, factor
fSc : {M e, Sc} and factor fCo : {Co, Br, Sc}.
Loop 1: The factor set F is initialized to {fM e , fBr , fSc , fCo } and C is empty. Every
factor in F is then pairwised combined and added to C. The result is shown
in table 8.5.
Here, the best combinations are candidate no 1 and 2, since both have mini-
mum result size and equal number of variables. Candidate no 1, (fM e , fBr ) is
choosed to replace the factors in F which is updated to {(fM e , fBr ), fSc , fCo }.
After deleting every factor with a non-empty intersection with (fM e , fBr ), the
candidate set C becomes {(fSc , fCo )}.
Loop 2: Adding combinations from F makes C = {((fM e , fBr ), fSc ), ((fM e , fBr ),
fCo ), (fSc ,fCo )}. The best combination is ((fM e , fBr ), fSc ), in which it was
possible to sum out the variable M e. F is updated to {((fM e , fBr ), fSc ), fCo }
and C is empty.
Loop 3: The candidate set C becomes {(((fM e , fBr ), fSc ), fCo )} and the only com-
bination to choose is therefore (((fM e , fBr ), fSc ), fCo ). Both the variables Br
and Sc could be summed out. F is updated to {(((fM e , fBr ), fSc ), fCo )} which
fulfills the termination condition in step (vii) above.
Thus, the factoring result is:
X X
P (Co) = P (Co|Br, Sc) P (Sc|M e)P (Br|M e)P (M e) (8.4.32)
Br,Sc Me
8.5 Continuous Variables

So far we have only discussed discrete Bayesian networks, where every random vari-
able has a finite number of states. It is also possible to construct more general
models with continuous variables or a mixture of continuous and discrete variables.
The exact inference methods described here does not in general work for such net-
works. Jensen’s clustering algorithm has been modified [Lauritzen 92, Olesen 93] to
be able to handle mixture models with both discrete and Gaussian distributed ran-
dom variables. However, in order for the algorithm to work, all continuous variables
are assumed to be linear dependent of continuous parents. Also, the continuous
variables are not allowed to have discrete children.
8.5.1 Gibbs sampling

More general models where no exact solution is available require approximate meth-
ods. Gibbs sampling is a popular simple brute force technique which can be used
to generate artificial samples from a probability distribution specified by a Bayesian
network. Probabilities can then be estimated from these samples.
Suppose we have a Bayesian network with the variables {X1 , . . . , Xn } and we
would like to generate cases where E = {Xm = xm , . . . , Xn = xn } is the set of
observed variables.
(i) Choose valid arbitrary values for all unobserved variables.
(ii) Compute the distribution P (X1 |X2 = x02 , X3 = x03 , . . . , Xm−1

0
, E)
(iii) Sample a random value x11 according to the distribution above.
(iv) Compute the distribution P (X2 |X1 = x11 , X3 = x03 , . . . , Xm−1

0
, E)
(v) Sample a random value x12 .
(vi) Compute the distribution P (X3 |X1 = x11 , X2 = x12 , . . . , Xm−1

0
, E)
(vii) etc . . .
(viii) When a value has been sampled for all unobserved variables, restart with X1
and repeat the process until sufficiently many cases have been generated.
8.6. CONNECTION TO PROPOSITIONAL CALCULUS 89
To avoid bias from the initial configuration (which can be very unlikely) it is common
to discard the first 5-10 percent of the generated samples. This is called “burn-in”.
One problem with this kind of logic sampling is that it is possible to get stuck
in certain areas. There might exist an equal likely area, but in order to reach it,
a variable need to take a highly unlikely value. Another problem is that it can be
very time-consuming to estimate very unlikely events. Finally, the task of selecting
a valid starting configuration can be very tedious, it is in fact a NP-hard problem
in the general case!
8.6 Connection to propositional calculus

Even though Bayesian networks are a formalism for describing uncertainty, the con-
nection to logic is not far. Actually, it can easily be shown that any propositional
formula can be implemented and evaluated within a binary Bayesian network frame-
work.
The logical propositions, such as p and q can be represented by nodes with no
parents. The operators, (⇒, ∧, ∨, etc.) are written as nodes where the parents
are equal to the input and the conditional probability table defines the truth ta-
bles. Since boolean propositional calculus only handles the values true and false the
probability table can be written using only zeros and ones. Any formula can then
be constructed by a formation graph. The probability of the “root” node is to be
interpreted in the following way:
• If P (root) = 0, then the root can never be true, and the formula is therefore
unsatisfiable.
• If 1 > P (root) > 0, then this means that the formula is satisfiable.
• If P (root) = 1 then we know that the formula is always true, and the formula
is therefore valid (or a tautology).
For example, consider the following propositional formula
(p ⇒ q) ≡ (¬p ⇒ ¬q) (8.6.33)
This is represented with the Bayesian network below, where (≡) is the root node.

≡

7
o
S
S
S
¬ p S
S

+ @ S
R
@
⇒ ⇒

}
Z
Z ¬ q

Chapter 9
• Bayesian networks implements a joint probability distribution over the domain.
• The noisy-Or model can be used to avoid a parameter explosion when a vari-
able have many parents.
• As a reasoning system, Bayesian networks performs causal, diagnostic, and

intercausal inference. These basic types can also be combined without restric-
tions.
• Inference in Bayesian networks is in general NP-hard.
• Polynomial time algorithms exist for singly connected networks.
• Gibbs sampling can be used when no exact method is available.
• Boolean propositional calculus is a subset of Bayesian networks.
EXERCISES
III.1 (Software required) Implement the network of metastatic cancer de-

scribed in the lecture notes. Use the software to the compute the following:
(a) P (Co = true)

(b) P (M e = true|Co = false)
(c) P (Br = false|M e = true, Co = false)
91
(d) P (Br = false, Co = true)

(e) P (Br = false, Co = true|M e = false)
III.2 Consider the following Bayesian network:

A .S

.
B
?

BN
T . L B

HH
j

. E .P

PP q D
P
X

(a) Determine if the following conditional independence statements are true

or false:
(i) T ⊥
⊥A
(ii) A⊥⊥T
(iii) A⊥⊥S
(iv) A⊥ ⊥X|E
(v) A⊥ ⊥D|E
(vi) T ⊥⊥S|X
(vii) X⊥ ⊥S|L
(b) Is the network single connected or multiply connected?
(c) Suppose S, A and X have three states and all the other variables are
binary, how many parameters are needed to complete the specification?
III.3 (Software required) The Monty Hall puzzle gets its name from an Amer-
ican TV game show, ”Let’s make a deal”, hosted by Monty Hall. In this show,
you have the chance to win some prize if you are lucky enough to find the prize
behind one of three doors. The game goes like this:
– You are asked to select one of the three doors.

– Monty Hall (who knows where the prize is) opens another door which
does not contain the prize.
– You are asked if you want to redo your selection (select one of the two
closed doors).
– You get the prize if it is behind the door you selected.
93
The problem of the puzzle is: What should you do at your second selection?
Some would say that it does not matter because it is equally likely that the
prize is behind the two remaining doors. This, however, is not quite true. Build
a Bayesian network to conclude which action gives the highest probability.
Here is some help to get you started:
The Monty Hall puzzle can be modeled in three random variables: Prize,
First Selection, and Monty Opens.
– Prize represents the information about which door contains the prize.
This means that it has three states: ”Door 1”, ”Door 2”, and ”Door 3”.
– First Selection represents your first selection. This variable also has the
three states: ”Door 1”, ”Door 2”, and ”Door 3”.
– Monty Opens represents Monty Halls choice of door when you have made
your first selection. Again, we have the three states: ”Door 1”, ”Door
2”, and ”Door 3”.
III.4 (Software required) Implement a Bayesian network which represents the

following propositional formula:
(p ∨ q) ∧ ¬p ∧ ¬q (9.0.1)
Is the formula satisfiable?

III.5 Use a noisy-Or model to complete the conditional probability table for
a node “Alarm”, with the following possible causes:
– Burglary (B), P (Alarm|Burglary) = 0.95

– Earthquake (E), P (Alarm|Earthquake) = 0.5
– Electrical disturbance (D), P (Alarm|Disturbance) = 0.8
Under which assumptions are the noisy-Or model valid? Do you think the
noisy-Or model is appropriate to use in this particular application?
III.6 Given a discrete Bayesian network, B = {X1 , . . . , Xn }, an atomic con-
figuration is a specific assignment of each individual variable, i.e. X1 =
x1 , . . . , Xn = xn .
(a) Explain why the sum of the probability of each atomic configuration must
be equal to one, i.e.
X
P (X1 , . . . , Xn ) = 1. (9.0.2)
X1 ,...,Xn
(b) Prove equation (9.0.2). (Hint: use induction)

III.7 Consider two small Bayesian networks, A → B → C and A ← B ← C.

Prove that even though their causal semantics are different they implement
the same probability distribution.
Bibliography
[AboaFuzz 95] AboaFuzz 1.0 User Manual (P. Eklund, M. Fogstrm, S. Olli), Åbo
Akademi University, 1995.
[] R. G. Almond, http://bayes.stat.washington.edu/almond/belief.html, January,

1997.
[D’Ambrosio 89] B. D’Ambrosio, Symbolic probabilistic inference in belief nets,

Technical Report, Department of Computer Science, Oregon State University,
1989.
[Ambrosio 95] B. D’Ambrosio, Local expression languages for probabilistic depen-

dence, Int. J. Approximate Reasoning 13 (1995), 61-81.
[Anderson et al 91] H. Anderson, M. Dellborg, J. Markenvard, R. Jagenburg,

J. Herlitz, CK-MB räcker som specifik markör vid diagnos av hjärtinfarkt,
Läkartidningen 83 nr 36 (1991).
[Anttila et al 96] L. Anttila, P. Eklund, L. Kallin, P. Koskinen, T.-A. Penttilä, The

generalised preprocessing perceptron for medical data analysis: A case study for
the Polycystic Ovary Syndrome, Proc. EMCSR ’96, Vienna, April 9-12, 1996.
[Apple 92] F. S. Apple, Acute myocardial infarction and coronary reperfusion -

Serum cardiac markers for the 1990s, A. J. C. P. 97 no 2 (1992).
[Arnborg et al 87] S. Arnborg, D. G. Corneil, A. Proskurowski, Complexity of find-

ing embeddings in a k-tree, SIAM Journal on Algebraic and Discrete Methods 8
no 2 (1987), 277-284.
[Atanassov and Georgiev 93] K. Atanassov, C. Georgiev, Intuitionistic fuzzy Prolog,

Fuzzy Sets and Systems 53 (1993) pp. 121-128.
[Baker 87] J. E. Baker. Reducing Bias and Inefficiency in the Selection Algorithm, In
J. J. Grefenstette, editor, Genetic Algorithms and their Applications: Proceedings
of the Second International Conference on Genetic Algorithms, pages 14-21, 1987.
[Beasly 93] D. Beasely and D. R. Bull and R. R. Martin. An Overview of Genetic

Algorithms: Part 1, Fundamentals. University Computing, 15/2(:58-69, 1993.
95
96 BIBLIOGRAPHY
[Ben-Ari 93] M. Ben-Ari, Mathematical Logic for Computer Science, Prentice Hall,
1993.
[Bezdek 74] J. C. Bezdek, Cluster Validity with fuzzy sets, Journal of Cybernetics,
3 (1974), 58-73.
[Bezdek 80] J. C. Bezdek, A Convergence Theorem for the Fuzzy ISODATA Cluster-
ing Algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence
2 (1980), 1-8.
[Bezdek 81] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algo-
rithms, Plenum Press, 1981.
[Bezdek et al 87] J.C. Bezdek, R. Hathaway, M. Sabin, W. Tucker, Convergence

Theory for Fuzzy c-Means: Counterexamples and Repairs, IEEE Transactions on
Systems, Man and Cybernetics 17 (1987), 873-877.
[Bohlin et al 1997] J. Bohlin, P. Eklund, V. Kairisto, K. Pulkki, L.-M. Voipio-

Pulkki, A probabilistic network for the diagnosis of acute myocardial infarction,
Proc. IFSA 97, Prague, June, 1997, Vol. 4, 44-49.
[Buckles 97] B.
Buckles. Seminar Course: Evolutionary Computation - Lecture 3. Published on
www, adress http://www.eecs.tulane.edu/www/Buckles.Bill/ec.html, 1997.
[Casdagli and Weigend 91] M. C. Casdagli, A. S. Weigend, Exploring the continuum

between deterministic and stochastic modeling, Addison-Wesley, 1993.
[Charniak 91] E. Charniak, Bayesian networks without tears, AI Magazine, Winter

(1991), 50-63.
[Choe and Jordan 92] H. Choe, J. Jordan, On the Optimal Choice of Parameters
in a Fuzzy C-Means Algorithm, Proc. IEEE International Conference on Fuzzy
Systems, San Diego, 349-354, 1992.
[Cooper 87] G. F. Cooper, Probabilistic inference using belief networks is NP-hard,

Technical Report, Medical Computer Science Group, Stanford University, KLS
87-27.
[Cooper and Herskovits 92] G. F. Cooper, E. Herskovits, A Bayesian method for the
induction of probabilistc networks from data, Machine Learning 9 (1992), 309-347.
[Cybenko 88] G. Cybenko, Approximation by superposition of a sigmoidal function,

Research Notes in Artificial Intelligence, Pitman Publishing, London, (1987).
[Davis 87] L. Davis (ed.), Genetic Algorithms and Simulated, Technical Report, Uni-
versity of Illinois, 1988.
BIBLIOGRAPHY 97
[Driankov et al 93] D. Driankov, H. Hellendoorn, M. Reinfrank, An Introduction to

Fuzzy Control, Springer-Verlag, 1993.
[Druzdzel 95] M. J. Druzdzel, Qualitative verbal explanations in Bayesian belief net-
works, AISBQ 94, Winter95/Spring96, 1995, 43-54.
[Dubois, Prade 91] D. Dubois, H. Prade, Fuzzy Sets in Approximate Reasoning -
Part 1: Inference with possibility distributions, Fuzzy Sets and Systems 40 (1991)
pp. 143-202.
[Dubois, Lang, Prade 91] D. Dubois, J. Lang, H. Prade, Fuzzy Sets in Approximate
Reasoning - Part 2: Logical approaches, Fuzzy Sets and Systems 40 (1991) pp.
203-243.
[Dubois and Prade 80] D. Dubois, H. Prade, Fuzzy Sets and Systems: Theory and
Applications, Vol 144 in Mathematics in Science and Engineering, Academic Press,
1980.
[Dunn 73] J. C. Dunn, A Fuzzy Relative of the ISODATA Process and Its Use in
Detecting Compact Well-Separated Clusters, Journal of Cybernetics 3 (1973),
32-57.
[Dunn 74] J. C. Dunn, Well-separated Clusters and Optimal Fuzzy Partitions, Jour-
nal of Cybernetics 4 (1974), 95-104.
[Efron 82] B. Efron, The jackknife, the bootstrap and other similar resampling plans,
SIAM 38, 1982.
[Eklund 94] P. Eklund, Network size versus preprocessing, in:Fuzzy Sets, NeuralNet-
works and Soft Computing, ed. R.R.Yager, L.A.Zadeh, Van Nostrand Reinhold,
New York (1994), 250-264.
[Eklund 95] P. Eklund, A production line for generating clinical decision support
systems, EPIA ’95, Workshop on Fuzzy Logic and Neural Networks in Engineer-
ing, 69-80.
[Eklund and Forström 95] P. Eklund, J. Forsström, Computational intelligence for
laboratory information systems, Scand J Clin Lab Invest, 55 Suppl. 222 (1995),
21-30.
[Eklund et al 94] P. Eklund, J. Forsström, A. Holm, M. Nyström, G. Selén, Rule
Generation as an Alternative to Knowledge Acquisition, Fuzzy Sets and Systems,
66 (1994), 195-205.
[Eklund and Forsström 91] P. Eklund, J. Forsstrm, Diagnosis of Nephropathia Epi-
demica by Adaptation through L Ã ukasiewicz Inference, COMPUTATIONAL IN-
TELLIGENCE III - Proc. International Symposium ’Computational Intelligence
90’, ed. N. Cercone and F. Gardin, Elsevier Science Publisher B.V, 237-246, 1991.
98 BIBLIOGRAPHY
[Eklund and Klawonn 92] P. Eklund, F. Klawonn, Neuro Fuzzy Logic Programming,
IEEE Transactions on Neural Networks, Vol 3, No. 5, September 1992, 815-818.
[Eklund and Riissanen 92] P. Eklund, T. Riissanen, Learning in Neural Control:

Clustering vs Adaptation, manuskript, 1992.
[Eklund et al 91] P. Eklund, T. Riissanen, H. Virtanen, On the Fuzzy Logic Nature

of Neural Nets, Proc. 4th International Conference Neuro-Nimes ’91, Nimes, 293-
300, 1991.
[Eklund and Zhou 96] P. Eklund, J. Zhou, Comparison of Learning Strategies for
Adaptation of Fuzzy Controller Parameters, J. Fuzzy Sets and Systems, to appear.
[Ericson 93] C. Ericson. An Introduction to Genetic Algorithms. Technical Report

UMNAD 97.93, Department of Computing Science, Umeå University, 1993.
[Everitt 74] B. S. Everitt, Cluster Analysis, John Wiley & Sons, 1974.
[Fahlman and Lebiere 66] S. E. Fahlman, C. Lebiere, The cascade-correlation learn-

ing architecture, Morgan Kaufmann, 1990.
[Fogel 66] L. Fogel, M. Walsh, A. Owens, Artificial intelligence through simulated

evolution, John Wiley & Sons, New York, 1966.
[Fullér 95] R. Fullér, Neural Fuzzy Systems, Meddelanden från ESF vid Åbo
Akademi, Serie A:443, 1995.
[Geiger et al 89] G. Geiger, T. Verma, J. Pearl, d-separation: From theorems to

algorithms, Proceedings of the Fifth Workshop on Uncertainty in AI, August,
1989, 118-125.
[Geisser 75] S. Geisser, The predictive sampling reuse method with applications, J.
Amer. Stat. Assoc. ? (1975), xx-xx.
[Geman et al 92] S. Geman, E. Bienstock, R. Doursat, Neural networks and the

bias/variance dilemma, Neural Computation 5 (1992), 1-58.
[Ginsberg 87] M. L. Ginsberg (ed.), Readings in Nonmonotonic Reasoning, Morgan

Kaufman Publishers, Inc. 1987.
[Goldberg 89] D. E. Goldberg, Genetic algorithms in search, optimization, and ma-

chine learning. Addison-Wesley, Redwood City, CA, 1989.
[Hamacher 75] H. Hamacher, Über logische Verknüpfungen unscharfer Aussagen

und deren zugehrige Bewertungsfunktionen, Progress in Cybernetics and Systems
Research II, ed. R. Trappl and F. de P. Hanika, Hemishere Publication Corpora-
tion, 276-288, 1975.
BIBLIOGRAPHY 99
[Heckerman et al 1995] D. Heckerman, D. Gieger, D. M. Chickering, Learing

Bayesian networks: The combination of knowledge and statistical data, Networks
20 (1995), 197-243.
[Henrion et al 91] M. Henrion, S. Breese, E. J. Horvitz, Decision analysis and expert

systems, AI Magazine, Winter (1991), 64-91.
[Herrera, Lozano and Verdegay 95] F. Herrera, M. Lozano, J. L. Verdegay, Tuning

Fuzzy Controller by Genetic Algorithms, To appear in International Journal of
Approximate Reasoning 1995.
[Hertz, Krogh and Palmer 1991] J. Hertz, A. Krogh, R. G. Palmer, Introduction to

the Theory of Neural Computation, Addison Wesley, 1991.
[Höhle 89] U. Höhle, Monoidal Closed Categories, Weak Topoi, and Generalized
Logics preprint, 1989.
[Holland 92] J. Holland. Adaption in natural and artificial systems, The MIT Press,
Cambridge Massachusetts, London, England, 1992.
[Holmblad and Østergaard 82] L. P. Holmblad, J. J. Østergaard, Control of a Ce-

ment Kiln by Fuzzy Logic, Fuzzy Information and Decision Processes, ed. M.M.
Gupta and E. Sanchez, North-Holland, 389-400, 1982.
[Huber 96] B. Huber, Fibre Optic Gyro Application on Autonomous Vehicular Nav-
igation, PhD thesis, University of Strasbourg, 1996.
[Huntsberger et al 86] T. Huntsberger, C. Rangarajan, S. Jayaramamurthy, Repre-

sentation of Uncertainty in Computer Vision Using Fuzzy Sets, IEEE Transac-
tions on Computers 35 (1986), 145-156.
[Ishizuka and Kanai 85] M. Ishizuka , N. Kanai, Prolog-Elf - Incorporating Fuzzy

Logic, Proc 9th IJCAI, 1985, pp. 701-703.
[Jager 95] R. Jager, Fuzzy Logic in Control, doktorsavhandling, Technische Univer-

siteit Delft, 1995.
[Jain and Dubes 88] A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice
Hall, 1988.
[Jang 92] J.-S. R. Jang, Self-learning fuzzy controllers based on temporal back prop-
agation, IEEE Trans. Neural Networks 3 No 5 (1992), 714-723.
[Jensen 94] F. Jensen, Implementation aspects of various propagation algorithms in

Hugin, Technical Report, Department of Mathematics and Computer Science,
Aalborg University, Denmark, March, 1994.
100 BIBLIOGRAPHY
[Jensen 96] F. V. Jensen, An Introduction to Bayesian Networks, UCL Press Lim-

ited, 1996.
[Jensen and Jensen 94] F. V. Jensen, F. Jensen, Optimal junction trees, Technical
Report, Department of Mathematics and Computer Science, Aalborg University,
Denmark, July, 1994.
[Jensen et al 89] F. V. Jensen, S. L. Lauritzen, K. G. Olesen, Bayesian updating in
recursive graphical models by local computations, Technical Report, Department
of Mathematics and Computer Science, Aalborg University, Denmark, 1989.
[Jensenet al 91] F. V. Jensen, S. L. Lauritzen, K. G. Olesen, Bayesian updating in
recursive graphical models by local computations, Technical Report, Department
of Mathematics and Computer Science, Aalborg University, Denmark, 1991.
[Jensenet al 90] F. V. Jensen, K. G. Olesen, S. K. Andersen, An algebra of Bayesian
belief universes for knowledge based systems, Networks 20 (1990), 637-659.
[Kallin et al 95] L. Kallin, R. Räty, G. Selén, K. Spencer, A Comparison of
Numerical Risk Computational Techniques in Screening for Down’s Syndrome,
ICANN’95, International Conference on Artificial Neural Networks, Industrial
Conference (Medicine), Paris, France, 9-13 October 1995.
[Karr 91] C. Karr, Genetic Algorithms for Fuzzy Controllers, AI Expert, February
1991, 26-33.
[Klawonn and Kruse 94] F. Klawonn, R. Kruse, A L
Ã ukasiewicz Logic Based Prolog,
Mathware and Soft Computing No 1, (1994), pp. 5-29.
[Klimasauskas 91] C. C. Klimasauskas, Neural Networks: Application Walkthrough,
tutorial, Neuro-Nimes ’91, Nimes, November 4-8, 1991. (Contains: Applying Neu-
ral Networks, Parts I-V, reprinted from PC/AI Magazine.)
[Koza 92] J. R. Koza. Genetic Programming. The MIT Press, ”A Bradford Book”,
1992.
[Koza 94] J. R. Koza. Genetic Programming 2. The MIT Press, 1992.
[Krishnapuram et al 95] R. Krishnapuram, H. Frigui, O. Nasraoui, Fuzzy and
Prossibilistic Shell Clustering Algorithms and Their Application to Boundary De-
tection and Surface Approximation - Part II, IEEE Transactions on Fuzzy Systems
3 (1995), 44-60.
[Kruse et al 94] R. Kruse, J. Gebhardt, F. Klawonn, Foundations of Fuzzy Systems,
John Wiley & Sons, 1994.
[Lähdevirta 71] J. Lähdevirta, Nephropathia epidemica in Finland: A clinical, his-
tological and epidemiological study, Ann. Clin. Res. 3 (Suppl 8) (1971), 1-154.
BIBLIOGRAPHY 101
[Lähdevirta et al 84] J. Lähdevirta et al, Clinical and serological diagnosis of

Nephropathia Epidemica, the mild type of haemorrhagic fever with renal syndrome,
J. Infect. 9 (1984), 230-238.
[Larsen 80] P.M. Larsen, Industrial Applications of Fuzzy Logic Control, Interna-
tional Journal of Man Machine Studies 12 (1980), 3-10.
[Lauritzen 92] S. L. Lauritzen, Propagation of probabilities, means and variances in
mixed graphical association models, Journal of the American Statistical Associa-
tion 86 (1992).
[Lauritzen and Spiegelhalter 88] S. L. Lauritzen, D. J. Spiegelhalter, Local compu-
tations with probabilities on graphical structures and their application to expert
systems (with discussion), J. Royal Statistical Society 50 (1988), 157-224.
[Lee 72] R.C.T. Lee, Fuzzy Logic and the Resolution Principle, Journal of the As-
sociation for Computing Machinery, Vol. 19, No. 1, January 1972, pp. 109-119.
[Li and D’Ambrosio 94] Z. Li, B. D’Ambrosio, Efficient inference in Bayes networks
as a combinatorial optimization problem, Int. J. Approximate Reasoning 10 no 5
(1994).
[Lieven 96] K. Lieven, Applications of Intelligent Techniques in Data Analysis, Op-
pivien ja lykkiden jrjestelmien vuosiseminaari, TEKES, Esbo, 1996.
[Lim and Lee 90] Y. Lim, S. Lee, On the Color Image Segmentation Algorithm
Based on the Thresholding and the Fuzzy c-Means Techniques, Pattern Recog-
nition 23 (1990), 935-952.
[Lindahl 96] B. Lindahl, Biochemical Markers of Myocardial Damage for Early Di-
agnosis and Prognosis in Patient with Acute Coronary Syndromes, Acta Univer-
sitatis Upsaliensis, 1996.
[Ljung 81] L. Ljung, Reglerteori: Moderna analys- and syntesmetoder, Studentlit-
teratur, 1981.
[Lloyd 87] J. W. Lloyd, Foundations of Logic Programming, second extended edi-
tion, Springer-Verlag, 1987.
[ÃLukasiewicz and Tarski 30] J. LÃ ukasiewicz, A. Tarski, Untersuchunen über den
Aussagen kalkül, C. R. Soc. Sci. et Lettres Varsovie III, 23 (1930), 1-21.
Ã ukaszewicz, Non-Monotonic Reasoning, Formalization of
[ÃLukaszewicz 90] W. L
Commonsence Reasoning, Ellis Horowood series in Artificial Intelligence, Ellis
Horowood Limited, 1990.
[Mamdani 74] E.H. Mamdani, Applications of Fuzzy Algorithms for Control of Sim-
ple Dynamic Plant, Proc. IEEE 121 (1974), 1585-1588.
102 BIBLIOGRAPHY
[Mamdani and Assilian 75] E. H. Mamdani and S. Assilian, An Experiment in Lin-

guistic Synthesis with a Fuzzy Logic Controller, International Journal of Man
Machine Studies 7 (1975), 1-13.
[McDermot and Doyle 87] D. McDermot, J. Doyle, Non-monotonic logic I, Artificial

Intelligence J., 13(1980), 41-72. Also in [Ginsberg 87].
[Minker 91] J. Minker, An overview of nonmonotonic reasoning and logic program-

ming, updated version of an invited Banquet Address, Int. Workshop on Logic
Programming and Non-Monotonic Reasoning, Washington, D.C., July 23, 1991.
[Minsky and Papert 69] M. L. Minsky, S. Papert, Perceptrons: An Essay in Com-

putational Geometry, The MIT Press, 1969.
[Mizumoto 89] M. Mizumoto, Pictorial representations of fuzzy connectives, part I:

Cases of t-norms, t-conorms and averaging operators, Fuzzy Sets and Systems,
31 (1989), 217-242.
[Moody 94] J. Moody, Prediction risk and architecture selection for neural networks,
In: V. Cherkassky, J. H. Friedman, H. Wechsler (Eds.), From Statistics to Neural
Networks: Theory and Pattern Recognition, NATO ASI Series F, Springer-Verlag,
1994.
[Moody and Utans 95] J. Moody, J. Utans, Architecture selection strategies for neu-
ral networks: Application to corporate bond rating predictions, In: A.-P. Refenes
(Ed.), Neural Networks in the Capital Markets, John Wiley & Sons, 1995, 277-
300.
[Moore 83] R. C. Moore, Semantical Considerations on Nonmonotonic Logic, Proc

8th IJCAI, Karlsruhe, Germany, 1983, 272-279.
[Moore 84] R. C. Moore, Possible-World Semantics for Autoepistemic Logic, Proc.

AAAI Workshop on Non-Monotonic Reasoning, New York, 1984, 344-354.
[Novak 90a] V. Novak, On the Syntactico-Semantical Completeness of First-Order

Fuzzy Logic, Part I: Syntax and Semantics, Kybernetica, Vol. 26, No. 1, 1990, pp.
47-66.
[Novak 90b] V. Novak, On the Syntactico-Semantical Completeness of First-Order

Fuzzy Logic, Part II: Main Results, Kybernetika, Vol. 26, No. 2, 1990, pp. 134-154.
[Olesen et al 89] K. G. Olesen, U. Kjærulff, F. Jensen, F. V. Jensen, B. Falck, S.

Andreassen, S. K. Andersen, A MUNIN network for the median nerve – a case
study on loops, Applied Artificial Intelligence (Special issue: Towards Causal AI
Models in Practice), 1989.
BIBLIOGRAPHY 103
[Olesen 93] K. G. Olesen, Causal probabilistic networks with both discrete and con-
tinuous variables, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 3 (1993).
[Olli 95] S. Olli, Fuzzy Control for AGNES, opublicerade anteckningar, bo Akademi,
1995.
[Pearl 88] J. Pearl, Probabilistic Reasoning in intelligent Systems, Morgan Kauf-

mann, 1988.
[Procyk and Mamdani 79] T. J. Procyk, E. H. Mamdani, A Linguistic Self-

Organizing Process Controller, Automatica 15 (1979), 15-30.
[Rechenberg 73] I. Rechenberg. Evolutionstrategie: Optimierung Technischer Sys-

teme nach Prinzipen der Biologischen Evolution. Friederich Frommann Verlag,
Stuttgart, Germany, 1973.
[Riissanen 92a] T. Riissanen, Experiences in Fuzzy Control: A Case Study for

Chemical Plants, Proc. CIFT’92, 2nd International Workshop on Current Issues
in Fuzzy Technologies, Trento, 96-103, 1992.
[Riissanen 92b] T. Riissanen, Clustering Manual Control Actions, Proc. MEPP’92,

International Seminar on Fuzzy Control through Neural Interpretations of Fuzzy
Sets, Mariehamn, 57-63, 1992.
[Riissanen and Eklund 96] T. Riissanen and P. Eklund, Working within a Fuzzy
Control Application Development Workbench: Case Study for a Water Treatment
Plant, Proc. EUFIT’96, 4th European Congress on Intelligent Techniques and
Soft Computing, Aachen, 1142-1145, 1996.
[Rissanen 82] J. Rissanen, Modelling by shortest data descriptions, Automatica 14

(1978), 465-471.
[Robinson 65] J. A. Robinson, A Machine-Oriented Logic Based on the Resolution

Principle, Journal of the Association for Computing Machinery, Vol. 12, No. 1
(January 1965), pp.23-41
[Roeher et al 95] O. Roeher, R. Schmidt, S. Korth, Fuzzy-Controlled Drug Infu-

sion During Extracorporeal Blood Purification, Proc. EUFIT’95, 3rd European
Congress on Intelligent Techniques and Soft Computing, Aachen, 1626-1632, 1995.
[Rose and Rosser 58] F. Rosenblatt, Fragments of many-valued statement calculi,

Trans. Amer. Math. SOc. 87 (1958), 1-53.
[Rosenblatt 58] F. Rosenblatt, The perceptron: A probabilistic model for informa-

tion storage and organization in the brain, Psychological Review 65 (1958), 386-
408.
104 BIBLIOGRAPHY
[Russell and Norvig 95] S. Russell, P. Norvig, Artificial Intelligence – a Modern Ap-
proach, Prentice-Hall International, 1995.
[Saupe et al 95] D. Saupe, M. Rombach, H. Fischer, Fuzzy Clustering for Fractal

Image Compression with Applications to Digital Angiography, Proc. EUFIT’95,
3rd European Congress on Intelligent Techniques and Soft Computing, Aachen,
1685-1689, 1995.
[Schweizer and Sklar 61] B. Schweizer, A. Sklar, Associative functions and statisti-
cal triangle inequalities, Publicationes Mathema-ticae Dedrecen, 8 (1961), 169-
186.
[Settergren 89] B. Settergren et al, Clinical characteristics of Nephropathia Epidem-

ica in Sweden: prospective study of 74 cases, Rev. Infect. Dis. 11 (1989), 921-927.
[Shachter et al 90] R. D. Shachter, B. D’Ambrosio, B. A. Del Favero, Symbolic prob-

abilistic inference in belief networks, AAAI, August (1990), 126-131.
[Shao 88] S. Shao, Fuzzy Self-Organizing Controller and its Application for Dynamic
Processes, Fuzzy Sets and Systems 26 (1988), 151-164.
[Smith and Kelleher 88] B. Smith, G. Kelleher (eds.), Reason Maintenance Sys-
tems and their Applications, Ellis Horowood series in Artificial Intelligence, Ellis
Horowood Limited, 1988.
[Smits et al 94] P. Smits, M. Mari, A. Teschioni, S. Dellepiane, F. Fontana, Appli-

cation of Fuzzy Methods to Segmentation of Medical Images, Proc. IPMU’94, 5th
International Conference on Information Processing and Management of Uncer-
tainty in Knowledge-Based Systems, Paris, 910-915, 1994.
[Sugeno 85] M. Sugeno, An introductory survey of fuzzy control, Information Sci-

ences 36 (1985), 59-83.
[Sugeno and Yasukawa 93] M. Sugeno and T. Yasukawa, A Fuzzy-Logic-Based Ap-

proach to Qualitative Modeling, IEEE Transactions on Fuzzy Systems 1 (1993),
7-31.
[Sutanto and Warwick 94] E. Sutanto, K. Warwick, Cluster Analysis - An Intelli-

gent System for the Process Industries, Proc. Cybernetics and Systems ’94, 12th
European Meeting on Cybernetics and Systems Research, ed. R. Trappl, World
Scientific, 327-334, 1994.
[Takagi and Sugeno 85] T. Takagi, M. Sugeno, Fuzzy Identification of Systems and
Its Applications to Modeling and Control, IEEE Transactions on Systems, Man
and Cybernetics 15 (1985), 116-132.
BIBLIOGRAPHY 105
[Tsukamoto 77] Y. Tsukamoto, An Approach to Fuzzy Resoning Method, Advances

in Fuzzy Set Theory and Applications, ed. M.M Gupta, R.K. Ragade and R.R.
Yager, North-Hollan Publishers, 137-149, 1977.
[Umano 87] M.Umano, Fuzzy-Set Prolog, Second IFSA Congress, Tokyo, 1987, pp.
750-753.
[Varsek and Filipic 93] T. U. Varsek, B. Filipic, Genetic Algorithms in Controller

Design and Tuning , IEEE Trans. Syst. Man. Cybern., 23 (1993),1330-1339.
[Wang and Mendel 92] L. X. Wang, J. M. Mendel, Fuzzy Basis Functions, Universal
Approximation, and Orthogonal Least-Squares Learning, IEEE Trans. on Neural
Networks., 3 No.5. September, 1992, 807-813.
[Weigend and LeBaron 94] A. S. Weigend, B. LeBaron, Evaluating neural network

predictors by bootstrapping, Technical Report CU-CS-725-94, Department of Com-
puter Science and Institute of Cognitive Science, University of Colorado, Boulder,
CO, May, 1994.
[Weigend and Rumelhart 92] A. S. Weigend, D. E. Rumelhart, Weight elimination

and effective network size, ???, Chap. 16, ???, 1992, 457-476.
[Weigend et al 90] A. S. Weigend, D. E. Rumelhart, B. A. Huberman, Predicting the

future: A connectionist approach, Internat. J. Neural Systems ? (1990), 193-209.
[Windham 82] M. Windham, Cluster Validity for the Fuzzy c-Means Clustering Al-
gorithm, IEEE Transactions on Pattern Analysis and Machine Intelligence, 4
(1982), 357-363.
[Windham 83] M. Windham, Geometrical Fuzzy Clustering Algorithms, Fuzzy Sets

and Systems 10 (1983), 271-279.
[Virtanen 91] H. Virtanen, Combining and Incrementing Fuzzy Evidence - Heuristic

and Formal Approaches to Fuzzy Logic Programming, Proc. 4th IFSA Congress
(Artificial Intelligence Chapter), Brussels, Belgium, 1991, pp. 200-203.
[Virtanen 94a] H. E. Virtanen, A Study in Fuzzy Logic Programming, in Cybernetics

and Systems’94, Proceedings of the 12th European Meeting on Cybernetics and
Systems Research, Ed. R.Trappl, Vienna, Austria, 1994, pp.249-256.
[Virtanen 94a] H. E. Virtanen, Fuzzy Unification, Proc 5th IPMU’94, Information

Processing and Management of Uncertainty in Knowledge-Based Systems, Paris,
France, 1994, pp.1147-1152.
[Virtanen 96] H. E. Virtanen, Linguistic Logic Programming, Åbo Akademi, Reports

on Computer Science & Mathematics, Ser. A. No 176,1996, 23 pages. A five page
abstract of this work has also been accepted to EUFIT’96, Aachen, Germany.
106 BIBLIOGRAPHY
[Yager 80] R. R. Yager, On a General Class of Fuzzy Connectives, Fuzzy Sets and
Systems 4 (1980), 235-242.
[Yager 88] R. R. Yager, On Ordered Weighted Averaging Aggregation Operators in

Multicriteria Decisionmaking, IEEE Transactions on Systems, Man and Cyber-
netics 18 (1988), 183-190.
[Yager 96a] R. R. Yager, Knowledge-Based Defuzzification, Fuzzy Sets and Systems

80 (1996), 177-185.
[Yager 96b] R. R. Yager, Constrained OWA Aggregation, Fuzzy Sets and Systems
81 (1996), 89-101.
[Zadeh 65] L. A. Zadeh, Fuzzy Sets, Information and Control 8 (1965), 338-353.
[Zadeh 73] L. A. Zadeh, Outline of a New Approach to the Analysis of Complex

Systems and Decision Process, IEEE Trans. Syst. Man. Cybern., Vol. SMC-3 No
1, January, 1973, 28-44.
[Zadeh 75] L. A. Zadeh, The Concepts of a Linguistic Variable and its Application
to Approximate Reasoning, Information Science, 8 (1975), 199-249.
[Zadeh 89] L. A. Zadeh, Fuzzy Sets as a Basis for a Theory of Possibility, Fuzzy
Sets and Systems, Vol 1, 1978, pp. 3-28.
[Zadeh 89] L. A. Zadeh, The coming age of fuzzy logic, plenary talk at 3rd IFSA,
Seattle, August 6-11, 1989.
[Zhou and Eklund 95] J. Zhou, P. Eklund, Some Remarks on Learning Strategies for
Parameter Identification in Rule Based Systems, Proc. EUFIT’95, 3rd European
Congress on Intelligent Techniques and Soft Computing, Aachen, 1911-1916, 1995.
[Zhou and Eriksson] J. Zhou, J. Eriksson, Suitability of Fuzzy Controller Fine-

tuning: A Case Study for a Chemical Plant, In R.Trappl (Ed.), Proceedings of
the EMCSR ’96 (Cybernetics and Systems 96), Vienna, Austria, 9-12 April, 1996,
324-328.
[strm 76] K. J. strm, Reglerteori, Almqvist & Wiksell, 1976.
[] Discussion with Mats Nilsson, University Hospital of Northern Sweden, on vari-

ables for acute myocardial infarction, September, 1996.
[] Discussions with Veli Kairisto, Turku University Central Hospital, on the diag-
nostic problem and the TUCH data set for acute myocardial infarction, Spring,
1996.

Computational Intelligence-Notes1 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Intelligence-Notes1 PDF

Uploaded by

Copyright:

Available Formats

Lecture Notes

Edition for course in Spring 2000

Umeå, January 3, 2000

I LOGIC AND FOUNDATIONS 1

2 Summary and Exercises 25

5 Summary and Exercises 49

III PROBABILISTIC COMPUTING 51

8 Inference in Bayesian Networks 65

9 Summary and Exercises 91

LOGIC AND FOUNDATIONS

1.1 Fuzzy Sets

In this definition, µA is the characteristic function

Note that µA (x) represents the truth value of

set is denoted as µA , and if their is no confusion it is common to speak of µA and A

Using A to denote µA we can speak of the truth value for

A = {(x1 , µA (x1 )), · · · , (xn , µA (xn ))} ⊆ X × I,

where (xi , µA (xi )) is an ordered pair representing the corresponding membership

A = µA (x1 )/x1 + · · · + µA (xn )/xn ,

Figure 1.1: Membership function for ”close to 4”.

where x is a measurement of temperature. The corresponding function µA is

Figure 1.2: Membership function for ”normal temperature”.

depicted in Figure 1.3.

Figure 1.3: Membership function for ”high moisture rates”.

1.2 Useful Membership Functions

1.2.1 Triangular and trapetsoidal functions

Figures 1.4 and 1.5 shows the graphical representation.

Figure 1.4: Γ-membership function.

Similarly, we closed membership functions which allow non-zero membership val-

Figure 1.5: L-membership function.

Figure 1.6: Λ-membership function.

See Figure 1.6.

Figure 1.7: A family of membership functions.

The trapetsoidal membership function, Π : X → [0, 1], is given by four parameters

Figure 1.8: Π-membership function.

Γ(x; α, β) = Π(x; α, β, 10, 10),

Example. The membership function for ”normal room temperature” could be

1.2.2 Gaußian and sigmoidal functions

where βl och βr are, respectively, left and right slopes.

Figure 1.9: G-tillhörighetsfunktionen.

Figure 1.10: σ-function.

Zadeh defined an S-function using polynomials rather than exponentials accord-

Figure 1.11: S-tillhörighetsfunktionen.

Since LR takes the form

1.3 Fuzzy Logic

1.3.1 Logic connectives

(x is NORMAL and y is HIGH) or (not z is MEDIUM),

where x, y and z are to be seen as measurements with membership in, respectively,

Ã ukasiewicz, for ∧ och ∨ we have

[ÃLukasiewicz and Tarski 30]:

a ∧ b = 1 − min{[(1 − a)p + (1 − b)p ]1/p , 1}, for p > 0

a ∧ b = a · b/(γ + (1 − γ) · (a + b − a · b)), for γ > 0

Lookig more closely at Yager’s connectives, we have the following observations. If

and the drastic sum

Thus, in Yager’s connectives, we normally use p ≥ 1.

Example. Let us evaluate the expression

1.3.2 Operations on fuzzy sets

Zadeh min{0.6, 0.9} = 0.6 max{0.6, (1 − 0.8)} = 0.6

Produkt 0.6 · 0.9 = 0.54 0.54 + (1 − 0.8) − 0.54 · (1 − 0.8) ≈ 0.63

Table 1.1: Evaluation of (a ∧ b) ∨ ¬c.

µA∩B (x) = min{µA (x), µB (x)}

and complement µĀ is defined according to

µĀ (x) = 1 − µA (x).

then by Zadeh’s definitions we have