Professional Documents
Culture Documents
Computational Intelligence-Notes1 PDF
Computational Intelligence-Notes1 PDF
COMPUTATIONAL INTELLIGENCE
Umeå University
Department of Computing Science
S-901 87 Umeå
Sweden
i
These lecture notes were originally prepared for the course in Computational In-
telligence held in Spring 1998 at the Department of Computing Science at Umeå
University.
Text contributions for this edition have been provided by Jens Bohlin, Patrik Eklund
and Tony Riissanen.
The authors
ii
Contents
II FUZZY SYSTEMS 29
3 Fuzzy Control 31
3.1 Fuzzy Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Fuzzy rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Fuzzy rule bases . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 Overview of fuzzy controller structures . . . . . . . . . . . . . 33
3.2 Inference in Fuzzy Controllers . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Mamdani’s method . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Takagi-Sugeno’s method . . . . . . . . . . . . . . . . . . . . . 35
3.3 Defuzzification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
iv CONTENTS
4 Fuzzy Clustering 43
4.1 Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Fuzzy c-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Identification of Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Geometric Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Bayesian Networks 57
7.1 A Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2.1 Linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.2 Diverging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.3 Converging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.4 d-separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2.5 A notation for conditional independence . . . . . . . . . . . . 61
7.3 Finding the Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.3.1 Disjunctive Interaction . . . . . . . . . . . . . . . . . . . . . . 61
1
Chapter 1
Many-Valued Logic
µA : X → {0, 1}.
”x is in A”.
This function can also be interpreted as a relation consisting of ordered pairs (x, µA (x)).
In a first step, a fuzzy set can be seen as an extension of characteristic functions,
i.e. a fuzzy set µ can be defined mathematically by assigning to each possible individ-
ual in the universe of discourse a value, µ(x), representing its grade of membership
in the fuzzy set µ. This grade corresponds to the degree to which that individual is
similar or compatible with the concept represented by the fuzzy set. Often a fuzzy
3
4 CHAPTER 1. MANY-VALUED LOGIC
µA : X → I.
”x is in A”
or even
”x is A”
if it is desirable to speak of compatibility with the concept A rather than member-
ship.
If X = {x1 , . . . , xn } is a finite set and A a fuzzy subset of X, a more relational
notation for A would be
where µA (xi )/xi contains respective grades of membership and + should be seen as
a union. This notation is very informal and certainly not algebraic in any sense.
Example. Suppose we want to define a fuzzy set of natural numbers ”close to 4”
(see Figure 1.1). This can be given e.g. as
A = {(1, 0.0), (2, 0.2), (3, 0.6), (4, 1.0), (5, 0.6), (6, 0.2), (7, 0.0)}.
0
2 3 4 5 6
The above definition of a fuzzy set is a typical situation where the relational style
is convenient.
Example. A fuzzy set A defining ”normal room temperature” can be given as
0, om x < 16◦ C
◦ ◦ ◦ ◦
(x − 16 C)/2 C, om 16 C ≤ x < 18 C
µA (x) = 1, om 18◦ C ≤ x ≤ 22◦ C
(24◦ C − x)/2◦ C, om 22◦ C < x ≤ 24◦ C
0, om x > 24◦ C,
0
16 18 20 22 24
As can be seen, temperatures below 16◦ C or above 24◦ C are not to any degree
considered to be normal.
Example. A fuzzy set B defining ”high moisture rates” can be given as
0, om x < 30%
µB (x) = (x − 30%)/20%, om 30% ≤ x ≤ 50%
1, om x > 50%,
0
10 30 50 70 90
Correspondingly, we have functions with open left shoulders, L : X → [0, 1], defined
by
1, x<α
L(x; α, β) = (β − x)/(β − α), α ≤ x ≤ β
0, x > β.
0
α β
0
α β
0
α β γ
NB NS ZO PS PB
0
-40 -30 -20 -10 0 10 20 30 40
0
α β γ δ
The level between β and γ is sometimes called a plateau. See Figure 1.8).
From implementation point of view it is obviously advantageous to consider Γ, L
and Λ as special cases of Π. This is possible if the universe of discourse is a bounded
interval. Suppose the universe of discourse is [-10,10]. Then
G(x; α, β) = e−β(x−α) ,
2
where α is the midpoint and β reflects the slope value. Note that β must be positive,
and that the function never reaches zero.
The Gaußian function can also be extended to have different left and right slopes.
We then have three parameters in
(
e−βl (x−α) , x ≤ α
2
G(x; α, βl , βr ) =
e−βr (x−α) , x > α,
2
0
α
1
σ(x; α, β) = ,
1+ e−β(x−α)
where α is the midpoint and β is the slope value at the inflexion point. Note that
β is 4 times the derivative value at the inflexion point.
Similarly, β must be positive. This S-function never reaches neither 0 nor 1.
0
α
0
α β γ
The purpose of this section is to introduce multi-valued logic from a more intu-
itive and informal point of view as compared to a strongly algebraically developed
theory of multi-valued logic. Generally speaking we will stay more on the syntactic
side, rather than diving deeply into semantics. Keep in mind our purpose to present
multi-valued logic as method and technique to support application development.
Furthermore, our general viewpoint is that success in applications should guide the
search for ”the best” understanding of the foundations.
¬a =1−a
a ∧ b = min{a, b}
a ∨ b = max{a, b}
The intuition for these are obvious as the clearly reflect worst and best case char-
acterisations. However, this is also a disadvantage since it means that an outcome
might remane unchanged even if we modify some value, e.g. min{0.7,0.5} is the
same as min{0.8,0.5}. If we desire any change in a or b to be effective we can use
e.g. the well-known product connectives
a∧b =a·b
a∨b =a+b−a·b
a ∧ b = max{0, a + b − 1}
a ∨ b = min{a + b, 1}
In the above mentioned connectives the outcome depends only on a and b, i.e.
there are no additional parameters. Adding parameters, however, introduces several
interesting and useful classes of connectives.
12 CHAPTER 1. MANY-VALUED LOGIC
[Yager 80]:
[Hamacher 75]:
Definition a∧b (a ∧ b) ∨ ¬c
L
à ukasiewicz max{0, 0.6 + 0.9 − 1} = 0.5 min{0.5 + (1 − 0.8), 1} = 0.7
Yager 1 − min{[(1 − 0.6)2 + (1 − 0.9)2 ]1/2 , 1} ≈ 0.59 min{[0.592 + (1 − 0.8)2 ]1/2 , 1} ≈ 0.62
0.6·0.9 0.52+(1−0.8)−(2−2)·0.52·(1−0.8)
Hamacher ≈ 0.52 ≈ 0.65
2+(1−2)·(0.6+0.9−0.6·0.9) 1−(1−2)·0.52·(1−0.8)
maximum operator for disjunction, union of A and B is then a fuzzy set µA∪B given
by
µA∪B (x) = max{µA (x), µB (x)}.
Similarly, the intersection µA∩B is given by
Example. Let A and B be discrete fuzzy subsets of X = {−3, −2, −1, 0, 1, 2, 3}. If
A = {(−3, 0.0), (−2, 0.3), (−1, 0.6), (0, 1.0), (1, 0.6), (2, 0.3), (3, 0.0)}
and
B = {(−3, 1.0), (−2, 0.5), (−1, 0.2), (0, 0.0), (1, 0.2), (2, 0.5), (3, 1.0)}
A ∧ B = {(−3, 0.0), (−2, 0.3), (−1, 0.2), (0, 0.0), (1, 0.2), (2, 0.3), (3, 0.0)}
and
A ∨ B = {(−3, 1.0), (−2, 0.5), (−1, 0.6), (0, 1.0), (1, 0.6), (2, 0.5), (3, 1.0)}.
¬A = {(−3, 1.0), (−2, 0.7), (−1, 0.4), (0, 0.0), (1, 0.4), (2, 0.7), (3, 1.0)}
and
¬B = {(−3, 0.0), (−2, 0.5), (−1, 0.8), (0, 1.0), (1, 0.8), (2, 0.5), (3, 0.0)}.
14 CHAPTER 1. MANY-VALUED LOGIC
A B
∨
A B
1 1 1
=
∨
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
A B A∨B
1 1 1
∨ =
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
f :X→Y
f : PX → PY
Strictly speaking we should use another notation for f , e.g. P f since the mappings
are not the same. However, as there is usually no confusion, it is very common to
use f in both situations.
Viewing this extension in the fuzzy domain, we have the following obvious ques-
tion: Given a fuzzy subset µA of X, what is the corresponding image f (µA ) as a
fuzzy subset of Y ? A natural approach is to require that
but in such cases we need to define f (µA )(f (x)) also in situations where we might
have other points x0 for which f (x) = f (x0 ) with µA (x) 6= µA (x0 ). Typically in such
a situation we ”the best we have”, i.e. f (µA )(f (x)) would be the largest value (or
supremum in an infinite case) of all µA (x0 ) for which f (x) = f (x0 ). Also we need to
specify the values of f (µA )(y) where we cannot find any x such that f (x) = y. In
such a situation it is natural to require that f (µA )(y) = 0.
To summarize, the Extension Principle states that
( W
f (x)=y µA (x), if {x ∈ X | f (x) = y} =
6 ∅,
f (µA )(y) =
0, otherwise.
f : Xn → Y
and then we have to define f (µA 1 , . . . , µA n )(f (x)), x = (x1 , . . . , xn ), given that we
have µA 1 (x1 ), . . . , µA n (xn ). Now we need to consider worst cases wrt µA i (xi ) as the
combination is more of a conjunction. But the Extension Principle remains basically
the same, i.e.
( W
f (x)=y mini {µA i (xi )}, if {x ∈ X n | f (x) = y} =
6 ∅,
f (µA 1 , . . . , µA n )(y) =
0, otherwise.
16 CHAPTER 1. MANY-VALUED LOGIC
+:R×R→R
or
µA1 ×···×An (u1 , u2 , · · ·, un ) = µA1 (u1 )·µA2 (u2 )· · ··µAn (un ).
R◦S = {[(u, w), sup(µR (u, v) ∗ µS (v, w))], u∈U, v∈V, w∈W }.
v
where ∗ could be any operator in the class of membership function, namely, mini-
mum, algebraic product, bounded product or drastic product.
In a general form, a compositional operator may be expressed as the Sup-star
compositional, where ”star” denotes an operator - e.g., min, product, etc. In the
literature., four kind of compositional operators can be used in the compositional
rule of inference as following:
• Sup-min operation,
1.4. PRODUCTS, RELATIONS AND COMPOSITIONS 17
• Sup-product operation,
• Sup-bounded-product operation,
• Sup-drastic-product operation.
In FLC applications, the Sup-min and Sup-product compositional operator are the
most frequently used.
The fuzzy rule in premise 2 above can be put into the simpler form ” A × B → C
”. Intuitively, this fuzzy rule can be transformed into a ternary fuzzy relation R,
which is specified by the following MF:
µR (x, y, z) = µ(A×B)×C (x, y, z) = µA (x)∧µB (y)∧µC (z) (1.1)
0
And the resulting C is expressed as
0 0 0
C = A ×B ◦(A × B → C) (1.2)
Thus
The interpretation of multiple rule is usually taken as the union of the fuzzy relation
corresponding to the fuzzy rules. for example, given the following fact and rules:
0 0
premise 1 (fact): x is A and y is B ,
premise 2 (rule 1) : if x is A1 and y is B1 then z is C1 ,
premise 3 (rule 2) : if x is A2 and y is B2 then z is C2 .
0
consequence: z is C
we can use the fuzzy reasoning shown in the Figure 1.14. as an inference procedure
0
to derive the resulting output fuzzy set C .
Figure 1.14: Fuzzy reasoning for multiple rules with multiple antecedents.
0 0
where C1 and C2 are the inferred fuzzy sets for rule 1 and 2, respectively. Figure 1.14
shows graphically the operation of fuzzy reasoning for multiple rules with multiple
antecedents. Suppose a fuzzy rule base consists of a collection of fuzzy if...then rules
in the following form:
Ri (x1 , x2 , ...xm , y) = Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y) (1.6)
We could combine the rules by an aggregation operator Agg into one rule which
0 0
used to obtain C from A .
that is
\n
R(x, y) = i=1
Ri (x, y) = min(Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y))
that is
[n
R(x, y) = i=1
Ri (x, y) = max(Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y))
µC 0 (y) = ∨ni=1 {[µA0 (x1 ) ∧ µAi1 (x1 )] ∧ [µA0 (x2 ) ∧ µAi2 (x2 )]
i1 i2
= ∨ni=1 {∧m
j=1 [µA0ij (xj ) ∧ µAij (xj )]} ∧ µCi (y)
= ∨ni=1 {τi ∧ µCi (y)}
Figure 1.15 shows graphically the operation of fuzzy reasoning for MISO.
M
xm If x1 is An1 and ... and xm is Anm Then y is Cn
τ n = [µ An 1 (x 1) ∧ µ An 2 (x 2 ) ∧ L ∧ µA nm (x m )]→ τ n ∧ µ C n (y)
and from previous definitiona we see that there are six interpretations (1-6) for
fuzzy implication, and in each interpretation we may employ different t-norms or
t-conorms, therefore, a fuzzy if...then rule (1.5) can be interpreted in a number of
ways, then the output of the fuzzy inference mechanisms can be in different way.
for these different types of outputs. we could use different defuzzifier to defuzzify
them into a single point in the output space V.
1.5. APPROXIMATE REASONING 21
(5) I(p, q) ≥ q
(10) I is continuous.
The following table gives the most usual implications, which class they belong to
and which properties are satisfied [Dubois, Prade 91, Dubois, Lang, Prade 91].
Let p → q represent the rule if p then q, where p can be of the form p1 and . . . and
pn . We can then then say that:
(i) The truth value connected to p, τ (p), is then of the form τ (p) = τ (p1 ) ∗ . . . ∗
τ (pn ), where ∗ is a t-norm;
Definition Let Γ be a fuzzy set of axioms. The mapping Γ |=: L → [0, 1] is given
by Γ |= P = inf{Υ(P ) | Υ is a valuation w.r.t. Γ}, where inf ∅ = 1.
EXERCISES
25
26 CHAPTER 2. SUMMARY AND EXERCISES
Suppose that their intersection and union are defined by the Hamacher’s t-norm
and t-conorm with γ = 1, respectively. What are then the membership function of
{A ∩ B} and {A ∪ B} ?
I.3 Show that Yager’s ∧ and ∨ are, respectively, t-norms and co-t-norms.
I.4 Prove that for any t-norm T and any co-t-norm S we have
I.5 Fuzzy sets µ1 , µ2 ∈ Fc (R) (which is class of all upper semicontuinuous fuzzy sets
of R). With the help of the extension principle, for the sum µ1 ⊕ µ2 , the product
µ1 ¯ µ2 , we lay down the following:
I.6 Let f (x) = x2 and let A ∈ F be a symmetric triangular fuzzy number with
membership function
(
1 − |a − x|/α if |a − x| ≤ α
A(x) =
0 otherwise
Then use the extension principle to calculate the membership function of fuzzy set
f (A).
I.8 Consider two fuzzy relations R =”x is considerable smaller than y” and G = ”y
is very close to y”
y 1 y2 y3 y4 y1 y2 y3 y4
x 0.5 0.1 0.1 0.7 x 0.4 0 0.9 0.6
R=
1 1
G =
x2 0 0.8 0 0 x2 0.9 0.4 0.5 0.7
x3 0.9 1 0.7 0.8 x3 0.3 0 0.8 0.5
1. What are their intersection of R and G means that R =”x is considerable smaller
than y” and G = ”y is very close to y”.
2. What are their union of R and G means that R =”x is considerable smaller than
y” or G = ”y is very close to y”.
27
I.9 Consider two fuzzy relations R =”x is considerable smaller than y” and G = ”y
is very close to z”
z1 z2 z3
y1 y2 y3 y4
x y1 0.4 0.9 0.3
0.5 0.1 0.1 0.7
R=
1
G =
y2 0 0.4 0
x2 0 0.8 0 0
y3 0.9 0.5 0.8
x3 0.9 1 0.7 0.8
y4 0.6 0.7 0.5
FUZZY SYSTEMS
29
Chapter 3
Fuzzy Control
where hfuzzy criteriai and hfuzzy conclusioni either are atomic or compound fuzzy
propositions. Such a rule can be seen as a causal relation between measurements
and control values of the process. If e and ė are insignals and u̇ an outsignal, and
further NS, PS and NL are linguistic variables, then
IF e is N S AND ė is P S THEN u̇ is N L
or
31
32 CHAPTER 3. FUZZY CONTROL
if the present deviation of the control value is N S and the latest change
in the deviation of the control value is P S then this should cause the
control value to be N L.
Both antecedents and consequents can involve several linguistic variables. If this
is the case, the system is called a multi-input-multi-output (MIMO) fuzzy system.
Such systems have several insignals and outsignals. Also multi-input-single-output
(MISO) systems, with several insignals but only one outsignal are very common.
An example of a MISO system is as follows.
R1 : IF x is A1 AND y is B1 THEN z is C1
R2 : IF x is A2 AND y is B2 THEN z is C2
.. ..
. .
Rn : IF x is An AND y is Bn THEN z is Cn .
Here x and y are insignals and z is the outsignal. Further, Ai , Bi and Ci are linguistic
variables.
• An expert that knows the process provides linguistic rules that are specified
given previous knowledge and know-how related to the process.
• The process is described within a fuzzy model, based on which control rules
can be directly derived. Such methods do not yet exist, and require further
research.
• En fuzzy controller is adaptive in the sense that the rule base together with
parameters in rules (and possibly in inference mechanisms) are adjusted in real-
time given possibilities for the systems to identify itself as being in respectively
good or bad states. Some suggestions of related techniques are found e.g. in
[Procyk and Mamdani 79], [Shao 88] and [Sugeno 85].
Whatever technique we use, our goal is to construct a number of fuzzy rules with
the following syntax:
3.1. FUZZY CONTROLLERS 33
Note that this syntax is valid for MISO systems. For rule Ri we have x1 , . . . , xm as
insignals and Ai1 , . . . , Aim as respective linguistic quantifiers of the insignals. The
consequent of the rule is ”u is Ui ”.
Example. The following shows an example of fuzzyness used for speed control. As
insignals we have the actual speed v (km/h) and the load l (N) of the car. Load
components are e.g. force F and friction Fµ . As linguistic variables for speed we use
LS (low speed), N S (normal speed) and HS (high speed) and for load similarly LL
(low load), N L (normal load) and HL (high load). These are typically bell-shaped
functions and as midpoint in N S we use 70 km/h which also is the constant speed
we try to maintain. The choice of precise shape and transposition of membership
functions is left open at this point. Low load appear e.g. downhill and high load
uphill (see figure 3.1).
N
G
F+Fµ N NORMAL N
LOAD v
G G
D
F
x U
E
Z
RULE BASE F
U
Z Z
I Z
F I
I F
C I
A C
T A
I INFERENCE T u
O I
N
MECHANISM O
N
1 A 11 1 A12 1 U1
α1
0 0 0
1 A21 1 A22 1 U2
α2
0 0 0
x1 x2 min
This is the most common view of fuzzy control. In Mamdani’s method, conjunc-
tion is given by the minumum operator, implication likewise, and resulting output
membership functions are combined using the maximum operator as disjunction.
To be more precise, if the activation level is given by
^
m
αi = Aij (xj ),
j=1
1 A 11 1 A12 1 U1
α1
0 0 0
1 A21 1 A22 1 U2
α2
0 0 0
x1 x2 min
where pi0 , . . . , pim are constants related to rule i. Methods to specify the con-
stants are discussed in [Takagi and Sugeno 85], including also algorithms for se-
lecting insignals related to respective ui . The final control value is given by
X
n
αi ui
i=1
u= Xn .
αi
i=1
Example. Given insignals x1 = 6.5 and x2 = 9.2, and linear output functions
u1 = 2 + 1.7x1 + 1.3x2 and u2 = −3 + 0.5x1 + 2.1x2 , we obtain u1 ≈ 25.0 and
u2 ≈ 22.3. Thus the control value becomes
Ui (ui ) = αi .
1 A11 1 A12 1
α1
0 0 0
u1
1 A 21 1 A 22 1
α2
0 0 0
x1 x2 min u2
Σ αi u i
u =
Σ αi
X
n
αi ui
i=1
u= Xn .
αi
i=1
3.3 Defuzzification
As a result of inference we obtain a fuzzy set µU of proposed control values. Each
activated rule Ri , i.e. for which the activation level is non-zero, contributes to µU ,
and therefore we obtain the final conclusion as
38 CHAPTER 3. FUZZY CONTROL
A 11 A12 U1
1 1 1
α1
0 0 0
u1
A21 A22 U2
1 1 1
α2
0 0 0
x1 x2 min u2
Σ αi ui
u =
Σ αi
[
n
µU = µUi ,
i=1
Centre-of-Gravity, CoG
CoG finds the centre of gravity of µU . In the discrete case, we have
X
l
uk · µU (uk )
k=1
u=
X
l
µU (uk )
k=1
CoG
1
In ICoG we only consider the area of µU that is above a specified level α (see figure
3.8), and compute the centre of gravity for this area. Thus, in the discrete case, we
have
X
l
uk · [µU (uk )]α
k=1
u= ,
X
l
[µU (uk )]α
k=1
where [µU (uk )]α denotes the area of fuzzy set µU above α level.
In the continuous case, we have
R
[U ]α v · [µU (v)]α dv
u= R .
α
[U ]α [µU (v)] dv
ICoG
1
Centre-of-Sums, CoS
In CoG we had to consider the whole of µU . CoS is similar to CoG but imple-
mentationally much more efficient. In CoS, we consider all output functions when
computing the sum of all µUi , i.e. overlapping areas may be considered more than
once (see figure 3.9). In the discrete case, we obtain
40 CHAPTER 3. FUZZY CONTROL
X
l X
n
uk · µUi (uk )
k=1 i=1
u=
X
l X
n
µUi (uk )
k=1 i=1
CoS
1
First-of-Maxima, FoM
In FoM, defuzzication of µU is defined as the smallest value in the domain of U with
maximal membership value, i.e.
Middle-of-Maxima, MoM
MoM is similar to FoM. Instead of taking the first values with maximal grades of
membership, we compute the average of all values with maximal grades,
min{v ∈ U | µU (v) = max U } + max{v ∈ U | µU (v) = max U }
u= .
2
A graphical representation is given in figure 3.11.
Height Method, HM
3.3. DEFUZZIFICATION 41
FoM
1
MoM
1
The height method is not applied directly on µU , but is focused on heights, and
computes a weighted sum of these heights. The weighted sum is according to
X
n
µU peak · fi
i
i=1
u= X
n ,
fi
i=1
where µU peak is the ui for which µUi (ui ) = 1 (where µUi is in its original form). See
k
figure 3.12. For the trapetsoidal membership function, µU peak becomes an interval
k
(plateau), from which we can select a representative, e.g. the mean value. The value
fi is the height of µUi , i.e. max µUi .
HM is computationally both simple and fast.
1
HM
f2
f1
0
µ U peak µ U peak
1 2
u
CoLA
1
Example. Defuzzification can lead to undesirable effects. Consider e.g. a car trying
to avoid an obstacle directly in front of the car. The membership function repre-
senting candidates of control values is then typically as shown in figure 3.14. Both
CoG and CoS, however, will result in driving straightforward.
Fuzzy Clustering
0
1
0
1 1 1
0 0
1 1
1 0 0
x 1
1 1 0 x
1 1 0 0
1 0
0
1 0
43
44 CHAPTER 4. FUZZY CLUSTERING
CLUSTER 1 CLUSTER 2
0.47 0.53
0.68 0.32
0.32 0.68
0.73 0.77 0.66 0.27 0.23 0.34
0.30 0.70
0.19 0.81
0.80 0.20
0.77 0.15 0.23 0.85
0.91 0.10 0.09 0.90
x 0.86 x 0.14
0.80 0.61 0.20 0.39
0.85 0.14 x 0.86 x
0.77 0.09 0.20 0.15
0.23 0.91 0.80
0.74 0.27 0.26 0.73
0.24 0.76
0.69 0.30 0.31 0.70
To describe the algorithm, we need some notations. The set of all points con-
sidered is X = {x1 , · · · , xn }(⊂ Rd ). We write ui : X → [0, 1] for the ith cluster,
i = 1, . . . , c, and we will use uik to denote ui (xk ), i.e. the grade of membership of xk
in cluster ui . We also use U = huik i, for the matrix of all membership values. The
’midpoint’ of ui is vi (∈ Rd ), and is computed according to
X
n X
n
vi = (uik )m xk / (uik )m .
k=1 k=1
X
c
uik = 1,
i=1
X
c X
n
J= (uik )m k xk − vi k2 .
i=1 k=1
The following algorithm for FCM clustering will meet this objective:
4.2. FUZZY C-MEANS CLUSTERING 45
if µk = ∅, then
X
c
uik = 1/[ (k xk − vi k / k xk − vj k)2/(m−1) ]
j=1
otherwise X
uik = 0 ∀i 6∈ µk and uik = 1.
i∈µk
v1
x v
x 2
v3x
where x̄ is the midpoint of all points in X (see figure 4.3). Obviously, we will select
that particular c0 for which
S(U, c0 ) = min S(U, c).
c
i.e. we should iterate the clustering algorithm with c ranging in a suitably selected
interval, e.g. 2 to 20 clusters.
Clearly, in many cases, it is not straightforward to find out the optimal number
of clusters. For small number of data points in X, a worst case scenario is that the
criterion number decreases with increasing c and reaches an optimality with c equal
to the number of data points in X. Also intuitively (see figure 4.4), it is not obvious
which number of clusters will be better from practical viewpoints.
Each cluster will now generate one rule in the following way. Consider a fuzzy
cluster ui on X, and write πp (ui ) : Xp → [0, 1] for the corresponding cluster projected
on the pth axis, i.e. πp (ui )(πp (xk )) = uik . Assume we have decided to use Gaussian
functions in our generated rule. This means we want to estimate the parameters of
Gaussian functions with best fit to respective πp (ui ). Thus, for each p we need to
find βp such that
X
| uik − e−βp (πp (xk )−αip ) |
xk ∈X
x
x
A 32
v
α 32 x 3 x
A 31
α 31
related to the ith cluster. If this cluster is elliptic with points being concentrated
around the centre point of the ellipse, then the principal axes of the cluster is given
by the eigenvectors of M .
4.5 Applications
There is a wide range of applications of clustering. In industrial process control
there are numerous applications, such as fertilizer production ([Riissanen 92b]),
automatic steering (AGNES [Olli 95, Huber 96]), and process state prognostics
([Sutanto and Warwick 94]), only to mention some examples. The reader is referred
to proceedings e.g. related to IEEE and IFSA conferences on fuzzy systems.
Also in pattern recognition and image processing there are typical application
e.g. in segmentation of colour images ([Lim and Lee 90]), edge detection in 3-D
objects ([Huntsberger et al 86]), and surface approximation for reconstruction of
images ([Krishnapuram et al 95]).
Chapter 5
EXERCISES
where
(
1 − 4t , if 0 ≤ t ≤ 4
SM ALL(t) =
0, otherwise
(
|t−2|
1− , if 0 ≤ t ≤ 4
M EDIU M (t) = 2
0, otherwise
49
50 CHAPTER 5. SUMMARY AND EXERCISES
(
1− 4−t
4
, if 0 ≤ t ≤ 4
BIG(t) =
0, otherwise
where
(
1 − 4t , if 0 ≤ t ≤ 4
SM ALL(t) =
0, otherwise
(
|t−2|
1− , if 0 ≤ t ≤ 4
M EDIU M (t) = 2
0, otherwise
(
1− 4−t
4
, if 0 ≤ t ≤ 4
BIG(t) =
0, otherwise
use the Mamdani method of inference together with a defuzzification method of your
choice to compute the output of inputs given by x = 2, y = 2 and z = 4.
Part III
PROBABILISTIC COMPUTING
51
Chapter 6
Introduction
A substantial number of our everyday decisions are made under conditions of un-
certainty. We are often faced with situations where conflicting evidence forces us
to explicitly value observed information in order to make rational decisions. Dif-
ferent methods, such as logical, probabilistic and numerical approaches have been
developed to handle uncertainty in reasoning systems. Bayesian networks (BNs),
also known as causal probabilistic networks, belief networks etc., is one appealing
formalism to describe uncertainty within domains. In recent years, the interest for
BNs has increased due to new efficient algorithms and user-friendly software making
BNs available to others than the research community responsible for them.
BNs have been successfully applied in applications such as medical diagnosis, im-
age processing and software debugging and new applications are reported frequently.
53
54 CHAPTER 6. INTRODUCTION
6.2.1 Definitions
The probability of a event a is defined as
g(a)
P (a) = , (6.2.1)
N
where g(a) is the number of positive outcomes for a and N is the total number of
outcomes. The number of positive outcomes for a is bounded by
0 ≤ g(a) ≤ N, (6.2.2)
0 ≤ P (a) ≤ 1. (6.2.3)
If A is a random variable with the discrete and exhaustive states {a1 , . . . , aj } then
X
j
P (A = ai ) = 1. (6.2.4)
i=1
P (A, B)
P (A|B) = (6.2.5)
P (B)
where P (A, B) is the joint probability for A and B, i.e “A and B”. Rewriting
equation (6.2.5) gives an expression for the joint probability for dependent variables:
A is independent of B, if and only if, P (A|B) = P (A). Using this fact in equation
(6.2.6) gives an expression for independent variables:
Since the phrase “A and B” is equal to “B and A”, equation (6.2.5) can be used to
formulate:
P (B|A)P (A)
P (A|B) = (6.2.10)
P (B)
which is known as Bayes’ rule or Bayes’ theorem. This equation makes it possible to
calculate the posterior probability, P (A|B), for a event when the opposite condition,
P (B|A), is known (and the unconditional probabilities for the individual events).
Bayesian Networks
57
58 CHAPTER 7. BAYESIAN NETWORKS
Timer
T
@
R
@
H W
Holmes car Watsons car
three random variables: Holmes’ car (H), Watson’s car (W ) and the timer (T ). All
these variables are discrete with two mutually exclusive states each. The cars can
either start or not, and the timer is either working or not.
The causal relationship between these random variables can be expressed by
connecting them pairwise with directed links. Here, the timer affects (through the
power supply) the functionality of the cars, so a link can be drawn from the timer to
each of the two cars, see figure 7.1. The direction of the links is of great importance,
the state of the timer (working or not working) affects the cars ability to start, not
the other way around.
In order to complete the network model, a conditional probability table (CPT)
must be specified for each node. Since the variable T is a root node in the network,
i.e. node T has no incoming links, only the unconditional probability of each state
needs to be specified. The probability that the timer will fail, P (t), is 20/365 ≈ 0.05
and the probability that the timer will work, P (t), is therefore 1 − P (t) ≈ 0.95. The
probability tables for Holmes and Watson’s cars must to be specified in context of
the timer. For example, from the text it can read that the chance that Holmes’ car
will fail when the timer is working, P (h|t), is 2/10 = 0.2. The complete CPT for
Holmes’ and Watson’s cars are given in table 7.1.
With the above example in mind it is now appropriate to give a formal definition
of Bayesian networks:
A
?
P A B
B
J J
J J
J J
^
J
^
J
?
A B C
C
a) Linear b) Diverging c) Converging
7.2.1 Linear
A linear connection is shown in figure 7.2a. If B in unknown, the probability of
B is determined from the status of A. Since C is determined from B, C therefore
becomes dependent of A.
If B is known to be in state bi (and therefore unaffected of A), the probability
of C can be calculated direct from its probability table, P (C|bi ) and is therefore
conditionally independent of A.
7.2.2 Diverging
This case was previously illustrated by the car trouble example. Generally, nodes
from a common parent P are dependent unless there is evidence in P . Evidence in
P blocks the path from A to B, see figure 7.2b.
7.2.3 Converging
The last case to consider is when two or more variables causes the same effect
(figure 7.2b). If nothing is known about C, the parents A and B are independent.
For example, it is fairly safe to state that cold and allergy-attack, which both can
cause a person to sneeze, are independent, see fig 7.3 (this example is borrowed from
Henrion [Henrion et al 91]). However, observing a person sneezing, makes the two
events dependent. If a person sneezes when an allergen (for example, from a cat) is
present the support for cold is reduced in favour of the allergy-attack theory. This
phenomena is also known as “explaining away”. The allergy-attack explains away
the cold. Here, the evidence was observed directly in the converging node S. This
is not necessarily the case. In fact, any observed descendant node from S can act
as transmitter between the parents and make them conditionally dependent.
The conclusion is that a converging node, C, blocks the path between its parents
unless there is evidence in C or any of its descendants.
T Cat
?
Cold C A Allergy-
attack
J
J^ /
S
Sneeze
Figure 7.3: If a person is sneezing when an allergen is present, the support for cold
is reduced.
7.3. FINDING THE PROBABILITIES 61
7.2.4 d-separation
The three cases above can be used to formulate a separability criterion, called
direction-dependent separation or d-separation. The d-separation is a very impor-
tant property that can be used to find efficient inference algorithms A formal proof
can be found in Pearl [Pearl 88].
Definition 7.2.1 Any two nodes, A and B, in a Bayesian network are d-separated,
and therefore conditionally independent, if every path between A and B is blocked
by an intermediate node, V ∈ / {A, B}.
V blocks the path if and only if one of the following holds:
(i) the structure is linear or diverging and V is known.
(ii) the structure is converging and neither V nor its descendants are known.
The d-separability criterion will later be used frequently in the presentation of in-
ference methods.
⊥B|P
(b) A⊥
⊥B (however A⊥
(c) A⊥ ⊥B|C is false)
probability values needed can be too many to manage, and with larger sets of parents
the CPT quickly becomes unwidely. For example, a discrete boolean variable with
only four boolean parents needs as many as 32 values to complete the CPT. Even
with a large database there is a potential risk that some of the combinations are
too unusual to provide a reliable estimate. One commonly used model to avoid this
problem is the noisy-Or gate derived by Judea Pearl in [Pearl 88].
The model is based on disjunctive interaction, that is when the likelihood for a
particular condition is unchanged when other conditions occur at the same time. For
example, if cold, pneumonia and chicken-pox is likely to cause fever, then disjunctive
interaction applies when a person suffering from several of these diseases at the same
time would only be more likely to develop fever. Furthermore, if the person is also
suffering from a disease that is unlikely to cause fever, this additional evidence
does not reduce the support for fever caused by the other diseases. There exist an
well-founded theory for a disjunctive model if the following assumptions are made:
(i) Boolean variables: All included variables must have two discrete states, namely
true and false.
(iii) Exception independence: The processes that inhibit an event under a condition
are independent.
Assumption (ii) is not as strict as it might look, as it is always possible to add a
“Other causes”-variable to represent what is not explicitly specified in the closed-
world assumption. The model requires that only the individual probabilities for each
parent are specified, i.e. P (E| only Hi ). Thus, using this technique requires only
7.3. FINDING THE PROBABILITIES 63
Cold Co Noisy-Or gate
HH .
HH
j
Pneumonia Pn - - Fe Fever
3
Chicken-pox Cp
Suppose P (f e|co)= 0.4, P (f e|pn) = 0.8 and P (f e|cp) = 0.9. For example, the
probability of fever when both Cold and Pneumonia are present is
Table 7.3: Probability values for the fever calculated with the noisy-Or model.
Co Pn Cp P (f e) P (f e)
f f f 1 0
f f t 0.1 0.9
f t f 0.2 0.8
f t t 0.02 0.98
t f f 0.6 0.4
t f t 0.06 0.94
t t f 0.0.12 0.88
t t t 0.012 0.988
64 CHAPTER 7. BAYESIAN NETWORKS
Chapter 8
Once a network is constructed it can be used to answer queries about the domain.
The basic task in a Bayesian network is to compute the probability of a variable
under evidence. Since there are no special input or output nodes, any variable can
be computed or observed as evidence. There are basically three different types of
inference that occur in a Bayesian network:
• Causal inference
• Diagnostic inference
• Intercausal inference
Causal inference is when the reasoning follows the same direction as the links in the
network. Diagnostic inference, on the other hand, is when the line of reasoning is
the opposite of the causal dependencies. The basic strategy to handle diagnostic
reasoning is to apply Bayes’ rule in equation (6.2.10). Intercausal inference means
reasoning between causes of a common effect. This is the same condition as in section
7.2.3. The presence of one cause make the others less likely. Finally, combinations
of these inferences can appear in Bayesian networks. This is sometimes called mixed
inference.
65
66 CHAPTER 8. INFERENCE IN BAYESIAN NETWORKS
Proof 8.1.1 Induction on the nodes in the network. Suppose that every causal
Bayesian network has a joint distribution as in equation 8.1.1. For a network with
only one node the hypothesis is obviously true. Suppose that the hypothesis is true for
Q
network of the variables Bn−1 = {X1 , . . . , Xn−1 }, that is P (Bn−1 ) = n−1
i=1 P(Xi |PXi ).
Let Bn = Bn−1 ∪ Xn , where X is a leaf in Bn . (Since Bn is a DAG, there is at least
one leaf in Bn ). By using the fundamental rule in equation (6.2.5) the formula can
be written
P (Bn ) = P(Bn−1 , Xn ) = P(Xn |Bn−1 )P(Bn−1 )
Since Xn is independent of Bn−1 \ PXn the left-hand side can be reduced to
The fact that a Bayesian network can be represented by a joint distribution on the
form above results in two important properties.
Here, consistent means that the probability values does not conflict each other. It is
rather easy, intentional or not, to construct an inconsistent system. Let for example
P (a) = 0.7, P (b) = 0.2 and P (a|b) = 0.65. These values might seem alright at a
first glance, but they are in fact inconsistent. By using Bayes’ rule the term P (b|a)
can be written as P (a|b)P (a)/P (b) which is > 1. The consistency property ensures
that a Bayesian network does not violate the axioms of probability.
The joint distribution can be used to answer any query of the network. To
illustrate this, lets again return to the previous car trouble example. Here, the
joint distribution is P (H, W, T ) = P (H|T )P (W |T )P (T ). The computation of, say,
P (h, w, t) can be done by simply multiplying the corresponding values in the CPT:
Table 8.1: Atomic events for the car trouble example calculated with the joint
distribution.
H W T P(H, W, T)
y y y 0.684
y y n 0.0075
y n y 0.076
y n n 0.0075
n y y 0.171
n y n 0.0175
n n y 0.019
n n n 0.0175
The terms P (h, w) and P (w) can be computed by using equation (6.2.9) to sum
over all matching terms in table 8.1:
0.076 + 0.0075
P (h|w) = ≈ 0.696
0.076 + 0.0075 + 0.019 + 0.0175
If both cars fail to start, the probability of a broken timer is:
P (t, h, w) 0.0175
P (t|h, w) = = ≈ 0.48 (8.1.2)
P (w, h) 0.019 + 0.0175
Summing over the joint distribution is an easy method to use when answering queries
in Bayesian networks. Unfortunately, due to the exponential growth of atomic
events, this method cannot be used it practice in this simple way. The number
of cases to consider is equal to the product of the number of states for each variable.
If every variable is binary, the number of atomic events in a network with n nodes
are as many as 2n . With larger networks, this method simply becomes intractable.
For example, the medical diagnose system MUNIN (Olesen et al in [Olesen et al 89])
consists of more than 1000 variables with up to seven states each. Even though the
MUNIN system is one of the most elaborative application found in the literature,
real-world modeling often requires far more nodes than is manageable within the
reach of this simple method. Even worse, Cooper proved in [Cooper 87] that infer-
ence in a Bayesian network is NP-hard irrespective of the method used. There are
no efficient methods for undertaking an arbitrary network. Despite this dishearten-
ing result there are ways to tackle this problem. The proof in [Cooper 87] applies
to an arbitrary network, and there are certain families of network topologies where
more efficient algorithms exist. Also, many applications generate sparse graphs, in
which the prospects of finding a more local computation scheme are good. Three
such exact inference methods are described in section 8.3.1, 8.3.2 and 8.4.
68 CHAPTER 8. INFERENCE IN BAYESIAN NETWORKS
.
@ @ A
R
@
@
?
= @@
?
R= AU
@ @
? @
R
@ ?
? @
R
@
.
. .
@ @
?
R
@
@ ? @
R
@
a) Singly connected b) Multiply connected
. . . . . . . . . . . . . . . . . .. .. .. . . . . .
. . .
. . . .
. . .. . .
. . . .
. . C . .
Z . .
. Z
~
Z
. W
C . .
. .
. U1 . Um EUm X . .
. . .
.E + =
@ . . . . . S
. X . .. . S w . .
. . @. .
. . .
. .
. . . . . . . . . . @ . .. . .
. .. .. .
R
@ /
.
. . . . . .. .. . . X
. . . . . . . . . .
. . . . . . . . ..
.
. . . EXY1 Z1k . . . S . .
. . . . . . .S .
Znj
.
.
. . Z11 B . S . .
. .
HH BN / . . w
S .
. . j Y
H . Yn .
. . 1 .
. . .
.
. .
S
w
S .
U .
. . .
. . .
. . . . .
. .. . . . . . . .
. −
.
. . . . . . . . . . . . . . X. . . . . . E . .
which makes α:
1
α= P − + . (8.2.5)
X P (EX |X, EX )
That is, α can be treated as a normalizing constant which scales the sum of the indi-
vidual probabilities to 1. There is no need to find an explicit formula for P (EX+ |EX− ).
Next, P (X|EX+ ) can be computed by considering all possible configurations of
the parents of X. Let PX = {U1 , . . . , Um }, we get
X
P (X|EX+ ) = P(X|PX , EX+ )P(PX |EX+ ) (8.2.6)
PX
The term P (X|PX , EX+ ) can be simplified into P (X|PX ) since the parents PX d-
separates EX+ from X. Furthermore, since X is a converging node, it blocks the
undirected path between the parents which ensures that they are conditionally in-
dependent. Using the equation of independent events (6.2.7) the equation can be
written X Y
P (X|EX+ ) = P(X|PX ) P(Uk |EX+ ). (8.2.7)
PX k
The evidence, EX+ , can be further decomposed into {EU1 X , . . . , EUm X }, see figure 8.2.
Since the parents are independent, each Uk is independent all other parents and
their evidence sets. This observation gives
X Y
P (X|EX+ ) = P(X|PX ) P(Uk |EUk X ). (8.2.8)
PX k
70 CHAPTER 8. INFERENCE IN BAYESIAN NETWORKS
A closer lock at the term P (Uk |EUk X ) reveals that it is in fact a recursive instance
of the original problem, excluding the node X. The term P (X|PX ) can be found in
the conditional probability table for X.
Now, returning to equation (8.2.4) the last term to consider is P (EX− |X). Let
Y = {Y1 , . . . , Yn } be the set of children of X. Decomposing the evidence into
{EXY1 , . . . , EXYn } yields
The last term, P (Yi , Zi |X) can be written P (Yi |X, Zi )P(Zi |X) and since X is d-
separated from Zi the formula becomes:
Y X X P(Zi |EXY
+ +
)P(EXY )
P (EX− |X) = i i
P(EY−i |Yi )P(Yi |X, Zi )P(Zi ) (8.2.15)
i Yi Zi P(Zi )
8.2. SINGLY CONNECTED NETWORKS 71
+
The term P (EXY i
) is unconditioned and therefore independent of the states of X.
+
Replacing P (EXYi ) with a constant βi and cancelling the terms P (Zi ) gives:
YXX
P (EX− |X) = βi P(Zi |EXY
+
i
)P(EY−i |Yi )P(Yi |X, Zi ) (8.2.16)
i Yi Zi
Now, inspecting each term reveals that P (EY−i |Yi ) is a recursive instance of P (EX− |X)
excluding X, P (Yi |X, Zi ) is a lookup in the conditional probability table (CPT) for
Yi and P (Zij |EZ+ij Yi ) is a recursive instance of the original problem, excluding Yi .
Notice that there is no need to find an explicit value of β since it can be combined
into α in equation 8.2.18 to form a new normalizing constant ξ. To summarize,
the belief for a variable X under evidence, E, in a singly connected network can be
evaluated as:
P (X|E) = ξπ(X)λ(X) (8.2.18)
where ξ is a normalizing constant and
X Y
π(X) = P (X|U) P(Uk |EUk X ) (8.2.19)
U k
YX X Y
λ(X) = P (EY−i |Yi ) P(Yi |X, Zi ) P(Zij |EZij Yi ) (8.2.20)
i Yi Zi j
Different versions on how to turn the above equation into a general algorithm
can be found in the literature. Pearl constructs in [Pearl 88] an object-orientated
message passing scheme where the flow of belief is updated by messages sent be-
tween adjacent nodes. A recursive formulation is derived by Russell and Norvig in
[Russell and Norvig 95].
− −
Identifying the symbols reveals that Y1 = H, Y2 = W , EH = {h} and EW = {w}.
Since P (h|h) = P (w|w) = 0 and P (h|h) = P (w|w) = 1, the formula can be reduced
to
That is, P (t|h, w) ≈ 0.48, which is the same result as in equation (8.1.2).
8.3.1 Conditioning
The basic idea in the conditioning approach is to divide the multiply connected
network into several smaller singly connected networks conditioned on a set of in-
stantiated variables. Figure 8.4 shows the two networks created when the boolean
variable M e in figure 8.3 is instantiated. Generally, the number of resulting sub-
trees is exponential to the product over the states of each variable in the cutset. The
cutset is the set of conditioned variables. The problem in this approach is to find
8.3. MULTIPLY CONNECTED NETWORKS 73
Br Sc P (co|Br, Sc)
t t 0.8
t f 0.8
f t 0.8
f f 0.05
me me me me
?
?
?
?
Sc Br Sc Br
@ @
R
@
@ R
@
@
Co Co
Figure 8.4: Conditioning the node M e creates two singly connected networks.
the minimal cutset that divides the original network into singly connected subnets.
Once created, the probability for a variable can be calculated as the weighted sum
over each individual polytree.
A nice side-effect of this technique is that the weighted sum can be used to quickly
calculate an approximate answer. Starting with the largest weight, the system can
compute the probability until a desired level of accuracy is obtained. A simple way
to calculate the accuracy range is to sum over all remaining weights to calculate an
upper bound. The lower bound is, of course, the probability calculated so far, since
probabilities values are always positive.
74 CHAPTER 8. INFERENCE IN BAYESIAN NETWORKS
8.3.2 Clustering
Clustering takes the opposite approach to conditioning. Instead of dividing the net-
work into smaller parts, clustering algorithms combines nodes into larger clusters.
The variables Br and Sc in the coma example could be collapsed into a compound
variable Z = {Br, Sc}. The states of the cluster node becomes the set of combina-
tions to all included variables. Here, the states of Z are {(br, sc), (br, sc), (br, sc),
(br, sc)}. The clustering transforms the network to a polytree where the inference
can be performed as usual. The disadvantage, is of course, that if the network is
dense, the compound variables can become intractable large, since the number of
states is exponential to number of collapsed variables. Despite this fact, clustering
techniques are by many considered as the best exact algorithm for most types of
non-singly connected networks.
One particularly interesting and the current standard algorithm for clustering
networks was originally developed by Lauritzen and Spiegelhalter in [Lauritzen and Spiegelhalter 88].
The method was later improved with a general absorption scheme and by Jensen in
[Jensenet al 90]. This technique uses properties of the clusters in order to efficiently
propagate the flow of belief.
The key issue is the concept of consistent universes. Consider two clusters, V =
{A, B, C} and W = {C, D, E}, in figure 8.5. Now, the probability for the common
variable C can be calculated by summing over all elements except C in both V and
W. Thus, X X
P (C) = P (A, B, C) = P (C, D, E) (8.3.22)
A,B D,E
If evidence changes the information in V then the above condition can be used to
update the probability for W in the following way: Initially, let the distributions
for V and W be P 0 (A, B, C) and P 0 (C, D, E), respectively. Now, suppose that
evidence in V changes the distributions to P 1 (A, B, C). With this new information,
8.3. MULTIPLY CONNECTED NETWORKS 75
A,B,C C C,D,E
the probability for the common variable C can be marginalized out of P 1 (A, B, C)
as in equation 8.3.22, X
P 1 (C) = P 1 (A, B, C). (8.3.23)
A,B
Using the fundamental equation 6.2.5 the term P (C, D, E) can be written
where the term P 0 (D, E|C) can be calculated from the initial distribution:
P 0 (C, D, E) 1
P 1 (C, D, E) = P (C) (8.3.26)
P 0 (C)
The scheme above is called absorption, the cluster W has absorbed from V. In
general terms the absorption process can be described as follows:
Definition 8.3.1 Let V and W be cluster of variables and let S be the set of their
common variables, that is S = V ∩ W. Let ψV0 , ψW 0
and ψS0 be the belief tables
associated with each cluster. The absorption procedure is defined by the following
steps:
V W
A B ψV0 (A, B) B C 0
ψW (B, C)
a1 b1 0.05 b1 c1 0.1
a1 b2 0.6 b1 c2 0.3
a2 b1 0.3 b2 c1 0.4
a2 b2 0.05 b2 c2 0.2
for W can be calibrated to V by letting W absorb from V. The new belief table for
the separator S = {A} is
X
ψS1 = ψV0
B
= ψV0 (b1 )
+ ψV0 (b2 )
= (0.05 + 0.3, 0.6 + 0.05)
= (0.35, 065).
0
Finally, ψW can be updated:
1 0 ψS1
ψW = ψW
ψS0
0 (0.35, 0.65)
= ψW ≈ ψW0
(0.875, 1.083)
(0.4, 0.6)
≈ (0.1, 0.3, 0.4, 0.2)(0.875, 1.083)
| {z } | {z }
b1 b2
≈ (0.1 · 0.875, 0.3 · 0.875, 0.4 · 1.083, 0.2 · 1.083)
≈ (0.0875, 0.2625, 0.4333, 0.2166).
@ , @
,
@
R
@ ,
?
?
@ @
R
@
@ @
@
CPN Moral graph
(1) Let i = |V |
(2) Let X be the ith node (using the order above) and let P be the set of
neighbours to X with numbers < i.
(3) Add links between ant two nodes in P that are not already connected.
(4) Let i = i − 1
(5) Continue from (2) until i = 0.
The triangulation of the graph ensures that the final cluster tree will fulfill the second
property above, see figure 8.9 for an example. Notice that the triangulation of a
graph is not unique. There are several steps in the algorithm were arbitrary choices
can be made. Intuitively, the best triangulation is the one that yields minimum
fill-in. However, in this case the optimal triangulation is concerned with the size
8.3. MULTIPLY CONNECTED NETWORKS 79
of final the junction tree. Unfortunately, finding the optimal triangulated graph is
NP-hard. See Arnborg el al. in [Arnborg et al 87] and further discussion by Jensen
et al. in [Jensen and Jensen 94].
Next, the junction graph can be constructed by identifying the cliques in the
moral and triangulated graph. Links between clusters are added by connecting
cluster with a non-empty intersection, see figure 8.7. The intersection between two
variables is called separator, denoted S. Finally, the junction tree can be found in the
A
@ ABC
@
@ .....
. .....
.....
..... ..
B C
C .
"" ...
. BC .....
" .
.. .....
..
"
" BCD CD CDE
" ..... .
D E ......
. ......
@ D .
...
DE
@
@
...
.. %
F
DEF
Triangulated graph Junction graph
junction graph by finding a maximum spanning tree (Jensen [Jensen and Jensen 94]),
where the weight of a link is represented by the number of variables in the separator,
i.e |S|. Figure 8.8 shows a junction tree found in a junction graph.
ABC ABC
... ..... ..
.....
.....
....
...
.... ....
BC C .
BC
... ..... ...
.... .....
... ... .
BCD CD CDE BCD CD CDE
..... ......
. ....
......
. ..
... . ...
D DE DE
...
...
... % %
DEF DEF
Junction graph Junction tree
A A AB A AC
@ , @
R
@ ,
, @
B C B C B C
.
?
? BD CE
D E D E ...
.
.
..
... ..
.
@ @ D E
R
@
@ @
@ .....
....... ......
....
F F DEF
CPN Moral graph Cluster graph
(i) Give all nodes (clusters) and separators a table of ones, i.e. ψ 0 = 1.
(ii) For each variable, A, choose one cluster, C, containing A ∪ PA and multiply
P (A|PA ) (the CPT) with ψC0 .
8.3.2.6 Evidence
Entering observed evidence in a junction tree is easy. Evidence is normally on the
form A = aj . Semantically, this means that the probability of all other states is 0.
P (A) = {0, . . . , 1, 0, . . .}, a “1” in the j:th position. The same can be done for a
cluster of variables: Let E = {0, . . . , 1, 0, . . .} be a finding on A. Multiply E with
the belief table for any cluster containing A.
8.3. MULTIPLY CONNECTED NETWORKS 81
(iii) Let the parent absorb from the node: Call Absorb(parent, node) (unless
the parent is the root if the tree).
TopDown(node,parent)
When the junction tree is consistent, the belief for a single variable can be computed
by marginalization: X
P (A) = ψC , (8.3.29)
C\A
R R
6
1
2
. ?
..... ..... S 2
1 .....
.
..... .....
.. o
S 1 2
.....
.
..... .....
.. S
S w
S
... S. . ...
.
... ... ...
...
...
.. ... ..
BottomUp(R) TopDown(R)
(iv) Construct initial belief tables for each node and separator in the junction tree.
(vi) Select a root node, R (any node in the junction tree can act as root).
Note that step (i) to (iii) is a static task, there is no need to redo this process unless
the structure of the network is changed.
Me Me MSB
@ @
@
R
@ @
@
Sc Br Sc Br SB
@ @
@
@
R
@
@
Co Co SBC
Original BN Moral graph Junction tree
term P (Co|Sc, Br) is multiplied with ψC02 . The initial belief table for the separator
remains 1, the tables for C1 and C2 are shown in table 8.3.
Notice that these two clusters are not consistent. To make them consistent the
absorption process must be applied. Suppose that C1 is selected as root in the tree.
A call to BottomUp() will cause C1 to absorb from C2 . In this absorption nothing is
changed however, since the separator ψS1 will remain equal to 1 when marginalized
out of ψC02 . This is of course, not a coincidence since ψC02 in this stage is equal to
P (Co|Sc, Br). Next, the call to TopDown() will force C2 to absorb from C1 . The new
belief table for the separator S is shown in table 8.4. Finally, when the junction tree
is globally consistent it is possible to calculate the probability for every individual
variable. For instance, the probability for coma can be marginalized out of ψC22 :
X
P (co) = ψC22
Sc,Br
= 0.032 + 0.224 + 0.032 + 0.032
= 0.32
Now, updating ψC02 to ψC12 by multiplying ψS2 / ψS1 with ψC02 . The result is shown in
the last column in table 8.3.
84 CHAPTER 8. INFERENCE IN BAYESIAN NETWORKS
Br Sc ψS2
y y 0.04
y n 0.04
n y 0.28
n n 0.64
P (A)P (T |A)P (E|T, L)P (X|E)P (L|S)P (B|S)P (D|E, B)P (S) (8.4.30)
To compute the unconditionally probability of, say, dyspnea it is possible sum over
8.4. SYMBOLIC PROBABILISTIC INFERENCE (SPI) 85
which requires substantial fewer computations. The SPI approach is to find the op-
timal factoring so that the necessary calculations in the joint distribution are kept
to a minimum. This problem is closely related to the standard optimal factoring
problem, OFP, which is believed to be NP-hard. However, recently, some heuris-
tic search algorithms have been developed which appears to find good factoring
solutions. Also, computations can be saved using cache techniques, to avoid recom-
puting values that are already calculated in a previous step in the summation. For
P
example, the term S P (L|S)P (B|S)P (S) above is unchanged during the summa-
tion of the variables A,T ,E and B and can therefore be saved in a cache memory
after the first computation, and then be accessed when needed during the remaining
computations.
can be rewritten as P (X, Y )/P (Y ), the algorithm is not restricted to handling con-
junctive queries.
A factor is a subset of the complete probability distribution. Each factor contains
a set of variables, which affect the distribution. For example, the factor P (B|A)
includes the variables {A, B} and combining this factor with the factor P (A) yields
a conformal factor, P (B|A)P (A), with the same set of variables as P (B|A).
Now, let Q be the set of target variables, a good factoring for P (Q) can be found
in the following way:
(i) First, find the relevant nodes in the original BN. This can be done using the
d-separation property to exclude parts of the network which have no relevancy
to the current query. A linear time algorithm to find this subtree can be found
in Pearl [Geiger et al 89].
(ii) Let F be a factor set which contains all factors to consider in the next com-
putation, and let C be the set of factor candidates. Initially, let F be all
distributions from the subtree and let C be empty.
(iii) Combine the factors in F pairwise and add all pairs in which one factor of the
pair contains a variable which is a parent or a child of at least one variable in
the second pair, to the candidate set, C. There is also no need to add factors
which are already in C.
(iv) Let Ui be the set of variables for each combination in C. Compute vars(Ui ), the
number of variables in Ui excluding any target variable: vars(Ui ) = |Ui \ Q|.
For each combination in C, compute sum(Ui ), which is the number of variables
that can be summed out when two factors are combined. A variable can be
summed out when it does not appear in neither the set of target variables,
Q, nor any of the other factors in F (excluding those in the current pair).
Compute the result size as vars(Ui ) - sum(Ui ).
(v) Select the best candidate in C in the following way: Choose the element in C
with the lowest result size. If more than one element apply, choose the one of
these with most number of variables (including target variables). If there is
still more than one candidate, choose one of these arbitrary.
(vi) Construct a new factor by combining the chosen pair into a conformal factor.
Update F by replacing the two chosen factors with the new combined one.
Update C by deleting any pair which has a non-empty (factor-level) intersection
with the above chosen factor.
(vii) Continue from step (iii) until only one factor remains in F. This is the resulting
factor.
Finally, use the resulting factor above to compute an answer for the conjunctive
probability.
8.4. SYMBOLIC PROBABILISTIC INFERENCE (SPI) 87
C (fM e , fBr ) (fM e , fSc ) (fM e , fCo ) (fBr , fSc ) (fBr , fCo ) (fSc , fCo )
U Br,Me Me,Sc Me,Br,Sc,Co Me,Br,Sc Me,Br,Sc,Co Me,Br, Sc, Co
sum(U) 0 0 0 0 0 0
vars(U) 2 2 2 3 3 3
result size 2 2 2 3 3 3
In this presentation all variables are assumed to be binary, i.e., they all have two
states. If the number of states is not equal for all variables, it is possible to compute
the size of a factor as the product over the states for every included variable instead
of just considering the number of variables.
According to Li and D’Ambrosio in [Li and D’Ambrosio 94] the set-factoring SPI
algorithm is superior to Jensen’s algorithm for most kinds of networks.
Loop 1: The factor set F is initialized to {fM e , fBr , fSc , fCo } and C is empty. Every
factor in F is then pairwised combined and added to C. The result is shown
in table 8.5.
Here, the best combinations are candidate no 1 and 2, since both have mini-
mum result size and equal number of variables. Candidate no 1, (fM e , fBr ) is
choosed to replace the factors in F which is updated to {(fM e , fBr ), fSc , fCo }.
After deleting every factor with a non-empty intersection with (fM e , fBr ), the
candidate set C becomes {(fSc , fCo )}.
Loop 2: Adding combinations from F makes C = {((fM e , fBr ), fSc ), ((fM e , fBr ),
fCo ), (fSc ,fCo )}. The best combination is ((fM e , fBr ), fSc ), in which it was
possible to sum out the variable M e. F is updated to {((fM e , fBr ), fSc ), fCo }
and C is empty.
Loop 3: The candidate set C becomes {(((fM e , fBr ), fSc ), fCo )} and the only com-
bination to choose is therefore (((fM e , fBr ), fSc ), fCo ). Both the variables Br
88 CHAPTER 8. INFERENCE IN BAYESIAN NETWORKS
and Sc could be summed out. F is updated to {(((fM e , fBr ), fSc ), fCo )} which
fulfills the termination condition in step (vii) above.
Thus, the factoring result is:
X X
P (Co) = P (Co|Br, Sc) P (Sc|M e)P (Br|M e)P (M e) (8.4.32)
Br,Sc Me
(vii) etc . . .
(viii) When a value has been sampled for all unobserved variables, restart with X1
and repeat the process until sufficiently many cases have been generated.
8.6. CONNECTION TO PROPOSITIONAL CALCULUS 89
To avoid bias from the initial configuration (which can be very unlikely) it is common
to discard the first 5-10 percent of the generated samples. This is called “burn-in”.
One problem with this kind of logic sampling is that it is possible to get stuck
in certain areas. There might exist an equal likely area, but in order to reach it,
a variable need to take a highly unlikely value. Another problem is that it can be
very time-consuming to estimate very unlikely events. Finally, the task of selecting
a valid starting configuration can be very tedious, it is in fact a NP-hard problem
in the general case!
• The noisy-Or model can be used to avoid a parameter explosion when a vari-
able have many parents.
EXERCISES
91
92 CHAPTER 9. SUMMARY AND EXERCISES
A .S
.
B
?
BN
T . L B
HH
j
. E .P
PP q D
P
X
III.3 (Software required) The Monty Hall puzzle gets its name from an Amer-
ican TV game show, ”Let’s make a deal”, hosted by Monty Hall. In this show,
you have the chance to win some prize if you are lucky enough to find the prize
behind one of three doors. The game goes like this:
The problem of the puzzle is: What should you do at your second selection?
Some would say that it does not matter because it is equally likely that the
prize is behind the two remaining doors. This, however, is not quite true. Build
a Bayesian network to conclude which action gives the highest probability.
Here is some help to get you started:
The Monty Hall puzzle can be modeled in three random variables: Prize,
First Selection, and Monty Opens.
– Prize represents the information about which door contains the prize.
This means that it has three states: ”Door 1”, ”Door 2”, and ”Door 3”.
– First Selection represents your first selection. This variable also has the
three states: ”Door 1”, ”Door 2”, and ”Door 3”.
– Monty Opens represents Monty Halls choice of door when you have made
your first selection. Again, we have the three states: ”Door 1”, ”Door
2”, and ”Door 3”.
(p ∨ q) ∧ ¬p ∧ ¬q (9.0.1)
Under which assumptions are the noisy-Or model valid? Do you think the
noisy-Or model is appropriate to use in this particular application?
III.6 Given a discrete Bayesian network, B = {X1 , . . . , Xn }, an atomic con-
figuration is a specific assignment of each individual variable, i.e. X1 =
x1 , . . . , Xn = xn .
(a) Explain why the sum of the probability of each atomic configuration must
be equal to one, i.e.
X
P (X1 , . . . , Xn ) = 1. (9.0.2)
X1 ,...,Xn
[AboaFuzz 95] AboaFuzz 1.0 User Manual (P. Eklund, M. Fogstrm, S. Olli), Åbo
Akademi University, 1995.
[Baker 87] J. E. Baker. Reducing Bias and Inefficiency in the Selection Algorithm, In
J. J. Grefenstette, editor, Genetic Algorithms and their Applications: Proceedings
of the Second International Conference on Genetic Algorithms, pages 14-21, 1987.
95
96 BIBLIOGRAPHY
[Ben-Ari 93] M. Ben-Ari, Mathematical Logic for Computer Science, Prentice Hall,
1993.
[Bezdek 74] J. C. Bezdek, Cluster Validity with fuzzy sets, Journal of Cybernetics,
3 (1974), 58-73.
[Bezdek 80] J. C. Bezdek, A Convergence Theorem for the Fuzzy ISODATA Cluster-
ing Algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence
2 (1980), 1-8.
[Bezdek 81] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algo-
rithms, Plenum Press, 1981.
[Buckles 97] B.
Buckles. Seminar Course: Evolutionary Computation - Lecture 3. Published on
www, adress http://www.eecs.tulane.edu/www/Buckles.Bill/ec.html, 1997.
[Choe and Jordan 92] H. Choe, J. Jordan, On the Optimal Choice of Parameters
in a Fuzzy C-Means Algorithm, Proc. IEEE International Conference on Fuzzy
Systems, San Diego, 349-354, 1992.
[Cooper and Herskovits 92] G. F. Cooper, E. Herskovits, A Bayesian method for the
induction of probabilistc networks from data, Machine Learning 9 (1992), 309-347.
[Davis 87] L. Davis (ed.), Genetic Algorithms and Simulated, Technical Report, Uni-
versity of Illinois, 1988.
BIBLIOGRAPHY 97
[Eklund and Klawonn 92] P. Eklund, F. Klawonn, Neuro Fuzzy Logic Programming,
IEEE Transactions on Neural Networks, Vol 3, No. 5, September 1992, 815-818.
[Eklund and Zhou 96] P. Eklund, J. Zhou, Comparison of Learning Strategies for
Adaptation of Fuzzy Controller Parameters, J. Fuzzy Sets and Systems, to appear.
[Everitt 74] B. S. Everitt, Cluster Analysis, John Wiley & Sons, 1974.
[Fullér 95] R. Fullér, Neural Fuzzy Systems, Meddelanden från ESF vid Åbo
Akademi, Serie A:443, 1995.
[Geisser 75] S. Geisser, The predictive sampling reuse method with applications, J.
Amer. Stat. Assoc. ? (1975), xx-xx.
[Höhle 89] U. Höhle, Monoidal Closed Categories, Weak Topoi, and Generalized
Logics preprint, 1989.
[Holland 92] J. Holland. Adaption in natural and artificial systems, The MIT Press,
Cambridge Massachusetts, London, England, 1992.
[Huber 96] B. Huber, Fibre Optic Gyro Application on Autonomous Vehicular Nav-
igation, PhD thesis, University of Strasbourg, 1996.
[Jain and Dubes 88] A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice
Hall, 1988.
[Jang 92] J.-S. R. Jang, Self-learning fuzzy controllers based on temporal back prop-
agation, IEEE Trans. Neural Networks 3 No 5 (1992), 714-723.
[Moody 94] J. Moody, Prediction risk and architecture selection for neural networks,
In: V. Cherkassky, J. H. Friedman, H. Wechsler (Eds.), From Statistics to Neural
Networks: Theory and Pattern Recognition, NATO ASI Series F, Springer-Verlag,
1994.
[Moody and Utans 95] J. Moody, J. Utans, Architecture selection strategies for neu-
ral networks: Application to corporate bond rating predictions, In: A.-P. Refenes
(Ed.), Neural Networks in the Capital Markets, John Wiley & Sons, 1995, 277-
300.
[Olesen 93] K. G. Olesen, Causal probabilistic networks with both discrete and con-
tinuous variables, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 3 (1993).
[Olli 95] S. Olli, Fuzzy Control for AGNES, opublicerade anteckningar, bo Akademi,
1995.
[Riissanen and Eklund 96] T. Riissanen and P. Eklund, Working within a Fuzzy
Control Application Development Workbench: Case Study for a Water Treatment
Plant, Proc. EUFIT’96, 4th European Congress on Intelligent Techniques and
Soft Computing, Aachen, 1142-1145, 1996.
[Russell and Norvig 95] S. Russell, P. Norvig, Artificial Intelligence – a Modern Ap-
proach, Prentice-Hall International, 1995.
[Schweizer and Sklar 61] B. Schweizer, A. Sklar, Associative functions and statisti-
cal triangle inequalities, Publicationes Mathema-ticae Dedrecen, 8 (1961), 169-
186.
[Shao 88] S. Shao, Fuzzy Self-Organizing Controller and its Application for Dynamic
Processes, Fuzzy Sets and Systems 26 (1988), 151-164.
[Smith and Kelleher 88] B. Smith, G. Kelleher (eds.), Reason Maintenance Sys-
tems and their Applications, Ellis Horowood series in Artificial Intelligence, Ellis
Horowood Limited, 1988.
[Takagi and Sugeno 85] T. Takagi, M. Sugeno, Fuzzy Identification of Systems and
Its Applications to Modeling and Control, IEEE Transactions on Systems, Man
and Cybernetics 15 (1985), 116-132.
BIBLIOGRAPHY 105
[Umano 87] M.Umano, Fuzzy-Set Prolog, Second IFSA Congress, Tokyo, 1987, pp.
750-753.
[Wang and Mendel 92] L. X. Wang, J. M. Mendel, Fuzzy Basis Functions, Universal
Approximation, and Orthogonal Least-Squares Learning, IEEE Trans. on Neural
Networks., 3 No.5. September, 1992, 807-813.
[Windham 82] M. Windham, Cluster Validity for the Fuzzy c-Means Clustering Al-
gorithm, IEEE Transactions on Pattern Analysis and Machine Intelligence, 4
(1982), 357-363.
[Yager 80] R. R. Yager, On a General Class of Fuzzy Connectives, Fuzzy Sets and
Systems 4 (1980), 235-242.
[Yager 96b] R. R. Yager, Constrained OWA Aggregation, Fuzzy Sets and Systems
81 (1996), 89-101.
[Zadeh 65] L. A. Zadeh, Fuzzy Sets, Information and Control 8 (1965), 338-353.
[Zadeh 75] L. A. Zadeh, The Concepts of a Linguistic Variable and its Application
to Approximate Reasoning, Information Science, 8 (1975), 199-249.
[Zadeh 89] L. A. Zadeh, Fuzzy Sets as a Basis for a Theory of Possibility, Fuzzy
Sets and Systems, Vol 1, 1978, pp. 3-28.
[Zadeh 89] L. A. Zadeh, The coming age of fuzzy logic, plenary talk at 3rd IFSA,
Seattle, August 6-11, 1989.
[Zhou and Eklund 95] J. Zhou, P. Eklund, Some Remarks on Learning Strategies for
Parameter Identification in Rule Based Systems, Proc. EUFIT’95, 3rd European
Congress on Intelligent Techniques and Soft Computing, Aachen, 1911-1916, 1995.
[] Discussions with Veli Kairisto, Turku University Central Hospital, on the diag-
nostic problem and the TUCH data set for acute myocardial infarction, Spring,
1996.