Professional Documents
Culture Documents
Soft Computing Notes PDF
Soft Computing Notes PDF
SOFT COMPUTING
Umeå University
Department of Computing Science
SE-901 87 Umeå
Sweden
i
These lecture notes were originally prepared for the course in Computational Intel-
ligence at the Department of Computing Science, Umeå University.
Text contributions for this edition have been provided by Jens Bohlin, Patrik Eklund,
Lena Kallin-Westin and Tony Riissanen.
The authors
ii
Contents
I TYPICAL APPLICATIONS 1
1 Application Scenarios 3
1.1 Applications in the Health Care Domain . . . . . . . . . . . . . . . . 3
1.1.1 Medical decision support . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Information scope and management . . . . . . . . . . . . . . . 4
1.1.3 Case studies and scenarios . . . . . . . . . . . . . . . . . . . . 5
1.2 Industrial Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Diagnostics and control . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Case studies and scenarios . . . . . . . . . . . . . . . . . . . . 7
3 Logic Programming 21
3.1 Logical formulae as a program . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Resolution in logic programming . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 The resolution procedure . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Unification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Many-Valued Logic 27
4.1 Fuzzy Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Useful Membership Functions . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Triangular and trapetsoidal functions . . . . . . . . . . . . . . 30
4.2.2 Gaußian and sigmoidal functions . . . . . . . . . . . . . . . . 32
4.3 Fuzzy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iii
iv CONTENTS
7 Fuzzy Clustering 67
7.1 Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2 Fuzzy c-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3 Identification of Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.4 Geometric Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . 71
7.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9 Parameter Estimations 89
9.1 Tuning of Fuzzy Rule-Base by Using Gradient-Descent Algorithm . . 89
9.2 Tuning of Fuzzy Rule-Base by Using Gauss-Newton with Regulariza-
tion Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.3 Tuning of Fuzzy Rule-Base by Using Levenberg-Marquardt Method . 94
10 Software Developments 97
10.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.2 System Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.3 Design and Integration of Fuzzy Controllers . . . . . . . . . . . . . . 100
10.3.1 Scenario: Rule Generation based on manual control . . . . . . 101
10.3.2 Scenario: Integration of control rules . . . . . . . . . . . . . . 101
10.4 Overview of AboaFuzz . . . . . . . . . . . . . . . . . . . . . . . . . . 102
TYPICAL APPLICATIONS
1
Chapter 1
Application Scenarios
3
4 CHAPTER 1. APPLICATION SCENARIOS
scenario in which clearly the computer systems engineer becomes more superfluous,
and the domain expert is heading developments and is specifying types and styles
of decision making required to be backed up and supported within an information
system.
SUPPORTING TECHNOLOGY
EXPERT EXPERIENCE
information usage
and refinement
PATIENT MULTIMEDIA
RECORD INFORMATION
SYSTEMS SYSTEMS
In the data refinement scenario, data mining is supported through a data analysis
workbench, including tools both within statistics and computational intelligence,
together with supporting tools to enable system integration (see Figure 1.2).
Nephropathia Epidemica
NE is a haemorrhagic fever with renal syndrome [Lähdevirta et al 84, Settergren 89].
It is a mild form of this acute infectious disease that occurs in Europe. The bank vole,
Clethrionomys glareolus, is the natural host of the Puumala virus [Lähdevirta 71]
which is the cause of NE. The diagnosis of NE is often apparent from the clinical
findings. NE begins with abrupt high fever for 3-5 days with headache from the
second day onwards. From the third day there are nausea and vomiting, backache
and abdominal pains. Some of the patients have acute myopia for one to three days.
In the acute phase the patients often have more or less profound thrombocytopenia.
During the first week the patients develop symptoms of acute renal failure like pro-
teinuria, oliguria or anuria, and azotemia. In ultrasonographic imaging the swelling
of kidneys is a typical finding. Description of patient material can be found in
[Eklund and Forsström 91], where initial numerical tests were made with 27 symp-
toms and signs. Later diagnostic success rates were improved to provide up to 88%
correctness rates. The number of symptoms and signs can be reduced to five, using
only CRP (C-reactive protein, using values < 200), ESR (erythrocyte sedimenta-
tion rate, using values < 100), creatinine (using values < 500), thrombocytes (using
values < 400) and myopia (as a binary value yes/no), still providing success rates
above 80%. The diagnosis (output of network) is either positive (virus present in
blood) or negative.
Myocardial infarction
Acute Myocardial Infarction (AMI) is the death of heart-muscle cells from reduced
or obstructed blood flow through the coronary arteries. Traditionally, the diagnosis
6 CHAPTER 1. APPLICATION SCENARIOS
of AMI is made upon signs as sweating, nausea, pain in the chest, changes in the
ECG and raised levels of biochemical markers. The symptoms can vary a great
deal between different patients and they come under treatment in different stages
of AMI. The time factor is critical in the diagnose process, 50% of the patients die
within two hours after the first symptoms so it is important to find a quick and
reliable way to make the diagnosis.
The diagnosis of AMI requires at least two of the following criteria: A history
of characteristic chest pain, evolutionary changes on the ECG or elevation of serial
cardiac enzymes [Apple 92]. The values of the enzymes are measured from samples
taken at regular intervals within the first 24 hours after the infarction. Markers used
are, for instance, CK-MB, Myoglobin, and Tropinin.
In the case of AMI, the level of CK-MB increases 2-3 hours after that the AMI
has started, reaches a peak 10-24 hours after the AMI and returns to normal within
3-4 days afterwards. The myoglobin level is increased 2-3 hours after the start, peaks
at 6-9 hours and returns to normal within 18-24 days. The values for troponin is
increased at 4-6 hours, peaks at 10-24 hours and are back to normal within 10-15
days.
Observations related to data as indicated above should be used as a basis for
specifying combinations and transformations of data. Apart from data modelling it
is also interesting to investigate whether or not biological modelling, e.g. of CK-MB
behaviour, is possible. We know that the value of the markers increases, peaks and
decreases but the values differ from patient to patient. Reasons for the different
values are that the patients are seeking treatment at different stages of their AMI,
the treatment given, e.g. intravenous drip and/or drugs, affects the values and that
physical activity can rise the level of CK-MB without being a sign of AMI. Clearly,
also intravenous drip, when started, and so on, needs to be considered in data
modelling.
Downs syndrome
The most common used biochemical markers are AFP (alpha-foetoprotein) and hCG
(human chorion gonadotropin), and especially its free subunit β-hCG. There are
several factors that have an influence on AFP and β-hCG levels e.g. insulin diabetes,
race, smoking, and overweight of the mother. Therefore it can be expected that when
adding more anamnestic information to be taken into consideration, especially those
known to have an influence on marker levels, it is possible to get more specific and
sensitive information in regard to find the group that is at risk of having a DS baby.
A more detailed description of the syndrome, as far as data analysis is concerned,
can be found in [Kallin et al 95].
Based on computational methods as indicated in [Kallin et al 95] and explained
later in the text, a decision support system has been constructed which evaluates
a risk for the syndrome given the three inputs; the mothers age, AFP, and β-hCG.
The system shows a improved performance over the multiple Gaußian formula, one
of the most common statistical formulas used in software today.
1.2. INDUSTRIAL APPLICATIONS 7
AGNES data:
1.2. INDUSTRIAL APPLICATIONS 9
• air-filled tyres
• steering axis chain coupled with 0.9 Nm DC motor and 56:1 transmission
measurements
trajectories from sensors
calculate
deviations
generate
setpoints
steering speed
fuzzy controller fuzzy controller
AGNES
Fertilizer Production
In the fertiliser production line, consistency of material is controlled by injections of
liquid to the crystallise and drum, respectively. The objective is to produce fixed size
10 CHAPTER 1. APPLICATION SCENARIOS
granules. Disturbances derive mainly from recycling and added chemicals. Change
of recipes also present situations and control problems where the process needs to
be stabilised with minimum delay.
A data set recorded from particular situations involving manual control has been
used for analysis. The control actions, injection change to the drum, injection
change to the crystallise, and circulation change, are based on observation of size of
the granules, direction of size of the granules, rear temperature, front temperature,
revolution of the drum, revolution of the crystalliser, and amount in circulation.
In extraction of data, much attention has to be given to obtaining reasonably
good approximations for measurements to enable structure identification. At this
stage, much engineering knowledge of the process is required. A general comment
is that chemical processes of this type are impossible to model. Furthermore, data
extracted is far from reasonably clean, even after transformation to suit numerical
experimentation. However, for the discussion here, this case study provides a useful
counterpart to other case studies. From integration point of view, the chemical plant
is a typical example that demonstrates the suitability of the rule base generation
approach adopted. The industrial automation system can directly integrate the
controller as developed within a design workbench, where the integration with the
automation system can be done in various ways.
Part II
11
Chapter 2
Two-Valued Logic
This chapter has a very strong influence from the book ”Mathematical logic for
computer science” by Ben-Ari. The book is recommended for further reading about
the classical two-valued logic.
The study of logic was begun by the ancient Greeks. They found philosophy and
rhetoric extremely important in their culture. The logic was a way to define rules
of deduction so that if everybody started with the same premises and followed the
same logical rules the derived conclusions would also be the same.
Words as axiom, theorem and syllogism were used by the Greeks and are still
intact to this day. The famous rule of syllogism:
1. All men are mortal.
2. X is a man.
3. Therefore, X is mortal.
If the first two sentences are true (the premises) the Law of Syllogism assures
that the third sentence is true for every X. We can for example use X = Socrates
and deduce that Socrates is mortal.
There are however some problems, natural language are a too imprecise notation.
We can claim false statements to be true or to claim that a statement is true even
though its truth does not follow from the premises. An illustrative example:
Even if you use the logic carefully, paradoxes can arise. One famous paradox is
”the Liar’s Paradox”:
Epimenides who was a Cretan (a person from the island of Crete) was
heard to say; ”All Cretans are liars.”
13
14 CHAPTER 2. TWO-VALUED LOGIC
Now if in saying this Epimenides is telling the truth, then he must be a liar (by
virtue of being a Cretan.) But if he is lying, then then by his statement he is
telling the truth. It is hard to figure how he can be both lying and telling the truth
simultaneously, and a vicious circle results.
Another way of seeing the puzzle here is to consider the following statement:
”This statement is false.” The same vicious circle arises. If the statement is true,
then it is the case that it is false. But if false, then it must actually be true...
Mathematicians have study the logic in order to formalize the concept of math-
ematical proof. Hilbert (1862-1943) tried to find a ’proof theory’ that should be a
direct check for the consistency of mathematics. He wanted to find a system where
(ii) if a statement is in fact true, there is a proof somewhere out there just waiting
to be discovered.
The mathematician Gödel spoiled Hilbert’s dreams when he showed that there
are true statements of arithmetic that are not provable.
In this notes three types of logic is presented, the propositional calculus, the
predicate calculus, and the fuzzy logic. It should be noted that there exists other
logics as well. Examples of other kind of logics are the intuitionistic logic, the
temporal logic, and the modal logic. For more information about them .....
x ¬
1 0
0 1
x y ∨ ∧ → ≡ ⊕ ← ↑ ↓
1 1 1 1 1 1 0 1 0 0
1 0 1 0 1 0 1 0 1 0
0 1 1 0 0 0 1 1 1 0
0 0 0 0 1 1 0 1 1 1
The set of operators is highly redundant and in fact only the negation and one
of the six first operators is needed to express all other operators. The two last ones
(nand and nor) is by themselves sufficient to define all other operators. What set
of operators to use is highly dependent on the application. In mathematics the
implication is crucial but in electronics nand and nor are more important.
The grammar for building a formula is as follows:
The set P is called the set of atomic propositions or atoms. As in arithmetic the
operators has a order of precedence and parentheses can be used to make the prece-
dence clear or change it if needed in the same way as in arithmetic. The order of
precedence (from high to low) is ¬, ∧, ∨, →, ←, ≡.
16 CHAPTER 2. TWO-VALUED LOGIC
v(P ∧ Q) = 0
v(¬(P ∧ Q)) = 1
Definition A propositional formula A is valid (also called a tautology if, for every
interpretation the truth-value is 1. Notation: |= A. If the truth-value is 0 in some
interpretation, the formula is not-valid or falsifiable.
2.2. PREDICATE CALCULUS 17
2. Socrates is a man.
In the propositional calculus we would use three atoms: p1 = All men are mortal.,
p2 = Socrates is a man., and p3 = Socrates is mortal.. The syllogism would be
written as a formula like this: p1 ∧ p2 → p3 . Even though we intuitively know that
the implication is valid, we are unable to show it.
In predicate calculus we can use the predicates man(x) and mortal(x) and the
three formulas would be written as
1. ∀x(man(x) → mortal(x))
2. man(Socrates)
∀ is the universal quantifier and is read ”for all”. ∃ is the existential quantifier
and is read ”there exists”. Quantifiers have the same precedence as negation. As
an example of the existential quantifier consider the sentence ”There exists birds
that cannot fly”. If we use the predicates bird(x) and can fly(x) the corresponding
formula in predicate calculus is:
• If f is an n-ary function symbol and t1 , ..., tn are terms, then f (t1 , ..., tn ) is a
term.
• If p is an n-ary predicate symbol and t1 , ..., tn are terms, then p(t1 , ..., tn ) is a
formula (also called an atom).
Definition For a quantified formula such as ∃xA, x is called the bound variable and
A is the scope of the quantified variable. If every variable in a formula is bounded
the formula is said to be closed. The universal closure is obtained by bounding every
free variable in a formula with the universal quantifier. If the existential quantifier
is used, it is called a existential closure instead.
As mentioned before the symbols in the predicate calculus does not have a special
meaning. Before we can define an interpretation for a predicate formula we have to
introduce the concept of substitution.
Definition Let A be a formula, x a variable and a a constant. A[x ← a], the
substitution of a for x, is defined inductively as follows:
such that an n-ary relation is assigned to each n-ary predicate symbol, an n-ary
function is assigned to each n-ary function symbol and a domain element is assigned
to each constant symbol.
is in CNF while
(¬p(x) ∨ q(f (x), y) ∨ r(z)) ∧ ((p(x) ∧ ¬q(f (x), y)) ∨ r(z)) ∧ (¬r(z))
is not in CNF because of the conjunction embedded within the second disjunction.
Definition A formula is in prenex konjunktiv normalform (PCNF) iff it is on the
form:
Ql xl ...Qn xn M
where the Qi are quantifiers and M is a formula in CNF with no quantifiers. The
sequence of quantifiers is called the prefix and M is called the matrix.
Definition A formula is in clausal form if it is in PCNF and the prefix consists only
of universal quantifiers.
Example.
∀x∀y((p(f (x)) ∨ ¬p(x) ∨ ¬p(y)) ∧ (p(a) ∨ ¬p(y)))
1. A = ∀x∃yp(x, y)
2. A = ∃yp(y)
20 CHAPTER 2. TWO-VALUED LOGIC
In the first case there is at least one universal quantifier to the left of the ex-
istential quantifier and in the second there is no universal quantifier to the left.
Fortunately Skolem showed that there is a way to find a formula A0 in clausal form
such that A is satisfiable iff A0 is satisfiable:
In the first example there exists a y for every x such that p(x, y) is true. Then
there exists a new function f , y = f (x) that produces these values and A0 =
∀xp(x, f (x)).
In the second example it is said that there exists a y such that p(y) is true. We
choose a new constant c (or more correct a 0-ary function symbol c) that maps to
the correct y.
How to transform a formula into clausal form:
1. Give all bounded variables unique names.
2. Eliminate all connectives except ¬, ∧, and ∨.
3. Push the negation inward using de Morgan’s laws:
¬(A ∧ B) ↔ (¬A ∨ ¬B)
¬(A ∨ B) ↔ (¬A ∧ ¬B)
When pushing a negation through a quantifier use the equivalencies:
¬∀xA(x) ↔ ∃¬A(x)
¬∃xA(x) ↔ ∀¬A(x)
4. Extract all quantifiers from the matrix. Choose a quantifier that is not in the
scope of any other quantifier in the matrix and extract it using the following
rules (Q is either ∀ or ∃ and op is either ∨ or ∧:
AopQxB(x) ↔ Qx(AopB(x))
QxB(x)opA ↔ Qx(B(x)opA)
Repeat until no quantifiers are left in the matrix.
5. Use the distributive laws to transform the matrix into CNF:
(A ∨ (B ∧ C)) ↔ (A ∨ B) ∧ (A ∨ C)
(A ∧ (B ∨ C)) ↔ (A ∧ B) ∨ (A ∧ C)
6. Use Skolem functions to eliminate the existential quantifiers.
Example. Consider the formula ∃x∀yp(x, y) ↔ ∀y∃xp(x, y). We transform it into
clausal form step by step:
Rename bound variables: ∃x∀yp(x, y) ↔ ∀w∃zp(z, w)
Eliminate connectives: ¬∃x∀yp(x, y) ∨ ∀w∃zp(z, w)
Push negations inward: ∀x∃y¬p(x, y) ∨ ∀w∃zp(z, w)
Extract quantifiers: ∀x∃y∀w∃z(¬p(x, y) ∨ p(z, w))
Remove existential quantifiers: ∀x∀w(¬p(x, f (x)) ∨ p(g(x, w), w))
Chapter 3
Logic Programming
In this section logic programming is only given a short overview. For a complete
description, see [Lloyd 87].
A rule is of the form A ← B1 , . . . , Bn , where A and Bi are predicates with the arity
≥ 0. The meaning of the rule is as follows: for each assignment of each variable, if
B1 , . . . , Bn are all true, then A is true. A is called the head and B1 , . . . , Bn is called
the body of the rule. A rule is also called a definite program clause.
21
22 CHAPTER 3. LOGIC PROGRAMMING
Example Suppose we have the following logic program (variables start by capital
letters):
loves(X,Y)←mother(X),child of(Y,X)
mother(mary)←
child of(tom,mary)←
If this program is given the query ←loves(Person,Who), the answer is YES, Per-
son=mary, Who=tom
{x1 ← t1 , . . . , xn ← tn }
where each xi is a distinct variable and each ti is a term which is not identical to
the corresponding variable xi .
Let E be a clause and θ = {x1 ← t1 , . . . , xn ← tn } a substitution. An instance
Eθ of E is obtained by simultaneously replacing each occurrence of xi in E by ti .
Example
E = p(x) ↔ q(y)
θ = {x ← y, y ← a}
Eθ = p(y) ↔ q(a)
We can see that the word simultaneously in the definition imply that we should not
substitute y for x and then a for y.
3.2.2 Unification
With the help of substitutions two atoms can be made equal, i.e. we can obtain
clashing ground clauses.
24 CHAPTER 3. LOGIC PROGRAMMING
1. Let k = 0, Wk = W, σk = ε
5. Let k = k + 1. Go to step 2.
Many-Valued Logic
µA : X → {0, 1}.
”x is in A”.
This function can also be interpreted as a relation consisting of ordered pairs (x, µA (x)).
In a first step, a fuzzy set can be seen as an extension of characteristic functions,
i.e. a fuzzy set µ can be defined mathematically by assigning to each possible individ-
ual in the universe of discourse a value, µ(x), representing its grade of membership
in the fuzzy set µ. This grade corresponds to the degree to which that individual is
similar or compatible with the concept represented by the fuzzy set. Often a fuzzy
27
28 CHAPTER 4. MANY-VALUED LOGIC
µA : X → I.
”x is in A”
or even
”x is A”
if it is desirable to speak of compatibility with the concept A rather than member-
ship.
If X = {x1 , . . . , xn } is a finite set and A a fuzzy subset of X, a more relational
notation for A would be
where µA (xi )/xi contains respective grades of membership and + should be seen as
a union. This notation is very informal and certainly not algebraic in any sense.
Example. Suppose we want to define a fuzzy set of natural numbers ”close to 4”
(see Figure 4.1). This can be given e.g. as
A = {(1, 0.0), (2, 0.2), (3, 0.6), (4, 1.0), (5, 0.6), (6, 0.2), (7, 0.0)}.
0
2 3 4 5 6
The above definition of a fuzzy set is a typical situation where the relational style
is convenient.
Example. A fuzzy set A defining ”normal room temperature” can be given as
0, x < 16◦ C
◦ ◦ ◦ ◦
(x − 16 C)/2 C, 16 C ≤ x < 18 C
µA (x) = 1, 18◦ C ≤ x ≤ 22◦ C
(24◦ C − x)/2◦ C, 22◦ C < x ≤ 24◦ C
0, x > 24◦ C,
0
16 18 20 22 24
As can be seen, temperatures below 16◦ C or above 24◦ C are not to any degree
considered to be normal.
Example. A fuzzy set B defining ”high moisture rates” can be given as
0, x < 30%
µB (x) = (x − 30%)/20%, 30% ≤ x ≤ 50%
1, x > 50%,
0
10 30 50 70 90
Correspondingly, we have functions with open left shoulders, L : X → [0, 1], defined
by
1, x<α
L(x; α, β) = (β − x)/(β − α), α ≤ x ≤ β
0, x > β.
0
α β
0
α β
0
α β γ
NB NS ZO PS PB
0
-40 -30 -20 -10 0 10 20 30 40
0
α β γ δ
The level between β and γ is sometimes called a plateau. See Figure 4.8).
From implementation point of view it is obviously advantageous to consider Γ, L
and Λ as special cases of Π. This is possible if the universe of discourse is a bounded
interval. Suppose the universe of discourse is [-10,10]. Then
G(x; α, β) = e−β(x−α) ,
2
where α is the midpoint and β reflects the slope value. Note that β must be positive,
and that the function never reaches zero.
The Gaußian function can also be extended to have different left and right slopes.
We then have three parameters in
(
e−βl (x−α) , x ≤ α
2
G(x; α, βl , βr ) =
e−βr (x−α) , x > α,
2
0
α
1
σ(x; α, β) = ,
1+ e−β(x−α)
where α is the midpoint and β is the slope value at the inflexion point. Note that
β is 4 times the derivative value at the inflexion point.
Similarly, β must be positive. This S-function never reaches neither 0 nor 1.
0
α
0
α β γ
The purpose of this section is to introduce multi-valued logic from a more intu-
itive and informal point of view as compared to a strongly algebraically developed
theory of multi-valued logic. Generally speaking we will stay more on the syntactic
side, rather than diving deeply into semantics. Keep in mind our purpose to present
multi-valued logic as method and technique to support application development.
Furthermore, our general viewpoint is that success in applications should guide the
search for ”the best” understanding of the foundations.
¬a =1−a
a ∧ b = min{a, b}
a ∨ b = max{a, b}
The intuition for these are obvious as the clearly reflect worst and best case char-
acterisations. However, this is also a disadvantage since it means that an outcome
might remane unchanged even if we modify some value, e.g. min{0.7,0.5} is the
same as min{0.8,0.5}. If we desire any change in a or b to be effective we can use
e.g. the well-known product connectives
a∧b =a·b
a∨b =a+b−a·b
a ∧ b = max{0, a + b − 1}
a ∨ b = min{a + b, 1}
In the above mentioned connectives the outcome depends only on a and b, i.e.
there are no additional parameters. Adding parameters, however, introduces several
interesting and useful classes of connectives.
36 CHAPTER 4. MANY-VALUED LOGIC
[Yager 80]:
[Hamacher 75]:
Definition a∧b (a ∧ b) ∨ ¬c
L
à ukasiewicz max{0, 0.6 + 0.9 − 1} = 0.5 min{0.5 + (1 − 0.8), 1} = 0.7
Yager 1 − min{[(1 − 0.6)2 + (1 − 0.9)2 ]1/2 , 1} ≈ 0.59 min{[0.592 + (1 − 0.8)2 ]1/2 , 1} ≈ 0.62
0.6·0.9 0.52+(1−0.8)−(2−2)·0.52·(1−0.8)
Hamacher ≈ 0.52 ≈ 0.65
2+(1−2)·(0.6+0.9−0.6·0.9) 1−(1−2)·0.52·(1−0.8)
maximum operator for disjunction, union of A and B is then a fuzzy set µA∪B given
by
µA∪B (x) = max{µA (x), µB (x)}.
Similarly, the intersection µA∩B is given by
Example. Let A and B be discrete fuzzy subsets of X = {−3, −2, −1, 0, 1, 2, 3}. If
A = {(−3, 0.0), (−2, 0.3), (−1, 0.6), (0, 1.0), (1, 0.6), (2, 0.3), (3, 0.0)}
and
B = {(−3, 1.0), (−2, 0.5), (−1, 0.2), (0, 0.0), (1, 0.2), (2, 0.5), (3, 1.0)}
A ∧ B = {(−3, 0.0), (−2, 0.3), (−1, 0.2), (0, 0.0), (1, 0.2), (2, 0.3), (3, 0.0)}
and
A ∨ B = {(−3, 1.0), (−2, 0.5), (−1, 0.6), (0, 1.0), (1, 0.6), (2, 0.5), (3, 1.0)}.
¬A = {(−3, 1.0), (−2, 0.7), (−1, 0.4), (0, 0.0), (1, 0.4), (2, 0.7), (3, 1.0)}
and
¬B = {(−3, 0.0), (−2, 0.5), (−1, 0.8), (0, 1.0), (1, 0.8), (2, 0.5), (3, 0.0)}.
38 CHAPTER 4. MANY-VALUED LOGIC
A B
∨
A B
1 1 1
=
∨
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
A B A∨B
1 1 1
∨ =
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
f :X→Y
f : PX → PY
Strictly speaking we should use another notation for f , e.g. P f since the mappings
are not the same. However, as there is usually no confusion, it is very common to
use f in both situations.
Viewing this extension in the fuzzy domain, we have the following obvious ques-
tion: Given a fuzzy subset µA of X, what is the corresponding image f (µA ) as a
fuzzy subset of Y ? A natural approach is to require that
but in such cases we need to define f (µA )(f (x)) also in situations where we might
have other points x0 for which f (x) = f (x0 ) with µA (x) 6= µA (x0 ). Typically in such
a situation we ”the best we have”, i.e. f (µA )(f (x)) would be the largest value (or
supremum in an infinite case) of all µA (x0 ) for which f (x) = f (x0 ). Also we need to
specify the values of f (µA )(y) where we cannot find any x such that f (x) = y. In
such a situation it is natural to require that f (µA )(y) = 0.
To summarize, the Extension Principle states that
( W
f (x)=y µA (x), if {x ∈ X | f (x) = y} =
6 ∅,
f (µA )(y) =
0, otherwise.
f : Xn → Y
and then we have to define f (µA 1 , . . . , µA n )(f (x)), x = (x1 , . . . , xn ), given that we
have µA 1 (x1 ), . . . , µA n (xn ). Now we need to consider worst cases wrt µA i (xi ) as the
combination is more of a conjunction. But the Extension Principle remains basically
the same, i.e.
( W
f (x)=y mini {µA i (xi )}, if {x ∈ X n | f (x) = y} =
6 ∅,
f (µA 1 , . . . , µA n )(y) =
0, otherwise.
40 CHAPTER 4. MANY-VALUED LOGIC
+:R×R→R
or
µA1 ×···×An (u1 , u2 , · · ·, un ) = µA1 (u1 )·µA2 (u2 )· · ··µAn (un ).
R◦S = {[(u, w), sup(µR (u, v) ∗ µS (v, w))], u∈U, v∈V, w∈W }.
v
where ∗ could be any operator in the class of membership function, namely, mini-
mum, algebraic product, bounded product or drastic product.
In a general form, a compositional operator may be expressed as the Sup-star
compositional, where ”star” denotes an operator - e.g., min, product, etc. In the
literature., four kind of compositional operators can be used in the compositional
rule of inference as following:
• Sup-min operation,
4.4. PRODUCTS, RELATIONS AND COMPOSITIONS 41
• Sup-product operation,
• Sup-bounded-product operation,
• Sup-drastic-product operation.
In FLC applications, the Sup-min and Sup-product compositional operator are the
most frequently used.
The fuzzy rule in premise 2 above can be put into the simpler form ” A × B → C
”. Intuitively, this fuzzy rule can be transformed into a ternary fuzzy relation R,
which is specified by the following MF:
µR (x, y, z) = µ(A×B)×C (x, y, z) = µA (x)∧µB (y)∧µC (z) (4.1)
0
And the resulting C is expressed as
0 0 0
C = A ×B ◦(A × B → C) (4.2)
Thus
The interpretation of multiple rule is usually taken as the union of the fuzzy relation
corresponding to the fuzzy rules. for example, given the following fact and rules:
0 0
premise 1 (fact): x is A and y is B ,
premise 2 (rule 1) : if x is A1 and y is B1 then z is C1 ,
premise 3 (rule 2) : if x is A2 and y is B2 then z is C2 .
0
consequence: z is C
we can use the fuzzy reasoning shown in the Figure 4.14. as an inference procedure
0
to derive the resulting output fuzzy set C .
Figure 4.14: Fuzzy reasoning for multiple rules with multiple antecedents.
0 0
where C1 and C2 are the inferred fuzzy sets for rule 1 and 2, respectively. Figure 4.14
shows graphically the operation of fuzzy reasoning for multiple rules with multiple
antecedents. Suppose a fuzzy rule base consists of a collection of fuzzy if...then rules
in the following form:
Ri (x1 , x2 , ...xm , y) = Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y) (4.6)
We could combine the rules by an aggregation operator Agg into one rule which
0 0
used to obtain C from A .
that is
\n
R(x, y) = i=1
Ri (x, y) = min(Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y))
that is
[n
R(x, y) = i=1
Ri (x, y) = max(Ai1 (x1 )×Ai2 (x2 )×· · ·×Aim (xm )→Ci (y))
µC 0 (y) = ∨ni=1 {[µA0 (x1 ) ∧ µAi1 (x1 )] ∧ [µA0 (x2 ) ∧ µAi2 (x2 )]
i1 i2
= ∨ni=1 {∧m
j=1 [µA0ij (xj ) ∧ µAij (xj )]} ∧ µCi (y)
= ∨ni=1 {τi ∧ µCi (y)}
Figure 4.15 shows graphically the operation of fuzzy reasoning for MISO.
M
xm If x1 is An1 and ... and xm is Anm Then y is Cn
τ n = [µ An 1 (x 1) ∧ µ An 2 (x 2 ) ∧ L ∧ µA nm (x m )]→ τ n ∧ µ C n (y)
and from previous definitiona we see that there are six interpretations (1-6) for
fuzzy implication, and in each interpretation we may employ different t-norms or
t-conorms, therefore, a fuzzy if...then rule (4.5) can be interpreted in a number of
ways, then the output of the fuzzy inference mechanisms can be in different way.
for these different types of outputs. we could use different defuzzifier to defuzzify
them into a single point in the output space V.
4.5. APPROXIMATE REASONING 45
(5) I(p, q) ≥ q
(10) I is continuous.
The following table gives the most usual implications, which class they belong to
and which properties are satisfied [Dubois, Prade 91, Dubois, Lang, Prade 91].
Let p → q represent the rule if p then q, where p can be of the form p1 and . . . and
pn . We can then then say that:
(i) The truth value connected to p, τ (p), is then of the form τ (p) = τ (p1 ) ∗ . . . ∗
τ (pn ), where ∗ is a t-norm;
Definition Let Γ be a fuzzy set of axioms. The mapping Γ |=: L → [0, 1] is given
by Γ |= P = inf{Υ(P ) | Υ is a valuation w.r.t. Γ}, where inf ∅ = 1.
EXERCISES
49
50 CHAPTER 5. SUMMARY AND EXERCISES
Suppose that their intersection and union are defined by the Hamacher’s t-norm
and t-conorm with γ = 1, respectively. What are then the membership function of
{A ∩ B} and {A ∪ B} ?
II.3 Show that Yager’s ∧ and ∨ are, respectively, t-norms and co-t-norms.
II.4 Prove that for any t-norm T and any co-t-norm S we have
II.5 Fuzzy sets µ1 , µ2 ∈ Fc (R) (which is class of all upper semicontuinuous fuzzy
sets of R). With the help of the extension principle, for the sum µ1 ⊕µ2 , the product
µ1 ¯ µ2 , we lay down the following:
II.6 Let f (x) = x2 and let A ∈ F be a symmetric triangular fuzzy number with
membership function
(
1 − |a − x|/α if |a − x| ≤ α
A(x) =
0 otherwise
Then use the extension principle to calculate the membership function of fuzzy set
f (A).
II.8 Consider two fuzzy relations R =”x is considerable smaller than y” and G =
”y is very close to y”
y 1 y2 y3 y4 y1 y2 y3 y4
x 0.5 0.1 0.1 0.7 x 0.4 0 0.9 0.6
R=
1 1
G =
x2 0 0.8 0 0 x2 0.9 0.4 0.5 0.7
x3 0.9 1 0.7 0.8 x3 0.3 0 0.8 0.5
1. What are their intersection of R and G means that R =”x is considerable smaller
than y” and G = ”y is very close to y”.
2. What are their union of R and G means that R =”x is considerable smaller than
y” or G = ”y is very close to y”.
51
II.9 Consider two fuzzy relations R =”x is considerable smaller than y” and G =
”y is very close to z”
z1 z2 z3
y1 y2 y3 y4
x y1 0.4 0.9 0.3
0.5 0.1 0.1 0.7
R=
1
G =
y2 0 0.4 0
x2 0 0.8 0 0
y3 0.9 0.5 0.8
x3 0.9 1 0.7 0.8
y4 0.6 0.7 0.5
FUZZY SYSTEMS
53
Chapter 6
Fuzzy Control
where hfuzzy criteriai and hfuzzy conclusioni either are atomic or compound fuzzy
propositions. Such a rule can be seen as a causal relation between measurements
and control values of the process. If e and ė are insignals and u̇ an outsignal, and
further NS, PS and NL are linguistic variables, then
IF e is N S AND ė is P S THEN u̇ is N L
or
55
56 CHAPTER 6. FUZZY CONTROL
if the present deviation of the control value is N S and the latest change
in the deviation of the control value is P S then this should cause the
control value to be N L.
Both antecedents and consequents can involve several linguistic variables. If this
is the case, the system is called a multi-input-multi-output (MIMO) fuzzy system.
Such systems have several insignals and outsignals. Also multi-input-single-output
(MISO) systems, with several insignals but only one outsignal are very common.
An example of a MISO system is as follows.
R1 : IF x is A1 AND y is B1 THEN z is C1
R2 : IF x is A2 AND y is B2 THEN z is C2
.. ..
. .
Rn : IF x is An AND y is Bn THEN z is Cn .
Here x and y are insignals and z is the outsignal. Further, Ai , Bi and Ci are linguistic
variables.
• An expert that knows the process provides linguistic rules that are specified
given previous knowledge and know-how related to the process.
• The process is described within a fuzzy model, based on which control rules
can be directly derived. Such methods do not yet exist, and require further
research.
• En fuzzy controller is adaptive in the sense that the rule base together with
parameters in rules (and possibly in inference mechanisms) are adjusted in real-
time given possibilities for the systems to identify itself as being in respectively
good or bad states. Some suggestions of related techniques are found e.g. in
[Procyk and Mamdani 79], [Shao 88] and [Sugeno 85].
Whatever technique we use, our goal is to construct a number of fuzzy rules with
the following syntax:
6.1. FUZZY CONTROLLERS 57
Note that this syntax is valid for MISO systems. For rule Ri we have x1 , . . . , xm as
insignals and Ai1 , . . . , Aim as respective linguistic quantifiers of the insignals. The
consequent of the rule is ”u is Ui ”.
Example. The following shows an example of fuzzyness used for speed control. As
insignals we have the actual speed v (km/h) and the load l (N) of the car. Load
components are e.g. force F and friction Fµ . As linguistic variables for speed we use
LS (low speed), N S (normal speed) and HS (high speed) and for load similarly LL
(low load), N L (normal load) and HL (high load). These are typically bell-shaped
functions and as midpoint in N S we use 70 km/h which also is the constant speed
we try to maintain. The choice of precise shape and transposition of membership
functions is left open at this point. Low load appear e.g. downhill and high load
uphill (see figure 6.1).
N
G
F+Fµ N NORMAL N
LOAD v
G G
D
F
x U
E
Z
RULE BASE F
U
Z Z
I Z
F I
I F
C I
A C
T A
I INFERENCE T u
O I
N
MECHANISM O
N
1 A 11 1 A12 1 U1
α1
0 0 0
1 A21 1 A22 1 U2
α2
0 0 0
x1 x2 min
This is the most common view of fuzzy control. In Mamdani’s method, conjunc-
tion is given by the minumum operator, implication likewise, and resulting output
membership functions are combined using the maximum operator as disjunction.
To be more precise, if the activation level is given by
^
m
αi = Aij (xj ),
j=1
1 A 11 1 A12 1 U1
α1
0 0 0
1 A21 1 A22 1 U2
α2
0 0 0
x1 x2 min
where pi0 , . . . , pim are constants related to rule i. Methods to specify the con-
stants are discussed in [Takagi and Sugeno 85], including also algorithms for se-
lecting insignals related to respective ui . The final control value is given by
X
n
αi ui
i=1
u= Xn .
αi
i=1
Example. Given insignals x1 = 6.5 and x2 = 9.2, and linear output functions
u1 = 2 + 1.7x1 + 1.3x2 and u2 = −3 + 0.5x1 + 2.1x2 , we obtain u1 ≈ 25.0 and
u2 ≈ 22.3. Thus the control value becomes
Ui (ui ) = αi .
1 A11 1 A12 1
α1
0 0 0
u1
1 A 21 1 A 22 1
α2
0 0 0
x1 x2 min u2
Σ αi u i
u =
Σ αi
X
n
αi ui
i=1
u= Xn .
αi
i=1
6.3 Defuzzification
As a result of inference we obtain a fuzzy set µU of proposed control values. Each
activated rule Ri , i.e. for which the activation level is non-zero, contributes to µU ,
and therefore we obtain the final conclusion as
62 CHAPTER 6. FUZZY CONTROL
A 11 A12 U1
1 1 1
α1
0 0 0
u1
A21 A22 U2
1 1 1
α2
0 0 0
x1 x2 min u2
Σ αi ui
u =
Σ αi
[
n
µU = µUi ,
i=1
Centre-of-Gravity, CoG
CoG finds the centre of gravity of µU . In the discrete case, we have
X
l
uk · µU (uk )
k=1
u=
X
l
µU (uk )
k=1
CoG
1
In ICoG we only consider the area of µU that is above a specified level α (see figure
6.8), and compute the centre of gravity for this area. Thus, in the discrete case, we
have
X
l
uk · [µU (uk )]α
k=1
u= ,
X
l
[µU (uk )]α
k=1
where [µU (uk )]α denotes the area of fuzzy set µU above α level.
In the continuous case, we have
R
[U ]α v · [µU (v)]α dv
u= R .
α
[U ]α [µU (v)] dv
ICoG
1
Centre-of-Sums, CoS
In CoG we had to consider the whole of µU . CoS is similar to CoG but imple-
mentationally much more efficient. In CoS, we consider all output functions when
computing the sum of all µUi , i.e. overlapping areas may be considered more than
once (see figure 6.9). In the discrete case, we obtain
64 CHAPTER 6. FUZZY CONTROL
X
l X
n
uk · µUi (uk )
k=1 i=1
u=
X
l X
n
µUi (uk )
k=1 i=1
CoS
1
First-of-Maxima, FoM
In FoM, defuzzication of µU is defined as the smallest value in the domain of U with
maximal membership value, i.e.
Middle-of-Maxima, MoM
MoM is similar to FoM. Instead of taking the first values with maximal grades of
membership, we compute the average of all values with maximal grades,
min{v ∈ U | µU (v) = max U } + max{v ∈ U | µU (v) = max U }
u= .
2
A graphical representation is given in figure 6.11.
Height Method, HM
6.3. DEFUZZIFICATION 65
FoM
1
MoM
1
The height method is not applied directly on µU , but is focused on heights, and
computes a weighted sum of these heights. The weighted sum is according to
X
n
µU peak · fi
i
i=1
u= X
n ,
fi
i=1
where µU peak is the ui for which µUi (ui ) = 1 (where µUi is in its original form). See
k
figure 6.12. For the trapetsoidal membership function, µU peak becomes an interval
k
(plateau), from which we can select a representative, e.g. the mean value. The value
fi is the height of µUi , i.e. max µUi .
HM is computationally both simple and fast.
1
HM
f2
f1
0
µ U peak µ U peak
1 2
u
CoLA
1
Example. Defuzzification can lead to undesirable effects. Consider e.g. a car trying
to avoid an obstacle directly in front of the car. The membership function repre-
senting candidates of control values is then typically as shown in figure 6.14. Both
CoG and CoS, however, will result in driving straightforward.
Fuzzy Clustering
0
1
0
1 1 1
0 0
1 1
1 0 0
x 1
1 1 0 x
1 1 0 0
1 0
0
1 0
67
68 CHAPTER 7. FUZZY CLUSTERING
CLUSTER 1 CLUSTER 2
0.47 0.53
0.68 0.32
0.32 0.68
0.73 0.77 0.66 0.27 0.23 0.34
0.30 0.70
0.19 0.81
0.80 0.20
0.77 0.15 0.23 0.85
0.91 0.10 0.09 0.90
x 0.86 x 0.14
0.80 0.61 0.20 0.39
0.85 0.14 x 0.86 x
0.77 0.09 0.20 0.15
0.23 0.91 0.80
0.74 0.27 0.26 0.73
0.24 0.76
0.69 0.30 0.31 0.70
To describe the algorithm, we need some notations. The set of all points con-
sidered is X = {x1 , · · · , xn }(⊂ Rd ). We write ui : X → [0, 1] for the ith cluster,
i = 1, . . . , c, and we will use uik to denote ui (xk ), i.e. the grade of membership of xk
in cluster ui . We also use U = huik i, for the matrix of all membership values. The
’midpoint’ of ui is vi (∈ Rd ), and is computed according to
X
n X
n
vi = (uik )m xk / (uik )m .
k=1 k=1
X
c
uik = 1,
i=1
X
c X
n
J= (uik )m k xk − vi k2 .
i=1 k=1
The following algorithm for FCM clustering will meet this objective:
7.2. FUZZY C-MEANS CLUSTERING 69
if µk = ∅, then
X
c
uik = 1/[ (k xk − vi k / k xk − vj k)2/(m−1) ]
j=1
otherwise X
uik = 0 ∀i 6∈ µk and uik = 1.
i∈µk
v1
x v
x 2
v3x
where x̄ is the midpoint of all points in X (see figure 7.3). Obviously, we will select
that particular c0 for which
S(U, c0 ) = min S(U, c).
c
i.e. we should iterate the clustering algorithm with c ranging in a suitably selected
interval, e.g. 2 to 20 clusters.
Clearly, in many cases, it is not straightforward to find out the optimal number
of clusters. For small number of data points in X, a worst case scenario is that the
criterion number decreases with increasing c and reaches an optimality with c equal
to the number of data points in X. Also intuitively (see figure 7.4), it is not obvious
which number of clusters will be better from practical viewpoints.
Each cluster will now generate one rule in the following way. Consider a fuzzy
cluster ui on X, and write πp (ui ) : Xp → [0, 1] for the corresponding cluster projected
on the pth axis, i.e. πp (ui )(πp (xk )) = uik . Assume we have decided to use Gaussian
functions in our generated rule. This means we want to estimate the parameters of
Gaussian functions with best fit to respective πp (ui ). Thus, for each p we need to
find βp such that
X
| uik − e−βp (πp (xk )−αip ) |
xk ∈X
x
x
A 32
v
α 32 x 3 x
A 31
α 31
related to the ith cluster. If this cluster is elliptic with points being concentrated
around the centre point of the ellipse, then the principal axes of the cluster is given
by the eigenvectors of M .
7.5 Applications
There is a wide range of applications of clustering. In industrial process control
there are numerous applications, such as fertilizer production ([Riissanen 92b]),
automatic steering (AGNES [Olli 95, Huber 96]), and process state prognostics
([Sutanto and Warwick 94]), only to mention some examples. The reader is referred
to proceedings e.g. related to IEEE and IFSA conferences on fuzzy systems.
Also in pattern recognition and image processing there are typical application
e.g. in segmentation of colour images ([Lim and Lee 90]), edge detection in 3-D
objects ([Huntsberger et al 86]), and surface approximation for reconstruction of
images ([Krishnapuram et al 95]).
Chapter 8
73
74 CHAPTER 8. GENERALISED PERCEPTRONS AS FUZZY RULES
neural
fuzzy computing
inference view
view
For neuralists, the first step is the most ”magic”. (A fuzzy practisioner considers
this step to be ”logic”.) In this step, transformations like ’1-of-N Codes’ is applied to
handle symbolic data, and ’Histogram Equalization’ overcomes difficulties with out-
of-normal-range values. For the second step there are few rules-of-thumb concerning
splitting of the data file into pieces for, respectively, training and testing. The third
step is experimental, and is a generate-and-test approach to finding (what in the end
is believed to be) near-optimal networks. The fourth phase is often nicely supported
by code generating modules in software packages.
According to the neural computing view, raw data is to be collected from a
physical description of the disease. Data is inserted and used as such by the network.
Preprocessing of data is a post-activity after the final network has been extracted.
By preprocessing, the diagnostic performance might be further improvable.
The fuzzy inference view is the opposite: Preprocess first, and organize your sys-
tem based on knowledge extracted from the preprocessing phase. The preprocessing
phase includes a reorganization (often symptom combinations) of the physical dis-
ease description. Data is pushed through (non-linear) transformation functions, and
resulting transformed data is used instead of raw data. In subsequent sections we
will see examples of the power of using transformed data.
reformation
logical output
defuzzification
network output
n e t w o r k inference
logical input
preprocessing
linear transformation
Once the input modelling has been fixed, the second step in forwarding starts
off by observing that we need transformation functions to convert physical data
into logical values. To see the necessity of the preprocessing functions we need
only mentioned the standard example of fever. A value for fever when determining
bronchitis cannot be transformed in the same way as when determining pneumonia.
In all, transformation functions related to particular inputs are always to be given
for the specific disease under consideration.
At this point we need to be critical about the selection of preprocessing function
types. As we have seen, e.g. sigmoidal transformation functions can be useful.
Again, whether the function needs to be ascending or descending, is usually trivially
known by the domain expert. Learning strategies for parameter estimation within
the network is now easily extended to apply also on the parameters within the
sigmoidal functions. Of course, other preprocessing functions can be used, with the
recommendation that they be differentiable to enable parameter tuning.
Once the logical inputs are given, the network is then the computational method
that implements the numerical optimization task, i.e. the training of all the system
parameters. Note, that whereas in the neural approach we usually only identify
parameter values given a certain network structure, the structures within fuzzy
systems domains are identified together with the paramater estimations [Bezdek 81,
Jang 92, Sugeno and Yasukawa 93].
The output of the network is still only of logical nature, typically a unit interval
value, and needs further explanation if representing a final decision or recommen-
dation from the end-user support system. The value can also remain as such, or
simply be appropriately transformed, to indicate a risk value, but can also act as
an input into a decision maker that provides even binary type decisions regarding
further actions to be suggested within patient care.
Generally speaking, a diagnostic task is the problem to find the best representa-
8.4. THE GENERALISED PREPROCESSING PERCEPTRON 77
X
n
y = act( wi ∗ g[αi , βi ](xi )),
i=1
where act is the activation function and g[αi , βi ] are sigmoidal functions.
The specialized preprocessing perceptron shows comparable performance as com-
pared to the multilayer perceptron. Improved discrimance effects and diagnos-
tic success rates have been demonstrated in several case studies [Eklund et al 94,
Kallin et al 95]. This in conjunction with the remarks that generalized preprocessing
perceptrons use only interpretable parameters, makes this approach very appealing.
The logical disjunction as the generalization of summation in the weighted sum,
corresponds to some co-t-norm S as represented by generator functions g : R →
[0, 1], with an existing inverse g −1 . The representation is given by
Incidently, the representation functions are exactly the counterparts of the activation
functions used in the neural framework, and it comes as a surprise to see that
respective literatures on co-t-norms and neural activation functions have very little
in common.
Note how the representation functions can be parametrized in different ways.
From these observations we realize that we obtain a wide spectrum of param-
etrizations of the weighted sum, with interpretations in a logical framework. The
presentation of neural networks as a logical structure can indeed be described in ex-
treme as is done in a subsequent section, thereby concluding that from a generalized
viewpoint, fuzzy logic and neural networks come down to one and the same thing!
8.5. LOGICAL INTERPRETATIONS OF THE WEIGHTED SUM 79
Note that the transformation of addition on the real line is not restricted to the
unit interval only. For learning rules we thereby obtain the following convenience.
For a parametrized function
αk := αk ⊕ ∆pattern αk ,
where ⊕ is the corresponding addition in the range of the α’s, and ∆p αk is given by
a particular optimization strategy.
For the specialized preprocessing function we adapt the parameters in the sigmoid
functions according to
∂Ep
∆p αi = − = −η(tp − op )op (1 − op )wi g[αi , βi ](xi )(1 − g[αi , βi ](xi ))βi
∂αi
and
∂Ep
∆p βi = − = η(tp − op )op (1 − op )wi g[αi , βi ](xi )(1 − g[αi , βi ](xi ))(xi − αi ),
∂βi
respectively, with mean squares in the error function.
Learning as described above is based on gradient descent. However, this being
rather an exemplification than a choice, we certainly do refer also to other optimiza-
tion techniques to be evaluated for these purposes. For some examples, we refer to
chapters describing this in more detail.
1
Tp (a1 , a2 ) = (ap1 + ap2 − 1) p if (ap1 + ap2 ) ≥ 1 and p 6= 0,
Tp (a1 , a2 ) = 0 if (ap1 + ap2 ) < 1 and p 6= 0,
Tp (a1 , a2 ) = a1 a2 if p = 0.
and
Qn
Φni=1 (zi ) = 1 − i=1 (1 − zi ),
In this case study we use both the preprocessing perceptron with the weighted
sum, and also the generalised preprocessing perceptron with a selection of connec-
tives as specified in the previous section.
The data consists of ten different parameters as described in section 1.1.3. Cor-
rectness rates as average values are high both for the weighted sum (99%) and when
using the generalised preprocessing perceptron (97%). Values are for sensitivity
given that specificity is enforced to provide 95% correctness.
Values for weights (implication uncertainties), cut-offs and slopes are shown in
table 8.1. In this table let WS stand for the method using a weighted sum and YI
stand for the method using Yagers implication.
Note that cut-off values in some cases are surprisingly similar. A cautious con-
clusion can be that the preprocessing layer indeed, at least to some extent, is inde-
pendent of the network structure. More experiments should, of course, be done to
verify such claims, but, nevertheless, this observation is most encouraging for aiming
at further understanding hybrids and modularity in learning architectures.
We can say that these methods are successful, but a comparison between them
does not seem meaningful. However, we should note that shifting from a linear
feedforward to a non-linear logic based approach, seems not to endanger success
rates. The advantage of the logic approach is obvious since we immediately open
up possibilities to identify respective suitable implication operators for particular
symptoms and signs. Thus, this methods can in fact be used to identify inference
structures in medical decision making, or at least in some general cases.
82 CHAPTER 8. GENERALISED PERCEPTRONS AS FUZZY RULES
Data retrieval
Figure 8.6: The different paths a patient can take through the health care system.
A health care centre is visited for preventive care or illness. In both cases the
general practitioner examines the patient and laboratory samples might be taken.
The general practitioner uses a computerised patient record system to store the
information about a patient. As the systems in primary health care are often local,
they may differ from place to place. This makes it difficult to compare and/or
combine the data stored about a patient at a health care centre and at a hospital.
If the doctor at the health care centre wishes to consult specialists or if the patient
needs treatment that cannot take place at home, the patient is referred to hospital.
The patient can also be admitted to hospital directly, e.g., in emergency situations.
In hospital the specialists examine the patient and laboratory tests and/or x-rays
are taken. The hospital performs surgery if needed and often the patient stays in
hospital for after care. During the hospital stay several laboratory samples might
be taken and other data gathered.
Also at the hospitals information is stored about patients’ health status such
as laboratory test results, medications, and diagnoses in vast electronic databases.
Many important medical observations have been made by studying patient history,
and much more could be done if scientists would have faster and broader access to
the knowledge that can be found in the huge data masses. Further observations
might be done if it were possible to connect and compare databases at health care
centres with the ones at hospitals.
The scientific and the administrative viewpoint on databases do not coincide.
The administrative interest is to find information on one patient at a time, inserting
or retrieving data concerning this particular patient. The scientific interest is in
turn to combine information concerning patients’ health status. Individual patients
are not in focus but larger groups that have certain things, e.g., the same diagnosis,
in common. For administrative purposes summarised information is also of interest,
but such information concerns mainly, for example, how much an average patient
costs or which hospital wards consume the most resources. The scientific viewpoint
has generally not been considered when building hospital databases, but the trends
seem to be shifting slowly.
84 CHAPTER 8. GENERALISED PERCEPTRONS AS FUZZY RULES
these program clauses are weighted with values in the unit interval. Thus we consider
a Prolog-modification based on LÃ ukasiewicz logic.
Definition The subset of atomic formulae of L is denoted Kf . Its elements are
called facts. A rule is a formula of the form
(P ← (. . . (Q1 ¦ Q2 ) ¦ . . . ¦ Qn )),
• (i)Π0 = Π and Πn = Π0
Note, that ≺ defines a preordering on the set of fuzzy logic programs (i.e. ≺ is a
reflexive and transitive relation).
Definition For a fuzzy logic program Π the mapping Π `: Kf → [0, 1] is given by
Π ` P = sup{Π0 (P ) | Π ≺ Π0 }.
In opposition to standard Prolog, where it is sufficient to find just one proof for a
fact, in this many-valued modification all proofs for a fact have to be taken into
account, in order to obtain the greatest uncertainty factor for the fact. Even in
the case, when the FLP Π has a finite support, not necessarily for all facts P there
exists an FLP Π0 derivable from Π s.t. Π0 (P ) = Π ` P . If f.e. Π(P ) = 0.5 and
Π(P ← P ¦ P ) = 1, Π(Q) = 0, otherwise, where ¦(α, β) = α + β − α · β, then
Π ` P = 1, but for all FLP’s Π0 derivable from Π, Π0 (P ) < 1 holds.
Although the supremum is never reached, it can be approximated by a sequence
of FLP’s derivable from Π. Therefore, the supremum does not cause severe prob-
lems. The need to consider all instead of only one proof for a fact enforces the
86 CHAPTER 8. GENERALISED PERCEPTRONS AS FUZZY RULES
The continuity and the monotonicity of ¦ imply, that there are ε1 , . . . , εn > 0, s.t.
Π(R) + ¦(Υ0 (Q1 ) − ε1 , . . . , Υ0 (Qn ) − εn ) − 1 > Υ0 (Q). Since Υ0 (Qi ) = Π ` S =
sup{Π0 (S) | Π ≺ Π0 } > Υ0 (Qi ) − εi for i = 1, . . . , n, there is a fuzzy logic program
Π0 , satisfying Π ≺ Π0 and for all i ∈ {1, . . . , n} : Π0 (Qi ) > Υ0 (Qi ) − εi . But from Π0
we can directly derive a fuzzy logic program Π00 with
which is obviously a contradiction. For simple rules R, Υ(R) ≥ Π(R) can be shown
analogously.
one for each program clause in the FLP where N ame is the name of the node. Oper
represents the operation in the rule, and is assigned to id for simple rules. AssList
is the list of pairs consisting of an antecedent Si in the rule together with the value
Π(Si ← X). Note that, before transformations the program will be completed with
rules so that there never exist two rules with the same head. For a leaf node, AssList
is NIL. This corresponds to a fact in the FLP.
We may write Ω = F LN (Π) for this set of netnodes. Given an FLP Π : K →
[0, 1], a corresponding FLN is now constructed as follows.
If there is more than one rule with head D, we rename these heads with unique
D1 , . . . , Dn , respectively, and complete the program with the rule D0 ← D1 ¦. . .¦Dn ,
with Π(D0 ← D1 ¦ . . . ¦ Dn ) = 1, where ¦ is the maximum operation.
In the completed program, for facts S we assign
S 7→ F LN etN ode(S, id, N IL).
Simple rules are transformed according to
D←S 7→ F LN etN ode(D, id, [hS, Π(D ← S)i]).
Non-simple rules assign according to
D ← D1 ¦ . . . ¦ Dn 7→ F LN etN ode(D, ¦, [. . . , hSi , wi i . . .])
where wi = Π(Di ← Si ) if Di ← Si is a simple rule and Π(Di ← Si ) > 0, and wi = 1
if Di ← X is a non-simple rule.
Conversely, given a FLN, a corresponding FLP is generated as follows.
F LN etN ode(S, id, N IL) assigns to facts S, where Π(S) are the corresponding
input values in these leaf nodes.
F LN etN ode(D, id, [hS, wi]) assigns to simple rules S ← D with Π(D ← S) = w.
F LN etN ode(D, ¦, [hS1 , w1 i . . . hSn , wn i]) assigns to the non-simple rule D ← D1 ¦
. . . ¦ Dn , together with simple rules D1 ← S1 . . . Dn ← Sn , where Π(S ← S1 ¦ . . . ¦
Sn ) = 1 and Π(Di ← Si ) = wi .
Again, we can write Π = F LP (Ω) for this program. Note that Ω = F LN (F LP (Ω)),
and that Π and F LP (F LN (Π)) are semantically equivalent.
For FLNs to allow for invoking learning procedures, restrictions have to be made.
These restricted nets are called neural logic nets (NLNs) and must fulfil the following
conditions:
• (i)there are no two rules with the same head,
• (ii)in one and the same rule each proposition constant appears at most once,
• (iii)the graph of named nodes created is acyclic,
• (iv)for non-simple rules R, Π(R) = 1.
The corresponding FLPs are called neural logic programs (NLPs). Because of (i)-
(iv), and the finiteness of the support of Π we get Π ` P = max{Π0 (P ) | Π ≺ Π0 }.
88 CHAPTER 8. GENERALISED PERCEPTRONS AS FUZZY RULES
Chapter 9
Parameter Estimations
As we have already seen, there are been many successful applications of fuzzy con-
trol, but control designers still need to face two major obstacles in implementing
fuzzy control. The first is the acquisition of fuzzy rules and the second is the search
for optimal parameters of membership functions for the linguistic rules. We have
techniques for generating rules based on clustering algorithm. In general, however,
the initialisation design of the fuzzy controller will not result in an optimal control
behaviour. To improve the control behaviour, tuning is necessary. From tuning
point of view, the problem becomes to search for optimal parameters of membership
functions for the linguistic rules. Recently, using supervised learning methods to
fine-tune membership functions in a fuzzy rule base, has received more and more
attention. From adaptive fuzzy control point of view, we take the pragmatic atti-
tude that an adaptive fuzzy controller is a controller with adjustable fuzzy rule-base
parameters and a mechanism for adjusting the rule-base parameters. An adaptive
fuzzy control system can be thought of as having two loops. One loop is a nor-
mal feedback with the process and fuzzy logic controller. The other loop is the
parameter adjustment loop. The parameter adjustment loop is often slower than
the normal feedback loop. The derivation of different learning algorithms are given
in the following sections.
89
90 CHAPTER 9. PARAMETER ESTIMATIONS
where p labels the pattern. The derivatives of F are obtained by summing the
derivatives obtained for each pattern separately. yp∗ is the target output value for
pattern p and yp is the actual output of some network function for pattern p. Of
course, the implementation of the different learning strategies will obtain different
improving of control behaviours. But it is not immediately clear, how different
learning techniques are suited for not well-behaving data such as typically found
in chemical, biological systems or medical informatics, where modelling usually is
difficult. In this section we will describe the gradient descent learning method.
Iterative gradient descent techniques can be used in any feed-forward networks.
Therefore, if we represent the fuzzy control system as feed-forward networks in
Figure 5.2.1, then we can use gradient descent (GD) to train different subclasses
of parameters. The objective is the updating of the parameters of the membership
functions in a fuzzy rule base such that the rule base performs a desired mapping
of input to output activations. Optimisation is based on (9.1). The learning goal
now is to find a global minimum of F . The parameter in the rule base are changed
along a search direction d(t) which is the first order derivative, namely the gradient
∂Ep
∇F := ∂α ij
is commonly used, it is driving the parameters in the direction of the
estimated minimum:
∂F
4α(t) = η (9.2)
∂α
α(t + 1) = α(t) + 4α(t) (9.3)
Once the partial derivatives are known, the next step in GD learning is to com-
pute the resulting parameters update. In its simplest form, the parameters update
is scaled step in the opposite direction of gradients, in other words, the negative
derivative is multiplied by a constant value, the learning rate η. This minimization
technique is commonly known as ”gradient descent”:
4α(t) = −η∇F (t) (9.4)
or, for a single parameter:
∂F
4αij (t) = −η (t) (9.5)
∂αij
The derivatives of F are obtained by summing the derivatives obtained for each
pattern, respectively: i = 1, 2, . . ., n, k = 0, 1, 2, . . . is the update (iteration) in-
dex, from Figure 5.2.1 [Wang and Mendel 92], we see that y depends on αij only
P P
through gi . For notation simplicity, let y = a/b, a = ni=1 Ci gi , b = ni=1 gi and
Q −βij (xi −αij )2
gi = m j=1 e , hence, using the ”chain rule”, we have
∂F ∂y ∂gi Ci − y
= −(y ∗ − y) = −2(y ∗ − y) gi βij (xj − αij ) (9.6)
∂αij ∂gi ∂αij b
9.1. TUNING OF FUZZY RULE-BASE BY USING GRADIENT-DESCENT ALGORITHM91
In a manner similar to that described to tune the consequent part, the mean (mid-
point) and variance (slope) are update as follows:
(
αij (k) ⊕ 2ηα (y ∗ − y) Cib−y gi βlij (k)(xj − αij (k)) if xj ≤ αij
αij (k + 1) = (9.7)
αij (k) ⊕ 2ηα (y ∗ − y) Cib−y gi βrij (k)(xj − αij (k)) if xj > αij
Ci − y
βlij (k + 1) = βlij (k) ª ηβl (y ∗ − y) gi (xj − αij (k))2 (9.8)
b
Ci − y
βrij (k + 1) = βrij (k) ª ηβr (y ∗ − y) gi (xj − αij (k))2 (9.9)
b
y∗ − y
ci (k + 1) = ci (k) ª ηc gi . (9.10)
b
where ⊕ and ª are a (isomorphic) operations in the subset of R, where i = 1, 2, . . ., n,
j = 1, 2, . . ., m, k = 0, 1, 2, . . .. ηα , ηβ and ηc are the learning rates.
The training procedure for the neural fuzzy system is a two pass procedure. First, for
given input values x = (x1 , · · ·, xm ), compute forward along the network (i.e. neural
fuzzy system) to obtain gi (i = 1, 2, . . ., n), a, b and y; then, train the network
parameters αij , βlij , βrij and Ci (i = 1, 2, . . ., n, j = 1, 2, . . ., m) backward using
(9.6), (9.7), (9.8), (9.9) and (9.10), respectively.
Although the basic learning rule is rather simple, it is often a difficult task to
choose the learning-rate appropriately. A good choice depends on the shape of
the error function, which obviously changes with the learning task itself. A small
learning-rate will result in long convergence time on a flat error-function, whereas
a large learning-rate will possible lead to oscillations, preventing the error to fall
below a certain value. A comparison of the different learning rates will be shown
later. The typical adaptive fuzzy control system tuning by GD learning algorithm
is shown in the following:
1X 1X ∗
F = Ep = (y − yp )2
2 p 2 p p
Pn Qm −β (x −α )2
C ·( e ij j ij )
{ Denote: y = i=1 i
P n Q m
j=1
−β (x −α )2
i=1
e ij j ij
Pn
j=1
Pn Qm −βij (xj −αij )2
where y = a/b, a = i=1 Ci gi , b = i=1 gi and gi = j=1 e , }
1. Given the initial value to the training parameters.{p, α, βl, βr, C, ηα , ηβ , eg}
92 CHAPTER 9. PARAMETER ESTIMATIONS
3. k := 1; y = yinit ;
4.1 If Sum-squeared error goal less than 0.02 then stop the loop.
4.2 ∂F
:= −e ∂g
∂αij (k)
∂y ∂gi
i ∂αij (k)
:= −2e Cib−y gi βij (k)(xj − αij (k)); { Compute the
derivative of the error with respect to the midpoint of membership function}
Pn Qm −βij (k)(xpi −αij (k))2
4.2.1 bp := i=1 j=1 e ;
Q p
−βij (k)(xi −αij (k)) 2
4.2.2 gip := m j=1 e ;
Ci −yp
4.2.3 cout := bp ;
∂Ep
4.2.4 ∂αij (k)
:= −2 ∗ ep ∗ cout ∗ gi ∗ βij (k)(xpj − αij (k));
P
4.2.5 ∂F
∂αij (k)
:= Pp=1 ∂α∂E p
ij (k)
;
4.3 Normalized αij from the unit inteval [0, 1] into the real set;
4.4 αij (k + 1) := αij (k) − ηα ∂α∂F
ij (k)
; {learning rule-base parameter αij }
4.5 Normalized the αij from the real set into the unit inteval;
1P
4.6 Compute y := simuinf er(p, α, βl, βr, c, f 1); e := y ∗ −y; SSE := (y ∗ −
2 p p
yp )2 ;
It can be shown that this step is completely equivalent to minimizing the error
obtained from using an affine model of R(α) around α(k):
1 1
min kMk (α)k = min kR(α(k)) + J(α(k))(α − α(k))k (9.16)
2 2
If J(α(k)) has full column rank, J(α(k))T J(α(k)) is nonsingular, the Gauss-Newton
step is a descent direction and the method can be modified with line searches
(damped Gauss-Newton method) [?]. The use of Tikhonov regularization elimi-
nates that weakness in an elegant way. The introduction of the augmented term
µ(x − c) has the following impact on the computation of the search direction
°Ã ! Ã !°2
1° J(wc ) E(wc ) °
minn °° p+
°
° . (9.17)
p∈< 2° µt In µt (wc − c) °
à ! à !
J(wc ) E(wc )
Define Jaug = , Eaug = . It is clear that Jaug has always full
µt In µt (wc − c)
rank (with the approximate condition number kJk2 /µ). Thus, the Gauss-Newton
method applied on the regularized problem becomes well-defined and now solves a
much easier sub-problem. If p solves (9.17) and α is the steplength along the search
direction p then w+ = wc + pα becomes the new iterate. Last, in each epoch, a
new decision is made whether to decrease the regularization parameter µ or not. It
is appropriate to start with µ = 0.1 and then decrease µ during the iterations. In
this implementation µ = µ ∗ 0.8 if the steplength is α > 0.5 in the previous step
[Zhou and Eriksson]. The typical adaptive fuzzy control system tuning by GNR
learning algorithm is shown in the following:
Pn Qm −β (x −α )2
C ·( e ij j ij )
{ Denote: y = i=1 i
P n Q m
j=1
−β (x −α )2
i=1
e ij j ij
Pn
j=1
Pn Qm −βij (xj −αij )2
where y = a/b, a = i=1 Ci gi , b = i=1 gi and gi = j=1 e , }
1. Given the initial value to the training parameters.{p, t, α, βl, βr, C, ηα , ηβ , eg};
3. k := 1; y = yinit ;
4.1 If Sum-squeared error goal less than 0.02 then stop the loop;
Pn Qm −βij (k)(xpi −αij (k))2
4.2 (b1 , · · · , bp ) := i=1 j=1 e ;
Qm −βij (k)(xpi −αij (k))2
4.3 (gi1 , · · · , gip ) := j=1 e ;
h i
∂ep
4.4 Compute Jacobian Matrix J(αij ) := ∂αij
;
4.4.1 for p = 1 to p do
(Ci −yp )
4.4.1.1 J(αij )p := −2 ∗ bp
∗ gip ∗ βij (k)(xpj − αij (k));
4.4.2 end forà !
J(α)
4.4.3 Jaugα =
µα In
à !
E(α)
4.4.4 Eaugα =
µα (α − c)
4.4.5 Pα = −Jaugα \Eaugα ;
4.4.6 Normalized αij from the unit inteval [0, 1] into the real set;
4.5 αij (k + 1) := αij (k) + γ ∗ Pα ; {learning fuzzy rule-base parameter αij }
4.8 Normalized the αij from the real set into the unit inteval;
1P
4.9 Calculate y; e := y ∗ − y; SSE := (y ∗
2 p p
− yp )2 ;
current search point. The size of this region is governed by the value of parameter µ.
This approach is addressed by seeking to minimize the error function while at same
time trying to keep the step size small so as to ensure that the linear approximation
remains valid. This method is an approximation to Gauss-Newton’s method. The
LM modification to the Gauss-Newton. The step is defined as:
1. Given the initial value to the training parameters.{p, t, α, βl, βr, C, ηα , ηβ , eg};
3. k := 1; y := yinit ; µ := µ ∗ µinit ;
96 CHAPTER 9. PARAMETER ESTIMATIONS
4.1 If Sum-squeared error goal less than 0.02 then stop the loop;
Pn Qm −βij (k)(xpi −αij (k))2
4.2 (b1 , · · · , bp ) := i=1 j=1 e ;
Qm −βij (k)(xpi −αij (k))2
4.3 (gi1 , · · · , gip ) := j=1 e ;
4.4 repeat
h i
∂ep
4.4.1 Compute Jacobian Matrix J(αij ) := ∂αij
;
(Ci −yp )
4.4.1.1 J(αij )p := −2 ∗ bp
∗ gip ∗ βij (k)(xpj − αij (k));
4.4.2 Jα := [J(αij )T J(αij ) + µα I]−1 J(αij )T R(αij (k));
4.4.3 Normalized αij from the unit inteval [0, 1] into the real set;
4.4.4 newαij := αij (k) − Jα ; {learning rule-base parameter αij }
4.4.5 Normalized the newαij from the real set into the unit inteval;
4.4.6 Compute newy := simuinf er(p, newα , βl, βr, c, f 1); newe ; newSEE ;
4.4.7 If newSSE < SSE then termination
4.4.8 µ := µ ∗ µinc ;
4.5 until (the errors are reduced);
4.6 If µ > µmax , k := k − 1, break, end
4.7 µ := µ ∗ µdec ;
4.8 αij (k + 1) := newαij ; {Updating parameter αij }
4.9 y := newy ; e := newe ; SSE := newSSE ;
4.10 k := k + 1;
Software Developments
For decades the CI community has been developing applications with a various
number of methods and tools. Applications and applications engineering styles
presented at conferences has always been considered as freeware in the community.
The Fuzzy Boom can at least partly be explained by this generosity provided by
R&D groups all around the world.
As a consequence of increasing interests towards applications, the need for so-
phistication in tools and supporting software is also growing rapidly. These demands
are by the software engineers met either by incorporating more and more functional-
itites into their own software packages, thus aiming at providing complete solutions
from problem solving to installations, or else by aiming at open architectures and
configuration of toolboxes in order to reduce engineering efforts and support ease of
integration into existing systems.
In recent years, several commercial and public domain software has become avail-
able to the CI community. Commercial and freeware tools are used in different
configurations in a wide range of applications. The pressure on software developers
is obvious. The ”complete solutions” approach requires a continuous incorpora-
tion of standard functionalitites found elsewere. The ”open architectures” approach
needs to continuously follow the CI toolkit market in order to maintain and improve
integration capabilities.
This calls for an obvious need for development groups to communicate and ex-
change ideas and development styles. Furthermore, there is certainly opportunities
also to exchange source code and software libraries, thereby creating symbiosis be-
tween development groups.
• (ii) platform independency (i.e. managing shifts e.g. between PC and Unix)
97
98 CHAPTER 10. SOFTWARE DEVELOPMENTS
Commercial
informal Tools
raw "Interview" represen- Previewer/
MONITORING info tation Rule Editor
Code for
Rule Base Code (or paramsList) Fuzzy
CONTROLLER generation Rule
+ Base
Process support
in out
sensor(s) actuator(s)
PROCESS
the controller software modules need to integrate with different controller hardware
architectures, and interlink to drivers and servers running under a wide range of
operating systems.
PROCESSOR
(PLC boards)
in out
Further, the software modules need to integrate with different controller software
environments, and interlink to monitoring and simulation modules running under a
wide range of operating systems.
RS 232
RS
232
PROCESSOR
(PLC boards)
in out
sensor(s) actuator(s)
PROCESS
For design purposes the hardware systems may, of course, in a design stage be
replaced by simulation environments, typically with simulation software interlinked
with monitoring and control software.
log-file for
manual measurements and
control actions
CONTROL
PROCESSOR
PROCESS
If the rule base is only presented as a list of text it is very difficult to handle all
but the smallest rule bases, especially it is difficult to detect errors in the rule base.
Various visualization techniques have been tried to clarify the rule base graphically.
PROCESSOR
PROCESS
A software controller with its corresponding rule base should be directly interlinked
with software modules providing analysis, monitoring, simulation and editing facil-
ities.
MONITORING AND CONTROL
MONITORING
&
VISUALIZATION
CONTROL EDITING
(fuzzy logic inference)
REINFORCEMENT
learning
SUPERVISED UNSUPERVISED
learning learning
CONTROL
&
MONITORING
such as COLD, HIGH and SMALL depending on the type of variable and the pa-
rameters of the membership function. The rule base parameters can be changed by
editing or dragging the shapes of the functions.
For a clustered rule base adjustments are required, especially for data not ranging
in the data set provided through the recording phase. Giving linguistic names to
cluster projections makes the rule base more understandable for process engineers.
Execution of the rule base using DDE is easy for the user - no programming
or even integration of code is needed. It works with I/O driver software, process
monitoring applications and simulation software providing DDE service. Aboa-
Fuzz@Control also supports DDE links to multiple servers if needed. AboaFuzz@Control
includes the possibility to watch measurements while running. More sophisticated
data analysis and statistics can be applied using existing commercial tools, which is
possible due to DDE.
104 CHAPTER 10. SOFTWARE DEVELOPMENTS
Chapter 11
EXERCISES
where
(
1 − 4t , if 0 ≤ t ≤ 4
SM ALL(t) =
0, otherwise
(
|t−2|
1− , if 0 ≤ t ≤ 4
M EDIU M (t) = 2
0, otherwise
105
106 CHAPTER 11. SUMMARY AND EXERCISES
(
1− 4−t
4
, if 0 ≤ t ≤ 4
BIG(t) =
0, otherwise
where
(
1 − 4t , if 0 ≤ t ≤ 4
SM ALL(t) =
0, otherwise
(
|t−2|
1− , if 0 ≤ t ≤ 4
M EDIU M (t) = 2
0, otherwise
(
1− 4−t
4
, if 0 ≤ t ≤ 4
BIG(t) =
0, otherwise
use the Mamdani method of inference together with a defuzzification method of your
choice to compute the output of inputs given by x = 2, y = 2 and z = 4.
Part IV
PROBABILISTIC COMPUTING
107
Chapter 12
Introduction
A substantial number of our everyday decisions are made under conditions of un-
certainty. We are often faced with situations where conflicting evidence forces us
to explicitly value observed information in order to make rational decisions. Dif-
ferent methods, such as logical, probabilistic and numerical approaches have been
developed to handle uncertainty in reasoning systems. Bayesian networks (BNs),
also known as causal probabilistic networks, belief networks etc., is one appealing
formalism to describe uncertainty within domains. In recent years, the interest for
BNs has increased due to new efficient algorithms and user-friendly software making
BNs available to others than the research community responsible for them.
BNs have been successfully applied in applications such as medical diagnosis, im-
age processing and software debugging and new applications are reported frequently.
109
110 CHAPTER 12. INTRODUCTION
12.2.1 Definitions
The probability of a event a is defined as
g(a)
P (a) = , (12.2.1)
N
where g(a) is the number of positive outcomes for a and N is the total number of
outcomes. The number of positive outcomes for a is bounded by
0 ≤ g(a) ≤ N, (12.2.2)
0 ≤ P (a) ≤ 1. (12.2.3)
If A is a random variable with the discrete and exhaustive states {a1 , . . . , aj } then
X
j
P (A = ai ) = 1. (12.2.4)
i=1
P (A, B)
P (A|B) = (12.2.5)
P (B)
where P (A, B) is the joint probability for A and B, i.e “A and B”. Rewriting equa-
tion (12.2.5) gives an expression for the joint probability for dependent variables:
A is independent of B, if and only if, P (A|B) = P (A). Using this fact in equation
(12.2.6) gives an expression for independent variables:
Since the phrase “A and B” is equal to “B and A”, equation (12.2.5) can be used
to formulate:
P (B|A)P (A)
P (A|B) = (12.2.10)
P (B)
which is known as Bayes’ rule or Bayes’ theorem. This equation makes it possible to
calculate the posterior probability, P (A|B), for a event when the opposite condition,
P (B|A), is known (and the unconditional probabilities for the individual events).
Bayesian Networks
113
114 CHAPTER 13. BAYESIAN NETWORKS
Timer
T
@
R
@
H W
Holmes car Watsons car
three random variables: Holmes’ car (H), Watson’s car (W ) and the timer (T ). All
these variables are discrete with two mutually exclusive states each. The cars can
either start or not, and the timer is either working or not.
The causal relationship between these random variables can be expressed by
connecting them pairwise with directed links. Here, the timer affects (through the
power supply) the functionality of the cars, so a link can be drawn from the timer to
each of the two cars, see figure 13.1. The direction of the links is of great importance,
the state of the timer (working or not working) affects the cars ability to start, not
the other way around.
In order to complete the network model, a conditional probability table (CPT)
must be specified for each node. Since the variable T is a root node in the network,
i.e. node T has no incoming links, only the unconditional probability of each state
needs to be specified. The probability that the timer will fail, P (t), is 20/365 ≈ 0.05
and the probability that the timer will work, P (t), is therefore 1 − P (t) ≈ 0.95. The
probability tables for Holmes and Watson’s cars must to be specified in context of
the timer. For example, from the text it can read that the chance that Holmes’ car
will fail when the timer is working, P (h|t), is 2/10 = 0.2. The complete CPT for
Holmes’ and Watson’s cars are given in table 13.1.
With the above example in mind it is now appropriate to give a formal definition
of Bayesian networks:
A
?
P A B
B
J J
J J
J J
^
J
^
J
?
A B C
C
a) Linear b) Diverging c) Converging
13.2.1 Linear
A linear connection is shown in figure 13.2a. If B in unknown, the probability of
B is determined from the status of A. Since C is determined from B, C therefore
becomes dependent of A.
If B is known to be in state bi (and therefore unaffected of A), the probability
of C can be calculated direct from its probability table, P (C|bi ) and is therefore
conditionally independent of A.
13.2.2 Diverging
This case was previously illustrated by the car trouble example. Generally, nodes
from a common parent P are dependent unless there is evidence in P . Evidence in
P blocks the path from A to B, see figure 13.2b.
13.2.3 Converging
The last case to consider is when two or more variables causes the same effect (figure
13.2b). If nothing is known about C, the parents A and B are independent. For
example, it is fairly safe to state that cold and allergy-attack, which both can cause
a person to sneeze, are independent, see fig 13.3 (this example is borrowed from
Henrion [Henrion et al 91]). However, observing a person sneezing, makes the two
events dependent. If a person sneezes when an allergen (for example, from a cat) is
present the support for cold is reduced in favour of the allergy-attack theory. This
phenomena is also known as “explaining away”. The allergy-attack explains away
the cold. Here, the evidence was observed directly in the converging node S. This
is not necessarily the case. In fact, any observed descendant node from S can act
as transmitter between the parents and make them conditionally dependent.
The conclusion is that a converging node, C, blocks the path between its parents
unless there is evidence in C or any of its descendants.
T Cat
?
Cold C A Allergy-
attack
J
J^ /
S
Sneeze
Figure 13.3: If a person is sneezing when an allergen is present, the support for cold
is reduced.
13.3. FINDING THE PROBABILITIES 117
13.2.4 d-separation
The three cases above can be used to formulate a separability criterion, called
direction-dependent separation or d-separation. The d-separation is a very impor-
tant property that can be used to find efficient inference algorithms A formal proof
can be found in Pearl [Pearl 88].
Definition 13.2.1 Any two nodes, A and B, in a Bayesian network are d-separated,
and therefore conditionally independent, if every path between A and B is blocked
by an intermediate node, V ∈ / {A, B}.
V blocks the path if and only if one of the following holds:
(i) the structure is linear or diverging and V is known.
(ii) the structure is converging and neither V nor its descendants are known.
The d-separability criterion will later be used frequently in the presentation of in-
ference methods.
⊥B|P
(b) A⊥
⊥B (however A⊥
(c) A⊥ ⊥B|C is false)
probability values needed can be too many to manage, and with larger sets of parents
the CPT quickly becomes unwidely. For example, a discrete boolean variable with
only four boolean parents needs as many as 32 values to complete the CPT. Even
with a large database there is a potential risk that some of the combinations are
too unusual to provide a reliable estimate. One commonly used model to avoid this
problem is the noisy-Or gate derived by Judea Pearl in [Pearl 88].
The model is based on disjunctive interaction, that is when the likelihood for a
particular condition is unchanged when other conditions occur at the same time. For
example, if cold, pneumonia and chicken-pox is likely to cause fever, then disjunctive
interaction applies when a person suffering from several of these diseases at the same
time would only be more likely to develop fever. Furthermore, if the person is also
suffering from a disease that is unlikely to cause fever, this additional evidence
does not reduce the support for fever caused by the other diseases. There exist an
well-founded theory for a disjunctive model if the following assumptions are made:
(i) Boolean variables: All included variables must have two discrete states, namely
true and false.
(iii) Exception independence: The processes that inhibit an event under a condition
are independent.
Assumption (ii) is not as strict as it might look, as it is always possible to add a
“Other causes”-variable to represent what is not explicitly specified in the closed-
world assumption. The model requires that only the individual probabilities for each
parent are specified, i.e. P (E| only Hi ). Thus, using this technique requires only
13.3. FINDING THE PROBABILITIES 119
Cold Co Noisy-Or gate
HH .
HH
j
Pneumonia Pn - - Fe Fever
3
Chicken-pox Cp
Suppose P (f e|co)= 0.4, P (f e|pn) = 0.8 and P (f e|cp) = 0.9. For example, the
probability of fever when both Cold and Pneumonia are present is
Table 13.3: Probability values for the fever calculated with the noisy-Or model.
Co Pn Cp P (f e) P (f e)
f f f 1 0
f f t 0.1 0.9
f t f 0.2 0.8
f t t 0.02 0.98
t f f 0.6 0.4
t f t 0.06 0.94
t t f 0.0.12 0.88
t t t 0.012 0.988
120 CHAPTER 13. BAYESIAN NETWORKS
Chapter 14
Once a network is constructed it can be used to answer queries about the domain.
The basic task in a Bayesian network is to compute the probability of a variable
under evidence. Since there are no special input or output nodes, any variable can
be computed or observed as evidence. There are basically three different types of
inference that occur in a Bayesian network:
• Causal inference
• Diagnostic inference
• Intercausal inference
Causal inference is when the reasoning follows the same direction as the links in the
network. Diagnostic inference, on the other hand, is when the line of reasoning is
the opposite of the causal dependencies. The basic strategy to handle diagnostic
reasoning is to apply Bayes’ rule in equation (12.2.10). Intercausal inference means
reasoning between causes of a common effect. This is the same condition as in section
13.2.3. The presence of one cause make the others less likely. Finally, combinations
of these inferences can appear in Bayesian networks. This is sometimes called mixed
inference.
121
122 CHAPTER 14. INFERENCE IN BAYESIAN NETWORKS
Proof 14.1.1 Induction on the nodes in the network. Suppose that every causal
Bayesian network has a joint distribution as in equation 14.1.1. For a network with
only one node the hypothesis is obviously true. Suppose that the hypothesis is true for
Q
network of the variables Bn−1 = {X1 , . . . , Xn−1 }, that is P (Bn−1 ) = n−1
i=1 P(Xi |PXi ).
Let Bn = Bn−1 ∪ Xn , where X is a leaf in Bn . (Since Bn is a DAG, there is at least
one leaf in Bn ). By using the fundamental rule in equation (12.2.5) the formula can
be written
P (Bn ) = P(Bn−1 , Xn ) = P(Xn |Bn−1 )P(Bn−1 )
Since Xn is independent of Bn−1 \ PXn the left-hand side can be reduced to
The fact that a Bayesian network can be represented by a joint distribution on the
form above results in two important properties.
Here, consistent means that the probability values does not conflict each other. It is
rather easy, intentional or not, to construct an inconsistent system. Let for example
P (a) = 0.7, P (b) = 0.2 and P (a|b) = 0.65. These values might seem alright at a
first glance, but they are in fact inconsistent. By using Bayes’ rule the term P (b|a)
can be written as P (a|b)P (a)/P (b) which is > 1. The consistency property ensures
that a Bayesian network does not violate the axioms of probability.
The joint distribution can be used to answer any query of the network. To
illustrate this, lets again return to the previous car trouble example. Here, the
joint distribution is P (H, W, T ) = P (H|T )P (W |T )P (T ). The computation of, say,
P (h, w, t) can be done by simply multiplying the corresponding values in the CPT:
Table 14.1: Atomic events for the car trouble example calculated with the joint
distribution.
H W T P(H, W, T)
y y y 0.684
y y n 0.0075
y n y 0.076
y n n 0.0075
n y y 0.171
n y n 0.0175
n n y 0.019
n n n 0.0175
The terms P (h, w) and P (w) can be computed by using equation (12.2.9) to sum
over all matching terms in table 14.1:
0.076 + 0.0075
P (h|w) = ≈ 0.696
0.076 + 0.0075 + 0.019 + 0.0175
If both cars fail to start, the probability of a broken timer is:
P (t, h, w) 0.0175
P (t|h, w) = = ≈ 0.48 (14.1.2)
P (w, h) 0.019 + 0.0175
Summing over the joint distribution is an easy method to use when answering queries
in Bayesian networks. Unfortunately, due to the exponential growth of atomic
events, this method cannot be used it practice in this simple way. The number
of cases to consider is equal to the product of the number of states for each variable.
If every variable is binary, the number of atomic events in a network with n nodes
are as many as 2n . With larger networks, this method simply becomes intractable.
For example, the medical diagnose system MUNIN (Olesen et al in [Olesen et al 89])
consists of more than 1000 variables with up to seven states each. Even though the
MUNIN system is one of the most elaborative application found in the literature,
real-world modeling often requires far more nodes than is manageable within the
reach of this simple method. Even worse, Cooper proved in [Cooper 87] that infer-
ence in a Bayesian network is NP-hard irrespective of the method used. There are
no efficient methods for undertaking an arbitrary network. Despite this dishearten-
ing result there are ways to tackle this problem. The proof in [Cooper 87] applies
to an arbitrary network, and there are certain families of network topologies where
more efficient algorithms exist. Also, many applications generate sparse graphs, in
which the prospects of finding a more local computation scheme are good. Three
such exact inference methods are described in section 14.3.1, 14.3.2 and 14.4.
124 CHAPTER 14. INFERENCE IN BAYESIAN NETWORKS
.
@ @ A
R
@
@
?
= @@
?
R= AU
@ @
? @
R
@ ?
? @
R
@
.
. .
@ @
?
R
@
@ ? @
R
@
a) Singly connected b) Multiply connected
. . . . . . . . . . . . . . . . . .. .. .. . . . . .
. . .
. . . .
. . .. . .
. . . .
. . C . .
Z . .
. Z
~
Z
. W
C . .
. .
. U1 . Um EUm X . .
. . .
.E + =
@ . . . . . S
. X . .. . S w . .
. . @. .
. . .
. .
. . . . . . . . . . @ . .. . .
. .. .. .
R
@ /
.
. . . . . .. .. . . X
. . . . . . . . . .
. . . . . . . . ..
.
. . . EXY1 Z1k . . . S . .
. . . . . . .S .
Znj
.
.
. . Z11 B . S . .
. .
HH BN / . . w
S .
. . j Y
H . Yn .
. . 1 .
. . .
.
. .
S
w
S .
U .
. . .
. . .
. . . . .
. .. . . . . . . .
. −
.
. . . . . . . . . . . . . . X. . . . . . E . .
which makes α:
1
α= P − + . (14.2.5)
X P (EX |X, EX )
That is, α can be treated as a normalizing constant which scales the sum of the indi-
vidual probabilities to 1. There is no need to find an explicit formula for P (EX+ |EX− ).
Next, P (X|EX+ ) can be computed by considering all possible configurations of
the parents of X. Let PX = {U1 , . . . , Um }, we get
X
P (X|EX+ ) = P(X|PX , EX+ )P(PX |EX+ ) (14.2.6)
PX
The term P (X|PX , EX+ ) can be simplified into P (X|PX ) since the parents PX d-
separates EX+ from X. Furthermore, since X is a converging node, it blocks the
undirected path between the parents which ensures that they are conditionally in-
dependent. Using the equation of independent events (12.2.7) the equation can be
written X Y
P (X|EX+ ) = P(X|PX ) P(Uk |EX+ ). (14.2.7)
PX k
The evidence, EX+ , can be further decomposed into {EU1 X , . . . , EUm X }, see figure 14.2.
Since the parents are independent, each Uk is independent all other parents and their
evidence sets. This observation gives
X Y
P (X|EX+ ) = P(X|PX ) P(Uk |EUk X ). (14.2.8)
PX k
126 CHAPTER 14. INFERENCE IN BAYESIAN NETWORKS
A closer lock at the term P (Uk |EUk X ) reveals that it is in fact a recursive instance
of the original problem, excluding the node X. The term P (X|PX ) can be found in
the conditional probability table for X.
Now, returning to equation (14.2.4) the last term to consider is P (EX− |X). Let
Y = {Y1 , . . . , Yn } be the set of children of X. Decomposing the evidence into
{EXY1 , . . . , EXYn } yields
The last term, P (Yi , Zi |X) can be written P (Yi |X, Zi )P(Zi |X) and since X is d-
separated from Zi the formula becomes:
Y X X P(Zi |EXY
+ +
)P(EXY )
P (EX− |X) = i i
P(EY−i |Yi )P(Yi |X, Zi )P(Zi ) (14.2.15)
i Yi Zi P(Zi )
14.2. SINGLY CONNECTED NETWORKS 127
+
The term P (EXY i
) is unconditioned and therefore independent of the states of X.
+
Replacing P (EXYi ) with a constant βi and cancelling the terms P (Zi ) gives:
YXX
P (EX− |X) = βi P(Zi |EXY
+
i
)P(EY−i |Yi )P(Yi |X, Zi ) (14.2.16)
i Yi Zi
Now, inspecting each term reveals that P (EY−i |Yi ) is a recursive instance of P (EX− |X)
excluding X, P (Yi |X, Zi ) is a lookup in the conditional probability table (CPT) for
Yi and P (Zij |EZ+ij Yi ) is a recursive instance of the original problem, excluding Yi .
Notice that there is no need to find an explicit value of β since it can be combined
into α in equation 14.2.18 to form a new normalizing constant ξ. To summarize,
the belief for a variable X under evidence, E, in a singly connected network can be
evaluated as:
P (X|E) = ξπ(X)λ(X) (14.2.18)
where ξ is a normalizing constant and
X Y
π(X) = P (X|U) P(Uk |EUk X ) (14.2.19)
U k
YX X Y
λ(X) = P (EY−i |Yi ) P(Yi |X, Zi ) P(Zij |EZij Yi ) (14.2.20)
i Yi Zi j
Different versions on how to turn the above equation into a general algorithm
can be found in the literature. Pearl constructs in [Pearl 88] an object-orientated
message passing scheme where the flow of belief is updated by messages sent be-
tween adjacent nodes. A recursive formulation is derived by Russell and Norvig in
[Russell and Norvig 95].
− −
Identifying the symbols reveals that Y1 = H, Y2 = W , EH = {h} and EW = {w}.
Since P (h|h) = P (w|w) = 0 and P (h|h) = P (w|w) = 1, the formula can be reduced
to
That is, P (t|h, w) ≈ 0.48, which is the same result as in equation (14.1.2).
14.3.1 Conditioning
The basic idea in the conditioning approach is to divide the multiply connected
network into several smaller singly connected networks conditioned on a set of in-
stantiated variables. Figure 14.4 shows the two networks created when the boolean
variable M e in figure 14.3 is instantiated. Generally, the number of resulting sub-
trees is exponential to the product over the states of each variable in the cutset. The
cutset is the set of conditioned variables. The problem in this approach is to find
14.3. MULTIPLY CONNECTED NETWORKS 129
Br Sc P (co|Br, Sc)
t t 0.8
t f 0.8
f t 0.8
f f 0.05
me me me me
?
?
?
?
Sc Br Sc Br
@ @
R
@
@ R
@
@
Co Co
Figure 14.4: Conditioning the node M e creates two singly connected networks.
the minimal cutset that divides the original network into singly connected subnets.
Once created, the probability for a variable can be calculated as the weighted sum
over each individual polytree.
A nice side-effect of this technique is that the weighted sum can be used to quickly
calculate an approximate answer. Starting with the largest weight, the system can
compute the probability until a desired level of accuracy is obtained. A simple way
to calculate the accuracy range is to sum over all remaining weights to calculate an
upper bound. The lower bound is, of course, the probability calculated so far, since
probabilities values are always positive.
130 CHAPTER 14. INFERENCE IN BAYESIAN NETWORKS
14.3.2 Clustering
Clustering takes the opposite approach to conditioning. Instead of dividing the net-
work into smaller parts, clustering algorithms combines nodes into larger clusters.
The variables Br and Sc in the coma example could be collapsed into a compound
variable Z = {Br, Sc}. The states of the cluster node becomes the set of combina-
tions to all included variables. Here, the states of Z are {(br, sc), (br, sc), (br, sc),
(br, sc)}. The clustering transforms the network to a polytree where the inference
can be performed as usual. The disadvantage, is of course, that if the network is
dense, the compound variables can become intractable large, since the number of
states is exponential to number of collapsed variables. Despite this fact, clustering
techniques are by many considered as the best exact algorithm for most types of
non-singly connected networks.
One particularly interesting and the current standard algorithm for clustering
networks was originally developed by Lauritzen and Spiegelhalter in [Lauritzen and Spiegelhalter 88].
The method was later improved with a general absorption scheme and by Jensen in
[Jensenet al 90]. This technique uses properties of the clusters in order to efficiently
propagate the flow of belief.
The key issue is the concept of consistent universes. Consider two clusters, V =
{A, B, C} and W = {C, D, E}, in figure 14.5. Now, the probability for the common
variable C can be calculated by summing over all elements except C in both V and
W. Thus, X X
P (C) = P (A, B, C) = P (C, D, E) (14.3.22)
A,B D,E
If evidence changes the information in V then the above condition can be used to
update the probability for W in the following way: Initially, let the distributions
for V and W be P 0 (A, B, C) and P 0 (C, D, E), respectively. Now, suppose that
evidence in V changes the distributions to P 1 (A, B, C). With this new information,
14.3. MULTIPLY CONNECTED NETWORKS 131
A,B,C C C,D,E
the probability for the common variable C can be marginalized out of P 1 (A, B, C)
as in equation 14.3.22, X
P 1 (C) = P 1 (A, B, C). (14.3.23)
A,B
Using the fundamental equation 12.2.5 the term P (C, D, E) can be written
where the term P 0 (D, E|C) can be calculated from the initial distribution:
P 0 (C, D, E) 1
P 1 (C, D, E) = P (C) (14.3.26)
P 0 (C)
The scheme above is called absorption, the cluster W has absorbed from V. In
general terms the absorption process can be described as follows:
Definition 14.3.1 Let V and W be cluster of variables and let S be the set of their
common variables, that is S = V ∩ W. Let ψV0 , ψW 0
and ψS0 be the belief tables
associated with each cluster. The absorption procedure is defined by the following
steps:
V W
A B ψV0 (A, B) B C 0
ψW (B, C)
a1 b1 0.05 b1 c1 0.1
a1 b2 0.6 b1 c2 0.3
a2 b1 0.3 b2 c1 0.4
a2 b2 0.05 b2 c2 0.2
for W can be calibrated to V by letting W absorb from V. The new belief table for
the separator S = {A} is
X
ψS1 = ψV0
B
= ψV0 (b1 )
+ ψV0 (b2 )
= (0.05 + 0.3, 0.6 + 0.05)
= (0.35, 065).
0
Finally, ψW can be updated:
1 0 ψS1
ψW = ψW
ψS0
0 (0.35, 0.65)
= ψW ≈ ψW0
(0.875, 1.083)
(0.4, 0.6)
≈ (0.1, 0.3, 0.4, 0.2)(0.875, 1.083)
| {z } | {z }
b1 b2
≈ (0.1 · 0.875, 0.3 · 0.875, 0.4 · 1.083, 0.2 · 1.083)
≈ (0.0875, 0.2625, 0.4333, 0.2166).
@ , @
,
@
R
@ ,
?
?
@ @
R
@
@ @
@
CPN Moral graph
(1) Let i = |V |
(2) Let X be the ith node (using the order above) and let P be the set of
neighbours to X with numbers < i.
(3) Add links between ant two nodes in P that are not already connected.
(4) Let i = i − 1
(5) Continue from (2) until i = 0.
The triangulation of the graph ensures that the final cluster tree will fulfill the second
property above, see figure 14.9 for an example. Notice that the triangulation of a
graph is not unique. There are several steps in the algorithm were arbitrary choices
can be made. Intuitively, the best triangulation is the one that yields minimum
fill-in. However, in this case the optimal triangulation is concerned with the size
14.3. MULTIPLY CONNECTED NETWORKS 135
of final the junction tree. Unfortunately, finding the optimal triangulated graph is
NP-hard. See Arnborg el al. in [Arnborg et al 87] and further discussion by Jensen
et al. in [Jensen and Jensen 94].
Next, the junction graph can be constructed by identifying the cliques in the
moral and triangulated graph. Links between clusters are added by connecting
cluster with a non-empty intersection, see figure 14.7. The intersection between two
variables is called separator, denoted S. Finally, the junction tree can be found in the
A
@ ABC
@
@ .....
. .....
.....
..... ..
B C
C .
"" ...
. BC .....
" .
.. .....
..
"
" BCD CD CDE
" ..... .
D E ......
. ......
@ D .
...
DE
@
@
...
.. %
F
DEF
Triangulated graph Junction graph
junction graph by finding a maximum spanning tree (Jensen [Jensen and Jensen 94]),
where the weight of a link is represented by the number of variables in the separator,
i.e |S|. Figure 14.8 shows a junction tree found in a junction graph.
ABC ABC
... ..... ..
.....
.....
....
...
.... ....
BC C .
BC
... ..... ...
.... .....
... ... .
BCD CD CDE BCD CD CDE
..... ......
. ....
......
. ..
... . ...
D DE DE
...
...
... % %
DEF DEF
Junction graph Junction tree
A A AB A AC
@ , @
R
@ ,
, @
B C B C B C
.
?
? BD CE
D E D E ...
.
.
..
... ..
.
@ @ D E
R
@
@ @
@ .....
....... ......
....
F F DEF
CPN Moral graph Cluster graph
(i) Give all nodes (clusters) and separators a table of ones, i.e. ψ 0 = 1.
(ii) For each variable, A, choose one cluster, C, containing A ∪ PA and multiply
P (A|PA ) (the CPT) with ψC0 .
14.3.2.6 Evidence
Entering observed evidence in a junction tree is easy. Evidence is normally on the
form A = aj . Semantically, this means that the probability of all other states is 0.
P (A) = {0, . . . , 1, 0, . . .}, a “1” in the j:th position. The same can be done for a
cluster of variables: Let E = {0, . . . , 1, 0, . . .} be a finding on A. Multiply E with
the belief table for any cluster containing A.
14.3. MULTIPLY CONNECTED NETWORKS 137
(iii) Let the parent absorb from the node: Call Absorb(parent, node) (unless
the parent is the root if the tree).
TopDown(node,parent)
When the junction tree is consistent, the belief for a single variable can be computed
by marginalization: X
P (A) = ψC , (14.3.29)
C\A
R R
6
1
2
. ?
..... ..... S 2
1 .....
.
..... .....
.. o
S 1 2
.....
.
..... .....
.. S
S w
S
... S. . ...
.
... ... ...
...
...
.. ... ..
BottomUp(R) TopDown(R)
(iv) Construct initial belief tables for each node and separator in the junction tree.
(vi) Select a root node, R (any node in the junction tree can act as root).
Note that step (i) to (iii) is a static task, there is no need to redo this process unless
the structure of the network is changed.
Me Me MSB
@ @
@
R
@ @
@
Sc Br Sc Br SB
@ @
@
@
R
@
@
Co Co SBC
Original BN Moral graph Junction tree
term P (Co|Sc, Br) is multiplied with ψC02 . The initial belief table for the separator
remains 1, the tables for C1 and C2 are shown in table 14.3.
Notice that these two clusters are not consistent. To make them consistent the
absorption process must be applied. Suppose that C1 is selected as root in the tree.
A call to BottomUp() will cause C1 to absorb from C2 . In this absorption nothing is
changed however, since the separator ψS1 will remain equal to 1 when marginalized
out of ψC02 . This is of course, not a coincidence since ψC02 in this stage is equal to
P (Co|Sc, Br). Next, the call to TopDown() will force C2 to absorb from C1 . The new
belief table for the separator S is shown in table 14.4. Finally, when the junction tree
is globally consistent it is possible to calculate the probability for every individual
variable. For instance, the probability for coma can be marginalized out of ψC22 :
X
P (co) = ψC22
Sc,Br
= 0.032 + 0.224 + 0.032 + 0.032
= 0.32
Now, updating ψC02 to ψC12 by multiplying ψS2 / ψS1 with ψC02 . The result is shown in
the last column in table 14.3.
140 CHAPTER 14. INFERENCE IN BAYESIAN NETWORKS
Br Sc ψS2
y y 0.04
y n 0.04
n y 0.28
n n 0.64
P (A)P (T |A)P (E|T, L)P (X|E)P (L|S)P (B|S)P (D|E, B)P (S) (14.4.30)
To compute the unconditionally probability of, say, dyspnea it is possible sum over
14.4. SYMBOLIC PROBABILISTIC INFERENCE (SPI) 141
which requires substantial fewer computations. The SPI approach is to find the op-
timal factoring so that the necessary calculations in the joint distribution are kept
to a minimum. This problem is closely related to the standard optimal factoring
problem, OFP, which is believed to be NP-hard. However, recently, some heuris-
tic search algorithms have been developed which appears to find good factoring
solutions. Also, computations can be saved using cache techniques, to avoid recom-
puting values that are already calculated in a previous step in the summation. For
P
example, the term S P (L|S)P (B|S)P (S) above is unchanged during the summa-
tion of the variables A,T ,E and B and can therefore be saved in a cache memory
after the first computation, and then be accessed when needed during the remaining
computations.
can be rewritten as P (X, Y )/P (Y ), the algorithm is not restricted to handling con-
junctive queries.
A factor is a subset of the complete probability distribution. Each factor contains
a set of variables, which affect the distribution. For example, the factor P (B|A)
includes the variables {A, B} and combining this factor with the factor P (A) yields
a conformal factor, P (B|A)P (A), with the same set of variables as P (B|A).
Now, let Q be the set of target variables, a good factoring for P (Q) can be found
in the following way:
(i) First, find the relevant nodes in the original BN. This can be done using the
d-separation property to exclude parts of the network which have no relevancy
to the current query. A linear time algorithm to find this subtree can be found
in Pearl [Geiger et al 89].
(ii) Let F be a factor set which contains all factors to consider in the next com-
putation, and let C be the set of factor candidates. Initially, let F be all
distributions from the subtree and let C be empty.
(iii) Combine the factors in F pairwise and add all pairs in which one factor of the
pair contains a variable which is a parent or a child of at least one variable in
the second pair, to the candidate set, C. There is also no need to add factors
which are already in C.
(iv) Let Ui be the set of variables for each combination in C. Compute vars(Ui ), the
number of variables in Ui excluding any target variable: vars(Ui ) = |Ui \ Q|.
For each combination in C, compute sum(Ui ), which is the number of variables
that can be summed out when two factors are combined. A variable can be
summed out when it does not appear in neither the set of target variables,
Q, nor any of the other factors in F (excluding those in the current pair).
Compute the result size as vars(Ui ) - sum(Ui ).
(v) Select the best candidate in C in the following way: Choose the element in C
with the lowest result size. If more than one element apply, choose the one of
these with most number of variables (including target variables). If there is
still more than one candidate, choose one of these arbitrary.
(vi) Construct a new factor by combining the chosen pair into a conformal factor.
Update F by replacing the two chosen factors with the new combined one.
Update C by deleting any pair which has a non-empty (factor-level) intersection
with the above chosen factor.
(vii) Continue from step (iii) until only one factor remains in F. This is the resulting
factor.
Finally, use the resulting factor above to compute an answer for the conjunctive
probability.
14.4. SYMBOLIC PROBABILISTIC INFERENCE (SPI) 143
C (fM e , fBr ) (fM e , fSc ) (fM e , fCo ) (fBr , fSc ) (fBr , fCo ) (fSc , fCo )
U Br,Me Me,Sc Me,Br,Sc,Co Me,Br,Sc Me,Br,Sc,Co Me,Br, Sc, Co
sum(U) 0 0 0 0 0 0
vars(U) 2 2 2 3 3 3
result size 2 2 2 3 3 3
In this presentation all variables are assumed to be binary, i.e., they all have two
states. If the number of states is not equal for all variables, it is possible to compute
the size of a factor as the product over the states for every included variable instead
of just considering the number of variables.
According to Li and D’Ambrosio in [Li and D’Ambrosio 94] the set-factoring SPI
algorithm is superior to Jensen’s algorithm for most kinds of networks.
Loop 1: The factor set F is initialized to {fM e , fBr , fSc , fCo } and C is empty. Every
factor in F is then pairwised combined and added to C. The result is shown
in table 14.5.
Here, the best combinations are candidate no 1 and 2, since both have mini-
mum result size and equal number of variables. Candidate no 1, (fM e , fBr ) is
choosed to replace the factors in F which is updated to {(fM e , fBr ), fSc , fCo }.
After deleting every factor with a non-empty intersection with (fM e , fBr ), the
candidate set C becomes {(fSc , fCo )}.
Loop 2: Adding combinations from F makes C = {((fM e , fBr ), fSc ), ((fM e , fBr ),
fCo ), (fSc ,fCo )}. The best combination is ((fM e , fBr ), fSc ), in which it was
possible to sum out the variable M e. F is updated to {((fM e , fBr ), fSc ), fCo }
and C is empty.
Loop 3: The candidate set C becomes {(((fM e , fBr ), fSc ), fCo )} and the only com-
bination to choose is therefore (((fM e , fBr ), fSc ), fCo ). Both the variables Br
144 CHAPTER 14. INFERENCE IN BAYESIAN NETWORKS
and Sc could be summed out. F is updated to {(((fM e , fBr ), fSc ), fCo )} which
fulfills the termination condition in step (vii) above.
Thus, the factoring result is:
X X
P (Co) = P (Co|Br, Sc) P (Sc|M e)P (Br|M e)P (M e) (14.4.32)
Br,Sc Me
(vii) etc . . .
(viii) When a value has been sampled for all unobserved variables, restart with X1
and repeat the process until sufficiently many cases have been generated.
14.6. CONNECTION TO PROPOSITIONAL CALCULUS 145
To avoid bias from the initial configuration (which can be very unlikely) it is common
to discard the first 5-10 percent of the generated samples. This is called “burn-in”.
One problem with this kind of logic sampling is that it is possible to get stuck
in certain areas. There might exist an equal likely area, but in order to reach it,
a variable need to take a highly unlikely value. Another problem is that it can be
very time-consuming to estimate very unlikely events. Finally, the task of selecting
a valid starting configuration can be very tedious, it is in fact a NP-hard problem
in the general case!
• The noisy-Or model can be used to avoid a parameter explosion when a vari-
able have many parents.
EXERCISES
147
148 CHAPTER 15. SUMMARY AND EXERCISES
A .S
.
B
?
BN
T . L B
HH
j
. E .P
PP q D
P
X
IV.3 (Software required) The Monty Hall puzzle gets its name from an Amer-
ican TV game show, ”Let’s make a deal”, hosted by Monty Hall. In this show,
you have the chance to win some prize if you are lucky enough to find the prize
behind one of three doors. The game goes like this:
The problem of the puzzle is: What should you do at your second selection?
Some would say that it does not matter because it is equally likely that the
prize is behind the two remaining doors. This, however, is not quite true. Build
a Bayesian network to conclude which action gives the highest probability.
Here is some help to get you started:
The Monty Hall puzzle can be modeled in three random variables: Prize,
First Selection, and Monty Opens.
– Prize represents the information about which door contains the prize.
This means that it has three states: ”Door 1”, ”Door 2”, and ”Door 3”.
– First Selection represents your first selection. This variable also has the
three states: ”Door 1”, ”Door 2”, and ”Door 3”.
– Monty Opens represents Monty Halls choice of door when you have made
your first selection. Again, we have the three states: ”Door 1”, ”Door
2”, and ”Door 3”.
(p ∨ q) ∧ ¬p ∧ ¬q (15.0.1)
Under which assumptions are the noisy-Or model valid? Do you think the
noisy-Or model is appropriate to use in this particular application?
IV.6 Given a discrete Bayesian network, B = {X1 , . . . , Xn }, an atomic con-
figuration is a specific assignment of each individual variable, i.e. X1 =
x1 , . . . , Xn = xn .
(a) Explain why the sum of the probability of each atomic configuration must
be equal to one, i.e.
X
P (X1 , . . . , Xn ) = 1. (15.0.2)
X1 ,...,Xn
[AboaFuzz 95] AboaFuzz 1.0 User Manual (P. Eklund, M. Fogstrm, S. Olli), Åbo
Akademi University, 1995.
[Baker 87] J. E. Baker. Reducing Bias and Inefficiency in the Selection Algorithm, In
J. J. Grefenstette, editor, Genetic Algorithms and their Applications: Proceedings
of the Second International Conference on Genetic Algorithms, pages 14-21, 1987.
151
152 BIBLIOGRAPHY
[Ben-Ari 93] M. Ben-Ari, Mathematical Logic for Computer Science, Prentice Hall,
1993.
[Bezdek 74] J. C. Bezdek, Cluster Validity with fuzzy sets, Journal of Cybernetics,
3 (1974), 58-73.
[Bezdek 80] J. C. Bezdek, A Convergence Theorem for the Fuzzy ISODATA Cluster-
ing Algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence
2 (1980), 1-8.
[Bezdek 81] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algo-
rithms, Plenum Press, 1981.
[Buckles 97] B.
Buckles. Seminar Course: Evolutionary Computation - Lecture 3. Published on
www, adress http://www.eecs.tulane.edu/www/Buckles.Bill/ec.html, 1997.
[Choe and Jordan 92] H. Choe, J. Jordan, On the Optimal Choice of Parameters
in a Fuzzy C-Means Algorithm, Proc. IEEE International Conference on Fuzzy
Systems, San Diego, 349-354, 1992.
[Cooper and Herskovits 92] G. F. Cooper, E. Herskovits, A Bayesian method for the
induction of probabilistc networks from data, Machine Learning 9 (1992), 309-347.
[Davis 87] L. Davis (ed.), Genetic Algorithms and Simulated, Technical Report, Uni-
versity of Illinois, 1988.
BIBLIOGRAPHY 153
[Eklund and Klawonn 92] P. Eklund, F. Klawonn, Neuro Fuzzy Logic Programming,
IEEE Transactions on Neural Networks, Vol 3, No. 5, September 1992, 815-818.
[Eklund and Zhou 96] P. Eklund, J. Zhou, Comparison of Learning Strategies for
Adaptation of Fuzzy Controller Parameters, J. Fuzzy Sets and Systems, to appear.
[Everitt 74] B. S. Everitt, Cluster Analysis, John Wiley & Sons, 1974.
[Fullér 95] R. Fullér, Neural Fuzzy Systems, Meddelanden från ESF vid Åbo
Akademi, Serie A:443, 1995.
[Geisser 75] S. Geisser, The predictive sampling reuse method with applications, J.
Amer. Stat. Assoc. ? (1975), xx-xx.
[Höhle 89] U. Höhle, Monoidal Closed Categories, Weak Topoi, and Generalized
Logics preprint, 1989.
[Holland 92] J. Holland. Adaption in natural and artificial systems, The MIT Press,
Cambridge Massachusetts, London, England, 1992.
[Huber 96] B. Huber, Fibre Optic Gyro Application on Autonomous Vehicular Nav-
igation, PhD thesis, University of Strasbourg, 1996.
[Jain and Dubes 88] A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice
Hall, 1988.
[Jang 92] J.-S. R. Jang, Self-learning fuzzy controllers based on temporal back prop-
agation, IEEE Trans. Neural Networks 3 No 5 (1992), 714-723.
[Moody 94] J. Moody, Prediction risk and architecture selection for neural networks,
In: V. Cherkassky, J. H. Friedman, H. Wechsler (Eds.), From Statistics to Neural
Networks: Theory and Pattern Recognition, NATO ASI Series F, Springer-Verlag,
1994.
[Moody and Utans 95] J. Moody, J. Utans, Architecture selection strategies for neu-
ral networks: Application to corporate bond rating predictions, In: A.-P. Refenes
(Ed.), Neural Networks in the Capital Markets, John Wiley & Sons, 1995, 277-
300.
[Olesen 93] K. G. Olesen, Causal probabilistic networks with both discrete and con-
tinuous variables, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 3 (1993).
[Olli 95] S. Olli, Fuzzy Control for AGNES, opublicerade anteckningar, bo Akademi,
1995.
[Riissanen and Eklund 96] T. Riissanen and P. Eklund, Working within a Fuzzy
Control Application Development Workbench: Case Study for a Water Treatment
Plant, Proc. EUFIT’96, 4th European Congress on Intelligent Techniques and
Soft Computing, Aachen, 1142-1145, 1996.
[Russell and Norvig 95] S. Russell, P. Norvig, Artificial Intelligence – a Modern Ap-
proach, Prentice-Hall International, 1995.
[Schweizer and Sklar 61] B. Schweizer, A. Sklar, Associative functions and statisti-
cal triangle inequalities, Publicationes Mathema-ticae Dedrecen, 8 (1961), 169-
186.
[Shao 88] S. Shao, Fuzzy Self-Organizing Controller and its Application for Dynamic
Processes, Fuzzy Sets and Systems 26 (1988), 151-164.
[Smith and Kelleher 88] B. Smith, G. Kelleher (eds.), Reason Maintenance Sys-
tems and their Applications, Ellis Horowood series in Artificial Intelligence, Ellis
Horowood Limited, 1988.
[Takagi and Sugeno 85] T. Takagi, M. Sugeno, Fuzzy Identification of Systems and
Its Applications to Modeling and Control, IEEE Transactions on Systems, Man
and Cybernetics 15 (1985), 116-132.
BIBLIOGRAPHY 161
[Umano 87] M.Umano, Fuzzy-Set Prolog, Second IFSA Congress, Tokyo, 1987, pp.
750-753.
[Wang and Mendel 92] L. X. Wang, J. M. Mendel, Fuzzy Basis Functions, Universal
Approximation, and Orthogonal Least-Squares Learning, IEEE Trans. on Neural
Networks., 3 No.5. September, 1992, 807-813.
[Windham 82] M. Windham, Cluster Validity for the Fuzzy c-Means Clustering Al-
gorithm, IEEE Transactions on Pattern Analysis and Machine Intelligence, 4
(1982), 357-363.
[Yager 80] R. R. Yager, On a General Class of Fuzzy Connectives, Fuzzy Sets and
Systems 4 (1980), 235-242.
[Yager 96b] R. R. Yager, Constrained OWA Aggregation, Fuzzy Sets and Systems
81 (1996), 89-101.
[Zadeh 65] L. A. Zadeh, Fuzzy Sets, Information and Control 8 (1965), 338-353.
[Zadeh 75] L. A. Zadeh, The Concepts of a Linguistic Variable and its Application
to Approximate Reasoning, Information Science, 8 (1975), 199-249.
[Zadeh 89] L. A. Zadeh, Fuzzy Sets as a Basis for a Theory of Possibility, Fuzzy
Sets and Systems, Vol 1, 1978, pp. 3-28.
[Zadeh 89] L. A. Zadeh, The coming age of fuzzy logic, plenary talk at 3rd IFSA,
Seattle, August 6-11, 1989.
[Zhou and Eklund 95] J. Zhou, P. Eklund, Some Remarks on Learning Strategies for
Parameter Identification in Rule Based Systems, Proc. EUFIT’95, 3rd European
Congress on Intelligent Techniques and Soft Computing, Aachen, 1911-1916, 1995.
[] Discussions with Veli Kairisto, Turku University Central Hospital, on the diag-
nostic problem and the TUCH data set for acute myocardial infarction, Spring,
1996.