You are on page 1of 38

PE 511 IT AI DCET

ARTIFICIAL INTELLIGENCE (PE-511 IT)


UNIT-III
PROBABILISTIC REASONING

TOPICS
Probabilistic Reasoning,
Probability,
Conditional probability,
Bayes’ Rule,
Bayesian Networks- Representation, Construction, Inference,
Temporal model,
Hidden Markov model.

UNCERTAINTY
Most intelligent systems have some degree of uncertainty associated with them. Uncertainty in
systems arises primarily because of problems in data
• Major cause is missing or unavailable, data is present but unreliable or ambiguous due
to errors in measurement, presence of multiple conflicting measurements and so on.
• It is not always possible to represent data in precise and consistent manner.
• Data is generally based on defaults, which may have exceptions leading errors in
intelligent systems.
Uncertainty may also be caused by the representation of knowledge since it might represent only
the best guesses of the expert based on observation or statistical conclusions, which may not be
appropriate in all situations. Because of these numerous sources of error, it is imperative that
uncertainty management is incorporated in intelligent systems.

Important issues to be considered in uncertainty scheme


• Representation of uncertain data.
• Combination of two or more pieces of uncertain data.
• Drawing inference using uncertain data.

Several methods are available for handling uncertain information such as


Probability theory
Fuzzy logic
Model
Temporal Logic

1. PROBABILITY THEORY
Probability Theory is designed to estimate the degree of uncertainty. It is the oldest method with
strong mathematical basis.
The term probability is defined as a way of turning an opinion or an expectation into a number lying
between 0 and 1.
It basically reflects the likelihood of an event, or a chance that a particular event will occur.
UNIT - III Page 1
PE 511 IT AI DCET
Assume a set S (known as a Sample space) consisting of independent events representing all possible
outcomes of random experiment.

 Every non empty subset A of sample space S is called an event;


 The empty set is called an impossible event,
 S is called a sure event.
 Let the probability of an event A is denoted by P(A) defined as

(Number of outcomes of Favorable to A)


P(A) =
(Total number of possible outcomes)

 The probabilities of all events {A1,A2,,…………..,AN} must sum up to certainty ,i.e.


P(A1)+…….+P(AN)=1.
 An impossible event has a probability 0, while certain event has probability 1.

Consider an example of tossing a coin;


The probability of throwing a successive head with a fair coin is found to be 0.25. There
are only four possible outcomes {HH, HT, TH, TT} of throwing a fair coin twice successfully. Since
there is only way of getting HH out of four, the probability =1/4=0.25

If S represents a sample space and A and B represents events, then the following axioms hold true
here A` represents complement of set A.
 P(A) ≥0
 P(S)=1
 P(A') = 1-P(A)
 P(AUB) = P(A)+P(B), if events A and B are mutually exclusive.
 P(AUB) = P(A)+P(B)-P(A∩B) if A and B are not mutually exclusive. This is called
addition rule of probability.
 In general, if events A1,A2, ………AN in S are mutually exclusive, then we can write
P(A1∪A2 ∪ ………. ∪AN) = P(A1)+…….+P(AN)

1.1 JOINT PROBABILITY


Joint Probability is defined as the probability of two events in conjunction. i.e., I t refers to the
probability of both events occurring together. The Joint Probability of A and B is written as
P(A∩B) or P(A and B)
P(A∩B) =P(A)*P(B) / * represents multiplication
Two events are said to be Independent if the occurrence of one event does not affect the
probability of occurrence of the other.
Consider an example of tossing of two fair coins separately.
 The probability of getting head H on tossing the first coin is denoted by P(A)=0.5 and the
probability of getting head on tossing the second coin is denoted by P(B)=0.5.
 The probability of getting H on both the coins is called joint probability and is represented
as
P (A and B). Calculated as
P (A and B) = P(A)*P(B)
=0.5*0.5

UNIT - III Page 2


PE 511 IT AI DCET
=0.25
 Similarly, the probability of getting H on tossing one or both coins can be calculated. It is
called Union of the probabilities P(A) and P(B) is denoted by P(A∪B), It is also written as P
(A or B)
P (A or B) = P(A)+P(B)-P(A∩B)
= P(A)+P(B)- P(A)*P(B)
=0.5+0.5- 0.25
=0.75

Consider another example where Joint probability distribution of two variables A and B are
given in table.
Joint A A'
Probabilities

B 0.20 0.12

B' 0.65 0.03

P (A and B) =0.20, P (A and B’) =0.65, P (A' and B) =0.12, P (A' and B’) =0.03.
From this data we can compute P(A) and P(B) as P(A) = P (A and B) + P (A and B')
= 0.20+0.65=0.85
P(B) = P (A and B) + P (A' and B)
= 0.20+0.12=0.32
We can compute the probability of any logical combination of A and B as follows
P (A or B) = P(A)+P(B)-P(A∩B)
=P(A and B)+ P(A and B')+ P(A and B)+ P(A' and B)-P(A and B)
=P(A and B)+ P(A and B')+ P(A' and B)
=0.20+0.65+0.12
=0.97
Alternatively, we can compute P (A or B) as follows
P (A or B) = 1- P((A or B) ' )
=1-P(A ' and B ')
=1-0.03=0.97

3. CONDITIONAL PROBABILITY
When the probability of the statement is based on a piece of evidence, this type of probability is
called Conditional Probability. The concept of Conditional Probability relates the probability of
one event to the occurrence of another. It is defined as the probability of the occurrence of an
event H(hypothesis) provided an event E(evidence) is known to have occurred. It is denoted by
P(H|E)
Number of events favourable to H which are also favourable to E
P(H | E) =
Number of events favourable to E

P(H and E)
P(H | E) =
P(E)

UNIT - III Page 3


PE 511 IT AI DCET
This rule cannot be used in cases where P(E)=0

Result 1: If events H and E defined in sample space S of a random experiment are independent
then P(H|E)=P(H)
Proof P(H|E)=P(H and E)/P(E) By definition
= P(H)*P(E)/P(E)
=P(H)
Similarly P(E|H)=P(E)

EXAMPLE 1(INDEPENDENT EVENTS)


We are given probability of any person chosen at random being literate as 0.40 and probability of
any person chosen at random having age>60 as 0.005. Find the probability of the fact that a person
chosen at random of age>60 years is literate.
Solution
The probability that a given person chosen at random is both literate and has age>60 years is
calculated as follows:
[P (X is literate and the age>60)] =P (X is literate) *P (Age of X>60)
= 0.40*0.005
= 0.002
Then the probability of as person chosen at random having age>60 years is literate is
calculated using conditional probability formula as follows
[(X is literate and the age of X  60)]
P(X is literate | Age of X  60) =
[P(Age of X  60)]

0.002
=
0.005
= 0.4

EXAMPLE 2(DEPENDENT EVENTS)


Suppose we are given the probability of some event today such as Sun is bright today as
P(sunny_today) =0.6 and the probability of the same event tomorrow as P(sunny_tomorow
|sunny_today)=0.8, provided today the sun is bright.

It must be noted that if sun is bright today then the probability of sun being bright tomorrow is
high and these two events are not independent. All possible join probabilities of these two
variables can be computed. Let us represent A for Sun is bright today and B as Sun will be bright
tomorrow.

All Four situation can be represented as shown in figure using the path, we can compute join
probability appropriately.

UNIT - III Page 4


PE 511 IT AI DCET

4. BAYES’ THEOREM
Thomas Bayes proposed Bayes’ theorem in the year 1763 for two events using the concept of
conditional probability. This theorem provides a mathematical model for reasoning where prior
beliefs are combined with evidence to get estimates of uncertainty.
It relates the conditional probability and probabilities of events H and E. The basic idea is
to compute P(H|E) that represents the probability assigned to H after taking into account the new
piece of evidence E.
This approach relies on the concept that one should incorporate the prior probability of an
event into the interpretation of a new situation.

Bayes’ theorem related the conditional probabilities of events which allows us to express P(H|E) in
terms of the probabilities P(E|H), P(E), P(H). H denotes the hypothesis, while E represents a new
piece of evidence.
P(E | H) * P(H)
P(H | E) =
P(E)

Bayes’ theorem can be derived from conditional probably as follows


P(H|E)=P(H and E)/P(E)

P(H|E)*P(E)= P(H and E) (1)

Similarly

P(E|H)=P(E and H)/P(H)

P(E|H)*P(H)= P(E and H) (2)

From the above equation (1) and (2)


P(H|E)*P(E)= P(E|H)*P(H)
Hence we obtain
P(E | H) * P(H)
P(H | E) =
P(E)

UNIT - III Page 5


PE 511 IT AI DCET

The probability P(H|E) can be expressed in terms of P(E|H) P(E|~H) and P(H) i.e. P(E|H)
P(E|~H) two conditional probabilities which indicate how probable our piece of evidence is
depending on whether our theory is true or not true.
P(H|E) can be expressed as
P(E | H) * P(H)
P(H | E) =
P(E | H) * P(H) + P(E |~ H) * P(~ H)

In order to prove we need to prove that P(E)=P(E|H) *P(H)+P(E|~H) *P(~H)


We know P(E)=P (E and H) +P (E and ~H)
Using the conditional probability P (E and H) =P(E|H) *P(H)
P (E and ~H) = P(E|~H) *P(~H)
Therefore,
P(E)= P(E|H) *P(H)+P(E|~H) *P(~H)
Substituting the expression for P(E) as P(E|H) *P(H)+P(E|~H) *P(~H) in above equation
P(E | H) * P(H)
P(H | E) =
P(E | H) * P(H) + P(E |~ H) * P(~ H)

P(H) is called as the prior probability of H. It is called prior because it does not consider any
information regarding E.P(H|E) is known as the conditional probability of H, given E. It is also called
posterior probability because it is derived from or depends on the specified values of E.P(E|H) is
known as the conditional probability of E given H.P(E) is the prior probability of E and acts as a
normalizing constant.
Bayesian Probability theory was used by the PROSPECTOR expert system for locating
deposits of various minerals from geological evidences at a particular location. Some expert system
use Bayesian theory to derive further concepts.

The Rule type of ‘if-then’ in rule-based system can incorporate probabilities as ‘If X is true then Y can
be concluded with the probability’. In this rule, IF the patient has a cold, THEN the patient will sneeze
(0.75). Probability 0.75 is stated at the end of the rule.

EXAMPLE 2: Suppose we are given


➢ The probability of Mike has a cold as 0.25,
➢ The probability of Mike was observed sneezing when he had cold in the past as 0.9
➢ The probability of Mike was observed sneezing when he did not have cold as 0.20.
Find the probability of Mike having a cold given that he sneezes.

Solution:
Probability of hypothesis Mike has a cold as 0.25 as P(H) =0.25
Probability of Fact that Mike was observed sneezing when he had cold in the past as 0.9
P(Mike was observed sneezing |Mike has a cold)=P(E|H)=0.90
Probability of Fact that Mike was observed sneezing when he did not have cold is 0.20.
P(Mike was observed sneezing |Mike does not have a cold)=P(E|~H)=0.20
We have to find the probability of Mike having cold, when he was observed sneezing. That is
P(Mike has a cold | Mike was observed sneezing)= P(H|E)
Using the formula

UNIT - III Page 6


PE 511 IT AI DCET

P(E | H) * P(H)
P(H | E) =
P(E | H) * P(H) + P(E |~ H) * P(~ H)

Compute Denominator P(E|H)*P(H)+P(E|~H)*P(~H) =(0.90)(0.25)+(0.20)(0.75)=0.375


Hence, we obtain
P(H|E)=[(0.9*0.25)]/0.375 =0.6

The probability of Mike having a cold given that he sneezes is equal to 0.6
Similarly we can compute his probability of having a cold if he was not sneezing
P(H|~E)=[P(~E|H)*P(H)]/P(~E)
= [(1-0.9)*0.25]/[1-0.375]
=0.025/0.625=0.04
Bayes’ Theorem can be extended to include more than two events.
 One Hypothesis and Two Evidences
 One Hypothesis and Multiple Evidences
 Chain Evidences
 Multiple Hypothesis and Single Evidences
 Multiple Hypothesis and Multiple Evidences

4.1 One Hypothesis and Two Evidences


Consider a given hypothesis H and two evidences E1 and E2. The probability of H if both E1 and
E2 are true is calculated by using the following formula.
P(E1 | H) * P(E2 | H) * P(H)
P(H | E1 and E2) =
P(E1 and E2)

We can derive different forms of the formula using Bayes’ theorem and conditional probability to
compute P(H|E1 and E2).

UNIT - III Page 7


PE 511 IT AI DCET

UNIT - III Page 8


PE 511 IT AI DCET

UNIT - III Page 9


PE 511 IT AI DCET

BAYESIAN METHOD: ADVANTAGES AND DISADVANTAGES


Bayesian networks are useful for both inferential exploration of previously undetermined
relationships among variables and descriptions of these relationships upon discovery.
Advantages
❖Bayesian method is based on a strong theoretical foundation in probability theory, it is currently
the most advanced of all certainty reasoning methods.
❖This method has well-defined semantics for decision making.
Disadvantages
❖The system using Bayesian approach needs quite a large amount of probability data to construct a
knowledge base.
❖Since conditional probabilities are based on statistical data, the sample size must be sufficient to
ensure that the probabilities obtained from them are accurate.
❖If conditional probabilities are based on human experts, then the question of values being
consistent and comprehensive arises.
❖Since associations between the hypotheses and evidences are reduced to numbers, it eliminates the
actual knowledge embedded within the data.

5. BAYESIAN BELIEF NETWORKS


n
Join probability requires 2 entries for n variables to store the distribution explicitly. The time and
storage requirements for such computations become impractical as the value of n increases.
Moreover, inferring with such large number of probabilities does not seem to model the human
process of reasoning.
When reasoning with uncertain beliefs, humans tend to single out a few
propositions that are known to be casually linked. This leads to the concept of forming a belief
network called a Bayesian belief network.
Which is a probabilistic graphical model that encodes probabilistic relationships
among a set of variables with their probabilistic dependencies. This belief network is an efficient
structure for storing joint probability distribution. In a Bayesian network, only one branch of the
tree needs to be traversed.

UNIT - III Page 10


PE 511 IT AI DCET
Formal Definition of Bayesian Belief Network
The Bayesian Belief Network may be formally defined as an acyclic (with no cycles) directed
graph where the nodes of the graph represent evidence or hypothesis and an arc connecting two
nodes represents the dependence between them.
e.x. If there exist an arc from node X to another node Y then X is called a Parent of Y and Y is
called a child of X.
The set of parents nodes of a node Xi represented by parent_node(Xi).
Join probabilities of n variables: Join probability for n variables (dependent or independent) may
be computed as
P(X1,……Xn)=P(Xn|X1,……Xn-1 )* P(X1,……Xn-1) or
P(X1,……Xn)= P(Xn|X1,……Xn-1 )* P(Xn-1|X1,……Xn-2 )*………..* P(X2|X1 )* P(X1)

JOIN PROBABILITY OF N VARIABLES USING BAYESIAN NETWORK


 In Bayesian network, the joint probability distribution can be written as the product of the local
distributions of each node and its parents
n such as
P(X1,  Xn) =

i =1
P(Xi | parent_nodes(Xi))

 This expression is reduction of joint probability formula of n variables as some of the terms
corresponding to independent variables will not be required.
 If node Xi has no parents, its local probability distribution is said to be unconditional and is written
as P(Xi) instead of P(Xi)|parents_node(Xi).
 The nodes having parents are called conditional nodes.
 If the value of the node is observed, then the node is said to be evidence node.
 Nodes with no children are termed as hypotheses nodes
 While nodes with no parents are called independent nodes

Example: Bayesian Belief Network

 Bayesian Belief Network


 Four Nodes A,B,C,D
 {A,B} represents the Evidence
 {C,D} represents the Hypothesis .
 A and B are unconditional nodes and C and D are conditional nodes.
 Arcs between the nodes represents the interdependence of evidence and hypothesis, such as C is
dependent on A; D is dependent on both A and B .
 Nodes A and B are independent of each other.

UNIT - III Page 11


PE 511 IT AI DCET

 Each node has a probability attached to it.


A B P(D)

A P(C)
T T 0.7
T 0.4
T F 0.4
F 0.2
F T 0.2
P(A) P(B)
F F 0.01
0.3 0.6

Following are the given probabilities


 P(A)=0.3
 P(B)=0.6
 P(C|A)=0.4
 P(C|~A)=0.2
 P(D|A,B)=0.7
 P(D|A,~B)=0.4
 P(D|~A,B)=0.2
 P(D|~A,~B)=0.01
 Joint probability for four variables can be computed using
P(A,B,C,D)=P(D|A,B,C)*P(C|A,B)*P(B|A)*P(A) This can be simplifies using Bayesian Belief
network
P(A,B,C,D)=P(D|A,B)*P(C|A)*P(B)*P(A)
=0.7*0.4*0.6*0.3 This result can be obtained by using
n
P(X1,  Xn) =

i =1
P(Xi | p arent_nodes(Xi))

UNIT - III Page 12


PE 511 IT AI DCET

UNIT - III Page 13


PE 511 IT AI DCET

5.1 SIMPLE BAYESIAN BELIEF NETWORKS: EXAMPLE


 The topology of the network—the set of nodes and links—specifies the conditional independence
relationships that hold in the domain, in a way that will be made precise shortly.
 The intuitive meaning of an arrow is typically that X has a direct influence on Y, which suggests
that causes should be parents of effects. It is usually easy for a domain expert to decide what direct
influences exist in the domain—much easier, in fact, than actually specifying the probabilities
themselves.
 Once the topology of the Bayesian network is laid out, we need only specify a conditional
probability distribution for each variable, given its parents.
 Consider an example consisting of the variables Toothache, Cavity, Catch, and Weather . Weather
is independent of the other variables; furthermore, Toothache and Catch are conditionally
independent, given Cavity.
 Formally, the conditional independence of Toothache and Catch, given Cavity, is indicated by the
absence of a link between Toothache and Catch.
 Intuitively, the network represents the fact that Cavity is a direct cause of Toothache and Catch,
whereas no direct causal relationship exists between Toothache and Catch.

Consider an alarm system installed in a house that can be triggered by three events, namely
• Earthquake, burglary and tornado.

UNIT - III Page 14


PE 511 IT AI DCET
• This situation can be modeled with the help of Bayesian Network All four variables has two
possible values T(for true) and F(for False).
• Here the names of the variables can be abbreviated such as A=alarm, E= earthquake, B=burglary
and D=tornado.
• Directed acyclic graph represents the casual relationship that occurs among the nodes. The
variables E and B are independent variables and A,D are dependent variables.

P(E) P(B)

0.4 0.7 E P(D)

T 0.8

F 0.5

B D P(A)
E

T T T 1.0
T T F 0.9
T F T 0.95
T F F 0.85
F T T 0.89
F T F 0.7
F F T 0.87
F F F 0.3

The joint probabilities can be computed as P(E,B,D,A)=P(A|E,B,D)*P(D|E)*P(E)*P(B)


=1.0*0.8*0.4*0.7
=0.214
 Now consider the following example, which is just a little more complex.
 You also have two neighbors, John and Mary, who have promised to call you at work when
they hear the alarm.
UNIT - III Page 15
PE 511 IT AI DCET
 John nearly always calls when he hears the alarm, but sometimes confuses the telephone
ringing with the alarm and calls then, too.
 Mary, on the other hand, likes rather loud music and often misses the alarm altogether. Given
the evidence of who has or has not called, we would like to estimate the probability of a
burglary.

 The network structure shows that burglary and earthquakes directly affect the probability of the
alarm’s going off, but whether John and Mary call depends only on the alarm.
 The network thus represents our assumptions that they do not perceive burglaries directly, they do
not notice minor earthquakes, and they do not confer before calling.
 The conditional distributions in Figure 14.2 are shown as a conditional probability table, or CPT.
 Each row in a CPT contains the conditional probability of each node value for a conditioning case.
A conditioning case is just a possible combination of values for the parent nodes—a miniature
possible world.
 Each row must sum to 1, because the entries represent an exhaustive set of cases for the variable.
 For Boolean variables, once you know that the probability of a true value is p, the probability of
false must be 1 – p.
 In general, a table for a Boolean variable with k Boolean parents contains 2k independently
specifiable probabilities. A node with no parents has only one row, representing the prior
probabilities of each possible value of the variable.
 Notice that the network does not have nodes corresponding to Mary’s currently listening to loud
music or to the telephone ringing and confusing John. These factors are summarized in the
uncertainty associated with the links from Alarm to JohnCalls and MaryCalls. This shows both
laziness and ignorance in operation.

5.3 SEMANTICS OF BAYESIAN NETWORKS


There are two ways in which one can understand the semantics of Bayesian networks.
The first is to see the network as a representation of the joint probability distribution.
The second is to view it as an encoding of a collection of conditional independence
statements.
The two views are equivalent, but the first turns out to be helpful in understanding how to
construct networks, whereas the second is helpful in designing inference procedures.

UNIT - III Page 16


PE 511 IT AI DCET
Representing the full joint distribution
Viewed as a piece of “syntax,” a Bayesian network is a directed acyclic graph with some numeric
parameters attached to each node. One way to define what the network means—its semantics—is
to define the way in which it represents a specific joint distribution over all the variables.The
parameters associated with each node correspond to conditional probabilities P(Xi |Parents(Xi));
equation 1
n
P(X1, Xn) =
 i =1
P(Xi | p arent_nodes(Xi))

The next step is to explain how to construct a Bayesian network in such a way that the resulting
joint distribution is a good representation of a given domain. The Equation implies certain
conditional independence relationships that can be used to guide the knowledge engineer in
constructing the topology of the network.
First, we rewrite the entries in the joint distribution in terms of conditional probability, using the
product rule.

This identity is called the chain rule. It holds for any set of random variables. Comparing it with
Equation 1 we see that the specification of the joint distribution is equivalent to the general
assertion that, for every variable Xi in the network,
n
P(Xi | Xi - 1,  Xn) =

i =1
P(Xi | p arent_nodes(Xi))

equation 2
Parents(Xi) ⊆ {Xi−1, . . .,X1}. This last condition is satisfied by numbering the nodes in a way that
is consistent with the partial order implicit in the graph structure.
What Equation 2 says is that the Bayesian network is a correct representation of the domain only if
each node is conditionally independent of its other predecessors in the node ordering, given its
parents. We can satisfy this condition with this methodology:
Nodes
First determine the set of variables that are required to model the domain. Now order
them, {X1, . . . ,Xn}. Any order will work, but the resulting network will be more compact if
the variables are ordered such that causes precede effects.
Links
For i = 1 to n do:
• Choose, from X1, . . . ,Xi−1, a minimal set of parents for Xi, such that Equation 2 is
satisfied.
• For each parent insert a link from the parent to Xi.
• CPTs: Write down the conditional probability table, P(Xi|Parents(Xi)). the parents of node
Xi should contain all those nodes in X1, . . . ,Xi−1 that directly influence Xi.

Because each node is connected only to earlier nodes, this construction method guarantees that the
network is acyclic.
Another important property of Bayesian networks is that they contain no redundant
probability values. If there is no redundancy, then there is no chance for inconsistency: it is

UNIT - III Page 17


PE 511 IT AI DCET
impossible for the knowledge engineer or domain expert to create a Bayesian network that violates
the axioms of probability.
Example

Suppose we have completed the network in Figure except for the choice of parents for MaryCalls.
MaryCalls is certainly influenced by whether there is a Burglary or an Earthquake, but not directly
influenced.
Intuitively, our knowledge of the domain tells us that these events influence Mary’s calling
behavior only through their effect on the alarm.
Also, given the state of the alarm, whether John calls has no influence on Mary’s calling.
Formally speaking, we believe that the following conditional independence statement holds:
P(MaryCalls | JohnCalls , Alarm, Earthquake, Burglary) = P(MaryCalls | Alarm) .Thus,
Alarm will be the only parent node for MaryCalls.

CONDITIONAL INDEPENDENCE RELATIONS IN BAYESIAN NETWORKS


The topological semantics specifies that each variable is conditionally independent of its non-
descendants, given its parents. For example, JohnCalls is independent of Burglary, Earthquake,
and MaryCalls given the value of Alarm.

MARKOV BLANKET
Another important independence property is implied by the topological semantics.
A node is conditionally independent of all other nodes in the network, given its parents, children,
and children’s parents—that is, given its Markov blanket.
For example, Burglary is independent of JohnCalls and MaryCalls, given Alarm and Earthquake.

UNIT - III Page 18


PE 511 IT AI DCET
5.4 BAYESIAN BELIEF NETWORKS: ADVANTAGES AND DISADVANTAGES
Bayesian belief network has several advantages when used in conjunction with statistical
techniques.
Advantages
 Since this model encodes dependencies among all variables, it can handle situations where some
data entries are missing.
 It is intuitively easier for a human to understand direct dependencies and local distributions than
complete joint distribution.
 It can be used to learn causal relationships, and hence can be used to gain understanding about a
problem domain and to predict the consequences of intervention.
 It is an ideal representation for combining prior knowledge (which comes in a causal form) and
data because the model has both causal and probabilistic semantics.
 Bayesian statistical methods in conjunction with Bayesian network offer an efficient and
principled approach for avoiding over-fitting of data.
Disadvantages
 The probabilities are described as a single numeric point value. This can be distortion of the
precision that is actually available for supporting evidence.
 There is no way to differentiate between ignorance and uncertainty. These are distinct two
different concepts and be treated as such.
 There exist the computational difficulty of exploring a previously unknown network.
 All branches must be calculated to find the probabilities of any branch of the network. Even
though the resulting ability to describe the network can be performed in linear time, this process of
network discovery is an NP-hard task. It might be either too costly to perform or impossible, given
the number and combination of variables.
 The quality and extent of the prior belief used in Bayesian inference processing are major short-
comings
 Reliability of Bayesian network depends on the reliability of prior knowledge.

6. TEMPORAL MODEL
Time and Uncertainty
We have developed our techniques for probabilistic reasoning in the context of static worlds, in
which each random variable has a single fixed value.
For example, when repairing a car, we assume that whatever is broken remains broken
during the process of diagnosis; our job is to infer the state of the car from observed evidence,
which also remains fixed.
Now consider a slightly different problem: treating a diabetic patient. As in the case of car
repair, we have evidence such as recent insulin doses, food intake, blood sugar measurements, and
other physical signs.
The task is to assess the current state of the patient, including the actual blood sugar level
and insulin level. Given this information, we can make a decision about the patient’s food intake
and insulin dose. Unlike the case of car repair, here the dynamic aspects of the problem are
essential.

The same considerations arise in many other contexts, such as tracking the location of a robot,
tracking the economic activity of a nation, and making sense of a spoken or written sequence of
words. How can dynamic situations like these be modeled?
➢ We view the world as a series of snapshots, or time slices, each of which contains a set of
random variables, some observable and some not.

UNIT - III Page 19


PE 511 IT AI DCET
➢ For simplicity, we will assume that the same subset of variables is observable in each time
slice.
➢ We will use Xt to denote the set of state variables at time t, which are assumed to be
unobservable, and
➢ Et to denote the set of observable evidence variables.
➢ The observation at time t is Et =et for some set of values et.

FOR EXAMPLE:
You are the security guard stationed at a secret underground installation. You want to know
whether it’s raining today, but your only access to the outside world occurs each morning when
you see the director coming in with, or without, an umbrella.
For each day t, the set Et thus contains single evidence variable Umbrellat or Ut for short (whether
the umbrella appears), and the set Xt contains a single state variable Raint or Rt for short (whether
it is raining).

➢ The interval between time slices also depends on the problem. For diabetes monitoring, a suitable
interval might be an hour rather than a day.
➢ We assume the interval between slices is fixed, so we can label times by integers.
➢ Assume that evidence starts arriving at t=1. Hence, our umbrella world is represented by state
variables R0, R1, R2, . . . and evidence variables U1, U2, . . ..
➢ We will use the notation a:b to denote the sequence of integers from a to b (inclusive), and the
notation Xa:b to denote the set of variables from Xa to Xb. For example, U1:3 corresponds to the
variables U1, U2, U3.
➢ An agent maintains a belief state that represents which states of the world are currently possible.
➢ From the belief state and a transition model, the agent can predict how the world might evolve in
the next time step.
➢ From the percepts observed and a sensor model, the agent can update the belief state. A percept is
the input that an intelligent agent is perceiving at any given moment. It is essentially the same
concept as a percept in psychology, except that it is being perceived not by the brain but by the
agent.

➢ The transition and sensor models may be uncertain: the transition model describes the probability
distribution of the variables at time t, given the state of the world at past times,
➢ While the sensor model describes the probability of each percept at time t, given the current state
of the world.

With the set of state and evidence variables for a given problem decided on, the next step is to
specify how the world evolves (the transition model) and how the evidence variables get their
values (the sensor model).
The transition model specifies the probability distribution over the latest state variables, given the
previous values, that is, P(Xt |X0:t−1). The set X0:t−1 is unbounded in size as t increases. We solve
the problem by making a Markov assumption.

MARKOV ASSUMPTION
That the current state depends on only a finite fixed number of previous states.
Processes satisfying this assumption were first studied in depth by the Russian statistician Andrei
Markov (1856–1922) and are called Markov processes or Markov chains.

UNIT - III Page 20


PE 511 IT AI DCET
MARKOV PROCESSES OR MARKOV CHAINS.
They come in various flavors; the simplest is the first-order Markov process, in which the current
state depends only on the previous state and not on any earlier states. In other words, a state
provides enough information to make the future conditionally independent of the past, and we
have

P(Xt |X0:t−1). = P(Xt |Xt−1)

Hence, in a first-order Markov process, the transition model is the conditional distribution P(Xt
|Xt−1). The transition model for a second-order Markov process is the conditional distribution P(Xt
|Xt−2,Xt−1). Figure shows the Bayesian network structures corresponding to first-order and second-
order Markov processes. there are infinitely many possible values of t. Do we need to specify a
different distribution for each time step?
We avoid this problem by assuming that changes in the world state are caused by a stationary
process—that is, a process of change that is governed by laws that do not themselves change over
time.

In the umbrella world, then, the conditional probability of rain, P(Rt |Rt−1), is the same for all t,
and we only have to specify one conditional probability table.

Now for the sensor model. The evidence variables Et could depend on previous variables as well
as the current state variables. Thus, we make a sensor Markov assumption as follows:
P(Et |X0:t,E0:t−1) = P(Et |Xt)
Thus P(Et |Xt) is our sensor model (sometimes called the observation model).

Shows both the transition model and the sensor model for the umbrella example
The structure in Figure 15.2 is a first-order Markov process—the probability of rain is
assumed to depend only on whether it rained the previous day.

UNIT - III Page 21


PE 511 IT AI DCET
Whether such an assumption is reasonable depends on the domain itself. The first-order
Markov assumption says that the state variables contain all the information needed to characterize
the probability distribution for the next time slice. Sometimes the assumption is exactly true
sometimes the assumption is only approximate
Notice the direction of the dependence between state and sensors: the arrows go from the
actual state of the world to sensor values because the state of the world causes the sensors to take
on particular values: the rain causes the umbrella to appear.
The inference process, of course, goes in the other direction; the distinction between the
direction of modeled dependencies and the direction of inference is one of the principal
advantages of Bayesian networks.
In addition to specifying the transition and sensor models, we need to say how everything
gets started—the prior probability distribution at time 0, P(X0). With that, we have a
specification of the complete joint distribution over all the variables, using Equation (14.2). For
any t,

There are two ways to improve the accuracy of the approximation:


1. Increasing the order of the Markov process model. For example, we could make a second-
order model by adding Raint−2 as a parent of Raint, which might give slightly more accurate
predictions. For example, in Palo Alto, California, it very rarely rains more than two days in a row.
2. Increasing the set of state variables. For example, we could add Seasont to allow us to
incorporate historical records of rainy seasons, or we could add Temperaturet, Humidityt and
Pressuret (perhaps at a range of locations) to allow us to use a physical model of rainy conditions.

6.1 INFERENCE IN TEMPORAL MODELS

1. Filtering (state estimation)


This is the task of computing the belief state—the posterior distribution over the most recent
state—given all evidence to date. Filtering is also called state estimation. In our example, we
wish to compute P(Xt | e1:t).
In the umbrella example, this would mean computing the probability of rain today, given all the
observations of the umbrella carrier made so far. Filtering is what a rational agent does to keep
track of the current state so that rational decisions can be made. It turns out that an almost identical
calculation provides the likelihood of the evidence sequence, P(e1:t).

2. Prediction
This is the task of computing the posterior distribution over the future state, given all
evidence to date. That is, we wish to compute P(Xt+k | e1:t) for some k > 0.
In the umbrella example, this might mean computing the probability of rain three
day from now, given all the observations to date. Prediction is useful for evaluating possible
courses of action based on their expected outcomes.
3. Smoothing
This is the task of computing the posterior distribution over a past state, given all
evidence up to the present. That is, we wish to compute P(Xk | e1:t) for some k such that 0 ≤ k < t.

UNIT - III Page 22


PE 511 IT AI DCET
In the umbrella example, it might mean computing the probability that it rained last
Wednesday, given all the observations of the umbrella carrier made up to today. Smoothing
provides a better estimate of the state than was available at the time, because it incorporates more
evidence.
4. Most likely explanation
Given a sequence of observations, we might wish to find the sequence of states that is most
likely to have generated those observations. That is, we wish to compute
argmaxx1:t P(x1:t | e1:t).
For example, if the umbrella appears on each of the first three days and is absent on
the fourth, then the most likely explanation is that it rained on the first three days and did not rain
on the fourth.
Algorithms for this task are useful in many applications, including speech recognition—where the
aim is to find the most likely sequence of words, given a series of sounds—and the reconstruction
of bit strings transmitted over a noisy channel.

In addition to these inference tasks, we also have


Learning
 The transition and sensor models, if not yet known, can be learned from observations. Just as with
static Bayesian networks, dynamic Baye’s net learning can be done as a by-product of inference.
 Inference provides an estimate of what transition actually occurred and of what states generated
the sensor readings, and these estimate can be used to update the models.
 The updated models provide new estimates, and the process iterates to convergence. The overall
process is an instance of the expectation maximization or EM algorithm.
Note that learning requires smoothing, rather than filtering, because smoothing provides better
estimates of the states of the process. Learning with filtering can fail to converge correctly.
Consider, for example, the problem of learning to solve murders: unless you are an
eyewitness, smoothing is always required to infer what happened at the murder scene from
the observable variables.

6.1.1 FILTERING AND PREDICTION


Useful filtering algorithm needs to maintain a current state estimate and update it, rather than
going back over the entire history of percepts for each update. (Otherwise, the cost of each update
increases as time goes by.) In other words, given the result of filtering up to time t, the agent needs
to compute the result for t + 1 from the new evidence et+1

P(Xt+1 | e1:t+1) = f(et+1, P(Xt | e1:t))

for some function f. This process is called recursive estimation. We can view the calculation as
being composed of two parts: first, the current state distribution is projected forward from t to t+1;
then it is updated using the new evidence et+1. This two-part process emerges quite simply when
the formula is rearranged:

α is a normalizing constant used to make probabilities sum up to 1. The second term, P(Xt+1 | e1:t)
represents a one-step prediction of the next state, and the first term updates this with the new

UNIT - III Page 23


PE 511 IT AI DCET
evidence; notice that P(et+1 |Xt+1) is obtainable directly from the sensor model. Now we obtain the
one-step prediction for the next state by conditioning on the current state Xt:

Within the summation, the first factor comes from the transition model and the second comes from
the current state distribution. Hence, we have the desired recursive formulation. Filtered estimate
P(Xt | e1:t) can be thought as a “message” f1:t that is propagated forward along the sequence,
modified by each transition and updated by each new observation. The process is given by

Where FORWARD implements the update described in Equation (15.5) and the process begins
with f1:0 = P(X0). When all the state variables are discrete, the time for each update is constant
(i.e., independent of t), and the space required is also constant. (The constants depend, of course,
on the size of the state space and the specific type of the temporal model in question). The time
and space requirements for updating must be constant if an agent with limited memory is to keep
track of the current state distribution over an unbounded sequence of observations.

UNIT - III Page 24


PE 511 IT AI DCET

UNIT - III Page 25


PE 511 IT AI DCET

In addition to filtering and prediction, we can use a forward recursion to compute the likelihood of
the evidence sequence, P(e1:t). This is a useful quantity if we want to compare different temporal
models that might have produced the same evidence sequence (e.g., two different models for the
persistence of rain). For this recursion, we use a likelihood message
L1:t (Xt)=P(Xt, e1:t).

It is a simple exercise to show that the message calculation is identical to that for filtering:
L1:t+1 = FORWARD(L1:t, et+1) .

Having computed 1:t, we obtain the actual likelihood by summing out Xt:

L1:t= P( e1:t) = 
xt
L1 : t (xt )

UNIT - III Page 26


PE 511 IT AI DCET

6.1.2 SMOOTHING
Smoothing is the process of computing the distribution over past state given evidence up to the
present; that is, P(Xk | e1:t) for 0 ≤ k < t. (See Figure 15.3.). In anticipation of another recursive
message-passing approach, we can split the computation into two parts—the evidence up to k and
the evidence from k +1 to t.

UNIT - III Page 27


PE 511 IT AI DCET

UNIT - III Page 28


PE 511 IT AI DCET

UNIT - III Page 29


PE 511 IT AI DCET

where BACKWARD implements the update described in Equation (15.9). As with the forward
recursion, the time and space needed for each update are constant and thus independent of t. We
can now see that the two terms in Equation (15.8) can both be computed by recursions through
time, one running forward from 1 to k and using the filtering equation (15.5)

Both the forward and backward recursions take a constant amount of time per step; hence, the time
complexity of smoothing with respect to evidence e1:t is O(t). This is the complexity for smoothing
at a particular time step k.
If we want to smooth the whole sequence, one obvious method is simply to run the whole
smoothing process once for each time step to be smoothed. This results in a time complexity of
O(t 2).
A better approach uses a simple application of dynamic programming to reduce the
complexity to O(t). A clue appears in the preceding analysis of the umbrella example, where we
were able to reuse the results of the forward-filtering phase.
The key to the linear-time algorithm is to record the results of forward filtering over the
whole sequence. Then we run the backward recursion from t down to 1, computing the
smoothed estimate at each step k from the computed backward message bk+1:t and the stored
forward message f1:k. The algorithm, aptly called the forward–backward algorithm.
The forward–backward algorithm forms the computational backbone for many applications
that deal with sequences of noisy observations

UNIT - III Page 30


PE 511 IT AI DCET

IT HAS TWO PRACTICAL DRAWBACKS.

The first is that its space complexity can be too high when the state space is large and the
sequences are long. It uses O(|f|t) space where |f| is the size of the representation of the forward
message. The space requirement can be reduced to O(|f| log t) with a concomitant increase in the
time complexity by a factor of log t In some cases a constant-space algorithm can be used.

The second drawback of the basic algorithm is that it needs to be modified to work in an online
setting where smoothed estimates must be computed for earlier time slices as new observations are
continuously added to the end of the sequence.

The most common requirement is for fixed-lag smoothing, which requires computing the smoothed
estimate P(Xt−d | e1:t) for fixed d. That is, smoothing is done for the time slice d steps behind the
current time t; as t increases, the smoothing has to keep up.
Obviously, we can run the forward–backward algorithm over the d-step “window” as each new
observation is added, but this seems inefficient.

6.1.3 FINDING THE MOST LIKELY SEQUENCE


Suppose that [true, true, false, true, true] is the umbrella sequence for the security guard’s first
five days on the job.
What is the weather sequence most likely to explain this? Does the absence of the umbrella
on day 3 mean that it wasn’t raining, or did the director forget to bring it? If it didn’t rain on day 3,
perhaps (because weather tends to persist) it didn’t rain on day 4 either, but the director brought the
umbrella just in case.
In all, there are 25 possible weather sequences we could pick. Is there a way to find the
most likely one, short of enumerating all of them?We could try this linear-time procedure: use
smoothing to find the posterior distribution for the weather at each time step; then construct the
sequence, using at each step the weather that is most likely according to the posterior.
Such an approach should set off alarm bells in the reader’s head, because the posterior
distributions computed by smoothing are distributions over single time steps, whereas to find the
UNIT - III Page 31
PE 511 IT AI DCET
most likely sequence we must consider joint probabilities over all the time steps. The results can in
fact be quite different.

There is a linear-time algorithm for finding the most likely sequence, but it requires a little more
thought. It relies on the same Markov property that yielded efficient algorithms for filtering and
smoothing. The easiest way to think about the problem is to view each sequence as a path through
a graph whose nodes are the possible states at each time step.
Such a graph is shown for the umbrella world in Figure 15.5(a). Now consider the task of
finding the most likely path through this graph, where the likelihood of any path is the product of
the transition probabilities along the path and the probabilities of the given observations at each
state.
Let’s focus in particular on paths that reach the state Rain 5 =true. Because of the Markov
property, it follows that the most likely path to the state Rain5 =true consists of the most likely
path to some state at time 4 followed by a transition to Rain 5 =true; and the state at time 4 that
will become part of the path to Rain 5 =true is whichever maximizes the likelihood of that path.
In other words, there is a recursive relationship between most likely paths to each state xt+1
and most likely paths to each state xt. We can write this relationship as an equation connecting the
probabilities of the paths:

UNIT - III Page 32


PE 511 IT AI DCET

Thus, the algorithm for computing the most likely sequence is similar to filtering: it runs forward
along the sequence, computing the m message at each time step, using Equation (15.11). The
progress of this computation is shown in Figure 15.5(b).
At the end, it will have the probability for the most likely sequence reaching each of the final
states. One can thus easily select the most likely sequence overall (the states outlined in bold). In
order to identify the actual sequence, as opposed to just computing its probability, the algorithm will
also need to record, for each state, the best state that leads to it; these are indicated by the bold
arrows in Figure 15.5(b).

7. HIDDEN MARKOV MODELS


An HMM is a temporal probabilistic model in which the state of the process is described by a single
discrete random variable.The possible values of the variable are the possible states of the world. The
umbrella example described in the preceding section is therefore an HMM, since it has just one state
variable:
Raint. What happens if you have a model with two or more state variables? You can still
fit it into the HMM framework by combining the variables into a single “megavariable” whose
values are all possible tuples of values of the individual state variables.

SIMPLIFIED MATRIX ALGORITHMS


With a single, discrete state variable Xt, we can give concrete form to the representations of the
transition model, the sensor model, and the forward and backward messages. Let the state variable
Xt have values denoted by integers 1, . . . , S, where S is the number of possible states. The
transition model P(Xt |Xt−1) becomes an S ×S matrix T, where

UNIT - III Page 33


PE 511 IT AI DCET

Besides providing an elegant description of the filtering and smoothing algorithms for HMMs, the
matrix formulation reveals opportunities for improved algorithms.
The first is a simple variation on the forward–backward algorithm that allows smoothing to
be carried out in constant space, independently of the length of the sequence. The idea is that
smoothing for any particular time slice k requires the simultaneous presence of both the forward
and backward messages, f1:k and bk+1:t, according to Equation (15.8).
The forward–backward algorithm achieves this by storing the fs computed on the forward pass so
that they are available during the backward pass.
Another way to achieve this is with a single pass that propagates both f and b in the
same direction. For example, the “forward” message f can be propagated backward if we
manipulate Equation (15.12) to work in the other direction:

The modified smoothing algorithm works by first running the standard forward pass to
compute ft:t (forgetting all the intermediate results) and then running the backward pass for both

UNIT - III Page 34


PE 511 IT AI DCET

b and f together, using them to compute the smoothed estimate at each step. Since only one copy
of each message is needed, the storage requirements are constant (i.e., independent of t, the length
of the sequence).

There are two significant restrictions on this algorithm: it requires that the transition matrix
be invertible and that the sensor model have no zeroes—that is, that every observation be possible
in every state.
A second area in which the matrix formulation reveals an improvement is in online
smoothing with a fixed lag. The fact that smoothing can be done in constant space suggests that
there should exist an efficient recursive algorithm for online smoothing—that is, an algorithm
whose time complexity is independent of the length of the lag. Let us suppose that the lag is d; that
is, we are smoothing at time slice t−d, where the current time is t.

UNIT - III Page 35


PE 511 IT AI DCET

In the localization problem for the vacuum world the robot had a single nondeterministic Move action
and its sensors reported perfectly whether or not obstacles lay immediately to the north, south, east,
and west; the robot’s belief state was the set of possible locations it could be in.
Here we make the problem slightly more realistic by including a simple probability model
for the robot’s motion and by allowing for noise in the sensors. The state variable X t represents the
location of the robot on the discrete grid; the domain of this variable is the set of empty squares {s1, . .
. , sn}.
Let NEIGHBORS(s) be the set of empty squares that are adjacent to s and let N(s) be the
size of that set. Then the transition model for Move action says that the robot is equally likely to end
up at any neighboring square:

The sensor variable Et has 16 possible values, each a four-bit sequence giving the presence or absence
of an obstacle in a particular compass direction. NS, for example, to mean that the north and south
sensors report an obstacle and the east and west do not.
Suppose that each sensor’s error rate is and that errors occur independently for the four sensor
directions.
In that case, the probability of getting all four bits right is (1−ε)4 and the probability of getting them all
wrong is ε4.
Furthermore, if dit is the discrepancy—the number of bits that are different—between the true
values for square i and the actual reading et, then the probability that a robot in square i would
receive a sensor reading et is

UNIT - III Page 36


PE 511 IT AI DCET

Given the matrices T and Ot, the robot can use Equation (15.12) to compute the posterior
distribution over locations—that is, to work out where it is. Figure 15.7 shows the distributions
P(X1 |E1 =NSW) and P(X2 |E1 =NSW,E2 =NS).
In addition to filtering to estimate its current location, the robot can use smoothing to work
out where it was at any given past time—for example, where it began at time 0—and it can use the
Viterbi algorithm to work out the most likely path it has taken to get where it is now.
Figure 15.8 shows the localization error and Viterbi path accuracy for various values of the
per-bit sensor error rate ε . Even when ε is 20%—which means that the overall sensor reading is
wrong 59% of the time—the robot is usually able to work out its location within two squares after
25 observations.
This is because of the algorithm’s ability to integrate evidence over time and to take into
account the probabilistic constraints imposed on the location sequence by the transition model.

UNIT - III Page 37


PE 511 IT AI DCET
When ε is 10%, the performance after a half-dozen observations is hard to distinguish from the
performance with perfect sensing. Broadly speaking, high levels of localization and path accuracy are
maintained even in the face of substantial errors in the models used.
The state variable for the example we have considered in this section is a physical location in the
world. Other problems can, of course, include other aspects of the world. consider a version of the
vacuum robot that has the policy of going straight for as long as it can; only when it encounters an
obstacle does it change to a new (randomly selected) heading.
To model this robot, each state in the model consists of a (location, heading) pair. For the
environment in Figure 15.7, which has 42 empty squares, this leads to 168 states and a transition
matrix with 1682 =28, 224 entries—still a manageable number.
If we add the possibility of dirt in the squares, the number of states is multiplied by 242
and the transition matrix ends up with more than 1029 entries—no longer a manageable number

UNIT - III Page 38

You might also like