You are on page 1of 169

Probability in AI

Professor Abdelwadood MESLEH

Based on slides of Russel (AIMA chapter 13, 14, 15, 17)

1
Quantifying Uncertainty

2
Acting Under Uncertainty
• Agent may need to handle uncertainty, whether due to partial observability,
non-determinism, or combination of the two

• The agent’s knowledge can at best provide only a degree of belief in the
relevant sentences

• Main tool for dealing with degree of belief is probability theory (utility theory is to
represent and reason with preferences)

• Probability provides a way of summarizing the uncertainty that come from


laziness and ignorance, thereby solving the quantification problem
– Laziness: It is too much work to list the complete set of antecedents or consequents
needed to ensure an exceptionless rule and too hard to use such rules.

– Theoretical ignorance: Medical science has no complete theory for the domain.

– Practical ignorance: Even if we know all the rules, we might be uncertain about a
particular patient because not all the necessary tests have been or can be run.
3
Acting Under Uncertainty
• Uncertainty in logic sentence

– Toothache ⇒ Cavity (True?)

– Toothache ⇒ Cavity ∨ Gum Problem ∨ Abscess . . .

• Probability

– From the statistical data, 80% of the toothache patients


have had cavities

4
Acting Under Uncertainty
• Let action At = leave for BAU t minutes before class

– Will At get me there on time?

• Problems:

– partial observability (road state, other drivers' plans, etc.)

– uncertainty in action outcomes (at tire, etc.)

5
Acting Under Uncertainty
• If we use purely logic to solve that problem, then

– A25 will get the student arrive on time if “there is no


accident or traffic jam and it does not rain”

• No Rain AND No Accident  On Time

6
Acting Under Uncertainty

7
Acting Under Uncertainty
• Preferences, as expressed by utilities, are combined with

probabilities in the general theory of rational decisions called

decision theory

– Decision Theory = probability theory + utility theory

• Fundamental of decision theory is that an agent is rational if

only if it chooses the action that yields the highest expected


utility, averaged over all the possible outcomes of the action
called maximum expected utility (MEU).
8
Acting Under Uncertainty

9
Basic Probability Notation
• In probability theory, the set of all possible worlds is called the
sample space

• For example, if we are about to roll two (distinguishable) dice,


there are 36 possible worlds to consider: (1,1), (1,2), . . ., (6,6)

• The probability for each possible world is as follows:

– I.e. the probability of the rolled two dice are 1/36

10
Basic Probability Notation
• The propositions is the set of two or more possible worlds

• The probability is

– I.e P(Total=11) = P((5,6)) + P((6,5)) = 1/36 + 1/36 = 1/18

11
Basic Probability Notation
• There are two kinds of probabilities:

– Unconditional or prior probabilities

• Degrees of belief in propositions in the absence of any


other information

– Conditional or posterior probabilities

• There is some evidence (information) to support the


probability

• If the first dice is 5, then what is the P(Total=11)?

12
Basic Probability Notation
• Conditional probabilities

– The “|” is pronounced “given”

– Example

– Product rule

13
Basic Probability Notation
• Variables in probability theory are called random variables

– Total, Die1

• Every random variable has a domain

– Total = {2, …, 12}

– Die1 = {1,…6}

14
Basic Probability Notation
• Probabilities of all the possible values of a random variable:

– P(Weather =sunny) = 0.6

– P(Weather =rain) = 0.1

– P(Weather =cloudy) = 0.29

– P(Weather =snow) = 0.01

• The probability distribution for the random variable Weather

– P(Weather)=0.6, 0.1, 0.29, 0.01

15
Basic Probability Notation
• For continuous variables, it is not possible to write out the
entire distribution as a vector, because there are infinitely
many values

• The temperature at noon is distributed uniformly between 18


and 26 degrees Celcius

– P(NoonTemp =x) = Uniform[18C,26C](x)

– Called probability density function

16
Basic Probability Notation

• We need notation for distributions on multiple variables

• P(Weather , Cavity) denotes the probabilities of all combinations


of the values of Weather and Cavity

• This is a 4×2 table of probabilities called the joint probability


distribution of Weather and Cavity

17
 P(posterior|evidence)

Inference Using Full Joint Distributions


• Probabilistic Inference

– Given observed evidence, estimate the posterior


probability!)
– Full-joint distribution = Knowledge base

18
Inference Using Full Joint Distributions
• Probabilistic Inference

– To measure the probability of Cavity = True, we use


marginalization (or summing out)

19
Inference Using Full Joint Distributions
• Full Joint Distribution for the Toothache, Cavity, Catch

• Four possible world in which cavity holds:

– P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2

20
Inference Using Full Joint Distributions
• Full Joint Distribution for the Toothache, Cavity, Catch

• Six possible world in which cavity  toothache holds:

– P(cavity  toothache)

• = 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28

21
Inference Using Full Joint Distributions
• Full Joint Distribution for the Toothache, Cavity, Catch

• Can also compute conditional probabilities:


– P(cavity | toothache)
= P(cavity  toothache) div P(toothache)
= (0.016+0.064) / (0.108 + 0.012 + 0.016 + 0.064)
= 0.4
22
Independence here
• How are P(toothache, catch, cavity, cloudy) and P(toothache,
catch, cavity) related?
– Use the product rule :

• P(toothache, catch, cavity, cloudy) = P(cloudy |


toothache, catch, cavity) P(toothache, catch, cavity)
– However, the weather does not influence the dental
variables, therefore : (independence)

• P(cloudy | toothache, catch, cavity) = P(cloudy)

23
Independence
• A and B are independent iff
– P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B)

– P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch,


Cavity) P(Weather)

24
Independence

25
Independence
• Conditional independence

– P(X, Y | Z) = P(X | Z)P(Y | Z)

• As example, let’s take a look at toothache and catch


probabilities, given cavity

– P(toothache ∧ catch | Cavity) = P(toothache |


Cavity)P(catch | Cavity)

– These variables are independent, give the presence or the


absence of a cavity

26
Probability and Bayes’ Theorem
• Bayes’ rule or Bayes’ theorem

– Why use this?


• We perceive as evidence the effect of some unknown
cause and we would like to determine that cause
27
Ex1: Probability and Bayes’ Theorem
• For example, a doctor knows that the disease meningitis
causes the patient to have a stiff neck, say, 70% of the time

• The doctor also knows some unconditional facts: the prior


probability that a patient has meningitis is 1/50,000, and the
prior probability that any patient has a stiff neck is 1%.

28
Ex2: Probability and Bayes’ Theorem
Ali had onset of symptoms such as spots on the face. Doctor
diagnose that Ali got chicken pox ‫ جدري الماء‬with the possibility:
• Probability appearance of spots on the face, if Ali got chicken pox,
p (spots|chicken pox) = 0.8
• Probability Ali got chicken pox without notice any symptoms,
p(chicken pox) = 0.4
• Probability appearance of spots on the face, if Ali got allergy,
p(spots|allergy) = 0.3
• Probability Ali got allergy without notice any symptoms, p(allergy)
= 0.7
29
Ex2: Probability and Bayes’ Theorem~
• Probability appearance of spots on the face, if Ali got
pimples‫ البثور‬, p(spots|pimples) = 0.9

• Probability that Ali got pimples without notice any symptoms,


p(pimples) = 0.5

Calculate the probability of each symptoms stated above!


(0.8) * (0.4) 0.32
p(chickenpox | spots)    0.327
(0.8) * (0.4)  (0.3) * (0.7)  (0.9) * (0.5) 0.98
(0.3) * (0.7)
p(allergy | spots)   0.214
0.98
(0.9) * (0.5)
p( pimples | spots)   0.459
0.98 30
Ex3:

Solved HW- Probability and Bayes’ Theorem


• Problem : Marie is getting married tomorrow, at an outdoor
ceremony in the desert. In recent years, it has rained only 5 days
each year. Unfortunately, the weatherman has predicted rain for
tomorrow.

• When it actually rains, the weatherman correctly forecasts rain


90% of the time.

• When it doesn't rain, he incorrectly forecasts rain 10% of the time.

• What is the probability that it will rain on the day of Marie's


wedding?

31
Solved HW- Probability and Bayes’ Theorem~
• Solution: The sample space is defined by two mutually-
exclusive events - it rains or it does not rain. Additionally, a
third event occurs when the weatherman predicts rain.
Notation for these events appears below.

 Event A1. It rains on Marie's wedding.

 Event A2. It does not rain on Marie's wedding.

 Event B. The weatherman predicts rain.

32
Solved HW- Probability and Bayes’ Theorem~
• P( A1 ) = 5/365 = 0.0136985 [It rains 5 days out of the year.]

• P( A2 ) = 360/365 = 0.9863014 [It does not rain 360 days out


of the year.]

• P( B | A1 ) = 0.9 [When it rains, the weatherman predicts rain


90% of the time.]

• P( B | A2 ) = 0.1 [When it does not rain, the weatherman


predicts rain 10% of the time.]

33
Solved HW- Probability and Bayes’ Theorem~
• We want to know P( A1 | B ), the probability it will rain on the
day of Marie's wedding, given a forecast for rain by the
weatherman. The answer can be determined from Bayes'
theorem, as shown below.

(0.014)(0.9)
p( A1 | B)   0.0111
(0.014)(0.9)  (0.986)(0.1)

34
Probabilistic Reasoning

35
Representing
Knowledge in an Uncertain Domain
• A Bayesian network is a directed graph that represents the
dependencies among variables and any full joint probability
distribution

– Each node corresponds to a random variable

– A set of directed arrows connects pairs of nodes

• Direct influence

– Each node has a conditional probability distribution

• P (Xi | Parents(Xi))

36
Representing
Knowledge in an Uncertain Domain
• Example

– Weather is independent of the other variables

– Toothache and Catch are conditionally independent given


Cavity

A Simple Bayesian Network


37
Representing
Knowledge in an Uncertain Domain
• Example (A new burglar alarm)
– An alarm responds to (almost) a burglary (‫ )السطو‬and (sometimes)
a minor earthquake
– There are two neighbours (John and Mary) who always call you
when they hear an alarm
– John always calls when he hear alarm or mistakenly hear
telephone ring
– Mary often misses the alarm because she likes hearing loud
music
– Draw the Bayesian network!

38
Representing
Knowledge in an Uncertain Domain

A Typical Bayesian Network


39
Representing
Knowledge in an Uncertain Domain
• The conditional distributions are shown as a conditional
probability table (CPT)
• CPT contains the conditional probability of each node value for a
conditioning case

– A possible combination of values for the parent nodes

40
Semantic of Bayesian Networks
• The full joint distribution is defined as the product of the local
conditional distributions:

• What is the probability of an alarm ring but neither a burglary


nor an earthquake has occurred, and both John and Mary call?

41
Semantic of Bayesian Networks

What is the probability of an alarm ring but neither a burglary nor an


earthquake has occurred, and both John and Mary call?

• Solve it, the answer is = 0.000628 !!!!!


=0.90 × 0.70 × 0.001 × 0.999 × 0.998 = 0.000628 .
42
Semantic of Bayesian Networks
• A method for constructing Bayesian networks

– Nodes: Determine the set of variables to model the domain


{Xi,…,Xn}

– Links: For i =1 to n, do:

• Choose the parents for Xi

• For each parent, insert a link from the parent to Xi

• CPTs: Write down the conditional probability table

43
Example

44
Efficient
Representation of Conditional Distribution

• We can represent the conditional distribution in more efficient


way by utilizing deterministic nodes (no uncertainty)

– It has its value specified exactly by the values of its parents

– i.e. The relationship between the parent nodes Canadian, US,


Mexican and the child node NorthAmerican is simply that the
child is the disjunction of the parents

45
Efficient
Representation of Conditional Distribution

• From individual probabilities, the entire CPT can be built

– qcold = P(fever | cold, flu, malaria) = 0.6

– qflu = P(fever | cold, flu, malaria) = 0.2

– qmalaria = P(fever | cold, flu, malaria) = 0.1

• With the general rule is

46
qcold = P(fever | cold, flu, malaria) = 0.6
qflu = P(fever | cold, flu, malaria) = 0.2
qmalaria = P(fever | cold, flu, malaria) = 0.1

Efficient
Representation of Conditional Distribution

47
Exact Inference in Bayesian Networks

• The basic task for any probabilistic inference system is to


compute the posterior probability for a set of query variables,
given some observed event
• In the burglary, we might observe the event in which JohnCalls
= true and MaryCalls = true

48
Exact Inference in Bayesian Networks
• Inference by enumeration

– A query can be answered using


Let e be the list of observed values
for them, and let Y be the remaining
unobserved variables

– Query P(Bulgary | JohnCalls = true, MaryCalls = true)

Therefore, a query can be answered using a Bayesian network by computing sums


of products of conditional probabilities from the network.

49
Exact Inference in Bayesian Networks
• Inference by enumeration

– For Bulgary = true

– P (b) term is a constant and can be moved outside the


summations over a and e, and the P (e) term can be
moved outside the summation over . Hence, we have:

50
Exact Inference in Bayesian Networks
• Inference by enumeration

• This expression can be evaluated by


looping through the variables in order,
multiplying CPT entries as we go. For
each summation, we also need to
loop over the variable’s possible
values.
• The structure of this computation is
shown in Figure 14.8. [shown next
slide]

• Using the numbers shown on right


graph we obtain:

• And computation for NOT b obtain:

• 
51
Figure 14.8 : the evaluation proceeds top down, multiplying values
along each path and summing at the “+” nodes.
Notice the repetition of the paths for j and m  this is a problem.

52
Exact Inference in Bayesian Networks
• Variable elimination algorithm
– To eliminate repeated calculation in enumeration process [in Figure 14.8]
– Let’s apply this to burglary network
– Note, that we apply each expression with a factor (f).

– In f4 and f5, the depending variable is A:

– f3(A, B, E) will be a 2 × 2 × 2 matrix: “first” element P (a | b, e) = 0.95 &


“last” element P (¬a | ¬b, ¬e) = 0.999.) [derived from CPT in slide 42]
– 

53
Exact Inference in Bayesian Networks
• The process of evaluation is a process of summing out variables (right
to left) from pointwise products of factors to produce new factors,
eventually yielding a factor that is the solution, i.e., the posterior
distribution over the query variable. The steps are as follows:

– First, we sum out A from the product of f3, f4, and f5 ,This gives us a
new 2 × 2 factor f6 (B, E) whose indices range over just B and E:

– Now we are left with the expression

54
Exact Inference in Bayesian Networks
– Next, we sum out E from the product of f2 and f6

– This leaves the expression

which can be evaluated by taking the pointwise product and


normalizing the result.
– Examining this sequence, we see that two basic computational
operations are required: point-wise product of a pair of factors,
and summing out a variable from a product of factors. The next
section describes each of these operations.

55
Inference in Bayesian Networks
• Operation on factors: pointwise product of two factors f1 and f2 yields a new
factor f whose variables are the union of the variables in f1 and f2 and whose
elements are given by the product of the corresponding elements in the two
factors.

• Given two factors f1(A, B) and f2(B, C), the pointwise product f1 × f2 = f3(A, B,
111
C) has 2  8 entries

56
Exact Inference in Bayesian Networks

• Summing out a variable, from a product of factors is done by


adding up the submatrices formed by fixing the variable.

• The only trick is to notice that any factor that does not depend
on the variable to be summed out can be moved outside the
summation. For example, if we were to sum out E first in the
burglary network, the relevant part of the expression would be

57
Exact Inference in Bayesian Networks
• Variable ordering and variable relevance:
– The algorithm in Figure 14.8 includes an unspecified ORDER
function to choose an ordering for the variables.
– Every choice of ordering yields a valid algorithm, but different
orderings cause different intermediate factors to be generated
during the calculation.
– For example, in the calculation shown previously, we eliminated A
before E; if we do it the other way, the calculation becomes

during which a new factor f6(A, B) will be generated.


– the time and space requirements of variable elimination are
dominated by the size of the largest factor constructed during the
operation of the algorithm. It turns out to be intractable to determine
the optimal ordering, but several good heuristics are available. One
fairly effective method is a greedy one: eliminate whichever variable
minimizes the size of the next factor to be constructed.

58
Exact Inference in Bayesian Networks
• Let us consider one more query: P(JohnCalls | Burglary = true). As
usual, the first step is to write out the nested summation:

• Evaluating this expression from right to left, we notice the last


summation (with red) is equal to 1 by definition! Hence, there was no
need to include it in the first place; the variable M is irrelevant to this
query.
• Another way of saying this is that the result of the query P (JohnCalls |
Burglary = true) is unchanged if we remove MaryCalls from the network
altogether.
• In general, we can remove any leaf node that is not a query variable or
an evidence variable (Figure 14.8)
• After its removal, there may be some more leaf nodes, and these too
may be irrelevant.
• Continuing this process, we eventually find that every variable that is
not an ancestor of a query variable or evidence variable is irrelevant to
the query.
• A variable elimination algorithm can therefore remove all these
variables before evaluating the query.
59
Probabilistic Reasoning over Time

60
Random Variable
• A random variable is a variable whose value is unknown or a
function that assigns values to each of an experiment's
outcomes.
• Ex: letter X may be designated to represent the sum of the
resulting numbers after three dice are rolled. In this case, X
could be 3 (1 + 1+ 1), 18 (6 + 6 + 6), or somewhere between 3
and 18, since the highest number of a die is 6 and the lowest
number is 1.

61
Decision Theory

• Decision theory = probability theory + utility theory

62
Decision Theory

Axioma Probability:

63
Bayes Rule
• A doctor knows that the disease meningitis causes the patient to have
a stiff neck, say, 50% of the time.
• The doctor also knows some unconditional facts:
– the prior probability of a patient having meningitis is 1/50,000, and
– the prior probability of any patient having a stiff neck is 1/20.
• Letting:
– S be the proposition that the patient has a stiff neck and
– M be the proposition that the patient has meningitis

64
Time and Uncertainty

• How we estimate the probability of changing random


variable?
– When a car is broken, remains broken during the process
diagnosis (static)
– On the other hand, a diabetic patient has changing
evidence (blood sugar, insulin doses, etc) (dynamic)
• We view the world as a series of snapshots (time slices)
– Xt denotes the set of state variables at time t
– Et denotes the observable evidences at time t

65
Notations
• Each time slice involves a set or random variables indexed by t:
– the set of unobservable state variables Xt
– the set of observable evidence variable Et
– The observation at time t is Et = et for some set of values et
– The notation Xa:b denotes the set of variables from Xa to Xb

66
Time and Uncertainty

First-order

Second-order

• First-order Markov chain: P(Xt | Xt-1):


• Second-order Markov chain: P(Xt | Xt-2,Xt-1)
• Sensor Markov assumption: P(Et | X0:t,E0:t-1) = P(Et | Xt)
- mote: sensor model is also called observation model

• Stationary process: transition model and sensor model fixed


67
Time and Uncertainty
• The complete joint distribution is the combination of the transition model
and sensor model:

– Transition model: P(Xt |Xt-1)

– Sensor model: P(Et |Xt )

– Prior probability: P(X0)

68
Markov Chains

• First order Markov Chains P(Rt | Rt-1) of umbrella world


– Probability of raining in day 0: P(R0) = [0.8 0.2]
0.7 0.3
– Transition model 𝑷(𝑹𝒕 | 𝑹𝒕 − 𝟏) =
0.3 0.7
– Probability of raining in day 1: P(R1)= P(R0) P(Rt | Rt-1)
• P(R1)=[(0.7*0.8+0.3*0.2) (0.8*0.3+0.7*0.2)]
• P(R1)=[0.62 0.38]
– So, the probability of raining = true is 0.62 and raining =
false is 0.38

69
Markov Chains

• Instead of only using the probability of previous state, Markov


Chains also measures the probability of sensor
– Observation model 𝑷(𝑹𝒕 | 𝑼𝑡) = 0.9 0.2
– Probability of raining in day 1: P(R1)= P(R0) P(Rt | Rt-1) P(Rt
| Ut)
• P(R1)=[(0.7*0.8+0.3*0.2)+ (0.8*0.3+0.7*0.2)] [0.9 0.2]
• P(R1)=[0.62 0.38] [0.9 0.2]
• P(R1)=[0.558 0.076] ≈ [0.88 0.12]

70
Markov Chains
• Example
– A child with a lower-class parent has a 60% chance of
remaining in the lower class, has a 40% chance to rise to the
middle class, and has no chance to reach the upper class.
– A child with a middle-class parent has a 30% chance of falling
to the lower class, a 40% chance of remaining middle class,
and a 30% chance of rising to the upper class.
– A child with an upper-class parent have no chance of falling to
the lower class, has a 70% chance of falling to the middle
class, and has a 30% chance of remaining in the upper class.
– Assuming that 20% of the population belongs to the lower
class, that 30% belong to the middle class, and that 50%
belong to the upper class.

71
Markov Chains
• Example
– Markov transition matrix

– Markov transition diagram

– Initial condition

72
Markov Chains
• Solution
– Illustrate, consider population dynamics over the next 4
generations is :

73
Inference in Temporal Model
• Inference tasks:
– Filtering (monitoring): computing the belief state, i.e.: Current state estimation P(Xt | e1:t)
• Example: What is the probability that it is raining today, given all the umbrella
observations up through today?
– Prediction: computing the posterior distribution of future state, i.e.: Future state prediction
P(Xt+k | e1:t)
• Example: What is the probability that it will rain the day after tomorrow, given all the
umbrella observations up through today?
– Smoothing: computing the posterior distribution of past state
• Past state analysis P(Xk | e1:t)
• Example: What is the probability that it rained yesterday, given all the umbrella
observations through today?

– Most likely explanation: arg max x1,..xt P(x1,…,xt |e1,…,et ) given sequence of
observation, find sequence of states that is most likely to have generated those
observations.
• Example: if the umbrella appeared the first three days but not on the fourth, what is
the most likely weather sequence to produce these umbrella sightings. 74
Inference in Temporal Model
• Filtering – (prediction)
– To recursively update the distribution using a forward
message from previous states, i.e. recursive estimation to
compute P(Xt+1 | e1:t+1) as a function of et+1 and P(Xt | e1:t) ,
and we can write this as follows:

– This leads to a recursive definition


f1:t+1 = αFORWARD(f1:t:t,et+1)
75
• Filtering solved example:

76
Inference in Temporal Model

• Smoothing

– Process of computing the distribution over past states given


evidence up to the present

– The computation can be split into two parts: forward message


and backward message

forward backward
77
Inference in Temporal Model
• Smoothing

– The process of backward message

78
Smoothing(forward-backward algorithm) solved Example

79
Inference in Temporal Model
• Most likely explanation:
– There is a recursive relationship between the most likely paths to
each state xt+1 and most likely paths to each state xt (Markov
property)

• Noting that most likely sequence does not mean most likely
states.

• Most likely path to each xt+1 = most likely path to some xt plus
one more step.

• Identical to filtering, except forward message f1:t replaced by

80
Inference in Temporal Model
• Most likely explanation:
– i.e., m1:t(i) gives the probability of the most likely path to state i.
Update has sum replaced by max, giving the Viterbi algorithm:

81
Viterbi example

82
HMM

83
Hidden Markov Models
• Simple Markov models  The observer knows the state
directly

• Hidden Markov models  The observer knows the state


indirectly (through an output state or observed data)

84
Hidden Markov Models

• At each time slice t, the state of the world is


described by an unobservable variable Xt and
an observable evidence variable Et
• Transition model: distribution over the current
state given the whole past history:
P(Xt | X0, …, Xt-1) = P(Xt | X0:t-1)
• Observation model: P(Et | X0:t, E1:t-1)

X0 X1 X2 … Xt-1 Xt

E1 E2 Et-1 Et
85
Hidden Markov Models
• Markov assumption (first order)
– The current state is conditionally independent of all the
other states given the state in the previous time step
– What does P(Xt | X0:t-1) simplify to?
P(Xt | X0:t-1) = P(Xt | Xt-1)
• Markov assumption for observations
– The evidence at time t depends only on the state at time t
– What does P(Et | X0:t, E1:t-1) simplify to?
P(Et | X0:t, E1:t-1) = P(Et | Xt)

X0 X1 X2 … Xt-1 Xt

E1 E2 Et-1 Et
86
Example

state

evidence

87
Example 1

Transition model

state

evidence

Observation model

88
An alternative visualization

Rt = T Rt = F
Transition Rt-1 = T 0.7 0.3
R model Rt-1 = F 0.3 0.7

Ut = T Ut = F
Observation Rt = T 0.9 0.1
U model Rt = F 0.2 0.8
89
Example 2

• States: X = {home, office, cafe}


• Observations: E = {sms, facebook, email}

90
The Joint Distribution

• Transition model: P(Xt | X0:t-1) = P(Xt | Xt-1)


• Observation model: P(Et | X0:t, E1:t-1) = P(Et | Xt)
• How do we compute the full joint P(X0:t, E1:t)?
t
P( X 0:t ,E1:t )  P( X 0 ) P( X i|X i 1 ) P( Ei|X i )
i 1

X0 X1 X2 … Xt-1 Xt

E1 E2 Et-1 Et
91
HMM Learning and Inference

• Inference tasks
– Filtering: what is the distribution over the current state Xt
given all the evidence so far, e1:t
– Smoothing: what is the distribution of some state Xk given the
entire observation sequence e1:t?
– Evaluation: compute the probability of a given observation
sequence e1:t
– Decoding: what is the most likely state sequence X0:t given
the observation sequence e1:t?
• Learning
– Given a training sample of sequences, learn the model
parameters (transition and emission probabilities)
– Tool: Hidden Markov Model Toolkit (HTK) – SR course!!!

92
Approximate inference

93
Approximate inference: Sampling
• A Bayesian network is a generative model
– Allows us to efficiently generate samples from the joint distribution
• To get approximate answer we can do stochastic simulation (sampling)
• Algorithm for sampling the joint distribution:
– While not all variables are sampled:
• Pick a variable that is not yet sampled, but whose parents are
sampled
• Draw its value from P(X | parents(X))

94
Example of sampling from the joint distribution

95
Example of Sampling

• From a Bayesian network that has no evidence associated with it, we


can sample each variable in turn, in topological order
• When the values of parent nodes have been drawn, we know from
which distribution we have to sample the child
• Let is fix a topological order for the nodes of our network:
[Cloudy, Sprinkler, Rain, WetGrass]
– 1. Sample from P(Cloudy) = [0.5, 0.5]; suppose this returns True
– 2. Sample from P(Sprinkler | cloudy) = [0.1, 0.9]; suppose this
returns False
– 3. Sample from P(Rain | cloudy) = [0.8, 0.2]; suppose this returns
True
– 4. Sample from P(WetGrass | ¬sprinkler, rain) = [0.9, 0.1]; suppose
this returns True

96
Example of Sampling
• From the prior joint distribution specified by the network we have
drawn the event [True, False, True, True]
• Let

be the probability that a specific event is generated by this prior


sampling algorithm

• Just looking at the sampling process, we have:

• On the other hand, this is also the probability of the event according to
the Bayesian net’s representation of the joint distribution; i.e.:

• Let

be the frequency of the specific event x1 , … , xn and that there


are N total samples
97
Example of Sampling
• We expect this frequency to converge in the limit to its expected
value according to the sampling probability:

• E.g.
hence in the limit of large N, we expect 32.4% of the samples to
be of this event
• The estimate of prior sampling is consistent in the sense that the
estimated probability becomes exact in the large-sample limit
• One can also produce a consistent estimate of the probability of any
partially specified event x1, …, xm, where m<=n:

• Let us denote by the probability estimated from a sample

98
Inference via sampling

• Suppose we drew N samples from the joint


distribution
• How do we compute P(X = x | e)?

P( x, e ) # of times x and e happen / N


P( X  x | e )  
P (e ) # of times e happens / N

99
Inference via sampling

• Rejection sampling: to compute P(X = x | e),


keep only the samples in which e happens and
find in what proportion of them x also happens
• What if e is a rare event?
– Example: burglary  earthquake
– Rejection sampling ends up throwing away most of
the samples

100
Inference via sampling

• Rejection sampling: to compute P(X = x | e),


keep only the samples in which e happens and
find in what proportion of them x also happens
• What if e is a rare event?
– Example: burglary  earthquake
– Rejection sampling ends up throwing away most of
the samples
• Likelihood weighting
– Sample from P(X = x | e), but weight each sample by P(e)

101
Inference via sampling: Summary

• Use the Bayesian network to generate samples


from the joint distribution
• Approximate any desired conditional or marginal
probability by empirical frequencies
– This approach is consistent: in the limit of infinitely
many samples, frequencies converge to probabilities
– No free lunch: to get a good approximate of the
desired probability, you may need an exponential
number of samples anyway

102
Other sampling methods

• Likelihood weighting: use evidence to weight samples

103
Other approximate inference methods

• Variational methods
– Approximate the original network by a simpler one
(e.g., a polytree) and try to minimize the divergence
between the simplified and the exact model
• Belief propagation
– each node computes some local estimate aIterative
message passing and shares it with its neighbors. On
the next iteration, it uses information from its
neighbors to update its estimate.

104
Markov Decision Process

MDPs

105
Markov Decision Processes
• In HMMs, we see a sequence of observations and try to reason
about the underlying state sequence
– There are no actions involved
• But what if we have to take an action at each step that, in turn,
will affect the state of the world?

106
Markov Decision Processes
• Components that define the MDP. Depending on the problem
statement, you either know these, or you learn them from data:
– States s, beginning with initial state s0
– Actions a
• Each state s has actions A(s) available from it
– Transition model P(s’ | s, a)
• Markov assumption: the probability of going to s’ from s
depends only on s and a and not on any other past actions
or states
– Reward function R(s)
• Policy – the “solution” to the MDP:
– (s) ∈ A(s): the action that an agent takes in any given state
• MDPs are non-deterministic search problems
• First, we will look at how to “solve” MDPs, or find the optimal policy
when the transition model and the reward function are known

107
Game show
• A series of questions with increasing level of difficulty and increasing
payoff
• Decision: at each step, take your earnings and quit, or go for the
next question
– If you answer wrong, you lose everything

$100 $1,000 $10,000 $50,000


question question question question
Correct:
Correct Correct Correct $61,100
Q1 Q2 Q3 Q4

Incorrect: Incorrect: Incorrect: Incorrect:


$0 $0 $0 $0

Quit: Quit: Quit:


$100 $1,100 $11,100

108
Game show
• What should we do in Q3?
• Consider $50,000 question
– Probability of guessing correctly: 1/10 – Payoff for quitting: $1,100
– Quit or go for the question? – Payoff for continuing: 0.5 *
• What is the expected payoff for continuing? $11,100 = $5,550
– 0.1 * 61,100 + 0.9 * 0 = 6,110 • What about Q2?
• What is the optimal decision? – $100 for quitting vs. $4,162 for
continuing
• What about Q1?

U = $3,746 U = $4,162 U = $5,550 U = $11,100


$100 $1,000 $10,000 $50,000
question question question question 1/10
1/100 3/4 1/2 Correct:
Correct Correct Correct $61,100
Q1 Q2 Q3 Q4

Incorrect: Incorrect: Incorrect: Incorrect:


$0 $0 $0 $0

Quit: Quit: Quit:


$100 $1,100 $11,100

109
Solving MDPs
• MDP components:
– States s
– Actions a
– Transition model P(s’ | s, a)
– Reward function R(s)
• The solution:
– Policy (s): mapping from states to actions
– How to find the optimal policy?

110
Maximizing expected utility
• The optimal policy (s) should maximize the expected utility over all
possible state sequences produced by following that policy:

෍ 𝑃 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒|𝑠0 , 𝑎 = 𝜋 𝑠0 𝑈 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒
𝑠𝑡𝑎𝑡𝑒 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠
𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔 𝑓𝑟𝑜𝑚 𝑠0

• How to define the utility of a state sequence?


– Sum of rewards of individual states
– Problem: infinite state sequences

111
Utilities of state sequences
• Normally, we would define the utility of a state sequence as
the sum of the rewards of the individual states
• Problem: infinite state sequences
• Solution: discount the individual state rewards by a factor 
between 0 and 1:

U ([s0 , s1 , s2 ,])  R( s0 )   R( s1 )   2 R( s2 )  

Rmax
   R( st ) 
t
(0    1)
t 0 1 

– Sooner rewards count more than later rewards


– Makes sure the total utility stays bounded
– Helps algorithms converge

112
Utilities of states

• Expected utility obtained by policy  starting in state s:

𝑈𝜋 𝑠 = ෍ 𝑃 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒|𝑠, 𝑎 = 𝜋 𝑠 𝑈 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒
𝑠𝑡𝑎𝑡𝑒 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠
𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔 𝑓𝑟𝑜𝑚 𝑠

• The “true” utility of a state, denoted U(s), is the best possible


expected sum of discounted rewards
– if the agent executes the best possible policy starting in
state s
• Reminiscent of minimax values of states…

113
Finding the utilities of states
Max node
• What is the expected utility of
taking action a in state s?

Chance node  P(s'| s, a)U (s' )


s'

• How do we choose the optimal


P(s’ | s, a)
action?
 * ( s)  arg max  P( s' | s, a)U ( s' )
U(s’) aA ( s ) s'

• What is the recursive expression for U(s) in terms of the


utilities of its successor states?
U ( s )  R( s )   max a  P( s ' | s, a)U ( s ' )
s'
114
The Bellman equation
• Recursive relationship between the utilities of successive
states:
U ( s )  R( s )   max  P( s ' | s, a)U ( s ' )
aA ( s )
s'

Receive reward R(s)

Choose optimal action a

End up here with P(s’ | s, a)


Get utility U(s’)
(discounted by )

115
The Bellman equation
• Recursive relationship between the utilities of successive
states:
U ( s )  R( s )   max  P( s ' | s, a)U ( s ' )
aA ( s )
s'

• For N states, we get N equations in N unknowns


– Solving them solves the MDP
– Nonlinear equations -> no closed-form solution, need to
use an iterative solution method (is there a globally
optimum solution?)
– We could try to solve them through expectiminimax search,
but that would run into trouble with infinite sequences
– Instead, we solve them algebraically
– Two methods: value iteration and policy iteration

116
Method 1: Value iteration

• Start out with every U(s) = 0


• Iterate until convergence
– During the ith iteration, update the utility of each state
according to this rule:

U i 1 ( s )  R( s )   max  P( s' | s, a)U i ( s ' )


aA ( s )
s'

• In the limit of infinitely many iterations, guaranteed to find the


correct utility values
– In practice, don’t need an infinite number of iterations…

117
Value iteration

• What effect does the update have?

U i 1 ( s )  R( s )   max  P( s' | s, a)U i ( s ' )


aA ( s )
s'

• Notes:
– Reachable states from s by doing a (small set – S’)
– Probability of getting s’ from s via a
– S’ is the expected value of following a policy in s’
– a is the action indicated by the policy in s

118
Method 2: Policy iteration
• Start with some initial policy 0 and alternate between the following
steps:
– Policy evaluation: calculate Ui(s) for every state s
– Policy improvement: calculate a new policy i+1 based on the
updated utilities

• Notice it’s kind of hill-climbing.


– Policy evaluation: Find ways in which the current policy is
suboptimal
– Policy improvement: Fix those problems

• Unlike Value Iteration, this is guaranteed to converge in a finite


number of steps, as long as the state space and action set are both
finite.

119
Method 2, Step 1: Policy evaluation

• Given a fixed policy , calculate U(s) for every state s

U  ( s )  R( s)    P( s ' | s,  ( s ))U  ( s ' )


s'
• (s) is fixed, therefore 𝑃(𝑠 ′ |𝑠, 𝜋 𝑠 ) is an 𝑠’ × 𝑠 matrix, therefore we
can solve a linear equation to get U(s)!
• Note: this “Policy Evaluation” formula is much easier to solve than
the original Bellman equation!!

120
Method 2, Step 2: Policy improvement

• Given U(s) for every state s, find an improved (s)

 i 1 ( s)  arg max  P( s' | s, a)U  ( s' )


i

aA ( s ) s'

121
Extra
Appendix: Linear Solution to the Fixed
Policy Utility Calculation
• 𝑢 = 𝑟Ԧ + 𝑔 𝑃𝑢,
– where 𝑢 = [u(1), u(2), …, u(N)]^T
– This is the same as the equation on the previous slide
– 𝑢(𝑠) = 𝑟(𝑠) + 𝑔 σ𝑠′ 𝑝(𝑠, 𝑠’) 𝑢(𝑠’)
• The solution is: 𝑢 = (𝐼 − 𝑔𝑃)−1 𝑟Ԧ
• Solution computational cost is O{N^3} if N is the number of
distinguishable states
Summary
• MDP defined by states, actions, transition model, reward function
• The “solution” to an MDP is the policy: what do you do when you’re in any
given state
• The Bellman equation tells the utility of any given state, and incidentally, also
tells you the optimum policy. The Bellman equation is N nonlinear equations
in N unknowns (the policy), therefore it can’t be solved in closed form.
• In value iteration:
– Every iteration updates both the values and (implicitly) the policy
– We don’t track the policy, but taking the max over actions implicitly
recomputes it
• In policy iteration:
– We do several passes that update utilities with fixed policy (each pass is
fast because we consider only one action, not all of them)
– After the policy is evaluated, a new policy is chosen (slow like a value
iteration pass)
– The new policy will be better (or we’re done)
• Both value iteration and policy iteration compute the same thing (all optimal
values), i.e. Both value iteration and policy iteration are dynamic programs for
solving MDPs 123
Probabilistic First Order Logic/Statistical
Relational Learning

124
Statistical relational learning
• Most learners assume i.i.d. data (independent and identically distributed)
– One type of object
– Objects have no relation to each other
• Real applications: dependent, variously distributed data
– Multiple types of objects
– Relations between objects
• Benefits
– Better predictive accuracy
– Better understanding of domains
– Growth path for machine learning
• Costs
– Learning is much harder
– Inference becomes a crucial issue
– Greater complexity for user
• Goal: Learn from non-i.i.d. data as easily as from i.i.d. data
• Open-source software available

125
3 in 1
• We have the elements:
– Probability for handling uncertainty
– Logic for representing types, relations,
and complex dependencies between them
– Learning and inference algorithms for each
• Figure out how to put them together
• Tremendous leverage on a wide range of applications

Markov Logic
=
First Order Logic
+
Markov Networks

• Other approaches are shown next slide.

126
Representations
Representation Logical Probabilistic
Language Language
Knowledge-based Horn clauses Bayes nets
model construction
Stochastic logic Horn clauses PCFGs
programs
Probabilistic Frame systems Bayes nets
relational models
Relational Markov SQL queries Markov nets
networks
Bayesian logic First-order Bayes nets
Markov logic First-order logic Markov nets
127
First-Order Logic - review

• Constants, variables, functions, predicates


E.g.: Anna, X, mother_of(X), friends(X, Y)
• Grounding: Replace all variables by constants
E.g.: friends (Anna, Bob)
• World (model, interpretation):
Assignment of truth values to all ground predicates

128
Markov Networks
Undirected graphical model

Smoking Cancer

Asthma Cough
• Markov networks (Markov random field)) are represented by Undirected
graphical models.
• Undirected graph over a set of random variables, where an edge represents
a dependency.
• The Markov blanket of a node, X, in a Markov Net is the set of its neighbors
in the graph (nodes that have an edge connecting to X).
• Every node in a Markov Net is conditionally independent of every other node
given its Markov blanket.
• Graph :
– Bayesian networks: Directed graphs.
– Markov networks: Undirected graphs
• Differences:
– Bayes nets represent a subclass of joint distributions that capture non-
cyclic causal dependencies between variables.
– A Markov net can represent any joint distribution. 129
Graphs
Directed Models vs. Undirected Models

Parent Friend 1

Child Friend 2

P(Child|Parent) φ(Friend1,Friend2)

130
Undirected Probabilistic Logic Models
Upgrade undirected propositional models to relational setting

• Markov Nets  Markov Logic Networks


• Markov Random Fields  Relational Markov Nets
• Conditional Random Fields  Relational CRFs

131
Markov Logic Networks

• Soften logical clauses


– A first-order clause is a hard constraint on the world
x, person( x)  y, person( y ), father ( x, y )
– Soften the constraints so that when a constraint is violated, the world
is less probably, not impossible
w : friends( x, y )  smokes( x)  smokes( y )

– Higher weight  Stronger constraint


– Weight of   first-order logic

Probability( World S ) = ( 1 / Z )  exp {  weight i x numberTimesTrue(f i, S) }

132
Markov Networks

Markov Blanket

If all paths from A to B are blocked, A is said to


be d-separated from B by C. If A is d-separated
A node is conditionally
from B by C, the joint distribution over all
independent of all other
variables in the graph satisfies .
nodes conditioned only
on the neighboring
nodes.
Cliques

Clique

Maximal Clique

134
Markov networks
• The distribution of a Markov • where x{k} represents the joint
net is most compactly assignment of the variables in
described in terms of a set of clique k, and Z is a normalizing
potential functions, φk, for
constant that makes a joint
each clique, k, in the graph.
distribution that sums to 1.
• For each joint assignment of
values to the variables in
clique k, φk assigns a non-
negative real value that Z   k ( x{k } )
represents the compatibility of x k
these values.
• The joint distribution of a xc = short vector
Markov is then defined by:
Smoking Cancer Ф(S,C)
x = vector
False False 4.5
False True 4.5
1
P( x1 , x2 ,...xn )   k ( x{k } ) True False 2.7
Z k
True True 4.5
135
Markov Logic
• Most developed approach to date
• Many other approaches can be viewed as special cases
• A logical KB is a set of hard constraints on the set of possible worlds
• Let’s make them soft constraints: When a world violates a formula, It
becomes less probable, not impossible
• Give each formula a weight (Higher weight  Stronger constraint)
• A Markov Logic Network (MLN) is a set of pairs (F, w) where
– F is a formula in first-order logic
– w is a real number
• Together with a set of constants, it defines a Markov network with
– One node for each grounding of each predicate in the MLN
– One feature for each grounding of each formula F in the MLN, with
the corresponding weight w
i.e. MLN is weighted logical formulas and a set of constants.

136
• Clearly these formulas are not always true: not everyone who smokes
gets cancer
• Adding weights to these formulas allows us to capture formulas which
are generally true
– the weights reflect the strength of the coupling of the two nodes, but
are not probabilities (may be more than 1)
137
Markov Logic: Intuition(1)
• A logical KB is a set of hard constraints
on the set of possible worlds

x Smokes( x)  Cancer ( x)

138
Markov Logic: Intuition(1)
• A logical KB is a set of hard constraints
on the set of possible worlds

x Smokes( x)  Cancer ( x)

• Let’s make them soft constraints:


When a world violates a formula,
the world becomes less probable, not impossible

139
Markov Logic: Intuition (2)
• The more formulas in the KB a possible world satisfies the more it
should be likely
• Give each formula a weight
• Adopting a log-linear model, by design, if a possible world satisfies
a formula its probability should go up proportionally to exp(the
formula weight).

P(world)  exp weights of formulas it satisfies


That is, if a possible world satisfies a formula its log probability
should go up proportionally to the formula weight.

log( P(world))   weights of formulas it satisfies

140
Markov Logic: Definition

• A Markov Logic Network (MLN) is


– a set of pairs (F, w) where Grounding:
• F is a formula in first-order logic substituting vars
• w is a real number with constants
– Together with a set C of constants,
• It defines a Markov network with
– One binary node for each grounding of each predicate in the MLN
– One feature/factor for each grounding of each formula F in the MLN,
with the corresponding weight w

141
Example: Friends & Smokers
Smoking causes cancer.
Friends have similar smoking habits.

1.5 x Smokes( x )  Cancer( x )


1.1 x, y Friends( x, y )  Smokes( x )  Smokes( y ) 

Two constants: Anna (A) and Bob (B)

142
MLN nodes
1.5 x Smokes( x )  Cancer( x )
1.1 x, y Friends( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
 One binary node for each grounding of each
predicate in the MLN Grounding:
substituting vars
with constants

Smokes(A) Smokes(B)

Cancer(A) Cancer(B)

 Any nodes missing?


143
MLN nodes (complete)
1.5 x Smokes( x )  Cancer( x )
1.1 x, y Friends( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
 One binary node for each grounding of each
predicate in the MLN
Friends(A,B)

Friends(A,A) Smokes(A) Smokes(B) Friends(B,B)

Cancer(A) Cancer(B)
Friends(B,A)

144
MLN features
1.5 x Smokes( x )  Cancer( x )
1.1 x, y Friends( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Edge between two nodes iff the corresponding ground predicates
appear together in at least one grounding of one formula
Grounding:
Friends(A,B)
substituting vars
with constants

Friends(A,A) Smokes(A) Smokes(B) Friends(B,B)

Cancer(A) Cancer(B)
Friends(B,A)

Which edge should not be there?


145
MLN features
1.5 x Smokes( x )  Cancer( x )
1.1 x, y Friends( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Edge between two nodes iff the corresponding ground predicates
appear together in at least one grounding of one formula
Friends(A,B)

Friends(A,A) Smokes(A) Smokes(B) Friends(B,B)

Cancer(A) Cancer(B)
Friends(B,A)

146
MLN features
1.5 x Smokes( x )  Cancer( x )
1.1 x, y Friends( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Friends(A,B)

Friends(A,A) Smokes(A) Smokes(B) Friends(B,B)

Cancer(A) Cancer(B)
Friends(B,A)

One feature/factor for each grounding of each formula F in


the MLN
147
MLN: parameters
• For each formula i we have a factor

1.5 x Smokes( x)  Cancer ( x)


 1 if Smokes ( x)  Cancer ( x)
f (Smokes(x), Cancer(x) )  
 0 otherwise

148
MLN: prob. of possible world
1.5 x Smokes( x )  Cancer( x )
1.1 x, y Friends( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B) 1
Friends(A,B)
P( pw)  
Z c
 c ( pwc )

Friends(A,A) Smokes(A) Smokes(B) Friends(B,B)

Cancer(A) Cancer(B)
Friends(B,A)

149
MLN: prob. of possible world
1.5 x Smokes( x )  Cancer( x )
1.1 x, y Friends( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B) 1
Friends(A,B)
P( pw)  
Z c
 c ( pwc )

Friends(A,A) Smokes(A) Smokes(B) Friends(B,B)

Cancer(A) Cancer(B)
Friends(B,A)

150
MLN: prob. Of possible world

• Probability of a world pw:


1  
P( pw)  exp   wi ni ( pw) 
Z  i 
Weight of formula i No. of true groundings of formula i in pw

Friends(A,B)

Friends(A,A) Smokes(A) Smokes(B) Friends(B,B)

Cancer(A) Cancer(B)
Friends(B,A)

P(world)  exp weights of grounded formulas it satisfies


Relation to Statistical Models

• Special cases:
– Markov networks
– Markov random fields
– Bayesian networks • Obtained by making
predicates zero-arity
– Log-linear models
– Exponential models
– Max. entropy models • Markov logic allows
– Gibbs distributions
objects to be
interdependent
– Boltzmann machines (non-i.i.d.)
– Logistic regression
– Hidden Markov models
• Easy to compose
– Conditional random models
fields

152
Relation to First-Order Logic
• Infinite weights  First-order logic
• Satisfiable KB, positive weights 
Satisfying assignments = Modes of distribution
• Markov logic allows contradictions between formulas
How MLN s generalize FOL

• First order logic (with some mild assumptions) is a special Markov


Logics obtained when
– all the weight are equal
– and tend to infinity
How MLNs generalize FOL
• Consider MLN containing only one formula

w x R( x)  S ( x) C  {A}

w   , P( S ( A) | R( A))  1 “recovering logical entailment”


Inference
• Goal: Compute marginal probabilities of nodes (or formulas) given
evidence
• Other inference tasks:
– Compute most likely state of world
– Compute actions that maximize utility
• Approaches such as MAP/MPE, Lazy inference, Lifted Inference

156
Inference in MLN
• MLN acts as a template for a Markov Network
• We can always answer prob. queries using standard Markov network
inference methods on the instantiated network
• However, due to the size and complexity of the resulting network, this is
often infeasible.
• Instead, we combine probabilistic methods with ideas from logical
inference, including satisfiability and resolution.
• This leads to efficient methods that take full advantage of the logical
structure.

157
MAP Inference
• Problem: Find most likely state of world

arg max P( pw)


pw

 Probability of a world pw:


1  
P( pw)  exp   wi ni ( pw) 
Z  i 
Weight of formula i No. of true groundings of formula i in pw

1  
arg max exp   wi ni ( pw) 
pw Z  i  158
MAP Inference

1  
arg max exp   wi ni ( pw) 
pw Z  i 

arg max
pw
 w n ( pw)
i
i i

159
MAP Inference

• Therefore, the MAP problem in Markov logic reduces to finding


the truth assignment that maximizes the sum of weights of
satisfied formulas (let’s assume clauses)

arg max
pw
 w n ( pw)
i
i i

• This is just the weighted MaxSAT problem


• Use weighted SAT solver (e.g., MaxWalkSAT [Kautz et al.,
1997])

160
(Stochastic) Local Search Algorithms can be used for
this task!
Evaluation Function f(pw) : number of satisfied clauses
WalkSat: One of the simplest and most effective algorithms:
Start from a randomly generated interpretation (pw)
• Pick randomly an unsatisfied clause
• Pick a proposition/atom to flip (randomly 1 or 2)
1. Randomly
2. To maximize # of satisfied clauses

161
MaxWalkSAT algorithm
Evaluation Function f(pw) : ∑ weights(sat. clauses in pw)

current pw <- randomly generated interpretation


Generate new pw by doing the following
• Pick randomly an unsatisfied clause
• Pick a proposition/atom to flip (randomly 1 or 2)
1. Randomly
2. To maximize ∑ weights(sat. clauses in resulting pw)

162
Computing Probabilities
P(Formula|ML,C) = ?
• Brute force: Sum probs. of possible worlds where formula holds

M L ,C
PWF
P( F | M L ,C )   pwPW P( pw, M L ,C )
F

• MCMC: Sample worlds, check formula holds

S
SF
| SF |
P ( F | M L ,C ) 
|S| 163
Computing Cond. Probabilities
1.5 x Smokes( x )  Cancer( x )
1.1 x, y Friends( x, y )  Smokes( x )  Smokes( y ) 
Let’s look at the simplest case
P(ground literal | conjuction of ground literals, ML,C)
P(Cancer(B)| Smokes(A), Friends(A, B), Friends(B, A) )

Friends(A,B)

Friends(A,A) Smokes(A) Smokes(B) Friends(B,B)

Cancer(A) Cancer(B)
Friends(B,A)
Computing Cond. Probabilities
Let’s look at the simplest case
P(ground literal | conjuction of ground literals, ML,C)
P(Cancer(B)| Smokes(A), Friends(A, B), Friends(B, A) )

You do not need to create (ground) the part of the Markov Network from which
the query is independent given the evidence
Computing Cond. Probabilities

P(Cancer(B)| Smokes(A), Friends(A, B), Friends(B, A) )

Then you can perform (Gibbs) Sampling in


this Sub Network
Alchemy
• Open-source software including:
• Full first-order logic syntax
• Generative & discriminative weight learning
• Structure learning
• Inference (marginals and most prob. states)
• Programming language features

http://alchemy.cs.washington.edu/

167
Alchemy Prolog BUGS

Represent- F.O. Logic + Horn Bayes nets


ation Markov nets clauses

Inference Lifted BP, etc. Theorem MCMC


proving

Learning Parameters No Params.


& structure

Uncertainty Yes No Yes

Relational Yes Yes No

168
Applications
• Information extraction
• Entity resolution
• Link prediction
• Collective classification
• Web mining
• Natural language processing
• Computational biology
• Social network analysis
• Robot mapping
• Activity recognition
• Many challenging applications now within reach –
FYPs, TPs, HWs using Alchemy

169

You might also like