Artiﬁcial Intelligence EECS 491
Probabilistic Reasoning
The process of probabilistic inference
1. deﬁne model of problem 2. derive posterior distributions and estimators 3. estimate parameters from data 4. evaluate model accuracy
Axioms of probability
•
Axioms (Kolmogorov): 0 ≤ P(A) ≤ 1 P(true) = 1 P(false) = 0 P(A or B) = P(A) + P(B) − P(A and B)
•
Corollaries:

A single random variable must sum to 1:
n
P (D = di ) = 1
i=1
The joint probability of a set of variables must also sum to 1. If A and B are mutually exclusive: P(A or B) = P(A) + P(B)
P r(AB ) = P r(B )
P r(B ) > 0
•
corollary (Bayes’ rule)
P r(B A)P r(A) ⇒ P r(B A)
= P r(A and B ) = P r(AB )P r(B ) P r(AB )P r(B ) = P r(A)
.Rules of probability
•
conditional probability
P r(A and B ) .
Discrete probability distributions
• • • • •
discrete probability distribution joint probability distribution marginal probability distribution Bayes’ rule independence
.
Basic concepts
Making rational decisions when faced with uncertainty:
• • •
Probability the precise representation of knowledge and uncertainty Probability theory how to optimally update your knowledge based on new information Decision theory: probability theory + utility theory how to use this information to achieve maximum expected utility
Basic probability theory random variables probability distributions (discrete) and probability densities (continuous) rules of probability expectation and the computation of 1st and 2nd moments joint and multivariate probability distributions and densities covariance and principal components
• • • • • •
.
B.25 0.05 0.10 0.05 0. CMU
Copyright © 2001. Moore Probabilistic Analytics: Slide 41
. those numbers must sum to 1.05
C
B 0. Andrew W.30 0.05 0. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).The Joint Distribution
Recipe for making a joint distribution of M variables: 1. say how probable it is. For each combination of values.05 0.05 0.10
A
0. 3.
A
0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1
Example: Boolean variables A. 2.10 0.25
0.10 0. If you subscribe to the axioms of probability.10
0. C
B C
0 1 0 1 0 1 0 1
Prob
0.10 0.30 All the nice looking slides like this one from are from Andrew Moore.
10 0.10 0. If you subscribe to the axioms of probability.05 0.05 0.10 0.30
Copyright © 2001.05
C
0. 2.05 0.30 0. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). those numbers must sum to 1.
A
0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1
Example: Boolean variables A. C
B C
0 1 0 1 0 1 0 1
Prob
0. say how probable it is.25 0. B.The Joint Distribution
Recipe for making a joint distribution of M variables: 1. Moore
B
Probabilistic Analytics: Slide 41
. Andrew W.05 0.10
A
0. For each combination of values.25
0. 3.10
0.05 0.10 0.
05 0. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).10 0. For each combination of values.30
Copyright © 2001. Moore
B
Probabilistic Analytics: Slide 41
. If you subscribe to the axioms of probability.25 0.05 0. 2.10 0.
A
0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1
Example: Boolean variables A.10 0. B.10 0.10
A
0. those numbers must sum to 1.05 0.05 0.The Joint Distribution
Recipe for making a joint distribution of M variables: 1. Andrew W.10
0.05 0.05
C
0.30 0. 3. say how probable it is. C
B C
0 1 0 1 0 1 0 1
Prob
0.25
0.
Using the Joint
One you have the JD you can ask for the probability of any logical expression involving your attribute
P( E ) "
rows matching E
! P(row )
Copyright © 2001. Moore
Probabilistic Analytics: Slide 42
. Andrew W.
Moore
Probabilistic Analytics: Slide 42
.Using the Using the Joint Joint
OneP(Poor you have the JD can Male) =you 0. Andrew W.4654 ask for the probability of any logical expression involving your attribute
P( E ) "
rows matching E
P" (row ) ! P( E) ! P(row )
rows matching E
Copyright © 2001.
Using the Using the Joint Joint
One you have the JD can P(Poor) =you 0.7604 ask for the probability of any logical expression involving your attribute
P( E ) "
rows matching E
P( (row ) !P E) "
rows matching E
! P(ro
Copyright © 2001. Moore
Probabilistic Analytics: Slide 42
. Andrew W.
Inference Inference Inference with the with the Using the with Joint Joint Joint
PP ( row ) ! P () row ) ( row ! ! P ( E E ) # One you have the JD you can P ( E # (E ) " matching (row )E P ( E1 #1E2P )E rows ! matching E rows EP and E and 2) P ( E  E ) " " P (1E 2E )" P( E  1E )2 " " " ask for the probability of any P( E ) P ( row ) ! P2() E2 ) P () row ) P (row logical expression involving P ( E ! ! your attribute
1 2 1 2 rows matching E1 and E2
1
1 2
2
2
rows matching E
rows matching E2 rows matching E2 rows matching E2
P(Male  Poor) = 0. Moore
Probabilistic Analytics: Slide 45 Probabilistic Analytics: Slide 42 Probabilistic Analytics: Slide 45
Probabilistic Analytics: Slide 46
.4654 / 0. Andrew W.612
Copyright © 2001. Moore
Copyright © 2001.7604 = 0. Moore Copyright © 2001. Andrew W. Andrew W. Andrew W. Moore Copyright © 2001.
Continuous probability distributions
• • • • •
probability density function (pdf) joint probability density marginal probability calculating probabilities using the pdf Bayes’ rule
.
Andrew W. Moore
Probability Densities: Slide 5
.A PDF of American Ages in 2000
more of Andrew’s nice slides Copyright © 2001.
36
Copyright © 2001. Moore
age $ 30
# p(age)dage
Probability Densities: Slide 6
. Andrew W. If p(x) is a Probability Density Function for X then…
P!a & X % b " $
x$a
# p( x)dx
50
b
P!30 & Age % 50 " $
= 0.A PDF of American Ages in 2000
Let X be a continuous random variable.
Moore
Probability Densities: Slide 14
. Andrew W. The likelihood of p(a) can only bestomach compared relatively to other values p(b) Talking to your It What’s indicatesthe the gutfeel relative probability ofof the integrated density over a small delta: • meaning p(x)?
If then
p(a) "! p (b) P ( a % h $ X $ a # h) lim "! h &0 P (b % h $ X $ b # h )
Copyright © 2001. it’s not a value between 0 and 1. and an arbitrary one at that.b “very close to” 5. It’s just a value.92. Moore
Probability Densities: Slide 13
• • • • •
It does not mean a probability! First of all.
What does p(x) mean?
Copyright © 2001. Andrew W.
Moore
Probability Densities: Slide 17
. Andrew W.Expectations
E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X
"
$
x $ #"
! x p( x) dx
Copyright © 2001.
Expectations
E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X
"
$
E[age]=35.897
x $ #"
! x p( x) dx
= the first moment of the shape formed by the axes and the blue curve = the best value to choose if you must guess an unknown person’s age and you’ll be fined the square of your error
Copyright © 2001. Moore
Probability Densities: Slide 18
. Andrew W.
Expectation of a function
!=E[f(X)] = the expected value of f(x) where x is drawn from X’s distribution.62
2
!%
#
x % $#
" f ( x) p( x) dx
Note that in general:
E[ f ( x)] & f ( E[ X ])
Copyright © 2001. Moore
Probability Densities: Slide 19
.64 ( E[age]) % 1288. = the average value we’d see if we took a very large number of random samples of f(X)
E[age 2 ] % 1786. Andrew W.
and assuming you play optimally
Copyright © 2001.02
= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error.Variance
'2 = Var[X] = the expected squared difference between x and E[X]
' %
2
#
x % $#
" (x $ ! )
2
p ( x) dx
Var[age] % 498. Moore
Probability Densities: Slide 20
. Andrew W.
02
! % 22. Moore Probability Densities: Slide 21
.Standard Deviation
!2 = Var[X] = the expected squared difference between x and E[X]
! %
2
#
x % $#
" (x $ & )
2
p ( x) dx
Var[age] % 498. Andrew W. and assuming you play optimally ! = Standard Deviation = “typical” deviation of X from its mean
! % Var[ X ]
Copyright © 2001.32
= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error.
Simple example: medical test results
• • • •
Test report for rare disease is positive. 90% accurate What’s the probability that you have the disease? What if the test is repeated? This is the simplest example of reasoning by combining sources of information.
.
How do we model the problem?
•
Which is the correct description of “Test is 90% accurate” ?
P (T = true) P (T = trueD = true) P (D = trueT = true)
= 0.9
•
What do we want to know?
P (T = true) P (T = trueD = true) P (D = trueT = true)
•
More compact notation:
P (T = trueD = true) P (T = falseD = false)
→ P (T D) ¯D ¯) → P (T
.9 = 0.9 = 0.
001
What about P(T)? What’s the interpretation of that?
.Evaluating the posterior probability through Bayesian inference
• •
We want P(DT) = “The probability of the having the disease given a positive test” Use Bayes rule to relate it to what we know: P(TD)
likelihood
prior
P (T D)P (D) posterior P (DT ) = P (T )
normalizing constant
• • • •
What’s the prior P(D)? Disease is rare. so let’s assume
P (D) = 0.
Evaluating the normalizing constant
likelihood prior
P (T D)P (D) posterior P (DT ) = P (T )
normalizing constant
• •
P(T) is the marginal probability of P(T.D) = P(TD) P(D) So. compute with summation
P (T ) =
all values of D
P (T D)P (D)
•
For true or false propositions:
¯ )P (D ¯) P (T ) = P (T D)P (D) + P (T D
What are these?
.
Reﬁning our model of the test
•
We also have to consider the negative case to incorporate all information:
P (T D) = ¯) = P (T D
0.9 ?
• •
What should it be?
¯ )? What about P (D
.
9 × 0.1 × 0.9 × 0.001 P (DT ) = = 0.0089 0.Plugging in the numbers
•
Our complete expression is
P (DT ) =
P (T D)P (D) ¯ )P (D ¯) P (T D)P (D) + P (T D
•
Plugging in the numbers we get:
0.001 + 0.999
•
Does this make intuitive sense?
.
Same problem different situation
• • •
Suppose we have a test to determine if you won the lottery. It’s 90% accurate. What is P($ = true  T = true) then?
.
999
•
What if the test was the same.9
.9 ⇥ 0.1 ⇥ 0.1 P (DT ) = = 0.5 0.Playing around with the numbers
P (T D)P (D) ¯ )P (D ¯) P (T D)P (D) + P (T D
P (DT ) =
•
What if the test were 100% reliable?
1.1 + 0. but disease wasn’t so rare?
0.0 × 0.001 + 0.0 × 0.001 P (DT ) = = 1.0 1.9 ⇥ 0.0 × 0.
we apply Bayes’ rule
P (T1 . right? Just to be sure the doctor recommends repeating the test. T2 )
•
How do we model P(T1.0089.Repeating the test
• • • •
We can relax. T2 D)P (D) P (DT1 . T2 ) = P (T1 . P(DT) = 0. T2 )
Again. How do we represent this?
P (DT1 .T2D)?
.
Modeling repeated tests
P (T1 . we have
P (T1 D)P (T2 D)P (D) P (DT1 . T2 D) = P (T 1D)P (T2 D)
This also implies:
P (T1 . T2 )
• • •
Easiest is to assume the tests are independent.
P (T1 . T2 ) = P (T1 . T2 ) = P (T 1)P (T2 )
Plugging these in. T2 ) = P (T1 )P (T2 )
. T2 D)P (D) P (DT1 .
001 + 0.1 × 0.1 × 0.001 P (DT ) = = 0. T2 ) =
P (T1 D)P (T2 D)P (D) D={t.f } P (T1 D )P (T2 D )P (D )
•
Plugging in the numbers gives us
0.9 × 0.
.9 × 0.9 × 0.999
•
Another way to think about this:

What’s the chance of 1 false positive from the test? What’s the chance of 2 false positives?
•
The chance of 2 false positives is still 10x more likely than the a prior probability of having the disease.9 × 0.Evaluating the normalizing constant again
•
Expanding as before we have
P (DT1 .075 0.
which we just computed
P (T1 D)P (D) P (DT1 ) = P (T1 )
.Simpler: Combining information the Bayesian way
•
Let’s look at the equation again:
P (T1 D)P (T2 D)P (D) P (DT1 . T2 ) = P (T2 )P (T1 )
We’ve seen this before!
•
It’s the posterior for the ﬁrst test. T2 ) = P (T1 )P (T2 )
•
If we rearrange slightly:
P (T2 D)P (T1 D)P (D) P (DT1 .
P (DT ) =
P (T D)P (D) ¯ )P (D ¯) P (T D)P (D) + P (T D
0.0089 P (DT ) = = 0. T 2) = P (T2 )
•
Plugging in the numbers gives the same answer:
This is how Bayesian reasoning combines old information with new information to update our belief states.0089 P (DT1 .075 0. T2 ) = P (T2 )P (T1 )
P (T2 D) × 0.9911
.0089 + 0.1 × 0.9 × 0.The old posterior is the new prior
• •
We can just plugin the value of the old posterior It plays exactly the same role as our old prior
P (T2 D)P (T1 D)P (D) P (DT1 .9 × 0.
1) =
9 10
⇥
1 100000 1 2
= 1. Inspector Clouseau arrives at the scene of a crime.001.2.
Example 1.3 (Inspector Clouseau). Consider the following ﬁctitious scientiﬁc information: Doctors ﬁnd that people with KreuzfeldJacob disease (KJ) almost invariably ate hamburgers. KJ ) p(Hamburger EaterKJ )p(KJ ) = p(Hamburger Eater) p(Hamburger Eater) (1. what is the probability that a hamburger eater will have KreuzfeldJacob disease? This may be computed as p(KJ Hamburger Eater) = p(Hamburger Eater. say p(Hamburger Eater) = 0. thus p(Hamburger EaterKJ ) = 0.8 ⇥ 10
5
(1.2 (Hamburgers). The probability of an individual having KJ is currently rather low.2)
2.2.000.9.
Example 1. about one in 100. If the fraction of people eating hamburgers was rather small. for which the graphical notations of chapter(2) will play a central role. this is given by
9 10
⇥
1 100000 1 1000
⇡ 1/100
(1. p(Hamburger Eater) = 0. 1. Assuming eating lots of hamburgers is rather widespread.5.3)
This is much higher than in scenario (1) since here we can be more sure that eating hamburgers is related to the illness. The victim lies dead
. what is the probability that a regular hamburger eater will have KreuzfeldJacob disease? Repeating the above calculation.reasoning in networks containing many variables.2.
m)p(B.b p(K b. mK ) = = Pm =P (1.5)
p(knife p(knife p(knife p(knife
usedB usedB usedB usedB
= not murderer.6.2 = 0. m) b p( b) m p(K b. = murderer. = murderer. The inspector’s prior criminal knowledge can be formulated mathematically as follows: dom(B ) = dom(M ) = {murderer. what is the probability that the Butler is the murderer? (Remember that it might be that neither is the murderer).2.
Example 1. 2012 13
. not murderer} . m. Assuming that the knife is the murder weapon.2 that the Maid is the murderer. K ) P p(K B. M )p(B )p(M ). The victim lies dead in the room alongside the possible murder weapon.6 that the Butler is the murderer.2. M ) = p(K B. Inspector Clouseau arrives at the scene of a crime. dom(K ) = {knife used.3 = 0. and a prior belief of 0.4)
p(B = murderer) = 0. m)p(m) m m DRAFT February 27.7) p( K ) m. m) p(B ) m p(K B.2. = not murderer.3 (Inspector Clouseau).related to the illness.2
(1.1
(1. a knife. These beliefs are independent in the sense that p(B. Using b for the two states of B and m for the two states of M .6 = 0.
p(M = murderer) = 0. The Butler (B ) and Maid (M ) are the inspector’s main suspects and the inspector has a prior belief of 0. knife not used} (1. (It is possible that both the Butler and the Maid murdered the victim or neither). M ) = p(B )p(M ). m)p(b.2. P X X p(B. B. m)p(m) P p( B  K ) = p(B.6)
In addition p(K.
M M M M
= not murderer) = murderer) = not murderer) = murderer)
= 0.
‘All fruits grow on trees’ may be represented by p(T = trF = tr) = 1.2. Then we may infer
.
= p(T = trF = fa) p(F = faA = tr) + p(T = trF = tr) p(F = trA = tr) = 1  {z }  {z } {z }
=0 =1 =1
(1. p(T A.Probabilistic Reasoning
Example 1. based on the information presented. From this we can compute X X p(T = trA = tr) = p(T = trF. Similarly.16)
Example 1. (This kind of reasoning is called resolution and is a form of transitivity : from the statements A ) F and F ) T we can infer A ) T ).5 (Aristotle : Resolution). one may deduce that ‘if B is false then A is false’. Additionally we assume that whether or not something grows on a tree depends only on whether or not it is a fruit. We can represent the statement ‘All apples are fruit’ by p(F = trA = tr) = 1. A = tr)p(F A = tr) = p(T = trF )p(F A = tr)
F F
In other words we have deduced that ‘All apples grow on trees’ is a true statement. from the statement : ‘If A is true then B is true’. F ) = P (T F ). According to Logic. To see how this ﬁts in with a probabilistic reasoning system we can ﬁrst express the statement : ‘If A is true then B is true’ as p(B = trA = tr) = 1.6 (Aristotle : Inverse Modus Ponens).
Next time
•
Bayesian belief networks
.