This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

Dr Michael Ashcroft January 3, 2012

This document remains the property of Inatas. Reproduction in whole or in part without the written permission of Inatas is strictly forbidden. 1

Contents

1 Introduction to Discrete Probability 1.1 Discrete Probability Spaces . . . . . . . . . . . . . . . . . . . . . 1.1.1 1.1.2 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 Sample Spaces, Outcomes and Events . . . . . . . . . . . Probability Functions . . . . . . . . . . . . . . . . . . . . 3 3 3 3 4 5 6 7 7 8 9 10 10 10 12 14 16 17 21 22 22 23 23 24 24 24 24 25 25 . .

The probabilities of events . . . . . . . . . . . . . . . . . . . . . . Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . Combinations of events . . . . . . . . . . . . . . . . . . . . . . . Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conditional Independence . . . . . . . . . . . . . . . . . . . . . . The Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Introduction to Bayesian Networks 2.1 2.2 2.3 2.4 2.5 2.6 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . D-Separation, The Markov Blanket and Markov Equivalence Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exact Inference on a Bayesian Network: The Variable Elimination Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exact Inference on a Bayesian Network: The Junction Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inexact Inference on a Bayesian Network: Likelihood Sampling .

3 Parameter Learning 3.1 3.2 The Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . Parameter Dirichlet Distribution . . . . . . . . . . . . . . . . . .

4 Structure Learning 4.1 Search Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 4.1.2 4.1.3 4.2 4.3 Ordered DAG Topologies . . . . . . . . . . . . . . . . . . DAG Topologies . . . . . . . . . . . . . . . . . . . . . . . Markov Equivalence Classes of DAG Topologies . . . . . .

The Bayesian Scoring Criterion (BS) . . . . . . . . . . . . . . . . The Bayesian Equivalent Scoring Criterion (BSe) . . . . . . . . . 2

5. 3. {3. {4}. 6}.1 Introduction to Discrete Probability Discrete Probability Spaces A discrete probability space is a pair: < S. 5. 5}. 5. 2. {3. 4. {5}. 2. {2. 5}. {2. 5. {3. 3. Outcomes and Events An outcome is a value that the stochastic system we are modeling can take. 5. 6. 3. {4. 5. 4. 2. 5}. 4. 2. An event is a subset of the sample space. 4. 4. {1. 2. 4. 6}. 4}. The sample space of our model is the set of all outcomes. 3. 3. {1. 2. The outcomes of this sample space are 1. {1. 4. 6}. 4. 3. {3. {2. {1. {1. {1. {1. {2. 5}. 6}. 4. 4}. 6}. {6}. 4}. {1. 5}. 4. 5}. {1. 3. 4. 6}. 2. 3. where S is a Sample Space and P is a probability function. 4. Example 2. {1. {1. 2. 5 and 6. 6}. 6} 1. such that: (i) 0 ≤ p(s) ≤ 1. 4. 6}. {1. {2. 4}. 5. 5. 3. P >. {2. {1. 6}. 6}. 3. 6}. If a six sided dice is fair. 5}. 3. {1.1. 6}. 4. 4}. 2. {1. The events of this sample space are the members of its power set: ∅. 4. 6}. 6}. 2. {4. 4. 6}. {4. {1. 3. 5. {1. 5}. (ii) (p(s)) = 1. {3}. 4. {2. 4. {2. 5}. 5}. 5. 3. 2. {1. {2. {2. 2. 6}. 5. {1. 6}. 3}. 5}. {1. {1}. {3. 4}. {2}. {5.2 Probability Functions A probability function is a function from S to the real numbers. 2. 3. 6}. 5}. {1. {1. {2. {2. 2. 6}. 2. {2. 3. 3. 3}. 6}. 4. 3. 6}. 6}. 1. 6}. 6}. 5. 2. 5}.1. 2. 3. 6}. 6}. {2.1 1. {3. {1. 6}. 6}. 3.1 Sample Spaces. 4. 5. 3. 6}. 3. 5. 3. 4. {2. 4. 5}. {1. The sample space corresponding to rolling a die is 1. {1. 3}. 4. {1. 5. 2}. {1. 6} and the probability function is: p(1) = 1/6 p(2) = 1/6 3 .) Example 1. 6}. for each s ∈ S . {1. 5}. 3. {1. {1. 5. {3. 3. (So they are also sets of outcomes. {1. 4. 5. 2. 4}. the sample space associated with the result of a single throw is {1.

that the roll of the die produces 5 or 6. the event F . is {1. is {5. the event E . E is deﬁned: p(E ) = o∈E p(o) Example 4. Continuing example 2 (the fair die).2 The probabilities of events The probability of an event. 5}. The probability function for such a die would be: p(1) = 1/7 p(2) = 1/7 p(3) = 1/7 p(4) = 1/7 p(5) = 1/7 p(6) = 2/7 1. for example 3 (the biased die). Therefore: p(E ) = o∈E p(o) = p(1) + p(3) + p(5) = 1/6 + 1/6 + 1/6 = 1/2 Example 5. Therefore: 4 . 6}.p(3) = 1/6 p(4) = 1/6 p(5) = 1/6 p(6) = 1/6 Example 3. 3. that the roll of the die produces an odd number. Now imagine a die that is not fair: It has twice the probability of coming up six as it does of coming up any other number. Likewise.

1. r. Let X (t) be the random variable that equals the number of heads that appear when t is the outcome.p(F ) = o∈E p(o) = p(5) + p(6) = 1/7 + 2/7 = 3/7 Notice that an event represents the disjunctive claim that one of the outcomes that are its members occurred. As we would expect: (i) 0 ≤ p(Er ) ≤ 1. (ii) r ∈X ( S ) p(Er ) = 1. 3. X . Then X (t) takes the following values: X (HHH ) = 3 X (HHT ) = X (HT H ) = X (T HH ) = 2 X (HT T ) = X (T HT ) = X (T T H ) = 1 X (T T T ) = 0 Notice that a random variable divides the sample space into disjoint and exhaustive set of events. 5}) represents the event that the roll of the die produced 1 or 2 or 3. Suppose a coin is ﬂipped three times. for each r ∈ X (S ). on a sample space. 5 . Example 6. p(X = r) > for all r ∈ X (S ).3 Random Variables Note: A random variable is neither random nor a variable! Often. We use random variables to model these. is the set of ordered pairs < r. S . each mapped to a unique real number. The probability distribution of a random variable. where p(X = r) = (p(o)). Let us term this set of events Er . we are interested in numerical values that are connected with our outcomes. A random variable is a function from a sample space to the real numbers. So event E ({1. This is the probability that an outcome o o ∈ Er occurred such that X (o) = r and is often characterized by saying that X took the value r.

F ) = p(E ∩ F ) 3. Points to note about random variables: 1. p(E or F ) = p(E ∪ F ) = p(E ) + p(F ) − p(E ∩ F ) Theorem 1 should be obvious! Note. A function from the codomain of a random variable to the real numbers is itself a random variable. let the event E . 6} (that the roll of the die produces 5 or 6). 2. be {1. A random variable and its probability distribution together consistute a probability space. Using theorem two we see: p(E ∪ F ) = p(E ) + p(F ) − p(E ∪ F ) = (p(1) + p(3) + p(5)) + (p(5) + p(6)) − p(5) = p(1) + p(3) + p(5) + p(6) = 4/6 Which is as it should be! 6 . 1. it entails that p(∅) = 0. and event F be {5. which is to say that the event E ∪ F occurs.Placing this in table form gives us a familiar discrete probability distribution. 5} (that the roll of the die produces an odd number). Example 7. p(E 2. Returning to example 4 with the fair die. Lets look at an example for theorem 2. that combined with the deﬁnition of a probability distribution. p(E and F ) = p(E. We want to calculate the probability that one of these events occurs. though. 3.4 Combinations of events Some theorems: ¯ ) = 1 − p(E ) 1.

F ) = p(E )p(F ). we can calculate the probability that the roll of the die produces 5 or 6 given that it produces an odd number: p(E |F ) = = = p( E ∩ F ) p(F ) p(5) p(1)+p(3)+p(5) 1/6 3/6 = 1/3 1. X2 and X3 do likewise for the second and third coins. Continuing example 7. given another. p(X2 = 1) = 0. two random variables are independent if and only if p(X (s) = r1 ). Likewise. E and F. Where a number of events are independent.7 2. that the probability of both events occurring is simply the probability of the ﬁrst event occurring multiplied by the probability that the second event occurs. T }3 . p(E |F ) = p( E ∩ F ) p(F ) Example 8. T }.1. Let us deﬁne three random variables.6 Independence Two events. are independent if and only if p(E. Now assume we are given the following information: 1.4 3. Y (s) = r2 ) = p(X (s) = r1 )p(Y (s) = r2 ). E .5 Conditional Probability The conditional probability of one event. Independence is of great practical importance: It signiﬁcantly simpliﬁes working out complex probabilities. Example 9. and 0 otherwise. is denoted: p(E |F ). X1 . X3 . p(X3 = 1) = 0. Given the possible results of each coin is {H. X1 maps outcomes to 1 if the ﬁrst coin lands heads.1 7 . F . for all real numbers r1 and r2 . That is to say. X2 . we can quickly calculate their joint probability distribution from their individual probabilities. Imagine we are examining the results of the (ordered) tosses of three coins. p(X1 = 1) = 0. the sample space for our model will be {H.

9 = 0. p(F |G) = 0.4 · 0.7 · 0. we will need to have the values for each of the entries in the joint probability distribution.) Typically.1 = 0. 1. 2. p(X1 = 1.7 · 0. X2 = 1. Some methods for dealing with this.108 7.252 3. but we could obtain this in the early case as well. G.7 · 0.018 8.3 · 0.7 · 0. p(X1 = 0. if we kept the probabilities after we calculated them. X3 = 1) = 0. p(X1 = 0. p(X1 = 0. X2 = 1. X3 = 1) = 0.042 4.028 2.9 = 0.162 If we do not know these random variables are independent. p(X1 = 1.3 · 0.6 · 0. we require much more information. if and only if P (G) = 0 and one of the following holds: 1. X3 = 0) = 0. the probability distributions that are of interest to us are such that this exponential storage complexity renders them intractable.4 · 0.378 5.3 · 0.1 = 0.4 · 0.4 · 0. X2 = 1.3 · 0. Notice that: • our storage requirements have jumped from linear on the number of random variables to exponential. But this can lead to signiﬁcantly lower accuracy from the model. we say that two events. X3 = 0) = 0. E and F.012 6.) • our computational complexity has fallen from linear of the number of random variables to constant.6 · 0.7 Conditional Independence Analogously to independence.9 = 0.1 = 0. In fact.9 = 0. X3 = 1) = 0. X2 = 0. 8 . p(E |F ∪ G) = p(E |G) and p(E |G) = 0. X3 = 1) = 0. p(X1 = 1. X2 = 0. X2 = 0. are conditionally independent given another. X3 = 0) = 0. then we can immediately calculate the joint probability distribution for the three random ¯n = 1 − En ): variables from these three values alone (remembering that E 1. (Good. X2 = 0. p(X1 = 0.1 = 0.6 · 0. p(E |G) = 0 or p(F |G) = 0. p(X1 = 1. simply assume independence among the random variables they are modeling. X2 = 1. (Very bad. X3 = 0) = 0.If we also know that these random variables are independent.6 · 0. such as the naive Bayes classiﬁer.

. E2 . . B12 ... then using the techniques we have already looked at. E2 . . X1 . each object has either a ’1’ or a ’2’ written on it. The objects are: B12 . = p(X2 = x2 |X1 = x1 )p(X1 = x1 ) It is straightforward to prove this rule using the rule for conditional probability. and each object is either a square (2) or a diamond(3). . Xn−2 = xn−2 . En−2 .. Say we have 13 objects..Xn = xn ) = p(Xn = nx |Xn−1 = xn−1 . deﬁned on the same sample space S : p(X1 = x1 . B22 . we can see that the event. 1. EB that the box is black (and.8 The Chain Rule The chain rule for events says that given n events. but soon it will take center stage as a means of obtaining the accuracy of using the full joint distribution of the random variables we are modeling while avoiding the complexity issues that accompany this. deﬁned on the same sample space S : p(E1 . that a randomly selected box has a ’1’ written on it is not independent from the event..Example 10. B22 . E2 . E1 . W23 If we are interested in the characteristics of a randomly drawn object and assume all objects have equal chance of being drawn.En ) = p(En |En−1 .. E1 . .p(E2 |E1 )p(E1 ) Applied to random variables. B22 . W22 . W13 . also given the event that the box is white): p(E1 ) = 5/13 p(E1 |E2 ) = 3/8 p(E1 |EB ) = 3/9 = 1/3 p(E1 |E2 ∩ EB ) = 2/6 = 1/3 There is little more to say about conditional independence at this point... B13 . . But they are conditionally independent given the event.. in fact.En ....E1 ). X2 . that such a box is square..Xn . this gives us that for n random variables. Each object is either black (B) or white(W). B23 .. 9 . . B23 W12 . X2 = x2 .X1 = x1 ) .. B22 .

2 2. Therefore. p(E ∩ F ) = p(F |E )p(E ) = p(E |F )p(F ).002 Notice that the result was not intuitively obvious. so p(F ) = 100000 and so p(F 100000 100 ¯ |F ¯ |F ) = 1 . so p(E ¯ ) = 5 .99)(. 3.99999) ≈ . Most people. Given this information. So by p(E 100 1000 1000 Bayes theorem: p(F |E ) = p(E |F )p(F ) ¯ )p(F ¯) p(E |F )p(F )+p(E |F = (. Suppose 1 person in 100000 has a particular rare disease. if told only the information we had available. so p(E ) = 5.1 Introduction to Bayesian Networks Bayesian Networks A Bayesian Network is a model of a system. Therefore p(F |E ) = p(E |F )p(F ) p(E ) = p(E |F )p(F ) ¯ )p(F ¯) p(E |F )p(F )+p(E |F (Bayes theorem) Example 11. By the deﬁnition of conditional probability p(F |E ) = p(E ∩F ) p(F ) . There exists a diagnostics test for this disease that is accurate 99% of the time when given to those who have the disease and 99. Likewise we know that p(E ¯ |F ¯ ) = 995 . It consists of: 10 .005)(. (E ∩ F ) and (E ∩ F ) are disjoint (otherwise x ∈ (F ∩ F ¯ ¯ p((E ∩ F ) ∪ (E ∩ F )) = p(E |F )p(F ) + p(E |F )p(F ) 6.00001) (. p(F |E ) = p(E |F )p(F ) p(E ) ¯ )) = p((E ∩ F ) ∪ (E ∩ F )) 4. Therefore. p(E ∩F ) p(E ) and p(E |F ) = 2. We want to ﬁnd p(F |E ). we can ﬁnd the probability that someone who tests positive for the disease actually has the disease: Let E be the event that someone tests positive for the disease and F be the event that a person has the disease. We know that 1 ¯ ) = 99999 . assume that testing positive means a very high probability of having the disease.1. which in turn consists of a number of random variables.9 Bayes Theorem p(E |F )p(F ) ¯ )p(F ¯) p(E |F )p(F )+p(E |F Bayes theorem is: p(F |E ) = Proof: 1.5% of the time when given to those who do not.00001)+(. p(E ) = p(E ∩ S ) = p(E ∩ (F ∪ F ¯ ) = ∅).99)(. We also know that p(E |F ) = 99 .

we know that: • p(C |B ∪ A) = p(C |A) • p(D|C ∪ B ∪ A) = p(D|C ∪ B ) • p(E |D ∪ C ∪ B ∪ A) = p(E |C ) So we know that p(A. advantages of such a method become crucial. E ) = p(E |C )p(D|C. But given the conditional independencies present in P . within which each random variable is represented by a node. or support vector machines. if there are no conditional independencies in the joint probability distribution. C. From the chain rule. B. the advantages of such a method become crucial. A)p(C |B. B. A)p(B |A)p(A).1. while independence relationship between random variables in a system we are interested in modeling are rare (and assumptions regarding such independence dangerous). D. B. C. representing it with a Bayesian Network gains us nothing. etc). It means we can calculate the full joint distribution from the (normally much. But in practice. and that every Bayesian Network represents some probability distribution. B )p(C |A)p(B |A)p(A). D. It has been proven that every discrete probability distribution (and many continuous ones) can be represented by a Bayesian Network. E ) = p(E |D. 11 . Of course. This may not seem a huge improvement. which give the probability of the random variable represented by the given node taking particular values given the values the random variables represented by the node’s parents take. one for each node. B. We now have a means of obtaining tractable calculations using the full joint distribution. there are many advantages to this. As the networks get bigger. A directed acyclic graph (DAG). C. Some important points about Bayesian Networks: • Bayesian Networks provide much more information than simple classiﬁers (like neural networks. when used to predict the value a random variable will take. they return a probability distribution rather than simply specifying what value is most probable. 2. Most importantly. but it is. The topology of this DAG must meet the Markov Condition: Each node must be conditionally independent of its nondescendants given its parents. Examine the DAG in Figure 1 and the information in Table 1. A)p(D|C. much smaller) conditional probability tables associated with each node. What we have done is pull the joint probability distribution apart by its conditional independencies. A set of conditional probability distributions. we know that the joint probability distribution of the random variables p(A. conditional independencies are plentiful. Obviously.

(which is associated with a random variable) whose probability distribution we wish to predict and whose Markov Blanket is the set of nodes. etc. when one set of random variables. Since. collecting data on random variables can be costly. 12 . and we need not continue to include them in the network nor collect information on them. is conditionally independent of another. in order to perform automated decision decision making. Γ. If we know the value of (the random variables associated with) every node in Γ. in practice. In this way. its children. Θ. ∆. Because of this. then we know that there is no more information regarding the value taken by (the random variable associated with) α. Because of the Markov Condition. We will see one advantage of this in the next section. α. Γ. and the other parents of its children. this can be very helpful. Bayesian Networks can also be extended to ’Inﬂuence Diagrams’. or support vector machines.2 D-Separation. them we will say that the nodes representing the random variables in Γ are D-Separated from ∆ by Θ. as it is diﬃcult). The most important case of D-Separation/Conditional Independence is: • A node is D-Separated of the entire graph given its parents. We will also say that two DAGs are Markov Equivalent if they have the same D-Separations. The Markov Blanket and Markov Equivalence The Markov Condition also entails other conditional independencies. But generally we are interested in inferring the probability distributions of a subset of the random variables of the network given knowledge of the values taken by another (possibly empty) subset. 2. Accordingly. This is important. children and other parents of a node’s children are called the Markov Blanket of the node. if we are conﬁdent that we can always establish the values of some of the random variables our network is modeling. We can use Bayesian Networks to simply model the correlations and conditional independencies between the random variables of systems. with decision and utility nodes. these conditional independencies have a graph theoretic criterion called D-Separation (which we will not deﬁne. we can often see that certain of the random variables are superﬂuous. the parents.• Bayesian Networks have easily understandable and informative physical interpretation (unlike neural networks. Imagine we have a node. given a third. which are eﬀectively black boxes to all but experts).

given C Table 1: Conditional independencies required of random variables the DAG in Figure 1 to be a Bayesian Network A B C D E F G H I J K L M N O P Q R S T U V W Figure 2: The Markov Blanket of Node L 13 .A B C D E Figure 1: A DAG with ﬁve nodes Node A B C D E Conditional Independencies C and E. given B and C A. B and D. given A A and E. given A B.

< Vα . the multiplication of these potentials is itself a potential. Let φV be a mapping V × ΓV → R. The scheme of a product of a set of potentials is the union of the schemes of the factors. x) = the ith term of x. A potential is an ordered pair < V. and F is a mapping ΓV → R.6 Table 2: pot1 Where pot3 = pot1 · pot2 . If W ⊆ V . . V V such that φW (x.3 Potentials Where V is a set of random variables {v1 . Given a set of potentials. where: n • Vα = i=1 Vi n • Fα (x) = i=1 Fα Fi (ψF (x)) i This is simpler than it appears. F >. where: 14 .7 0. . Fn >}.. < Vα . f > and pot2 =< {X1 . < V. Fα >. the values assigned by the function in the potential to particular value combinations of the random variables is the product of the values assigned by the functions of the factors to the same value combinations (for those random variables present in the factor).. {< V1 . F1 >.2. y ). ψW (y )) = φV (x. such that φV (vi . X2 = x2 ) 0. g >. X3 }. So ΓV consists of all the possible combinations of values that the random variables of V can be take. where all random variables are binary: x1 1 1 0 0 x2 1 0 1 0 f (X1 = x1 . we have: Given a potentials. for all x ∈ W. We call the set of random variables in a potential the potential’s scheme. let ψW be a mapping ΓV → ΓW . the marginalization out of some random variable v ∈ V from this potential is itself a potential. So ψW gives us the member of ΓW in which all the members of W are ’assigned’ the same values as a particular member of ΓV .3 0. where x ∈ ΓV . Take the multiplication of two potentials pot1 =< {X1 . X2 }. Fα >.. F >. Likewise. < Vn . Example 12..vn }. where V is a set of random variables. let ΓV be the Cartesian product of the co-domains of the random variables in V . Ie φV gives us the value ’assigned’ to a particular member of V V by a particular member of ΓV .4 0. y ∈ ΓV .

2 0.2 0.1 0.12 0. X2 = x2 . Example 13. not a distribution. 15 .4 · 0. X3 = x3 ) 0.2 0.63 0. X3 = x3 ) 0.6 · 0.9 0.1 0. In fact.3 · 0. then: x2 1 0 i(X2 = x2 ) 1.7 · 0.x1 1 1 0 0 x3 1 0 1 0 g (X1 = x1 .9 0. If pot4 is the result of marginalizing X1 out of pot1 from Example 12. but not vice versa. potentials need not sum to 1.8 0.1 0. and that the latter are necessarily the former.9 0.32 0.08 0. a conditional probability table is a potential.03 0. • Unlike distributions.4 · 0.1 0.6 · 0.48 Table 4: pot3 • Vα = V \ v • Fα (x) = y ∈ΓV F F (y ).7 · 0.8 h(X1 = x1 .9 Table 5: pot4 Some points: • Note that potentials are simply generalizations of probability distributions.07 0. where ψF (y ) = x α .8 Table 3: pot2 x1 1 1 1 1 0 0 0 0 x2 1 1 0 0 1 1 0 0 x3 1 0 1 0 1 0 1 0 0.3 · 0.27 0.

Remove the random variable associated with the bucket from the associated list. Γ. 2. Associate with this potential a list of random variables that includes all random variables on the lists associated with the original potential in the bucket. From the deﬁnition of a DAG. this is always possible. b∅ . place the distribution in the null bucket. 5. Some points to note: 16 . v ∈ Γ. ∆. bn . For each node. Let f be a function that assigns each random variable. If there are no random variables remaining. eliminate all rows corresponding to values other than f (v ). place the potential in the null bucket. If there are no random variables remaining.2. For each conditional probability distribution in the network: (a) Create a list of random variables present in the conditional probability distribution. n. has taken particular values. construct a ’bucket’. Proceed in the given order through the buckets: (a) Create a new potential by multiplying all potential in the bucket. takes particular values given the observation that a second subset. To obtain the probability that the random variables in Γ take the values assigned to them by f : 1. 4. v ∈ Γ a particular value from those that v can take. we run the algorithm twice: First on Γ ∪ ∆.4 Exact Inference on a Bayesian Network: The Variable Elimination Algorithm Let Γ be a subset of random variables in our network. then on Delta. Multiply together the ’distributions’ in the null bucket (this is simply scalar multiplication). and we divide the ﬁrst by the second. f (v ). This gives us an ordering where all nodes occur before their descendants. (c) Associate this list with the resulting potential and place this potential in the bucket associated with random variable remaining in the list associated with the highest ordered node. (b) In this potential. Perform a topological sort on the DAG. (b) For each random variable. and eliminate v from the associated list. Also construct a null bucket. 3. To obtain the a posteriori probability that a subset of random variables. (c) Place the distribution in the bucket associated with random variable remaining in the list associated with the highest ordered node. marginalize out (ie sum over) the random variable associated with the bucket.

We ﬁrst show how to create this structure. permitting eﬃcient exact inference.• The algorithm can be extended to obtain good estimates of error bars for our probability estimates. permit the calculation of error bars for our probability estimates. much. 17 . Some Deﬁnitions: • A cluster is a maximally connected sub-graph. of the DAG. the algorithm is relatively ineﬃcient. Take a copy. • The algorithm can be run on the smallest sub-graph containing (the nodes representing) the variables whose a posteriori probabilities we wish to ﬁnd that is D-Separated for the remainder of the Network by (nodes representing) random variables whose values we know. v . this can be done is an open research question. The Create (an Optimal) Junction Tree Algorithm: 1. • The complexity of the algorithm is dominated by the largest potential. • When used to calculate a large number of probabilities (such as the a posteriori probability distributions for each unobserved random variable). G. • The weight of a cluster is the product of the weight of its constituent nodes. It does not. much smaller than the full joint distribution. there is hope that the extension to the latter that permits us to obtain such error bars can likewise be generalized so as to be utilized in the former. which will be at least the size of the largest conditional probability table and which is. 2. since. Whether. and if so how. • The weight of a node is the number of values its associated random variable has. and wishing to do so is the main reason for using the algorithm. Since the Junction Tree algorithm is a generalization of the Variable Elimination algorithm. in practice. it must be run f (v ) − 1 times for each unobserved random variable. if f is a function from the random variables in the network to the number of values each can take. though.5 Exact Inference on a Bayesian Network: The Junction Tree Algorithm The Junction Tree algorithm is the work horse of Bayesian Network inference algorithms. This algorithm utilizes a secondary structure formed from the Bayesian Network called a Junction Tree or Join Tree. join all unconnected parents and undirect all edges.

(c) Insert s between the cliques X and Y only if X and Y are on diﬀerent trees in the forest. n.) Before explaining how to perform inference using a Junction Tree. (c) If C is not a sub-graph of a previously stored cluster. (This merges their two trees into a larger tree. such that n causes the least number of edges to be added in step 2b. from G. each consisting of a single stored clique. adding edges as required. we require some deﬁnitions: Evidence Potentials 18 . (b) Delete s from S . 3. from the node and its neighbors. breaking ties by choosing the node which induces the cluster with the least weight. Further ties can be broken arbitrarily. store C as a clique. breaking ties by calculating the product of the number of values of the random variables in the sets. Also create a set. that has the largest number of variables in it.A B C D E F G H I Figure 3: A simple Bayesian Network 2. C . (b) Form a cluster. until you are left with a single tree: The Junction Tree. s. (d) Remove n from G. S . and choosing the set with the lowest. Repeat until n − 1 sepsets have been inserted into the forest: (a) Select from S the sepset. While there are still nodes left in G: (a) Select a node. Create n trees.

D} {A. I } {D} {F. F } {C.{B.75 150 Value 2 1 0 0 0. C. E. Observed to not be value 2 Soft evidence. E } {E } {D.05 10 Notes Nothing known Observed to be value 1. assigns same probabilities as D Table 6: EvidenceP otentials 19 .2 40 Value 3 1 0 1 0. H } {F } {C. G} {G} {G. D. D} Figure 4: The Junction Tree constructed from Figure 3 Variable A B C D E Value 1 1 1 1 0. with actual probabilities Soft evidence.

and 1 to all other values (where at least one value must be mapped to 1). Assign a new potential to c2 . containing only those variables in s. whose random variables are those of the clique/subset. If working with hard evidence. 2. Calls Collect Evidence recursively on the unmarked neighbors of c. if any. Message Pass We pass a message from one clique. If working with soft evidence. Passes a message to each of the unmarked neighbors of c. out of c1 . values can be mapped to any non-negative real number. Passes a message from c to the clique that called collect evidence. 3. and maps real numbers to the random variable’s values. Marks c. Such a potential assigns values probabilities as speciﬁed by the its normalization. s. if any. via the intervening sepset. 3. to another. if any. Calls Disperse Evidence recursively on the unmarked neighbors of c. Save the potential associated with s. Where all values except one are mapped to 1. For each node: 20 . 3. Collect Evidence does the following: 1. if any. c2 . nothing is know about the random variables. it will map 0 to values which evidence has ruled out. Marginalize a new potential for s. 2. Associate with each clique and sepset a potential. 2. c. and which associates with all value combinations of these random variables the value 1. Where all values are mapped to 1. To perform inference on a Junction Tree.An evidence potential has a singleton set of random variables. such that: (s)new ∗ pot(c2 )new = pot(c2 )old ( pot pot(s)old ) Collect Evidence When called on a clique. Marks c. c. by: 1. c1 . Disperse Evidence When called on a clique. it is known that the random variable takes the speciﬁed value. 2. but the sum of these must be non-zero. Disperse Evidence does the following: 1. we use the following algorithm: 1.

given a set of random variables. whose values we know (or are assuming). For each node you wish to obtain a posteriori probabilities for: (a) Select the smallest clique containing this node. much smaller than the full joint distribution. Perform a topological sort on the DAG. in the network: 1. In such cases we turn to likelihood sampling. U. we can estimate a posteriori probabilities for the other random variables. But it is.6 Inexact Inference on a Bayesian Network: Likelihood Sampling If the network is suﬃciently complex. • A Junction Tree can be formed from the smallest sub-graph containing (the nodes representing) the variables whose a posteriori probabilities we wish to ﬁnd that is D-Separated for the remainder of the Network by (nodes representing) random variables whose values we know. and call collect evidence and then disperse evidence on this clique: 4. (d) Normalize the resulting potential. 21 . Using this algorithm. There are also numerous techniques to improve eﬃciency available in the literature.) (c) Multiply in the evidence potential associated with the node. (c) Marginalize all other nodes out of the clique. the algorithm is comparatively eﬃcient. exact inference algorithms will become intractable.(a) Associate with the node an evidence potential representing current knowledge. (By ’multiply in’ is meant: multiply the node’s conditional probability table and the clique’s potential. This is the random variable’s a posteriori probability distribution. 3. which will be at least the size of. • When cliques are relatively small. 2. (b) Find a clique containing the node and its parents (it is certain to exist) and multiply in the node’s conditional probability table to the clique’s potential. Pick an arbitrary root clique. in practice. Some points to note: • The complexity of the algorithm is dominated by the largest potential associated a clique. the largest conditional probability table. (b) Create a copy of the potential associated with this clique. and replace the cliques potential with the result. E. and probably much larger than.

1 Parameter Learning The Dirichlet Distribution The Dirichlet distribution is a multivariate distribution parametrized by a vector α of positive reals. calculate: p(E = e) = En ∈ E (p(En = en |P ar(En ) = par(En )) (c) For each random variable in U. It is often said that Dirichlet distributions represent the probabilities related with seeing value ai occur xi out of N = xn times..fn−1 ) = Γ(N ) n x1 −1 x2 −2 x n −1 f1 f2 . For each random variable in U. normalize its score card.. 3 3.. n If the probability of a random variable. 3.En }. Repeat: (a) In the order generated in step 1.. calculate the p(E = e).xn ) then: p(X = xi ) = n xi xn The corresponding probability density function is: p(f1 . from the conditional probability tables of the random variables in E .fn Γ(xk ) k=1 where 0 ≤ fk ≤ 1 22 . For each random variable in U. x2 . X . add p(E = e) to the score for the value it was assigned in this sample. (b) Given the values assigned. . randomly assign values to each random variable using their conditional probability tables.. Initially set all numbers to zero.. 5. Set all random variables in E to the value they are known/assumed to take. an is given by a Dirichlet distribution dir(x1 .2. Ie.. for each node in U. This is an estimate of the random variable’s a posteriori probability distribution.. where P ar(v ) is random variables associated with the parents of the node associated with random variable v . 4. create a score card. with a number for each value the random variable can take. taking particular values from the set a1 . a2 . . par(v ) are the values these parents have been assigned and E = {E1 . f2 .

for our purposes. 3. we do not want to conclude from a single instance that it is certain a random variable will take a given value. − fn−1 ) Take two binary random variables. 3. 2.. n ∈ G. However. with values (codomains) {x1 . Find the Dirichlet distribution associated with n that corresponds to the values taken by the random variables associated with the parents of n in d. For each datum.. however this is normally rendered irrelevant because of the use of an equivalent sample size (see below). 4 Structure Learning To score network topologies we require: 1. in the order given by step 1: i. the probabilities for Y would be much more resistant to emendation from new evidence than those for X. since so much more of the density distribution lies in the vicinity of these values. x2 } and {y1 . We shall also see that. associate a set of Dirichlet distributions with each possible combination of values the random variables associated with the node’s parents can take. X and Y . our conﬁdence in the probabilities given for Y would be much higher than those given for X. y2 }. 6) and dir(40. To encode prior information and/or enforce a level of conservativism.4 and p(X = x2 ) = p(Y = y2 ) = .2 Parameter Dirichlet Distribution We can now give the algorithm for learning a networks parameters given data D and graph G: 1.6. Perform a topological sort on G. it is often suggested that the parameters all be initialized to 1. While p(X = x1 ) = p(Y = y1 ) = . we can set the initial parameters of the Dirichlets of step 2. 60).n k=1 fk = 1 fn = 1 − f1 − f2 − . In this Add 1 to the parameter corresponding to the value the random variable associated with n takes in d. d ∈ D: (a) For each node. For each node. A search space 23 . be represented by the Dirichlet distributions dir(4. ii. To avoid this. Regarding the conservativism.

A search strategy/algorithm 4. though.1. X3 )p(X2 |X 3)p(X3 ) Motivations include the small size of the state space and the ability. Justiﬁcation for this is that the chain rule is valid in any order. Therefore. For example: p(X1 . A scoring function which we will seek to maximise 4. X2 . But some equivalence classes have massively more members than others. A set of transistions which will permit us to search the state space 3. X3 ) = p(X3 |X1 . we would like all equivalence classes to be equally likely to be selected. presents concerns. In cases where such independencies are present.3 Markov Equivalence Classes of DAG Topologies This leads to the obvious ﬁnal search space: Equivalence classes of DAG topologies.1. such equivlance classes are much more likely than those with few members to be learnt. the network is more susceptible to noise and more complex than it needs to be. X2 )p(X2 |X1 )p(X1 ) = p(X1 |X2 . This too. since our state space searches are heuristic. because all graphs respect the ordering. though. Remember that graph topologies can be divided in Markov equivalence classes. and that all topologies belonging to the same equivalence class encode the same conditional independencies. to produce compound graphs from a number of high scoring topologoes.1. 4. and can get stuck at local maxima. 24 . 4. and hence not encoded in the network.1 Search Spaces There are three search spaces that might be used. Searching Ordered DAG topologies also raises the issues that we be presented below for searching all DAG topologies.2. Problematically. not all conditional independencies potentially present can be represented by the topologies respecting a particular ordering. if we search DAG topologies. This is generally the best option and is the option used in high end Bayesian Network applications.2 DAG Topologies We can avoid these issues by searching all possible DAG topologies. 4. Apriori.1 Ordered DAG Topologies Firstly we might specify an ordering on the variables and search the topologies that respect this ordering.

• s is the sum of the learnt additions to the same parameter.4. • a is the Dirichlet prior parameter corresponding to value k in row j for node i in graph G.3 The Bayesian Equivalent Scoring Criterion (BSe) To ensure that the procedure outlined in the previous section will results in Markov equivalent topologies obtaining equal scores. we must use an equivalent sample size. of the Dirichlet prior parameters for the row of node is conditional probability table corresponding to its parents’ value combination j . this often results either in nodes with no. or nodes 25 . for graph G. and hence being resistent to learning from the data. we learn the parameters of the network using the algorithm explained earlier. the BS is locally updateable: We can calculate the eﬀects that alterations to the topology have on the score. • N (G) is the sum. rather than needing to recalculate the score from scratch. What this means is that we pick some number. Given a new topology. • ri is the number of values node i has. We then score the topology given these parameters by using the BS. parents either having large prior parameters. and ﬁx it such that the prior parameters in the Dirichlets assoicated with each nodes conditional probability table sum to n.2 The Bayesian Scoring Criterion (BS) The Bayesian scoring criterion (BS) scores the ﬁtness of a topology by calculating the probability of the Data given the topology: P (d|G) = i∈G j ∈P A(n) Γ(Nij ) Γ(Nij (G) (G) ( G) r Γ(aijk + sijk ) Γ(aijk ) (G) (G) (G) + Mij ) k=0 where • d is our learning data • G is the graph we are scoring • n is the number of nodes in the graph • P A is a function from a node to the possible value combinations of the parents of that node. or few. Importantly. n. • M is the sum of the learnt additions to the Dirichlet parameters for the same row. 4. Because the size of the conditional probability distributions in exponential on the number of parents of the node.

Using the Bayesian scoring criterion with an equivalent sample size is called using the Bayesian equivalent scoring criterion (BSe). Generally. 26 .with many parents having prior parameters very close to zero. which results in a lack of conservativism. the second choice is choosen.

Textbook

Textbook

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd