P. 1
Ai99 Tutorial 4

# Ai99 Tutorial 4

5.0

|Views: 371|Likes:

See more
See less

11/08/2012

pdf

text

original

# Nicholson & Korb

1

Nicholson & Korb

2

Overview Bayesian AI
1. Introduction to Bayesian AI (20 min) AI’99, Sydney 6 December 1999 Ann E. Nicholson and Kevin B. Korb School of Computer Science and Software Engineering Monash University, Clayton, VIC 3168 AUSTRALIA 2. Bayesian networks (50 min) Break (10 min) 3. Applications (50 min) Break (10 min) 4. Learning Bayesian networks (50 min) 5. Current research issues (10 min) 6. Bayesian Net Lab (60 min: Optional) 7. Dinner (Optional)

fannn,korbg@csse.monash.edu.au

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

3

Nicholson & Korb

4

Introduction to Bayesian AI
Reasoning under uncertainty Probabilities Alternative formalisms – Fuzzy logic – MYCIN’s certainty factors – Default Logic Bayesian philosophy – Dutch book arguments – Bayes’ Theorem – Conditionalization – Conﬁrmation theory Bayesian decision theory Towards a Bayesian AI

Reasoning under uncertainty
Uncertainty: The quality or state of being not clearly known. This encompasses most of what we understand about the world — and most of what we would like our AI systems to understand. Distinguishes deductive knowledge (e.g., mathematics) from inductive belief (e.g., science). Sources of uncertainty Ignorance (which side of this coin is up?) Physical randomness (which side of this coin will land up?) Vagueness (which tribe am I closest to genetically? Picts? Angles? Saxons? Celts?)
Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

5

Nicholson & Korb

6

Fuzzy Logic Probabilities
Classic approach to reasoning under uncertainty. (Blaise Pascal and Fermat). Kolmogorov’s Axioms: 1. 2. 3. Designed to cope with vagueness: Is Fido a Labrador or a Shepard?

m(Fido 2 Labrador) = m(Fido 2 Shepard) = 0:5
Extended to fuzzy logic, which takes intermediate truth values: T (Labrador(Fido)) = 0:5. Combination rules:

Fuzzy set theory:

P (U ) = 1

8X U P (X ) 0 8X; Y U if X \ Y = ; then P (X ^ Y ) = P (X ) + P (Y ) q Y i P (X jY ) = P (X )
P (X ^Y ) P (Y )

T (p ^ q) = min(T (p); T (q)) T (p _ q) = max(T (p); T (q)) T (:p) = 1 T (p)
Not suitable for coping with randomness or ignorance. Obviously not: Uncertainty(inclement weather) = max(Uncertainty(rain),Uncertainty(hail),. . . )

Conditional Probability P (X jY ) = Independence X

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

7

Nicholson & Korb

8

MYCIN’s Certainty Factors Default Logic
Uncertainty formalism developed for the early expert system MYCIN (Buchanan and Shortliffe, 1984): Elicit for (h; e): Intended to reﬂect “stereotypical” reasoning under uncertainty (Reiter 1980). Example:

MB (h; e) 2 0; 1] measure of disbelief: MD(h; e) 2 0; 1]
measure of belief:

Bird(Tweety) : Bird(x) ! Flies(x) Flies(Tweety)
Problems: Best semantics for default rules are probabilistic (Pearl 1988, Korb 1995). Mishandles combinations of low probability events. E.g.,

CF (h; e) = MB (h; e) MD(h; e) 2 1; 1]
Special functions provided for combining evidence. Problems: No semantics ever given for ‘belief’/‘disbelief’ Heckerman (1986) proved that restrictions required for a probabilistic semantics imply absurd independence assumptions.
Bayesian AI Tutorial

ApplyforJob(me) : ApplyforJob(x) ! Reject(x) Reject(me)
I.e., the dole always looks better than applying for a job!

Bayesian AI Tutorial

Nicholson & Korb

9

Nicholson & Korb

10

Probability Theory
So, why not use probability theory to represent uncertainty? That’s what it was invented for. . . dealing with physical randomness and degrees of ignorance. Furthermore, if you make bets which violate probability theory, you are subject to Dutch books: A Dutch book is a sequence of “fair” bets which collectively guarantee a loss. Fair bets are bets based upon the standard odds-probability relation:

A Dutch Book
Payoff table on a bet for h (Odds = p=1 p; S = betting unit) h T F Payoff \$(1-p) -\$p S S

Given a fair bet, the expected value from such a payoff is always \$0. Now, let’s violate the probability axioms. Example Say, P (A) =

0:1 (violating A2)

O(h) = 1 P (h()h) P O P (h) = 1 + (h()h) O
Bayesian AI Tutorial

Payoff table against A (inverse of: for A), with S = 1:

:A
T F

Payoff \$pS = -\$0.10 -\$(1-p)S = -\$1.10
Bayesian AI Tutorial

Nicholson & Korb

11

Nicholson & Korb

12

Bayes’ Theorem; Conditionalization
— Due to Reverend Thomas Bayes (1764)

Bayesian Decision Theory
— Frank Ramsey (1931) Decision making under uncertainty: what action to take (plan to adopt) when future state of the world is not known. Bayesian answer: Find utility of each possible outcome (action-state pair) and take the action that maximizes expected utility. Example
Action Take umbrella Leave umbrella Expected utilities: E(Take umbrella) = (30)(.4) + (10)(.6) = 18 E(Leave umbrella) = (-50)(.4) + (100)(.6) = 40 Rain (p = .4) 30 -100 Shine (1 - p = .6) 10 50

j P (hje) = P (ePh()eP (h) )
Conditionalization:

P 0 (h) = P (hje)

Posterior = Likelihood Prior Prob of evidence
Assumptions: 1. Joint priors over fhi g and e exist. 2. Total evidence: e, and only e, is learned.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

13

Nicholson & Korb

14

Bayesian AI
A Bayesian conception of an AI is: An autonomous agent which Has a utility structure (preferences) Can learn about its world and the relation between its actions and future states (probabilities) Maximizes its expected utility The techniques used in learning about the world are (primarily) statistical. . . Hence Bayesian data mining

Bayesian Networks: Overview
Syntax Semantics Evaluation methods Inﬂuence diagrams (Decision Networks) Dynamic Bayesian Networks

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

15

Nicholson & Korb

16

Bayesian Networks
Data Structure which represents the dependence between variables; Gives concise speciﬁcation of the joint probability distribution. A Bayesian Network is a graph in which the following holds: 1. A set of random variables makes up the nodes in the network. 2. A set of directed links or arrows connects pairs of nodes. 3. Each node has a conditional probability table that quantiﬁes the effects the parents have on the node. 4. Directed, acyclic graph (DAG), i.e. no directed cycles.

Example: Earthquake (Pearl,R&N)
You have a new burglar alarm installed. It is reliable about detecting burglary, but responds to minor earthquakes. Two neighbours (John, Mary) promise to call you at work when they hear the alarm. – John always calls when hears alarm, but confuses alarm with phone ringing (and calls then also) – Mary likes loud music and sometimes misses alarm! Given evidence about who has and hasn’t called, estimate the probability of a burglary.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

17

Nicholson & Korb

18

Earthquake Example: Notes Earthquake Example: Network Structure
Assumptions: John and Mary don’t perceive burglary directly; they do not feel minor earthquakes. Note: no info about loud music or telephone ringing and confusing John. Summarised in uncertainty in links from Alarm to JohnCalls and MaryCalls. Once speciﬁed topology, need to specify conditional probability table (CPT) for each node. – Each row contains the cond prob of each node value for a conditioning case. – Each row must sum to 1. – A table for a Boolean var with n Boolean parents contain 2n+1 probs. – A node with no parents has one row (the prior probabilities)
Bayesian AI Tutorial Bayesian AI Tutorial

Burglary P(B) 0.01 Alarm

Earthquake

P(E) 0.02 B E P(A|B,E) 0.95 0.94 0.29 0.001

JohnCalls A T F P(J|A) 0.90 0.05

MaryCalls A T F

T T T F F T F F

P(M|A) 0.70 0.01

Nicholson & Korb

19

Nicholson & Korb

20

Representing the joint probability distribution Semantics of Bayesian Networks
A (more compact) representation of the joint probability distribution. – helpful in understanding how to construct network Encoding a collection of conditional independence statements. – helpful in understanding how to design inference procedures

P (X1 = x1 ; X2 = x2 ; :::; Xn = xn ) = P (x1 ; x2 ; :::; xn)
= P (x1 ) P (x2 jx1 )::: P (xn jx1 ^ :::xn 1 ) = =
i P (xi jx1 ^ :::xi 1 ) i P (xi j (Xi ))

Example:

P (J ^ M ^ A ^ :B ^ :E )

= P (J jA)P (M jA)P (Aj:B ^ :E )P (:B )P (:E ) = 0:9 0:7 0:001 0:999 0:998 = 0:0067:

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

21

Nicholson & Korb

22

Network Construction
1. Choose the set of relevant variables Xi that describe the domain. 2. Choose an ordering for the variables. 3. While there are variables left: (a) Pick a variable Xi and add a node to the network for it. (b) Set (Xi ) to some minimal set of nodes already in the net such that the conditional independence property is satisﬁed. (c)

Compactness and Node Ordering
Compactness of BN is an example of a locally structured (or sparse) system. The correct order to add nodes is to add the “root causes” ﬁrst, then the variable they inﬂuence, so on until “leaves” reached. Examples of wrong ordering (which still represent same joint distribution): 1. MaryCalls, JohnCalls, Alarm, Burglary, Earthquake.
MaryCalls JohnCalls

P (XijXi 1; :::; X1) = P (Xij (Xi)) Deﬁne the CPT for Xi .

Alarm

Burglary

Earthquake

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

23

Nicholson & Korb

24

Compactness and Node Ordering (cont.)
2. MaryCalls, JohnCalls, Earthquake, Burglary, Alarm.
MaryCalls JohnCalls

Conditional Independence: Causal Chains

Causal chains give rise to conditional independence:

A

B

C

Earthquake

P (C jA ^ B ) = P (C jB )
Alarm

Burglary

Example More probabilities than the full joint! See below for why. C = Jill’s ﬂu A = Jack’s ﬂu B = severe cough

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

25

Nicholson & Korb

26

Conditional Independence: Common Causes

Conditional Dependence: Common Effects

Causal causes (or ancestors) also give rise to conditional independence:

Common effects (or their descendants) give rise to conditional dependence:
A C

B B

A

C

P (AjC ^ B ) 6= P (A)P (C )
Example A = ﬂu B = severe cough C = tuberculosis Given a severe cough, ﬂu “explains away” tuberculosis.

P (C jA ^ B ) = P (C jB )
Example A = Jack’s ﬂu B = Joe’s ﬂu C = Jill’s ﬂu
Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

27

Nicholson & Korb

28

D-separation
Graph-theoretic criterion of conditional independence. We can determine whether a set of nodes X is independent of another set Y, given a set of evidence nodes E, i.e., X q Y jE ”. Earthquake example
Burglary Earthquake

Causal Ordering

Why does variable order affect network density? Because Using the causal order allows direct representation of conditional independencies Violating causal order requires new arcs to re-establish conditional independencies

Alarm

JohnCalls

MaryCalls

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

29

Nicholson & Korb

30

Causal Ordering (cont’d)

Inference in Bayesian Networks
Basic task for any probabilistic inference system: Compute the posterior probability distribution for a set of query variables, given values for some evidence variables. Also called Belief Updating. Types of Inference:

Flu

TB

Cough

Flu and TB are marginally independent. Given the ordering: Cough, Flu, TB:

Q
Cough

E

Q

E

E

Flu

TB

E (Explaining Away) Intercausal

Q

Marginal independence of Flu and TB must be TB re-established by adding Flu ! TB or Flu

E Diagnostic

Q Causal

E Mixed

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

31

Nicholson & Korb

32

Kinds of Inference
Diagnostic inferences: from effect to causes. P(Burglary|JohnCalls) Causal Inferences: from causes to effects. P(JohnCalls|Burglary) P(MaryCalls|Burglary) Intercausal Inferences: between causes of a common effect. P(Burglary|Alarm) P(Burglary|Alarm Exact inference – Trees and polytrees: message-passing algorithm – Multiply-connected networks: Clustering Approximate Inference

Inference Algorithms: Overview

^

Earthquake)

– Large, complex networks: Stochastic Simulation Other approximation methods In the general case, both sorts of inference are computationally complex (“NP-hard”).

Mixed Inference: combining two or more of above. P(Alarm|JohnCalls

^ :EarthQuake) P(Burglary|JohnCalls ^ :EarthQuake)
Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

33

Nicholson & Korb

34

Message Passing Example
P(B) 0.01 Burglary Earthquake P(E) 0.02 B E P(A) 0.95 0.94 0.29 0.001

Inference in multiply connected networks
Networks where two nodes are connected by more than one path – Two or more possible causes which share a common ancestor – One variable can inﬂuence another through more than one causal mechanism Example: Cancer network
Metastatic Cancer

PhoneRings P(Ph) 0.05 JohnCalls P A P(J) 0.95 0.5 0.90 0.01

Alarm

MaryCalls A T F P(M) 0.70 0.01

T T T F F T F F

T T T F F T F F

π(Β) = (.001,.999) λ (Β) = (1,1) bel(B) = (.001, .999)
B

π(Ε) = (.002,.998) λ (Ε) = (1,1) bel(E) = (.002, .998) λ A (E)
E

A Brain tumour B Increased total serum calcium C D Coma E Severe Headaches

λ A (B)

bel(Ph) = (.05, .95) π(Ph) = (.05,.95) λ(Ph) = (1,1)
Ph

π A (B) λ J (Ph) λ J (A)
A

π A (E) π M(A)

π J (Ph)

λ M(A) π J (A)

J

M

λ (J) = (1,1)

λ (M) = (1,0)

Message passing doesn’t work - evidence gets “counted twice”
Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

35

Nicholson & Korb

36

Clustering methods Clustering methods (cont.)
Transform network into a probabilistically equivalent polytree by merging (clustering) offending nodes Cancer example: new node Z combining B and C Jensen Join-tree (Jensen, 1996) version the current most efﬁcient algorithm in this class (e.g. used in Hugin, Netica). Network evaluation done in two stages

A

– Compile into join-tree May be slow May require too much memory if original network is highly connected

Z=B,C E D

– Do belief updating in join-tree (usually fast) Caveat: clustered nodes have increased complexity; updates may be computationally complex

P (z ja) = P (b; cja) = P (bja)P (cja) P (ejz ) = P (ejb; c) = P (ejc) P (djz ) = P (djb; c)
Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

37

Nicholson & Korb

38

Approximate inference with stochastic simulation
Use the network to generate a large number of cases that are consistent with the network distribution. Evaluation may not converge to exact values (in reasonable time). Usually converges to close to exact solution quickly if the evidence is not too unlikely. Performs better when evidence is nearer to root nodes, however in real domains, evidence tends to be near leaves (Nicholson&Jitnah, 1998)

Making Decisions
Bayesian networks can be extended to support decision making. Preferences between different outcomes of various plans. – Utility theory Decision theory = Utility theory + Probability theory.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

39

Nicholson & Korb

40

Type of Nodes Decision Networks
A Decision network represents information about the agent’s current state its possible actions the state that will result from the agent’s action the utility of that state Also called, Inﬂuence Diagrams (Howard&Matheson, 1981). Chance nodes: (ovals) represent random variables (same as Bayesian networks). Has an associated CPT. Parents can be decision nodes and other chance nodes. Decision nodes: (rectangles) represent points where the decision maker has a choice of actions. Utility nodes: (diamonds) represent the agent’s utility function (also called value nodes in the literature). Parents are variables describing the outcome state that directly affect utility. Has an associated table representing multi-attribute utility function.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

41

Nicholson & Korb

42

Example: Umbrella
Weather

Evaluating Decision Networks: Algorithm
1. Set the evidence variables for the current state. 2. For each possible value of the decision node (a) Set the decision node to that value. (b) Calculate the posterior probabilities for the parent nodes of the utility node (as for BNs). (c) Calculate the resulting (expected) utility for the action. 3. Return the action with the highest expected utility. Simple for single decision, less so when executing several actions in sequence (i.e. a plan).

Forecast

U

Take Umbrella

P (Weather = Rainj) = 0:3 P (Forecast = RainyjWeather = Rain) = 0:60 P (Forecast = CloudyjWeather = Rain) = 0:25 P (Forecast = SunnyjWeather = Rain) = 0:15 P (Forecast = RainyjWeather = NoRain) = 0:1 P (Forecast = CloudyjWeather = NoRain) = 0:2 P (Forecast = SunnyjWeather = NoRain) = 0:7 U (NoRain; TakeUmbrella) = 20 U (NoRain; LeaveAtHome) = 100 U (Rain; TakeIt) = 70 U (Rain; LeaveAtHome) = 0
Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

43

Nicholson & Korb

44

Dynamic Belief Networks

State evolution model

Dynamic Decision Network
State t State t+1 State t+2 Obs t Obs t+1
Sensor model

State t-2

State t-1

Obs t-2

Obs t-1

Obs t+2

Similarly, Decision Networks can be extended to include temporal aspects. Sequence of decisions taken = Plan.

The values of state variables at time t depend only on the values at t 1. Can calculate distributions for St+1 and further: probabilistic projection. Can be done using standard BN updating algorithms This type of DBN gets very large, very quickly. Usually only keep two time slices of the network.

Dt

Dt+1

Dt+1

Dt+1

State t

State t+1

State t+2

State t+3

Ut+3 Obs t Obs t+1 Obs t+2 Obs t+3

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

45

Nicholson & Korb

46

Bayesian Networks: Summary Uses of Bayesian Networks
1. Calculating the belief in query variables given values for evidence variables (above). 2. Predicting values in dependent variables given values for independent variables. 3. Decision making based on probabilities in the network and on the agent’s utilities (Inﬂuence Diagrams [Howard and Matheson 1981]). 4. Deciding which additional evidence should be observed in order to gain useful information. 5. Sensitivity analysis to test impact of changes in probabilities or utilities on decisions. Bayes’ rule allows unknown probabilities to be computed from known ones. Conditional independence (due to causal relationships) allows efﬁcient updating Bayesian networks are a natural way to represent conditional independence info. – links between nodes: qualitative aspects; – conditional probability tables: quantitative aspects. Inference means computer the probability distribution for a set of query variables, given a set of evidence variables. Inference in Bayesian networks is very ﬂexible: can enter evidence about any node and update beliefs in any other nodes. The speed of inference in practice depends on the structure of the network: how many
Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

47

Nicholson & Korb

48

Applications: Overview
(Simple) Example Networks loops; numbers of parents; location of evidence and query nodes. Bayesian networks can be extended with decision nodes and utility nodes to support decision making: Decision Networks or Inﬂuence Diagrams. Bayesian and Decision networks can be extended to allow explicit reasoning about changes over time. Applications – Medical Decision Making: Survey of applications – Planning and Plan Recognition – Natural Language Generation (NAG) – Bayesian poker Deployed Bayesian Networks (See Handout for details) BN Software Web Resources

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

49

Nicholson & Korb

50

Example: Cancer
Metastatic cancer is a possible cause of a brain tumor and is also an explanation for increased total serum calcium. In turn, either of these could explain a patient falling into a coma. Severe headache is also possibly associated with a brain tumor. (Example from (Pearl, 1988).)
Metastatic Cancer A Brain tumour B Increased total serum calcium C D Coma E Severe Headaches

Example: Asia
A patient presents to a doctor with shortness of breath. The doctor considers that possibles causes are tuberculosis, lung cancer and bronchitis. Other additional information that is relevant is whether the patient has recently visited Asia (where tuberculosis is more prevalent), whether or not the patient is a smoker (which increases the chances of cancer and bronchitis). A positive xray would indicate either TB or lung cancer. (Example from (Lauritzen, 1988).)
visit to Asia smoking

P (a) = 0:2 P (bja) = 0:80 P (cja) = 0:20 P (djb; c) = 0:80 P (djb; :c) = 0:80 P (ejc) = 0:80

P (bj:a) = 0:20 P (cj:a) = 0:05 P (dj:b; c) = 0:80 P (dj:b; :c) = 0:05 P (ej:c) = 0:60

tuberculosis either tub or lung cancer

lung cancer

bronchitis

positive X-ray

dyspnoea

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

51

Nicholson & Korb

52

Example: A Lecturer’s Life
Dr. Ann Nicholson spends 60% of her work time in her ofﬁce. The rest of her work time is spent elsewhere. When Ann is in her ofﬁce, half the time her light is off (when she is trying to hide from students and get some real work done). When she is not in her ofﬁce, she leaves her light on only 5% of the time. 80% of the time she is in her ofﬁce, Ann is logged onto the computer. Because she sometimes logs onto the computer from home, 10% of the time she is not in her ofﬁce, she is still logged onto the computer. Suppose a student checks Dr. Nicholson’s login status and sees that she is logged on. What effect does this have on the student’s belief that Dr. Nicholson’s light is on? (Example from (Nicholson, 1999))

Probabilistic reasoning in medicine
See handout from (Dean et al., 1993). Simplest tree-structured network for diagnostic reasoning
– H = disease hypothesis; F = ﬁndings (symptoms, test results)

in-office

lights-on

logged-on

Multiply-connected network (QMR structure) – B = background information (e.g. age, sex of patient)

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

53

Nicholson & Korb

54

Medical Applications
Pathﬁnder case study: see handout using material from (Russell&Norvig, 1995, pp.457-458). QMR (Quick Medical Reference): 600 diseases, 4,000 ﬁndings, 40,000 arcs. (Dean&Wellman, 1991) MUNIN (Andreassen et al., 1989): neuromuscular disorders, about 1000 nodes; exact computation < 5 seconds. Glucose prediction and insulin dose adjustment (DBN application) (Andreassen et al., 1991). CPSC project (Pradham et al., 1994) – 448 nodes, 906 links, 8254 conditional probability values – LW algorithm - answers in 35 mins (1994)
Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

55

Nicholson & Korb

56

Application of LW to medical diagnosis (Shwe&Cooper, 1990). Forecasting sleep apnea (Dagum et al., 1993). ALARM (Beinlich et al., 1989): 37 nodes, 42 arcs. (See Netica examples.)
MinVolSet (3)
.976

Example of a Dynamic Decision Network Dean&Wellman, 1991.

Ventmach (4)
1.158

Disconnect (2)
.617

PulmEmbolus(2)
.369 .288 .428

Intubation (3)

.141 .140

VentTube (4) 1.146 KinkedTube(4)
1.180 .227 .098

PAP (3)

Shunt (2)
.067 .100

Press (4)
1.201

VentLung (4)
1.189

FiO2 (2)
.411 .213

VentAlv (4)
.805 .743

MinVol (4)
.891 .362

PVSat (3) InsuffAnesth (2)
.092

ArtCO2 (3)
.054 .066

Anaphylaxis (2) ExpCO2 (4) SaO2 (3)
.246 .239

TPR (3)

Catechol (2) LVFailure(2)
.547 .137 .360 .479

.470

Hypovolemia (2)
.538

ErrCauter (2)
.324

.888

HR (3)

ErrLowOutput(2)
.344

History (2)
.724

StrokeVolume (3)
.746

LVEDVolume(3)
.874

HRSat (3)

.888 .948 .324 HREKG (3)

HRBP (3)

.251

CO (3)
.199

CVP (3)

PCPW (3)
.485

BP (3)

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

57

Nicholson & Korb

58

Plan Recognition Applications
Keyhole plan recognition in an Adventure game (Albrecht et al., 1998).
A 0 A 1 A 2 A 3 A 0 A 1 A 2 A 3

Trafﬁc Monitoring: BATmobile
(Forbes et al., 1995) Example of a DBN

Q

Q’

Q

Q’

L 0

L 1

L 2

L 3

L 0

L 1

L 2

L 3

(a) mainModel
A 0 A 1 A 2 A 3

(b) indepModel
Q Q’

Q

Q’

L 0

L 1

L 2

L 3

(c) actionModel

(d) locationModel

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

59

Nicholson & Korb

60

Natural Language Generation
NAG (McConachy et al., 1999) – A Nice Argument Generator – uses two Bayesian networks to generate and assess natural language arguments: Normative Model: Represents our best understanding of the domain; proper (constrained) Bayesian updating, given premises. User Model: Represents our best understanding of the human; Bayesian updating modiﬁed to reﬂect human biases (e.g., overconﬁdence; Korb, McConachy, Zukerman, 1997). BNs are embedded in a semantic hierarchy
1

Higher level %c E concepts like % EE cc ‘motivation’ or ‘ability’ E ccc % EE Lower level cc concepts like %% E ‘Grade Point Average’ cc % EE cc Semantic + @ %% E H cc Network @ % EE B HH @ % cc 2nd layer E A B HH Q @% EQ B EA H cc A %@@ EE QQBB E Semantic cc E Network %% - @ E Q R @ E c 1st layer % EE C c % C HH % E C H H EE C H %% Bayesian % E C Network % EE %% EE 6 %

%%

Proposition, e.g., [publications authored by

person X cited >5 times]

supports attentional modeling constrained updating
Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

61

Nicholson & Korb

62

Bayesian Poker
(Korb et al., 1999) Poker is ideal for testing automated reasoning under uncertainty – Physical randomisation – Incomplete hand information – Incomplete opponent info (strategies, blufﬁng, etc) Bayesian networks are a good representation for complex game playing. Our Bayesian Poker Player (BPP) plays 5-Card stud poker at the level of a good amateur human player. To play: telnet indy13.cs.monash.edu.au login: poker password: maverick

Bayesian Poker BN
Bayesian network provides an estimate of winning at any point in the hand. Betting curves based on pot-odds used to determine action (bet/call, pass or fold).
BPP Win

OPP Final

BPP Final

M C|F OPP Current

M

C|F

BPP Current

M A|C

M

U|C

OPP Action

OPP Upcards

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

63

Nicholson & Korb

64

Bayesian Poker BN (cont.)
Hand Types

Bayesian Poker BN (cont.)
Different networks (matrices) for each round. OPP Current, BPP Current: (partial) hand types with cards dealt so far. OPP Final, BPP Final: hand types after all 5 cards dealt. Observation nodes: – OPP Upcards: All opponent’s cards except ﬁrst are visible to BPP. – OPP Action: BPP knows opponent’s action.

Initial 9 hand types too coarse. We use a ﬁner granularity for most common hands (busted and a pair): – low, medium, Q-high, K-high, A-high – results in 17 hand-types Conditional Probability Matrices

MAjC : probability of opponent’s action given
current hand type learned from observed showdown data.

MU jC and MC jF
poker hands.

estimated by dealing out 107

Belief Updating: Since network is a polytree, simple fast propagation updating algorithm used.
Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

65

Nicholson & Korb

66

Current Status, Possible Extensions
BPP outperforms automated opponents, is fairly even with ave amateur humans, and loses to experienced humans. Learning the OPP Action CPTs does not (yet) appear to improve performance. BN Improvements – Reﬁne action nodes – Further reﬁnement of hand types – Improve network structure – Adding blufﬁng to the opponent model – Improved learning of opponent model More complex poker: multi-opponent games, table stake games. DBN model to represent changes over time
Bayesian AI Tutorial

Deployed BNs
From Web Site database: See handout for details. TRACS: Predicting reliability of military vehicles. Andes: intelligent tutoring system for physics. Distributed Virtual Agents advising online users on web sites. Information extraction from natural language text DXPLAIN: decision support for medical diagnosis. Illiad: teaching tool for medical students. Microsoft Health Produce: “ﬁnd by symptom” feature.
Bayesian AI Tutorial

Nicholson & Korb

67

Nicholson & Korb

68

Weapons scheduling. Monitoring power generation. Processor fault diagnosis. Knowledge Industries applications: (a) in medicine, sleep disorders, pathology, trauma care, hand and wrist evaluations, dermatology, and home-based health evaluations (b) in capital equipment, locomotives, gas-turbine engines for aircraft and land-based power production, the space shuttle, and ofﬁce equipment. Software debuggin. Vista: decision support system used at NASA Mission Control Center. MS: (a) Answer Wizard (Ofﬁce 95), Information retrieval; (b) Print Troubleshooter; (c) Aladdin, troubleshooting customer support.

BN Software: Issues
Functionality – Especially application vs API Price – Many free for demo versions or educational use – Commercial licence costs. Availability (platforms) Quality – GUI – Documentation and Help Leading edge Robustness – software – company
Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

69

Nicholson & Korb

70

BN Software
Analytica: www.lumina.com Hugin: www.hugin.com

Web Resources
Bayesian Belief Network site (Russell Greiner):
www.cs.ualberta.ca/ greiner/bn.html

Netica: www.norsys.com Bayesian Network Repository (Nir Friedman) Above 3 available during tutorial lab session.
www-

JavaBayes: http://www.cs.cmu.edu/ javabayes/Home/ Many other packages (see next slide)

nt.cs.berkeley.edu/home/nir/public html/Repository/index.htm

Summary of BN software and links to software sites (Kevin Murphy)
HTTP.CS.Berkeley.EDU/ murphyk/Bayes/bnsoft.html

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

71

Nicholson & Korb

72

Learning Bayesian Networks
Linear and Discrete Models Learning Network Parameters – Linear Coefﬁcients

Applications: Summary
Various BN structures are available to compactly and accurately represent certain types of domain features. Bayesian networks have been used for a wide range of AI applications. Robust and easy to use Bayesian network software is now readily available.

– Learning Probability Tables Learning Causal Structure Conditional Independence Learning – Statistical Equivalence – TETRAD II Bayesian Learning of Bayesian Networks – Cooper & Herskovits: K2 – Learning Variable Order – Statistical Equivalence Learners Full Causal Learners Minimum Encoding Methods – Lam & Bacchus’s MDL learner – MML metrics – MML search algorithms – MML Sampling Empirical Results

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

73

Nicholson & Korb

74

Linear and Discrete Models
Linear Models: Used in biology & social sciences since Sewall Wright (1921) Linear models represent causal relationships as sets of linear functions of “independent” variables.
X1 X2

Learning Linear Parameters
Maximum likelihood methods have been available since Wright’s path model analysis (1921). Equivalent methods: Simon-Blalock method (Simon, 1954; Blalock, 1964) Ordinary least squares multiple regression (OLS)

X3

Equivalently (assuming linear parameters):

X3 = a13 X1 + a23 X2 +

1

Discrete models: “Bayesian nets” replace vectors of linear coefﬁcients with CPTs.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

75

Nicholson & Korb

76

Learning Conditional Probability Tables
Spiegelhalter & Lauritzen (1990): assume parameter independence each CPT cell i = a parameter in a Dirichlet distribution

for K parents

D 1; : : : ; i; : : : ; K ]
i = K=1 k k

Dual log-linear and full CPT models (Neil, Wallace, Korb 1999).

prob of outcome i is

observing outcome i update D to

D 1 ; : : : ; i + 1; : : : ; K ]
Others are looking at learning without parameter independence. E.g., Decision trees to learn structure within CPTs (Boutillier et al. 1996).
Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

77

Nicholson & Korb

78

Learning Causal Structure
This is the real problem; parameterizing models is essentially numerical computing. There are two basic methods: Learning from conditional independencies (CI learning) Learning using a scoring metric (Metric learning)

Statistical Equivalence
Verma and Pearl’s rules identify the set of causal models which are statistically equivalent — Two causal models H1 and H2 are statistically equivalent iff they contain the same variables and joint samples over them provide no statistical grounds for preferring one over the other. Examples All fully connected models are equivalent. A !B !C and A A !B !D B C. B !D C.

CI learning (Verma and Pearl, 1991) Suppose you have an Oracle who can answer yes or no to any question of the type:

X q Y jS?
Then you can learn the correct causal model, up to statistical equivalence.
Bayesian AI Tutorial

C and A

Bayesian AI Tutorial

Nicholson & Korb

79

Nicholson & Korb

80

Statistical Equivalence
Chickering (1995): Any two causal models over the same variables which have the same skeleton (undirected arcs) and the same directed v-structures are statistically equivalent. If H1 and H2 are statistically equivalent, then they have the same maximum likelihoods relative to any joint samples:

— Spirtes, Glymour and Scheines (1993) Replace the Oracle with statistical tests: for linear models a signiﬁcance test on partial correlation

X q Y jS i

XY S = 0

max P (ejH1 ; 1) = max P (ejH2 ; 2)
where

for discrete models a 2 test on the difference between CPT counts expected with independence (Ei ) and observed (Oi )

i is a parameterization of Hi

X q Y jS i

Oi i Oi ln Ei

2

0

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

81

Nicholson & Korb

82

Bayesian LBN: Cooper & Herskovits TETRAD II
Asymptotically ﬁnds causal structure to within the statistical equivalence class of the true model. Requires larger sample sizes than MML (Dai, Korb, Wallace & Wu, 1997): Statistical tests are not robust given weak causal interactions and/or small samples. Cheap, and easy to use. — Cooper & Herskovits (1991, 1992) Compute P (hi je) by brute force, under the assumptions: 1. All variables are discrete. 2. Samples are i.i.d. 3. No missing values. 4. All values of child variables are uniformly distributed. 5. Priors over hypotheses are uniform. With these assumptions, Cooper & Herskovits reduce the computation of PCH (h; e) to a polynomial time counting problem.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

83

Nicholson & Korb

84

Learning Variable Order Cooper & Herskovits
But the hypothesis space is exponential; they go for dramatic simpliﬁcation: 6. Assume we know the temporal ordering of the variables. Now for any pair of variables: either they are connected by an arc or they are not. Further, cycles are impossible. New hypothesis space has size only 2(n (still exponential).
2

Reliance upon a given variable order is a major drawback to K2 And many other algorithms (Buntine 1991, Bouckert 1994, Suzuki 1996, Madigan & Raftery 1994) What’s wrong with that? We want autonomous AI (data mining). If experts can order the variables they can likely supply models. Determining variable ordering is half the problem. If we know A comes before B , the only remaining issue is whether there is a link between the two. The number of orderings consistent with dags is apparently exponential (Brightwell & Winkler 1990). So iterating over all possible orderings will not scale up.
Bayesian AI Tutorial

n)=2

Algorithm “K2” does a greedy search through this reduced space.

Bayesian AI Tutorial

Nicholson & Korb

85

Nicholson & Korb

86

Statistical Equivalence Learners
Heckerman & Geiger (1995) advocate learning only up to statistical equivalence classes (a la TETRAD II). Since observational data cannot distinguish btw equivalent models, there’s no point trying to go futher.

Statistical Equivalence Learners
Wallace & Korb (1999): This is not right! These are causal models; they are distinguishable on experimental data. – Failure to collect some data is no reason to change prior probabilities. E.g., If your thermometer topped out at 35 , you wouldn’t treat 35 and 34 as equally likely. Not all equivalence classes are created equal: f A B !C, A !B !C, A B Cg f A !B Cg Within classes some dags should have greater priors than others. . . E.g., LightsOn !InOfﬁce !LoggedOn v. LightsOn InOfﬁce !LoggedOn

)Geiger and Heckerman (1994) deﬁne
Bayesian metrics for linear and discrete equivalence classes of models (BGe and BDe)

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

87

Nicholson & Korb

88

Full Causal Learners
So. . . a full causal learner is an algorithm that: 1. Learns causal connectedness. 2. Learns v-structures. Hence, learns equivalence classes. 3. Learns full variable order. Hence, learns full causal structure (order + connectedness). TETRAD II: 1, 2. Madigan et al.: 1, 2. Cooper & Herskovits’ K2: 1. Lam and Bacchus MDL: 1, 2 (partial), 3 (partial). Wallace, Neil, Korb MML: 1, 2, 3.

MDL
Minimum Description Length (MDL) inference — Invented by Rissanen (1978) based upon Minimum Message Length (MML) invented by Wallace (Wallace and Boulton, 1968). Plays trade-off btw – model simplicity – model ﬁt to the data by minimizing the length of a joint description of model and data given the model.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

89

Nicholson & Korb

90

Lam & Bacchus (1993)
MDL encoding of causal models: Network:

Lam & Bacchus
j 2 (i) sj ]
Search algorithm: Initial constraints taken from domain expert: partial variable order, direct connections Greedy search: every possible arc addition is tested, best MDL measure used to add one (Note: no arcs are deleted) Local arcs checked for improved MDL via arc reversal Iterate until MDL fails to improve

ki log(n) for specifying ki parents for ith
node

n i=1 ki log(n) + d(si ki

1)

d(si 1) j=1 sj for specifying the CPT: d is the ﬁxed bit-length per probability si is the number of states for node i N N M (Xi; (i)) is mutual information btw Xi
and its parent set

Data given network:

n i=1 M (Xi ; (i))

n i=1 H (Xi )

– –

H (Xi ) is entropy of variable Xi

)Results similar to K2, but without full variable
ordering

(NB: This code is not efﬁcient. E.g., treats every node as equally likely to be a parent; assumes knowledge of all ki .)
Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

91

Nicholson & Korb

92

MML
Minimum Message Length (Wallace & Boulton 1968) uses Shannon’s measure of information:

MML Metric for Linear Models
Network:

I (m) = log P (m)
Applied in reverse, we can compute P (h; e) from I (h; e). Given an efﬁcient joint encoding method for the hypothesis & evidence space (i.e., satisfying Shannon’s law), MML: Searches fhi g for that hypothesis h that minimizes I (h) + I (ejh). Equivalent to that h that maximizes P (h)P (ejh) — i.e., P (hje). The other signiﬁcant difference from MDL: MML takes parameter estimation seriously.
Bayesian AI Tutorial

log n! + n(n2 1) log E
– – –

log n! for variable order
n(n
1)

log E restore efﬁciency by subtracting
cost of selecting a linear extension

2

for connectivity

Parameters given dag h:

Xj

f log p( j jh) F ( j)

where j are the parameters for Xj and F ( j ) is the Fisher information. f ( j jh) is assumed to be N (0; j ). (Cf. with MDL’s ﬁxed length for parms)

Bayesian AI Tutorial

Nicholson & Korb

93

Nicholson & Korb

94

MML Metric for discrete models MML Metric for Linear Models
Sample for Xj given h and We can use PCH (hi ; e) (from Cooper & Herskovits) to deﬁne an MML metric for discrete models. Difference between MML and Bayesian metrics:
2
jk

j:

log P (ejh; j ) =

K k=1

p1 e
2 j

=

2 2 j

where K is the number of sample values and jk is the difference between the observed value of Xj and its linear prediction.

MML partitions the parameter space and selects optimal parameters. Equivalent to a penalty of 1 log 6e per parameter 2 (Wallace & Freeman 1987); hence:

I (e; hi) = j2sj log 6e log PCH (hi ; e)
Applied in MML Sampling algorithm.

(1)

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

95

Nicholson & Korb

96

MML search algorithms
MML metrics need to be combined with search. This has been done three ways: 1. Wallace, Korb, Dai (1996): greedy search (linear). Brute force computation of linear extensions (small models only). 2. Neil and Korb (1999): genetic algorithms (linear). Asymptotic estimator of linear extensions GA chromosomes = causal models Genetic operators manipulate them Selection pressure is based on MML 3. Wallace and Korb (1999): MML sampling (linear, discrete). Stochastic sampling through space of totally ordered causal models No counting of linear extensions required
Bayesian AI Tutorial Bayesian AI Tutorial

MML Sampling
Search space of totally ordered models (TOMs). Sampled via a Metropolis algorithm (Metropolis et al. 1953). From current model M , ﬁnd the next model M 0 by: Randomly select a variable; attempt to swap order with its predecessor. Or, randomly select a pair; attempt to add/delete an arc. Attempts succeed whenever P (M 0 )=P (M ) > U (per MML metric), where U is uniformly random from 0 : 1].

Nicholson & Korb

97

Nicholson & Korb

98

Empirical Results MML Sampling
Metropolis: this procedure samples TOMs with a frequency proportional to their posterior probability. To ﬁnd posterior of dag h: keep count of visits to all TOMs consistent with h Estimated by counting visits to all TOMs with identical max likelihoods to h Output: Probabilities of Top dags Top statistical equivalence classes Top MML equivalence classes A weakness in this area — and AI generally. Paper publications based upon very small models, loose comparisons. ALARM net often used — everything sets it to within 1 or 2 arcs. Neil and Korb (1999) compared MML and BGe (Heckerman & Geiger’s Bayesian metric over equivalence classes), using identical GA search over linear models: On KL distance and topological distance from the true model, MML and BGe performed nearly the same. On test prediction accuracy on strict effect nodes (those with no children), MML clearly outperformed BGe.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

99

Nicholson & Korb

100

Current Research Issues
size and complexity difﬁculties with elicitation combinations of discrete and continuous (i.e. mixing node types) Learning issues – Missing data – Latent variables – Experimental data – Learning CPT structure – Multi-structure models continuous & discrete CPTs w/ & w/o parm independence inappropriate problems (deterministic systems, legal rules)

(Other) Limitations

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

101

Nicholson & Korb

102

References
Introduction to Bayesian AI
T. Bayes (1764) “An Essay Towards Solving a Problem in the Doctrine of Chances.” Phil Trans of the Royal Soc of London. Reprinted in Biometrika, 45 (1958), 296-315. B. Buchanan and E. Shortliffe (eds.) (1984) Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley. B. de Finetti (1964) “Foresight: Its Logical Laws, Its Subjective Sources,” in Kyburg and Smokler (eds.) Studies in Subjective Probability. NY: Wiley. D. Heckerman (1986) “Probabilistic Interpretations for MYCIN’s Certainty Factors,” in L.N. Kanal and J.F. Lemmer (eds.) Uncertainty in Artiﬁcial Intelligence. North-Holland. C. Howson and P. Urbach (1993) Scientiﬁc Reasoning: The Bayesian Approach. Open Court. A MODERN REVIEW OF BAYESIAN THEORY. K.B. Korb (1995) “Inductive learning and defeasible inference,” Jrn for Experimental and Theoretical AI, 7, 291-324.
Bayesian AI Tutorial

R. Neapolitan (1990) Probabilistic Reasoning in Expert Systems. Wiley. C HAPTERS 1, 2 AND 4 COVER SOME OF THE RELEVANT HISTORY. J. Pearl (1988) Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann.

F. P. Ramsey (1931) “Truth and Probability” in The Foundations of Mathematics and Other Essays. NY: Humanities Press. T HE ORIGIN OF MODERN BAYESIANISM . I NCLUDES LOTTERY- BASED ELICITATION AND D UTCH - BOOK ARGUMENTS FOR THE USE OF PROBABILITIES. R. Reiter (1980) “A logic for default reasoning,” Artiﬁcial Intelligence, 13, 81-132. J. von Neumann and O.Morgenstern (1947) Theory of Games and Economic Behavior, 2nd ed. Princeton Univ. S TANDARD REFERENCE ON ELICITING UTILITIES VIA LOTTERIES.

Bayesian Networks
E. Charniak (1991) “Bayesian Networks Without Tears”, Artiﬁcial Intelligence Magazine, pp. 50-63, Vol 12. A N ELEMENTARY INTRODUCTION.
Bayesian AI Tutorial

Nicholson & Korb

103

Nicholson & Korb

104

D. D’Ambrosio (1999) “Inference in Bayesian Networks”. Artiﬁcial Intelligence Magazine, Vol 20, No. 2. P. Haddaway (1999) “An Overview of Some Recent Developments in Bayesian Problem-Solving Techniques”. Artiﬁcial Intelligence Magazine, Vol 20, No. 2. Howard & Matheson (1981) “Inﬂuence Diagrams,” Principles and Applications of Decision Analysis. F. V. Jensen (1996) An Introduction to Bayesian Networks, Springer. R. Neapolitan (1990) Probabilistic Reasoning in Expert Systems. Wiley. S IMILAR COVERAGE TO THAT OF P EARL ; MORE
EMPHASIS ON PRACTICAL ALGORITHMS FOR NETWORK UPDATING.

Applications
D.W. Albrecht, I. Zukerman and Nicholson, A.E. (1998) Bayesian Models for Keyhole Plan Recognition in an Adventure Game. User Modeling and User-Adapted Interaction, 8(1-2), 5-47, Kluwer Academic Publishers. S. Andreassen, F.V. Jensen, S.K. Andersen, B. Falck, U. Kjærulff, M. Woldbye, A.R. Sørensen, A. Rosenfalck and F. Jensen (1989) “MUNIN — An Expert EMG Assistant”, Computer-Aided Electromyography and Expert Systems, Chapter 21, J.E. Desmedt (Ed.), Elsevier. S.A. Andreassen, J.J Benn, R. Hovorks, K.G. Olesen and R.E. Carson (1991) “A Probabilistic Approach to Glucose Prediction and Insulin Dose Adjustment: Description of Metabolic Model and Pilot Evaluation Study”. I. Beinlich, H. Suermondt, R. Chavez and G. Cooper (1992) “The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks”, Proc. of the 2nd European Conf. on Artiﬁcial Intelligence in medicine, pp. 689-693. T.L Dean and M.P. Wellman (1991) Planning and control, Morgan Kaufman. T.L. Dean, J. Allen and J. Aloimonos (1994) Artiﬁcial Intelligence: Theory and Practice, Benjamin/Cummings.
Bayesian AI Tutorial

J. Pearl (1988) Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann. T HIS IS THE CLASSIC TEXT INTRODUCING BAYESIAN NETWORKS TO THE AI COMMUNITY. Poole, D., Mackworth, A., and Goebel, R. (1998) Computational Intelligence: a logical approach. Oxford University Press. Russell & Norvig (1995) Artiﬁcial Intelligence: A Modern Approach, Prentice Hall.
Bayesian AI Tutorial

Nicholson & Korb

105

Nicholson & Korb

106

P. Dagum, A. Galper and E. Horvitz (1992) “Dynamic Network Models for Forecasting”, Proceedings of the 8th Conference on Uncertainty in Artiﬁcial Intelligence, pp. 41-48. J. Forbes, T. Huang, K. Kanazawa and S. Russell (1995) “The BATmobile: Towards a Bayesian Automated Taxi”, Proceedings of the 14th Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI’95), pp. 1878-1885. S.L Lauritzen and D.J. Spiegelhalter (1988) “Local Computations with Probabilities on Graphical Structures and their Application to Expert Systems”, Journal of the Royal Statistical Society, 50(2), pp. 157-224. McConachy et al (1999) A.E. Nicholson (1999) “CSE2309/3309 Artiﬁcial Intelligence, Monash University, Lecture Notes”, a http://www.csse.monash.edu.au/˜nnn/2-3309.html. M. Pradham, G. Provan, B. Middleton and M. Henrion (1994) “Knowledge engineering for large belief networks”, Proceedings of the 10th Conference on Uncertainty in Artiﬁcial Intelligence. D. Pynadeth and M. P. Wellman (1995) “Accounting for Context in Plan Recogniition, with Application to Trafﬁc Monitoring”, Proceedings of the 11th Conference on Uncertainty in Artiﬁcial Intelligence, pp.472-481.
Bayesian AI Tutorial

M. Shwe and G. Cooper (1990) “An Empirical Analysis of Likelihood-Weighting Simulation on a Large, Multiply Connected Belief Network”, Proceedings of the Sixth Workshop on Uncertainty in Artiﬁcial Intelligence, pp. 498-508, 1990. L.C. van der Gaag, S. Renooij, C.L.M. Witteman, B.M.P. Aleman, B.G. “Tall (1999) How to Elicit Many Probabilities”, Laskey & Prade (eds) UAI99, 647-654. Zukerman, I., McConachy, R., Korb, K. and Pickett, D. (1999) “Exploratory Interaction with a Bayesian Argumentation System,” in IJCAI-99 Proceedings – the Sixteenth International Joint Conference on Artiﬁcial Intelligence, pp. 1294-1299, Stockholm, Sweden, Morgan Kaufmann.

Learning Bayesian Networks
H. Blalock (1964) Causal Inference in Nonexperimental Research. University of North Carolina. R. Bouckeart (1994) Probabilistic network construction using the minimum description length principle. Technical Report RUU-CS-94-27, Dept of Computer Science, Utrecht University. C. Boutillier, N. Friedman, M. Goldszmidt, D. Koller (1996) “Context-speciﬁc independence in Bayesian networks,” in Horvitz & Jensen (eds.) UAI 1996, 115-123.
Bayesian AI Tutorial

Nicholson & Korb

107

Nicholson & Korb

108

G. Brightwell and P. Winkler (1990) Counting linear extensions is #P-complete. Technical Report DIMACS 90-49, Dept of Computer Science, Rutgers Univ. W. Buntine (1991) “Theory reﬁnement on Bayesian networks,” in D’Ambrosio, Smets and Bonissone (eds.) UAI 1991, 52-69. W. Buntine (1996) “A Guide to the Literature on Learning Probabilistic Networks from Data,” IEEE Transactions on Knowledge and Data Engineering,8, 195-210. D.M. Chickering (1995) “A Tranformational Characterization of Equivalent Bayesian Network Structures,” in P. Besnard and S. Hanks (eds.) Proceedings of the Eleventh Conference on Uncertainty in Artiﬁcial Intelligence (pp. 87-98). San Francisco: Morgan Kaufmann. STATISTICAL EQUIVALENCE . G.F. Cooper and E. Herskovits (1991) “A Bayesian Method for Constructing Bayesian Belief Networks from Databases,” in D’Ambrosio, Smets and Bonissone (eds.) UAI 1991, 86-94. G.F. Cooper and E. Herskovits (1992) “A Bayesian Method for the Induction of Probabilistic Networks from Data,” Machine Learning, 9, 309-347. A N EARLY BAYESIAN CAUSAL DISCOVERY METHOD.
Bayesian AI Tutorial

H. Dai, K.B. Korb, C.S. Wallace and X. Wu (1997) “A study of casual discovery with weak links and small samples.” Proceedings of the Fifteenth International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 1304-1309. Morgan Kaufmann. N. Friedman (1997) “The Bayesian Structural EM Algorithm,” in D. Geiger and P.P. Shenoy (eds.) Proceedings of the Thirteenth Conference on Uncertainty in Artiﬁcial Intelligence (pp. 129-138). San Francisco: Morgan Kaufmann. Geiger and Heckerman (1994) “Learning Gaussian networks,” in Lopes de Mantras and Poole (eds.) UAI 1994, 235-243. D. Heckerman and D. Geiger (1995) “Learning Bayesian networks: A uniﬁcation for discrete and Gaussian domains,” in Besnard and Hankds (eds.) UAI 1995, 274-284. D. Heckerman, D. Geiger, and D.M. Chickering (1995) “Learning Bayesian Networks: The Combination of Knowledge and Statistical Data,” Machine Learning, 20, 197-243. BAYESIAN LEARNING OF STATISTICAL EQUIVALENCE CLASSES. K. Korb (1999) “Probabilistic Causal Structure” in H. Sankey (ed.) Causation and Laws of Nature:
Bayesian AI Tutorial

Nicholson & Korb

109

Nicholson & Korb

110

Australasian Studies in History and Philosophy of Science 14. Kluwer Academic. I NTRODUCTION TO THE RELEVANT PHILOSOPHY OF CAUSATION FOR LEARNING BAYESIAN NETWORKS. P. Krause (1998) Learning Probabilistic Networks.

http : ==www:auai:org=bayes USKrause:ps:gz
BASIC
INTRODUCTION TO

Structure Priors,” in N. Zhong and L. Zhous (eds.) Methodologies for Knowledge Discovery and Data Mining: Third Paciﬁc-Asia Conference (pp. 432-437). Springer Verlag. G ENETIC ALGORITHMS FOR CAUSAL DISCOVERY; STRUCTURE PRIORS. J.R. Neil, C.S. Wallace and K.B. Korb (1999) “Learning Bayesian networks with restricted causal interactions,” in Laskey and Prade (eds.) UAI 99, 486-493. J. Rissanen (1978) “Modeling by shortest data description,” Automatica, 14, 465-471. H. Simon (1954) “Spurious Correlation: A Causal Interpretation,” Jrn Amer Stat Assoc, 49, 467-479. D. Spiegelhalter & S. Lauritzen (1990) “Sequential Updating of Conditional Probabilities on Directed Graphical Structures,” Networks, 20, 579-605. P. Spirtes, C. Glymour and R. Scheines (1990) “Causality from Probability,” in J.E. Tiles, G.T. McKee and G.C. Dean Evolving Knowledge in Natural Science and Artiﬁcial Intelligence. London: Pitman. A N
ELEMENTARY INTRODUCTION TO STRUCTURE LEARNING VIA CONDITIONAL INDEPENDENCE .

BN S,

PARAMETERIZATION

AND LEARNING CAUSAL STRUCTURE .

W. Lam and F. Bacchus (1993) “Learning Bayesian belief networks: An approach based on the MDL principle,” Jrn Comp Intelligence, 10, 269-293. D. Madigan, S.A. Andersson, M.D. Perlman & C.T. Volinsky (1996) “Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs,” Comm in Statistics: Theory and Methods, 25, 2493-2519. D. Madigan and A. E. Raftery (1994) “Model selection and accounting for model uncertainty in graphical modesl using Occam’s window,” Jrn AMer Stat Assoc, 89, 1535-1546. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller and E. Teller (1953) “Equations of state calculations by fast computing machines,” Jrn Chemical Physics, 21, 1087-1091. J.R. Neil and K.B. Korb (1999) “The Evolution of Causal Models: A Comparison of Bayesian Metrics and
Bayesian AI Tutorial

P. Spirtes, C. Glymour and R. Scheines (1993) Causation, Prediction and Search: Lecture Notes in Statistics 81.
Bayesian AI Tutorial

Nicholson & Korb

111

Nicholson & Korb

112

Springer Verlag. A THOROUGH PRESENTATION
STRUCTURE .

OF THE ORTHODOX

STATISTICAL APPROACH TO LEARNING CAUSAL

J. Suzuki (1996) “Learning Bayesian Belief Networks Based on the Minimum Description Length Principle,” in L. Saitta (ed.) Proceedings of the Thirteenth International Conference on Machine Learning (pp. 462-470). San Francisco: Morgan Kaufmann. T.S. Verma and J. Pearl (1991) “Equivalence and Synthesis of Causal Models,” in P. Bonissone, M. Henrion, L. Kanal and J.F. Lemmer (eds) Uncertainty in Artiﬁcial Intelligence 6 (pp. 255-268). Elsevier. T HE GRAPHICAL CRITERION FOR STATISTICAL EQUIVALENCE . C.S. Wallace and D. Boulton (1968) “An information measure for classiﬁcation,” Computer Jrn, 11, 185-194. C.S. Wallace and P.R. Freeman (1987) “Estimation and inference by compact coding,” Jrn Royal Stat Soc (Series B), 49, 240-252. C. S. Wallace and K. B. Korb (1999) “Learning Linear Causal Models by MML Sampling,” in A. Gammerman (ed.) Causal Models and Intelligent Data Management. Springer Verlag. S AMPLING APPROACH TO LEARNING CAUSAL MODELS ; DISCUSSION OF STRUCTURE PRIORS.
Bayesian AI Tutorial

C. S. Wallace, K. B. Korb, and H. Dai (1996) “Causal Discovery via MML,” in L. Saitta (ed.) Proceedings of the Thirteenth International Conference on Machine Learning (pp. 516-524). San Francisco: Morgan Kaufmann. I NTRODUCES AN MML METRIC FOR CAUSAL MODELS. S. Wright (1921) “Correlation and Causation,” Jrn Agricultural Research, 20, 557-585. S. Wright (1934) “The Method of Path Coefﬁcients,” Annals of Mathematical Statistics, 5, 161-215.

Current Research

Bayesian Network URL’s

Bayesian AI Tutorial

scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->