Lecture 06 Bayesian Networks 07112022 011127pm

BAYESIAN NETWORKS
LECTURER:
Humera Farooq, Ph.D.
Computer Sciences Department,
Bahria University (Karachi Campus)
Outline
2
• Introduction
• Bayesian Interpretation
• Bayes Theorem
• Naïve Bayes
• Bayesian Networks
• Example
• Conclusion
Basics of Bayesian Learning
 Goal: find the best hypothesis from some space H of
hypotheses, given the observed data D.
 Define best to be: most probable hypothesis in H
 In order to do that, we need to assume a probability distribution
over the class H.
 In addition, we need to know something about the relation
between the data observed and the hypotheses (E.g., a coin
problem).
 h is a class variable and D are the examples (features)
Basics of Bayesian Learning
 P(h) - the prior probability of a hypothesis h (class variable)
Reflects background knowledge; before data is observed. If no
information - uniform distribution.
 P(D) - The probability that this sample of the Data is observed.

(No knowledge of the hypothesis)
 P(D|h): The probability of observing the sample D, given that

hypothesis h is the target
 P(h|D): The posterior probability of h. The probability that h is

the target, given that D has been observed.
Bayes Theorem
 In ML problems, we are interested in the probability P(h|D) that h
holds given the observed training data D.
 Bayes Theorem provides a way to calculate the posterior
probability P(h|D), from the prior probability P(h) , together with
P(D) and P(D|h).
 Bayes Theorem:
P ( D | h) P ( h)
P (h | D ) 
P( D)
 P(h|D) increase with P(h) and P(D|h) according to Bayes theorem.
 P(h|D) decreases as P(D) increases , because more probable it is
that D will be observed independent of h, the less evidence D
provides in support of h.
Bayes Theorem : An example
6
Maximum A Posteriori (MAP) Hypothesis, hMAP
7
Maximum Likelihood (ML) Hypothesis, hML
8
Example: Does a patient have cancer or not?
9
Naïve Bayes
 It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors. In simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature.
 Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features.
 Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
 It is mainly used in text classification that includes a high-dimensional training dataset.
 It is a probabilistic classifier, which means it predicts on the basis of the probability of an

object.
 Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Types of Naïve Bayes
• Gaussian: The Gaussian model assumes that features follow a normal
distribution. This means if predictors take continuous values instead of discrete,
then the model assumes that these values are sampled from the Gaussian
distribution.
• Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
• Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier,
but the predictor variables are the independent Booleans variables. Such as if a
particular word is present or not in a document. This model is also famous for
document classification tasks.
Bayes Classifiers Working
Let’s assume Bayes Theorem as
where, y is class variable and X is a dependent feature vector (of

size n) where:
we split evidence into the independent parts. Now, if any two events A
(class variable) and B (feature vector) are independent, then,
P (A, B) = P (A) P (B)
Hence:
And can be expressed as
By keeping / ignoring the denominator since it is constant we got

Bayes Classifiers
To create the classifier model , calculate the probability of input for

the class variable y and select the maximum probability :
Find the P(y) and from the dataset of weather overcast

dataset
Example
• Example: Play Tennis
Example
• Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Yes Play=No Wind Play=Yes Play=No

High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053

P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

Naïve Bayes Summary
 Naïve Bayes can handle missing values by ignoring the sample during probability
computation, is robust to outliers and irrelevant features
 Naïve Bayes algorithm is very easy to implement for applications involving textual
information data (e.g., sentiment analysis, news article classification, spam filtering).
 Convergence is quicker relative to logistic regression that discriminative in nature.
 It performs well even when the independence between features assumption does not hold.
 The resulting decision boundaries can be non-linear and/or piecewise.
 Disadvantage: It is not robust to redundant features. If the features have a strong

relationship or correlation with each other, Naïve Bayes is not a good choice. Naïve Bayes
has high bais and low variance and there are no regularization here to adjust the bias thing
Bayesian Networks
 A graphical model for representing probabilistic relationships among inputs, labels. -
Generalizes the idea of naïve Bayes to model distributions over groups of variables with
more complex conditional independence relationships.
 A Bayesian network consists of a collection of conditional probability distributions such

that their product is a full joint distribution over all the variables.
Bayesian Networks
Overview – Example: Bayesian Network for Liver
Disorder Diagnosis(A.. Onisko, M. Druzdzel, and H. Wasyluk, Sept. 1999)
Bayesian Network
20
Edges represent “connection” so no directed cycles are allowed
Each node is conditionally independent of its ancestors given its

parents (Morkov Property)
Example of Simple Bayesian Networks
21
A and B are marginally independent, but when C is give, they are

conditionally dependent. This is called explaining away
Examples of 3-way Bayesian Networks
22
B is given , A and C are marginal independent,
A is past
B is present
C is future
Examples of 3-way Bayesian Networks
23
Here A is given , B and C are conditional independent

For example:
A as the common cause of the two independent effects B and C
Other Examples
24
References
25
A Tutorial On Learning With Bayesian Networks, Haimonti Dutta ,

Department Of Computer And Information Science
A tutorial on Bayesian Network by Rick

Example
26
Example
27
P (B=T) P(B=F) P (E=T) P(E=F)
0.001 0.999 0.002 0.998
B E P (A=T) P(A=F)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
A P (JC=T) P(JC=F) A P P(MC=F)

T 0.90 0.10 (MC=T)
T 0.70 0.30
F 0.05 0.95
F 0.01 0.99
Constructing a Bayesian Network : Step 1
28
Constructing a Bayesian Network : Step 2
29
The Resulting Bayesian Network
30
Bayesian Networks (different variable
ordering)
31
Bayesian Networks (different variable
ordering)
32
Example
33
• What is the probability that the alarm has sounded but neither
a burglary nor a earthquake has occurred , and both John and
Mary call?
P(JC, MC, A, B, E)

= P(JC  A) P (MC A) P(A   B,  E) P (B) P ( E)
= 0.90 x 0.70 x 0.001 x 0.999 x 0.998
= 0.00062

Lecture 06 Bayesian Networks 07112022 011127pm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 06 Bayesian Networks 07112022 011127pm

Uploaded by

Copyright:

Available Formats

BAYESIAN NETWORKS

 P(D) - The probability that this sample of the Data is observed.

 P(D|h): The probability of observing the sample D, given that

 P(h|D): The posterior probability of h. The probability that h is

 Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

 It is mainly used in text classification that includes a high-dimensional training dataset.

 It is a probabilistic classifier, which means it predicts on the basis of the probability of an

where, y is class variable and X is a dependent feature vector (of

And can be expressed as

By keeping / ignoring the denominator since it is constant we got

To create the classifier model , calculate the probability of input for

Find the P(y) and from the dataset of weather overcast

Humidity Play=Yes Play=No Wind Play=Yes Play=No

P(Play=Yes) = 9/14 P(Play=No) = 5/14

P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

 Convergence is quicker relative to logistic regression that discriminative in nature.

 The resulting decision boundaries can be non-linear and/or piecewise.

 Disadvantage: It is not robust to redundant features. If the features have a strong

 A Bayesian network consists of a collection of conditional probability distributions such

Edges represent “connection” so no directed cycles are allowed

Each node is conditionally independent of its ancestors given its

A and B are marginally independent, but when C is give, they are

B is given , A and C are marginal independent,

Here A is given , B and C are conditional independent

A Tutorial On Learning With Bayesian Networks, Haimonti Dutta ,

A tutorial on Bayesian Network by Rick

P (B=T) P(B=F) P (E=T) P(E=F)

0.001 0.999 0.002 0.998

A P (JC=T) P(JC=F) A P P(MC=F)

P(JC, MC, A, B, E)

You might also like