You are on page 1of 37

Bayesian Reasoning

Reference

Gopal, M. (2019). Applied machine learning. McGraw-Hill


Education.
Bayesian Reasoning

• Bayesian models associate probability with each


decision
• Bayesian learning techniques have relevance in the study
of machine learning for two separate reasons –
1. Bayesian learning algorithms that compute
explicit probabilities, for example, the naive
Bayes classifier, are one of the most practical
approaches, especially for large datasets and
NLP
2. It gives a meaningful perspective to the
comprehension of various learning algorithms
that do not explicitly manipulate probabilities.
Bayes Theorem

• Bayes' Theorem states that the conditional probability


of an event, based on the occurrence of another
event, is equal to the likelihood of the second event
given the first event multiplied by the probability of
the first event.
• We, use the Bayes theorem for the following problem
setting.
𝐷 ∶ 𝑥 𝑖 ,𝑦 𝑖
; 𝑖 = 1, 2, … , 𝑛

with patterns
𝑥 = (𝑥1 𝑥2 … 𝑥𝑛 )𝑇
• We consider y to be a random variable that must be
described probabilistically.
𝑦 ∶ (𝑦1 , 𝑦2 , … , 𝑦𝑞 … , 𝑦𝑀 )
𝑦𝑞 ; 𝑞 = 1, . . 𝑀 corresponds to class 𝑞 ∈ {1, . . 𝑀}
• The distribution of all possible values of discrete random
variable y is expressed as probability distribution,
𝑃 𝑦 = 𝑃(𝑦1 ), … , 𝑃(𝑦𝑀 )

𝑃(𝑦1 ) + … + 𝑃(𝑦𝑀 ) = 1

• Known priors 𝑃(𝑦𝑞 )


Bayes theorem provides a way to get posterior 𝑃(𝑦𝑘 𝑥 ; k ∈ {1,
…, M} from the known priors 𝑃(𝑦𝑞 ), using known conditional
probabilities 𝑃(𝑥 𝑦𝑞 ; q = 1, …, M.

𝑃( 𝑦𝑘 )𝑃(𝑥 𝑦𝑘
𝑃(𝑦𝑘 𝑥 =
𝑃(𝑥 )
𝑀

𝑃 𝑥 = ෍ 𝑃(𝑥 𝑦𝑞 𝑃( 𝑦𝑞 )
𝑞=1
• 𝑃(𝑥) expresses variability of the observed data,
independent of the class.
• 𝑃(𝑥 𝑦𝑘 is called the class likelihood and is the
conditional probability that a pattern belonging to class
𝑦𝑘 has the associated with observation value x.

𝑃𝑟𝑖𝑜𝑟 ∗ 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑
𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒
• The posterior can be calculated as,

𝑃( 𝑦𝑘 )𝑃(𝑥 𝑦𝑘
𝑃(𝑦𝑘 𝑥 = 𝑀
σ𝑞=1 𝑃(𝑥 𝑦𝑞 𝑃( 𝑦𝑞 )

• We can determine the Maximum A Posteriori (𝑀𝐴𝑃) class


by choosing:
𝑚𝑎𝑥
𝐶𝑙𝑎𝑠𝑠 𝑘 𝑖𝑓 𝑃(𝑦𝑘 𝑥 = 𝑞 𝑃 𝑦𝑞 𝑥)

• Thus, 𝑦𝑀𝐴𝑃 corresponds to MAP class provided,


𝑚𝑎𝑥
𝑦𝑀𝐴𝑃 ≡ arg 𝑃 𝑦𝑞 𝑥)
𝑞
arg 𝑚𝑎𝑥
𝑞
𝑃 𝑦𝑞 𝑃 𝑥 𝑦𝑞

𝑃 𝑋
𝑚𝑎𝑥 (1)
≡ arg 𝑃( 𝑦𝑞 )𝑃(𝑥 𝑦𝑞
𝑞
• A 𝑃(𝑥 𝑦𝑞 represents the likelihood of the data x given
class 𝑦𝑞

• In some cases, classes are assumed to be equally


probable 𝑃 𝑦𝑘 = 𝑃 𝑦𝑞 ∀𝑘, only likelihood is required
to be considered

• Any class that maximizes 𝑃(𝑥 𝑦𝑞 is called Maximum


Likelihood (𝑀𝐿) class. Thus 𝑦𝑀𝐿 corresponds to 𝑀𝐿 class
provided,
𝑚𝑎𝑥
𝑦𝑀𝐿 ≡ arg 𝑃(𝑥 𝑦𝑞
𝑞
Disadvantages of Bayes’ Classifier
• Requires initial knowledge of prior probability
𝑃(𝑦𝑞 ) and likelihood 𝑃(𝑥|𝑦𝑞 )
• In real world problems, these probabilities are not
known in advance
• With the knowledge of the probabilistic structure
of the problem, conditional densities can be
parameterized
• In most pattern recognition problems, assumption
of knowledge of probability structure is not always
valid
• Classical parametric models are unimodal, but
multimodal densities are found in many real
problems
Parameter Estimation and Dependencies
• It is easier to estimate the conditional density parameters, if the
probability structure is known

• E.g If it is known that 𝑃 𝑥 𝑦𝑞 ~𝑁 𝜇𝑞 , 𝜎𝑞2 , it is simpler to estimate


𝜇𝑞 , 𝜎𝑞2

• Sometimes parameterized density functions are not enough, as


there are statistical dependencies or causal relationships among
the features

• When such relationships are known, the dependencies can be


represented with the help of Bayesian Belief Networks

• If the dependency structure is unknown, we proceed by the most


basic assumption: features are conditionally independent given the
class
Naive Bayes Classifier

• Sometimes, very simple algorithms perform quite well

• The Naive Bayes is one of the widely used algorithms


for classification problems.

• It is derived from Bayes' probability theory, very useful


in high-dimensional datasets and text classification

• Naive Bayes assumes conditional independence where


Bayes theorem does not.

• Naïve Bayes considers all features as equally


important and independent of each other
Naive Bayes Classifier
• Consider features are categorical
• Continuous features can be converted to categorical by
creating bins
• To get, 𝑃(𝑦𝑘 𝑥 ; 𝑘 ∈ {1, … 𝑀} , specify 𝑃(𝑦𝑞 ) and
𝑃(𝑥|𝑦𝑞 )
• 𝑃( 𝑦𝑞 ) (if prior knowledge is not there) may be estimated
simply by counting the frequency with which class 𝑦𝑞
occurs in the training data:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑤𝑖𝑡ℎ 𝑐𝑙𝑎𝑠𝑠 𝑦𝑞
𝑃(𝑦𝑞 ) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑁 𝑜𝑓 𝑑𝑎𝑡𝑎
• Class-conditional probabilities 𝑃(𝑥 𝑦𝑞 can be estimated:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑝𝑎𝑡𝑡𝑒𝑟𝑛 𝑥 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑤𝑖𝑡ℎ 𝑦𝑞 𝑐𝑙𝑎𝑠𝑠


𝑃(𝑥 𝑦𝑞 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑦𝑞 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎
Naive Bayes Classifier
• The assumption is that given the class of the pattern, the
probability of observing the conjunction x1 ,x2 , . . . , xn is
just the product of the probabilities for the individual
attributes (conditional independence):
𝑃(𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝑦𝑞 ) = ෑ 𝑃(𝑥𝑗 𝑦𝑞
𝑗
• Substituting this into Equation (1), we have the naive
Bayes algorithm:
𝑚𝑎𝑥
𝑦𝑁𝐵 ≡ arg 𝑃( 𝑦𝑞 ) ෑ 𝑃(𝑥𝑗 𝑦𝑞
𝑞
𝑗 (2)
𝑦𝑁𝐵 is the class output
Number of 𝑃(𝑥𝑗 𝑦𝑞 terms is given by the number of
distinct attributes (𝑛) times the number of classes 𝑀
• The values are generated simply by counting the
frequency of data combinations within training sample
Naive Bayes Classifier
• If the features are continuous then discretization gives us
categorical values 𝑉𝑥𝑗
• If 𝑑𝑗 are countable values 𝑥𝑗 can take, then,
𝑉𝑥𝑗 = { 𝑣1 𝑥𝑗 , 𝑣2 𝑥𝑗 , 𝑣3 𝑥𝑗 , … , 𝑣 𝑑𝑗 𝑥𝑗 } = { 𝑣𝑙 𝑥𝑗 ; 𝑙 = 1, 2, … , 𝑑𝑗 }

• Let the value of 𝑥𝑗 be 𝑣𝑙 𝑥𝑗 . Then

𝑁𝑞 𝑣𝑙 𝑥
𝑗
𝑃 𝑥𝑗 𝑦𝑞 ) =
𝑁𝑞

where 𝑁𝑞 𝑣𝑙 𝑥 is the number of training samples of class 𝑦𝑞


𝑗
having the value 𝑣𝑙 𝑥𝑗 for attribute 𝑥𝑗 , and 𝑁𝑞 is the total
number of training samples with class 𝑦𝑞 .
• Class prior probabilities may be calculated as,
𝑁𝑞
𝑃(𝑦𝑞 ) =
𝑁
where 𝑁 is the total number of training samples and 𝑁𝑞 is the
number of samples of class 𝑦𝑞 .
Example 1: Consider the dataset D given in Table.

Gender Height Sport y


x1 x2

s(1) F 1.6 m Cricket y1


s(2) M 2m Football y3
s(3) F 1.9 m Tennis y2
s(4) F 1.88 m Tennis y2
s(5) F 1.7 m Cricket y1
s(6) M 1.85 m Tennis y2
s(7) F 1.6 m Cricket y1
s(8) M 1.7 m Cricket y1
s(9) M 2.2 m Football y3
s(10) M 2.1 m Football y3
s(11) F 1.8 m Tennis y2
s(12) M 1.95 m Tennis y2
s(13) F 1.9 m Tennis y2
s(14) F 1.8 m Tennis y2
s(15) F 1.75 m Tennis y2
Solution: y1 corresponds to the class ‘Cricket’, y2 corresponds to the
class ‘Tennis’, and y3 corresponds tothe class ‘Football’. Therefore,

M = 3, N = 15

𝑁1 4
𝑃(𝑦1 ) = = = 0.267
𝑁 15

𝑁2 8
𝑃(𝑦2 ) = = = 0.533
𝑁 15

𝑁3 3
𝑃(𝑦3 ) = = = 0.2
𝑁 15

𝑉𝑥1 :{M , F} = { 𝑣1 𝑥1 , 𝑣2 𝑥1 }; 𝑑1 = 2

𝑉𝑥2 = { 𝑣1 𝑥2 , 𝑣2 𝑥1 , 𝑣3 𝑥2 , 𝑣4 𝑥2 , 𝑣5 𝑥2 , 𝑣6 𝑥2 }; 𝑑2 = 6

= bins {(0, 1.6], (1.6, 1.7], (1.7, 1.8], (1.8, 1.9], (1.9, 2.0], (2.0, ∞)}
The count table generated from data is given in Table.

Table: Number of training samples, 𝑁𝑞 𝑣𝑙 𝑥 , of class q having value 𝑉𝑙 𝑥𝑗


𝑗

Count 𝑵𝒒 𝒗𝒍 𝒙
𝒋
Value
𝑽𝒍 𝒙𝒋 Cricket Tennis Football
q=1 q=2 q=3

𝑣1 𝑥1 :M 1 2 3

𝑣2 𝑥1 : F 3 6 0

𝑣1 𝑥2 :(0, 1.6] bin 2 0 0

𝑣2 𝑥1 :(1.6, 1.7] bin 2 0 0

𝑣3 𝑥2 :(1.7, 1.8] bin 0 3 0

𝑣4 𝑥2 :(1.8, 1.9] bin 0 4 0

𝑣5 𝑥2 :(1.9, 2.0] bin 0 1 1

𝑣6 𝑥2 : (2.0, ∞) bin 0 0 2
We consider an instance from the given dataset (the same procedure
applies for a data tuple not in the given dataset (unseen instance)):

x : {M, 1.95 m} = {x1, x2}

In the discretized domain, ‘M’ corresponds to 𝑣1 𝑥1 and ‘1.95 m’


corresponds to 𝑣5 𝑥2 .

𝑁2 𝑣1 𝑥 2
1
𝑃 𝑥1 𝑦1 ) = =
𝑁2 8

𝑁3 𝑣1 𝑥 3
1
𝑃 𝑥1 𝑦3 ) = =
𝑁3 3

𝑁1 𝑣5 𝑥 0
2
𝑃 𝑥2 𝑦1 ) = =
𝑁1 4

𝑁2 𝑣5 𝑥 1
2
𝑃 𝑥2 𝑦2 ) = =
𝑁2 8
𝑁3 𝑣5 𝑥 1
2
𝑃 𝑥2 𝑦3 ) = =
𝑁3 3

1
𝑃 𝑥 𝑦1 ) = 𝑃 𝑥1 𝑦1 ) ∗ 𝑃 𝑥2 𝑦1 ) = ∗0=0
4

2 1 1
𝑃 𝑥 𝑦2 ) = 𝑃 𝑥1 𝑦2 ) ∗ 𝑃 𝑥2 𝑦2 ) = ∗ =
8 8 32

3 1 1
𝑃 𝑥 𝑦3 ) = 𝑃 𝑥1 𝑦3 ) ∗ 𝑃 𝑥2 𝑦3 ) = ∗ =
3 3 3

𝑃 𝑥 𝑦1 ) 𝑃(𝑦1 ) = 0*0.267 = 0

1
𝑃 𝑥 𝑦2 ) 𝑃(𝑦2 ) = *0.533 = 0.0166
32

1
𝑃 𝑥 𝑦3 ) 𝑃(𝑦3 ) = *0.2 = 0.066
3

𝒎𝒂𝒙
𝒚𝑵𝑩 = 𝒂𝒓𝒈 𝑷 𝒙 𝒚𝒒 ) 𝑷(𝒚𝒒 )
𝒒
This gives 𝒒 = 3.

Therefore, for the pattern x = {M 1.95m}, the predicted class is


‘Football’.

The true class in the data table is ‘Tennis’. Note that we are working with
an artificial toy dataset. Use of naive Bayes algorithm on real-life
datasets will bring out the power of naive Bayes classifier when N is
large.
Naïve Bayes

• Suppose due to lack of data, one of the class conditional


probability becomes zero, it will make all probability values go to
zero
• Then it is customary to replace zero with a small quantity

𝑁𝑞𝑗
Original : 𝑃 𝑥𝑗 𝑞 =
𝑁𝑞
𝑁𝑞𝑗 +1
Laplace : 𝑃 𝑥𝑗 𝑞 =
𝑁𝑞 +𝑞
𝑁𝑞𝑗 +𝑚∗𝑃(𝑦𝑞 )
m-estimate : 𝑃 𝑥𝑗 𝑞 =
𝑁𝑞 +𝑚
Where 𝑚 is the number of parameters
𝑞 is the number of classes
Gaussian Naive Bayes

• Gaussian Naive Bayes is the extension of Naive Bayes.


• Gaussian Naive Bayes (GNB) is a classification technique
used in Machine Learning (ML) based on the
probabilistic approach and Gaussian distribution.
• A univariate normal distribution for the attribute 𝑥𝑗 is
defined by –
2
1 𝑥− 𝜇𝑞𝑗
1 −2 𝜎
(𝑖) 𝑞𝑗
𝑝(𝑥𝑗 = 𝑥 𝑦𝑞 = 𝑒
𝜎𝑞𝑗 √2𝜋
• With reference to General Bayes Theorem, the naive
Bayes classifier algorithm for continuous variable follows
from equation (2) –
𝑚𝑎𝑥
𝑦𝑁𝐵 = arg 𝑃( 𝑦𝑞 ) ෑ 𝑃(𝑥𝑗 𝑦𝑞
𝑞
𝑗
Example 2: Consider the dataset given in Table.

Height Weight Footsize Class


x1 x2 x3 y
(feet) (lbs) (inches)

s(1) 6 180 12 y1

s(2) 5.92 190 11 y1


s(3) 5.58 170 12 y1
s(4) 5.92 165 10 y1
s(5) 5.00 100 8 y2
s(6) 5.50 150 8 y2
s(7) 5.42 130 7 y2
s(8) 5.75 150 9 y2
Solution: Let 𝜇𝑞𝑗 be the mean of the values 𝑥𝑗 (𝑗 = 1, 2, 3) associated
2
with the class 𝑦𝑞 𝑞 = 1, 2 , and 𝜎𝑞𝑗 be its variance.

𝟏
𝝁𝒒𝒋 = σ𝒊 𝒙(𝒊)
𝒋 , gives
𝑵𝒒

𝜇11 = 5.855, 𝜇12 = 176.25, 𝜇13 = 11.25,


𝜇21 = 5.4175, 𝜇22 = 132.5, 𝜇23 = 7.5

𝟏
𝝈𝟐𝒒𝒋 = σ𝒊(𝒙(𝒊) 𝟐
𝒋 −𝝁𝒒𝒋 ) , gives
𝑵𝒒

2 2 2
𝜎11 = 0.0262, 𝜎12 = 92.1875, 𝜎13 = 0.6875,
2 2 2
𝜎21 = 0.0729, 𝜎22 = 418.75, 𝜎23 = 0.5
Testing sample: 𝑥1 = 6, 𝑥2 = 130, 𝑥3 = 8

The probability density is calculated by 𝑝 𝒙 𝑦𝑞 ).

𝑝 𝑥1 𝑦1) = 1.65, 𝑝 𝑥2 𝑦1) = 3.76 ∗ 10−7, 𝑝 𝑥3 𝑦1) = 2.21 ∗ 10−4,


𝑝 𝑥1 𝑦2 ) = 0.145, 𝑝 𝑥2 𝑦2 ) = 0.018, 𝑝 𝑥3 𝑦2 ) = 0.564

𝑪𝒍𝒂𝒔𝒔 𝒌 = 𝒂𝒓𝒈𝒎𝒂𝒙𝒒 𝑷( 𝒚𝒒 ) ෑ 𝑷(𝒙𝒋 𝒚𝒒


𝒋

= 𝐚𝒓𝒈 𝒎𝒂𝒙
𝒒
𝒑 𝒙𝟏 𝒚𝒒 )𝒑 𝒙𝟐 𝒚𝒒 ) 𝒑 𝒙𝟑 𝒚𝒒 )𝒑(𝒚𝒒 )

This gives 𝒌 = 2, therefore the test sample is associated with class


‘female’.
Confusion Matrix

• A confusion matrix is a table that is used to define the


performance of a classification algorithm.
• One prediction on the test set has four possible
results, depicted in Table 1.
Hypothesized class (prediction)

Classified +ve Classified –ve

Actual +ve TP FN
Actual class
(observation)
Actual –ve FP TN

Table 1: Confusion Matrix


• The true positive (TP) and the true negative (TN) are
accurate classifications.
• A false positive (FP) takes place when the result is
inaccurately predicted as positive when it is negative in
reality.
• A false negative (FN) is said to occur when the result is
inaccurately predicted as negative when in reality it is
positive.
• Misclassification Error: The overall success rate on a
given test set is the number of correct classifications
divided by the total number of classifications.
𝑇𝑃+ 𝑇𝑁
Success Rate = 𝑇𝑃+ 𝑇𝑁+𝐹𝑃+𝐹𝑁

The misclassification rate of a classifier is simply (1 –


recognition rate).
𝐹𝑃+𝐹𝑁
Misclassification rate = 𝑇𝑃+ 𝑇𝑁+𝐹𝑃+𝐹𝑁
𝑇𝑃
• Sensitivity = True Positive Rate = 𝑇𝑃 + 𝐹𝑁

𝑇𝑁
• Specificity = True Negative Rate =
𝐹𝑃 +𝑇𝑁

• 1-Specificity = False Positive Rate = (1 – True Negative


Rate)

𝐹𝑃
1- Specificity = 𝐹𝑃 +𝑇𝑁
= (fp rate)

• Sensitivity and Specificity may not be useful for


imbalanced data
• In such cases, precision-recall metrics may be used
ROC Curves

• The true positives, true negatives, false positives and false


negatives have different costs and benefits (or risks and
gains) with respect to a classification model.
• ROC stands for Receiver Operating Characteristic curve,
developed in 1950s to separate signal from noise for
Radar communication
• The ROC Graph (a two-dimensional graph) plots sensitivity
on the y-axis and complement of the specificity on the x-
axis.
• An ROC graph, hence, shows relative trade-offs between
advantages (true positives) and costs (false positives).
• Each value of decision threshold corresponds to a point on
ROC curve.
Figure 1: ROC Curve
AUC= Area under ROC curve, varies between 0.5 and 1.
Larger the better
Precision-Recall Curves
• Information retrieval problems “relevance”
• Precision is a metric for relevancy of prediction
results.
• What proportion of positive identifications was
actually correct?
𝑇𝑃
Precision = 𝑇𝑃 + 𝐹𝑃
• Precision is the fraction of relevant documents that
are actually relevant
• Recall is a metric for how many truly relevant results
are obtained.
• What proportion of actual positives was identified
correctly?
𝑇𝑃
Recall = 𝑇𝑃 + 𝐹𝑁
Figure 2: Precision-Recall Curve
Figure 3: Precision-Recall Curve vs ROC Curve
F-Score

• An F-score is the weighted harmonic mean of


precision and recall values.

(𝛽 2 +1) ∗ Precision ∗ Recall


F-score =
𝛽 2 ∗ Precision + Recall

• The default balanced F-score equally weighs precision


and recall (𝛽 = 1). It is commonly written as 𝐹1 :

2 ∗ Precision ∗ Recall
𝐹1 =
Precision + Recall

• The values of 𝛽 < 1 put more weight on precision than


recall while the values of 𝛽 > 1 emphasize recall.
F-Score

• F1 summarizes the model effectiveness for a


specific decision threshold
• AUC for an ROC summarizes effectiveness
across threshold
• For F-score to be high, both precision and
recall have to be high, because harmonic-
mean is used
• F-score varies between 0 to 1, 1 being perfect
precision and recall and 0 if either of precision
or recall are zero.

You might also like