Module05 - Bayesian Reasoning

Bayesian Reasoning
Reference
Gopal, M. (2019). Applied machine learning. McGraw-Hill

Education.
Bayesian Reasoning
• Bayesian models associate probability with each

decision
• Bayesian learning techniques have relevance in the study
of machine learning for two separate reasons –
1. Bayesian learning algorithms that compute
explicit probabilities, for example, the naive
Bayes classifier, are one of the most practical
approaches, especially for large datasets and
NLP
2. It gives a meaningful perspective to the
comprehension of various learning algorithms
that do not explicitly manipulate probabilities.
Bayes Theorem
• Bayes' Theorem states that the conditional probability

of an event, based on the occurrence of another
event, is equal to the likelihood of the second event
given the first event multiplied by the probability of
the first event.
• We, use the Bayes theorem for the following problem
setting.
𝐷 ∶ 𝑥 𝑖 ,𝑦 𝑖
; 𝑖 = 1, 2, … , 𝑛
with patterns
𝑥 = (𝑥1 𝑥2 … 𝑥𝑛 )𝑇
• We consider y to be a random variable that must be
described probabilistically.
𝑦 ∶ (𝑦1 , 𝑦2 , … , 𝑦𝑞 … , 𝑦𝑀 )
𝑦𝑞 ; 𝑞 = 1, . . 𝑀 corresponds to class 𝑞 ∈ {1, . . 𝑀}
• The distribution of all possible values of discrete random
variable y is expressed as probability distribution,
𝑃 𝑦 = 𝑃(𝑦1 ), … , 𝑃(𝑦𝑀 )
𝑃(𝑦1 ) + … + 𝑃(𝑦𝑀 ) = 1
• Known priors 𝑃(𝑦𝑞 )

Bayes theorem provides a way to get posterior 𝑃(𝑦𝑘 𝑥 ; k ∈ {1,
…, M} from the known priors 𝑃(𝑦𝑞 ), using known conditional
probabilities 𝑃(𝑥 𝑦𝑞 ; q = 1, …, M.
𝑃( 𝑦𝑘 )𝑃(𝑥 𝑦𝑘
𝑃(𝑦𝑘 𝑥 =
𝑃(𝑥 )
𝑀
𝑃 𝑥 = ෍ 𝑃(𝑥 𝑦𝑞 𝑃( 𝑦𝑞 )
𝑞=1
• 𝑃(𝑥) expresses variability of the observed data,
independent of the class.
• 𝑃(𝑥 𝑦𝑘 is called the class likelihood and is the
conditional probability that a pattern belonging to class
𝑦𝑘 has the associated with observation value x.
𝑃𝑟𝑖𝑜𝑟 ∗ 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑
𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒
• The posterior can be calculated as,
𝑃( 𝑦𝑘 )𝑃(𝑥 𝑦𝑘
𝑃(𝑦𝑘 𝑥 = 𝑀
σ𝑞=1 𝑃(𝑥 𝑦𝑞 𝑃( 𝑦𝑞 )
• We can determine the Maximum A Posteriori (𝑀𝐴𝑃) class

by choosing:
𝑚𝑎𝑥
𝐶𝑙𝑎𝑠𝑠 𝑘 𝑖𝑓 𝑃(𝑦𝑘 𝑥 = 𝑞 𝑃 𝑦𝑞 𝑥)
• Thus, 𝑦𝑀𝐴𝑃 corresponds to MAP class provided,

𝑚𝑎𝑥
𝑦𝑀𝐴𝑃 ≡ arg 𝑃 𝑦𝑞 𝑥)
𝑞
arg 𝑚𝑎𝑥
𝑞
𝑃 𝑦𝑞 𝑃 𝑥 𝑦𝑞
≡
𝑃 𝑋
𝑚𝑎𝑥 (1)
≡ arg 𝑃( 𝑦𝑞 )𝑃(𝑥 𝑦𝑞
𝑞
• A 𝑃(𝑥 𝑦𝑞 represents the likelihood of the data x given
class 𝑦𝑞
• In some cases, classes are assumed to be equally

probable 𝑃 𝑦𝑘 = 𝑃 𝑦𝑞 ∀𝑘, only likelihood is required
to be considered
• Any class that maximizes 𝑃(𝑥 𝑦𝑞 is called Maximum

Likelihood (𝑀𝐿) class. Thus 𝑦𝑀𝐿 corresponds to 𝑀𝐿 class
provided,
𝑚𝑎𝑥
𝑦𝑀𝐿 ≡ arg 𝑃(𝑥 𝑦𝑞
𝑞
Disadvantages of Bayes’ Classifier
• Requires initial knowledge of prior probability
𝑃(𝑦𝑞 ) and likelihood 𝑃(𝑥|𝑦𝑞 )
• In real world problems, these probabilities are not
known in advance
• With the knowledge of the probabilistic structure
of the problem, conditional densities can be
parameterized
• In most pattern recognition problems, assumption
of knowledge of probability structure is not always
valid
• Classical parametric models are unimodal, but
multimodal densities are found in many real
problems
Parameter Estimation and Dependencies
• It is easier to estimate the conditional density parameters, if the
probability structure is known
• E.g If it is known that 𝑃 𝑥 𝑦𝑞 ~𝑁 𝜇𝑞 , 𝜎𝑞2 , it is simpler to estimate

𝜇𝑞 , 𝜎𝑞2
• Sometimes parameterized density functions are not enough, as

there are statistical dependencies or causal relationships among
the features
• When such relationships are known, the dependencies can be

represented with the help of Bayesian Belief Networks
• If the dependency structure is unknown, we proceed by the most

basic assumption: features are conditionally independent given the
class
Naive Bayes Classifier
• Sometimes, very simple algorithms perform quite well
• The Naive Bayes is one of the widely used algorithms

for classification problems.
• It is derived from Bayes' probability theory, very useful

in high-dimensional datasets and text classification
• Naive Bayes assumes conditional independence where

Bayes theorem does not.
• Naïve Bayes considers all features as equally

important and independent of each other
• Consider features are categorical
• Continuous features can be converted to categorical by
creating bins
• To get, 𝑃(𝑦𝑘 𝑥 ; 𝑘 ∈ {1, … 𝑀} , specify 𝑃(𝑦𝑞 ) and
𝑃(𝑥|𝑦𝑞 )
• 𝑃( 𝑦𝑞 ) (if prior knowledge is not there) may be estimated
simply by counting the frequency with which class 𝑦𝑞
occurs in the training data:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑤𝑖𝑡ℎ 𝑐𝑙𝑎𝑠𝑠 𝑦𝑞
𝑃(𝑦𝑞 ) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑁 𝑜𝑓 𝑑𝑎𝑡𝑎
• Class-conditional probabilities 𝑃(𝑥 𝑦𝑞 can be estimated:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑝𝑎𝑡𝑡𝑒𝑟𝑛 𝑥 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑤𝑖𝑡ℎ 𝑦𝑞 𝑐𝑙𝑎𝑠𝑠

𝑃(𝑥 𝑦𝑞 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑦𝑞 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎
• The assumption is that given the class of the pattern, the
probability of observing the conjunction x1 ,x2 , . . . , xn is
just the product of the probabilities for the individual
attributes (conditional independence):
𝑃(𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝑦𝑞 ) = ෑ 𝑃(𝑥𝑗 𝑦𝑞
𝑗
• Substituting this into Equation (1), we have the naive
Bayes algorithm:
𝑚𝑎𝑥
𝑦𝑁𝐵 ≡ arg 𝑃( 𝑦𝑞 ) ෑ 𝑃(𝑥𝑗 𝑦𝑞
𝑞
𝑗 (2)
𝑦𝑁𝐵 is the class output
Number of 𝑃(𝑥𝑗 𝑦𝑞 terms is given by the number of
distinct attributes (𝑛) times the number of classes 𝑀
• The values are generated simply by counting the
frequency of data combinations within training sample
• If the features are continuous then discretization gives us
categorical values 𝑉𝑥𝑗
• If 𝑑𝑗 are countable values 𝑥𝑗 can take, then,
𝑉𝑥𝑗 = { 𝑣1 𝑥𝑗 , 𝑣2 𝑥𝑗 , 𝑣3 𝑥𝑗 , … , 𝑣 𝑑𝑗 𝑥𝑗 } = { 𝑣𝑙 𝑥𝑗 ; 𝑙 = 1, 2, … , 𝑑𝑗 }
• Let the value of 𝑥𝑗 be 𝑣𝑙 𝑥𝑗 . Then
𝑁𝑞 𝑣𝑙 𝑥
𝑗
𝑃 𝑥𝑗 𝑦𝑞 ) =
𝑁𝑞
where 𝑁𝑞 𝑣𝑙 𝑥 is the number of training samples of class 𝑦𝑞

𝑗
having the value 𝑣𝑙 𝑥𝑗 for attribute 𝑥𝑗 , and 𝑁𝑞 is the total
number of training samples with class 𝑦𝑞 .
• Class prior probabilities may be calculated as,
𝑁𝑞
𝑃(𝑦𝑞 ) =
𝑁
where 𝑁 is the total number of training samples and 𝑁𝑞 is the
number of samples of class 𝑦𝑞 .
Example 1: Consider the dataset D given in Table.
Gender Height Sport y

x1 x2
s(1) F 1.6 m Cricket y1

s(2) M 2m Football y3
s(3) F 1.9 m Tennis y2
s(6) M 1.85 m Tennis y2
s(8) M 1.7 m Cricket y1
s(9) M 2.2 m Football y3
s(10) M 2.1 m Football y3
s(12) M 1.95 m Tennis y2
Solution: y1 corresponds to the class ‘Cricket’, y2 corresponds to the
class ‘Tennis’, and y3 corresponds tothe class ‘Football’. Therefore,
M = 3, N = 15
𝑁1 4
𝑃(𝑦1 ) = = = 0.267
𝑁 15
𝑁2 8
𝑃(𝑦2 ) = = = 0.533
𝑁 15
𝑁3 3
𝑃(𝑦3 ) = = = 0.2
𝑁 15
𝑉𝑥1 :{M , F} = { 𝑣1 𝑥1 , 𝑣2 𝑥1 }; 𝑑1 = 2
𝑉𝑥2 = { 𝑣1 𝑥2 , 𝑣2 𝑥1 , 𝑣3 𝑥2 , 𝑣4 𝑥2 , 𝑣5 𝑥2 , 𝑣6 𝑥2 }; 𝑑2 = 6
= bins {(0, 1.6], (1.6, 1.7], (1.7, 1.8], (1.8, 1.9], (1.9, 2.0], (2.0, ∞)}
The count table generated from data is given in Table.
Table: Number of training samples, 𝑁𝑞 𝑣𝑙 𝑥 , of class q having value 𝑉𝑙 𝑥𝑗

𝑗
Count 𝑵𝒒 𝒗𝒍 𝒙
𝒋
Value
𝑽𝒍 𝒙𝒋 Cricket Tennis Football
q=1 q=2 q=3
𝑣1 𝑥1 :M 1 2 3
𝑣2 𝑥1 : F 3 6 0
𝑣1 𝑥2 :(0, 1.6] bin 2 0 0
𝑣2 𝑥1 :(1.6, 1.7] bin 2 0 0
𝑣3 𝑥2 :(1.7, 1.8] bin 0 3 0
𝑣4 𝑥2 :(1.8, 1.9] bin 0 4 0
𝑣5 𝑥2 :(1.9, 2.0] bin 0 1 1
𝑣6 𝑥2 : (2.0, ∞) bin 0 0 2
We consider an instance from the given dataset (the same procedure
applies for a data tuple not in the given dataset (unseen instance)):
x : {M, 1.95 m} = {x1, x2}
In the discretized domain, ‘M’ corresponds to 𝑣1 𝑥1 and ‘1.95 m’

corresponds to 𝑣5 𝑥2 .
𝑁2 𝑣1 𝑥 2
1
𝑃 𝑥1 𝑦1 ) = =
𝑁2 8
𝑁3 𝑣1 𝑥 3
1
𝑃 𝑥1 𝑦3 ) = =
𝑁3 3
𝑁1 𝑣5 𝑥 0
2
𝑃 𝑥2 𝑦1 ) = =
𝑁1 4
𝑁2 𝑣5 𝑥 1
2
𝑃 𝑥2 𝑦2 ) = =
𝑁2 8
𝑁3 𝑣5 𝑥 1
2
𝑃 𝑥2 𝑦3 ) = =
𝑁3 3
1
𝑃 𝑥 𝑦1 ) = 𝑃 𝑥1 𝑦1 ) ∗ 𝑃 𝑥2 𝑦1 ) = ∗0=0
4
2 1 1
𝑃 𝑥 𝑦2 ) = 𝑃 𝑥1 𝑦2 ) ∗ 𝑃 𝑥2 𝑦2 ) = ∗ =
8 8 32
3 1 1
𝑃 𝑥 𝑦3 ) = 𝑃 𝑥1 𝑦3 ) ∗ 𝑃 𝑥2 𝑦3 ) = ∗ =
3 3 3
𝑃 𝑥 𝑦1 ) 𝑃(𝑦1 ) = 0*0.267 = 0
1
𝑃 𝑥 𝑦2 ) 𝑃(𝑦2 ) = *0.533 = 0.0166
32
1
𝑃 𝑥 𝑦3 ) 𝑃(𝑦3 ) = *0.2 = 0.066
3
𝒎𝒂𝒙
𝒚𝑵𝑩 = 𝒂𝒓𝒈 𝑷 𝒙 𝒚𝒒 ) 𝑷(𝒚𝒒 )
𝒒
This gives 𝒒 = 3.
Therefore, for the pattern x = {M 1.95m}, the predicted class is

‘Football’.
The true class in the data table is ‘Tennis’. Note that we are working with
an artificial toy dataset. Use of naive Bayes algorithm on real-life
datasets will bring out the power of naive Bayes classifier when N is
large.
Naïve Bayes
• Suppose due to lack of data, one of the class conditional

probability becomes zero, it will make all probability values go to
zero
• Then it is customary to replace zero with a small quantity
𝑁𝑞𝑗
Original : 𝑃 𝑥𝑗 𝑞 =
𝑁𝑞
𝑁𝑞𝑗 +1
Laplace : 𝑃 𝑥𝑗 𝑞 =
𝑁𝑞 +𝑞
𝑁𝑞𝑗 +𝑚∗𝑃(𝑦𝑞 )
m-estimate : 𝑃 𝑥𝑗 𝑞 =
𝑁𝑞 +𝑚
Where 𝑚 is the number of parameters
𝑞 is the number of classes
Gaussian Naive Bayes
• Gaussian Naive Bayes is the extension of Naive Bayes.

• Gaussian Naive Bayes (GNB) is a classification technique
used in Machine Learning (ML) based on the
probabilistic approach and Gaussian distribution.
• A univariate normal distribution for the attribute 𝑥𝑗 is
defined by –
2
1 𝑥− 𝜇𝑞𝑗
1 −2 𝜎
(𝑖) 𝑞𝑗
𝑝(𝑥𝑗 = 𝑥 𝑦𝑞 = 𝑒
𝜎𝑞𝑗 √2𝜋
• With reference to General Bayes Theorem, the naive
Bayes classifier algorithm for continuous variable follows
from equation (2) –
𝑚𝑎𝑥
𝑦𝑁𝐵 = arg 𝑃( 𝑦𝑞 ) ෑ 𝑃(𝑥𝑗 𝑦𝑞
𝑞
𝑗
Example 2: Consider the dataset given in Table.
Height Weight Footsize Class

x1 x2 x3 y
(feet) (lbs) (inches)
s(1) 6 180 12 y1
s(2) 5.92 190 11 y1

s(3) 5.58 170 12 y1
s(4) 5.92 165 10 y1
s(5) 5.00 100 8 y2
s(6) 5.50 150 8 y2
s(7) 5.42 130 7 y2
s(8) 5.75 150 9 y2
Solution: Let 𝜇𝑞𝑗 be the mean of the values 𝑥𝑗 (𝑗 = 1, 2, 3) associated
2
with the class 𝑦𝑞 𝑞 = 1, 2 , and 𝜎𝑞𝑗 be its variance.
𝟏
𝝁𝒒𝒋 = σ𝒊 𝒙(𝒊)
𝒋 , gives
𝑵𝒒
𝜇11 = 5.855, 𝜇12 = 176.25, 𝜇13 = 11.25,

𝜇21 = 5.4175, 𝜇22 = 132.5, 𝜇23 = 7.5
𝟏
𝝈𝟐𝒒𝒋 = σ𝒊(𝒙(𝒊) 𝟐
𝒋 −𝝁𝒒𝒋 ) , gives
𝑵𝒒
2 2 2
𝜎11 = 0.0262, 𝜎12 = 92.1875, 𝜎13 = 0.6875,
2 2 2
𝜎21 = 0.0729, 𝜎22 = 418.75, 𝜎23 = 0.5
Testing sample: 𝑥1 = 6, 𝑥2 = 130, 𝑥3 = 8
The probability density is calculated by 𝑝 𝒙 𝑦𝑞 ).
𝑝 𝑥1 𝑦1) = 1.65, 𝑝 𝑥2 𝑦1) = 3.76 ∗ 10−7, 𝑝 𝑥3 𝑦1) = 2.21 ∗ 10−4,

𝑝 𝑥1 𝑦2 ) = 0.145, 𝑝 𝑥2 𝑦2 ) = 0.018, 𝑝 𝑥3 𝑦2 ) = 0.564
𝑪𝒍𝒂𝒔𝒔 𝒌 = 𝒂𝒓𝒈𝒎𝒂𝒙𝒒 𝑷( 𝒚𝒒 ) ෑ 𝑷(𝒙𝒋 𝒚𝒒

𝒋
= 𝐚𝒓𝒈 𝒎𝒂𝒙
𝒒
𝒑 𝒙𝟏 𝒚𝒒 )𝒑 𝒙𝟐 𝒚𝒒 ) 𝒑 𝒙𝟑 𝒚𝒒 )𝒑(𝒚𝒒 )
This gives 𝒌 = 2, therefore the test sample is associated with class

‘female’.
Confusion Matrix
• A confusion matrix is a table that is used to define the

performance of a classification algorithm.
• One prediction on the test set has four possible
results, depicted in Table 1.
Hypothesized class (prediction)
Classified +ve Classified –ve
Actual +ve TP FN
Actual class
(observation)
Actual –ve FP TN
Table 1: Confusion Matrix

• The true positive (TP) and the true negative (TN) are
accurate classifications.
• A false positive (FP) takes place when the result is
inaccurately predicted as positive when it is negative in
reality.
• A false negative (FN) is said to occur when the result is
inaccurately predicted as negative when in reality it is
positive.
• Misclassification Error: The overall success rate on a
given test set is the number of correct classifications
divided by the total number of classifications.
𝑇𝑃+ 𝑇𝑁
Success Rate = 𝑇𝑃+ 𝑇𝑁+𝐹𝑃+𝐹𝑁
The misclassification rate of a classifier is simply (1 –

recognition rate).
𝐹𝑃+𝐹𝑁
Misclassification rate = 𝑇𝑃+ 𝑇𝑁+𝐹𝑃+𝐹𝑁
𝑇𝑃
• Sensitivity = True Positive Rate = 𝑇𝑃 + 𝐹𝑁
𝑇𝑁
• Specificity = True Negative Rate =
𝐹𝑃 +𝑇𝑁
• 1-Specificity = False Positive Rate = (1 – True Negative

Rate)
𝐹𝑃
1- Specificity = 𝐹𝑃 +𝑇𝑁
= (fp rate)
• Sensitivity and Specificity may not be useful for

imbalanced data
• In such cases, precision-recall metrics may be used
ROC Curves
• The true positives, true negatives, false positives and false

negatives have different costs and benefits (or risks and
gains) with respect to a classification model.
• ROC stands for Receiver Operating Characteristic curve,
developed in 1950s to separate signal from noise for
Radar communication
• The ROC Graph (a two-dimensional graph) plots sensitivity
on the y-axis and complement of the specificity on the x-
axis.
• An ROC graph, hence, shows relative trade-offs between
advantages (true positives) and costs (false positives).
• Each value of decision threshold corresponds to a point on
ROC curve.
Figure 1: ROC Curve
AUC= Area under ROC curve, varies between 0.5 and 1.
Larger the better
Precision-Recall Curves
• Information retrieval problems “relevance”
• Precision is a metric for relevancy of prediction
results.
• What proportion of positive identifications was
actually correct?
𝑇𝑃
Precision = 𝑇𝑃 + 𝐹𝑃
• Precision is the fraction of relevant documents that
are actually relevant
• Recall is a metric for how many truly relevant results
are obtained.
• What proportion of actual positives was identified
correctly?
𝑇𝑃
Recall = 𝑇𝑃 + 𝐹𝑁
Figure 2: Precision-Recall Curve
Figure 3: Precision-Recall Curve vs ROC Curve
F-Score
• An F-score is the weighted harmonic mean of

precision and recall values.
(𝛽 2 +1) ∗ Precision ∗ Recall

F-score =
𝛽 2 ∗ Precision + Recall
• The default balanced F-score equally weighs precision

and recall (𝛽 = 1). It is commonly written as 𝐹1 :
2 ∗ Precision ∗ Recall
𝐹1 =
Precision + Recall
• The values of 𝛽 < 1 put more weight on precision than

recall while the values of 𝛽 > 1 emphasize recall.
F-Score
• F1 summarizes the model effectiveness for a

specific decision threshold
• AUC for an ROC summarizes effectiveness
across threshold
• For F-score to be high, both precision and
recall have to be high, because harmonic-
mean is used
• F-score varies between 0 to 1, 1 being perfect
precision and recall and 0 if either of precision
or recall are zero.

Module05 - Bayesian Reasoning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module05 - Bayesian Reasoning

Uploaded by

Copyright:

Available Formats

Bayesian Reasoning

Gopal, M. (2019). Applied machine learning. McGraw-Hill

• Bayesian models associate probability with each

• Bayes' Theorem states that the conditional probability

• Known priors 𝑃(𝑦𝑞 )

• We can determine the Maximum A Posteriori (𝑀𝐴𝑃) class

• Thus, 𝑦𝑀𝐴𝑃 corresponds to MAP class provided,

• In some cases, classes are assumed to be equally

• Any class that maximizes 𝑃(𝑥 𝑦𝑞 is called Maximum

• E.g If it is known that 𝑃 𝑥 𝑦𝑞 ~𝑁 𝜇𝑞 , 𝜎𝑞2 , it is simpler to estimate

• Sometimes parameterized density functions are not enough, as

• When such relationships are known, the dependencies can be

• If the dependency structure is unknown, we proceed by the most

• Sometimes, very simple algorithms perform quite well

• The Naive Bayes is one of the widely used algorithms

• It is derived from Bayes' probability theory, very useful

• Naive Bayes assumes conditional independence where

• Naïve Bayes considers all features as equally

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑝𝑎𝑡𝑡𝑒𝑟𝑛 𝑥 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑤𝑖𝑡ℎ 𝑦𝑞 𝑐𝑙𝑎𝑠𝑠

• Let the value of 𝑥𝑗 be 𝑣𝑙 𝑥𝑗 . Then

where 𝑁𝑞 𝑣𝑙 𝑥 is the number of training samples of class 𝑦𝑞

Gender Height Sport y

s(1) F 1.6 m Cricket y1

Table: Number of training samples, 𝑁𝑞 𝑣𝑙 𝑥 , of class q having value 𝑉𝑙 𝑥𝑗

𝑣1 𝑥2 :(0, 1.6] bin 2 0 0

𝑣2 𝑥1 :(1.6, 1.7] bin 2 0 0

𝑣3 𝑥2 :(1.7, 1.8] bin 0 3 0

𝑣4 𝑥2 :(1.8, 1.9] bin 0 4 0

𝑣5 𝑥2 :(1.9, 2.0] bin 0 1 1

x : {M, 1.95 m} = {x1, x2}

In the discretized domain, ‘M’ corresponds to 𝑣1 𝑥1 and ‘1.95 m’

Therefore, for the pattern x = {M 1.95m}, the predicted class is

• Suppose due to lack of data, one of the class conditional

• Gaussian Naive Bayes is the extension of Naive Bayes.

Height Weight Footsize Class

s(2) 5.92 190 11 y1

𝜇11 = 5.855, 𝜇12 = 176.25, 𝜇13 = 11.25,

The probability density is calculated by 𝑝 𝒙 𝑦𝑞 ).

𝑝 𝑥1 𝑦1) = 1.65, 𝑝 𝑥2 𝑦1) = 3.76 ∗ 10−7, 𝑝 𝑥3 𝑦1) = 2.21 ∗ 10−4,

𝑪𝒍𝒂𝒔𝒔 𝒌 = 𝒂𝒓𝒈𝒎𝒂𝒙𝒒 𝑷( 𝒚𝒒 ) ෑ 𝑷(𝒙𝒋 𝒚𝒒

This gives 𝒌 = 2, therefore the test sample is associated with class

• A confusion matrix is a table that is used to define the

Classified +ve Classified –ve

Table 1: Confusion Matrix

The misclassification rate of a classifier is simply (1 –

• 1-Specificity = False Positive Rate = (1 – True Negative

• Sensitivity and Specificity may not be useful for

• The true positives, true negatives, false positives and false

• An F-score is the weighted harmonic mean of

(𝛽 2 +1) ∗ Precision ∗ Recall

• The default balanced F-score equally weighs precision

• The values of 𝛽 < 1 put more weight on precision than

• F1 summarizes the model effectiveness for a

You might also like