Professional Documents
Culture Documents
BASAV ROYCHOUDHURY
What do we learn?
Understand classification
Naïve rule for classification
Naïve Bayes rule for classification
K Nearest Neighbor classifier
Naïve Rule
Naïve Rule
Simplest possible classification
Independent of the predictor variables, the prediction is done in terms
of majority class
For detecting “Fraud” in a bank transaction pattern, any new
transaction will be classified as “Not fraud”, as that is the majority class
Used to ascertain performance of more complicated classifier
Naïve Bayes
Naïve Bayes
Simplicity First – simple ideas often work very well
All attributes make contributions to the decision
They are equally important and independent of one another
Unrealistic but simplistic
Useful when large dataset is available
Google uses Naïve Bayes classifier to correct mis-spellings that user
types in
◦ Based on the possible pattern and associated words
Weather Dataset
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Prediction – New Day
Rainy 3 2 Cool 3 1
2 3 2 2 3 4 6 2 9 5
Sunny Hot High False
9 5 9 5 9 5 9 5 14 14
4 0 4 2 6 1 3 3
Overcast Mild Normal True
9 5 9 5 9 5 9 5
3 2 3 1
Rainy Cool
9 5 9 5
Prediction – New Day
Likelihood of Yes
2 3 3 3 9
= × × × × = 0.0053
9 9 9 9 14
Likelihood of No
3 1 4 3 5
= × × × × = 0.0206
5 5 5 5 14
Prediction – New Day
Probability of Yes
0.0053
= = 20.5%
0.0053 + 0.0206
Probability of No
0.0206
= = 79.5%
0.0053 + 0.0206
Bayes Rule
For a hypothesis H borne out by evidence E on the hypothesis,
𝑃 𝑥 𝑦 = 𝑐 𝑃(𝑦 = 𝑐)
𝑃 𝑦 = 𝑐|𝑥 =
𝑃(𝑥)
𝑃(𝐸|𝐻) × 𝑃(𝐻)
֜ 𝑃(𝐻|𝐸) =
𝑃(𝐸)
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 × 𝑝𝑟𝑜𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
֜ 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝑝𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
H-> Proposition or Hypotheis, E-> Evidence
𝑃 𝑥 𝑦 = 𝑐 - likelihood function
Bayesian analysis requires prior distribution 𝑃(𝑦 = 𝑐)
Bayes Rule, Cont’d
If information is not available, one can use non-informative prior
If predictors are numeric, assumes the probability distribution is normal
Bayes Rule: Issue
Bayes rule requires a large number of records for learning, especially in
the presence of large number of predictors
Naive Bayes - Bayes’ theorem with the “naive” assumption of
independence between every pair of features
◦ 𝑃 𝑥 𝑦 = 𝑐 = ς 𝑃(𝑥𝑖 |𝑦 = 𝑐)
Naïve Bayes
Using
◦ 𝑃 𝑥 𝑦 = 𝑐 = ς 𝑃(𝑥𝑖 |𝑦 = 𝑐)
Naïve Bayes Rule
ς 𝑃(𝑥𝑖 |𝑦=𝑐)𝑃(𝑦=𝑐)
◦ 𝑃 𝑦 = 𝑐|𝑥 =
𝑃(𝑥)
Iris Dataset
Iris flower data set or Fisher's Iris
data set is a multivariate data set
introduced by Sir Ronald Fisher
Attribute Information:
◦ 1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
Advantages
Simplicity
Often outperforms sophisticated classifiers
◦ Even if the underlying assumption of independent predictors is not true
◦ Especially in the face of large number of predictors
Error rate
◦ percent of misclassified records out of the total records in the validation data
Predicted Class
Actual Class
setosa versicolor virginica
setosa 34 0 0
versicolor 0 33 2
virginica 0 2 29
Error Rate
Assume a two-class case - C0 and C1
Notation used ni,j
Total number of observations
◦ 𝑛 = 𝑛0,0 + 𝑛0,1 + 𝑛1,0 + 𝑛1,1
Overall Accuracy
𝑛0,0 +𝑛1,1
◦ 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 1 − 𝑒𝑟𝑟 =
𝑛
Error Rate
Predicted Class
Actual Class
setosa versicolor virginica
setosa 34 0 0
versicolor 0 33 2
virginica 0 2 29