You are on page 1of 30

Classification

BASAV ROYCHOUDHURY
What do we learn?
Understand classification
Naïve rule for classification
Naïve Bayes rule for classification
K Nearest Neighbor classifier
Naïve Rule
Naïve Rule
Simplest possible classification
Independent of the predictor variables, the prediction is done in terms
of majority class
For detecting “Fraud” in a bank transaction pattern, any new
transaction will be classified as “Not fraud”, as that is the majority class
Used to ascertain performance of more complicated classifier
Naïve Bayes
Naïve Bayes
Simplicity First – simple ideas often work very well
All attributes make contributions to the decision
They are equally important and independent of one another
Unrealistic but simplistic
Useful when large dataset is available
Google uses Naïve Bayes classifier to correct mis-spellings that user
types in
◦ Based on the possible pattern and associated words
Weather Dataset
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Prediction – New Day

Outlook Temperature Humidity Windy Play

Sunny Cool High True ?


Knowledge Discovery
Outlook Temperature Humidity Windy Play

Yes No Yes No Yes No Yes No Yes No

Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5

Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3

Rainy 3 2 Cool 3 1

Outlook Temperature Humidity Windy Play

2 3 2 2 3 4 6 2 9 5
Sunny Hot High False
9 5 9 5 9 5 9 5 14 14

4 0 4 2 6 1 3 3
Overcast Mild Normal True
9 5 9 5 9 5 9 5

3 2 3 1
Rainy Cool
9 5 9 5
Prediction – New Day
Likelihood of Yes
2 3 3 3 9
= × × × × = 0.0053
9 9 9 9 14
Likelihood of No
3 1 4 3 5
= × × × × = 0.0206
5 5 5 5 14
Prediction – New Day
Probability of Yes
0.0053
= = 20.5%
0.0053 + 0.0206
Probability of No
0.0206
= = 79.5%
0.0053 + 0.0206
Bayes Rule
For a hypothesis H borne out by evidence E on the hypothesis,
𝑃 𝑥 𝑦 = 𝑐 𝑃(𝑦 = 𝑐)
𝑃 𝑦 = 𝑐|𝑥 =
𝑃(𝑥)
𝑃(𝐸|𝐻) × 𝑃(𝐻)
֜ 𝑃(𝐻|𝐸) =
𝑃(𝐸)
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 × 𝑝𝑟𝑜𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
֜ 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝑝𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
H-> Proposition or Hypotheis, E-> Evidence
𝑃 𝑥 𝑦 = 𝑐 - likelihood function
Bayesian analysis requires prior distribution 𝑃(𝑦 = 𝑐)
Bayes Rule, Cont’d
If information is not available, one can use non-informative prior
If predictors are numeric, assumes the probability distribution is normal
Bayes Rule: Issue
Bayes rule requires a large number of records for learning, especially in
the presence of large number of predictors
Naive Bayes - Bayes’ theorem with the “naive” assumption of
independence between every pair of features
◦ 𝑃 𝑥 𝑦 = 𝑐 = ς 𝑃(𝑥𝑖 |𝑦 = 𝑐)
Naïve Bayes
Using
◦ 𝑃 𝑥 𝑦 = 𝑐 = ς 𝑃(𝑥𝑖 |𝑦 = 𝑐)
Naïve Bayes Rule
ς 𝑃(𝑥𝑖 |𝑦=𝑐)𝑃(𝑦=𝑐)
◦ 𝑃 𝑦 = 𝑐|𝑥 =
𝑃(𝑥)
Iris Dataset
Iris flower data set or Fisher's Iris
data set is a multivariate data set
introduced by Sir Ronald Fisher
Attribute Information:
◦ 1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
Advantages
Simplicity
Often outperforms sophisticated classifiers
◦ Even if the underlying assumption of independent predictors is not true
◦ Especially in the face of large number of predictors

Faster to predict classes using this algorithm than many other


classification algorithms
Can be use for Binary and Multiclass classification
Not sensitive to irrelevant features
Disadvantages
Problem with rare predictor value –
◦ Naive Bayesian prediction requires each conditional probability be non-zero;
Otherwise, the predicted probability will be zero
◦ if a predictor value which was not there is the training set arises while
prediction, it will assign a zero probability for the target variable

Known to be a bad estimator, so the probability outputs need not to be


taken too seriously.
Evaluating Classification
Performance
Why Evaluate?
Multiple methods to classify or predict
Multiple choices available for settings for each method
Single method with different options can lead to completely different
results
Choose best model - need to assess each model’s performance
How to assess the models?
Misclassification error
Misclassification Error:
◦ classifying a record as belonging to one class when it belongs to another
class.

Error rate
◦ percent of misclassified records out of the total records in the validation data

Would be great to have a model with error rate of zero - practically


impossible due to presence of ‘noise’ in real world data
Do better than naïve rule (with exceptions!)

SOURCE: SHMUELI, PATEL & BRUCE


Separation of Records
How many records/cases do we need for good classification?
Separation of records
◦ High separation of records - using predictor variables attains low error
◦ Low separation of records - using predictor variables does not improve much
on naïve rule

SOURCE: SHMUELI, PATEL & BRUCE


Separation of Records, Cont’d

SOURCE: SHMUELI, PATEL & BRUCE


Water, water, every where,
Not any drop to drink!

SOURCE: SHMUELI, PATEL & BRUCE


Confusion Matrix
Answer to confusion
◦ Confusion Matrix/Classification Matrix

Summarizes correct and incorrect classifications


Estimates true classification and misclassification rates
Simple estimates based on test/validation set
◦ reliability can be increased using large dataset

Honest estimate of classification error - matrix computed from


validation set or test set

SOURCE: SHMUELI, PATEL & BRUCE


Confusion Matrix

Predicted Class
Actual Class
setosa versicolor virginica

setosa 34 0 0

versicolor 0 33 2

virginica 0 2 29
Error Rate
Assume a two-class case - C0 and C1
Notation used ni,j
Total number of observations
◦ 𝑛 = 𝑛0,0 + 𝑛0,1 + 𝑛1,0 + 𝑛1,1

The estimated misclassification rate (overall error rate)


𝑛0,1 +𝑛1,0
◦ 𝑒𝑟𝑟 =
𝑛

Overall Accuracy
𝑛0,0 +𝑛1,1
◦ 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 1 − 𝑒𝑟𝑟 =
𝑛
Error Rate
Predicted Class
Actual Class
setosa versicolor virginica

setosa 34 0 0

versicolor 0 33 2

virginica 0 2 29

Overall error rate = (2+2)/100 = 4%


Accuracy = 1 – err = (34+33+29)/100 = 96%
Example Dataset
CRS_DEP_TIME Scheduled departure time
CARRIER Airline
DEP_TIME Actual departure time
DEST Destination airport
DISTANCE Distance
FL_DATE Flight date
FL_NUM Flight Number
ORIGIN Departure airport
WEATHER Whether inclement (1) or not (0)
DAY_WEEK Day of week – 1 for Monday
DAY_OF_MONTH Day of the month
TAIL_NUM Plane specific tail number
FLIGHT STATUS Ontime or Delayed (more than 15 minutes)
Thank you

You might also like