You are on page 1of 45

# Classification methods

## Methods & Characteristics

The three methods: Nave rule Nave Bayes K-nearest-neighbor Common characteristics:

## Data-driven, not model-driven Make no assumptions about the data

Nave Rule

Classify all records as the majority class Not a real method Introduced so it will serve as a benchmark against which to measure other results
S Y Charge N Truthful 60% Size L

Fraud 40%

Nave Bayes

## Idea of Nave Bayes: Financial Fraud

Target variable: fraud truthful Predictors:

Y Charge N

Size

## Classify based on the majority in each cell

(Conditional Probability)

## Nave Bayes: The Basic Idea

For a given new record to be classified, find other records like it (i.e., same values for the predictors) What is the prevalent class among those records? Assign that class to your new record

Usage

Requires categorical variables Numerical variable must be binned and converted to categorical Can be used with very large data sets Example: Spell check computer attempts to assign your misspelled word to an established class (i.e., correctly spelled word)

## Exact Bayes Classifier

Relies on finding other records that share same predictor values as record-to-beclassified. Want to find probability of belonging to class C, given specified values of predictors. Conditional probability P (Y= C| X = (x1, xp))

## Example: Financial Fraud

Target variable: fraud truthful Predictors:

Prior pending legal charges (yes/no) Size of firm (small/large) S Classify based on the majority in each cell
Error rate 20% Y Charge N

Size

## Exact Bayes Calculations

C h a rge s? y n n n n n y y n y Size sm a ll sm a ll la r g e la r g e sm a ll sm a ll sm a ll la r g e la r g e la r g e O u tco m e (T,F) Small tru th fu l Charges Yes (1,1) t r u t h f u l Charges No (3, 0) tru th fu l t r u t h f u l P(F|C,S) Small tru th fu l Y 0.5 tru th fu l N 0 fra u d Rule Small fra u d Y ? fra u d N Truthful fra u d
Large (0,2) (2,1)

## Exact Bayes Calculations

Goal: classify (as fraudulent or as truthful) a small firm with charges filed There are 2 firms like that, one fraudulent and the other truthful P(fraud|charges=y, size=small) = = 0.50 Note: calculation is limited to the two firms matching those characteristics

Problem

Even with large data sets, may be hard to find other records that exactly match your record, in terms of predictor values.

## Solution Nave Bayes

Assume independence of predictor variables (within each class) Use multiplication rule Find same probability that record belongs to class C, given predictor values, without limiting calculation to records that share all those same values

## Refining the primitive idea: Nave Bayes

Main idea: Instead of looking at combinations of predictors (crossed pivot table), look at each predictor separately How can this be done? A probability trick!

Based on Bayes rule Then make some simplifying assumption And get a powerful classifier! 15

Conditional Probability

A = the event X = A B = the event Y = B P ( A | B ) denotes the probability of A given B (the conditional probability that A occurs given that B occurred)

P( A B) P( A | B) = P( B)
If P(B)>0

AB

16

## Bayes Rule (Reverse conditioning)

What if I only know the opposite direction? Bayes rule gives a neat way to reverse time! P(AB) = P(B | A) P(A)= P(A | B) P(B) P( A B) A

P( A | B) P( B) P ( B | A) = P ( A)

AB

P(Fraud | Charge) P(Charge)= P(Charge | Fraud) P(Fraud) P(Fraud | Charge) = P(Charge | Fraud) P(Fraud) / P(Charge)
17

## Flipping the condition:

P(Y = 1 | X 1 ,..., X p ) =

18

## How is this used to solve our problem?

We want to estimate P(Y=1 | X1,,Xp) But we dont have enough examples of each possible profile X1, Xp in the training set If we had instead P(X1,,Xp | Y=1), we could separate it to P(X1|Y=1) P(X2|Y=1) P(Xp|Y=1)

True if we can assume independence between X1,,Xp within each class That means we could use single pivot tables! If the dependence is not extreme, it will work reasonably well
19

Independence Assumption
With Independence Assumption: A P(AB) = P(A)*P(B) We can thus calculate

AB

P(X1,,Xp | Y=1) = P(X1|Y=1)*P(X2|Y=1)* P(Xp|Y=1) P(X1,,Xp | Y=0) = P(X1|Y=0)*P(X2|Y=0)* P(Xp|Y=0) P(X1,,Xp ) = P(X1,,Xp | Y=1)+ P(X1,,Xp | Y=0)

## Putting it all together: How it works

1. 2.

3.

All predictors must be categorical. From the training set create all pivot tables of Y on each separate X. We can thus obtain P(X), P(X|Y=1),P(X|Y=0) For a to-be-predicted observation with predictors X1,X2, Xp, software computes the probability of belonging to Y=1 using the formula P ( X 1 | Y = 1) P( X 2 | Y = 1) P( X p | Y = 1) P(Y = 1) P (Y = 1 | X 1 ,..., X p ) = P( X 1 ,..., X p )

Each of the probabilities in the formula is estimated from a pivot table, and estimated P(Y=1) is the proportion of 1s in training set

1.

Use the cutoff to determine classification of this observation. Default: cutoff = 0.5 (classify to group that is most likely)
21

## Nave Bayes, cont.

Note that probability estimate does not differ greatly from exact All records are used in calculations, not just those matching predictor values This makes calculations practical in most circumstances Relies on assumption of independence between predictor variables within each class

Independence Assumption

Not strictly justified (variables often correlated with one another) Often good enough

## Example: Financial Fraud

Target variable: Fraud Predictors:

Truthful

Y Charge N

Size

## P(S,YIT)P(T) = P(S|T)*P(Y|T)P(T) = (4/6)*(1/6)*(6/10) = 0.067 P(S,YIF)P(F) = P(S|F)*P(Y|F)P(F)= (1/4)*(3/4)*(4/10) = 0.075

P(F|C,S) exact Y N

Small 0.5 0

P(F|C,S) Y N

## Nave Bayes Calculations

C h a rge s? y n n n n n y y n y Size sm a ll sm a ll la r g e la r g e sm a ll sm a ll sm a ll la r g e la r g e la r g e

Small O u tco m e (T,F) Y (1,1) (0,2) (1,3) tru th fu l N (3, 0) (2,1) (5,1) t r u t h f u l sum (4,1) (2,3) (6,4) t r u t h f u l P(C,S|F)P(F) Small Large P(C|F) 0.075 0.225 0.75 tru th fu lY N 0.025 0.075 0.25 t r u t h f u l P(S|F) 0.25 0.75 0.40 0.25*0.75*0.40 = 0.075 t r u t h f u l P(C,S|T)P(T) Small Large P(C|T) fra u d Y 0.067 0.034 0.17 0.334 0.164 0.83 fra u d N P(S|T) 0.67 0.33 0.60 fra u d P(F|C,S) = P(C,S|F)P(F)/P(C,S) fra u d = P(C|F)P(S|F)P(F)/P(C,S) P(C,S) = P(C,S|F)P(F)+P(C,S|T)P(T)

## Example: Financial Fraud

Target variable: Fraud Truthful Predictors:

## Prior pending legal charges (yes/no) Size of firm (small/large) S

Small 0.528 0.070 Large 0.869 0.316 Y Charge N

Size

## Estimated conditional probability P(F|C,S) Y N

The good

Simple Can handle large amount of predictors High performance accuracy, when the goal is ranking Pretty robust to independence assumption! Need to categorize continuous predictors Predictors with rare categories -> zero prob (if this category is important, this is a problem) Gives biased probability of class membership No insight about importance/role of each predictor
28

Sheet: NNB-Output1

## Classification> Nave Bayes

According to relative occurrences in training data Class 1 0 Prob. 0.095333333 <-- Success Class 0.904666667

## Prior class probabilities

P(accept=1) = 0.095

Conditional probabilities
Classes--> Input Variables Online CreditCard 1 Value 0 1 0 1 Prob 0.374125874 0.625874126 0.699300699 0.300699301 0 Value 0 1 0 1

## Prob 0.401621223 0.598378777 0.711864407 0.288135593

29

Sheet: NNB-ValidScore1

## Scoring the validation data

['UniversalBank KNN NBayes.xls']'Data_Partition1'!\$C\$3019:\$O\$5018 0.5 Prob. for 1 (success) 0.08795125 0.08795125 0.097697987 0.092925663 0.08795125 0.08795125 0.097697987 0.08795125 0.10316131

Data range

Back to Navig

## ( Updating the value here will NOT update value in summary re

Row Id. 2 3 7 8 11 13 14 15 16

Predicted Class 0 0 0 0 0 0 0 0 0

Actual Class 0 0 0 0 0 0 0 0 0

Online 0 0 1 0 0 0 1 0 1

CreditCard 0 0 0 1 0 0 0 0 1

30

K-Nearest Neighbors

Basic Idea

For a given record to be classified, identify nearby records Near means records with similar predictor values X1, X2, Xp Classify the record as whatever the predominant class is among the nearby records (the neighbors)

## How to Measure nearby?

The most popular distance measure is Euclidean distance

Choosing k

## K is the number of nearby neighbors to be used to classify the new record

k=1 means use the single nearest record k=5 means use the 5 nearest records

Typically choose that value of k which has lowest error rate in validation data

K=3
X2

X1

## Low k vs. High k

Low values of k (1, 3 ) capture local structure in data (but also noise) High values of k provide more smoothing, less noise, but may miss local structure Note: the extreme case of k = n (i.e. the entire data set) is the same thing as nave rule (classify all records according to majority class)

## Example: Riding Mowers

Data: 24 households classified as owning or not owning riding mowers Predictors = Income, Lot Size

Income 60.0 85.5 64.8 61.5 87.0 110.1 108.0 82.8 69.0 93.0 51.0 81.0 75.0 52.8 64.8 43.2 84.0 49.2 59.4 66.0 47.4 33.0 51.0 63.0

Lot_Size 18.4 16.8 21.6 20.8 23.6 19.2 17.6 22.4 20.0 20.8 22.0 20.0 19.6 20.8 17.2 20.4 17.6 17.6 16.0 18.4 16.4 18.8 14.0 14.8

Ownership owner owner owner owner owner owner owner owner owner owner owner owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner

XLMiner Output

For each record in validation data (6 records) XLMiner finds neighbors amongst training data (18 records). The record is scored for k=1, k=2, k=18. Best k seems to be k=8. K = 9, k = 10, k=14 also share low error rate, but best to choose lowest k.

Value of k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

% Error Training 0.00 16.67 11.11 22.22 11.11 27.78 22.22 22.22 22.22 22.22 16.67 16.67 11.11 11.11 5.56 16.67 11.11 50.00

% Error Validation 33.33 33.33 33.33 33.33 33.33 33.33 33.33 16.67 <--- Best k 16.67 16.67 33.33 16.67 33.33 16.67 33.33 33.33 33.33 50.00

## Using K-NN for Prediction (for Numerical Outcome)

Instead of majority vote determines class use average of response values May be a weighted average, weight decreasing with distance

Simple No assumptions required about Normal distribution, etc. Effective at capturing complex interactions among variables without having to define a statistical model

Shortcomings

## Required size of training set increases exponentially with # of predictors, p

This is because expected distance to nearest neighbor increases with p (with large vector of predictors, all records end up far away from each other)

In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s) These constitute curse of dimensionality

## Dealing with the Curse

Reduce dimension of predictors (e.g., with PCA) Computational shortcuts that settle for almost nearest neighbors

Summary

Nave rule: benchmark Nave Bayes and K-NN are two variations on the same theme: Classify new record according to the class of similar records No statistical models involved These methods pay attention to complex interactions and local structure Computational challenges remain