K723 Data Mining

Attribution Non-Commercial (BY-NC)

39 views

K723 Data Mining

Attribution Non-Commercial (BY-NC)

- 100620
- Data Mining in Excel Using Xl Miner
- Decision Support in Heart Disease Prediction System Using Naive Bayes
- “Analysis and Specifying Textcategorization Using a Bayesian Classification Approach”
- Image processing
- Review Paper on Concept Drifting Data Stream Mining
- Rule Based Classification
- Lecture 16
- 13bayes
- GroupTechnology
- ijaerv11n4_19
- Cs229 Midterm Aut2015
- A Review of Recent Texture Classification: Methods.
- Machine learning methods for data security
- A study on sentiment analysis using live twitter data
- Sensor Based System to Detect Driver's Stress, Fatigue and Drowsiness using Data Analytic Approach
- Organizational Objects
- Emergent System Using Tweet Analyzer Naturally Inspired Computing Approach
- tu3 weka tutorials
- Bellkor2008

You are on page 1of 45

The three methods: Nave rule Nave Bayes K-nearest-neighbor Common characteristics:

Nave Rule

Classify all records as the majority class Not a real method Introduced so it will serve as a benchmark against which to measure other results

S Y Charge N Truthful 60% Size L

Fraud 40%

Nave Bayes

Target variable: fraud truthful Predictors:

Y Charge N

Size

(Conditional Probability)

For a given new record to be classified, find other records like it (i.e., same values for the predictors) What is the prevalent class among those records? Assign that class to your new record

Usage

Requires categorical variables Numerical variable must be binned and converted to categorical Can be used with very large data sets Example: Spell check computer attempts to assign your misspelled word to an established class (i.e., correctly spelled word)

Relies on finding other records that share same predictor values as record-to-beclassified. Want to find probability of belonging to class C, given specified values of predictors. Conditional probability P (Y= C| X = (x1, xp))

Target variable: fraud truthful Predictors:

Prior pending legal charges (yes/no) Size of firm (small/large) S Classify based on the majority in each cell

Error rate 20% Y Charge N

Size

C h a rge s? y n n n n n y y n y Size sm a ll sm a ll la r g e la r g e sm a ll sm a ll sm a ll la r g e la r g e la r g e O u tco m e (T,F) Small tru th fu l Charges Yes (1,1) t r u t h f u l Charges No (3, 0) tru th fu l t r u t h f u l P(F|C,S) Small tru th fu l Y 0.5 tru th fu l N 0 fra u d Rule Small fra u d Y ? fra u d N Truthful fra u d

Large (0,2) (2,1)

Goal: classify (as fraudulent or as truthful) a small firm with charges filed There are 2 firms like that, one fraudulent and the other truthful P(fraud|charges=y, size=small) = = 0.50 Note: calculation is limited to the two firms matching those characteristics

Problem

Even with large data sets, may be hard to find other records that exactly match your record, in terms of predictor values.

Assume independence of predictor variables (within each class) Use multiplication rule Find same probability that record belongs to class C, given predictor values, without limiting calculation to records that share all those same values

Main idea: Instead of looking at combinations of predictors (crossed pivot table), look at each predictor separately How can this be done? A probability trick!

Based on Bayes rule Then make some simplifying assumption And get a powerful classifier! 15

Conditional Probability

A = the event X = A B = the event Y = B P ( A | B ) denotes the probability of A given B (the conditional probability that A occurs given that B occurred)

P( A B) P( A | B) = P( B)

If P(B)>0

AB

16

What if I only know the opposite direction? Bayes rule gives a neat way to reverse time! P(AB) = P(B | A) P(A)= P(A | B) P(B) P( A B) A

P( A | B) P( B) P ( B | A) = P ( A)

AB

P(Fraud | Charge) P(Charge)= P(Charge | Fraud) P(Fraud) P(Fraud | Charge) = P(Charge | Fraud) P(Fraud) / P(Charge)

17

P(Y = 1 | X 1 ,..., X p ) =

18

We want to estimate P(Y=1 | X1,,Xp) But we dont have enough examples of each possible profile X1, Xp in the training set If we had instead P(X1,,Xp | Y=1), we could separate it to P(X1|Y=1) P(X2|Y=1) P(Xp|Y=1)

True if we can assume independence between X1,,Xp within each class That means we could use single pivot tables! If the dependence is not extreme, it will work reasonably well

19

Independence Assumption

With Independence Assumption: A P(AB) = P(A)*P(B) We can thus calculate

AB

P(X1,,Xp | Y=1) = P(X1|Y=1)*P(X2|Y=1)* P(Xp|Y=1) P(X1,,Xp | Y=0) = P(X1|Y=0)*P(X2|Y=0)* P(Xp|Y=0) P(X1,,Xp ) = P(X1,,Xp | Y=1)+ P(X1,,Xp | Y=0)

1. 2.

3.

All predictors must be categorical. From the training set create all pivot tables of Y on each separate X. We can thus obtain P(X), P(X|Y=1),P(X|Y=0) For a to-be-predicted observation with predictors X1,X2, Xp, software computes the probability of belonging to Y=1 using the formula P ( X 1 | Y = 1) P( X 2 | Y = 1) P( X p | Y = 1) P(Y = 1) P (Y = 1 | X 1 ,..., X p ) = P( X 1 ,..., X p )

Each of the probabilities in the formula is estimated from a pivot table, and estimated P(Y=1) is the proportion of 1s in training set

1.

Use the cutoff to determine classification of this observation. Default: cutoff = 0.5 (classify to group that is most likely)

21

Note that probability estimate does not differ greatly from exact All records are used in calculations, not just those matching predictor values This makes calculations practical in most circumstances Relies on assumption of independence between predictor variables within each class

Independence Assumption

Not strictly justified (variables often correlated with one another) Often good enough

Target variable: Fraud Predictors:

Truthful

Y Charge N

Size

P(F|C,S) exact Y N

Small 0.5 0

P(F|C,S) Y N

C h a rge s? y n n n n n y y n y Size sm a ll sm a ll la r g e la r g e sm a ll sm a ll sm a ll la r g e la r g e la r g e

Small O u tco m e (T,F) Y (1,1) (0,2) (1,3) tru th fu l N (3, 0) (2,1) (5,1) t r u t h f u l sum (4,1) (2,3) (6,4) t r u t h f u l P(C,S|F)P(F) Small Large P(C|F) 0.075 0.225 0.75 tru th fu lY N 0.025 0.075 0.25 t r u t h f u l P(S|F) 0.25 0.75 0.40 0.25*0.75*0.40 = 0.075 t r u t h f u l P(C,S|T)P(T) Small Large P(C|T) fra u d Y 0.067 0.034 0.17 0.334 0.164 0.83 fra u d N P(S|T) 0.67 0.33 0.60 fra u d P(F|C,S) = P(C,S|F)P(F)/P(C,S) fra u d = P(C|F)P(S|F)P(F)/P(C,S) P(C,S) = P(C,S|F)P(F)+P(C,S|T)P(T)

Target variable: Fraud Truthful Predictors:

Small 0.528 0.070 Large 0.869 0.316 Y Charge N

Size

The good

Simple Can handle large amount of predictors High performance accuracy, when the goal is ranking Pretty robust to independence assumption! Need to categorize continuous predictors Predictors with rare categories -> zero prob (if this category is important, this is a problem) Gives biased probability of class membership No insight about importance/role of each predictor

28

The bad

Sheet: NNB-Output1

According to relative occurrences in training data Class 1 0 Prob. 0.095333333 <-- Success Class 0.904666667

P(accept=1) = 0.095

Conditional probabilities

Classes--> Input Variables Online CreditCard 1 Value 0 1 0 1 Prob 0.374125874 0.625874126 0.699300699 0.300699301 0 Value 0 1 0 1

29

Sheet: NNB-ValidScore1

['UniversalBank KNN NBayes.xls']'Data_Partition1'!$C$3019:$O$5018 0.5 Prob. for 1 (success) 0.08795125 0.08795125 0.097697987 0.092925663 0.08795125 0.08795125 0.097697987 0.08795125 0.10316131

Data range

Back to Navig

Row Id. 2 3 7 8 11 13 14 15 16

Predicted Class 0 0 0 0 0 0 0 0 0

Actual Class 0 0 0 0 0 0 0 0 0

Online 0 0 1 0 0 0 1 0 1

CreditCard 0 0 0 1 0 0 0 0 1

30

K-Nearest Neighbors

Basic Idea

For a given record to be classified, identify nearby records Near means records with similar predictor values X1, X2, Xp Classify the record as whatever the predominant class is among the nearby records (the neighbors)

The most popular distance measure is Euclidean distance

Choosing k

k=1 means use the single nearest record k=5 means use the 5 nearest records

Typically choose that value of k which has lowest error rate in validation data

K=3

X2

X1

Low values of k (1, 3 ) capture local structure in data (but also noise) High values of k provide more smoothing, less noise, but may miss local structure Note: the extreme case of k = n (i.e. the entire data set) is the same thing as nave rule (classify all records according to majority class)

Data: 24 households classified as owning or not owning riding mowers Predictors = Income, Lot Size

Income 60.0 85.5 64.8 61.5 87.0 110.1 108.0 82.8 69.0 93.0 51.0 81.0 75.0 52.8 64.8 43.2 84.0 49.2 59.4 66.0 47.4 33.0 51.0 63.0

Lot_Size 18.4 16.8 21.6 20.8 23.6 19.2 17.6 22.4 20.0 20.8 22.0 20.0 19.6 20.8 17.2 20.4 17.6 17.6 16.0 18.4 16.4 18.8 14.0 14.8

Ownership owner owner owner owner owner owner owner owner owner owner owner owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner

XLMiner Output

For each record in validation data (6 records) XLMiner finds neighbors amongst training data (18 records). The record is scored for k=1, k=2, k=18. Best k seems to be k=8. K = 9, k = 10, k=14 also share low error rate, but best to choose lowest k.

Value of k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

% Error Training 0.00 16.67 11.11 22.22 11.11 27.78 22.22 22.22 22.22 22.22 16.67 16.67 11.11 11.11 5.56 16.67 11.11 50.00

% Error Validation 33.33 33.33 33.33 33.33 33.33 33.33 33.33 16.67 <--- Best k 16.67 16.67 33.33 16.67 33.33 16.67 33.33 33.33 33.33 50.00

Instead of majority vote determines class use average of response values May be a weighted average, weight decreasing with distance

Advantages

Simple No assumptions required about Normal distribution, etc. Effective at capturing complex interactions among variables without having to define a statistical model

Shortcomings

This is because expected distance to nearest neighbor increases with p (with large vector of predictors, all records end up far away from each other)

In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s) These constitute curse of dimensionality

Reduce dimension of predictors (e.g., with PCA) Computational shortcuts that settle for almost nearest neighbors

Summary

Nave rule: benchmark Nave Bayes and K-NN are two variations on the same theme: Classify new record according to the class of similar records No statistical models involved These methods pay attention to complex interactions and local structure Computational challenges remain

- 100620Uploaded byvol2no6
- Data Mining in Excel Using Xl MinerUploaded bychintan18
- Decision Support in Heart Disease Prediction System Using Naive BayesUploaded byAmr Mohamed El-Koshiry
- “Analysis and Specifying Textcategorization Using a Bayesian Classification Approach”Uploaded byCHEIF EDITOR
- Image processingUploaded byUmesh Baskar
- Review Paper on Concept Drifting Data Stream MiningUploaded byijcsn
- Rule Based ClassificationUploaded byAllison Collier
- Lecture 16Uploaded byPrathyu Guduri
- GroupTechnologyUploaded byUpender Dhull
- 13bayesUploaded byAbhishek Arora
- ijaerv11n4_19Uploaded byMridul Sachan
- Cs229 Midterm Aut2015Uploaded byJMDP5
- A Review of Recent Texture Classification: Methods.Uploaded byInternational Organization of Scientific Research (IOSR)
- Machine learning methods for data securityUploaded byJozsef Hegedus
- A study on sentiment analysis using live twitter dataUploaded byDevesh
- Sensor Based System to Detect Driver's Stress, Fatigue and Drowsiness using Data Analytic ApproachUploaded byGRD Journals
- Organizational ObjectsUploaded bylokakalyanyadav
- Emergent System Using Tweet Analyzer Naturally Inspired Computing ApproachUploaded byEditor IJRITCC
- tu3 weka tutorialsUploaded byborjaunda
- Bellkor2008Uploaded byken_nov21
- 04078007Uploaded byMekaTron
- Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)Uploaded bysolomatine
- 13bayes.pptxUploaded bySudhagarSubbiyan
- Avbpa 2003 Li&WechslerUploaded byBhasker Nalaveli
- StatMetLecture4[5]Uploaded byRidoy El'pewe Sirait
- Unsupervised Distance Based Detection of Outliers by using Anti-hubsUploaded byAnonymous CUPykm6DZ
- PBacklund IDETC 2012 Revised FinalUploaded byBosco Raju
- Ghani 2013Uploaded bysuarez
- oUploaded byAnonymous DFpzhrR
- August 1-5, 2016Uploaded byBagwis Maya

- Harris_L1Uploaded byironchefff
- Corp Ident Through Social 83517981Uploaded byironchefff
- Chapter 14Uploaded byironchefff
- Chapter 13Uploaded byironchefff
- Chapter 11Uploaded byironchefff
- 48990195 Strategic Brand Management Keller 10 Rev Measuring Outcomes of BE Market Performance 0010Uploaded byKiran Soni
- Chapter 9Uploaded byironchefff
- Chapter 8Uploaded byironchefff
- Chapter 7Uploaded byironchefff
- Chapter 6Uploaded byironchefff
- Chapter 5Uploaded byironchefff
- Chapter 4Uploaded byironchefff
- Chapter 3Uploaded byironchefff
- 48990396 Strategic Brand Management Keller 2 CBBE 0002Uploaded bypawanshrestha1
- 48990161 Strategic Brand Management Keller 1 Intro 0001Uploaded byAarti Ck
- 10.Cluster AnalysisUploaded byironchefff
- 7.Simple ClassificationUploaded byironchefff
- 6.EvaluationUploaded byironchefff
- 5.PreparationUploaded byironchefff
- 4.dataminingUploaded byironchefff
- 3.OLAPUploaded byironchefff
- 2.datawarehouseUploaded byironchefff
- 1.IntroductionUploaded byironchefff
- Purchase RequisitionUploaded byironchefff

- HV BushingsUploaded bysolomon
- MSS7Uploaded bySalil Pari
- 2-Harmonic and Periodic Motions, Vibration TerminologyUploaded byUsmanGhani
- Www Leenas Com English Draw Bodice HTMLUploaded bymartyj19631
- Calculating Bacterial GrowthUploaded byIftikhar Ali
- Is Loneliness the Same as Being AloneUploaded byGerardo Damian
- 1.Pipe and Tube SizingUploaded bywidhisaputrawijaya
- DESIGN AND IMPLEMENTATION OF 32-BIT ALU ON XILINX FPGA USING VHDLUploaded byNigam Meher
- ASTM D 4138 - 94Uploaded byCyril Angki
- Directional couplerUploaded byShrutiAwasthi
- Pile Capacity Analysis - Stack Pile 2015.10.21Uploaded byRoda Cadiz
- AC Stark Shift - IOP ScienceUploaded byKapila Wijayaratne
- Access Failures Troubleshooting WorkshopUploaded bysyedusama
- Keyence cv700_man2Uploaded bykamaleon85
- DC to DC converter 12V to +_-40V using LM3524 _ EleccircuitUploaded byIonut Alexandru Anei
- Kozadajevs Kops Eng 2201Uploaded byFahimuddin Qureshi
- This Set of Microprocessor Multiple Choice QuestionsUploaded byramyadsp
- On EBX5 ProjectsUploaded bysakir bayram
- WavesUploaded bySarvesh Dubey
- Java First LessonUploaded byOyudo Gerald
- Archimedes PalimpsestUploaded bygfvila
- IB Biology 3.2 Carbohydrates Lipids and ProteinsUploaded byayushfm
- new ECE M.pdfUploaded byHarish
- data-quality-concepts.pdfUploaded bySugumar Kanniyappan
- 1976 R.Lo Propulsion by Laser Energy TransmissionUploaded byRoger Edwin Lo
- MCQ Question1Uploaded byRana Hassan Tariq
- Guide SteUploaded bySarp1234
- Acid Base ChemistryUploaded byGajendra Singh Raghav
- is.6773.2008Uploaded byPranay Kumar
- Nine LetterUploaded byEze Nonso