P. 1
Classification and Prediction

Classification and Prediction

|Views: 4|Likes:
Published by Surya Kameswari
a book related to data mining
a book related to data mining

More info:

Categories:Types, Research
Published by: Surya Kameswari on Feb 27, 2013
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

02/27/2013

pdf

text

original

Pirjo Moen

Univ. of Helsinki/Dept. of CS

Pirjo Moen

Univ. of Helsinki/Dept. of CS

Classification
Introduction 17.1. KDD Process
20.1.

What is classification?
28.2. Aim: predict categorical class labels for new tuples/samples Input: a training set of tuples/samples, each with a class label Output: a model (a classifier) based on the training set and the class labels
     

Clustering

Association rules 14.3. Conclusions 4.4.
Exam 12.4.

Concept description 31.1. Classification 14.2.

Data mining methods – Spring 2005

Page 1

Data mining methods – Spring 2005

Page 3

Pirjo Moen

Univ. of Helsinki/Dept. of CS

Pirjo Moen

Univ. of Helsinki/Dept. of CS

Classification and prediction
What is classification? What is prediction? Decision tree induction Bayesian classification Other classification methods Classification accuracy Summary
 

What is prediction?
Is similar to classification
constructs a model uses the model to predict unknown or missing values
¡ ¡  

Overview
   

 

Major method: regression
linear and multiple regression non-linear regression
¡ ¡

 

 

 

Data mining methods – Spring 2005

Page 2

 

Data mining methods – Spring 2005

Page 4

Pirjo Moen Univ. of Helsinki/Dept. of Helsinki/Dept. of Helsinki/Dept. of CS Typical applications Credit approval Target marketing Medical diagnosis Treatment effectiveness analysis Performance prediction   Classification vs. prediction Classification: predicts categorical class labels classifies data based on the training set and the values of the classification attribute     ¡   Classification . step: Model construction It’s a 2-step process! build a model from the training set ¡ 2. step: Model usage check the accuracy of the model and use it for classifying new data ¡ ¡ uses the constructed model in classifying new data Prediction: models continuous-valued functions predicts unknown or missing values Page 6 Data mining methods – Spring 2005 ¡ ¡   ¡ ¡ Data mining methods – Spring 2005 Page 8 . of CS Pirjo Moen Univ. of CS Classification vs.a two step process 1. of CS Pirjo Moen Univ. of Helsinki/Dept. clustering Classification = supervised learning training set of tuples/samples accompanied by class labels classify new data based on the training set ¡   ¡   Applications       Clustering = unsupervised learning class labels of training data are unknown aim in finding possibly existing classes or clusters in the data ¡ ¡   Data mining methods – Spring 2005 Page 5 Data mining methods – Spring 2005 Page 7 Pirjo Moen Univ.

Pirjo Moen Univ. of Helsinki/Dept. of CS Model usage Classify future or unknown objects ' 5   Example of model usage ¢ £  ¡ ¤¥ ' ¥¦ ¤  Estimate the accuracy of the model: the known class of a test tuple/sample is compared with the result given by the model accuracy rate = precentage of the tests tuples/samples correctly classified by the model ¡   ¤ ¢ @ Step 2  ©¥ ¢© (Jeff. NAME Mary James Bill John Mark Annie   Example of model construction ¢ £  ¡ ¤¥ § ¨¥ ¦  © ¥  ¥ © ' ¥¦ ' (¦ ¡C  " #! % &$ ' @  B ¤¥  ¤  ¤ ¤ ¤  ¢ ¥   ¡  5 ¥ ¢  ¢ @ Step 1     RANK YEARS TENURED Assistant Prof 3 No Assistant Prof 7 Yes Professor 2 Yes Associate Prof 7 Yes Assistant Prof 6 No Associate Prof 3 No ¢© ¢ £  ¡ A   ¤ ¢ 0 2 &1 87 ' ¢  ¤3 4  " @ 2 ' F ¢© 5 6 '© 9 Data mining methods – Spring 2005 Page 9 ' Data mining methods – Spring 2005 Page 11 Pirjo Moen Univ. The model is represented as classification rules. The class of a tuple/sample is determined by the class label attribute. of CS Pirjo Moen Univ. 4) NAME Tom Lisa Jack Ann RANK YEARS TENURED Assistant Prof 2 No Associate Prof 7 No Professor 5 Yes Assistant Prof 7 Yes ¡ Tenured? ED Note: test set should be independent of training set (avoid overfitting) Page 10   ¢ @  '¤ G '  Data mining methods – Spring 2005 Data mining methods – Spring 2005 Page 12 ¤  )  . of Helsinki/Dept. of CS Pirjo Moen Univ. of CS Model construction Each tuple/sample is assumed to belong a prefined class. of Helsinki/Dept. The training set of tuples/samples is used for model construction. Professor. of Helsinki/Dept. decision trees or mathematical formulae.

of Helsinki/Dept. of CS Evaluation of classification methods Accuracy Speed Robustness Scalability Interpretability Simplicity   Decision tree generation Two phases of decision tree generation: tree construction at start. of Helsinki/Dept. all the training examples at the root partition examples based on selected attributes test attributes are selected based on a heuristic or a statistical measure ¡ ¡           tree pruning identify and remove branches that reflect noise or outliers ¡   Data mining methods – Spring 2005 Page 14   ¡ Data mining methods – Spring 2005 Page 16 . of CS Pirjo Moen Univ.Pirjo Moen Univ. of CS Data preparation Data cleaning noise missing values B? ¡ ¡ Decision tree induction A? C?     A decision tree is a tree where internal node = a test on an attribute tree branch = an outcome of the test leaf node = class label or class distribution   D? Yes Relevance analysis (feature selection)   Data transformation   Data mining methods – Spring 2005 Page 13   Data mining methods – Spring 2005 Page 15 Pirjo Moen Univ. of CS Pirjo Moen Univ. of Helsinki/Dept. of Helsinki/Dept.

5. of Helsinki/Dept. GF P HI Gg d d b P Q V T d`   PRS TUV WH GX Y` `a Training set e `X X I SH TG Sf I W P d TV Y V ba cd ` Hb h i h i !$ ! % ' (&  ! " 10 # ) # 0 h 5 3 4 7 2 1$ 6 7 8 0 C @ % B D 3 A ' E$ Data mining methods – Spring 2005 Page 17 1 ) 07 0 9 @6 # ! %       Data mining methods – Spring 2005 Page 19 Pirjo Moen Univ. of CS Decision tree induction – classical example Play tennis? Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high Windy false true false false false true true false false false true true false true Class N N P P P N P N P P P P P N From a decision tree to classification rules One rule is generated for each path in the tree from the root to a leaf. C4. of Helsinki/Dept.Pirjo Moen Univ. of CS Pirjo Moen Univ. of Helsinki/Dept. CHAID main difference: divide (split) criterion / attribute selection measure ¡ ¥  £    § ¦¢  ¢ £ 8 8  Data mining methods – Spring 2005 Page 18   ¡ ¡ Data mining methods – Spring 2005 Page 20 . Rules are generally simpler to understand than trees. Each attribute-value pair along a path forms a conjunction. of CS Pirjo Moen Univ. The leaf node holds the class prediction. CART. of CS Decision tree obtained with ID3 Quinlan (1986) ¦    Decision tree algorithms Basic algorithm constructs a tree in a top-down recursive divide-and-conquer manner attributes are assumed to be categorical greedy (may get trapped in local maxima) ¡    ©¡  9 0 !  ¢¡   ¥¤ £ ¦ § £¥  9  @ 2 ¥ © ¥   2 @  ¨ ¥ © ¥  ©  Many variants: ID3. of Helsinki/Dept.

when using the attribute A. Let each Si contain pi examples of P and ni examples of N. The entropy. ni p n ¤   ¤ §   ¡ ¥   ¢ The information that would be gained by branching on the attribute A is Gain A I p . of CS Attribute selection measures Information gain Gini index χ2 contingency table statistic G-statistic           Information gain (2) Let sets {S1. of CS Information gain Select the attribute with the highest information gain Let P and N be two classes and S a dataset with p P-elements and n N-elements The amount of information needed to decide if an arbitrary example belongs to the class P or N is I p. n E A   ¡   ¡   ¢ £ ¡ Data mining methods – Spring 2005 Page 21 ¦ ¡ Data mining methods – Spring 2005 Page 23 Pirjo Moen Univ. of CS Pirjo Moen Univ. Sv} form a partition of the set S. of Helsinki/Dept. is E A i 1     pi n i I pi . 940 ¢ p n ¤ log 2 p p n ¤ £ n p n ¤ log 2 n p n ¤   ¡ ¢ Data mining methods – Spring 2005 Page 22 ¢ Data mining methods – Spring 2005 Page 24 . of Helsinki/Dept. ….n p £   Example of information gain Assumptions: Class P: plays_tennis = “yes” Class N: plays_tennis = “no” Information needed to classify a given sample:   ¡             I p. of CS Pirjo Moen Univ. of Helsinki/Dept.n I 9.5 ¡ 0 . S2 . or the expected information needed to classify objects in all the subtrees Si. of Helsinki/Dept.Pirjo Moen Univ.

of CS Example of information gain (2) Compute the entropy for the attribute outlook: outlook sunny overcast rain 4 I 4.0 14   ¡   £   ¡ ¡ ¤ ¤ Overfitting in decision tree classification The generated tree may overfit the training data too many branches poor accuracy for unseen samples ¡ ¡     pi 2 4 3   ni 3 0 2 ¡ I(pi. of Helsinki/Dept.97 0 . of CS Pirjo Moen Univ. k-ary splits categorical vs. of Helsinki/Dept.Pirjo Moen Univ.97 0 0. ni) 0. continuous attributes leaf node label = the class to which most samples at the node belong Page 26 ¡ ¡   How to avoid overfitting? Two approaches: prepruning: stop tree construction earlier – an approriate threshold? ¡     ¡ ¡ postpruning: remove branches from a “fully grown” tree ”best pruned tree” ? ¡ Branching scheme ¡ ¡   Labeling rule ¡   Data mining methods – Spring 2005 Data mining methods – Spring 2005 Page 28 . of CS Other criteria used in decision tree construction Conditions for stopping partitioning all samples belong to the same class no attributes left for further partitioning => majority voting for classifying the leaf no samples left for classifying the value of the attribute selection measure below a given threshold binary vs. of Helsinki/Dept. 694 Now Hence E outlook 5 I 2.5 ¡ ¢ E outlook 0 . of Helsinki/Dept.3 14 ¡   ¢ 5 I 3. 151 Gain windy 0 . 048   ¡ ¢   ¡ ¢ ¢ Data mining methods – Spring 2005 Page 25 ¡ Data mining methods – Spring 2005 Page 27 Pirjo Moen Univ. 029 Gain humidity 0 . 246 Similarly Gain temperature 0 .2 14 ¡   ¡ ¢ ¢ Reasons for overfitting noise and outliers too little training data local maxima in the greedy search ¡ ¡ Gain outlook     I 9. of CS Pirjo Moen Univ.

) PUBLIC (VLDB’98 — Rastogi & Shim) RainForest (VLDB’98 — Gehrke. of Helsinki/Dept.…)   Idea: assign to a sample X the class label C such that P(C|X) is maximal ¡ Data mining methods – Spring 2005 Page 30   Data mining methods – Spring 2005 Page 32 . of CS Pirjo Moen ¡ Univ. weighted by their probabilities ¡   Incremental: ¡ Why decision tree induction in data mining? Relatively faster learning speed than other methods Convertible to simple and understandable classification rules Can use SQL queries for accessing databases Comparable classification accuracy   ¡ ¡   Probabilistic prediction: Standard of optimal decision making against which other methods can be measured Page 31 ¡ ¡ ¡ Data mining methods – Spring 2005 Page 29 Data mining methods – Spring 2005 Pirjo Moen Univ.) SPRINT (VLDB’96 — J. of Helsinki/Dept. Ramakrishnan & Ganti)       Bayesian classification The classification problem may be formalized using posterior probabilities: P(C|X) = probability that the sample tuple X = <x1. Shafer et al. of CS Pirjo Moen Univ.windy=true. of CS Scalable decision tree induction methods SLIQ (EDBT’96 — Mehta et al.…. of Helsinki/Dept. of Helsinki/Dept. of CS Classification in large databases Scalability: Classifying data sets with millions of samples and hundreds of attributes with reasonable speed   ¡     ¡ Bayesian classification: Why? Probabilistic learning: calculate explicit probabilities for hypothesis among the most practical approaches to certain types of learning problems each training example can incrementally increase/decrease the probability that a hypothesis is correct prior knowledge can be combined with observed data predict multiple hypotheses.xk> belongs to the class C     For example P( class=N | outlook=sunny.Pirjo Moen Univ.

An unseen sample X = <rain.…. P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.010582 3. P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.018286 4. of Helsinki/Dept. The sample X is classified in class N (don’t play). hot. false> 2. of CS Pirjo Moen Univ. of CS Naïve Bayesian classification Naïve assumption: attribute independence P(x1. If i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density function   Computationally easy in both cases Page 34 Data mining methods – Spring 2005 Page 36 Data mining methods – Spring 2005   ¡ . of Helsinki/Dept. of CS Pirjo Moen Univ. high. of Helsinki/Dept. of Helsinki/Dept.xk|C) = P(x1|C)·…·P(xk|C) If i-th attribute is categorical: P(xi|C) is estimated as the relative frequency of samples having value xi as the i-th attribute in the class C ¡     Example of naïve Bayesian classification (2) Classifying a sample X: 1. of CS Estimating posterior probabilities Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) where P(X) is constant for all the classes P(C) is relative frequency of samples in the class C ¡ ¡   Example of naïve Bayesian classification Estimating P(xi|C) Outlook P(sunny | p) = 2/9 P(sunny | n) = 3/5 P(overcast | p) = 4/9 P(overcast | n) = 0 P(rain | p) = 3/9 P(rain | n) = 2/5 Temperature P(hot | p) = 2/9 P(mild | p) = 4/9 P(cool | p) = 3/9 Humidity P(high | p) = 3/9 P(normal | p) = 6/9 Windy P(true | p) = 3/9 P(false | p) = 6/9 P(p) = 9/14 P(n) = 5/14 P(high | n) = 4/5 P(normal | n) = 1/5 Finding a class C such that P(C|X) is maximum means finding a class C such that P(X|C)·P(C) is maximum problem: computing P(X|C) is unfeasible! ¡ P(hot | n) = 2/5 P(mild | n) = 2/5 P(cool | n) = 1/5 P(true | n) = 3/5 P(false | n) = 2/5   Data mining methods – Spring 2005 Page 33 Data mining methods – Spring 2005 Page 35 Pirjo Moen Univ.Pirjo Moen Univ.

that reason on one attribute at the time. Research directions: classification of nonrelational data. e. of Helsinki/Dept.Pirjo Moen Univ. of CS Naïve Bayesian classification – the independence hypothesis … makes computation possible … yields optimal classifiers when satisfied … but is seldom satisfied in practice. of Helsinki/Dept.g. of CS Other classification methods Neural networks Association-based approaches k-nearest neighbor classifier Case-based reasoning Genetic algorithms Rough set approach Fuzzy set approaches           Summary Classification is an extensively studied problem. of Helsinki/Dept. Page 40 More methods             Data mining methods – Spring 2005 Page 38 Data mining methods – Spring 2005 .g. Scalability is still an important issue for database applications. spatial and multimedia. of CS Pirjo Moen Univ. Classification is probably one of the most widely used data mining techniques with a lot of extensions. text. as attributes (variables) are often correlated. e.. of CS Pirjo Moen Univ.k-fold cross-validation ¡ ¡     Bootstrapping and leave-one-out (small data sets) ¡ Data mining methods – Spring 2005 Page 37 ¡ Data mining methods – Spring 2005 Page 39 Pirjo Moen Univ. of Helsinki/Dept. considering most important attributes first   ¡       Classification accuracy Estimating error rates: Partition: training-and-testing (large data sets) use two independent data sets. that combine Bayesian reasoning with causal relationships between attributes Decision trees. training set (2/3). test set(1/3)   Cross-validation (moderate data sets) divide the data set into k subsets use k-1 subsets as training data and one subset as test data --.. Attempts to overcome this limitation: Bayesian networks.

on Knowledge Discovery & Data Mining. Quinlan. P. D. Data mining with decision trees and decision rules. New York. 1999. Chan and S. Breiman. In R. R. First Int. of CS References C. Mainetto. Conf. and C4. P. Giannotti. In D. 1993. Kulikowski. Machine Learning. Page 41 Data mining methods – Spring 2005     Thank you Thank you for Jiawei Han from Simon Fraser University for his slides which greatly helped in preparing this lecture! Also thank you for Fosca Giannotti and Dino Pedreschi from Pisa for their slides of classification!                 Data mining methods – Spring 2005 Page 43 Pirjo Moen Univ. India. 725730. Catlett. In Proc. J. Stolfo. 404-415. J. 13. Ganti. G. Bonchi . C4. In Proc. Public: A decision tree classifer that integrates building and pruning. J. Conf. McClelland (eds. SPRINT : A scalable parallel classifier for data mining. Pedreschi. Sept. Bonchi. F. Metalearning for multistrategy and parallel learning. Rumelhart. 314-323.5. Mehta. E.) Parallel Distributed Processing. Quinlan. 1996. Conf. In Proc. Pedreschi. E. R. In Proc. In Proc. Learning arbiter and combiner trees from partitioned data for scaling machine learning. Ramakrishnan. Shim. 1991. In Proc. K. Machine Learning. Apte and S. of Helsinki/Dept. 2nd Int. Integrating classification and association rule mining. p. Ma. Aug. Portland. 1998. In Proc. and M. Aug. 1997. K. 1998 Int. 1986 Page 42                   Data mining methods – Spring 2005 . Neural Nets. Wadsworth International Group. Induction of decision trees. Stone. on Information and Knowledge Management. Conf. G. Conf. France. and J. G. NY. Blackwell Business. Friedman. Rissanen. pages 416-427. D. OR. 1984. KDD-99. and Expert Systems. and C. Bagging. L. The MIT Press. R. M. A. August 1998. and V. Williams. Avignon. 1994. Bagozzi. of Helsinki/Dept. March 1996. Bombay. August 1998. 1986. Very Large Data Bases. Rumelhart and J. Univ. of CS References (2) B. Computer Systems that Learn: Classification and Prediction Methods from Statistics. 13th Natl. Murthy. J. Quinlan. R. J. DaWak'99. on Artificial Intelligence (AAAI'96). In Proc. Conf. F. Weiss. Sept. KDD’98. R. R. J. Mehta. Olshen. D. Extending Database Technology (EDBT'96). F. Stolfo. Morgan Kaufman. SLIQ : A fast scalable classifier for data mining. New York. S. Mainetto. 1996 Int. Conf. Sydney. W. Automatic Construction of Decision Trees from Data: A Multi-Diciplinary Survey. Morgan Kaufman. Magidson. S. editor. A Classification-based Methodology for Planning Audit Strategies in Fraud Detection. 1993. R. Shafer. on Data Warehousing and Knowledge Discovery. Cambridge Massechusetts. New York. Megainduction: machine learning on very large databases. of Helsinki/Dept. R. Advanced Methods of Marketing Research. 1996. Using Data Mining Techniques in Fiscal Fraud Detection. In Proc. The CHAID approach to segmentation modeling: Chi-squared automatic interaction detection. 1:81-106. Gehrke.5: Programs for Machine Learning. L. Very Large Data Bases. August 1995. K. 1996 Int. Future Generation Computer Systems. Hsu and Y. Giannotti. Weiss and C. 544-555. Learning internal representation by error propagation. of CS Pirjo Moen Univ. J. pages 118-159. Liu. Classification and Regression Trees. ACM-SIGKDD Int. Agrawal. 1998 J. boosting. Rainforest: A framework for fast decision tree construction of large datasets. Agrawal.Pirjo Moen Univ. J. P. E. NY. Data Mining and Knowledge Discovery 2(4): 345-389. Rastogi and K. In Proc. Hinton and R. M. 1998 Int. KDD'95. PhD Thesis. F. 1999. J. Chan and S. 1991. Conf. J. Very Large Data Bases.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->