You are on page 1of 13
DS 535: ADVANCED DATA MINING FOR BUSINESS Lecture Notes #6: Classification and Regression Trees Random Forests and Boosted Trees (Textbook reading: Chapter 9) ‘Trees and Rules Goal: Classify or predict an outcome based on a set of predictors The output is a set of rules Example: © Goal: classify a record as “will accept credit card offer” or “will not accept” ‘© Rule might be “IF (Income >= 106) AND (Education < 1.5) AND (Family <= 2.5) THEN Class = 0 (nonacceptor) Also called CART, Decision Trees, or just Trees © Rules are represented by tree diagrams yes] Income < 106 fro) @ an12 = ( Education < 1.5 2312 3t (0) aes 257 Family < 2.5 Income < 116 1 966 48. 34 22 Income < 114 CCAvg <2.4 359 0 0}. 0 190 34 22 74 oat 207 515 FIGURE 9.1 ‘BEST-PRUNED TREE OBTAINED BY FITTING A FULL TREE TO THE TRAINING DATA a TS i scam Income Size Owners 0.0 Te ‘owner 855 168 owner 66.8 216 owner 61s 208 owner 870 238 owner n04 192 owner 408.0 178 owner a28 24 owner 69.0 200 owner 93.0 208 owner 51.0 220 owner ato 20.0 owner 75.0 198 non-owner 528 208 non-owner 64.8 172 non-owner 43.2 204 nonowner 840 178 non-owner 492 178 non-owner 594 16.0 non-owner 65.0 184 non-owner ara 164 non-owner 330 188 non-owner 510 149 non-owner 63.0 148 non-owner = a = aati } 2 ie Sle Income (006) FIGURE 9.2 scTER Lor oF Lost ws. NCONE FR 24 OWNERS AND NONOWNERS OF Howto split © Order records according to one variable, say income © Take a predictor value, say 60 (the first record) and divide records into those with income 60 and those < 60 © Measure resulting purity (homogeneity) of class in each resulting portion © Try all other split values © Repeat for other variable(s) © Select the one variable & split that yields the most purity Note: Categorical Variables © Examine all possible ways in which the categories can be split. © E.g., categories A, B, C can be split 3 ways {A} and {B,C} {B} and {A, C} {C} and {A, B} © With many categories, # of splits becomes huge The first split: Income = 60 Hi : a yes (O) tome < lo f . i iF ie . i FIGURE 9,5 sruTime ric 2 8fCoRD FST BY INCOME VALUE OF 6 ANDTHE LT SE ‘a oF 22 aevamenrenc acne toenISINRURIOR RCI After All Splits s e? i ee Dee eecree ee cae Es eee . z ] . " 8 = we ra ® we a Income 000) FIGURE 9.6 Fini. stccoFeecugse putoKING Ace CNGLE {CONSISTING OF SINGLE CASS (OWNERS OR HONOHINERS) je Riyll f Fr 2 elses | w 0 Gini index for rectangle i i Gint = i= p Po a ‘i Way=4- Shape ag i p= proportion of cases in rectangle A that belong to class k (out of m classes) oor » ft © 1(A) = 0 when all cases belong to same class © Max value when all classes are equally represented (= 0.50 in binary case) ‘Note: XLMiner uses a variant called “delta splitting rule” Entropy: entropy (A) = ~ 2 loga(Ps) ei p= proportion of cases in rectangle A that belong to class & (out of m classes) © Entropy ranges between 0 (most pure) and log2(m) (equal representation of classes) Impurity and Recursive Partitioning Obtain overall impurity measure (weighted avg. of individual rectangles) © Ateach successive stage, compare this measure across ali possible splits in all variables © Choose the split that reduces impurity the most © Chosen split points become nodes on the tree Reodes: library(rpart) \ library(rpart.plot) mowerdf < read.esv("RidingMowers.csv") # use rpart() to run a classification tree. # define rpart.control() in rpart() to determine the depth of the tree. # maxdepth Set the maximum depth of any node of the final tree, with the root node counted as # depth 0. class.tree < rpart(Ownership ~., data = mower.df, ‘control = rpart.control(maxdepth = 2), method = "class") #in this example, maxdepth = 1 gives the same results #4 plot tree # use prp0 to plot the tree. You can control plotting parameters such as color, shape, # and information displayed (which and where), prp(class.tree, type = 1, extra = L, split.font = 1, varlen = -10) First Split ~ The Tree > prp(class.tree) # gives the following plot i} ‘The impurity measures for this rectangle are: + Gini left = 1 - (7/8): - (1/8)? = 0.219 © entropy_left = - (7/8)log 2(7/8) - (1/8)log 2(1/8) = 0.544 The right rectangle contains 11 owners and five nonowners. The impurity measures of the right rectangle are therefore oe + Gini_right = 1 - (11/16) - (5/16) = 0.430 are + entropy_right = - (11/16)log 2(11/16) - (5/16)log 2(5/16) = 0.896 The combined impurity of the two rectangles that were created by the split is a weighted average of the two impurity measures, weighted by the number of records in each: © Gini = (8/24)(0.219) + (16/24)(0.430) = 0.359 © entropy = (8/24)(0.544) + (16/24)(0.896) = 0.779 ‘Thus, the Gini impurity index decreased from 0.5 before the split to 0.359 after the split. Similarly, the entsopy impurity measure decreased from 1 before the sph to 0.79 after the split. Tree after all splits > # plot tree afterall splits > class.tree < rpart( Ownership + metho > prp(class.tree, type = 1, extr varlen = -10) a # Cp: crease the overall lack of fit by a factor of cp. is not attempted. For instance, with anova splitting, this means that the overall R-squared must increase by cp at each step. The main role ofthis parameter isto save computing time by pruning off splits that are obviously not worthwhile. Essentially,t he user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it, Default value cp=0.01, # minsplit the minimum number of observations that must exist in a node in order for a split to be attempted. Default value minsplit = 20 (Gee) Ineome < 60 5) “The first split is on Income, then the next split is on Lot Size for both the low income group (at lot size 21) and the high income split (at lot size 20) ‘The ominant cassis this owen ofthe fest spit ose ‘ne neome>= 80)is owner” — Towers and Snomowners group of 18 wat boon {he basis ala size, ‘spliting at 20 | ‘The next spit for this | | | | Read down the tree to derive rules itincome PD» 410 M20 09 fan, ee 89 10 (0), on é ‘The Overfitting Problem Full trees are complex and overfit the data © Natural end of process is 100% purity in each leaf © This overfits the data, which end up fitting noise in the data © Consider Example 2, Loan Acceptance with more records and more variables than the Riding Mower data — the full tree is very complex code for creating a deeper classification tree HH#R Figure 9.10 deeper.ct < rpart (Personal.Loan ~ «, minsplit = 1) # count number of leaves train.df, method = "class", ep length (deeper. ct $frameSvar [deeper .ct#frameSvar == ""]) W plot tree prpldeeper.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = -10, gzay', ‘white’l) box. coli felse{deeper.ct$frame$var == "", 10 > length (deeper.ct$frameSvar[deeper.ct$trameSvar == ""]) ta) 53 Full trees are too complex ~ they end up fitting noise, overfitting the data, FIGURE 9.10 [AFULL TREE FOR THE LOAM ACCEPTANCE OATA USING THE TRAIMING SET (2000 RECORDS) arias code for classifying the validation data using a tree and computing the confusion matrices and accuracy for the‘training and validation data > #484 Table 9.3 > Library (caret) Loading required package: lattice Loading required package: ggplot2 Warning message: package ‘caret’ was built under R version 3.4.4 > > # classify records in the validation data. > # set argument type = "class" in predict() to generate predicted class membership. > default.ct.point.pred.train <- predict (default.ct, train.df, type = "class") > # generate confusion matrix for training data u mm Matin ( acfoche dy, a.factwe 9) predicted net wah Trality > table(default.ct.point.pred.train, train,df$Personal.Loan) —— oe dahy default.ct.point.pred.train 0 1 conf 0 2696 26 1 13 265 Note: Accuracy : 0.987 € note: =(2696+265) / (2696+26+134265} Sensitivity : 0.9952 Specificity : 0.9107 ‘Positive! Class : 0 > ### repeat the code for the validation set, default. ct.point.pred.valid <- predict (default.ct, valid.df, type > table(default.ct.point.pred.valid, valid.df$Personal. Loan) default.ct.point.pred.valid 9 1 01792 18 1 ag am > #88 repeat the code for using the deeper tree > default.ct-point.pred.train <- predict (deeper.ct, trainzdf, type ~ "class") table(default.ct .point.pred.train, train.df$personal.Loan) default.ct.point.pred.train 0 1 Train’ 02709 of ; 10 291 > default.ct.point.pred.valid <- predict (deeper.ct, valid.df, type = "class") > table(default.ct.point.pred.valid, valid. df$Personal.Loan) defer. cteyoint-predvvalid 9 1 validation oie 19 h 1 23 170 2 Suppose a 2x2 confusion matrix table is denoted as’ Reference Predicted Event No Event Event A B No Event c D Some other common metrics are: 4 VWheeens ivity=AMA+C); Speci fi DI(B+D); Prevalence=(A+C)(AHB+C+D); PPV=(sensitivity=prevalence)/ (sensitivity+prevalence}+((1-specificity)+(1—prevalence))); NPV=(specificity+(1~prevalence))/(((1-sensitivity)+prevalence)+ ((specificity)+(1~prevalence))): DetectionRate=A/(A+B+C+D); DetectionPrevalence=(A+B)/(A+B+C+D); BalancedAccuracy=(sensitivity+specificity)/2; Precision=AA+B); Recall=A/(A+C); F1=(1+beta2)*precision«recall/((beta2*precision)+recall) where beta~ 1 for this function. 9.4 Avoiding Overfitting Overfitting produces poor predictive performance — past a certain point in tree complexity, the error rate on new data starts to increase. ® : 2 Unseen Vali detin, e data Datm ui Training data # splits B Stopping tree growth - CHAID © CHAID, older than CART, uses chi-square statistical test to limit tree growth, ficant © Splitting stops when purity improvement is not statistically si ‘One can think of different criteria for stopping the tree growth before it starts overfitting the data, Examples are tree depth (i.e., number of splits), minimum number of records in a terminal node, and minimum reduction in impurity. In R’s rpart(), for example, we can control the depth of the tree with the complexity parameter (CP). The problem is that it is not simple to determine what is a good stopping point using such rules, CART lets tree grow to full extent, then prunes it back Idea is to find that point at which the validation error is at a minimum Generate successively smaller trees by pruning leaves At each pruning stage, multiple trees are possible Use cost complexity to choose the best tree at that stage Which branch to cut at each stage of pruning? CC(D) = Err(1) + a LD) CC(T) = cost complexity of a tree Err(T) = proportion of misclassified records a= penalty factor attached to tree size (set by user) © Among trees of given size, choose the one with lowest CC * Do this for each size of tree (stage of pruning) 14

You might also like