You are on page 1of 4
eg, jatecion Table 9.5 shows the result of running a boosted tree on the loan acceptance example that ‘we saw earlier. We can see that compared to the performance of the single pruned tree (Table 9.3), the boosted tree has better performance on the validation data in terms of overall accuracy and especially in terms of correct classification of 1's—the rare class of special interest. Where does boosting’s special talent for finding 1's come from? When one class is dominant (0's constitute over 90% of the data here), basic classifiers are tempted to classify cases as belonging to the dominant class, and the 1's in this case constitute most of the misclassifications with the single best-pruned tree. The boosting algorithm concentrates on the misclassifications (which are mostly 1's), so it is naturally going to do well in reducing the misclassification of 1's (from 18 in the single tree to 15 in the boosted tree, in the validation set). random foe is 0 fpecad ord Vay Description of Bagging and Boosting in Chapter 13 ‘Bagging Another form of ensembles is based on averaging across multiple random data samples. Bagging, short for "bootstrap aggregating,” comprises two steps: 1. Generate multiple random samples (by sampling with replacement from the original data)—this method is called "bootstrap sampling,” “ 7 wisdom of the erwd 2. Running an algorithm on each sample and producing scores. Bagging improves the performance stability of a model and helps avoid overfitting by separately modeli srent data samples and then combining the results. Its therefore especially useful for algorithms such as trees and neural networks. string estimable frm Boosting many weak estinedts Boosting is a slightly different approach to creating ensembles. Here the goal is to directly improve areas in the data where our model makes errors, by forcing the model to pay more attention to those records, The steps in boosting are: 1. Fita model to the data. ( 2. Drawa sample from the data so that misclassified records (or records with large predi errors) have higher probabilities of selection. 3. Fitthe model to the new sample. 4. Repeat Steps 2-3 multiple times. Bagging and Boosting in R In Chapter 9, we described random forests, an ensemble based on bagged trees. We illustrated a random forest implementation for the personal loan example. 27 wt ‘The adabag package in R can be used to generate bagged and boosted trees. Tables 13.1 and 13.2 show the R code and output producing a bagged tree and a boosted tree for the personal loan data, and how they are used to generate classifications for the validation set. Example Of Bagging and Boosting Classification Trees on the Personal Loan Data: R code > Library (adabag) > library (rpart) > library (caret) > > bank.df <- read.csv("UniversalBank.csv"} > bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code columns. > > & transform Personal.Loan into categorical variable > bank.desPersonal-tean = as-factor benkuafstersenal.toan) ©) = : li random > # partition the data J seh.ceed (1) oer som ple train.index <- sample (c(1:dim(bank.df) [1]), dim(bank.df) [1] *0.6) > train.df <- bank.df(train. index, ] ‘> valid.df <- bank.df[-train.index, ] > # single tree > tr <- rpa! (Personal.Loan ~ ., data = train.df) > pred <- predict (tr, valid.df, type = "class") > table (pred, valid.df$Personal.Loan) hates fee - pred 0 1 total cases mis- classified 2 10 +13 = 23 ~o 1803 13 1 10 174 ae i{feunt Hai 2 any nee gut He sone vesvet dn He Af Usage bagging(formula, data, mfinal = 100, control, par=FALSE,...) Arguments Formula a formula, as inthe 1m function Data a data frame in which to interpret the variables named in the formula Mfinal an integer, the number of iterations for which boosting is run or the number of trees to use. Defaults to mfinal=100 iterations. Control options that control details of the rpart algorithm. See rpart.control for more details 28 ® part! sivgte fre B ie > # bagging begged tre > bag <- bagging(Personal.Loan ~ ., data = train. > pred <- predict (bag, valid.df, type = "class") > table (pred$class, valid.df$Personal. Loan) confusion Matviy C factw OC ), face )) fee ; : pe32 01806 25 Tork tase unis-classifived « Tri 1 7 162 Boosting: Description Fits the AdaBoost M1 (Freund and Schapire, 1996) and SAMME (Zhu et al, 2008) algorithms using classification trees as single classifiers. Usage ~ boosting(formula, data, boos = TRUE, mfinal = 100, coeflearn = "Breiman, control...) Arguments ote formula a formula, as in the Im function. data a data frame in which to interpret the variables named in formula, boos if TRUE (by default), 2 bootstrap sample of the training set is drawn using the weights for each ‘observation on that iteration. If FALSE, every observation is used with its weights. final an integer, the number of iterations for which boosting is run or the number of trees to use. Defaults to mfinal=100 iterations. coeflearn if ’Breiman’(by default), alpha=1/2In((1-err)/err] is used. If Freund’ alpha=In((1-err)/err) is Used. In both cases the AdaBoost.M1 algorithm is used and alpha is the weight updating coefficient. On the other hand, if coeflearn is Zhu’ the SAMME algorithm is implemented with alpha=In(1-erri/err)+ In(nclasses-1). control options that control details of the rpart algorithm. See rpart.control for more details. > # boosting booted tee > boost <- boosting (Personal.Loan ~ ., data = train.df) > pred <- predict (boost, valid.df, type = “class") > table (predsclaes, valid.dfspersonal Loan) o 1 o1s10 18 Tol cases s-clasified > Wt 22 2] 1 3 169 ‘ baggiv can ty toad foet welt. spect care of baggy 29 Summary . Classification and Regression Trees are an easily understandable and transparent method for predicting or classifying new records A single tree is a graphical representation of a set of rules Tree growth must be stopped to avoid overfitting of the training data — cross-validation helps you pick the right ep level to stop tree growth Ensembles (random forests, boosting) improve predietive performance, but you lose interpretability and the rules embodied in a single tree Problems 1. Competitive Auctions on eBay.com. The file eBayAuctions.csv contains information on 1972 auctions that transacted on eBay.com during May-June 2004. ‘The goal is to use these data to build a model that will classify auctions as competitive or noncompetitive. A competitive auction is defined as an auction with at least two bids placed on the item auctioned. The data include variables that describe the item (auction category), the seller (his/her eBay rating), and the auction terms that the seller selected (auction duration, opening price, currency, day-of-week of auction close). In addition, we have the price at which the auction closed. The task is to predict whether or not the auction will be competitive. Data Preprocessing. Convert variable Duration into a categorical variable. Split the data into training (60%) and validation (40%) datasets. a. Fita classification tree using all predictors, using the best-pruned tree. To avoid overfitting, set the minimum number of records in a terminal node to 50 (in R: minbucket = 50). Also, set the maximum number of levels to be displayed at seven (in R: maxdepth = 7), Write down the results in terms of rules. (Note: If you had to slightly reduce the number of predictors due to software limitations, or for clarity of presentation, which would be a good variable to choose?) b._ Is this model practical for predicting the outcome of a new auction? Describe the interesting and uninteresting information that these rules provide. 4d. Fitanother classification tree (using the best-pruned tree, with a minimum number of records per terminal node = 50 and maximum allowed number of displayed levels = 7),this time only with predictors that can be used for predicting the outcome of a new auction, Describe the resulting tree in terms of rules. Make sure to report the smallest set of rules required for classification. Plot the resulting tree on a scatter plot: Use the two axes for the two best (quantitative) predictors. Each auction will appear as a point, with coordinates corresponding to its values on those two predictors. Use different colors or symbols to separate competitive and noncompetitive auctions. Draw lines (you can sketch these by hand or use R) at the values that create splits. Does this splitting seem 30

You might also like