Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
An Empirical Comparison of Boosting and Bagging Algorithms

An Empirical Comparison of Boosting and Bagging Algorithms

Ratings: (0)|Views: 98|Likes:
Published by ijcsis
Classification is one of the data mining techniques that analyses a given data set and induces a model for each class based on their features present in the data. Bagging and boosting are heuristic approaches to develop classification models. These techniques generate a diverse ensemble of classifiers by manipulating the training data given to a base learning algorithm. They are very successful in improving the accuracy of some algorithms in artificial and real world datasets. We review the algorithms such as AdaBoost, Bagging, ADTree, and Random Forest in conjunction with the Meta classifier and the Decision Tree classifier. Also we describe a large empirical study by comparing several variants. The algorithms are analyzed on Accuracy, Precision, Error Rate and Execution Time.
Classification is one of the data mining techniques that analyses a given data set and induces a model for each class based on their features present in the data. Bagging and boosting are heuristic approaches to develop classification models. These techniques generate a diverse ensemble of classifiers by manipulating the training data given to a base learning algorithm. They are very successful in improving the accuracy of some algorithms in artificial and real world datasets. We review the algorithms such as AdaBoost, Bagging, ADTree, and Random Forest in conjunction with the Meta classifier and the Decision Tree classifier. Also we describe a large empirical study by comparing several variants. The algorithms are analyzed on Accuracy, Precision, Error Rate and Execution Time.

More info:

Published by: ijcsis on Feb 19, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/03/2012

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, o. 11, ovember 2011
An Empirical Comparison of Boosting and BaggingAlgorithms
R. Kalaichelvi Chandrahasan
College of Computer StudiesAMA International UniversityKingdom of Bahrainkalai_hasan@yahoo.com 
Angeline Christobel Y
College of Computer StudiesAMA International UniversityKingdom of Bahrainangeline_christobel@yahoo.com
Usha Rani Sridhar 
College of Computer StudiesAMA International UniversityKingdom of Bahrainama_usharani@yahoo.com 
Arockiam L
Dept.of Computer ScienceSt. Joseph’s CollegeTiruchirappalli, TN, Indialarockiam@yahoo.co.in
 Abstract 
-
Classification is one of the data mining techniques thatanalyses a given data set and induces a model for each classbased on their features present in the data. Bagging and boostingare heuristic approaches to develop classification models. Thesetechniques generate a diverse ensemble of classifiers bymanipulating the training data given to a base learningalgorithm. They are very successful in improving the accuracy of some algorithms in artificial and real world datasets. We reviewthe algorithms such as AdaBoost, Bagging, ADTree, andRandom Forest in conjunction with the Meta classifier and theDecision Tree classifier. Also we describe a large empirical studyby comparing several variants. The algorithms are analyzed onAccuracy, Precision, Error Rate and Execution Time
.
 Key Wrods - Data Minig, Classification, Meta classifier, DecisionTree
I.
 
I
 NTRODUCTION
Data Mining is an iterative and multi step process of knowledge discovery in databases with the intention of uncovering hidden patterns. The huge amount of data to process is more and more significant in the world. Moderndata-mining problems involve streams of data that growcontinuously over time that includes customer click streams,telephone records, large sets of web pages, multimedia data,sets of retail chain transactions, assessing credit risks, medicaldiagnosis, scientific data analysis, music information retrievaland market research reports [32].Classification algorithm is a robust data mining tool thatuses exhaustive methods to generate models from a simple tohighly complex data. The induced model is used to classifyunseen data instances. It can be referred as supervisedlearning algorithms because it assigns class labels to dataobjects. There are many approaches to develop theclassification model including decision trees, meta algorithms,neural networks, nearest neighbor methods and rough set- based methods [14, 17].The Meta classifiers and the decision trees are the mostcommonly used classification algorithms, because of their ease of implementation and easier to understand compared toother classification algorithms.The main objective of this paper is to compare AdaBoost,Bagging, ADTree and Random Forest algorithms which use bagging or boosting techniques based on Accuracy, Precision,Error Rate and Processing Time. The implementations of these algorithms were taken place on three different medicaldatasets, "Wisconsin-BreastCancer", "Heart-statlog" and"Liver-disorders" obtained from UCI Machine LearnigRepository [40].Section 2 presents the proposed ensemble methods for  building ensembles that are based on bagging and boostingtechniques, while section 3 discusses the procedure for  performance estimation. Experiment results using threemedical data sets and comparisons of performance attributessuch as accuracy, precision, error rate and the processing timewith four algorithms are presented in section 4. We concludein section 5 with summary and further research areas.II.
 
BOOSTING
 
AND
 
BAGGING
 
APPROACHESMeta Learning is used in the area of predictive data mining,to combine the predictions from multiple models. It issignificantly useful when the types of models are verydifferent in their nature. In this perspective, this method isdefined as Stacking or Stacked Generalization. The predictions from various classifiers can be used as input to ameta-learner. The final best predicted classification will becreated in combining the predictions from the multiplemethods. This procedure yields more accurate predictions thanany other classifiers.Decision tree induction is a data mining inductiontechniques to solve the classification problems. The goal inconstructing a decision tree is to build a tree with accuracyand better performance. It is made of root, nodes, branches,and leaf nodes. The tree is used in classifying unknown datarecords. To classify an instance, one starts at the root andfinds the branch corresponding to the value of that attributeobserved in the instance. This process is repeated at the subtree rooted at that branch until a leaf node is reached. Theresulting classification is the class label on the leaf [26].In this paper we study the classification task with moreemphasis on boosting and bagging methods classification. Thefour popular ensemble algorithms are boosting, bagging,rotation forest and random subspace method. This paper describes the boosting and bagging techniques. Boostinginduces the ensemble of weak classifiers together to create onestrong classifier. In boosting successive models give extraweights to the earlier predictors. While In bagging, successivetrees do not depend on earlier trees. Each model isindependently constructed using a bootstrap sample of the data
147http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, o. 11, ovember 2011
set. In the end, overall prediction is made by majority voting.The paper concludes with two novel classifiers Meta classifier and Decision Trees classifier that give idea of their Accuracyand Precision attributes.
 A.
 
Meta Classifier: AdaBoost Algorithm
Adaptive boosting is a popular and powerful metaensemble algorithm. “Boosting” is an effective method for theimprovement in the performance of any learning algorithm. Itis also referred as “stagewise additive modeling”. The modelis a more user friendly algorithm. The algorithm does notsuffer from overfitting. It solves both the binary classification problems as well as multiclass problems in the machinelearning community. AdaBoost also gives an extension toregression problems. Boosting algorithms are stronger than bagging on noise free data. The algorithm depends more ondata set than type of classifier algorithms. The algorithm putsmany weak classifiers together to create one strong classifier.It is a sequential production of classifiers.To construct a classifier:1.
 
A training set is taken as input2.
 
A set of weak or base learning algorithms are calledrepeatedly in a series of rounds to maintain a set of weights over the training set. Initially, all weights areset equally, but on each round, the weights of incorrectly classified examples are increased so that theweak learner is forced to focus on the hard examples inthe training data.3.
 
This boosting can be applied by two frameworks, i) boosting by weighting ii) boosting by sampling. In boosting by weighting method, the base learningalgorithms can accept a weighted training set directly.With such algorithms, the entire training set is given tothe base learning algorithm. And in boosting bysampling examples are drawn with replacement fromthe training set with probability proportional to their weights.4.
 
The stopping iteration is determined by crossvalidation.The algorithm does not require prior knowledge about theweak learner and so can be flexibly combined with
any
method for finding weak hypotheses. Finally, it comes with aset of theoretical guarantees given sufficient data and a weak learner that can reliably provide only moderately accurateweak hypotheses.The algorithm is used on learning problems having either of the following two properties. The first property is that theobserved examples tend to have varying degrees of hardness.The boosting algorithm tends to generate distributions thatconcentrate on the harder examples, thus challenging the weak learning algorithm to perform well on these harder parts of thesample space. The second property is that the algorithm issensitive to changes in the training examples so thatsignificantly different hypotheses are generated for differenttraining sets.
 B.
 
Meta Classifier: Bagging Algorithm
Bagging is a machine learning method of combiningmultiple predictors. It is a model averaging approach.Bagging is a technique generating multiple training sets bysampling with replacement from the available training data. Itis also known as bootstrap aggregating. Bootstrapaggregating improves classification and regression models interms of stability and accuracy. It also reduces variance andhelps to avoid overfitting. It can be applied to any type of classifiers. Bagging is a popular method in estimating bias,standard errors and constructing confidence intervals for  parameters.To build a model,i)
 
split the data set into training set and test set.ii)
 
Get a bootstrap sample from the training data andtrain a predictor using the sample.Repeat the steps at random number of times. The modelsfrom the samples are combined by averaging the output for regression or voting for classification. Bagging automaticallyyields an estimate of the out of sample error, also referred toas the generalization error. Bagging works well for unstablelearning algorithms like neural networks, decision trees andregression trees. But it works poor in stable classifiers like k-nearest neighbors. The lack of interpretation is the maindisadvantage of bagging. The bagging method is used in theunsupervised context of cluster analysis.
C.
 
 Decision Tree Classifier: ADTree Algorithm
The Alternating Decision Tree (ADTree) is a successfulmachine learning classification technique that combines manydecision trees. It uses a meta-algorithm boosting to gainaccuracy. The induction algorithm is used to solve binaryclassification problems. The alternating decision trees providea mechanism to generate a strong classifier out of a set of weak classifier. At each boosting iteration, a splitter node andtwo prediction nodes are added to the tree, to generate adecision tree. In accordance with the improvement of purity,the algorithm determines a place for the splitter node byanalyzing all prediction nodes. Then the algorithm takes thesum of all prediction nodes to gain overall prediction values.A positive sum represents one class and a negative sumrepresents the other in two class data sets. A special feature of ADTree is the trees can be merged together. In multiclass problems the alternating decision tree can make use of all theweak hypotheses in boosting to arrive at a single interpretabletree from large numbers of trees.
 D.
 
 Decision Tree Classifier: Random Forest Algorithm
A random forest is a refinement of bagged trees toconstruct a collection of decision trees with controlledvariations. The method combines Breiman's bagging and Ho'srandom subspace method. The algorithm improves on bagging by de-correlating the trees. It grows trees in parallelindependently of one another. They are often used in very
148http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, o. 11, ovember 2011
large datasets and a very large number of input variables. Arandom forest model is made up of hundreds of decision trees.It does not require tree pruning and it handles continuous andcategorical variables and missing values. The algorithm can be used to generate tree-base clusters through sample proximity.The Random Forest algorithm is as follows:1. First Randomization (Bagging)Random Forest uses Bootstrap aggregation / bagging methodof ensemble learning that uses bootstrap sample (i.e samplingwith replacement from the original data) with a randomizedselection of features at each split in tree induction. Grow anun-pruned tree with this bootstrap. Splits are chosen by puritymeasures, Classification uses Gini or deviance, whileregression uses squared error.2. Second Randomization (Selection of subset Predictors)At each internal node, randomly select the best among asubset of predictors and determine the best split.m
try – 
number of predictors to try at each split.k – total number of predictor For classification m
try
= √K for Regression =k/3Bagging is a special case of Random Forest where m
try=
Subset of predictors is much faster to search than all predictors. The overall Prediction is made by majority voting(classification) or averaging (regression) the predictions of theensemble. As it is parallel algorithm type, several randomforests can be run on many machines and then aggregate thevotes component to get the final result. As it has only two parameters i) the number of variables in the random subset ii)and the number of trees in the forest, it is user-friendly.For each tree grown, 33-36% samples are not selected inthe bootstrap, called "Out Of Bootstrap" or "Out of Bag"(OOB) samples [8]. Predictions are made using these OOBsamples as input. OOB estimate of error rate will be computed by aggregating the OOB predictions. As it generates an internalunbiased estimate of the test error, cross validation is notnecessary. The algorithm builds trees until the errors no longer decreases. The number of predictors determines the number of trees necessary for good performance.III.
 
PERFORMANCE
 
EVALUATIONPerformance evaluation is a significantly important factor of any classifier. Performance evaluation includes the performance metrics for evaluating a single classifier, themetrics for comparing multiple classifiers and measure for theeffectiveness of the classifiers, which is the ability to take theright classification decisions. Various performance metrics areused for classification effectiveness evaluation, includingaccuracy, correct rate, recognition rate, error rate, false rate,reject rate, recall and precision.Cross validation is considered as a standard procedure for  performance estimation. There are several approaches in crossvalidation methods such as Resubstitution Validation, Hold-out Validation, k-fold cross validation, Leave-One-Out cross-validation and Repeated k-fold cross-validation. In this study,we have selected k-fold cross validation for evaluating theclassifiers [3, 9].The estimations of accuracy, precision and error rate are thekey factors to determine the algorithms' effectiveness in asupervised learning environment. In our empirical tests, thesecharacteristics are evaluated using the data from the confusionmatrix obtained. A confusion matrix contains informationabout actual and predicted classifications obtained by aclassification algorithm. The time taken to build the model isalso taken as another factor for the comparison.The Accuracy, Precision and the Error are computed asfollows:Accuracy = (a+d)/(a+b+c+d)Precision = (d)/(b+d)Error = (b+c)/(a+b+c+d)Where,
 
a
is the number of correct predictions that an instanceis negative,
 
b
is the number of incorrect predictions that aninstance is positive,
 
c
is the number of incorrect of predictions that aninstance negative, and
 
is the number of correct predictions that an instanceis positive.IV.
 
EXPERIMENTAL
 
ANALYSISWe carried out some experiments using Wisconsin-BreastCancer, Heart-statlog and Liver-disorders data sets attainedfrom the UCI Machine Learning Repository [40]. In our comparison study, the implementations of algorithms weredone by a machine learning algorithm tool Weka version3.6.5. Weka is a very supportive tool in learning the basicconcepts of data mining where we can apply different optionsand analyze the output that is being produced.Table 1 shows the datasets used for the implementation of algorithms with their number of instances, the number of attributes.Table 1: Description of the Datasets
Dataset Instances Attributes
Wisconsin-BreastCancer 699 10Heart-statlog 270 14Liver-disorders 345 7Table 2 shows the accuracy of various classifiers. TheFigure 1 gives an idea about the accuracy of the selectedalgorithms in graphical format.
149http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->