You are on page 1of 3

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org


Volume 5, Issue 1, January - February 2016
ISSN 2278-6856

Survey on Approaches, Problems and


applications of the Boosting
1

Rutuja Shirbhate, 2Dr. S.D.Babar

Sinhgad Institute of Technology, Savitribai Phule Pune University


309/310 Kusgaon (BK), Off. Mumbai-Pune Expressway,
Lonavala, Maharashtra 410401

Sinhgad Institute of Technology, Savitribai Phule Pune University


309/310 Kusgaon (BK), Off. Mumbai-Pune Expressway,
Lonavala, Maharashtra 410401.

Abstract
Cluster analysis or clustering is the task of grouping a
set of objects in such a way that objects in the same
group are more similar to each other than to those in
other groups. Boosting is iterative process to increase the
accuracy of the supervised learning algorithm i.e.
classifiers. Clustering with boosting improves quality of
mining process. In boosting, misclassified instances by
initial classifier are used for learning of subsequent
classifiers and set of classifiers is used to classifying
further instances. Usage of boosting in many
applications proved its effectiveness. Although its
success, boosting had certain problems. It could not
handle noisy data and data with troublesome areas. This
limitation of boosting is solved by cluster based boosting
in which data is clustered before boosting and depend on
the cluster boosting is performed. CBB works well on
benchmark data. Real world data contains many
irrelevant features. In CBB all features data is used for
clustering. Due to consideration of irrelevant feature
there is possibility of inaccurate clustering. Inaccurate
clustered data may result into negative effect on
boosting performance. To overcome this issue feature
selection will be applied before clustering on training
data.
Keywords: Data Classification, Boosting, Clustering,
Ensemble of Classifier

1. INTRODUCTION
In data mining classification is process to label an
unlabeled instance (testing data) of the data using
knowledge extracted from learning on labeled instances
(training data). Classifiers in the data mining can be
categorized by their learning process or representation of
extracted knowledge. Support vector machine (SVM),
decision trees like ID3, C4.5, k-nearest neighbor
classifiers, Probability based classifiers like Nave bayes
are some popular examples of the classifiers proposed in
the literature.
Once learning process is completed learned model
/classifier/
function
is
used
for
classifying/predicting/labeling the test/unlabeled data.

Volume 5, Issue 1, January February 2016

When classifier is evaluated prediction accuracy mostly


not 100% and it may be below acceptable figure like 40%,
60%.
After evaluation of the classifier when the
predication accuracy is unacceptable there is need to
improve the accuracy of the classifier. Most of the
supervised learning algorithms are prone to overfitting. In
over-fitting, classifier learning process starts memorizing
the training data instead of learning. This happens due to
high complexity of the data. If classifier memorized the
training data then its prediction accuracy will be low when
classifiers tested on non-training data. In the literature
Boosting is proposed to improve the accuracy of Initial
classifier. Boosting is iterative process in which
subsequent classifiers are trained on data which is
misclassified by previous classifier, and all classifiers are
used for further classification using major voting policy.
Also literature theoretically proves that boosting is overfitting to resistant [7]. Results of boosting show
improvement in the predicative accuracy than using single
classifier.
Despite success of boosting, there were some limitations
in the boosting process. Boosting could not deal with the
noisy data and troublesome areas in the training data.
Noisy data means labels provided to training data is
incorrect. Troublesome area in the data means some
feature set is relevant to class label in some part of the
training data and irrelevant to class label in another part
of the data. When classifier learned on such data and
learned relevancy of features in one area. In this scenario
when data from another part come for classification then
classifier gives wrong results i.e. boosting cant depend on
previous function to decide whether the instances are
classified correctly or wrongly. Boosting cannot handle
such noisy data or data with troublesome area.
In literature many methods have been proposed to address
the issues mentioned above. Technique in [6] proposed as
Cluster-based boosting (CBB) which can deal with the
noisy data and troublesome areas in the training data. In
CBB training data is clustered using k-means algorithm.
Cluster set with suitable k is considered for further
process. Type of the cluster in the cluster set is evaluated
and depend on the type of the cluster boosting is
performed. In CBB clusters are created using all features
in the data, this works well on standard data set. But in
Page 48

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org
Volume 5, Issue 1, January - February 2016
ISSN 2278-6856
real world dataset contains large number of feature and
may result into inaccurate clusters. Such inaccurate
clusters can affect CBB negatively. There is need to
conduct a study and investigation to fix this problem.

2. LITERATURE SURVEY
[1]Boosting helps to improve classification process and
able to provide better results. As described in this paper
[1], using ensembles for the sake of better clustering
quality there were scope for getting better results. To
achieve significant accuracy in classification there is a
need to improve partitioning process. In order to improve
the quality of partitioning, this paper proposed a robust
multi clustering solution which is based on general
principles of boosting by boosting a simple clustering
algorithm. This multiple clustering approach performs
iteration of the process of training examples which
provides multiple clustering and gain a common
partitioning. Common partitioning was achieved by using
iterations of basic clustering algorithm and aggregation of
multiple clustering results. This partition aggregation is
obtained using weighted voting. Each partition has a
weight which indicates its quality measure. Experimental
result showed that the method is promising and provide
robustness and improved performance.
[2]Boosting methodology works on inaccurate classified
instances for subsequent function learning. As it is known
that the boosting does not usually overfit the training data
even with classifiers with large size [1]. It is explained
using margins the classifier achieves on training examples
by Schapire et al. Margin represents the confidence of the
predictability of the aggregated classifier. This paper
studied Breimans arc-gv algorithm for maximizing
margins also it explains why boosting is resistant to
overfitting and how it refines the decision boundary for
accurate predictions.
[3]AdaBoost rarely suffers the problem of overfting when
less noisy data is present. The adaptive boosting algorithm
known as AdaBoost provided great success and proved as
important developments in classification methodologies
described in the literature. But in case of high noise data
Adaboost suffers with problem of overfitting. This paper
focused on AdaBoost and to improve the robustness of
AdaBoost, paper proposed two regularization schemes
from the viewpoint of mathematical programming. These
two algorithms AdaBoostKL and AdaBoostNorm2 are
proposed based on the different penalty functions and can
be considered as an extension of AdaBoostReg in term of
pursuing a soft margin achieves better performances than
AdaBoostReg. Paper showed that among the regularized
AdaBoost algorithms the performance of AdaBoostKL is
considered as a best.
[4]Generally boosting faces over-fitting problem on some
dataset and works well on some another datasets. Authors
of this paper observed that this problem happens due to
presence of overlapping classes. To overcome this problem
boosting, confusing samples are find out using Bayesian
classifier and removed during boosting phase. Authors

Volume 5, Issue 1, January February 2016

experimented with proposed approach; they removed


confusing examples and did the analysis on the results of
AdaBoost without confusing examples. To detect the
confusing instances authors used perfect Bayesian
classifier, instances which are misclassified this classifier
are considered as confusing instances. For boosting
purpose AdaBoost algorithm is used, confusing instances
are removed from boosting process. Results of the
experiments proved that observation about overlapping
classes was correct.
[5]Ensembles of classifiers are obtained by generating and
combining base classifiers, constructed using other
machine learning methods. The target of these ensembles
is to increase the predictive accuracy with respect to the
base classifiers. One of the most popular methods for
creating ensembles is boosting, a family of methods, of
which AdaBoost is the most prominent member. Boosting
is a general approach for improving classifier
performances. Boosting is a well-established method in
the machine learning community for improving the
performance of any learning algorithm. It is a method to
combine weak classifiers produced by a weak learner to a
strong classifier. Boosting refers to the general problem of
producing a very accurate prediction rule by combining
rough and moderately inaccurate rules-of-thumb. Boosting
Methods combine many weak classifiers to produce a
committee. It resembles Bagging and other committee
based methods. Many weak classifiers are combined to
produce a powerful committee. Sequentially apply weak
classifiers to modified versions of data. Predictions of
these classifiers are combined to produce a powerful
classifier i.e. to improve the predictive accuracy with
respect to base classifiers, ensemble classifiers are used.
Paper [5] described the evolution of the boosting and
evaluation of boosting algorithms with different
parameters. Experiments showed that boosting has
superior prediction capabilities than bagging as classifies
the samples more correctly.
[6]This paper discussed the problem of boosting technique
that boosting cannot handle the noisy data and data with
troublesome areas. Noisy data means training data is
wrongly labeled. Troublesome areas are areas of instances
where their relevant features are different from the rest of
the training data. Author of the paper observed that this
problem occurs because boosting uses incorrectly
predicted data for subsequent function learning. To
overcome this issue they designed a boosting method
which used correctly and incorrectly predicted data by the
initial function for subsequent function learning. Paper
considered the process that partitions the training data
into clusters and then integrates these clusters directly in
boosting process. Cluster based boosting applied selective
boosting strategy on each cluster based on previous
function accuracy on member data. Selective boosting
approach uses high learning rate, low learning or no
boosting strategy on each cluster. This paper addresses the
problem of overfitting in subsequent functions, filtering
subsequent functions when data has noisy label and have
troublesome area.
Page 49

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org
Volume 5, Issue 1, January - February 2016
ISSN 2278-6856
3. CONCLUSION AND FUTURE WORK
In this paper, we discussed various boosting problem and
proposed solutions and also described some clustering
techniques. Boosting proved advantageous for more
accurate results in machine learning. Cluster based
boosting approach addresses limitations in boosting on
supervised learning algorithms. In CBB clusters are
created using all features in the data, this works well on
standard data set. But in real world dataset contains large
number of feature and may result into inaccurate clusters.
Such inaccurate clusters can affect CBB negatively. There
is a scope to apply feature selection and use data of
selected features for clustering.

References
[1] D. Frossyniotis, A. Likas, and A. Stafylopatis, A
clustering methodbased on boosting, Pattern Recog.
Lett., vol. 25, pp. 641654, 2004
[2] L. Reyzin and R. Schapire, How boosting the margin
can alsoboost classifier complexity, in Proc. Int.
Conf. Mach. Learn., 2006,pp. 753760.
[3] Y. Sun, J. Li, and W. Hager, Two new regularized
adaboost algorithms,in Proc. Int. Conf. Mach. Learn.
Appl., 2004, pp. 4148
[4] A. Vezhnevets and O. Barinova, Avoiding boosting
overfittingby removing confusing samples, in Proc.
Eur. Conf. Mach. Learn.,2007, pp. 430441.
[5] A. Ganatra and Y. Kosta, Comprehensive evolution
and evaluationof boosting, Int. J. Comput. Theory
Eng., vol. 2, pp. 931936,2010.
[6] ]Cluster-Based Boosting,L. Dee Miller and LeenKiat Soh, Member, IEEE
[7] W. Gao and Z-H. Zhou, On the doubt about margin
explanation of boosting, Artif. Intell., vol. 203, pp.
118, Oct. 2013

AUTHORS
Rutuja Shirbhate received the B.E. degree in Informtion
Technology from Sant Gadge Baba Amravati University in 2012
During 2008-2012. Currently pursuing M.E. post-gratuation
degree from Savitri Phule Pune University.
Dr.S.D.Babar. is professor in Sinhgad institute of Technology,
Lonavala, Pune.

Volume 5, Issue 1, January February 2016

Page 50

You might also like