You are on page 1of 5

S e a r c h thi s s i te

Home Notes >





Introduction to Adaptive Boosting

These notes are based on the following article: R. Rojas, "AdaBoost and the Super Bowl of Classifiers A Tutorial Introduction to Adaptive Boosting," Freie University, Berlin, 2009.

1 Introduction
When you are working on a two-class pattern recognition problem for which you are given a large pool of classifiers (experts), you want to submit a still better classifier for a pattern recognition competition. The AdaBoost (adaptive boosting) algorithm was proposed in 1995 by Yoav Freund and Robert Shapire as a general method for generating a strong classifier out of a set of weak classifiers. This algorithm is widely used and Viola and Jones proposed a well-known face recognition method based on AdaBoost.

2 Model
Given a set of classifiers and training patterns, we want to generate a combined classifier in a linear way. More specifically, assuming we have L classifiers and every expert classifier k j can emit an opinion on each pattern xi : k j(xi ) {-1, 1} (-1 indicates "no" and +1 indicates "yes" to classification problem), we can generate a new and strong classifier by linearly combining the opinions of each experts in the pool.

where k denotes the expert classifier selected from the pool and a denotes the constant weight we assign to the opinion of the expert. We regard sign(C(xi )) as the final decision of the generated classifier on pattern xi.

3 Approach
By examining the model proposed, we can divide this generation problem into two small ones: which classifier in the pool to be selected and how much weight to be assigned to the classifier. Intuitively, we regard this generation as a military member selection procedure and call these two problems as Drafting and Weighting problem respectively. Before we drafting members, the assessment of candidates ability should be done. Firstly, we evaluate each candidate's classification performance on given training set. Assuming we have a training set T of N multidimensional data points xi and L classifiers in the pool, the table below records the classification result of them. 1 x1 x2 x3 ... xN 0 0 1 ... 0 2 1 0 1 ... 0 ... ... ... ... ... L 1 1 0 ... 0

where 1 indicates a hit which means classifier classify the points correctly while 0 indicates a miss, meaning classifier conduct an erroneous classification. With ground truth of each data point, it is easy to get the table by testing. Having every classifier evaluated, we can proceed to the drafting and weighting procedure. Classifier is drafted and given its weight iteratively. The basic principle behind these two steps is in every iteration we want to select a "best" classifier which can help the current classifier group with their error classifications. By doing so can we improve the performance of the generated classifier in each drafting and weighing.

3.1 Drafting
As has been discussed, the goal of drafting is to select a classifier to help with error classification. But how? Here we model this problem as a optimization problem by introducing cost function of classification. Assuming we have drafted m-1 classifiers in the previous m-1 iterations. We have Then we want to draft a new member to extent it to The cost function of classification is defined as

where yi {-1,1} is the class label of each point. By observing the equation above, we can find that when yi and Cm(xi) have same sign, which means Cm(xi) correctly classifies the points, exp(-yiCm(xi)) would generate a cost less than 1, while when yi and Cm(xi) bears different signs, in other words Cm(xi) generates a miss, exp(-yiCm(xi)) would generate a cost greater than 1. In a word, the cost function penalizes the miss more

heavily than hit. With this in mind, what we have to do is to find a classifier which has a lower E value after adding to the group than any other one in the pool. We rewrite the above expression as

where for i = 1, 2, 3, ..., N. In the .first iteration wi (1) = 1 for i = 1, 2, 3, ..., N. During later iterations, the vector w(m) represents the weight assigned to each data point in the training set at iteration m. Split the equation as

This equation can be termed as weighted miss cost plus weighted hit cost. Then we rewrite the expression by using Wc and We as symbols standing for weight. Multiply non-zero factor exp(a m) on both side, we have

Since exp(a m) is a positive value, minimizing E is equivalent to minimizing exp(a m)E. (Wc + We ) is the constant total sum W of the weights of all data points and exp(2a m - 1) is a positive value for a m>0, so we have to draft a classifier which has the lowest We value to minimize the total E. This make sense for the next draftee should be the one with the lowest penalty given the current set of weights.

3.2 Weighting
Weight serves as a importance evaluation of the classifier. The basic principle of weighting is minimizing the cost function E. Regarding the E as the function of a m, we can determine the value of a m via differentiation.

Equating this expression to zero, we obtain thus


is the is the percentage rate of error given the weights of the data points.

1. Initialization

2. Calculate the We of each classifier in the pool and draft the classifier with lowest We value

3. Set the weight a m

4. Updating the weight If km(xi) is a miss, set


5. Go to step 2 According to the weight equation A classifier with em = 1/2 doesn't not tell anything about the data point which would not perform better than guessing. a weight of zero would be assigned to it. A classifier with em = 0, which we would call perfect classier, would receive a infinite value since it would be the only member that we need. A classifier with em = 1, a perfect liar, can be assigned a negative infinite value and we can just use it as a perfect classifier by reversing its decision.

You have no permission to add comments.

Report Abuse | Remove Access | Powered By G o o gle S i te s