You are on page 1of 2

# Exercises

Machine Learning

## Institute of Computational Science Dept. of Computer Science, ETH Zrich u

Peter Orbanz E-Mail porbanz@inf.ethz.ch Web http://www.inf.ethz.ch/porbanz/ml

Series (Boosting)

The objective of this problem is to implement the AdaBoost algorithm. We will use a simple type of decision trees as weak learners and run the algorithm on the USPS data set. AdaBoost: Assume we are given a training sample (xi , yi ), i = 1, ..., n, where xi are data values in Rd and yi {1, +1} are class labels. Along with the training data, we provide the algorithm with a training routine for some classier c (the weak learner, also called the base classier). Here is the AdaBoost algorithm for the two-class problem: 1. Initialize weights: wi = 2. for b = 1, ..., B (a) Train a base classier cb on the weighted training data. (b) Compute error:
b
n i=1

1 n

:=

b b

## (c) Compute voting weights: b = log

(d) Recompute weights: wi = wi exp (b I{yi = cb (xi )}) 3. Return classier cB (x) = sgn
B b=1

b cb (x)

Decision stumps: In the lecture, we discussed decision tree classiers. The simplemost non-trivial type of decision tree (a root node with two leaves) is called a decision stump. A stump classier c is dened by c(x|j, ) := 1 xj > . 0 otherwise (1)

Since the stump ignores all entries of x except xj , it is equivalent to a linear classier dened by an ane hyperplane. The plane is orthogonal to the jth axis, with which it intersects at xj = . We will employ stumps as base learners in our boosting algorithm. To train stumps on weighted data, use the learning rule (j , ) := arg min
j, n i=1

## wi I{yi = c(x|j, )} . n i=1 wi

(2)

Implement this in your training routine by rst nding an optimal parameter j for each dimension j = 1, ..., d, and then select the j for which the cost term in (2) is minimal. USPS data: This data set consists of scanned images of handwritten numerals, collected by the US Postal Service. (We have used this data set before, to test the SVM implementation in problem 5.1 of the Machine Learning I lecture.) The USPS data has acquired some fame in machine learning, since linear classiers notoriously fail, and it became one of the showcase applications which secured the fame of the kernelized SVM. The data le available on the ML II homepage contains 100 data vectors each for two classes (corresponding to the numerals 5 and 6). The original images are 16-by-16 pixel, 8-bit grayscale, represented in the data set as vectors; we assume the feature space to be R256 . The data comes in two les, uspsdata.txt (containing the data vectors) and uspscl.txt (the class labels). You can directly load the textles into matlab; a load uspsdata.txt at the matlab prompt will create a matrix called uspsdata with the data vectors as rows.

1. Implement the AdaBoost algorithm in matlab. The algorithm requires two auxiliary functions, to train and evaluate the base classier. We also need a function which implements the resulting boosting classier. To ensure that an arbitrary base learner can easily be plugged into your boosting algorithm, please use function calls of the following form:

pars=train(X,w,y) for the base classier training routine, where X is a matrix the columns of which are the training vectors x1 , ..., xn , w and y are vectors containing the weights and class labels, and pars is a vector of parameters specifying the resulting classier. label=classify(x,pars) for the classication routine, which evaluates the base classier on a test vector x. A function agg class(x,alpha) which evaluates the boosting classier (aggregated classier) for a test vector x. alpha denotes the vector of voting weights b . 2. Implement the functions train and classify for decision stumps. 3. Add a cross validation step to the training algorithm: After each iteration b of the algorithm, estimate the current classication error of the current boosting classier (not the base classier) by cross validation. Assume that the training data is split only once, before the AdaBoost algorithm is executed, so AdaBoost uses one of the two subsets for training and cross validation is performed using the remaining data points. Store the acquired estimates. 4. Run your algorithm on the USPS data. Perform a random split of the 200 data points into two equally sized subsets, one for training and one for validation. Run this at least three times and plot the cross validation error estimates (as three graphs in a common plot) vs. the number b of iterations.