Professional Documents
Culture Documents
wx+b= 0 (1)
2. Active Learning for
corresponding to the decision function Support Vector Machines
f(x) = sgn (w x + b) ; (2) Learners traditionally attempt to minimize error on future
for w 2 <N and b 2 <. Given a set of labeled data
data. In a probabilistic framework, an active learner can
D = f(x1 ; y1); (x2; y2 ); : : :; (xm; ym )g, where xi 2 <N do this by estimating expected future error, then selecting
and yi 2 f?1; +1g, the optimal hyperplane is the unique
training examples that are expected to minimize this ex-
pected future error. SVMs, however, are a discriminant
hyperplane that separates positive and negative examples
classifier; only recently have attempts been made to pro-
for which the margin is maximized: 2
vide probabilistic estimates of a label’s confidence.
max min jjx ? xi jj : x 2 <N ; w x + b = 0
w;b x i
: (3) In this section, we describe two active learning criteria for
selecting data. The first is a probabilistically “correct,” but
When the data are not separable, a soft margin classifier is impractical approach, while the second is a simple, efficient
used. This introduces a misclassification cost C , which is heuristic inspired by the first. We find empirically that the
assigned to each misclassified training example. heuristic achieves remarkable performance at little cost.
Equation 3 is usually optimized by introducing Lagrange
multipliers i and recasting the problem in terms of its
2.1 A Greedy Optimal Strategy
Wolfe dual: Platt (1999) describes an intuitive means of assigning prob-
maximize: LD =
X
i ? 21X ij yi yj xi xj(4)
abilities to points in the space classified by a support vector
machine: project all examples onto an axis perpendicular
i i;j
X to the dividing hyperplane, and perform logistic regression
subject to: 0 i C , and iyi = 0 (5) on them to extract class probabilities. By integrating the
i=0 probability of error over the volume of the space, weighted
by some assumed distributions of test examples, we can es-
The xi for which i are non-zero have a special meaning. timate the expected error of the classifier. 3
They are the training examples which fall on the margin, 3
This integration would probably be done via Monte Carlo
2 methods, on a random sample of unlabeled examples sampled
This notion of “optimality” is not directly tied to the perfor-
mance of the classifier. There is, however, evidence that maximiz- or generated according to the test distribution. See Cohn et al.
ing the margin acts as a form of structural risk minimization (Vap- (1996) for details.
nik, 1998).
From there, it is straightforward to compute the expected guaranteed to have an effect on the solution, and appears to
effect of adding an arbitrary unlabeled example x: be a simple and effective form of divide and conquer.4
1. Use logisic regression to compute class probability +
P(y = 1jx) and P(y = ?1jx). o +
2. Add (x; 1) to the training set, retrain, and compute the
too far out - displaces
- hyperplane, but little
new expected error E(x;1). - change in margin
3. Remove (x; 1), add (x; ?1) to the training set, retrain,
and compute the new expected error E(x;?1) + + maximal displacement of
4. Estimate expected error after labeling and adding ex-
o hyperplane and decrease
0.8 The goal was to learn the correct newsgroup that a mes-
sage belongs to, analogous to an email filter. The 20
0.75
4 per iteration
newsgroups dataset does include a small number of cross-
8 per iteration posted articles, which makes some datasets inseparable un-
0.7 16 per iteration
32 per iteration der any domain representation. All of the newsgroups have
0.65
64 per iteration approximately the same size (1000 documents). Articles
0 100 200 300 400
training set size
with uuencoded content were ignored and headers were re-
moved before classification.
Figure 2. Effect of sample granularity of active learning for the
“earn” Reuters group. There is a large initial difference, but little 3.4 Reuters
asymptotic effect from the number of training examples added on
each iteration. The second data set involved articles on five topics from
Reuters news articles. These articles were taken from
For each class of two domains, we performed five runs with the Reuters-21578 Distribution 1.0 dataset (Lewis, 1997).
independently-drawn 50%/50% test/training splits for both The dataset contains 21,578 Reuters newswire articles from
the heuristic and the default strategy of adding example la- 1987. Of those, the subset of 10,711 articles that have topic
bels at random. Each experiment began with four randomly labels were used. On each run, the articles were randomly
chosen positive examples and four randomly chosen nega- divided into equally sized test and train sets. We chose the
tive examples. Successive examples were added in batches five topics with the largest number of “positive” examples:
of size b, chosen from the training set either randomly or acquisitions, earn, grain, crude oil, and money-fx. The
in order of proximity to the current dividing hyperplane. remainder of the 10,711 documents not bearing the selected
We performed runs for values of b = 4; 8; 16 and 32 new topic were labeled negative. The earn set had 3802 “posi-
examples per iteration. As expected, finer sampling granu- tive” documents (those with the earn label), acq had 2327,
larity leads to a sharper increase in performance for small money-fx had 743, crude had 592 and grain had 589.
training set sizes (Figure 2). Computationally, this increase
must be traded off against the cost of re-solving a new QP 4. Results
problem more frequently. The optimal time-vs.-label cost
tradeoff depends heavily on the domain; for simplicity, all 4.1 USENET
experiments discussed for the rest of the paper use b = 8.
In Figure 3 we plot test accuracy of the active learner and
We used a QP solver based on Joachims (1998a) to train random sampler as a function of training set size. The per-
the SVMs at each iteration for the USENET data. The formance of the active learner is slightly, but consistently
solver used the working set strategy (Osuna et al., 1997) better than that of the random sample, and all learners reach
with four elements and the PR LOQO solver (Smola, 1998) their asymptotes after relatively few documents. It is worth
without shrinking to solve each QP sub-problem. We used noting that in these experiments, the number of positive and
a version of Platt’s SMO algorithm (Keerthi et al., 1999)
negative examples (in both training and test sets) are ap- earn earn (closeup)
accuracy
accuracy
0.88 0.88
It should also be noted that the USENET dataset contain
0.86 0.86
crosspostings; approximately 20% of the atheism/religion
0.84 0.84
newsgroup articles are posted to both newsgroups, limiting
0.82 0.82
a perfect learner to 90% accuracy. The atheism/religion
0.8 active 0.8 active
newsgroup exhibits a slight downward slope after half the random random
0.78 0.78
documents have been inspected. In this case, the dip may 0 2000 4000 0 100 200 300 400 500
training set size training set size
be due to an increasing number of conflicts between la-
bels of the crossposted articles, we will encounter the phe- Figure 4. Accuracy curves for Reuters “earn” topic. Left is full
nomenon again with the Reuters groups, which contains no curve; right is close-up of initial segment.
conflicting labels.
atheism/religion baseball/cryptography
0.9 1
acquisitions acquisitions (closeup)
0.8 0.95 0.95
0.8
accuracy
accuracy
0.9 0.9
0.7
0.6 0.85 0.85
0.6
accuracy
accuracy
active active
0.8 0.8
random random
0.5 0.4
0 500 1000 0 500 1000 0.75 0.75
training set size training set size
graphics/X MS−Windows/PC−hardware 0.7 0.7
1 0.9 acquisitions
0.65 active 0.65 active
0.8 random random
0.8 0.6 0.6
accuracy
accuracy
4.2 Reuters
5. Discussion
The difference between active learning and random sam-
pling is much more pronounced on all of the Reuters sets In the previous section, we observed an unusual phe-
(Figures 4–8). The accuracies reported are the macro- nomenon with the learning curves. When training exam-
averaged5 accuracies of both classes. The accuracy of the ples were added at random, generalization increased mono-
active learner reaches a maximum long before that of the tonically until all available examples were added. When
random learner, which reaches its maximum using all of training examples were added via the heuristic, general-
the documents. ization peaked to a level above that achieved by using all
available data, then slowly degraded to the level achieved
It is worth noting that the active learner’s performance by the random learner when all data had finally been added.
is strongest over other methods when the split between We are actually achieving better performance from a small
categories is most uneven, where the smaller class can subset of the data that we can achieve using all available
5
The macro-average (Yang, 1999) is the average over each data. In the name of cost and accuracy, we would like to
class instead of each document. In other words, the average of estimate when we have achieved this peak performance, so
each classes accuracy. we can stop adding examples.
money−fx money−fx (closeup) grain grain (closeup)
0.85 0.85
0.8 0.8
0.8 0.8
accuracy
accuracy
accuracy
accuracy
0.75 0.75
0.75 0.75
0.7 0.7
0.7 0.7
0.65 0.65
0.65 0.65
active active active active
random random random random
0.6 0.6 0.6 0.6
0 2000 4000 0 100 200 300 400 500 0 2000 4000 0 100 200 300 400 500
training set size training set size training set size training set size
Figure 6. Accuracy curves for Reuters “moneyfx” topic. Left is Figure 8. Accuracy curves for Reuters “grain” topic. Left is full
full curve; right is close-up of initial segment. curve; right is close-up of initial segment.
accuracy
0.75 0.75
number of examples at which the stopping criterion was satisfied.
0.7 0.7
label peak max accuracy cutoff accuracy at cutoff
at active random active random
0.65 0.65 earn 504 0.95 0.93 875 0.95 0.922
active active acq 752 0.92 0.90 1050 0.917 0.882
random random
0.6 0.6 money 456 0.846 0.817 675 0.842 0.771
0 2000 4000 0 100 200 300 400 500
training set size training set size crude 256 0.852 0.842 550 0.849 0.750
grain 272 0.89 0.87 580 0.883 0.741
Figure 7. Accuracy curves for Reuters “crude” topic. Left is full
curve; right is close-up of initial segment.
stop labeling data once the margin has been exhausted. We
can compute this stopping criterion as a byproduct of eval-
5.1 Stopping Criteria uating candidates for labeling – if the best one (closest to
the hyperplane) is no closer than any of the support vec-
One obvious way of estimating when we have reached peak tors, our margin has been exhausted. Empirically, this stop-
generalization performance is the use of crossvalidation, or ping criterion performs well; when the margin has been ex-
of a hold-out set. Crossvalidation on the training set is im- hausted, the number of new support vectors drops (recall
practical, given the time needed to re-solve the SVM for that it should go to zero only if our data is linearly separa-
each crossvalidation split. Even if time were not an issue, ble). While this point generally lags the true peak, accuracy
crossvalidation assumes that the training set distribution is at margin exhaustion appears to closely approximate peak
representative of the test set distribution — an assumption accuracy (see Figure 9 and Table 1).
violated by the active learning approach. Given that the
motivation of active learning is to use as few labeled exam- 0.96
0.95
statistic.
accuracy
accuracy
0.88 0.8
0.86
0.75
Let us make the (unreasonable) assumption that our data 0.84
0.7
0.82
is linearly separable. If this is the case, then only unla- 0.8 accuracy 0.65 accuracy
beled examples within the margin will have any effect on 0.78
number of support vectors
0.6
number of support vectors
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
our learner. Labelling an example in the margin may shift training set size training set size
the margin such that examples that were previously “out- Figure 9. As examples are labeled and added to the training set,
side” are now “inside,” but once all unlabeled examples in there is a “knee” in the fraction of new points that become support
the margin have been exhausted, no future labelings will vectors. This point corresponds to the point where all unlabeled
affect the learner in the slightest. examples within the margin have been labeled. Accuracy is plot-
ted along with the number of support vectors (not to scale) as a
Working from this assumption, it would be reasonable to
function of training set size for “earn” (left) and “acq” (right).
5.2 Timing Considerations contain no bound support vectors. Therefore, when adding
one example per training iteration, the active heuristic can-
As we remarked earlier, training an SVM requires time
(m2 ) in the number of training examples. Training with not add an example which will be misclassified — not un-
til all examples within the margin have been exhausted.
all available data requires solving a single QP problem with
a large value of m. Active learning requires iteratively
Examples which make the model inconsistent will not be
solving QP problems for small but increasing values of m.
added until no more examples remain within the margin.
The superlinear time penalty for adding training data com- One hypothesis then, would be that it is these inconsistent
bines well with the increased generalization performance examples which lead to the degradation in performance,
from active learning in our experiments. When the cut- and that the degradation could be eliminated by removing
off point is O(n 3 ) in the total number of examples, active
2
inconsistent examples. Empirically, however, removing the
learning will be more efficient than batch learning on the misclassified examples and retraining does not recover the
entire set. Though not always superior, the running times peak performance of the active learner. For instance, the
for the active learning experiments on large datasets were “earn” group from the Reuters experiments reaches 95%
always competitive with, and in some cases clearly faster accuracy through the active learner, but attains 87% accu-
than training just once on all available data (Figure 10). racy on the entire set with bound support vector removal.
Although bound support vector removal sometimes helps, it
earn acq
1 1
has shown no systematic improvement over traditional soft-
0.95
0.95
0.9
margin classifiers, while active learning has consistently
0.9
0.85
outperformed soft-margin classifiers built over all exam-
ples. Since the active learning heuristic we describe will
0.8
Accuracy (%)
Accuracy (%)
0.85 0.75
0.6
may be benefitting from a better heuristic to remove noise.
0.75
0.55
active active
batch batch
0.7
1
10
2
10
3
10 10
4
10
5