Professional Documents
Culture Documents
Label Not A
Fine Model A
Label B
example Coarse Model Fine Model B
Label Not B
Label Good
Fine Model C
Scaled Precision
Scaled Precision
0.6 0.6 0.6
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Scaled Recall Scaled Recall Scaled Recall
Figure 5: Cascade Models vs. Single Models. Values from three representative adversarial category identi-
fication tasks show that using cascade methodology significantly improves recall at low false positive rates.
(Note that precision and recall values have been linearly transformed to protect sensitive data.)
large-scale learning [28, 4]. For example, only a few CPU Data Snapshots
seconds are required for training on data sets that are consid- Training Data Mappers
ered large in the academic literature, such as RCV1 [21]. But
because SGD is a sequential online methodology, it is non- Mapper 1 Mapper 2 Mapper 3 Mapper n
trivial to parallelize SGD training across multiple machines. Filter Examples Filter Examples Filter Examples Filter Examples
Do expensive work in parallel. The expensive The first of these is to use a feature-hashing approach
work of parsing, filtering, labeling, transforming, and similar in spirit to that of [36]. If we think of w Rd as
encoding data can all be done independently in paral- a set of key-value pairs where many values are exactly 0,
lel. We use hundreds of machines in the Map phase. then it is efficient to store the non-zero values in a hash
The output of the Map phase is labeled training data map. Hashing the keys ensures that the model size will not
that is efficiently compressed. grow beyond a certain bound. We have found that ignoring
collisions does not degrade model performance, in line with
Do cheap work sequentially. Because our models results from [36] and keeps model size manageable.
are small enough to fit in memory, a single Reduce The second strategy is to encourage sparsity in the learned
machine can perform the SGD training quickly once model, so that many weight values are indeed exactly 0. We
the data has been properly prepared and formatted. follow a projected-gradient methodology similar to that of
This eliminates the need for expensive message passing [10], projecting w to an L1-ball of a specified radius after
or synchronization. updates. This is done every k steps, after step 10 in Al-
gorithm 1. The exact L1-projection of [10] was somewhat
This framework allows us to train models within minutes slow, so we use a simpler and faster approximate projection
on large data sets, and is used for both ROC-SVM training given in Algorithm 2. The method of Duchi et al. uses an
and for training our cascade models. A similar framework approach similar to randomized median finding to find the
is used for evaluating models on holdout test data. exact value of that is used to project a given vector w onto
an L1-ball of radius at most . We make do with using a
3.3.2 Controlling Model Size value of that is guaranteed to cause the ||w||1 to converge
The learned models must be small enough to fit in memory to radius at most after repeated calls. In practice, we find
on a single machine; we use two strategies for keeping model this work well, is fast to compute, and is easier to tune than
size suitably restricted. the truncated gradient approach of [20].
Algorithm 2 Approximate projection to L1-ball of Transform features as needed, including scaling, dis-
radius at most . Repeated calls to this projection will cretizing, etc.
converge to radius .
1: c max( ||w||1 , 0) Label training data as a positive or a negative, and
2: d ||w||0 distinguish test data from training data.
3: dc
4: for each non-zero element i of w do Report the parameters that were used to train it, so
5: s sign(wi ) that the model may be re-trained if needed.
6: wi s max(|wi |, 0)
Score an example using a feature vector w.
7: end for
Calibrate its output scores onto a consistent scale.
3.4 Model Management Together, these requirements define a somewhat broader
It is worth briefly looking at some of the engineering is- view of a model than is generally considered in academic
sues involved in maintaining a large-scale data mining sys- literature, which often only discuss the weight vector w. We
tem with many component models. Our management strate- have found bundling this data together reduces system com-
gies include performing automated model calibration, estab- plexity and eases the burden of managing and maintaining
lishing effective automated monitoring of live models, and a large number of models in production.
bundling useful information into models.