You are on page 1of 9

Detecting Adversarial Advertisements in the Wild

D. Sculley Matthew Eric Otey Michael Pohl


Google, Inc. Google, Inc. Google, Inc.
dsculley@google.com otey@google.com mpohl@google.com
Bridget Spitznagel John Hainsworth Yunkai Zhou
Google, Inc. Google, Inc. Google, Inc.
drsprite@google.com hainsworth@google.com yunkaiz@google.com

ABSTRACT The problem of detecting adversarial advertisements is


In a large online advertising system, adversaries may at- complicated by scale. With millions of advertisers and bil-
tempt to profit from the creation of low quality or harmful lions of advertiser landing pages,1 automated detection meth-
advertisements. In this paper, we present a large scale data ods are clearly needed. However, unlike many data-mining
mining effort that detects and blocks such adversarial adver- tasks in which the cost of false positives (FPs) and false
tisements for the benefit and safety of our users. Because negatives (FNs) may be traded off, in this setting both false
both false positives and false negatives have high cost, our positives and false negatives carry extremely high misclassi-
deployed system uses a tiered strategy combining automated fication cost. Thus, both FP and FN rates must be driven
and semi-automated methods to ensure reliable classifica- toward zero, even for difficult edge cases.
tion. We also employ strategies to address the challenges of The need for extreme reliability at scale necessitates the
learning from highly skewed data at scale, allocating the ef- use of both automated and semi-automated methods in a
fort of human experts, leveraging domain expert knowledge, tiered system. Automated detection methods, based on
and independently assessing the effectiveness of our system. high-precision, large-scale machine learning methods, are
able to handle the bulk of the detection work. High-recall
models are then used in semi-automated fashion to guide
Categories and Subject Descriptors the effort of expert humans who can resolve hard edge cases.
I.5.4 [Computing Methodologies]: Pattern Recognition Together, these approaches form the basis of a system that
Applications quickly and reliably identifies adversarial advertisements and
blocks them from serving.
General Terms 1.1 Challenges
Experimentation This paper presents the full anatomy of the multi-tiered
data mining system currently deployed at Google for detect-
Keywords ing and blocking adversarial advertisements, and is intended
to serve as a detailed case study. This study is structured
online advertisement, data mining, adversarial learning around the following key challenges:

1. INTRODUCTION High cost of both FPs and FNs. In our setting,


The multi-billion dollar online advertising industry con- both FPs and FNs have high cost; we cannot trade
tinues to grow [15]. This growth is fueled by users who find off one against the other. Using a combination of au-
that online advertisements yield high quality, trustworthy tomated and semi-automated effort helps drive both
content, as provided by millions of good-faith advertisers. FP and FN rates towards zero.
However, in this favorable landscape a small number of ad-
versarial advertisers may seek to profit by attempting to Minority-class and multi-class issues. The vast
promote low quality or untrustworthy content via online ad- majority of ads are from good-faith advertisers; thus
vertising systems. Our goal is to detect and block these detecting adversarial advertisements presents a diffi-
motivated adversaries, protecting users and ensuring that cult class imbalance issue [6]. This challenge is com-
online advertisement remains a trustworthy source of com- pounded by the presence of many different classes of
mercial information. adversarial advertisements, described in Section 2.

Training many models at scale. At a high level,


our system may be viewed as an ensemble composed
Permission to make digital or hard copies of all or part of this work for of many large-scale component models. Each of these
personal or classroom use is granted without fee provided that copies are models must be frequently trained, evaluated, cali-
not made or distributed for profit or commercial advantage and that copies
brated, and monitored; an efficient paradigm for this
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific effort is presented in Section 3.
permission and/or a fee. 1
KDD11, August 2124, 2011, San Diego, California, USA. The ad landing page is the web page to which a user is
Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00. directed after clickling on an ad.
2. BACKGROUND:
ADVERSARIAL ADVERTISEMENTS
To our knowledge, adversarial advertisements have not
yet been widely studied in the literature (see Section 7). In
this section, we clarify the problem area by describing some
representative categories of adversarial advertisements that
we have encountered in practice. Note that this section is
a partial listing intended to guide the readers intuition for
this paper; official policies are available online via the Google
AdWords Help Center [13].

Counterfeit goods. Some adversaries attempt to sell


counterfeit or otherwise fraudulent goods while repre-
senting the goods as authentic.
Misleading or inaccurate claims. This class of ad-
Figure 1: System-Level Architecture. Our system
versarial advertisements attempt to make claims that
relies both on automated detection using large-scale
are unrealistic or scientifically impossible, such as a
learning and semi-automated detection in which
weight-loss plan promising extreme weight loss with-
learned models direct the effort of human experts.
out exercise or dieting.
Capturing expert knowledge. To cope with con- User safety issues. Some adversaries attempt to
stantly evolving adversarial tactics, our system needs profit by causing the user some form of harm, such
to be able to capture and leverage expert knowledge as with false financial or medical claims.
quickly and efficiently. Using experts to label exam-
ples is one such method. Section 5 details additional Phishing. Adversaries may attempt to obtain sen-
approaches including the use of active learning, provid- sitive personal information by disguising their site to
ing exploratory tools to experts, and enabling experts look like another site.
to develop rule-based models for fast response. Arbitrage. Advertisements whose sole or primary
purpose is to direct the user to additional advertise-
Allocating expert effort for multiple concurrent
ments add little or no value to the user experience, in
goals. Expert effort is required not only for handling
contrast to ads that provide useful content.
edge-cases, but also for providing training data and
unbiased evaluation metrics. Section 4 presents an Unclear or deceptive billing. Advertisements that
ensemble-aided stratified sampling approach to achieve list inaccurate or deceptive pricing, or which obscure
these multiple goals simultaneously. the pricing or billing method, can constitute adversar-
ial attempts to profit from false pretenses.
Independent evaluation. Because we rely on hu-
man experts for ground truth, regular independent Malware. Some adversaries attempt to direct users to
evaluations are critical to ensure that our ground truth landing pages where they might unwittingly download
understanding is accurate and comprehensive (see Sec- malware, badware, or other malicious software.
tion 6).
3. LEARNING METHODS
1.2 System Architecture We now turn to the details of our learning-based approaches
A high-level overview of our system architecture is given in to detecting adversarial advertisements, starting with an ex-
Figure 1. The main source of data for our system is provided amination of the features available for learning. We then
by a feed of advertisement data, including a crawl of the ad describe the approaches we use to cope with the particular
landing pages themselves. Because the crawl is responsible form of hierarchical multi-class classification required in this
for fetching the contents of billions of ad landing pages and setting, including methods for dealing with highly skewed
is a massive system in its own right, a detailed description class imbalance. We present a simple MapReduce frame-
of the crawl system is outside the scope of this work. work for training such models at scale, and conclude this
Each ad is evaluated by a large number of deployed mod- section with an examination of practical considerations that
els. The decisions from the models are aggregated; if there must be addressed in a live production setting.
is a high-confidence decision of block or allow, this decision
is put into serving. If the automated models are unable to 3.1 Features
provide a high-confidence decision, the ad may be shown to Feature engineering is a key component of effective data
human experts as part of our ensemble-aided stratified sam- mining; the following is a listing of the features extracted
pling process (see Section 4). Human experts may also de- from advertisements during training and classification.
velop models using automated assistance, or use exploratory
tools to find adversarial cases and add this data to our sys- Natural language features are extracted from the
tem. Because ad content may change dynamically over time, text of advertisement keywords, creatives, and land-
we record a snapshot of all features for an ad at the moment ing pages. These include term-level features, and se-
it is labeled by a human so that our repository of labeled mantically related terms and topic-level features using
data is contextually accurate. methods similar to [9] and [3].
3.2.1 One-vs-Good Multi-Class Classification
Typical strategies for performing k-class multi-class classi-
fication with binary classifiers include the one-vs-all method
of training k individual models to distinguish each class from
all other classes, and the one-vs-one method of training k2
`
models to distinguish each class from each other class [14].
The class labels in our setting have a special structure.
Figure 2: Class Structure. There is a clear distinc- There is a single large class of non-adversarial (or good)
tion between adversarial and non-adversarial, but ads, and then a large number of possibly overlapping adver-
members of the adversarial classes overlap in places. sarial classes (see Figure 2). This setting naturally sug-
gests a multi-class decomposition that we call one-vs-good,
in which for each of k 1 adversarial classes, a model is
String-based features are intended to avoid the pos-
trained to distinguish that class from members of the non-
sibility that adversaries may exploit alternate spellings
adversarial class only. This allows overlapping classes to
or typographical manipulation to avoid detection. We
be detected by examining the output of all models. In situa-
incorporate features that allow inexact string match-
tions where several classes overlap significantly, we found it
ing, similar in spirit to [30].
useful to train additional models to distinguish all members
of the high-overlap set from the non-adversarial class.
Structural features are extracted from the struc-
tural layout of the landing page. 3.2.2 Learning-to-Rank Methods for Classification
Linear SVMs have been found to be highly effective at
Page-type features are given by sub-classifiers that high dimensional classification tasks similar to those encoun-
determine the general landing page type, such as a blog tered here, such as text classification [17]. Linear SVMs are
posting or a list of shopping results. trained by solving the following optimization problem:

Crawl-based features are extracted from the results min ||w||2 + L(w, D)
w 2
of the http fetch of the landing page.
Here, w Rd is a d-dimensional linear weight vector and
Link-based features are based on links and redirects is a regularization parameter controlling model complexity.
from the landing page. L(w, D) is the total hinge-loss of w over theP labeled training
data D = ((x1 , y1 ), . . . , (xm , ym )) given by m i=1 max(0, 1
Non-textual content-based features yield infor- yi hw, xi i). Each labeled example contains a feature vector
mation about the image, video, or other multimedia x Rd , and a class label y {1, +1}. Linear SVMs
content on the page. may be trained efficiently using stochastic gradient descent
variants such as the Pegasos SVM algorithm [28].
However, in cases of extreme class imbalance, linear SVMs
Advertiser account-level features provide various
have been found to give less than ideal results [21]. One ap-
information that may help identify suspicious or gam-
proach is to set per-class weights on the loss function to
ing behavior from adversaries.
emphasize the importance of the minority class [21], but we
have found (in accordance with prior work [18, 26]) that us-
Policy-specific features include a variety of propri- ing a pairwise objective function both improves results and
etary hand-crafted features that help to identify viola- eliminates the need to tune special per-class weights.
tions in particular policy areas. In the binary-class case, we refer to this pairwise method
as the ROC Area SVM, or ROC-SVM. An ROC-SVM is
3.2 Multi-Class and Minority-Class Issues trained by solving the following optimization problem:
As described in Section 2, this problem area is inherently
multi-class. For policy reasons, it is important to determine min ||w||2 + L(w, P )
w 2
the exact category or categories for a given example, rather
than a binary adversarial or non-adversarial classifica- Here, P is the set of all candidate pairs in the original
tion. This problem is made more challenging by the fact data set D. A candidate pair contains one xp member
that the vast majority of ads are non-adversarial, making of the positive class and one xn member of the negative
each adversarial category an extreme minority class. class, and is used to construct a labeled pairwise exam-
Our automated classification methods include a variety ple ((xp xn ), +1). The ROC-SVM may be trained effi-
of inherently multi-class classifiers, including nearest neigh- ciently (despite the quadratic number of candidate pairs)
bor approaches, naive-bayes variants, and semi-supervised using stochastic gradient descent and an indexed sampling
graph-based algorithms. Because these methods are well scheme; see [26].2
known [2], we will not describe them further here. Results. Using ROC-SVM instead of a standard Pega-
Interestingly, the most effective methods we have found sos SVM improves recall by as much as 15% at our high-
are based on sets of binary-class linear classifiers deployed precision threshold for automated blocking of adversarial
as per-class classifiers, including linear support vector ma- advertisements.
chines (SVMs) [16, 28], linear rank-based SVMs [17, 26], 2
Open source code for (single-machine) ROC-SVM us-
and linear models in cascades [33]. These methods will be ing SGD is freely available http://code.google.com/p/
the focus of this section. sofia-ml.
Label A

Label Not A
Fine Model A

Label B
example Coarse Model Fine Model B

Label Not B
Label Good
Fine Model C

Figure 3: Visualizing Cascades. The vast majority Label C

of easy, good ads are filtered out by a high-recall


coarse model (left). Finely-grained models then de- Label Not C
tect specific adversarial classes (right).
Figure 4: Multi-Class Cascade Framework. The
coarse model filters out examples that are clearly
3.2.3 Cascade Models non-adversarial. The remainder are passed to a set
per-class models for fine-grained classification.
The single model ROC-SVM approach works well for a
number of adversarial classes, but other classes are more
difficult to classify at the high-precision levels needed for 3.3 Large-Scale Training
automated blocking. For these cases, we use a more sophis- By some standards, the data sets used to train our mod-
ticated methodology based on cascades. els may be seen as large (measured in terabytes); we need
The basic cascade framework uses a series of models, each an efficient methodology for frequent training of dozens or
of which rejects a portion of the data space, as illustrated in hundreds of models. By enforcing sparsity during training,
Figure 3. This approach has been particularly successful in we ensure that the resulting models fit in memory on a sin-
the field of computer vision for tasks such as face recognition gle machine. This allows us to deploy an efficient stochastic
[33] and email spam filtering [37]. In theory, there is no limit gradient descent (SGD) training paradigm in a MapReduce3
to the number of stages that may be applied (and boosting setting for fast model training.
approaches may result in dozens of stages). In practice,
tuning a large number of cascade stages requires significant 3.3.1 MapReduce SGD
manual effort, creating a heavy maintenance burden. Solving optimization problems such as those presented
After experimenting with a range of multi-stage configura- in Section 3.2.2 may be done efficiently using SGD, which
tions, we found a simple strategy that achieves good results quickly converges to approximate solutions that are highly
with minimal system complexity. We use a single coarse satisfactory from a machine learning perspective [4, 28].
model, common to all of our cascade models, trained to SGD is an iterative solver, sequentially examining data points
distinguish adversarial from non-adversarial with high one at a time in a random order. The basic SGD training
recall. We then train a set of more finely-grained models to paradigm for linear SVM [38] is given in Algorithm 1.
detect each of these more difficult classes with high preci-
sion, using the one-vs-good framework (see Figure 4). Algorithm 1 Training Linear SVM using SGD.
Cascade models are particularly susceptible to problems of 1: w0 0
over-fitting. We have found tightly regularizing the coarse- 2: for i = 1 to t do
level model to be effective. (Another approach uses cross- 3: (x, y) RandomExample(D)
validation on the training data [37], but it is then non-trivial 4: GetLearningRate(t)
to combine the models in stages in a principled way.) The 5: if yhwi1 , xi < 1.0 then
coarse model is tightly L1-regularized (see Section 3.3.2), 6: 1
inducing sparsity that keeps the memory footprint of this 7: else
coarse model relatively small; this is an important consider- 8: 0
ation when dealing with billions of features. The sub-models 9: end if
are each trained on data sets that are much reduced in size 10: wi (1 )wi1 + yx
due to the coarse-level filtering, reducing their size signifi- 11: end for
cantly as well. 12: return wt
Results. Representative results for three difficult per-
class cascade models are given in Figure 5. (Note that
In general, each iteration is fast to compute when x is
the precision and recall values given in these graphs have
sparse, and can be completed in O(s) time where s is the
been linearly transformed to obscure sensitive data; the rel-
number of non-zero elements in x. The Pegasos SVM vari-
ative performance trends remain unchanged.) In general,
ant includes a step that projects w into an L2-ball of fixed
the cascade models give excellent improvement in recall at
radius after step 10; this may be done in O(1) time with
high precision levels (the upper-left corner of each graph),
appropriate data structures [28]. The number of iterations
in comparison with a single ROC-SVM model for the same
t may be large, but is provably independent of the number
class problem. Note that in cases where the precision/recall
of training examples, making the SGD framework ideal for
curves cross (at lower precision levels, used for prioritizing
human expert effort) we can always use the better of the 3
MapReduce is a paradigm for embarrassingly parallel
two models in the different regions. tasks, widely used in cluster-based computing [8].
1 1 1
Cascade Model A Cascade Model B Cascade Model C
Single Model A Single Model B Single Model C

Scaled Precision 0.8 0.8 0.8

Scaled Precision

Scaled Precision
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Scaled Recall Scaled Recall Scaled Recall

Figure 5: Cascade Models vs. Single Models. Values from three representative adversarial category identi-
fication tasks show that using cascade methodology significantly improves recall at low false positive rates.
(Note that precision and recall values have been linearly transformed to protect sensitive data.)

large-scale learning [28, 4]. For example, only a few CPU Data Snapshots
seconds are required for training on data sets that are consid- Training Data Mappers
ered large in the academic literature, such as RCV1 [21]. But
because SGD is a sequential online methodology, it is non- Mapper 1 Mapper 2 Mapper 3 Mapper n
trivial to parallelize SGD training across multiple machines. Filter Examples Filter Examples Filter Examples Filter Examples

Approaches that have been analyzed include message pass-


ing approaches [23], lazy distributed updates [39], training
Assign Label Assign Label Assign Label Assign Label
on multiple independent samples of the data [40], and per-
forming multiple chained MapReduce steps [35]. However,
these methods all add significant complexity to a relatively Transform Features Transform Features Transform Features Transform Features

simple learning algorithm, and in some cases adversely im-


pact model quality or give only limited incremental benefit
Encode Data Encode Data Encode Data Encode Data
as more machines are added.
Interestingly, we have observed that the main cost of SGD
Single Reducer
training is actually reading and parsing data from disk; this
can be orders of magnitude more expensive than the actual SGD Learner

training. Langford et al. have found it effective to reduce this


cost by using a specialized data format, which significantly Trained Model
reduces disk-read times [19]. This observation allows us to
parallelize our training effort with a simpler approach than Figure 6: SGD learning via MapReduce. Pre-
those described above. As shown in Figure 6 our approach processing is parallelized; training is sequential.
is summarized as follows:

Do expensive work in parallel. The expensive The first of these is to use a feature-hashing approach
work of parsing, filtering, labeling, transforming, and similar in spirit to that of [36]. If we think of w Rd as
encoding data can all be done independently in paral- a set of key-value pairs where many values are exactly 0,
lel. We use hundreds of machines in the Map phase. then it is efficient to store the non-zero values in a hash
The output of the Map phase is labeled training data map. Hashing the keys ensures that the model size will not
that is efficiently compressed. grow beyond a certain bound. We have found that ignoring
collisions does not degrade model performance, in line with
Do cheap work sequentially. Because our models results from [36] and keeps model size manageable.
are small enough to fit in memory, a single Reduce The second strategy is to encourage sparsity in the learned
machine can perform the SGD training quickly once model, so that many weight values are indeed exactly 0. We
the data has been properly prepared and formatted. follow a projected-gradient methodology similar to that of
This eliminates the need for expensive message passing [10], projecting w to an L1-ball of a specified radius after
or synchronization. updates. This is done every k steps, after step 10 in Al-
gorithm 1. The exact L1-projection of [10] was somewhat
This framework allows us to train models within minutes slow, so we use a simpler and faster approximate projection
on large data sets, and is used for both ROC-SVM training given in Algorithm 2. The method of Duchi et al. uses an
and for training our cascade models. A similar framework approach similar to randomized median finding to find the
is used for evaluating models on holdout test data. exact value of that is used to project a given vector w onto
an L1-ball of radius at most . We make do with using a
3.3.2 Controlling Model Size value of that is guaranteed to cause the ||w||1 to converge
The learned models must be small enough to fit in memory to radius at most after repeated calls. In practice, we find
on a single machine; we use two strategies for keeping model this work well, is fast to compute, and is easier to tune than
size suitably restricted. the truncated gradient approach of [20].
Algorithm 2 Approximate projection to L1-ball of Transform features as needed, including scaling, dis-
radius at most . Repeated calls to this projection will cretizing, etc.
converge to radius .
1: c max( ||w||1 , 0) Label training data as a positive or a negative, and
2: d ||w||0 distinguish test data from training data.
3: dc
4: for each non-zero element i of w do Report the parameters that were used to train it, so
5: s sign(wi ) that the model may be re-trained if needed.
6: wi s max(|wi |, 0)
Score an example using a feature vector w.
7: end for
Calibrate its output scores onto a consistent scale.
3.4 Model Management Together, these requirements define a somewhat broader
It is worth briefly looking at some of the engineering is- view of a model than is generally considered in academic
sues involved in maintaining a large-scale data mining sys- literature, which often only discuss the weight vector w. We
tem with many component models. Our management strate- have found bundling this data together reduces system com-
gies include performing automated model calibration, estab- plexity and eases the burden of managing and maintaining
lishing effective automated monitoring of live models, and a large number of models in production.
bundling useful information into models.

3.4.1 Calibration 4. ENSEMBLE-AIDED STRATIFIED


As models are re-trained over time, the semantic meaning SAMPLING
of their output scores may drift or vary, resulting in con- Acquiring hand-labeled data represents a significant cost,
stant adjustment of decision thresholds. To avoid the need requiring expert judgement to navigate intricate policies and
for manual threshold adjustment, we automatically calibrate to recognize a wide variety of clever adversarial attacks.
each model so that its final output is an actual probability Our pilot experiments testing low-cost, crowd-sourced rater
estimate rather than an arbitrary score. pools showed that crowd-sourcing was not a viable option
Recall that each linear model w scores a given example to achieve labels of the needed quality for production use.
x using a scoring function f (x) = hw, xi. We learn a cali- Thus, we rely on more expensive, specially trained expert
bration function c() using holdout validation data to ensure raters. It is therefore critical that we make the most effi-
that c(f (x)) = P r[yx = 1]. This ensures that scores from cient use of this limited resource.
different model versions may be interpreted along the same In this section, we detail an ensemble-aided approach to
natural scale. stratified sampling that helps allocate rater effort efficiently
to achieve multiple distinct goals.
3.4.2 Monitoring
In a live production setting, it is critical to monitor the 4.1 Multiple Needs for Hand-Labeled Data
quality and output of models. Our first level of monitor- In machine learning literature, it is common to focus on
ing involves a set of precision/recall tests that each model gathering hand-labeled data only for the purpose of model
must pass, based on holdout test data, whenever the model training. But in our real-world setting, there are actually
is re-trained. If the model fails these tests, the new version several important and distinct areas in our system that re-
is not pushed to production. Second, we monitor the input quire hand-labeled data from expert raters. These are:
features, to make sure that in aggregate the distribution of
values is relatively stable. We need to be alerted if our input Catching hard adversaries. Human judgement is
features or signals were to vary significantly, as this would needed to provide a final verdict on cases which are too
cause changed behavior from our models. Third, we monitor difficult for automated methods to classify with high
model output scores to detect drift in distribution of values. precision.
A sudden drift would be a warning of a change somewhere
in our system. We also monitor the actual decision rates in Improving learned models. Hand-labeled data is
addition to the score distributions, to ensure we are aware needed to train models, and to keep the models current
of any sudden changes there. Finally, we monitor the over- as adversarial methods change over time. Examples
all system quality, based on data from our ensemble-aided which are most beneficial for improved models (such
stratified sampling pipeline, as described in Section 4. as may be selected by various active learning strategies
[32]) may be different from hard adversaries, above.
3.4.3 Bundling Model Data
It is clear that a model needs to know how to classify ex- Detecting new trends. It is helpful to prioritize the
amples. But what else should a model know how to do? review of new ads, so that new trends from adversaries
Over time, we have found that it is useful to bundle a sur- are quickly detected both by our automated systems
prisingly large amount of information together into a model and by our human domain experts.
object. In particular, a model should know how to do the
following: Maximizing impact. Because expert human-rater
capacity is expensive and limited, it is desirable to
Filter out examples that should be ignored, for exam- maximize the impact by focusing on advertisements
ple if they are in the wrong language. that have high impression counts.
Algorithm 3 Priority Sampling Ads from a Bin. We
use the following priority-sampling algorithm from [11] to
select ads from a given bin for near-optimally low variance.
1: for each advertisement i with impressions wi do
2: pick pi uniformly at random from (0, 1]
3: let priority qi = wi /pi
4: end for
5: sort all ads by their priority q
6: let equal the (k + 1)-th highest priority
7: for each of the k highest priority ads do
8: let effective weight wi = max(wi , )
9: end for
10: exclude all other ads from the sample for this bin

the ensemble, and calibrate its output scores as described in


Section 3.4.1. The score from the ensemble-model is used
to divide the ads in each category into uniformly spaced
score-bins containing different numbers of ads.
This coarse binning allows us to explicitly decide how
Figure 7: Ensemble-Aided Stratified Sampling. Ads many ads to select from each bin in order to balance the
in a given language are binned based on probability differing goals listed in Section 4.1. Selecting sites from
estimates from an ensemble. Grey and black circles mid-probability bins is akin to uncertainty sampling [32],
represent un-sampled and sampled ads, respectively. and provides benefit for model training. Selecting ads from
higher probability bins in the new ads or all other ads
Providing unbiased metrics. Hand rated data should categories gives priority to catching adversaries that have
be able to provide unbiased estimates for model-level not yet been detected automatically. Selecting some ads
and system-level precision and recall. The naive ap- from every bin ensures coverage of the entire data space.
proach to gathering unbiased evaluation data would
be to hand rate a uniform sample of ads. However, 4.3 Priority Sampling from Bins
this would waste effort because the vast majority of Assuming we have decided to select k ads from a given bin,
ads are not adversarial. Sampling ads to achieve the how should we choose which k to pick from that bin? Ide-
other goals above results in biased evaluation data; by ally, we would like a low-variance estimate of the impression-
constructing the sample in a careful manner we can weighted total of each class of adversarial advertisement
later remove the sample bias when computing metrics. from the given bin; however, impression counts vary dra-
matically across ads, following a heavy-tail distribution.
Not surprisingly, these different goals are largely disjoint, In this heavy-tail setting, using uniform sampling to se-
making it non-trivial to determine which ads should be se- lect ads within a bin is a poor strategy, resulting in high
lected for human rating. variance estimates [11]. Using an intuitive sampling propor-
tional to impression-count strategy is better, but far from
4.2 Ensemble-Aided Stratification optimal [11]. Instead, we use the Priority Sampling strat-
We considered various forms of model-aided sampling [25], egy of Duffield et al. which has been proven to yield near-
using learned models to induce a sampling bias towards more optimally low variance estimates for arbitrary subset sums
useful ads for human rating. We first tried to define an ag- [11]. This strategy is reviewed in Algorithm 3.
gregate utility score based on these different factors, but The variance from an estimate based on k samples selected
it was unclear how to combine (or in some cases even how with this strategy is provably at least as low as the variance
to measure) these different quantities. We also considered of k + 1 samples selected with the (computationally infeasi-
framing this problem in a bandit setting; however, McMa- ble) optimal strategy [31]. It also has the attractive quality
han and Streeter show that the use of bandit algorithms is of tending to prioritize very-high impression ads for review
problematic for selecting data for model-training due to the to maximize impact of human effort while maintaining good
non-stationarity of the underlying models [24]. coverage of the long tail.
Our approach is to stratify the data across several different Results. Using our ensemble-aided stratified sampling
dimensions, as shown in Figure 7. First, ads in each language approach instead of the naive approach of separate sam-
are considered separately. Within a given language, ads are plings increased the effective impact of our human experts
divided into three categories, new ads that are less than t by 50%, and reduced the latency required to compute unbi-
hours old, recently blocked ads that have been caught ased metrics by an order of magnitude with no added cost.
and turned off within the last t hours, and all other ads
that have been actively served over the last t hours.
To aggregate the scores from the many different models 5. LEVERAGING EXPERT KNOWLEDGE
in production, we train an ensemble model [2] for the given We have found it critical to leverage the knowledge of
language, using the output scores of each of the m models as human experts to help detect evolving adversarial adver-
features for the ensemble. We use a binary-class (adversar- tisements. Here, data mining methods provide automated
ial vs. non-adversarial) linear ROC-Area SVM to train guidance, ensuring the most effective use of human effort.
5.1 Active Learning large-scale evaluations using an approach similar to crowd-
Experts periodically detect new categories of bad ads, or sourcing [29]. These evaluations are used to calibrate our
particular emerging trends, for which it is useful to develop understanding of real user perception and ensure that our
a new model. Lacking initial training data, we have found system continues to protect the interests of actual users.
that margin-based uncertainty sampling (akin to the sim- The aggregate results from these independent evaluations
ple strategy of Tong and Koller [32]) has been an effective have consistently shown strong agreement with our human
methodology for rapid development of new models, often expert opinon.
requiring only a few dozen hand-labeled examples.
7. RELATED WORK
5.2 Exploring for Adversaries
To our knowledge, the general problem of detecting ad-
Attenberg et al. recently reported that in cases of extreme versarial advertisements has not previously been studied. In
class imbalance, traditional active learning strategies may the most closely related work, Attenberg et al. detected un-
fail from difficulty in locating any members of the proposed safe advertisements, such as those containing adult content
positive class. They suggested using information retrieval or hate speech, and used a search-based methodology over
systems in such cases, allowing expert users to search for active-learning for model training [1]. We consider a broader
positive examples guided by their intuition. range of adversarial advertisements, including many that are
Independently, we have also found that providing a search- often difficult for non-experts to distinguish from good-faith
based interface for expert users provides valuable automated advertisements without aid.
assistance for finding new examples of adversarial advertise- The field of email spam filtering has a large body of litera-
ments. Because this search-based tool is used by experts, ture on the use of data mining for blocking adversarial mes-
it has been practical to augment standard keyword-based sages (see [12] for an informative survey). The problem of
search with a variety of feature-based filters (using many of adversarial advertisement detection differs in several ways,
the features listed in Section 3.1). This allows experts to including the multi-class nature of the problem, the minor-
make guided searches in real time, based on their intuition ity class difficulties, the presence of dynamically changing
and a large store of informative data. content, and the need for trained expert human raters.
5.3 Rule-Based Models Dalvi et al. attempt to learn classifiers in adversarial situ-
ations by modeling the adversaries [7], but accurately mod-
Coming from a machine-learning background, it has sur- eling motivated adversaries is problematic in real-world set-
prised us that our experts have proven capable of devel- tings. Lowd and Meek point out that in a publicly-facing
oping hand-crafted, rule-based models with extremely high system, adversaries may attempt to reverse-engineer the model
precision. Enabling such models to be served in produc- via membership queries [22]. Our inclusion of semi-automated
tion provides a rapid response mechanism to new adversar- methods involving human effort helps to minimize the effec-
ial attacks, and gives an effective means of injecting expert tiveness of such strategies.
knowledge directly into our system. Crowd-sourcing efforts like reCAPTCHA [34] attempt to
Because such models do not adapt over time, we have de- use the effort of anonymous users to block abuse on the
veloped automated monitoring of the effectiveness of each web. This approach is difficult in the case of advertisements
rule-based model; models that cease to be effective are re- because it would be problematic to keep motivated adver-
moved. Although rule-based models only account for less saries from poisoning the signal. Sculley et al. explored the
than 4% of the overall system impact, they provide an im- use of aggregate user-based signals such as bounce-rate for
portant capability to respond to new classes of adversarial determining user satisfaction [27], but this approach is un-
attacks within minutes of discovery. suitable for making per-advertisement decisions with high
precision due to signal noise. Various approaches such as
6. INDEPENDENT EVALUATION that of Chakrabarti et al. have used click-feedback to de-
Finally, we examine the data challenge of evaluating the termine ad relevance [5], but for adversarial advertisements
human components of our system. relevance is a secondary factor compared to user safety.

6.1 Monitoring Human Rater Quality 8. REFERENCES


Because human ratings are used as our ground truth, it
is critical to measure how reliable these ratings are. To [1] J. Attenberg and F. J. Provost. Why label when you
establish this, we regularly evaluate the precision and recall can search?: alternatives to active learning for
applying human resources to build classification
of our base-level raters, using higher-level experts to re-rate
models under extreme class imbalance. In KDD, 2010.
a sample of ratings from each lower-level rater. We also
regularly double-check these results using an independent [2] C. M. Bishop. Pattern Recognition and Machine
set of higher-level experts. This allows us both to assess Learning. Springer-Verlag New York, Inc., 2006.
the performance of the base-level raters and to measure our [3] D. M. Blei, A. Ng, and M. Jordan. Latent dirichlet
confidence in those assessments. allocation. JMLR, 3, 2003.
[4] L. Bottou and O. Bousquet. The tradeoffs of large
6.2 Monitoring User Experience scale learning. In Advances in Neural Information
The different levels of human experts described above are Processing Systems 20. 2008.
all paid and carefully vetted experts, and as such may have [5] D. Chakrabarti, D. Agarwal, and V. Josifovski.
a viewpoint that does not always align with the perception Contextual advertising by combining relevance with
of common users. To ensure that we get an accurate read- click feedback. In WWW 08: Proceeding of the 17th
ing of real user perception, we additionally perform regular international conference on World Wide Web, 2008.
[6] N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: multi-armed bandits with expert advice. In COLT 09:
special issue on learning from imbalanced data sets. 22nd Annual Conference on Learning Theory, 2009.
SIGKDD Explor. Newsl., 6, June 2004. [25] C.-E. Sarndal, B. Swensson, and J. Wretman. Model
[7] N. Dalvi, P. Domingos, Mausam, S. Sanghai, and Assisted Survey Sampling. Springer, 2003.
D. Verma. Adversarial classification. In Proceedings of [26] D. Sculley. Large scale learning to rank. In NIPS 2009
the tenth ACM SIGKDD international conference on Workshop on Advances in Ranking, 2009.
Knowledge discovery and data mining, KDD 04, 2004. [27] D. Sculley, R. G. Malkin, S. Basu, and R. J. Bayardo.
[8] J. Dean and S. Ghemawat. Mapreduce: simplified Predicting bounce rates in sponsored search
data processing on large clusters. Commun. ACM, 51, advertisements. In KDD 09: Proceedings of the 15th
January 2008. ACM SIGKDD international conference on Knowledge
[9] S. Deerwester, S. Dumais, T. Landuaer, G. Furnas, discovery and data mining, 2009.
and R. Harshman. Indexing by latent semantic [28] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos:
analysis. Journal of the American Society of Primal estimated sub-gradient solver for SVM. In
Information Science, 1990. ICML 07: Proceedings of the 24th international
[10] J. Duchi, S. Shalev-Shwartz, Y. Singer, and conference on Machine learning, 2007.
T. Chandra. Efficient projections onto the l1-ball for [29] R. Snow, B. OConnor, D. Jurafsky, and A. Y. Ng.
learning in high dimensions. In ICML 08: Proceedings Cheap and fastbut is it good?: evaluating
of the 25th international conference on Mach ine non-expert annotations for natural language tasks. In
learning, 2008. Proceedings of the Conference on Empirical Methods
[11] N. Duffield, C. Lund, and M. Thorup. Priority in Natural Language Processing, EMNLP 08, 2008.
sampling for estimation of arbitrary subset sums. J. [30] S. Sonnenburg, G. Ratsch, and B. Scholkopf. Large
ACM, 54, December 2007. scale genomic sequence svm classifiers. In ICML 05:
[12] J. Goodman, G. V. Cormack, and D. Heckerman. Proceedings of the 22nd international conference on
Spam and the ongoing battle for the inbox. Commun. Machine learning, 2005.
ACM, 50(2), 2007. [31] M. Szegedy. The DLT priority sampling is essentially
[13] Landing page and site policies. Google AdWords Help optimal. In Proceedings of the thirty-eighth annual
Center, 2011. http://goo.gl/XcbPO. ACM symposium on Theory of computing, STOC 06,
[14] C.-W. Hsu and C.-J. Lin. A comparison of methods 2006.
for multiclass support vector machines. Neural [32] S. Tong and D. Koller. Support vector machine active
Networks, IEEE Transactions on, 13(2), Mar. 2002. learning with applications to text classification. J.
[15] IAB internet advertising revenue report, 2010. Mach. Learn. Res., 2, March 2002.
http://www.iab.net/media/file/IAB_report_1H_ [33] P. Viola and M. Jones. Rapid object detection using a
2010_Final.pdf. boosted cascade of simple features. CVPR: IEEE
[16] T. Joachims. Making large-scale support vector Computer Society Conference on Computer Vision
machine learning practical. 1999. and Pattern Recognition, 1:511, 2001.
[17] T. Joachims. Optimizing search engines using [34] L. von Ahn, B. Maurer, C. McMillen, D. Abraham,
clickthrough data. In KDD 02: Proceedings of the and M. Blum. reCAPTCHA: Human-based character
eighth ACM SIGKDD international conference on recognition via web security measure. September 2008.
Knowledge discovery and data mining, 2002. [35] M. Weimer, S. Rao, and M. Zinkevich. A convenient
[18] T. Joachims. A support vector method for framework for efficient parallel multipass algorithms.
multivariate performance measures. In ICML 05: In NIPS 2010 Workshop on Learning on Cores,
Proceedings of the 22nd international conference on Clusters and Clouds, 2010.
Machine learning, 2005. [36] K. Weinberger, A. Dasgupta, J. Langford, A. Smola,
[19] J. Langford. Vowpal wabbit. Open source release, and J. Attenberg. Feature hashing for large scale
2007. http://hunch.net/~vw/. multitask learning. In Proceedings of the 26th Annual
[20] J. Langford, L. Li, and T. Zhang. Sparse online International Conference on Machine Learning, ICML
learning via truncated gradient. J. Mach. Learn. Res., 09, 2009.
10, 2009. [37] W. Yih, J. Goodman, and G. Hulten. Learning at low
[21] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: false positive rates. In Proceedings of the Third
A new benchmark collection for text categorization Conference on Email and Anti-Spam (CEAS), 2006.
research. J. Mach. Learn. Res., 5, 2004. [38] T. Zhang. Solving large scale linear prediction
[22] D. Lowd and C. Meek. Adversarial learning. In problems using stochastic gradient descent algorithms.
Proceedings of the eleventh ACM SIGKDD In ICML 04: Proceedings of the twenty-first
international conference on Knowledge discovery in international conference on Machine learning, 2004.
data mining, KDD 05, 2005. [39] M. Zinkevich, A. Smola, and J. Langford. Slow
[23] G. Mann, R. McDonald, M. Mohri, N. Silberman, and learners are fast. In Advances in Neural Information
D. Walker. Efficient large-scale distributed training of Processing Systems 22. 2009.
conditional maximum entropy models. In Advances in [40] M. Zinkevich, M. Weimer, A. Smola, and L. Li.
Neural Information Processing Systems 22. 2009. Parallelized stochastic gradient descent. In Advances
[24] H. B. McMahan and M. Streeter. Tighter bounds for in Neural Information Processing Systems 23. 2010.

You might also like