Professional Documents
Culture Documents
Suspicious Behavior
Detection: Current
Trends and Future
Directions
Meng Jiang and Peng Cui, Tsinghua University
A
s Web applications such as Hotmail, Facebook, Twitter, and Amazon
100
"Likes" Zombie followers
Astroturf Link farming
80 Ratings
Wall posts Sybils
Email spam Phone calls Social spam
60 Messages
Percent
Pages
SMS spam Suspended
40
APP reviews spammer Fake review
Web spam Restaurant reviews
20 Product reviews Hotel reviews
Traditional spam
0
2005–2007 2008–2009 2010 2011 2012 2013 2014
Figure 1. Percentage of recent research works on four main categories of suspicious behaviors: traditional spam, fake review,
social spam, and link farming. Lately, we’ve seen tremendous progress in link-farming detection systems.
Table 1. Experimentally successful suspicious behavior detection techniques and addresses from different information
their gathered information from data.* sources to classify emails. It achieves
Information Traditional stable, high-quality performance: the
aspects spam Fake reviews Social spam Link farming FPR is close to zero, smaller than 0.5
C AFSD1 — Astroturf 2 and — percent when the network is 50 per-
Decorate 4 cent sparser.
N MailRank4 — SybilLimit5 and OddBall7 Text message spams often use the
Truthy 6 promise of free gifts, product offers,
B SMSF8 — CopyCatch9 or debt relief services to get users to
C+N — — SSDM10 and Collusionrank 12 reveal personal information. Due to
SybilRank 11 the low cost of sending them, the re-
C+B — ASM,13 GSRank,14 URLSpam17 and — sulting massive amount of SMS spam
OpinSpam,15 and Scavenger 18 seriously harms users’ confidence in
SBM16
their telecom service providers. Con-
N+B — FraudEagle19 — Com220 and tent-based spam filtering can use in-
LockInfer 21
dicative keywords in this space, such as
C+N+B — LBP 22 — CatchSync23 “gift card” or “Cheap!!!,” but obtain-
* C: content, N: network, and B: behavioral pattern. ing a text message’s content is expen-
sive and often infeasible. An algorithm
for SMS filtering (SMSF) detects spam
Email spam can include malware or sured by the area under the receiver- using static features such as the total
malicious links from botnets with no operating-characteristics curve (AUC) number of messages in seven days and
current relationship to the recipient, of 0.991 (with the highest score being temporal features such as message size
wasting time, bandwidth, and money. 1) on a dataset from NetEase, one of every day.8 On SMS data from 5 mil-
A content-based approach called the largest email service providers in lion senders on a Chinese telecom plat-
Adaptive Fusion for Spam Detection China, indicating an almost-perfect form, we can use static features to train
(AFSD) extracts text features (bag- performance in stopping text-rich support vector machine (SVM) classifi-
of-words) from an email’s character spam for email services. ers and get the AUC to 0.883—incor-
strings, develops a spam detector for a The MailRank system studies email porating temporal features gives an ad-
binary classification task (spam versus networks using data gathered from the ditional 7 percent improvement.
regular message), and shows prom- log files of a company-wide server and Researchers have developed vari-
ising accuracy in combating email the email addresses that users prefer to ous data mining approaches to detect
spams.1 People don’t want legitimate communicate with (typically acquain- email, SMS, and Web spam. High ac-
email blocked, so to take false-posi- tances rather than other unknown us- curacy (near-1 AUC and near-0 FPR)
tive rates (FPRs) into consideration, ers).4 The system applies personalized makes these methods applicable in real
AFSD gives the accuracy score mea- PageRank algorithms with trusted systems, and some achieve a certain
properties of malicious clusters and astroturfing campaigns in the context uncovers lockstep behaviors in zom-
identifies spam campaigns such as, “Get of US political elections.6 bie followers and provides initializa-
free ringtones” in 30,000 posts and tion scores by reading the social graph’s
“Check out this cool video” in 11,000 Link farming connectivity patterns.21 CatchSync ex-
posts from a large dataset composed of Link farming previously referred to a ploits two of the tell-tale signs left in
more than 187 million posts. The pro- form of spamming on search engine graphs by fraudsters: they’re often re-
posed Social Spammer Detection in indexes that connected all of a web- quired to perform some task together
Microblogging (SSDM) approach is a page’s hyperlinks to every other page and have “synchronized” behavioral
matrix factorization-based model that in a group. Today, it’s grown to include patterns, meaning their patterns are
integrates both social network informa- many graph-based applications within rare and very different from the ma-
tion and content information for social millions of nodes and billions of edges. jority.23 Quantifying concepts and us-
spammer detection.10,24,25 The unified For example, in Twitter’s “who-follows- ing a distance-based outlier detection
model achieves 9.73 percent higher ac- whom” graph, fraudsters are paid to method, CatchSync can achieve 0.751
curacy than those with only one kind make certain accounts seem more accuracy in detecting zombie followers
of information on a Twitter dataset of on Twitter and 0.694 accuracy on Ten-
2,000 spammers and 10,000 legitimate cent Weibo, one of the biggest microb-
users. astroturfing is a particular logging platforms in China. CatchSync
Social sybils refer to suspicious ac- works well with content-based meth-
counts creating multiple fake identi- type of abuse disguised as ods: combining content and behavioral
ties to unfairly increase a single user’s information can improve accuracy by
power or influence. With social net- spontaneous “grassroots” 6 and 9 percent, respectively. So far, it
working information of n user nodes, has found 3 million suspicious users
SybilLimit accepts only O(log n) sybil behavior, but that is who connect to around 20 users from
nodes per attack edge.5 The intuition is a set of 1,500 celebrity-like accounts on
that if malicious users create too many in reality carried out the 41-million-node Twitter network,
sybil identities, the graph will have a creating a big spike on the out-degree
small set of attack edges whose removal by a single person or distribution of the graph. Furthermore,
disconnects a large number of nodes removing the suspicious user nodes
(all the sybil identities). SybilRank re- organization. leaves a smooth power law distribu-
lies on social graph properties to rank tion on the remaining part of the graph,
users according to their perceived like- strong evidence that recall on the full
lihood of being fake sybils.11 SybilRank legitimate or famous by giving them dataset is high.
found 90 percent of 200,000 accounts additional followers (zombies). In Face- CopyCatch detects lockstep Face-
as most likely being fake on Tuenti, an book’s “who-likes-what-page” graph, book page-like patterns by analyzing
SNS with 11 million users. fraudsters create ill-gotten page likes to the social graph between users and
Social media has rapidly grown in turn a profit from groups of users acting pages and the times at which the edges
importance as a forum for political, together, generally liking the same pages in the graph (the likes) were created.9
advertising, and religious activism. As- at around the same time. Unlike spam Specifically, it searches for temporally
troturfing is a particular type of abuse content that can be caught via exist- “bipartite cores,” where the same set
disguised as spontaneous “grassroots” ing antispam techniques, link-farming of users like the same set of pages, and
behavior, but that is in reality car- fraudsters can easily avoid content- adds constraints on the relationship
ried out by a single person or organi- based detection: zombie followers don’t between edge properties (like times) in
zation. Using content-based features have to post suspicious content, they this core. CopyCatch is actively in use
such as hashtag, mentions, URLs, and just distort the graph structure. Thus, at Facebook, searching for attacks on
phrases can help determine astroturf- the problem of combating link farming its social graph. Com2 leveraged tensor
ing content with 0.96 accuracy.2 The is rather challenging. decomposition on (caller, callee, and
Truthy system includes network-based With a set of known spammers and day) triplets and minimum description
information (degree, edge weight, and a Twitter network, a PageRank-like ap- length (MDL)-based stopping crite-
clustering coefficient) to track po- proach can give high Collusionrank rion to find time-varying communities
litical memes in Twitter and detect scores to zombie followers.12 LockInfer in a European Mobile Carrier dataset
of 3.95 million users and 210 million (fake reviewers, sybil accounts, and so- constructs a hyperplane that repre-
phone calls over 14 days.20 One obser- cial spammers). Because labeling data sents the largest margin between the
vation was that five users received, on is difficult and graph data is emerging, two classes (spammers and legitimate
average, 500 phone calls each, on each unsupervised methods (which include users). SMSF applies Gaussian kernels
of four consecutive days, from a single both clustering and graph-based meth- and URLSpam applies a radial basis
caller. Similarly, OddBall spots anom- ods) nicely overcome the limitations of function kernel to maximum-margin
alous donator nodes whose neighbors labeled data access and generalize to hyperplanes to efficiently perform a
are very well connected (“near-cliques”) the real world. nonlinear classification. According to
or not connected (“stars”) on a large the distance measure, instead of the
graph of political donations.7 Supervised Methods margin, the k-NN algorithms find the
Solving link-farming problems in The major approaches in supervised top k nearest neighbors of training in-
the real world, such as hashtag hijack- detection methods are linear/logistic re- stances from test instances. The least
ing, retweet promoting, or false news gression models, naive Bayesian mod- squares method in SSDM learns a lin-
spreading, requires deep understanding els, SVM, nearest-neighbor algorithms ear model to fit the training data. This
for their specific suspicious behavioral (such as k-NN), least squares, and en- model can use an l1-norm penaliza-
patterns. Only through the integration sembles of classifiers (AdaBoost). tion to control the sparsity. The clas-
of content, network, and behavioral We start with suspicious users, such sification task can be performed by
information can we find multi-aspect as social spammers, {u1, ..., uN}, where solving the optimization problem
clues for effective and interpretable the ith user is ui = (xi, yi), xi ∈ RD×1 (D
1
detection. is the number of features) is the fea- min || X T W − Y ||2F + λ || W ||1 ,
W 2
ture representation of a certain user
Detection Methods (a training example), and yi ∈ {0, 1} where X ∈ RD×N denotes the feature
Suspicious behavior detection prob- is the label denoting whether the user matrix, Y ∈ RN×C denotes the label
lems can be formulated as machine is spammer 1 or legitimate user 0. A matrix (C is the number of categories/
learning tasks. Table 2 presents three supervised method seeks a function g: labels, and C = 2 if we focus on classi-
main categories of detection methods: X → Y, where X is the input feature fying users as spammers or legitimate
supervised, clustering, and graph-based space and Y is the output label space. users), and l is a positive regulariza-
methods. Supervised methods infer a Regressions and naive Bayes are con- tion parameter. AdaBoost is an algo-
function from labeled email content, ditional probability models, where g rithm for constructing a strong clas-
webposts, and reviews. Labeling the takes the form of g(x) = P (y|x). Lin- sifier as a linear combination of sim-
training data manually is often hard, ear regression in SBM models the re- ple classifiers. Statistical analysis has
so large-scale real systems use cluster- lationship between a scalar dependent demonstrated that ensembles of clas-
ing methods on millions of suspicious variable (label) y and one or more in- sifiers can improve the performance of
users and identify spammer and fraud- dependent variables (features) x. Lo- individual classifiers.
ster clusters. Social network informa- gistic regression applied in AFSD, Dec- The key process in making super-
tion and behavioral information are orate, and OpinSpam assumes a logis- vised methods work better is feature
often represented as graph data, so tic function to measure the relation- engineering—that is, using domain
graph-based methods have been quite ship between labels and features. Na- knowledge of the data to create fea-
popular in detecting suspicious behav- ive Bayes classifiers assume strong in- tures. Feature engineering is much
ioral links (injected following links, dependence between features. Gauss- more difficult and time-consuming
ill-gotten Facebook likes, and strange ian, multinomial, and Bernoulli na- than feature selection (returning a
phone calls). Supervised methods are ive Bayes have different assumptions subset of relevant features). Logistic
often applied to detect suspicious users on distributions of features. An SVM regression models in OpinSpam, for
Retweet
Which is more
suspicious?
Facet
Size Density
Figure 2. Three future directions for suspicious behavior detection: (a) behavior-driven suspicious pattern analysis, (b)
multifaceted behavioral information integration, and (c) general metrics for suspicious behaviors. Detecting retweet hijacking
requires comprehensively analyzing suspicious behavioral patterns (such as lockstep or synchronized). Detection techniques
should integrate multifaceted behavioral information such as user, content, device, and time stamp.
example, were fed with 36 proposed wall posts that share similar URLs.18 like approaches and density-based
features of content and behavior in- The next step is to identify which methods.
formation from three aspects (review, clusters are likely to represent the re- PageRank-like approaches solve
reviewer, and product), all of which sults of spam campaigns. Clustering suspicious node detection problem in
required knowledge from experts who methods can reveal hidden structures large graphs from the ranking per-
are familiar with Amazon and its re- in countless unlabeled data and sum- spective, such as MailRank for spam
view dataset. marize key features. ranking, SybilRank and Collusion-
The bottleneck in supervised meth- Latent variable models use a set of Rank for sybil ranking, and Fraud
ods is the lack of labeled training latent variables to represent unob- Eagle for fraud ranking. For example,
data in large-scale, real-world ap- served variables such as the spamicity given a graph G = (U, E), where U = u1,
plications. As CopyCatch says, un- of Amazon reviewers, where spamic- ..., uN is the set of nodes, Eij is the edge
fortunately, there’s no ground truth ity is the degree of something being from node ui to uj, and initial spamic-
about whether any individual Face- spam. The Author Spamicity Model ity scores (PageRank values) of the
book page “like” is legitimate. Simi- (ASM) is an unsupervised Bayesian set of nodes R0 = [R(u1), ..., R(uN)]T,
larly, CatchSync argues that labeling inference framework that formulates find the nodes that are most likely to
Twitter accounts as zombie follow- fake review detection as a clustering be suspicious. The iterative equation
ers or normal users can be difficult problem.13 Based on the hypothesis of the solution is
with only subtle red flags—each ac- that fake reviewers differ from others
1− d
count, on its own, generates a few on behavioral dimensions, ASM looks R(t + 1) = d TR(t) + 1,
N
small suspicions, but collectively, they at the population distributions of two
raise many more. Researchers have clusters—fake reviewers and normal where T ∈ RN×N denotes the adjacency
come to realize the power of unsuper- reviewers—in a latent space. Its ac- matrix of the graph or transition
vised methods such as clustering and curate classification results give good matrix. Note that the initial scores
graph-based methods that look for confidence that unsupervised spamic- could be empty, but if they were de-
suspicious behaviors from millions of ity models can be effective. termined with heuristic rules or for-
users and objects. mer observations, the performance
Graph-Based Methods would improve. LockInfer works
Clustering Methods Graphs (directed/undirected, binary/ on Twitter’s “who-follows-whom”
Clustering is the task of grouping a weighted, static/time-evolving) rep- graph and infers strange connectivity
set of objects so that those in the same resent interdependencies through patterns from subspace plots. With
cluster are more similar to each other the links or edges between objects seed nodes of high scores from ob-
than to those in other clusters. To de- via network information in social servations on the plots, the PageR-
tect suspicious users in this manner, spam and behavioral information ank-like (trust propagation) algorithm
Scavenger extracts URLs from Face- in link-farming scenarios. Graph- accurately finds suspicious followers
book wall posts, builds a wall post based suspicious detection methods and followees who perform lockstep
similarity graph, and then clusters can be categorized into PageRank- behaviors.
for example). Figure 2b represents (user, tweet, and IP), size, and density, experimental results show that a
retweeting behaviors as “user-tweet- which subtensor is more worthy of search algorithm based on this met-
IP-...” multidimensional tensors. User our investigation? ric can catch hashtag promotion and
behaviors naturally evolve with the Dense blocks (subtensors) often in- retweet hijacking.
changing of both endogenous (inten- dicate suspicious behavioral patterns
tion) and exogenous (environment) in many detection scenarios. Purchased
factors, resulting in different multi-
faceted, dynamic behavioral patterns.
However, there’s a lack of research
page likes on Facebook result in dense
“user-page-time” three-mode blocks,
and zombie followers on Twitter cre-
D ifferent real-world applica-
tions have different definitions
of suspicious behaviors. Detection
to support suspicious behavior anal- ate dense “follower-followee” two- methods often look for the most sus-
ysis with multifaceted and temporal mode blocks. Density is worth inspect- picious parts of the data by optimiz-
information. ing, but how do we evaluate the sus- ing (maximizing) suspiciousness scores.
Flexible Evolutionary Multifaceted piciousness? In other words, can we However, quantifying the suspicious-
Analysis (FEMA) uses a flexible and ness of a behavioral pattern is still an
dynamic high-order tensor factoriza- open issue.
tion scheme to analyze user behav-
ioral data sequences, integrating vari- if our job at twitter is to
ous bits of knowledge embedded in References
multiple aspects of behavioral infor- detect when fraudsters 1. C. Xu et al., “An Adaptive Fusion
mation.26 This method sheds light on Algorithm for Spam Detection,” IEEE
behavioral pattern discovery in real- are trying to manipulate Intelligent Systems, vol. 29, no. 4, 2014,
world applications, and integrating pp. 2–8.
multifaceted behavioral information the most popular tweets, 2. J. Ratkiewicz et al., “Detecting and
provides a deeper understanding of Tracking Political Abuse in Social
how to distinguish suspicious and nor- given time pressure and Media,” Proc. 5th Int’l AAAI Conf.
mal behaviors. Weblogs and Social Media, 2011; www.
considering facet (user, aaai.org/ocs/index.php/ICWSM/IC-
General Metrics for WSM11/paper/view/2850.
Suspicious behaviors tweet, and iP), size, and 3. K. Lee, J. Caverlee, and S. Webb,
Suppose we use “user-tweet-IP,” a “Uncovering Social Spammers: Social
three-order tensor, to represent a density, which subtensor Honeypots + Machine Learning,” Proc.
retweeting dataset. Figure 2c gives 33rd Int’l ACM SIGIR Conf. Research
us three subtensors: the first two are is more worthy of our and Development in Information Re-
dense three-order subtensors of dif- trieval, 2010, pp. 435–442.
ferent sizes, and the third is a two-or- investigation? 4. P.-A. Chirita, J. Diederich, and W.
der subtensor that takes all the values Nejdl, “Mailrank: Using Ranking for
on the third mode. For example, the Spam Detection,” Proc. 14th ACM Int’l
first subtensor has 225 Twitter us- find a general metric for the suspicious Conf. Information and Knowledge
ers, all retweeting the same 5 tweets, behaviors? Management, 2005, pp. 373–380.
10 to 15 times each, using 2 IP ad- A recent fraud detection study 5. H. Yu et al., “Sybillimit: A Near-
dresses; the second has 2,000 Twitter provides a set of basic axioms that Optimal Social Network Defense
users, retweeting the same 30 tweets, a good metric must meet to detect against Sybil Attacks,” IEEE/ACM
3 to 5 times each, using 30 IP ad- dense blocks in multifaceted data Trans. Networking, vol. 18, no. 3,
dresses; and the third has 10 Twitter (if two blocks are the same size, the 2010, pp. 885–898.
users, retweeting all the tweets, 5 to denser one is more suspicious).27 It 6. J. Ratkiewicz et al., “Truthy: Mapping
10 times each, using 10 IP addresses. demonstrates that while simple, meet- the Spread of Astroturf in Microblog
If our job at Twitter is to detect when ing all the criteria is nontrivial. The Streams,” Proc. 20th Int’l Conf. Comp.
fraudsters are trying to manipu- authors derived a metric from the World Wide Web, 2011, pp. 249–252.
late the most popular tweets, given probability of “dense-block” events 7. L. Akoglu, M. McGlohon, and
time pressure and considering facet that meet the specified criteria. Their C. Faloutsos, “Oddball: Spotting