You are on page 1of 19

© The British Computer Society 2018. All rights reserved.

For permissions, please e-mail: journals.permissions@oup.com


Advance Access publication on 21 December 2018 doi:10.1093/comjnl/bxy124

An Unsupervised Method for


Detecting Shilling Attacks in
Recommender Systems by Mining

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


Item Relationship and Identifying
Target Items
HONGYUN CAI1,2,3,4 AND FUZHI ZHANG1,3,4*
1
School of Information Science and Engineering, Yanshan University, Qinhuangdao 066000, Hebei
Province, China
2
School of Cyber Security and Computer, Hebei University, Baoding 071000, Hebei Province, China
3
The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province,
Qinhuangdao, Hebei Province, China
4
The Key Laboratory for Software Engineering of Hebei Province, Qinhuangdao, Hebei Province, China
*
Corresponding author: xjzfz@ysu.edu.cn

Collaborative filtering (CF) recommender systems have been shown to be vulnerable to shilling attacks.
How to quickly and effectively detect shilling attacks is a key challenge for improving the quality and
reliability of CF recommender systems. Although many recent studies have been devoted to detecting
shilling attacks, there are still problems that require further discussion, especially the improvement of
the detection performance on real-world unlabelled datasets. In this work, we propose an unsupervised
approach that exploits item relationship and target item(s) for attack detection. We first extract behav-
iour features based on the item relationship. Then, we distinguish suspicious users from normal users
and construct a set of suspicious users. Finally, we identify target item(s) by analysing the aggregation
behaviour of suspicious users, based on which we detect attack users from the set of suspicious users.
Extensive experiments on the MovieLens 100K dataset and sampled Amazon review dataset demon-
strate the effectiveness of the proposed approach for detecting shilling attacks in recommender systems.

Keywords: collaborative filtering recommender systems; shilling attacks; shilling attack detection;
behaviour features; item relationship; target item identification

Received 9 August 2018; revised 14 October 2018; editorial decision 1 November 2018
Handling editor: Albert Levi

1. INTRODUCTION
injection’ attacks) due to the openness of recommender sys-
With the rapid development of Internet, the problem of infor- tems [7, 8]. Researchers have discussed various attacks.
mation overload has also become increasingly prominent [1]. These attacks are mounted by injecting a number of fake pro-
Collaborative filtering (CF) recommender systems [2, 3] have files to promote or demote the recommendation of a target
arisen as an effective method to deal with the problem of item. Shilling attacks can damage the trustworthiness of CF
information overload. CF recommender systems rely on his- recommender systems. Therefore, how to protect the CF rec-
toric ratings given by users on items to make recommenda- ommender systems against shilling attacks is a crucial issue.
tions. Currently, CF recommender systems have been widely In recent years, researchers have endeavoured to extract
used in e-commerce [4], social networks [5], video on effective detection features and put forward efficient detection
demand [6], etc. However, previous research has shown that methods. The existing features are usually extracted based on
they are highly vulnerable to ‘shilling’ attacks (a.k.a. ‘profile rating statistical characteristics or item distributions. They are

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
580 H. CAI AND F. ZHANG

effective in detecting known types of attacks when supervised The rest of this paper is organized as follows. Section 2
or semi-supervised detection methods are adopted. However, introduces the related work on shilling attack detection in rec-
these methods cannot work when the training samples are ommender systems. Section 3 describes the proposed detection
hard to obtain. In such cases, unsupervised detection methods method in detail, which includes the extraction of behaviour
would be more applicable than supervised or semi-supervised features, the construction of the set of suspicious users and the
based methods because they do not need to label the training detection of attack users. The experimental results are reported
samples. To achieve better performance, the unsupervised and discussed in Section 4. In the last section, our work is
detection methods usually require a priori knowledge of summarized, and our future work is discussed.

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


attacks (e.g., attack size). However, this knowledge is difficult
to acquire in practice. Additionally, the existing unsupervised
methods do not always perform well in detecting attacks, 2. BACKGROUND AND RELATED WORK
especially when the attack users have similar rating patterns
as genuine ones. For example, in average over popular (AoP) 2.1. Shilling profiles and attack models
attack [9], the generated attack profiles look very similar to A shilling attack against a CF recommender system com-
genuine ones, which makes them more difficult to detect. prises a group of shilling profiles injected by the attacker; the
To solve these problems, an effective unsupervised detec- motivation of which is to increase or decrease the probability
tion approach should precisely detect attack users under vari- of a target item being recommended. The general form of a
ous attacks without needing any a priori knowledge of shilling profile was first defined by Bhaumik et al. [10] As
attacks. In CF recommender systems, genuine users tend to described in [10], each shilling profile usually contains ratings
rate items according to their preferences or needs, while on four sets of items, including IS , IF , IT and If . The set IS
attack users usually randomly choose non-target items to rate represents the set of selected items that are chosen by the
for the purpose of reducing the attack cost. Therefore, the rat- attacker, and the ratings of items in IS are specified by the rat-
ing behaviour is different between genuine users and attack ing function d . The set IF consists of filler items that are ran-
ones. Moreover, the behaviour intention of attack users is domly chosen by the attacker, and ratings of the items in IF
obvious. In this paper, we propose an unsupervised detection are determined by the rating function c . For each shilling pro-
method based on item relationship mining and target item file, there is at least one target item it . The set IT represents
identification. The proposed method is named as IRM-TIA the set of target items to be promoted or demoted, and ratings
and consists of three stages. Particularly, we first extract of target items are determined by the function g . The set If
detection features by mining item relationship on the users contains the unrated items.
ordered item rating sequence. Next, we construct the set of The most well-studied attack models can be found in
suspicious users by combining a principal component ana- [8, 11]. The intent of each attack model is to push or nuke the
lysis (PCA) and the k-means algorithm. Finally, we identify target items, so the ratings on target items given by attack
target item(s) by comprehensively analysing the aggregation users are usually the maximum rating (rmax ) for push attacks
behaviour of suspicious users, based on which we determine or the minimum rating (rmin ) for nuke attacks. The shilling
the attack users from the set of suspicious users. attack models used in this paper are listed in Table 1.
The main contributions of this paper are summarized as
follows:
2.2. Related work
(1) By mining the item relationship, four detection fea-
tures are extracted from user’s ordered item rating Detecting shilling attacks can help to improve the robustness of
sequence to characterize the difference between CF recommender systems. For this purpose, a number of shil-
attack users and genuine ones in rating behaviours, ling attack detection methods have been presented. Generally
which do not rely on the specific attacks. speaking, a shilling attack detection method consists of two
(2) By using PCA and clustering techniques, the set of parts: feature extraction and the detection algorithm.
suspicious users is constructed, which can reduce the
scope of detection. 2.2.1. Feature extraction
(3) By analysing the aggregation behaviour of suspicious Effective detection features can help to distinguish shilling
users, target item(s) can be identified and used to profiles from genuine ones, so they play an important role in
spot attack users from the set of suspicious users, detecting shilling attacks. So far, researchers have proposed a
which can further improve the detection precision. number of metrics to characterize the difference between
(4) To evaluate the detection performance of the proposed genuine profiles and attack ones, which can generally be clas-
method, we conduct extensive experiments on the sified as generic or type-specific features. Chirita et al. [12]
MovieLens 100K and the sampled Amazon review proposed two generic features derived from each user profile,
datasets to compare it with three baseline methods. which are rating deviation from mean agreement (RDMA)

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
AN UNSUPERVISED METHOD FOR DETECTING SHILLING ATTACKS IN RECOMMENDER SYSTEMS 581

TABLE 1. Summary of the attack models used in this paper.

Attack model IS IF it

Items Rating Items Rating

Random attack Null Null Randomly chosen System mean rmax /rmin
Average attack Null Null Randomly chosen Item mean rmax /rmin
Bandwagon attack Popular items rmax Randomly chosen System mean rmax /rmin

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


Average over popular items (AoP) Null Null Chosen from the top x% of most popular items Item mean rmax /rmin
Power user attack Null Null Chosen by the selected power user profiles Item mean rmax /rmin
Love/Hate attack Null Null Randomly chosen rmax rmin
Reverse bandwagon attack Unpopular items rmax Randomly chosen System mean rmin

and degree of similarity with top neighbours (DegSim). After selection algorithm to select effective features for detecting a
that, several new generic features were derived based on the specific type of attack, based on which they trained two clas-
significant difference of the ratings, which are weighted devi- sifiers based on a k-nearest neighbour classifier and Bayesian
ation from mean agreement (WDMA), weighted degree of classifier, respectively. Li et al. [20] used the improved ID3
agreement (WDA) and Length variance (LengthVar) [13, 14]. decision tree algorithm to detect shilling attacks. Yang et al.
Some type-specific features were presented in [14] and [15], [18] applied a variant of the boosting algorithm to detect shil-
which are mean variance (MeanVar), filler mean target dif- ling attacks based on 18 statistical features, which improved
ference (FMTD), and target model focus (TMF), etc. In add- the detection precision compared with using methods with a
ition, Zhang et al. [16, 17] presented features based on the single classifier. Zhang et al. [16] detected shilling profiles by
number of ratings on rated items for each user profile, which combining a Hilbert-Huang transform and Support Vector
are filler size with total items (FSTI), filler size with popular Machine (SVM). In this method, detection features are
items (FSPI), etc. Yang et al. [18] also presented three type- extracted using the Hilbert-Huang transform method, based
specific features based on the number of specific ratings on on which a SVM-based classifier is trained to detect shilling
filler or selected items, which are filler size with maximum profiles. Zhang et al. [17] also presented an ensemble frame-
rating in itself (FSMAXRI), filler size with minimum rating work that combines multiple base classifiers, which is effect-
in itself (FSMINRI) and filler size with average rating in ive in detecting some known types of attacks. Zhou et al.
itself (FSARI). Zhou [19] used the term frequency-inverse [23] proposed a detection approach for detecting unknown
document frequency (TF-IDF) to extract two features for types of attack, which uses the technique of bionic pattern
detecting AoP attack. Li et al. [20] also extracted three fea- recognition to cover the samples of genuine profiles and iden-
tures based on item popularity. These existing features tifies profiles outside of those coverage as attack ones.
mainly focus on differences between shilling profiles and As for semi-supervised detection methods, Wu et al. [24]
genuine ones in the statistical characteristics and item popu- presented a hybrid shilling attack detection approach, which
larity of ratings, of which few are extracted based on the exploited both labelled profiles and unlabelled ones to clas-
behaviour characteristics hidden in a user’s item rating sify shilling profiles. This method can detect hybrid shilling
sequence. Moreover, the existing features are not always attacks effectively, but the detection precision is inferior to
effective in detecting various types of attacks when an that of the C4.5 decision tree method. In addition, Cao et al.
unsupervised detection approach is adopted. Therefore, it is [25] proposed a semi-supervised learning method that first
worth extracting new features that can reflect the behaviour trains a Bayes classifier on some labelled profiles and then
difference between genuine and shilling profiles and that do incorporates unlabelled profiles with a weighting factor for
not rely on the specific attacks. expectation maximization (EM) to improve the initial classi-
fier. Lately, Zhang et al. [26, 27] utilized the semi-supervised
2.2.2. Detection algorithms approach to detect spammer groups from product reviews.
Detection algorithms against shilling attacks in CF recom- As for unsupervised detection methods, Zhang et al. [28]
mender systems can fall into three categories: supervised proposed a detection approach with time series data, which
detection methods, semi-supervised detection methods and used the time series of each item to judge whether it is a tar-
unsupervised detection methods. get item. This method is based on the assumption that there
As for supervised detection methods, Williams et al. [21] are a number of ratings from genuine users on target items
trained a classifier based on some generic and attack type- and that ratings on target items given by attack users are con-
specific detection attributes. Wu et al. [22] proposed a feature centrated within a short period. Bryan et al. [29] presented an

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
582 H. CAI AND F. ZHANG

unsupervised shilling attack detection algorithm based on the Recently, Yang et al. [45] applied adaptive structure learning
Hv -score metric, which is effective in detecting midsize to select more effective features and exploited a density-based
attacks. However, it does not perform well under small-scale clustering method to discover shilling profiles. Zhang et al.
bandwagon attacks. Mehta et al. [30] presented PCA- [46] presented an unsupervised detection method based on a
VarSelect using a PCA, which is based on the high similarity hidden Markov model and hierarchical clustering. This meth-
between shilling profiles. PCA-VarSelect can detect standard od first calculates the suspicious degree of each user using a
attacks (e.g., random attack and average attack) efficiently, hidden Markov model and then obtains the attack users using
but it requires a priori knowledge of the number of shilling the hierarchical clustering techniques. This method can detect

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


profiles. Hurley et al. [9] presented the unsupervised detec- some types of shilling profiles with high precision and high
tion method based on the theory of Neyman–Pearson. This recall, but its precision is not high for detecting AoP attack.
method is effective for detecting standard attacks and AoP In addition, researchers [47, 48] have presented other meth-
attack, but it performs poorly for a small attack size and filler ods (e.g., dynamic time interval segmentation technique [47]
size attack. Lathia et al. [31] defined a classification of tem- and multivariate auto regression [48]) to identify target items,
poral attacks, which used global, user and item behaviour which can help to detect shilling attacks. These methods
over time to detect an ongoing attack. Lee and Zhu [32] pro- assume that the ratings given by attack users to target items
posed a two-stage unsupervised detection method based on will be concentrated within a short time period and that there
multidimensional scaling (MDS). This method is not effective are a number of ratings from genuine users before attack
when detecting random attack with low filler size. Zou et al. event.
[33] presented the unsupervised detection method based on Supervised and semi-supervised based detection methods
the Belief Propagation algorithm. This method is effective are usually effective in detecting known types of attacks, but
when the number of target items is large; however, the detec- they need to label the training samples. In this work, we focus
tion performance declines dramatically if the number of target on developing an unsupervised detection method based on
items is small. Zhang et al. [34, 35] put forward graph-based item relationship mining and target item identification. In
shilling attack detection algorithms, which detect shilling pro- contrast to the supervised or semi-supervised based methods
files by finding a maximum submatrix in the user-user simi- in [17, 18, 24], our approach does not require training a clas-
larity matrix. These methods can detect various types of sifier based on the labelled profiles. As opposed to the
attacks, but they cannot perform well when the attack size is unsupervised based method in [30], our approach does not
small. Bilge et al. [36] proposed bisecting k-means clustering require knowing the number of shilling profiles in advance.
algorithm for detecting specific shilling attacks (i.e. average, Unlike the method in [38], our approach does not need to
bandwagon and segment attacks) with large filler size. Gunes label the candidate spammers. For the method in [46], the
et al. [37] presented a hierarchical clustering based detection attack users are detected using a hidden Markov model and
method, which can detect shilling attacks with large attack hierarchical clustering method, which could suffer from low
size in privacy-preserving CF recommender systems. Zhang detection performance in detecting AoP attack. In our approach,
et al. [38] introduced a shilling detection framework based on attack users are distinguished from genuine users by utilizing
the idea of fraudulent action propagation. This method per- behaviour differences and target items, and we can detect
forms well if the number of labelled candidate profiles is attacks regardless of the specific attack model. For the existing
enough, but it also requires knowing the number of shilling methods of identifying target items in [40, 44], an absolute
profiles in advance. Zhou et al. [39, 40] detected shilling pro- count threshold is used to capture target items, which could
files based on clustering and target item analysis. Wang et al. result in some normal items being misidentified as target items.
[41] presented an algorithm to detect shilling profiles on the In our method, we identify target items based on not only the
group level, which is effective in detecting a single target number of extreme ratings but also the aggregation behaviour
item with a large attack size. Wang et al. [42] proposed a of suspicious users, which can improve the accuracy of target
detector based on frequent pattern mining and PCA, which item identification. As opposed to the item anomaly detection
detects shilling groups on a real-life dataset from Amazon. approaches of [47, 48], our approach can identify target items
The detector can calculate the degree of shilling, but it whether they have been rated by genuine users before an attack
requires a priori knowledge of the number of shilling groups. event.
Gao et al. [43] proposed a group-based ranking method,
which calculates user reputation and identifies those users
with lower reputations as attackers. Yang et al. [44] exploited
a graph mining method to find suspicious users and further 3. THE DETECTION FRAMEWORK OF IRM-TIA
detected attack users from the set of suspicious users by target Figure 1 illustrates the detection framework of IRM-TIA,
item analysis. This method does not rely on specific attacks, which consists of three stages, i.e. extracting behaviour fea-
but it also does not perform well for small attack sizes. tures, constructing the set of suspicious users and detecting

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
AN UNSUPERVISED METHOD FOR DETECTING SHILLING ATTACKS IN RECOMMENDER SYSTEMS 583

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


FIGURE 1. Detection framework of IRM-TIA.

attack users. In the first stage, the ordered item sequence is TABLE 2. Notations and their descriptions
constructed for each user, and then four features are presented
by mining the item relationship on user’s ordered item Notation Description
sequence. In the second stage, the residual of each user is cal-
U The set of users and ∣ U ∣ = m
culated and combined with the user’s behaviour vector length
I The set of items and ∣ I ∣ = n
to construct the set of suspicious users. In the third stage, the
T Rating timestamp matrix T = [tu, j ]m ´ n , tu, j denotes the
target item(s) can be identified by analysing the aggregation
timestamp by user u on item j
behaviour of suspicious users, based on which we determine
R User-item rating matrix R = [ru, j ]m ´ n , ru, j denotes the
the attack users from the set of suspicious users.
rating by user u on item j
To facilitate discussions, we give descriptions of the nota-
[rmin, rmax ] Rating scale, where rmin means most dislike and rmax
tions used in this paper in Table 2.
means most like
rj The average of ratings on item j
ru The average of ratings given by user u
3.1. Extracting behaviour features r The average of ratings over all items and users
s The standard deviation of ratings over all items and users
In CF recommender systems, recommendations are made
based on historic ratings given by users on items. For nor-
mal users, items are usually rated according to their real pre-
ferences or actual needs. Unlike normal users, attack users
rate a number of non-target items at random for the purpose this limitation and to avoid the impact of zero transaction, we
of promoting or demoting the recommendations of target adopt the Kulc coefficient to calculate the degree of co-
items. In this section, we first analyse the item relationship occurrence correlation between items i and j; DCCIi, j is calcu-
based on co-occurrence and topic similarity. Then, we pro- lated as follows:
pose four detection features to characterize the difference
1 æç Coi, j Coi, j ö÷
between genuine users and attack ones in the user intra-track DCCIi, j = çç + ÷ (1 )
relationship. 2 çè NRi NRj ÷÷ø

DEFINITION 1 (Degree of co-occurrence correlation between where Coi, j is the number of users who give ratings for both
items, DCCI). For any two items i Î I and j Î I , the degree item i and item j, and NRi and NRj are the number of ratings
of co-occurrence correlation between them refers to their co- for items i and j, respectively.
occurrence (i.e. co-rated by the same user) relation, which is In recommender systems, an item typically concerns multiple
denoted by DCCIi, j . hidden topics in different proportions, which can be obtained
by a latent factor model, e.g., latent dirichlet allocation (LDA)
An association rule is usually used to show the co- [49] or probabilistic latent semantic analysis (PLSA) [50]. In
occurrence relation between items. In recommender systems, this work, we use the well-known PLSA based on a mixture
most items are rated by only a small number of users, so the decomposition derived from a latent class model. For an item
support between two items is usually very low. To address i Î I , the corresponding hidden topic distribution vector is

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
584 H. CAI AND F. ZHANG

denoted as HTDVi = ( pi,1, pi,2 , ¼, pi, c ), where c is the total åi, j Î Tru  i ¹ j DCCIi, j
number of hidden topics, pi, x is the probability or proportion RDUTu = (5 )
( ∣ Tru ∣ - 1)2
of item i belonging to the xth topic, 0 £ pi, x £ 1 and
å cx = 1pi, x = 1.
DEFINITION 5 (Similarity degree of user track, SDUT). For
DEFINITION 2 (Degree of topic similarity between items, any user u Î U , the similarity degree of Tracku refers to the
DTSI). For any two items i Î I and j Î I , the degree of average of the topic similarity degree between items in
topic similarity between them refers to the degree of similar-

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


Tracku , which is calculated by
ity between HTDVi and HTDVj , which is denoted by DTSIi, j
and calculated as follows: åi, j Î Tru  i ¹ j DTSIi, j
SDUTu = (6 )
( ∣ Tru ∣ - 1)2
1
DTSIi, j = (2)
1 + Divi, j
A user’s preferences and needs are usually stable in a
short period, but they may change dynamically as time
where Divi, j is the difference in topic distribution between goes on. Hence, for item i and item j in Tracku , the closer
item i and item j. the timestamps on them given by user u, the closer the rela-
tionship between them. To extract more features on the rat-
In mathematical statistics, the Kullback–Leibler divergence ing track of each user, we divide each user rating track into
(also called relative entropy) can be used to measure how one some disjoint subsequences (also called time windows).
probability distribution diverges from the other one. The The granularity of a time window depends on the context
Kullback–Leibler divergence from HTDVi to HTDVj is of application. In this paper, we define the length of one
denoted as time window as one day. Therefore, items rated by the
same user on the same day are divided into the same time
c
window. For any user u Î U , let NWu denote the number of
DKL (HTDVj∣∣HTDVi ) = å pi,x · log ( pi, x /pj, x ) (3 )
time windows on Tracku , and let TWinu, t be the set of items
x= 1
divided into the tth (t = 1,2,…, NWu ) time window, where
Tru = ⋃tNW
= 1 TWinu, t .
u
The Kullback–Leibler divergence is always non-negative,
and in general, DKL (HTDVj∣∣HTDVi ) does not equal DKL
DEFINITION 6 (Average relevance degree of time windows,
(HTDVi ∣∣HTDVj), so we use the average of DKL (HTDVj∣∣
ARDTW). For any user u Î U , the average relevance
HTDVi ) and DKL (HTDVi ∣∣HTDVj) to denote the difference
degree of time windows refers to the average of the co-
between items i Î I and j Î I in topic distribution, which is
occurrence correlation degree between items divided into the
calculated by
same time window on Tracku , which is calculated by
DKL (HTDVj∣∣HTDVi ) + DKL (HTDVi∣∣HTDVj) 1 NWu åi, j Î TWin u, t  i ¹ j DCCIi, j
Divi, j = (4 )
2 ARDTWu =
NWu
å ( ∣ TWinu, t ∣ - 1)2
(7 )
t=1

DEFINITION 3 (User rating track). For any user u Î U , let


Tru be the set of items rated by user u. The rating track of DEFINITION 7 (Average similarity degree of time windows,
user u refers to the ordered sequence of Tru , which is sorted ASDTW). For any user u Î U , the average similarity
in ascending order of rating time and denoted as Tracku . degree of time windows refers to the average of the topic
similarity degree between items divided into the same time
In the following, we will analyse user intra-track relation- window on Tracku , which is calculated by
ships and extract four behaviour features from user rating
1 NWu åi, j Î TWin u, t  i ¹ j DTSIi, j
tracks. ASDTWu =
NWu
å ( ∣ TWinu, t ∣ - 1)2
(8)
t=1
DEFINITION 4 (Relevance degree of user track, RDUT). For
any user u Î U , the relevance degree of Tracku refers to the
average of the co-occurrence correlation degree between Note that if one user only rates one item in the tth
items in Tracku , which is calculated by (t = 1,2,…, NWu ) time window, we set its co-occurrence

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
AN UNSUPERVISED METHOD FOR DETECTING SHILLING ATTACKS IN RECOMMENDER SYSTEMS 585

m
Algorithm 1 Extracting behaviour features. å k ¢= 1b k ¢i b k ¢j
covij = (9 )
m m
Input: user-item rating time matrix T, rating matrix R å k1= 1b k21i å k 2= 1b k22 j
Output: four behaviour features RDUT , SDUT , ARDTW and
ASDTW For a covariance matrix C, we can obtain the eigenvalues
1: for any two items i Î I and j Î I do l1 ³ l2 ³  ³ lf and the corresponding eigenvectors
2: if i ¹j then v1, v2, ¼, vf . The PCA selects the eigenvector with largest
3: Compute DCCIi, j by Equation (1) eigenvalue as the first principal component. Similarly, the lth

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


4: Compute DTSIi, j by Equation (2) (1 £ l £ f ) eigenvector vl with eigenvalue ll is the lth princi-
5: end if pal component. We construct the matrix PrinSpace consist-
6: end for ing of the top-k principal components, where k (k £ f ) is the
7: for each user u Î U do minimum and satisfies the predefined threshold of informa-
tion loss rate e , s.t. ålf1 = 1 ll1 ³ 1 - e.
k
8: Compute RDUTu by Equation (5)
ål 2 = 1 ll 2
9: Compute SDUTu by Equation (6)
10: Compute ARDTWu by Equation (7) DEFINITION 8 (Projection residual, PR). For any user u Î U ,
11: Compute ASDTWu by Equation (8) let xu denote the corresponding portion in PrinSpace, and let
12: end for x̂u denote the projection of ¾
xu in PrinSpace, where the pro-
13: return RDUT , SDUT , ARDTW and ASDTW jection residual of user u refers to the distance or deviation
between xu and x̂u , which is denoted as follows:

correlation degree and similarity degree in this time window PRu = xu - xˆu 2 (10)
to be constant.
Based on the above description, the algorithm for extract- where  . 2 represents the Frobenius norm.
ing behaviour features is described as follows.
Algorithm 1 mainly includes two parts. The first part Generally speaking, the projection residual values of most
(Lines 1–6) calculates the co-occurrence correlation degree genuine users are relatively low, but those of attack users and
and topic similarity degree between any two different items. a few genuine users are larger. In addition, we also calculate
The second part (Lines 7–12) calculates the values of four the length of the behaviour vector for each user.
features for each user.
DEFINITION 9 (Behaviour vector length, BVL). For any user
u Î U , the behaviour vector length of user u refers to the
3.2. Constructing the set of suspicious users length of its behaviour vector, which is denoted as follows:
In CF recommender systems, the behaviour intention of BVL u =  ¾

xu 2 (11)
attack users differs greatly from that of genuine users, which
causes an obvious difference between attack users and genu-
ine ones in the behaviour feature space. In this section, we
For most genuine users, the lengths of their behaviour vec-
use PCA to model user behaviour and to detect suspicious
tors are close to each other, but those of attack users and a
users.
few genuine users are obviously different.
Let xuj be the value of the jth behaviour feature for user
u Î U , and let ¾  Based on the above analysis of projection residuals and
xu = (xu1, xu2, ¼, xuf ) denote a f-dimensional
behaviour vector length, we use the k-means (with k equals
( f = 4 in this paper) vector corresponding to user u.
X = [ x1 , ¾

x2 , ¼, ¾
x m ´ f is a m ´ f feature matrix. To to 2) algorithm to cluster users. All of the users are split into
m] Î R
two parts. One part contains most of the genuine users, the
make the matrix zero-centred, a simple linear transform is
other part consists of all of the attack users and a few genu-
used by deducting the mean of every column from each vari-
ine users. The mean of the behaviour vector length for each
able, i.e.
part is calculated, and the part with the greater mean of
behaviour vector length is regarded as the set of suspicious
B = [buj ]m ´ f = [xuj - xj ]m ´ f , users. The algorithm for constructing the set of suspicious
users is described as follows.
where xj is the mean of the jth column. The covariance Algorithm 2 mainly includes three parts. The first part
matrix of X is C = [covij ] f ´ f , and each element covij is calcu- (Lines 1–6) constructs the m ´ f matrix X and extracts k
lated as follows: principal components. The second part (Lines 7–10)

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
586 H. CAI AND F. ZHANG

As described in Section 2, each attack user rates both target


Algorithm 2 Constructing the set of suspicious users.
item(s) and non-target items. They usually give extreme rat-
Input: user-item time matrix T , rating matrix R and the ings to target item(s) but normal ratings to non-target items.
threshold of information loss rate e In general, ratings given by genuine users relate to many
Output: the set of suspicious users SuspUserSet aspects including user rating bias, item rating bias, etc. If a
1: RDUT , SDUT , ARDTW and ASDTW ¬ Call users rating behaviour deviates greatly from its regular rating
Algorithm 1 habit or average rating of the corresponding item, it is more
2: Construct matrix X with RDUT, SDUT, ARDTW and likely to be suspicious. In this work, we define the weight of

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


ASDTW a rating based on the user rating bias, item rating bias and
3: Compute Covariance Matrix C by Equation (9) global rating bias. For a rating on item i given by user u, its
4: Obtain l1 ³ l2 ³  ³ lf and v1, v2, ¼, vf by weight is calculated as follows:
performing an eigen-decomposition of matrix C
5: Compute the value of k satisfying the threshold of ru, i - ru ru, i - ri ru, i - r
wu, i = 1 + + + (12)
information loss rate e ru ri r
6: Extract k principal components
7: for each u Î U According to Equation (12), the larger the weight of a rat-
8: Compute PR u by Equation (10) ing, the more suspicious the rating behaviour. We should pay
9: Compute BVL u by Equation (11) particular attention to those items with aggregation behaviour
10: end for for a number of suspicious users, which may be popular items
11: R1, R2 ¬ Cluster on (PR,BVL) using the k-means or target items. For a non-target popular item, it should have
algorithm been rated by many genuine users, and the ratings given by
12: mean1 ¬ Calculate the mean of the behaviour vector the attack users are usually accordance with its rating trend.
length for cluster R1 However, if a number of attack users rate a certain item with
13: mean2 ¬ Calculate the mean of the behaviour vector an abnormal rating, it is more probably that the corresponding
length for cluster R2 item is a target item. Based on the observations, we take the
14: if mean1 > mean2 then rating weight of suspicious users into consideration and intro-
15: SuspUserSet ¬ R1 duce the following definition of suspicious behaviour
16: else aggregation.
17: SuspUserSet ¬ R2
18: end if DEFINITION 10 (Suspicious behaviour aggregation, SBA). For
19: return SuspUserSet any item i Î I , the suspicious behaviour aggregation of item
i, SBAi , is defined as the sum of suspicious contribution from
suspicious users, which is calculated as follows:
calculates the projection residual and behaviour vector length SBAi = å wu¢, i ´ Gu (13)
for each user. The third part (Lines 11-19) constructs the set u Î U , ru, i ¹ 0
of suspicious users.
where wu¢, i denotes the suspicious contribution of each rating
on item i described as Equation (14), and Gu denotes the indi-
cator function described in Equation (15).
3.3. Detecting attack users
ïìï wu, i
, if ru, i ¹ 0
To further improve the detection precision of shilling attacks, ¢ ï
wu, i = í å vÎ U , rv, i¹ 0wv, i (14)
we need to filter out the genuine users or determine the attack ïï
users from the set of suspicious users. In our work, we distin- ïî0, otherwise
guish the attack users from genuine users according to
ïì1, if u Î SuspUserSet
Gu = ïí
whether they rate target item(s) with extreme ratings (i.e. the
(15)
highest rating for push attacks or lowest rating for nuke ïïî0, otherwise
attacks). The intuition behind this is that the attack users must
give extreme ratings to target items when they mount an
attack. To achieve the desired attack effect, the attack users
must provide enough ratings for the target item(s). Therefore, Intuitionally, suspicious behaviour aggregation helps to
we can detect attack users by identifying the target item(s) judge whether the item is a target item. The larger the suspi-
that they rate. The target item(s) can be identified by analys- cious behaviour aggregation of an item, the greater the likeli-
ing the aggregation behaviour of suspicious users. hood of being a target item. Moreover, we should take into

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
AN UNSUPERVISED METHOD FOR DETECTING SHILLING ATTACKS IN RECOMMENDER SYSTEMS 587

consideration the number of suspicious users for an item. For


Algorithm 3 Detecting attack users.
example, the suspicious behaviour aggregation of one item
equals 1 when all of the ratings on this item are given by sus- Input: user-item rating matrix R and threshold d
picious users. However, it is unreasonable to identify this Output: set of attack users AttUserSet
item as target item if the number of suspicious users is very 1: AttUserSet ¬ f
small. Thus, we introduce an aggregation degree to represent 2: SuspUserSet ¬ Call Algorithm 2
the aggregation behaviour of suspicious users, which should 3: for each u Î U do
increase monotonously as the number of suspicious users 4: foreach i Î I do

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


increases and show a considerable change around threshold q . 5: Compute wu, i by Equation (12)
We consider the threshold based on the assumption that a cer- 6: end for
tain number of attack users are required to make a consider- 7: end for
able prediction shift on the target item [7, 40]. For any item 8: for each u Î U do
i Î I , the aggregation degree of item i is calculated as 9: for each i Î I do
follows: 10: Compute wu¢, i by Equation (14)
11: end for
æ æ ö pö
fi = ççç arctan çç å Qu, i - q÷÷÷ + ÷÷÷ /p
12: end for
(16)
çè ççè ÷ø 2 ÷÷ø 13: for each i Î I do
uÎU
14: Compute SBAi by Equation (13)
where Qu, i is the indicator function described in Equation 15: Compute fi by Equation (16)
(17). 16: Compute Pi by Equation (18)
17: if Pi ³ d then
ì
ï1, if (u Î SuspUserSet )  (ru, i == a) 18: for each u Î SuspUserSet do
Qu, i = ï
í (17) 19: if ru, i = a then
ï
ï
î0, otherwise
20: AttUserSet ¬ AttUserSet È {u}
In Equation (17), constant a is set to rmax for detecting 21: end if
push attacks and rmin for nuke attacks. For each item i Î I , 22: end for
combining its suspicious behaviour aggregation and aggrega- 23: end if
tion degree, we can calculate its probability of being a target 24: end for
item according to Equation (18). 25: return AttUserSet

Pi = SBAi · fi (18)

Obviously, the larger the probability, the greater the likeli- (most disliked) and 5 is the highest (most liked). Each user
hood of being a target item. In the case of detecting shilling has rated at least 20 movies. The rating time is in UNIX sec-
attacks with a single target item, we can regard the item with onds since 1/1/1970 UTC. Similar to in the previous work,
the maximum probability as the target item. If there exist we assume that all of the profiles in the MovieLens 100K
multiple shilling groups and if the number of target items dataset are genuine ones. Shilling profiles are generated using
may be more than one for each shilling group, we consider the attack models described in Section 2 and injected into the
items as target items if their probability is larger than thresh- dataset, respectively. In the experiments, we use the models
old d . Users in the set of suspicious users are regarded as of random attack, average attack, bandwagon attack, average
attack ones if they rate a target item with rmax (in the case of over Popular items (AoP) attack and power user attack as
a push attack) or rmin (in the case of a nuke attack). push attacks, and we use love/hate attack and reverse band-
Based on the above description, the algorithm for detecting wagon attack as nuke attacks. The attack size and filler size
attack users is described below. vary. Specifically, the attack size is set to 3%, 5%, 8%, 10%
and 12%, and the filler size is set to 3%, 5%, 8% and 10%.
For a push attack, the target item is chosen at random from
4. EXPERIMENTAL EVALUATION items having an average rating lower than 3. For a nuke
attack, the target item is randomly chosen from the top 15%
4.1. Experimental dataset and setting
of popular items (i.e., items with a large number of ratings).
To evaluate the effectiveness of the proposed method, the fol- To obtain an accurate result, we repeat each experiment 10
lowing two datasets are used as the experimental data. times under the same conditions (i.e., same attack, attack size,
(1) MovieLens 100K dataset . This dataset consists of and filler size) and the average values of 10 experiments are
100,000 ratings on 1682 movies by 943 users. All of the rat- reported as the final evaluation results. As longer attacks have
ings are integers between 1 and 5, where 1 is the lowest less of an effect, attackers have to rate items quickly [31]. For

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
588 H. CAI AND F. ZHANG

the experiments on the MovieLens 100K dataset, the rating attack users, CBS needs to label a certain number of
timestamps of the attack users for the items are randomly candidate spam users and requires knowing the total
selected from 10 sequential time windows. number of attack users. In the experiments, we
(2) Sampled Amazon review dataset. The Amazon review assume that the attack size is known in advance. In
dataset [51] was crawled from Amazon.cn until 20 August addition, the detection performance of CBS is dir-
2012 and consists of 1205,125 ratings from 645,072 users ectly affected by the number of candidate spam
towards 136,785 items. All of the ratings are integers between users. Similar to the experiments in [38], we conduct
1 and 5, where 1 and 5 are the lowest (most disliked) and experiments under different k values and adopt the

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


highest (most liked) values, respectively. For the purpose of average value for k = 5, 10 and 15 as the final detec-
experimental evaluation, we select users with labels and their tion result under each attack size and filler size on
ratings records as our experimental data. The sampled ama- the MovieLens 100K dataset. For the experiment on
zon review dataset consists of 53,777 ratings from 5050 users the sampled Amazon review dataset, the average
towards 17,610 items. Among the 5050 labelled users, the value for k = 10, 30 and 50 is used as the final detec-
number of attack users is 1937. For experiments on the tion result.
sampled Amazon review dataset, we set the IRM-TIA model (3) UD-HMM: An unsupervised method for shilling
parameters e , q and d to 0.05, 3 and 0.9, respectively. attack detection based on a hidden Markov model
and hierarchical clustering, which analyses differ-
ences in user rating behaviour patterns and detects
4.2. Evaluation metrics attack users using the hierarchical clustering algo-
rithm. In the experiments, parameters N and a are set
We use precision, recall and F1-measure as our evaluation to 5 and 0.7, respectively.
metrics, which are defined as follows:
TP 4.3.1. Selection of IRM-TIA parameters on the MovieLens
precision = (19) 100K dataset
TP + FP
The parameters in the IRM-TIA detection model are the
TP threshold of information loss rate e , the count threshold q in
recall = (20) Equation (16) and the probability threshold d . In the experi-
TP + FN
ments on the MovieLens 100K dataset, we only considered a
2 ´ precision ´ recall single target item for each experiment, so that the item with
F1 - measure = (21) the maximum probability was regarded as the target item.
precision + recall
Therefore, the parameter d was not used. To illustrate the
where TP is the number of shilling profiles correctly detected, influence of parameters e and q on IRM-TIA on the
FP is the number of genuine profiles misclassified as attack MovieLens 100K dataset, we generated and injected the shil-
ones and FN is the number of shilling profiles misclassified ling profiles under various attack models (described in
as genuine ones. Section 2) with 5% attack size and 5% filler size. Figure 2
shows the influence of parameters e and q on the F1-measure
of IRM-TIA on the MovieLens 100K dataset.
It is observed from Fig. 2a that IRM-TIA performs well
4.3. Experimental results and analysis
under various values of parameter e . In the experiments on
In this section, we compare the performance of IRM-TIA the MovieLens 100K dataset, we set e to 0.3. As shown in
with that of three baseline methods including PCA-VarSelect Fig. 2b, except in the case of PUA, the F1-measure of IRM-
[30], CBS [38] and UD-HMM [46]. TIA has no significant difference for various values of q in
detecting shilling profiles. In this work, we set q to 10 on the
(1) PCA-VarSelect: An unsupervised method for shilling MovieLens 100K dataset.
attack detection, which detects attack users according
to the covariance value and requires a priori knowl- 4.3.2. Comparison of detection results for four methods on
edge of the attack size. In the experiments, we the MovieLens 100K dataset
assume that the attack size is known in advance. Tables 3 and 4 show a comparison of precision and recall for
(2) CBS (Catch the black sheep: unified framework for four methods under random attack and average attack with
shilling attack detection based on fraudulent action various attack sizes and filler sizes on the MovieLens 100K
propagation, CBS for short): An unsupervised meth- dataset, where the set of the selected items is empty and the
od for shilling attack detection, which detects attack filler items of attack users are chosen at random from all
users according to the spam probability value. For items. As shown in Tables 3 and 4, the detection precision of
calculating the spam probability value and to spot PCA-VarSelect under two attacks is between 0.8759 and

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
AN UNSUPERVISED METHOD FOR DETECTING SHILLING ATTACKS IN RECOMMENDER SYSTEMS 589

(a) (b)

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


ε θ

FIGURE 2. Influence of parameters e and q on the F1-measure of IRM-TIA on the MovieLens 100K dataset. (a) Influence of parameter e on
the F1-measure of IRM-TIA and (b) Influence of parameter q on the F1-measure of IRM-TIA.

TABLE 3. Comparison of precision and recall for four methods under random attack.

Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA

Precision Recall Precision Recall Precision Recall Precision Recall

3 0.9310 0.9642 0.3141 0.3321 0.9076 1 0.9931 1


5 0.9241 0.9571 0.4518 0.4789 0.9453 1 0.9931 1
3
8 0.9448 0.9789 0.6514 0.6910 0.9663 1 1 1
10 0.9586 0.9929 0.7468 0.7910 0.9116 1 1 1
3 0.9429 0.9830 0.4330 0.4569 0.9875 1 0.9835 1
5 0.9347 0.9745 0.6217 0.6560 0.9787 1 0.9917 1
5
8 0.9469 0.9872 0.7799 0.8229 0.9776 1 0.9918 1
10 0.9469 0.9872 0.8392 0.8854 0.9597 1 0.9958 1
3 0.9235 0.9973 0.5191 0.5676 0.9807 1 0.9897 1
5 0.9210 0.9947 0.7200 0.7870 0.9934 1 0.9899 1
8
8 0.9235 0.9973 0.8670 0.9566 0.9515 1 0.9947 1
10 0.9259 1 0.8848 0.9669 0.9934 1 1 1
3 0.9107 0.9979 0.5191 0.6211 0.9815 1 1 1
5 0.9107 0.9979 0.7622 0.8443 0.9969 1 0.9979 1
10
8 0.9126 1 0.8701 0.9636 0.9979 1 0.9979 1
10 0.9107 0.9979 0.8928 0.9888 0.9833 1 1 1
3 0.8968 1 0.6237 0.7028 0.9098 1 1 1
5 0.8968 1 0.7814 0.8805 0.9736 1 0.9965 1
12
8 0.8968 1 0.8735 0.9840 0.9904 1 1 1
10 0.8968 1 0.8774 0.9884 0.9855 1 0.9982 1

0.9586 and the detection recall of PCA-VarSelect under three attacks with a small attack size and filler size on the
attacks is over 0.9. This means that PCA-VarSelect can per- MovieLens 100K dataset. Therefore, IRM-TIA outperforms
form well in detecting the shilling profiles generated by ran- PCA-VarSelect and CBS in detecting random attack and
dom and average attack models when a priori knowledge of average attack on the MovieLens 100K dataset. The detection
the attack size is satisfied. The detection precision and recall precision of UD-HMM is over 0.9, and the detection recall is
of CBS increase with an increasing attack size and filler size, over 0.9716. The detection precision of IRM-TIA is between
and its detection performance is not affected by the type of 0.9617 and 1, and almost all of the detection recall values of
attack. However, CBS is not effective in detecting the three IRM-TIA are 1. These results show that UD-HMM and IRM-

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
590 H. CAI AND F. ZHANG

TABLE 4. Comparison of precision and recall for four methods under average attack.

Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA

Precision Recall Precision Recall Precision Recall Precision Recall

3 0.9034 0.9357 0.2736 0.2896 0.9509 0.9881 0.9862 0.9929


5 0.8966 0.9286 0.3546 0.3761 0.9266 0.9881 0.9931 1
3
8 0.9034 0.9357 0.5986 0.6352 0.9198 0.9881 0.9798 1

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


10 0.8759 0.9071 0.7211 0.7641 0.9427 1 1 1
3 0.9306 0.9702 0.4692 0.4952 0.9115 0.9915 0.9617 1
5 0.9306 0.9702 0.5687 0.6004 0.9129 0.9716 1 1
5
8 0.9020 0.9404 0.7418 0.7828 0.9444 0.9929 0.9877 1
10 0.9184 0.9574 0.8159 0.8608 0.9223 1 1 1
3 0.9037 0.9760 0.5073 0.5547 0.9433 0.9947 0.9974 1
5 0.9086 0.9813 0.7326 0.8008 0.9573 0.9920 1 1
8
8 0.9086 0.9813 0.8498 0.9287 0.9409 0.9991 0.9974 1
10 0.9062 0.9787 0.8830 0.9650 0.9741 1 1 1
3 0.9068 0.9936 0.5671 0.6282 0.9874 0.9979 0.9958 1
5 0.9087 0.9957 0.7528 0.8340 0.9835 1 0.9876 1
10
8 0.9068 0.9936 0.8650 0.9580 0.9875 1 1 1
10 0.9010 0.9872 0.8879 0.9833 0.9834 1 0.9979 1
3 0.8889 0.9912 0.6165 0.6947 0.9526 0.9947 0.9965 0.9982
5 0.8905 0.9930 0.7745 0.8726 0.9812 0.9982 0.9965 1
12
8 0.8952 0.9982 0.8707 0.9808 0.9329 1 0.9931 1
10 0.8889 0.9912 0.8823 0.9938 0.9780 1 1 1

TIA have excellent detection performance in detecting random filler sizes on the MovieLens 100K dataset, where the filler
attack and average attack. Furthermore, most of the precision items of attack users are chosen at random from the top 40%
values of IRM-TIA are greater than those of UD-HMM. of the most popular items. As shown in Table 6, PCA-
Table 5 shows the comparison of precision and recall for VarSelect is ineffective in detecting AoP attack and its detec-
four methods under bandwagon attack with various attack tion precision and recall decline sharply with an increased
sizes and filler sizes on the MovieLens 100K dataset, where filler size. The detection precision of CBS is between 0.6745
the attack users rate the target item and a few popular items and 0.8975, and the detection recall of CBS is between
with the highest rating. As shown in Table 5, the detection 0.7374 and 1, which means that CBS performs well in detect-
precision of PCA-VarSelect is between 0.7079 and 0.8980, ing AoP attack if the attack size is known in advance. The
and the detection recall of PCA-VarSelect is between 0.7234 detection performance of UD-HMM may improve with an
and 0.9809. CBS performs well with an increased attack size increased attack size, but declines with an increased filler
and filler size, but its detection precision and recall are low size, indicating that UD-HMM may not only group a large
when with a small attack size and filler size. The detection number of genuine profiles as attack ones, it may also mis-
precision of UD-HMM is between 0.7137 and 0.9683, and classify some attack profiles as genuine ones when the attack
the detection recall of UD-HMM is between 0.9702 and 1. size is very small and the filler size is large. As shown in
Compared with the results in Tables 3 and 4, UD-HMM Table 6, the detection precision of CBS is better than that of
declines slightly in detection precision under bandwagon PCA-VarSelect and UD-HMM, but it is still lower than that
attack because some genuine users are misclassified as attack of IRM-TIA. The detection precision of IRM-TIA is between
users by UD-HMM when detecting bandwagon attack. Both 0.9681 and 1, and the detection recall of IRM-TIA is between
the detection precision and recall of IRM-TIA are over 0.98, 0.9643 and 0.9978. Moreover, the precision and recall values
and most of the recall values of IRM-TIA are 100%, indicat- of IRM-TIA are very stable under various attack sizes and
ing that IRM-TIA can precisely detect bandwagon attack. filler sizes, which means that IRM-TIA can precisely distin-
Therefore, IRM-TIA has recall that is more or less as good as guish shilling profiles generated by the AoP attack model
that of UD-HMM and outperforms the three baselines in from genuine ones. Therefore, we can conclude that the
terms of the precision metric in detecting bandwagon attack. detection precision of IRM-TIA obviously outperforms that
Table 6 shows the comparison of precision and recall for of the three baselines while maintaining a high recall in
four methods under AoP attack with various attack sizes and detecting AoP attack on the MovieLens 100K dataset.

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
AN UNSUPERVISED METHOD FOR DETECTING SHILLING ATTACKS IN RECOMMENDER SYSTEMS 591

TABLE 5. Comparison of precision and recall for four methods under bandwagon attack.

Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA

Precision Recall Precision Recall Precision Recall Precision Recall

3 0.7172 0.7429 0.3366 0.3568 0.8162 0.9814 0.9860 0.9857


5 0.7586 0.7857 0.4474 0.4742 0.7882 0.9914 1 1
3
8 0.8552 0.8857 0.6689 0.7089 0.7648 0.9986 0.9966 1

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


10 0.8828 0.9143 0.7347 0.7785 0.7137 1 0.9966 1
3 0.6939 0.7234 0.4141 0.4370 0.8371 0.9770 0.9917 0.9957
5 0.8327 0.8681 0.6168 0.6510 0.8907 0.9881 0.9958 1
5
8 0.8857 0.9234 0.7910 0.8346 0.8199 1 0.9979 1
10 0.8980 0.9362 0.8450 0.8913 0.9832 1 1 1
3 0.7111 0.7680 0.5156 0.5638 0.8198 0.9669 0.9974 1
5 0.8469 0.9147 0.7036 0.7693 0.8224 0.9893 0.9987 1
8
8 0.8840 0.9547 0.8530 0.9323 0.8374 0.9893 1 1
10 0.8936 0.9653 0.8901 0.9726 0.8212 0.9952 0.9987 1
3 0.7359 0.8064 0.5982 0.6627 0.8557 0.9702 0.9859 0.9859
5 0.8369 0.9170 0.7628 0.8449 0.8686 1 0.9989 0.9989
10
8 0.8874 0.9723 0.8721 0.9659 0.8834 1 0.9989 0.9989
10 0.8951 0.9809 0.8929 0.9888 0.8864 1 1 1
3 0.7079 0.7894 0.6270 0.7067 0.9155 0.9947 0.9982 0.9982
5 0.8222 0.9168 0.7850 0.8845 0.9438 1 0.9965 0.9965
12
8 0.8667 0.9664 0.8740 0.9846 0.9683 1 0.9897 0.9897
10 0.8794 0.9805 0.8867 0.9988 0.9614 1 0.9965 0.9965

TABLE 6. Comparison of precision and recall for four methods under AoP attack.

Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA

Precision Recall Precision Recall Precision Recall Precision Recall

3 0.4828 0.5 0.7650 0.8109 0.6604 0.7929 1 0.9643


5 0.3103 0.3214 0.8468 0.8972 0.5125 0.9119 0.9729 0.9929
3
8 0.1034 0.1071 0.8371 0.8866 0.3795 0.8667 0.9862 0.9862
10 0.0345 0.0357 0.8604 0.9111 0.2961 0.5905 0.9867 0.9867
3 0.5837 0.6085 0.7516 0.7932 0.7632 0.9986 0.9681 0.9681
5 0.4776 0.4979 0.7891 0.8327 0.6670 1 0.9916 0.9916
5
8 0.2776 0.2864 0.8674 0.9150 0.6815 1 0.9918 0.9918
10 0.1143 0.1192 0.8768 0.9250 0.5414 1 0.9917 0.9917
3 0.6862 0.7413 0.6745 0.7374 0.6449 0.8299 0.9968 0.9968
5 0.6099 0.6587 0.7819 0.8546 0.6454 0.8389 0.9893 0.9893
8
8 0.3185 0.3440 0.8629 0.9431 0.4681 0.7472 0.9922 0.9922
10 0.1901 0.2053 0.8975 0.9807 0.3566 0.7115 0.9974 0.9974
3 0.7185 0.7872 0.7188 0.7963 0.7006 0.8413 0.9712 0.9712
5 0.6097 0.6681 0.8090 0.8961 0.5820 0.8238 0.9978 0.9978
10
8 0.3864 0.4234 0.8838 0.9788 0.5039 0.7587 0.9857 0.9857
10 0.2175 0.2383 0.8963 0.9926 0.4377 0.7600 0.9897 0.9897
3 0.7587 0.8460 0.7453 0.8399 0.8579 0.9979 0.9953 0.9953
5 0.6063 0.6761 0.8151 0.9183 0.7551 0.9986 0.9892 0.9892
12
8 0.3984 0.4442 0.8778 0.9888 0.7396 0.9596 0.9965 0.9965
10 0.2571 0.2867 0.8906 1 0.7411 0.9589 0.9895 0.9895

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
592 H. CAI AND F. ZHANG

Table 7 shows the comparison of the precision and recall the best detection result under PUA attack with various attack
for four methods under PUA attack with various attack sizes sizes.
on the MovieLens 100K dataset, where the power users are In addition, we conducted further experiments to evaluate
identified based on the approach of Indegree [11]. As shown the performance of IRM-TIA under nuke attacks. Taking the
in Table 7, all four detection methods perform well under love/hate attack and reverse bandwagon attack as examples,
PUA attack with various attack sizes. All of the detection Tables 8 and 9 show the comparison of the precision and
recall values of the four detection methods are over 0.94, and recall for four methods under two types of attacks on the
the detection recall values of UD-HMM and IRM-TIA are 1, MovieLens 100K dataset. As listed in Tables 8 and 9, PCA-

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


which means that UD-HMM and IRM-TIA can detect all VarSelect, UD-HMM and IRM-TIA perform very well under
shilling profiles under PUA attack. Specifically, the detection love/hate attack and reverse bandwagon attack with various
precision of PCA-VarSelect is between 0.8968 and 0.9924, attack sizes and filler sizes. The detection precision of PCA-
the detection precision of CBS is between 0.8878 and 0.9546, VarSelect declines slightly when the attack size increases.
the detection precision of UD-HMM is between 0.8092 and CBS also performs well when the attack size and filler size
0.9740, and the detection precision of IRM-TIA is over are large. These results demonstrate the effectiveness of the
0.9838. Therefore, among these methods, IRM-TIA obtains four methods in detecting love/hate attack and reverse

TABLE 7. Comparison of precision and recall for four methods under PUA attack.

Attack size (%) PCA-VarSelect CBS UD-HMM IRM-TIA

Precision Recall Precision Recall Precision Recall Precision Recall

3 0.9103 0.9428 0.9448 0.9971 0.8092 1 0.9931 1


5 0.9924 0.9617 0.9546 1 0.8447 1 0.9838 1
8 0.9209 0.9946 0.9153 0.9933 0.9674 1 0.9896 1
10 0.9126 0.9978 0.9030 0.9976 0.9740 1 0.9937 1
12 0.8968 0.9982 0.8878 0.9988 0.9705 1 0.9965 1

TABLE 8. Comparison of precision and recall for four methods under love/hate attack.

Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA

Precision Recall Precision Recall Precision Recall Precision Recall

3 0.9517 0.9857 0.2449 0.2600 0.9123 1 0.9673 1


5 0.9448 0.9786 0.3989 0.4227 0.8993 0.9714 0.9612 1
3
8 0.9448 0.9786 0.6417 0.6804 0.8549 1 0.9457 1
10 0.9310 0.9643 0.7182 0.7007 0.8963 1 0.9490 1
3 0.9429 0.9830 0.3879 0.4094 0.9423 1 0.9733 0.9979
5 0.9388 0.9787 0.5687 0.6003 0.8937 1 0.9736 1
5
8 0.9551 0.9957 0.7712 0.8137 0.9217 1 0.9698 1
10 0.9510 0.9915 0.8125 0.8574 0.9477 1 0.9716 1
3 0.9210 0.9947 0.5099 0.5575 0.9343 0.9947 0.9883 0.9987
5 0.9235 0.9973 0.7290 0.7970 0.9499 1 0.9858 1
8
8 0.9259 1 0.8435 0.9219 0.9542 1 0.9819 1
10 0.9185 0.992 0.8734 0.9545 0.9474 1 0.9974 1
3 0.9126 1 0.5540 0.6138 0.9697 0.9574 0.9864 0.9989
5 0.9126 1 0.7403 0.8202 0.9792 0.9979 0.9886 1
10
8 0.9126 1 0.8603 0.9528 0.9634 1 0.9927 1
10 0.9126 1 0.8915 0.9873 0.9613 1 0.9896 1
3 0.8968 1 0.6044 0.6811 0.9754 0.9841 0.9854 0.9991
5 0.8968 1 0.7564 0.8524 0.9645 0.9929 0.9879 1
12
8 0.8968 1 0.8551 0.9634 0.9692 0.9982 0.9922 1
10 0.8968 1 0.8818 0.9932 0.9593 1 0.9879 1

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
AN UNSUPERVISED METHOD FOR DETECTING SHILLING ATTACKS IN RECOMMENDER SYSTEMS 593

TABLE 9. Comparison of precision and recall for four methods under reverse bandwagon attack.

Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA

Precision Recall Precision Recall Precision Recall Precision Recall

3 0.9310 0.9643 0.5751 0.6109 0.8851 0.9657 0.9032 1


5 0.9379 0.9714 0.6688 0.7092 0.9303 0.9929 0.9032 1
3
8 0.9379 0.9714 0.7365 0.7804 0.8967 1 0.9032 1

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


10 0.9517 0.9857 0.8004 0.8478 0.9163 0.9886 0.9032 1
3 0.9551 0.9957 0.5763 0.6081 0.8964 0.9779 0.9400 1
5 0.9388 0.9787 0.7118 0.7513 0.9476 0.9898 0.9400 1
5
8 0.9429 0.9830 0.8436 0.8900 0.9192 0.9983 0.9400 1
10 0.9551 0.9957 0.8800 0.9283 0.9224 0.9932 0.9400 1
3 0.9235 0.9973 0.6375 0.6970 0.8973 0.9888 0.9615 1
5 0.9259 1 0.7743 0.8464 0.9390 0.9936 0.9615 1
8
8 0.9210 0.9947 0.8670 0.9475 0.9184 0.9984 0.9615 1
10 0.9259 1 0.9011 0.9847 0.9375 0.9979 0.9615 1
3 0.9068 0.9936 0.6806 0.7540 0.9103 0.9753 0.9745 0.9979
5 0.9107 0.9979 0.7988 0.8848 0.9378 0.9970 0.9691 1
10
8 0.9087 0.9957 0.8887 0.9842 0.9534 1 0.9691 1
10 0.9107 0.9979 0.8948 0.9910 0.9438 1 0.9691 1
3 0.8952 0.9982 0.7099 0.7999 0.9388 0.9802 0.9814 0.9947
5 0.8952 0.9982 0.8228 0.9270 0.9517 0.9943 0.9741 1
12
8 0.8968 1 0.8707 0.9809 0.9516 1 0.9741 1
10 0.8968 1 0.8884 0.9999 0.9535 1 0.9741 1

bandwagon attack. The detection precision of IRM-TIA is bandwagon attack) on the MovieLens 100K dataset, where
slightly inferior to that of PCA-VarSelect in detecting reverse the P-value denotes a significance level criterion and the test
bandwagon attack when attack size is less than 5%, but its result indicates whether it is a rejection of the null hypothesis.
precision is over 0.96 and superior to that of PCA-VarSelect If the P-value is below 0.05, the test result is 1, and the null
and UD-HMM when attack size is more than 5%. Moreover, hypothesis is rejected; otherwise, the test result is 0, and the
the detection recall of IRM-TIA is almost 100% under vari- null hypothesis is acceptable.
ous attack sizes. Therefore, we can conclude that IRM-TIA As shown in Table 10, all of the P-values of IRM-TIA ver-
not only has recall that is more or less as good as that of sus other methods for precision in detecting seven attacks on
PCA-VarSelect and UD-HMM but also has precision superior the MovieLens 100K dataset are less than 0.05 and all of the
to that of the three baselines in most cases when detecting test results are 1. These results indicate that the null hypoth-
these attacks with various attack sizes and filler sizes. esis should be rejected, which means the precision of IRM-
TIA is significantly different from that of the other three
4.3.3. Statistical significance between IRM-TIA and other benchmark methods at a significance level of 0.05 in detect-
methods ing seven attacks. These results are consistent with the con-
To further illustrate the performance differences between clusion that IRM-TIA is superior to PCA-VarSelect, CBS and
IRM-TIA and other methods (i.e., PCA-VarSelect, CBS and UD-HMM in its precision in detecting the seven attacks on
UD-HMM), we conducted the Wilcoxon rank-sum test [52] the MovieLens 100K dataset.
based on our experimental results on the MovieLens 100K It is observed from Table 10 that all of the P-values of
dataset. The Wilcoxon rank-sum test is a non-parametric IRM-TIA versus PCA-VarSelect (or CBS) for recall under
hypothesis test to check whether there is significant difference the seven attack models on the MovieLens 100K dataset are
between two independent samples. In our test, the null less than 0.05, and that all of the test results are 1. These
hypothesis is that IRM-TIA and other methods are equally results mean that the significant difference between IRM-TIA
good for precision and recall at a significance level of 0.05. and PCA-VarSelect (or CBS) for recall is significant at the
Table 10 lists the P-values and test results of IRM-TIA versus significance level of 0.05 in detecting these attacks. It is also
other methods for precision and recall under seven attacks observed from Table 10 that the P-values of IRM-TIA versus
(i.e., random attack, average attack, bandwagon attack, AoP UD-HMM for recall in detecting random attack, bandwagon
attack, power user attack, love/hate attack and reverse attack, PUA and love/hate attack on the MovieLens 100K

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
594 H. CAI AND F. ZHANG

dataset are greater than 0.05 and that the test results are 0. increase while the recall values of IRM-TIA decrease indicating
These results show that the significant difference between that some shilling groups only contain a relatively small number
IRM-TIA and UD-HMM for recall is not significant at a sig- of attack users on the sampled Amazon review dataset and that
nificance level of 0.05 under these attacks. The reason is that attack users in these shilling groups cannot be detected when
IRM-TIA and UD-HMM are equally good in terms of recall parameters d and q exceed a certain scope. Moreover, the larger
in detecting these four attacks. These results are consistent the probability of being a target item, the greater the likelihood
with the conclusion that IRM-TIA outperforms or is as good of being attacked by attack users. For example, if there are 17
as UD-HMM in recall in detecting seven attacks on the ratings on a certain item whose probability is 0.9775 and if 16

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


MovieLens 100K dataset at a significance level of 0.05. of these ratings are given by suspicious users, it is reasonable to
consider this item as target item. In fact, these suspicious users
4.3.4. Comparison of detection results for four methods on really are attackers on the sampled Amazon review dataset.
the sampled Amazon review dataset We conduct experiments on the sampled Amazon review
Table 11 shows the detection results of IRM-TIA on the dataset to compare our approach with PCA-VarSelect, CBS
sampled Amazon review dataset with various values of para- and UD-HMM. Figure 3 shows the detection results of four
meters d and q . As listed in Table 11, with an increase in the methods, where parameters d and q of IRM-TIA are set to 0.9
values of parameters d and q , the precision values of IRM-TIA and 3, respectively. As shown in Fig. 3, the detection

TABLE 10. P-values and the test results of IRM-TIA versus other methods by the Wilcoxon rank-sum test on the MovieLens 100K dataset.

Methods Attack model Precision Recall

P-values Test results P-values Test results

IRM-TIA versus PCA-VarSelect Random attack 6.1179e-08 1 9.3540e-06 1


Average attack 1.3949e-07 1 3.6806e-08 1
Bandwagon attack 6.6626e-08 1 4.9366e-08 1
AoP attack 6.7956e-08 1 6.7956e-08 1
Power user attack 0.0317 1 0.0079 1
Love/hate attack 1.6092e-07 1 0.0044 1
Reverse bandwagon attack 0.0012 1 1.9699e-05 1
IRM-TIA versus CBS Random attack 6.2503e-08 1 8.0065e-09 1
Average attack 1.4060e-07 1 1.7866e-08 1
Bandwagon attack 6.6626e-08 1 1.9135e-07 1
AoP attack 6.7956e-08 1 3.2931e-05 1
Power user attack 0.0079 1 0.0476 1
Love/hate attack 6.7860e-08 1 2.4426e-08 1
Reverse bandwagon attack 6.4399e-08 1 2.1052e-08 1
IRM-TIA versus UD-HMM Random attack 1.5109e-05 1 1 0
Average attack 8.9913e-07 1 9.0037e-04 1
Bandwagon attack 6.6626e-08 1 0.1741 0
AoP attack 6.7956e-08 1 0.0315 1
Power user attack 0.0079 1 1 0
Love/hate attack 1.8058e-05 1 0.1487 0
Reverse bandwagon attack 0.0025 1 3.3600e-05 1

TABLE 11. Detection results of our proposed approach with various parameters.

d 0.6 0.7 0.8 0.9

q 3 5 10 3 5 10 3 5 10 3 5 10

Precision 0.4864 0.4998 0.5259 0.5115 0.5264 0.5617 0.6241 0.6882 0.7489 0.8207 0.8551 0.8966
Recall 0.8049 0.7692 0.6438 0.8012 0.7455 0.6412 0.7501 0.7135 0.6174 0.6758 0.6366 0.5509
F1-measure 0.6064 0.6059 0.5789 0.6244 0.6171 0.5988 0.6814 0.7006 0.6769 0.7412 0.7298 0.6824

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
AN UNSUPERVISED METHOD FOR DETECTING SHILLING ATTACKS IN RECOMMENDER SYSTEMS 595

PCA–VarSelect UD–HMM CBS IRM–TIA performs significantly better than the three baselines in detect-
1
0.9
ing the collusive spammers.
0.8 This work provides a new perspective for distinguishing
0.7 attack users from genuine users in CF recommender systems,
0.6
0.5
but there is still room for further improvement. In our future
0.4 work, we will utilize user relationships to improve the detec-
0.3 tion performance of our approach in detecting the collusive
0.2
spammers. In addition, we will further investigate the group

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


0.1
0 behaviour characteristics of collusive spammers and present a
new detection method for group shilling attacks in CF recom-
mender systems.
FIGURE 3. Comparison of the detection performance for four meth-
ods on the sampled Amazon review dataset.

precision and recall of PCA-VarSelect and UD-HMM on the FUNDING


sampled Amazon review dataset are not very good. These This work was supported by the National Natural Science
results indicate that not only are many genuine users misiden- Foundation of China (Nos 61772452, 61379116), the Natural
tified as attack ones by these two methods but also many Science Foundation of Hebei Province, China (Nos
attack users cannot be detected. One possible reason is that F2015203046, F2014201165) and the Key Program of
the sampled Amazon review dataset is very sparse and many Research on Science and Technology of Higher Education
items rated by genuine users are unpopular. In addition, on Institutions of Hebei Province, China (No. ZD2016043).
the sampled Amazon review dataset, the number of co-rated
items by attack users is usually more than one and most items
rated by attack users are target items, which makes PCA-
REFERENCES
VarSelect and UD-HMM ineffective in detecting shilling
attacks on the sampled Amazon review dataset. The detection [1] Costa, H. and Macedo, L. (2013) Emotion-Based Recommender
performance of CBS is better than that of PCA-VarSelect and System for Overcoming the Problem of Information Overload.
UD-HMM, but it is still inferior to that of IRM-TIA. The 11th Int. Conf. Practical Applications of Agents and Multi-Agent
detection precision, recall and F1-measure of IRM-TIA are Systems, Salamance, Spain, 22–24 May, pp. 178–189. Springer,
0.8207, 0.6758 and 0.7412, respectively. These results indi- Berlin.
cate that IRM-TIA outperforms three baselines in detection [2] Schafer, J.B., Frankowski, D., Herlocker, J. and Sen, S. (2004)
performance on the sampled Amazon dataset. Collaborative filtering recommender systems. ACM Trans. Inf.
Syst., 22, 5–53.
[3] Bobadilla, J., Ortega, F. and Hernando, A. (2013) Recommender
systems survey. Knowl. Based Syst., 46, 109–132.
5. CONCLUSIONS [4] Zhu, T., Harrington, P., Li, J. and Tang, L. (2014) Bundle
The CF recommender system is vulnerable to shilling attacks. Recommendation in Ecommerce. Proc. SIGIR 2014, Gold
To protect CF recommender systems against shilling attacks, Coast, Australia, 6–11 July, pp. 657–666. ACM, New York.
we propose an unsupervised detection method based on item [5] Chamoso, P., Rivas, A., Rodr¨ªguez, S. and Bajo, J. (2018)
relationship mining and target item identification. We extract Relationship recommender system in a business and
employment-oriented social network. Inf. Sci., s433–434,
four behaviour features from the aspects of co-occurrence
204–220.
correlation and topic similarity between items on user-item
[6] Mei, T., Yang, B., Hua, X. and Li, S. (2011) Contextual video
rating tracks, which can reflect the difference between genu-
recommendation by multimodal relevance and user feedback.
ine users and attack ones in their behaviour characteristics. ACM Trans. Inf. Syst., 29, 1–24.
We construct the set of suspicious users based on the projec-
[7] Lam, S.K. and Riedl, J. (2004) Shilling Recommender Systems
tion residual and behaviour vector length. We devise an algo- for Fun and Profit. 13th Int. Conf. World Wide Web, New
rithm to calculate each item’s probability of being a target York, USA, 17–20 May, pp. 393–402. ACM, New York.
item, and based on this algorithm, we identify the target items [8] Gunes, I., Kaleli, C., Bilge, A. and Polat, H. (2014) Shilling
and spot attack users within the set of suspicious users. The attacks against recommender systems: a comprehensive survey.
extensive experiments on the MovieLens 100K dataset show Artif. Intell. Rev., 42, 767–799.
the superiority of IRM-TIA in detecting various types of shil- [9] Hurley, N., Cheng, Z. and Zhang, M. (2009) Statistical Attack
ling attacks. Additionally, the experimental results on the Detection. 3rd ACM Conf. Recommender Systems, New York,
sampled Amazon review dataset also show that IRM-TIA USA, 22–25 October, pp. 149–156. ACM, New York.

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
596 H. CAI AND F. ZHANG

[10] Mobasher, B., Burke, R., Bhaumik, R. and Williams, C. (2007) Product Reviews. Fifth Int. Conf. Advanced Cloud and Big
Toward trustworthy recommender systems: an analysis of Data (CBD), Shanghai, China, 13–15 August, pp. 368–373.
attack models and algorithm robustness. ACM Trans. Internet. IEEE, Piscataway.
Technol., 7, 1–41. [27] Zhang, L., Wu, Z. and Cao, J. (2018) Detecting spammer
[11] Wilson, D.C. and Seminario, C.E. (2014) Evil Twins: groups from product reviews: a partially supervised learning
Modeling Power Users in Attacks on Recommender Systems. model. IEEE Access, 6, 2559–2568.
Proc. UMAP 2014, Aalborg, Denmark, 7–11 July, pp. [28] Zhang, S., Chakrabarti, A., Ford, J. and Makedon, F. (2006)
231–242. Springer, Berlin. Attack Detection in Time Series for Recommender Systems. 12th

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


[12] Chirita, P., Nejdl, W. and Zamfir, C. (2005) Preventing Int. Conf. Knowledge Discovery and Data Mining, Philadelphia,
Shilling Attacks in Online Recommender Systems. Proc. 7th USA, 20–23 August, pp. 809–814. ACM, New York.
Annual ACM Int. Workshop on Web Information and Data [29] Bryan, K., O’Mahony, M. and Cunningham, P. (2008)
Management, Bermen, Germany, 4–4 November, pp. 67–74. Unsupervised Retrieval of Shilling Profiles in Collaborative
ACM, New York. Recommender Systems. 2nd ACM Conf. Recommender sys-
[13] Williams, C. (2006) Profile Injection Attack Detection for tems, Lausanne, Switzerland, 23–25 October, pp. 155–162.
Securing Collaborative Recommender Systems. Technical ACM, New York.
Report, School of Computer Science, DePaul University. [30] Mehta, B. and Nejdl, W. (2009) Unsupervised strategies for
[14] Burke, R., Mobasher, B., Williams, C. and Bhaumik, R. (2006) shilling detection and robust collaborative filtering. User
Classification Features for Attack Detection in Collaborative Model. User Adapt. Interact., 19, 65–97.
Recommendation Systems. 12th Int. Conf. Knowledge [31] Lathia, N., Hailes, S. and Capra, L. (2010) Temporal Defenses
Discovery and Data Mining, Philadelphia, USA, 20–23 for Robust Recommendations. Int. Ecml/pkdd Conf. Privacy and
August, pp. 542–547. ACM, New York. Security Issues in Data Mining and Machine Learning,
[15] Morid, M. and Shajari, M. (2013) Defending recommender sys- Barcelona, Spain, 20–24 September, pp. 64–77. Springer, Berlin.
tems by influence analysis. Inf. Retr., 17, 137–152. [32] Lee, J. and Zhu, D. (2012) Shilling attack detection—a new
[16] Zhang, F. and Zhou, Q. (2014) HHTCSVM: an online method approach for a trustworthy recommender system. Informs
for detecting profile injection attacks in collaborative recom- J. Comput., 24, 117–131.
mender systems. Knowl. Based Syst., 65, 96–105. [33] Zou, J. and Fekri, F. (2013) A Belief Propagation Approach for
[17] Zhang, F. and Chen, H. (2015) An ensemble method for detect- Detecting Shilling Attacks in Collaborative Filtering. 22nd
ing shilling attacks based on ordered item sequences. Secur. ACM Int. Conf. Information and Knowledge Management
Commun. Netw., 9, 680–696. (CIKM 2013), Burlingame, USA, 27 October–1 November, pp.
[18] Yang, Z., Xu, L., Cai, Z. and Xu, Z. (2016) Re-scale AdaBoost 1837–1840. ACM, New York.
for attack detection in collaborative filtering recommender sys- [34] Zhang, Z. and Kulkarni, S.R. (2013) Graph-Based Detection of
tems. Knowl. Based Syst., 100, 74–88. Shilling Attacks in Recommender Systems. IEEE Int.
[19] Zhou, Q. (2016) Supervised approach for detecting average Workshop on Machine Learning for Signal Processing
over popular items attack in collaborative recommender sys- (MLSP), Southampton, United Kingdom, 22–25 September,
tems. IET Inf. Secur., 10, 134–141. pp. 1–6. IEEE, Piscataway.
[20] Li, W., Gao, M., Li, H., Xiong, Q., Wen, J. and Ling, B. (2015) [35] Zhang, Z. and Kulkarni, S.R. (2014) Detection of Shilling
An shilling attack detection algorithm based on popularity degree Attacks in Recommender Systems Via Spectral Clustering.
features. Acta Automatica Sin., 41, 1563–1575. (in Chinese). 17th Int. Conf. Information Fusion (FUSION), Salamanca,
[21] Williams, C., Mobasher, B. and Burke, R. (2007) Defending Spain, 7–10 July, pp. 1–8. IEEE, Piscataway.
recommender systems: detection of profile injection attacks. [36] Bilge, A., Ozdemir, Z. and Polat, H. (2014) A Novel Shilling
Serv. Oriented Comput. Appl., 1, 157–170. Attack Detection Method. 2nd ITQM, Moscow, Russia, 3–5
[22] Wu, Z., Zhuang, Y., Wang, Y. and Cao, J. (2012) Shilling June, pp. 165–174. Procedia Computer Science.
attack detection based on feature selection for recommendation [37] Gunes, I. and Polat, H. (2015) Hierarchical Clustering-Based
systems. Acta Electron. Sin., 40, 1687–1693. (in Chinese). Shilling Attack Detection in Private Environments. Proc.
[23] Zhou, Q. and Zhang, F. (2014) Detecting unknown recommen- ISDFS 2015, Ankara, Turkey, 11–12 May, pp. 1–7. IEEE,
dation attacks based on bionic pattern recognition. J. Softw., Piscataway.
11, 2652–2665. (in Chinese). [38] Zhang, Y., Tan, Y., Zhang, M., Liu, Y., Tat-Seng, C. and Ma,
[24] Wu, Z., Wu, J., Cao, J. and Tao, D. (2012) HySAD: A Semi- S. (2015) Catch the Black Sheep: Unified Framework for
supervised Hybrid Shilling Attack Detector for Trustworthy Shilling Attack Detection Based on Fraudulent Action
Product Recommendation. Proc. 18th ACM SIGKDD, Beijing, Propagation. Proc. IJCAI 2015, Buenos Aires, Argentina,
China, 12–16 August, pp. 985–993. ACM, New York. 25–31 July, pp. 2408–2414. AAAI Press, Palo Alto.
[25] Cao, J., Wu, Z., Mao, B. and Zhang, Y. (2013) Shilling attack [39] Zhou, W., Yun, S., Wen, J., Alam, S. and Dobbie, G. (2014)
detection utilizing semi-supervised learning method for collab- Detection of Abnormal Profiles on Group Attacks in
orative recommender system. World Wide Web, 16, 729–748. Recommender Systems. Int. ACM SIGIR Conf. Research &
Development in Information Retrieval, Gold Coast, Australia,
[26] Zhang, L., Yuan, Y., Wu, Z. and Cao, J. (2017) Semi-SGD:
6–11 July, pp. 955–958. ACM, New York.
Semi-Supervised Learning Based Spammer Group Detection in

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019
AN UNSUPERVISED METHOD FOR DETECTING SHILLING ATTACKS IN RECOMMENDER SYSTEMS 597

[40] Zhou, W., Wen, J., Gao, M., Ren, H. and Li, P. (2015) Abnormal [47] Xia, H., Fang, B., Gao, M., Ma, H., Tang, Y. and Wen, J.
profiles detection based on time series and target item analysis for (2015) A novel item anomaly detection approach against shil-
recommender systems. Math. Probl. Eng., 2015, 1–9. ling attacks in collaborative recommendation systems using the
[41] Wang, Q., Ren, Y., He, N., Wan, M. and Lu, G. (2015) A Group dynamic time interval segmentation technique. Inf. Sci., 306,
Attack Detecter for Collaborative Filtering Recommendation. 150–165.
12th IEEE Int. Computer Conf. Wavelet Active Media [48] Günnemann, N., Günnemann, S. and Faloutsos, C. (2014)
Technology and Information Processing, Chengdu, China, 18–20 Robust Multivariate Auto Regression for Anomaly Detection in
December, pp. 454–457. IEEE, Piscataway. Dynamic Product Ratings. WWW 2014, Seoul, Korea, 7–11

Downloaded from https://academic.oup.com/comjnl/article/62/4/579/5255729 by Cornell University Library user on 01 September 2020


[42] Wang, Y., Wu, Z., Bu, Z., Cao, J. and Yang, D. (2016) April, pp. 361–372. ACM, New York.
Discovering shilling groups in a real e-commerce platform. [49] Mcauley, J. and Leskovec, J. (2013) Hidden Factors and
Online Inf. Rev., 40, 62–78. Hidden Topics: Understanding Rating Dimensions with
[43] Gao, J., Dong, Y., Shang, M., Cai, S. and Zhou, T. (2015) Review Text. 7th ACM Conf. Recommender Systems, Hong
Group-based ranking method for online rating systems with Kong, 12–16 October, pp. 165–172. ACM, New York.
spamming attacks. Epl, 110, 1–6. [50] Kabutoya, Y., Iwata, T., Toda, H. and Kitagawa, H. (2013) A
[44] Yang, Z., Cai, Z. and Guan, X. (2016) Estimating user behav- Probabilistic Model for Diversifying Recommendation Lists.
ior toward detecting anomalous ratings in rating systems. APWeb 2013, Sydney, Australia, 4–6 April, pp. 348–359.
Knowl. Based Syst., 111, 144–158. Springer, Berlin.
[45] Yang, Z., Cai, Z. and Yuan, Y. (2017) Spotting anomalous rat- [51] Xu, C., Zhang, J., Long, C. and Long, C. (2013) Uncovering
ings for rating systems by analyzing target users and items. Collusive Spammers in Chinese Review Websites. ACM Int. Conf.
Neurocomputing, 240, 25–46. Information & Knowledge Management (CIKM), San Francisco,
[46] Zhang, F., Zhang, Z., Zhang, P. and Wang, S. (2018) UD- USA, 27 October–1 November, pp. 979–988. ACM, New York.
HMM: an unsupervised method for shilling attack detection [52] Perolat, J., Couso, I., Loquin, K. and Strauss, O. (2015)
based on hidden Markov model and hierarchical clustering. Generalizing the Wilcoxon rank-sum test for interval data. Int.
Knowl. Based Syst., 148, 146–166. J. Approx. Reason., 56, 108–121.

SECTION D: SECURITY IN COMPUTER SYSTEMS AND NETWORKS


THE COMPUTER JOURNAL, VOL. 62 NO. 4, 2019

You might also like