Professional Documents
Culture Documents
Collaborative filtering (CF) recommender systems have been shown to be vulnerable to shilling attacks.
How to quickly and effectively detect shilling attacks is a key challenge for improving the quality and
reliability of CF recommender systems. Although many recent studies have been devoted to detecting
shilling attacks, there are still problems that require further discussion, especially the improvement of
the detection performance on real-world unlabelled datasets. In this work, we propose an unsupervised
approach that exploits item relationship and target item(s) for attack detection. We first extract behav-
iour features based on the item relationship. Then, we distinguish suspicious users from normal users
and construct a set of suspicious users. Finally, we identify target item(s) by analysing the aggregation
behaviour of suspicious users, based on which we detect attack users from the set of suspicious users.
Extensive experiments on the MovieLens 100K dataset and sampled Amazon review dataset demon-
strate the effectiveness of the proposed approach for detecting shilling attacks in recommender systems.
Keywords: collaborative filtering recommender systems; shilling attacks; shilling attack detection;
behaviour features; item relationship; target item identification
Received 9 August 2018; revised 14 October 2018; editorial decision 1 November 2018
Handling editor: Albert Levi
1. INTRODUCTION
injection’ attacks) due to the openness of recommender sys-
With the rapid development of Internet, the problem of infor- tems [7, 8]. Researchers have discussed various attacks.
mation overload has also become increasingly prominent [1]. These attacks are mounted by injecting a number of fake pro-
Collaborative filtering (CF) recommender systems [2, 3] have files to promote or demote the recommendation of a target
arisen as an effective method to deal with the problem of item. Shilling attacks can damage the trustworthiness of CF
information overload. CF recommender systems rely on his- recommender systems. Therefore, how to protect the CF rec-
toric ratings given by users on items to make recommenda- ommender systems against shilling attacks is a crucial issue.
tions. Currently, CF recommender systems have been widely In recent years, researchers have endeavoured to extract
used in e-commerce [4], social networks [5], video on effective detection features and put forward efficient detection
demand [6], etc. However, previous research has shown that methods. The existing features are usually extracted based on
they are highly vulnerable to ‘shilling’ attacks (a.k.a. ‘profile rating statistical characteristics or item distributions. They are
effective in detecting known types of attacks when supervised The rest of this paper is organized as follows. Section 2
or semi-supervised detection methods are adopted. However, introduces the related work on shilling attack detection in rec-
these methods cannot work when the training samples are ommender systems. Section 3 describes the proposed detection
hard to obtain. In such cases, unsupervised detection methods method in detail, which includes the extraction of behaviour
would be more applicable than supervised or semi-supervised features, the construction of the set of suspicious users and the
based methods because they do not need to label the training detection of attack users. The experimental results are reported
samples. To achieve better performance, the unsupervised and discussed in Section 4. In the last section, our work is
detection methods usually require a priori knowledge of summarized, and our future work is discussed.
Attack model IS IF it
Random attack Null Null Randomly chosen System mean rmax /rmin
Average attack Null Null Randomly chosen Item mean rmax /rmin
Bandwagon attack Popular items rmax Randomly chosen System mean rmax /rmin
and degree of similarity with top neighbours (DegSim). After selection algorithm to select effective features for detecting a
that, several new generic features were derived based on the specific type of attack, based on which they trained two clas-
significant difference of the ratings, which are weighted devi- sifiers based on a k-nearest neighbour classifier and Bayesian
ation from mean agreement (WDMA), weighted degree of classifier, respectively. Li et al. [20] used the improved ID3
agreement (WDA) and Length variance (LengthVar) [13, 14]. decision tree algorithm to detect shilling attacks. Yang et al.
Some type-specific features were presented in [14] and [15], [18] applied a variant of the boosting algorithm to detect shil-
which are mean variance (MeanVar), filler mean target dif- ling attacks based on 18 statistical features, which improved
ference (FMTD), and target model focus (TMF), etc. In add- the detection precision compared with using methods with a
ition, Zhang et al. [16, 17] presented features based on the single classifier. Zhang et al. [16] detected shilling profiles by
number of ratings on rated items for each user profile, which combining a Hilbert-Huang transform and Support Vector
are filler size with total items (FSTI), filler size with popular Machine (SVM). In this method, detection features are
items (FSPI), etc. Yang et al. [18] also presented three type- extracted using the Hilbert-Huang transform method, based
specific features based on the number of specific ratings on on which a SVM-based classifier is trained to detect shilling
filler or selected items, which are filler size with maximum profiles. Zhang et al. [17] also presented an ensemble frame-
rating in itself (FSMAXRI), filler size with minimum rating work that combines multiple base classifiers, which is effect-
in itself (FSMINRI) and filler size with average rating in ive in detecting some known types of attacks. Zhou et al.
itself (FSARI). Zhou [19] used the term frequency-inverse [23] proposed a detection approach for detecting unknown
document frequency (TF-IDF) to extract two features for types of attack, which uses the technique of bionic pattern
detecting AoP attack. Li et al. [20] also extracted three fea- recognition to cover the samples of genuine profiles and iden-
tures based on item popularity. These existing features tifies profiles outside of those coverage as attack ones.
mainly focus on differences between shilling profiles and As for semi-supervised detection methods, Wu et al. [24]
genuine ones in the statistical characteristics and item popu- presented a hybrid shilling attack detection approach, which
larity of ratings, of which few are extracted based on the exploited both labelled profiles and unlabelled ones to clas-
behaviour characteristics hidden in a user’s item rating sify shilling profiles. This method can detect hybrid shilling
sequence. Moreover, the existing features are not always attacks effectively, but the detection precision is inferior to
effective in detecting various types of attacks when an that of the C4.5 decision tree method. In addition, Cao et al.
unsupervised detection approach is adopted. Therefore, it is [25] proposed a semi-supervised learning method that first
worth extracting new features that can reflect the behaviour trains a Bayes classifier on some labelled profiles and then
difference between genuine and shilling profiles and that do incorporates unlabelled profiles with a weighting factor for
not rely on the specific attacks. expectation maximization (EM) to improve the initial classi-
fier. Lately, Zhang et al. [26, 27] utilized the semi-supervised
2.2.2. Detection algorithms approach to detect spammer groups from product reviews.
Detection algorithms against shilling attacks in CF recom- As for unsupervised detection methods, Zhang et al. [28]
mender systems can fall into three categories: supervised proposed a detection approach with time series data, which
detection methods, semi-supervised detection methods and used the time series of each item to judge whether it is a tar-
unsupervised detection methods. get item. This method is based on the assumption that there
As for supervised detection methods, Williams et al. [21] are a number of ratings from genuine users on target items
trained a classifier based on some generic and attack type- and that ratings on target items given by attack users are con-
specific detection attributes. Wu et al. [22] proposed a feature centrated within a short period. Bryan et al. [29] presented an
unsupervised shilling attack detection algorithm based on the Recently, Yang et al. [45] applied adaptive structure learning
Hv -score metric, which is effective in detecting midsize to select more effective features and exploited a density-based
attacks. However, it does not perform well under small-scale clustering method to discover shilling profiles. Zhang et al.
bandwagon attacks. Mehta et al. [30] presented PCA- [46] presented an unsupervised detection method based on a
VarSelect using a PCA, which is based on the high similarity hidden Markov model and hierarchical clustering. This meth-
between shilling profiles. PCA-VarSelect can detect standard od first calculates the suspicious degree of each user using a
attacks (e.g., random attack and average attack) efficiently, hidden Markov model and then obtains the attack users using
but it requires a priori knowledge of the number of shilling the hierarchical clustering techniques. This method can detect
attack users. In the first stage, the ordered item sequence is TABLE 2. Notations and their descriptions
constructed for each user, and then four features are presented
by mining the item relationship on user’s ordered item Notation Description
sequence. In the second stage, the residual of each user is cal-
U The set of users and ∣ U ∣ = m
culated and combined with the user’s behaviour vector length
I The set of items and ∣ I ∣ = n
to construct the set of suspicious users. In the third stage, the
T Rating timestamp matrix T = [tu, j ]m ´ n , tu, j denotes the
target item(s) can be identified by analysing the aggregation
timestamp by user u on item j
behaviour of suspicious users, based on which we determine
R User-item rating matrix R = [ru, j ]m ´ n , ru, j denotes the
the attack users from the set of suspicious users.
rating by user u on item j
To facilitate discussions, we give descriptions of the nota-
[rmin, rmax ] Rating scale, where rmin means most dislike and rmax
tions used in this paper in Table 2.
means most like
rj The average of ratings on item j
ru The average of ratings given by user u
3.1. Extracting behaviour features r The average of ratings over all items and users
s The standard deviation of ratings over all items and users
In CF recommender systems, recommendations are made
based on historic ratings given by users on items. For nor-
mal users, items are usually rated according to their real pre-
ferences or actual needs. Unlike normal users, attack users
rate a number of non-target items at random for the purpose this limitation and to avoid the impact of zero transaction, we
of promoting or demoting the recommendations of target adopt the Kulc coefficient to calculate the degree of co-
items. In this section, we first analyse the item relationship occurrence correlation between items i and j; DCCIi, j is calcu-
based on co-occurrence and topic similarity. Then, we pro- lated as follows:
pose four detection features to characterize the difference
1 æç Coi, j Coi, j ö÷
between genuine users and attack ones in the user intra-track DCCIi, j = çç + ÷ (1 )
relationship. 2 çè NRi NRj ÷÷ø
DEFINITION 1 (Degree of co-occurrence correlation between where Coi, j is the number of users who give ratings for both
items, DCCI). For any two items i Î I and j Î I , the degree item i and item j, and NRi and NRj are the number of ratings
of co-occurrence correlation between them refers to their co- for items i and j, respectively.
occurrence (i.e. co-rated by the same user) relation, which is In recommender systems, an item typically concerns multiple
denoted by DCCIi, j . hidden topics in different proportions, which can be obtained
by a latent factor model, e.g., latent dirichlet allocation (LDA)
An association rule is usually used to show the co- [49] or probabilistic latent semantic analysis (PLSA) [50]. In
occurrence relation between items. In recommender systems, this work, we use the well-known PLSA based on a mixture
most items are rated by only a small number of users, so the decomposition derived from a latent class model. For an item
support between two items is usually very low. To address i Î I , the corresponding hidden topic distribution vector is
denoted as HTDVi = ( pi,1, pi,2 , ¼, pi, c ), where c is the total åi, j Î Tru i ¹ j DCCIi, j
number of hidden topics, pi, x is the probability or proportion RDUTu = (5 )
( ∣ Tru ∣ - 1)2
of item i belonging to the xth topic, 0 £ pi, x £ 1 and
å cx = 1pi, x = 1.
DEFINITION 5 (Similarity degree of user track, SDUT). For
DEFINITION 2 (Degree of topic similarity between items, any user u Î U , the similarity degree of Tracku refers to the
DTSI). For any two items i Î I and j Î I , the degree of average of the topic similarity degree between items in
topic similarity between them refers to the degree of similar-
m
Algorithm 1 Extracting behaviour features. å k ¢= 1b k ¢i b k ¢j
covij = (9 )
m m
Input: user-item rating time matrix T, rating matrix R å k1= 1b k21i å k 2= 1b k22 j
Output: four behaviour features RDUT , SDUT , ARDTW and
ASDTW For a covariance matrix C, we can obtain the eigenvalues
1: for any two items i Î I and j Î I do l1 ³ l2 ³ ³ lf and the corresponding eigenvectors
2: if i ¹j then v1, v2, ¼, vf . The PCA selects the eigenvector with largest
3: Compute DCCIi, j by Equation (1) eigenvalue as the first principal component. Similarly, the lth
correlation degree and similarity degree in this time window PRu = xu - xˆu 2 (10)
to be constant.
Based on the above description, the algorithm for extract- where . 2 represents the Frobenius norm.
ing behaviour features is described as follows.
Algorithm 1 mainly includes two parts. The first part Generally speaking, the projection residual values of most
(Lines 1–6) calculates the co-occurrence correlation degree genuine users are relatively low, but those of attack users and
and topic similarity degree between any two different items. a few genuine users are larger. In addition, we also calculate
The second part (Lines 7–12) calculates the values of four the length of the behaviour vector for each user.
features for each user.
DEFINITION 9 (Behaviour vector length, BVL). For any user
u Î U , the behaviour vector length of user u refers to the
3.2. Constructing the set of suspicious users length of its behaviour vector, which is denoted as follows:
In CF recommender systems, the behaviour intention of BVL u = ¾
xu 2 (11)
attack users differs greatly from that of genuine users, which
causes an obvious difference between attack users and genu-
ine ones in the behaviour feature space. In this section, we
For most genuine users, the lengths of their behaviour vec-
use PCA to model user behaviour and to detect suspicious
tors are close to each other, but those of attack users and a
users.
few genuine users are obviously different.
Let xuj be the value of the jth behaviour feature for user
u Î U , and let ¾ Based on the above analysis of projection residuals and
xu = (xu1, xu2, ¼, xuf ) denote a f-dimensional
behaviour vector length, we use the k-means (with k equals
( f = 4 in this paper) vector corresponding to user u.
X = [ x1 , ¾
x2 , ¼, ¾
x m ´ f is a m ´ f feature matrix. To to 2) algorithm to cluster users. All of the users are split into
m] Î R
two parts. One part contains most of the genuine users, the
make the matrix zero-centred, a simple linear transform is
other part consists of all of the attack users and a few genu-
used by deducting the mean of every column from each vari-
ine users. The mean of the behaviour vector length for each
able, i.e.
part is calculated, and the part with the greater mean of
behaviour vector length is regarded as the set of suspicious
B = [buj ]m ´ f = [xuj - xj ]m ´ f , users. The algorithm for constructing the set of suspicious
users is described as follows.
where xj is the mean of the jth column. The covariance Algorithm 2 mainly includes three parts. The first part
matrix of X is C = [covij ] f ´ f , and each element covij is calcu- (Lines 1–6) constructs the m ´ f matrix X and extracts k
lated as follows: principal components. The second part (Lines 7–10)
Pi = SBAi · fi (18)
Obviously, the larger the probability, the greater the likeli- (most disliked) and 5 is the highest (most liked). Each user
hood of being a target item. In the case of detecting shilling has rated at least 20 movies. The rating time is in UNIX sec-
attacks with a single target item, we can regard the item with onds since 1/1/1970 UTC. Similar to in the previous work,
the maximum probability as the target item. If there exist we assume that all of the profiles in the MovieLens 100K
multiple shilling groups and if the number of target items dataset are genuine ones. Shilling profiles are generated using
may be more than one for each shilling group, we consider the attack models described in Section 2 and injected into the
items as target items if their probability is larger than thresh- dataset, respectively. In the experiments, we use the models
old d . Users in the set of suspicious users are regarded as of random attack, average attack, bandwagon attack, average
attack ones if they rate a target item with rmax (in the case of over Popular items (AoP) attack and power user attack as
a push attack) or rmin (in the case of a nuke attack). push attacks, and we use love/hate attack and reverse band-
Based on the above description, the algorithm for detecting wagon attack as nuke attacks. The attack size and filler size
attack users is described below. vary. Specifically, the attack size is set to 3%, 5%, 8%, 10%
and 12%, and the filler size is set to 3%, 5%, 8% and 10%.
For a push attack, the target item is chosen at random from
4. EXPERIMENTAL EVALUATION items having an average rating lower than 3. For a nuke
attack, the target item is randomly chosen from the top 15%
4.1. Experimental dataset and setting
of popular items (i.e., items with a large number of ratings).
To evaluate the effectiveness of the proposed method, the fol- To obtain an accurate result, we repeat each experiment 10
lowing two datasets are used as the experimental data. times under the same conditions (i.e., same attack, attack size,
(1) MovieLens 100K dataset . This dataset consists of and filler size) and the average values of 10 experiments are
100,000 ratings on 1682 movies by 943 users. All of the rat- reported as the final evaluation results. As longer attacks have
ings are integers between 1 and 5, where 1 is the lowest less of an effect, attackers have to rate items quickly [31]. For
the experiments on the MovieLens 100K dataset, the rating attack users, CBS needs to label a certain number of
timestamps of the attack users for the items are randomly candidate spam users and requires knowing the total
selected from 10 sequential time windows. number of attack users. In the experiments, we
(2) Sampled Amazon review dataset. The Amazon review assume that the attack size is known in advance. In
dataset [51] was crawled from Amazon.cn until 20 August addition, the detection performance of CBS is dir-
2012 and consists of 1205,125 ratings from 645,072 users ectly affected by the number of candidate spam
towards 136,785 items. All of the ratings are integers between users. Similar to the experiments in [38], we conduct
1 and 5, where 1 and 5 are the lowest (most disliked) and experiments under different k values and adopt the
(a) (b)
FIGURE 2. Influence of parameters e and q on the F1-measure of IRM-TIA on the MovieLens 100K dataset. (a) Influence of parameter e on
the F1-measure of IRM-TIA and (b) Influence of parameter q on the F1-measure of IRM-TIA.
TABLE 3. Comparison of precision and recall for four methods under random attack.
Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA
0.9586 and the detection recall of PCA-VarSelect under three attacks with a small attack size and filler size on the
attacks is over 0.9. This means that PCA-VarSelect can per- MovieLens 100K dataset. Therefore, IRM-TIA outperforms
form well in detecting the shilling profiles generated by ran- PCA-VarSelect and CBS in detecting random attack and
dom and average attack models when a priori knowledge of average attack on the MovieLens 100K dataset. The detection
the attack size is satisfied. The detection precision and recall precision of UD-HMM is over 0.9, and the detection recall is
of CBS increase with an increasing attack size and filler size, over 0.9716. The detection precision of IRM-TIA is between
and its detection performance is not affected by the type of 0.9617 and 1, and almost all of the detection recall values of
attack. However, CBS is not effective in detecting the three IRM-TIA are 1. These results show that UD-HMM and IRM-
TABLE 4. Comparison of precision and recall for four methods under average attack.
Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA
TIA have excellent detection performance in detecting random filler sizes on the MovieLens 100K dataset, where the filler
attack and average attack. Furthermore, most of the precision items of attack users are chosen at random from the top 40%
values of IRM-TIA are greater than those of UD-HMM. of the most popular items. As shown in Table 6, PCA-
Table 5 shows the comparison of precision and recall for VarSelect is ineffective in detecting AoP attack and its detec-
four methods under bandwagon attack with various attack tion precision and recall decline sharply with an increased
sizes and filler sizes on the MovieLens 100K dataset, where filler size. The detection precision of CBS is between 0.6745
the attack users rate the target item and a few popular items and 0.8975, and the detection recall of CBS is between
with the highest rating. As shown in Table 5, the detection 0.7374 and 1, which means that CBS performs well in detect-
precision of PCA-VarSelect is between 0.7079 and 0.8980, ing AoP attack if the attack size is known in advance. The
and the detection recall of PCA-VarSelect is between 0.7234 detection performance of UD-HMM may improve with an
and 0.9809. CBS performs well with an increased attack size increased attack size, but declines with an increased filler
and filler size, but its detection precision and recall are low size, indicating that UD-HMM may not only group a large
when with a small attack size and filler size. The detection number of genuine profiles as attack ones, it may also mis-
precision of UD-HMM is between 0.7137 and 0.9683, and classify some attack profiles as genuine ones when the attack
the detection recall of UD-HMM is between 0.9702 and 1. size is very small and the filler size is large. As shown in
Compared with the results in Tables 3 and 4, UD-HMM Table 6, the detection precision of CBS is better than that of
declines slightly in detection precision under bandwagon PCA-VarSelect and UD-HMM, but it is still lower than that
attack because some genuine users are misclassified as attack of IRM-TIA. The detection precision of IRM-TIA is between
users by UD-HMM when detecting bandwagon attack. Both 0.9681 and 1, and the detection recall of IRM-TIA is between
the detection precision and recall of IRM-TIA are over 0.98, 0.9643 and 0.9978. Moreover, the precision and recall values
and most of the recall values of IRM-TIA are 100%, indicat- of IRM-TIA are very stable under various attack sizes and
ing that IRM-TIA can precisely detect bandwagon attack. filler sizes, which means that IRM-TIA can precisely distin-
Therefore, IRM-TIA has recall that is more or less as good as guish shilling profiles generated by the AoP attack model
that of UD-HMM and outperforms the three baselines in from genuine ones. Therefore, we can conclude that the
terms of the precision metric in detecting bandwagon attack. detection precision of IRM-TIA obviously outperforms that
Table 6 shows the comparison of precision and recall for of the three baselines while maintaining a high recall in
four methods under AoP attack with various attack sizes and detecting AoP attack on the MovieLens 100K dataset.
TABLE 5. Comparison of precision and recall for four methods under bandwagon attack.
Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA
TABLE 6. Comparison of precision and recall for four methods under AoP attack.
Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA
Table 7 shows the comparison of the precision and recall the best detection result under PUA attack with various attack
for four methods under PUA attack with various attack sizes sizes.
on the MovieLens 100K dataset, where the power users are In addition, we conducted further experiments to evaluate
identified based on the approach of Indegree [11]. As shown the performance of IRM-TIA under nuke attacks. Taking the
in Table 7, all four detection methods perform well under love/hate attack and reverse bandwagon attack as examples,
PUA attack with various attack sizes. All of the detection Tables 8 and 9 show the comparison of the precision and
recall values of the four detection methods are over 0.94, and recall for four methods under two types of attacks on the
the detection recall values of UD-HMM and IRM-TIA are 1, MovieLens 100K dataset. As listed in Tables 8 and 9, PCA-
TABLE 7. Comparison of precision and recall for four methods under PUA attack.
TABLE 8. Comparison of precision and recall for four methods under love/hate attack.
Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA
TABLE 9. Comparison of precision and recall for four methods under reverse bandwagon attack.
Attack size (%) Filler size (%) PCA-VarSelect CBS UD-HMM IRM-TIA
bandwagon attack. The detection precision of IRM-TIA is bandwagon attack) on the MovieLens 100K dataset, where
slightly inferior to that of PCA-VarSelect in detecting reverse the P-value denotes a significance level criterion and the test
bandwagon attack when attack size is less than 5%, but its result indicates whether it is a rejection of the null hypothesis.
precision is over 0.96 and superior to that of PCA-VarSelect If the P-value is below 0.05, the test result is 1, and the null
and UD-HMM when attack size is more than 5%. Moreover, hypothesis is rejected; otherwise, the test result is 0, and the
the detection recall of IRM-TIA is almost 100% under vari- null hypothesis is acceptable.
ous attack sizes. Therefore, we can conclude that IRM-TIA As shown in Table 10, all of the P-values of IRM-TIA ver-
not only has recall that is more or less as good as that of sus other methods for precision in detecting seven attacks on
PCA-VarSelect and UD-HMM but also has precision superior the MovieLens 100K dataset are less than 0.05 and all of the
to that of the three baselines in most cases when detecting test results are 1. These results indicate that the null hypoth-
these attacks with various attack sizes and filler sizes. esis should be rejected, which means the precision of IRM-
TIA is significantly different from that of the other three
4.3.3. Statistical significance between IRM-TIA and other benchmark methods at a significance level of 0.05 in detect-
methods ing seven attacks. These results are consistent with the con-
To further illustrate the performance differences between clusion that IRM-TIA is superior to PCA-VarSelect, CBS and
IRM-TIA and other methods (i.e., PCA-VarSelect, CBS and UD-HMM in its precision in detecting the seven attacks on
UD-HMM), we conducted the Wilcoxon rank-sum test [52] the MovieLens 100K dataset.
based on our experimental results on the MovieLens 100K It is observed from Table 10 that all of the P-values of
dataset. The Wilcoxon rank-sum test is a non-parametric IRM-TIA versus PCA-VarSelect (or CBS) for recall under
hypothesis test to check whether there is significant difference the seven attack models on the MovieLens 100K dataset are
between two independent samples. In our test, the null less than 0.05, and that all of the test results are 1. These
hypothesis is that IRM-TIA and other methods are equally results mean that the significant difference between IRM-TIA
good for precision and recall at a significance level of 0.05. and PCA-VarSelect (or CBS) for recall is significant at the
Table 10 lists the P-values and test results of IRM-TIA versus significance level of 0.05 in detecting these attacks. It is also
other methods for precision and recall under seven attacks observed from Table 10 that the P-values of IRM-TIA versus
(i.e., random attack, average attack, bandwagon attack, AoP UD-HMM for recall in detecting random attack, bandwagon
attack, power user attack, love/hate attack and reverse attack, PUA and love/hate attack on the MovieLens 100K
dataset are greater than 0.05 and that the test results are 0. increase while the recall values of IRM-TIA decrease indicating
These results show that the significant difference between that some shilling groups only contain a relatively small number
IRM-TIA and UD-HMM for recall is not significant at a sig- of attack users on the sampled Amazon review dataset and that
nificance level of 0.05 under these attacks. The reason is that attack users in these shilling groups cannot be detected when
IRM-TIA and UD-HMM are equally good in terms of recall parameters d and q exceed a certain scope. Moreover, the larger
in detecting these four attacks. These results are consistent the probability of being a target item, the greater the likelihood
with the conclusion that IRM-TIA outperforms or is as good of being attacked by attack users. For example, if there are 17
as UD-HMM in recall in detecting seven attacks on the ratings on a certain item whose probability is 0.9775 and if 16
TABLE 10. P-values and the test results of IRM-TIA versus other methods by the Wilcoxon rank-sum test on the MovieLens 100K dataset.
TABLE 11. Detection results of our proposed approach with various parameters.
q 3 5 10 3 5 10 3 5 10 3 5 10
Precision 0.4864 0.4998 0.5259 0.5115 0.5264 0.5617 0.6241 0.6882 0.7489 0.8207 0.8551 0.8966
Recall 0.8049 0.7692 0.6438 0.8012 0.7455 0.6412 0.7501 0.7135 0.6174 0.6758 0.6366 0.5509
F1-measure 0.6064 0.6059 0.5789 0.6244 0.6171 0.5988 0.6814 0.7006 0.6769 0.7412 0.7298 0.6824
PCA–VarSelect UD–HMM CBS IRM–TIA performs significantly better than the three baselines in detect-
1
0.9
ing the collusive spammers.
0.8 This work provides a new perspective for distinguishing
0.7 attack users from genuine users in CF recommender systems,
0.6
0.5
but there is still room for further improvement. In our future
0.4 work, we will utilize user relationships to improve the detec-
0.3 tion performance of our approach in detecting the collusive
0.2
spammers. In addition, we will further investigate the group
[10] Mobasher, B., Burke, R., Bhaumik, R. and Williams, C. (2007) Product Reviews. Fifth Int. Conf. Advanced Cloud and Big
Toward trustworthy recommender systems: an analysis of Data (CBD), Shanghai, China, 13–15 August, pp. 368–373.
attack models and algorithm robustness. ACM Trans. Internet. IEEE, Piscataway.
Technol., 7, 1–41. [27] Zhang, L., Wu, Z. and Cao, J. (2018) Detecting spammer
[11] Wilson, D.C. and Seminario, C.E. (2014) Evil Twins: groups from product reviews: a partially supervised learning
Modeling Power Users in Attacks on Recommender Systems. model. IEEE Access, 6, 2559–2568.
Proc. UMAP 2014, Aalborg, Denmark, 7–11 July, pp. [28] Zhang, S., Chakrabarti, A., Ford, J. and Makedon, F. (2006)
231–242. Springer, Berlin. Attack Detection in Time Series for Recommender Systems. 12th
[40] Zhou, W., Wen, J., Gao, M., Ren, H. and Li, P. (2015) Abnormal [47] Xia, H., Fang, B., Gao, M., Ma, H., Tang, Y. and Wen, J.
profiles detection based on time series and target item analysis for (2015) A novel item anomaly detection approach against shil-
recommender systems. Math. Probl. Eng., 2015, 1–9. ling attacks in collaborative recommendation systems using the
[41] Wang, Q., Ren, Y., He, N., Wan, M. and Lu, G. (2015) A Group dynamic time interval segmentation technique. Inf. Sci., 306,
Attack Detecter for Collaborative Filtering Recommendation. 150–165.
12th IEEE Int. Computer Conf. Wavelet Active Media [48] Günnemann, N., Günnemann, S. and Faloutsos, C. (2014)
Technology and Information Processing, Chengdu, China, 18–20 Robust Multivariate Auto Regression for Anomaly Detection in
December, pp. 454–457. IEEE, Piscataway. Dynamic Product Ratings. WWW 2014, Seoul, Korea, 7–11