Professional Documents
Culture Documents
Xinkai Yang
College of Information, Mechanical and Electronical Engineering, Shanghai Normal University
Shanghai, China, 200234
E-mail: xkyang@shnu.edu.cn
Abstract-In this paper, we propose an iterative computation breakfast with espresso machine, outstanding service, and
framework to detect spam reviews based on coherent great price. Can't fmd a thing to complain about it. "
examination. We first define some reviews' coherent metrics to
analyze review coherence in the granularity of sentence. Then The rest of this paper is organized as follows. In Section II,
the framework and its evaluation process are discussed in details. some related research works on spam reviews detection are
discussed. Based on coherence examination, we propose a
Keywords- spam review detection; coherence metric; word general framework in Section Ill, followed by the coherent
transition probability; word concurrence probability metrics defined in Section IV. We discuss our evaluation
process in Section V and conclude this paper in section VI.
I. INTRODUCTION
IT. RELATED WORKS
Nowadays product (including service, e.g. hotel, restaurant.)
reviews play an important role for consumers' online shopping Mostly, the features of review contents and reviewers'
activities. Many people read these reviews before making behaviors are used in early research works to detect spam
purchase decision. Normally, the reviews for a product may be reviews [9]. By representing a review using a set of review
positive or negative. Customers will pay more attention to the level, reviewer-level and product-level features, a supervised
products with positive reviews and avoid the negative ones. classification method is applied to detect three types of fake
This effect will bring larger amount of business or lead to reviews in ref. [4]. Duplicated reviews are first identified using
potential financial losses [9]. Although lots of product reviews the Jaccard ratio, and these reviews are then used as the
are posted by real consumers to express their views and share training examples for the logistic regression classifier [4, 7]. In
their shopping experience with other people, more untruthful particular, untruthful review detection is performed by using
reviews appear in the e-business web sites because of financial duplicate reviews as labeled data.
reasons. For example, supposing a lodger writes a very Opinion and sentiment analysis of reviews extract and
negative review for a certain hotel on a review website due to aggregate positive and negative opinions from product reviews
its bad service. This review will present an unfavorable [5]. Researchers study the problem of generating feature-based
impression of this hotel to its potential customers and damage summ aries of customer reviews of products sold online. Given
its business. In order to avoid business losses caused by this a data set of customer reviews of any particular product, the
kind of truthful reviews, the owner of this hotel could employ task involves three subtasks: (1) identifying features of product
or entice some people to write untruthful reviews to promote its that customers have discussed or expressed their opinion on. (2)
reputation. Because of these pervasive untruthful reviews, For each feature, identifYing review sentences which gives an
customers are easy to be misled to buy low-quality opinion (positive or negative) and (3) producing a summary
services/products, while decent services/products could be using the discovered information [5].
defamed by malicious reviews. Usually, the writers who post
these deceptive reviews to intentionally mislead consumers or The other works study spam reviews by detecting
opinion analysis systems are called spammers and the fraudulent or unfair ratings. The review ratings are clustered
untruthful reviews are called spam reviews [9]. In order to into unfairly high ratings and unfairly low ratings by using
show how difficult spam detection is, we first depict one third party ratings products [4]. Those reviews with unfair
example reviews in the following part, which was posted to ratings can be removed to restore a fair item evaluation system.
ctrip.com (English version). This review is posted by one A new classification approach is developed to solve helpful
person who tries to promote that hotel and it is hard to identify prediction using review content and rating features. The
it as one deceptive review: reviews' rating features include the difference between the
review rating and the average rating of all reviews of the
"If you can put up with the small size of the room (which product. These features are then used to train a helpful
is normal in Japan) then you will love this place. Five minute classification method [4]. These works do not address review
walk from the main station, modem clean room, and great
with the surrounding reviews, we propose a model to analyze the connection between reviews and store-sentiment word
review coherence in the granularity of sentence. Intuitively, pairs. Each edge ei! is associated with a weight w/E [0, 1]
spam reviews always use some untruthful sentiment words to denoting the contribution of Pi! to the review dk • The weight
express their deceptive feelings, to promote or defame a store w / is computed by the contribution of word pair Pi! in all
through such positive or negative feedback. These sentiment sentences of dk• Further metrics will be defined to investigate
words are linked with the targeted stores in the spam reviews. the coherence between these sentiment words and other related
So it is crucial to find the coherent relationships between the words to find more information.
specific store and the sentiment words among them.
IV. COHERENCE METRICS AND ALGORITHM
Based on the above assumptions, we consider each review
as a bag of sentences, and define a sentence set as S {Sj. S2.
=
In this part we provide several metrics to measure the
S3, .... sn } . For each sentence, we capture the intra-sentence coherence of a review based on the flow smoothness
information through the store-sentiment word pair and the information between sentences: 1) Word transition probability
coherent features which are pair-wise sentence coherence. In - conditional probability; 2) Word concurrence probability -
addition, we construct a sentiment word lexicon Vo. We also joint probability. Then the framework is provided.
100
A. Word transition probability OJ J
Con(spsj+l) max log(--'- ) (6)
Usually, human writings will demonstrate certain word =
lV;E,Sl,lV E,S2
j GiD;
transition patterns naturally between two consecutive
sentences. When a word is given in one sentence, certain The abnormal word concurrence phenomena in spam
words could be observed in its following sentence with some reviews could be observed normally. For example, the fixed
probability [3]. However, such transition patterns in spam pattern of two associative words in a spam review can hardly
reviews can be impaired due to their deceptive nature. The be found in truthful reviews. Therefore, the value of coherent
consecutive sentences sand s' are denoted as a pair (s, s'). The metric for spam reviews is often lower than that of truthful
symbol S, represents the sentence set formed by s'. Let f(w, SJ reviews. So this coherent metric is also very helpful to detect
denote the frequency of W in Sj. Given a set of words W, we spam reviews and it can be jointly used with the former one.
can define one step transition probability p(Wj I wJ as the 1 "n-l
Can ( r) = -- L- 1-1 Can (S"S'+I)
-
(7)
probability of observing Wj in one sentence when the word Wj n -I
appears in the previous sentence.
f(wj,SJ (2)
p (Wj I wJ =
Lw W f(w ,SJ
k
ALGORITHM: Coherence Metrics Computation Process
kE
Input: The reviews set V, store S, and the times to of
For any two sentences s I and s2, the transition probability iterations
to observe S2 after s], i.e. s] S2, can be estimated by
-- Output: reviews' coherence metric
choosing the edge with the highest one step transition counter 0 repeat
=
101
infonnation is used to identifY spam revIews from truthful REFERENCES
reviews by human evaluators. [I] Castillo c., Donato D., "A reference collection for web spam", SIGIR
Forum 2006,40(2),ppll-24.
Tn our evaluation, we label a review as a suspicious spam
[2] Wang, Guan, Sihong Xie, Bing Liu, and Philip S. Yu., "Identify Online
review if more than one student has an agreement. Since the
Store Review Spammers via Social Review Graph", ACM Transactions
students can identifY 46 spam reviews from the 100 suspicious on Intelligent Systems and Technology,2012.
reviews which are identified by our algorithm, the accuracy of [3] Sun, Huan, Alex Morales, and Xifeng Yan., "Synthetic review
this our methodology is 46%. Although this accuracy is not spamming and defense", Proceedings of the 19th ACM SIGKDD
very high compared to other similar works, it is still international conference on Knowledge discovery and data mining -
KDD 13,2013.
meaningful to us. Because we are trying to detect spam
[4] http://www.cs.uic.edu.
reviews on the semantic level which needs to handle more [5] Sharma, Kuldeep, and King-Ip Lin., "Review spam detector with rating
complex situations. Few existing research is found in this consistency check", Proceedings of the 51st ACM Southeast Conference
direction. In addition, the students evaluators agreed on each on - ACMSE 13,2013.
other's judgment. [6] Xu, Chang, "Detecting collusive spammers in online review
communities", Proceedings of the sixth workshop on Ph D students in
TABLE I. RESULT OF OUR EvALUAnON information and knowledge management - PIKM 13,2013.
[7] Lau, Raymond Y. K., S. Y. Liao, Ron Chi-Wai Kwok,Kaiquan Xu,
Student 1 Student 2 Student 3 Yunqing Xia, and Yuefeng Li. "Text mining and probabilistic language
Student t 46 29 36 modeling for online review spam detection", ACM Transactions on
Management Information Systems,2011.
Student 2 38 30 [8] 1. Liu, Y. Cao, C.-Y. Lin, Y. Huang, and M. Zhou,"Low-quality product
review detection in opinion summarization",In EMNLP-CoNLL,2007.
Student 3 40 [9] Xie, Sihong,Guan Wang,Shuyang Lin, and Philip S. Yu. "Review spam
The evaluatIOn result IS hsted m table I WhiCh shows the detection via temporal pattern discovery",Proceedings of the 18th ACM
SIGKDD international conference on Knowledge discovery and data
agreement level among evaluators. For example, student 1
mining - KDD 12,2012.
identified 46 suspicious reviews; out of which 29 were [10] G. Wang,S. Xie, B. Liu, and P. S. Yu,"Review graph based online store
recognized by student 2 and 36 were caught by student 3. The review spammer detection",IEEE 11th International Conference on Data
different results also shows that it is hard to get an agreement Mining,pp 1242-1247,2011.
on some revIews among all the elevators without enough [11] A. Mukhe�jee, B. Liu, and N. Glance, "Spotting fake reviewer groups in
consumer reviews", In Proceedings of the 21st International Conference
strong evidences.
on World Wide Web (WWW),2012.
[12] N. Jindal and B. Liu., "Opinion spam and analysis", In Proceedings of
VI. CONCLUSION the International Conference on Web Search and Web Data Mining
(WSDM),pp2I9-230,2008.
Until today, there are a few attempts have been made on [13] N. Jindal and B. Liu, "Analyzing and detecting review spam", In
review spam detection which is a challenging and under Proceedings of the 7th IEEE International Conference on Data Mining,
exploration area [2]. In this paper, we propose a general pp547-552,2007.
[14] NJindal and B.Liu, "Review spam detection", In Proceedings of the
framework to detect spam reviews based on coherent 16th International Conference on World Wide Web, pp1l89-1190, 2007.
examination. We first discuss some assumptions of spam [15] E. Lim, V. Nguyen, N. Jindal, B. Liu, and H. Lauw, "Detecting product
reviews, and then we define some review coherent metrics. review spammers using rating behaviors", In Proceedings of the 19th
The iterative computation framework is also provided. Our ACM International Conference on Information and Knowledge
Management (CIKM), pp 939-948,2010.
proposed model tries to analyze review coherence in the [16] c. Dellarocas, "Immunizing online reputation reporting systems against
granularity of sentence. We define some metrics to investigate unfair ratings and discriminatory behavior",In ACM EC,2000.
the coherence between sentiment words and other related [17] S.-M. Kim, P. Pantel, T. Chklovski, and M. Pennacchiotti,
words based on the flow smoothness between sentences: word "Automatically assessing review helpfulness",In EMNLP,2006.
[18] Kleinberg, J. ,"Authoritative sources in a hyperlinked environent". In J.
transition probability and word concurrence probability. ACM 46,5,pp604-632,1999.
Because we try to identifY spam reviews on the semantic level,
our proposed model can reveal more important clues of spam
reviews based on their word pattern. This work provides a
novel viewpoint for spam review detection and more potential
approaches could be explored in the future.
102