Professional Documents
Culture Documents
BP - Product Review Credibility Analysis
BP - Product Review Credibility Analysis
Abstract² Product reviews are vital sources of customers¶ to reuse similar reviews on a same product. However, research
opinions and have a significant impact on purchasing decisions in spam detection is far from adequate to address the
and product rankings on online shopping sites. Unfortunately, challenges in this area. For instance, most existing techniques
fraudsters (spammers) may write deceptive reviews (spam for spam detection mainly rely on supervised learning. It
reviews) appreciating or deprecating a product, which can therefore requires high quality labeled training data that
mislead potential customers and negatively affect revenues of unfortunately is lacking in real-life datasets [6]. Content
many genuine organizations. Therefore, there is a great need for similarity analysis could be applied to any type of datasets but
an effective approach to detect fake reviews and spammers. In requires expensive computations. In addition, many existing
this paper, we propose a statistical credibility scoring mechanism
studies present theoretical approaches and typically test on
to identify spam reviews. It consists of three components:
small datasets as proof of concepts instead of practical
detection of duplicate reviews, detection of anomaly in review
count and rating distribution, and detection of incentivized application. In this study, we propose an end-to-end system
reviews. These three methodologies complement each other to that works seamlessly on large datasets and generates user
effectively indicate the credibility of product reviews without consumable reports. It integrates multiple methodologies in a
requiring significant computational resources. It can aid data practical fashion to overcome the ever-changing techniques
mining and online spam filtering systems to filter out spam employed by spammers. The additional novel methodology the
product reviews and refine product rankings. system uses is to detect the incentivized reviews based on text
analysis.
Keywords² Spam detection, product review, credibility
analysis, online spam filtering II. PROPOSED FRAMEWORK
As discussed earlier, the framework consists of three main
I. INTRODUCTION methods - detection of duplicate reviews, detection of anomaly
in review count and rating distribution, and detection of
With near ubiquitous access to the Internet and growing
incentivized reviews - to identify the reviews that hamper the
popularity of online shopping, there is a rapid increase in the
credibility of the review rating.
percentage of shopping through online stores. Unlike the
traditional retail models, purchase decisions at these online A. Detection of Duplicate Reviews
stores are often based on product reviews, which should be Most real-life review datasets do not provide labeled
made by real consumers who reveal their honest experiences of ground truth, which poses main challenge in applying
a product. However, due to the drive for profit or winning over supervised learning methods. Therefore, we adopt duplicate
competitors, some unscrupulous professionals (spammers) reviews detection, an unsupervised approach, as one of the
attempt to bias product reviews by writing spam comments to methodologies to score the credibility of a product. Different
intentionally deceive consumers. from existing work [5][11], we introduce various algorithms to
To address this issue, a few consumer sites [16] have improve the computational efficiency and to optimize the
consolidated tips and clues to manually spot spam reviews. process.
However, for online products with a huge number of reviews,
it is practically impossible to manually check and distinguish
the spam opinions from the real reviews. Several high-profile
cases have been reported [14][15][19] and spammers have
transparently admitted to be paid to write fake reviews in
media investigations [13][20]. Many businesses have rewarded
positive reviews with promotions and coupons.
Recently, researchers have shown great interest in
identifying truthfulness of reviews. Reviewer behavior analysis
is conducted in [4][12] to detect suspicious activities using the
temporal footprints of the reviewers. [4] and [9] hypothesize
that there are natural distributions of review star ratings and
spam reviews can lead to skewed rating distribution. [5][11] Figure 1. Overall process to detect duplicate reviews
use content similarity analysis which treats duplicates and The overall process and examples are illustrated in Fig. 1.
near-duplicate reviews as potential spam reviews based on the First, each review is converted into a set of bi-gram shingles
evidence that spammers often sign in with different identifies that are formed by combining two consecutive words together.
For example, given a review ³This product is amazing,´ a set B. Detection of Anomaly in Review Count and Rating
of bi-gram shingles is generated containing ³This product,´ Distribution
³product is,´ and ³is amazing.´ Usage of such bigrams is more In a spam review detecting system, identifying the burst
meaningful than tokenizing each word as it helps increase the patterns of product reviews over time is a significant evidence
relevance between contexts. of spammer attacks. Generally, a product is expected to get
In this way, the similarity between two reviews can be reviews and ratings progressively over the period with random
computed by checking the ratio of intersections of their sets of time intervals. But, when spammers are hired to write fake
bi-gram shingles. There are various similarity measures in the reviews, there is a swift increase in the number of reviews in a
literature such as Euclidean distance, Cosine similarity, short interval. The spammers may increase or decrease the
Minkowski distance, and Jaccard similarity. For this research, rating value of the targeted product in such short time periods
Jaccard similarity is applied because compared to other and mostly offer extreme ratings. Literature shows that there is
measures that compute similarity measures for vectors and a high correlation between such sudden spikes in the number of
points, Jaccard similarity is for data objects represented as sets reviews and star rating of the product with spammer attacks [4]
(unordered collections of objects) just like the sets of bi-gram [9]. Such spikes can be flagged as anomalies in the
shingles. Given two sets A and B, Jaccard similarity is defined distribution. On the other hand, such anomalies may appear for
in Eq. (1). highly seasonal products or products of unexpected popularity.
ȁתȁ
ܬሺܣǡ ܤሻ ൌ (1) For example, a sunscreen product will most likely get its
ȁȁ
Pairs of reviews with similarity greater than a set threshold reviews during summer. Hence, it is important to detect
are considered as duplicates. Here the threshold is set to 70% anomalies in the presence of seasonality and other underlying
as used in [10]. trends [8].
However, for a very large review dataset, this process leads
to enormous computational load. For example, a review with
1,000 words will generate 999 bigram shingles and a pair-wise
comparison is needed among all reviews. To improve the
efficiency, optimizations are performed as follows:
1) Cyclic Redundancy Check (CRC) 32 Hash and min-
Hash. For easier computation to get the Jaccard similarity
value, the shingles are first mapped to shingle IDs using the
CRC32 hash function [2] [3] that converts any variable length
string into a hexadecimal value of 32-bit binary sequence.
Now, the set for each review is represented as a set of integers
instead of substring shingles. Still, the size of the set is large to
compute the similarity.
Our goal is to have a smaller representation of these large Figure 2. Overall process to detect anomaly
sets called as ³signatures´ [22]. The key property of such In this study, the approach to detect anomaly in review
signatures is that they should be a good representation of the count and rating distribution is defined in Fig. 2. First, all the
large set with a much smaller size. Signatures for each set are reviews of a product are extracted and binned into 30-day
derived using min-hash scheme with k hash functions, where k buckets to generate the time series distribution. Next, using the
is a fixed integer. The value of k is set to 105 for this research. time series, a pandas dataframe is developed, which is a two-
According to Chernoff Bound [7], the expected error rate by dimensional tabular data structure with review count and time
using min-Hash is ܱ൫ͳȀξ݇൯ and in general, having a k around series. These dataframes form a seasonal univariate time series
100 leads to a small error probability [23]. Using min-Hash and are passed into the Seasonal Hybrid Extreme Studentized
values prevents from having to explicitly compute random Deviate (S-H-ESD) algorithm [18]. S-H-ESD detects both
permutations of all of the shingle IDs. global and local anomalies in the presence of seasonality and
2) Inverted Index. After obtain min-Hash values to growth. This algorithm is efficient for processing large
represent each review, the Jaccard similarity is converted into datasets. Fig. 3 shows the different examples of anomalies
detected by the algorithm and the types of behaviors accepted
the number of same signatures in reviews A and B divided by
as non-anomalies [21]. It detects a sudden increase, abnormal
the total number of unique signatures in them. However, to pick, and unusually high activity. It does not detect a linear
compare min-Hash values among reviews still requires the growth and a linear seasonal growth where the product must
complexity of O(n2) where n is the number of reviews in the have gained its popularity over time. This proves the
dataset. To optimizate the comparison, inverted indexes are applicability of this algorithm for our study. As a result, the
built using the min-Hash values. Inverted index is a widely number of rating anomalies and count anomalies for each
adopted approach in information retrieval. In our study, a product (namely ܲ ݉݊ܣ݃݊݅ݐܴܽ݀ݎand ܲ) ݉݊ܣݐ݊ݑܥ݀ݎ
dictionary is built by using all min-Hash values of a product as can be counted.
index keys, where each index key points to reviews containing
the corresponding min-hash value. This inverted index returned
all the products with a given min-Hash value in O(logn) time.
12
13
III. EXPERIMENTAL RESULTS reviews for the product during the year 2013, while the rating
To evaluate the proposed methods, six product categories distribution has been between 4 and 5 stars throughout the
from Amazon review dataset [6] is used as our experiment review time period. The increase in the review count helps
data. Table II shows the number of reviews and the number of increase the product ranking. This spike in the review count is
products under each product category. The test machine used is swift and short-lived, which indicates a suspicious spam
MacBook Pro with 2.7GHz Intel core i5 processor and 8 Giga activity.
bytes of RAM.
TABLE II. SIX AMAZON PRODUCT CATEGORIES FOR EXPERIMENTS
14
15