You are on page 1of 5



  
 

Product Review Credibility Analysis


Anusha Prabakaran, Min Chen
Division of Computing and Software Systems, School of STEM
University of Washington Bothell
Bothell, USA
{anushapr, minchen2}@uw.edu

Abstract² Product reviews are vital sources of customers¶ to reuse similar reviews on a same product. However, research
opinions and have a significant impact on purchasing decisions in spam detection is far from adequate to address the
and product rankings on online shopping sites. Unfortunately, challenges in this area. For instance, most existing techniques
fraudsters (spammers) may write deceptive reviews (spam for spam detection mainly rely on supervised learning. It
reviews) appreciating or deprecating a product, which can therefore requires high quality labeled training data that
mislead potential customers and negatively affect revenues of unfortunately is lacking in real-life datasets [6]. Content
many genuine organizations. Therefore, there is a great need for similarity analysis could be applied to any type of datasets but
an effective approach to detect fake reviews and spammers. In requires expensive computations. In addition, many existing
this paper, we propose a statistical credibility scoring mechanism
studies present theoretical approaches and typically test on
to identify spam reviews. It consists of three components:
small datasets as proof of concepts instead of practical
detection of duplicate reviews, detection of anomaly in review
count and rating distribution, and detection of incentivized application. In this study, we propose an end-to-end system
reviews. These three methodologies complement each other to that works seamlessly on large datasets and generates user
effectively indicate the credibility of product reviews without consumable reports. It integrates multiple methodologies in a
requiring significant computational resources. It can aid data practical fashion to overcome the ever-changing techniques
mining and online spam filtering systems to filter out spam employed by spammers. The additional novel methodology the
product reviews and refine product rankings. system uses is to detect the incentivized reviews based on text
analysis.
Keywords² Spam detection, product review, credibility
analysis, online spam filtering II. PROPOSED FRAMEWORK
As discussed earlier, the framework consists of three main
I. INTRODUCTION methods - detection of duplicate reviews, detection of anomaly
in review count and rating distribution, and detection of
With near ubiquitous access to the Internet and growing
incentivized reviews - to identify the reviews that hamper the
popularity of online shopping, there is a rapid increase in the
credibility of the review rating.
percentage of shopping through online stores. Unlike the
traditional retail models, purchase decisions at these online A. Detection of Duplicate Reviews
stores are often based on product reviews, which should be Most real-life review datasets do not provide labeled
made by real consumers who reveal their honest experiences of ground truth, which poses main challenge in applying
a product. However, due to the drive for profit or winning over supervised learning methods. Therefore, we adopt duplicate
competitors, some unscrupulous professionals (spammers) reviews detection, an unsupervised approach, as one of the
attempt to bias product reviews by writing spam comments to methodologies to score the credibility of a product. Different
intentionally deceive consumers. from existing work [5][11], we introduce various algorithms to
To address this issue, a few consumer sites [16] have improve the computational efficiency and to optimize the
consolidated tips and clues to manually spot spam reviews. process.
However, for online products with a huge number of reviews,
it is practically impossible to manually check and distinguish
the spam opinions from the real reviews. Several high-profile
cases have been reported [14][15][19] and spammers have
transparently admitted to be paid to write fake reviews in
media investigations [13][20]. Many businesses have rewarded
positive reviews with promotions and coupons.
Recently, researchers have shown great interest in
identifying truthfulness of reviews. Reviewer behavior analysis
is conducted in [4][12] to detect suspicious activities using the
temporal footprints of the reviewers. [4] and [9] hypothesize
that there are natural distributions of review star ratings and
spam reviews can lead to skewed rating distribution. [5][11] Figure 1. Overall process to detect duplicate reviews
use content similarity analysis which treats duplicates and The overall process and examples are illustrated in Fig. 1.
near-duplicate reviews as potential spam reviews based on the First, each review is converted into a set of bi-gram shingles
evidence that spammers often sign in with different identifies that are formed by combining two consecutive words together.

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

  !"# " "$$%"&'())) 11



  
 

For example, given a review ³This product is amazing,´ a set B. Detection of Anomaly in Review Count and Rating
of bi-gram shingles is generated containing ³This product,´ Distribution
³product is,´ and ³is amazing.´ Usage of such bigrams is more In a spam review detecting system, identifying the burst
meaningful than tokenizing each word as it helps increase the patterns of product reviews over time is a significant evidence
relevance between contexts. of spammer attacks. Generally, a product is expected to get
In this way, the similarity between two reviews can be reviews and ratings progressively over the period with random
computed by checking the ratio of intersections of their sets of time intervals. But, when spammers are hired to write fake
bi-gram shingles. There are various similarity measures in the reviews, there is a swift increase in the number of reviews in a
literature such as Euclidean distance, Cosine similarity, short interval. The spammers may increase or decrease the
Minkowski distance, and Jaccard similarity. For this research, rating value of the targeted product in such short time periods
Jaccard similarity is applied because compared to other and mostly offer extreme ratings. Literature shows that there is
measures that compute similarity measures for vectors and a high correlation between such sudden spikes in the number of
points, Jaccard similarity is for data objects represented as sets reviews and star rating of the product with spammer attacks [4]
(unordered collections of objects) just like the sets of bi-gram [9]. Such spikes can be flagged as anomalies in the
shingles. Given two sets A and B, Jaccard similarity is defined distribution. On the other hand, such anomalies may appear for
in Eq. (1). highly seasonal products or products of unexpected popularity.
ȁ஺‫ת‬஻ȁ
‫ܬ‬ሺ‫ܣ‬ǡ ‫ܤ‬ሻ ൌ (1) For example, a sunscreen product will most likely get its
ȁ஺‫׫‬஻ȁ
Pairs of reviews with similarity greater than a set threshold reviews during summer. Hence, it is important to detect
are considered as duplicates. Here the threshold is set to 70% anomalies in the presence of seasonality and other underlying
as used in [10]. trends [8].
However, for a very large review dataset, this process leads
to enormous computational load. For example, a review with
1,000 words will generate 999 bigram shingles and a pair-wise
comparison is needed among all reviews. To improve the
efficiency, optimizations are performed as follows:
1) Cyclic Redundancy Check (CRC) 32 Hash and min-
Hash. For easier computation to get the Jaccard similarity
value, the shingles are first mapped to shingle IDs using the
CRC32 hash function [2] [3] that converts any variable length
string into a hexadecimal value of 32-bit binary sequence.
Now, the set for each review is represented as a set of integers
instead of substring shingles. Still, the size of the set is large to
compute the similarity.
Our goal is to have a smaller representation of these large Figure 2. Overall process to detect anomaly
sets called as ³signatures´ [22]. The key property of such In this study, the approach to detect anomaly in review
signatures is that they should be a good representation of the count and rating distribution is defined in Fig. 2. First, all the
large set with a much smaller size. Signatures for each set are reviews of a product are extracted and binned into 30-day
derived using min-hash scheme with k hash functions, where k buckets to generate the time series distribution. Next, using the
is a fixed integer. The value of k is set to 105 for this research. time series, a pandas dataframe is developed, which is a two-
According to Chernoff Bound [7], the expected error rate by dimensional tabular data structure with review count and time
using min-Hash is ܱ൫ͳȀξ݇൯ and in general, having a k around series. These dataframes form a seasonal univariate time series
100 leads to a small error probability [23]. Using min-Hash and are passed into the Seasonal Hybrid Extreme Studentized
values prevents from having to explicitly compute random Deviate (S-H-ESD) algorithm [18]. S-H-ESD detects both
permutations of all of the shingle IDs. global and local anomalies in the presence of seasonality and
2) Inverted Index. After obtain min-Hash values to growth. This algorithm is efficient for processing large
represent each review, the Jaccard similarity is converted into datasets. Fig. 3 shows the different examples of anomalies
detected by the algorithm and the types of behaviors accepted
the number of same signatures in reviews A and B divided by
as non-anomalies [21]. It detects a sudden increase, abnormal
the total number of unique signatures in them. However, to pick, and unusually high activity. It does not detect a linear
compare min-Hash values among reviews still requires the growth and a linear seasonal growth where the product must
complexity of O(n2) where n is the number of reviews in the have gained its popularity over time. This proves the
dataset. To optimizate the comparison, inverted indexes are applicability of this algorithm for our study. As a result, the
built using the min-Hash values. Inverted index is a widely number of rating anomalies and count anomalies for each
adopted approach in information retrieval. In our study, a product (namely ܲ‫ ݉݋݊ܣ݃݊݅ݐܴܽ݀݋ݎ‬and ܲ‫) ݉݋݊ܣݐ݊ݑ݋ܥ݀݋ݎ‬
dictionary is built by using all min-Hash values of a product as can be counted.
index keys, where each index key points to reviews containing
the corresponding min-hash value. This inverted index returned
all the products with a given min-Hash value in O(logn) time.

12

  
 

D. Generation of Credibility Score


A data-driven approach is applied to score the credibility of
products. Depending on the score, a color-coded result is
displayed for easy visualization. First, an average scoring scale
is created for each product category. Then an individual
product¶s scores are analyzed for credibility against the average
category scoring scale.
As shown in Eqs. (2)-(5), for each given product category,
the total number of reviews and total number of products in it
(listed as TotalReviews(Category) and
TotalProduct(Category), respectively) are counted. The total
number of duplicate reviews, incentivized reviews, rating value
anomalies, and review count anomalies (named as
Figure 3. Examples results of anomaly detection (TotalDuplicate(Category), TotalIncentivized(Categor),
TotalRatingAnom(Category), TotalCountAnomalies(Category),
C. Detection of Incentivized Reviews respectively) are calculated and stored. This provides the
Many ecommerce websites allow the sellers to offer average ratio of the duplicate reviews (Eq. (2)), incentivized
products for free or at high discounts in exchange for positive reviews (Eq. (3)), and anomalies (Eqs. (4) and (5)) present in
reviews. While some reviews may have a disclaimer that ³the the entire dataset of that product category.
customer received this product in exchange for an honest ்௢௧௔௟஽௨௣௟௜௖௔௧௘ሺ஼௔௧௘௚௢௥௬ሻ
‫ ݁ݐ݈ܽܿ݅݌ݑܦ݃ݒܣ‬ൌ (2)
review´ stated in the reviews, there still exist many reviews ்௢௧௔௟ோ௘௩௜௘௪௦ሺ஼௔௧௘௚௢௥௬ሻ
்௢௧௔௟ூ௡௖௘௡௧௜௩௜௭௘ௗሺ஼௔௧௘௚௢௥௬ሻ
written under such condition without the disclaimer. [17] ‫ ݀݁ݖ݅ݒ݅ݐ݊݁ܿ݊ܫ݃ݒܣ‬ൌ (3)
்௢௧௔௟ோ௘௩௜௘௪௦ሺ஼௔௧௘௚௢௥௬ሻ
discussed how the incentivized reviews have affected the ்௢௧௔௟ோ௔௧௜௡௚஺௡௢௠ሺ஼௔௧௘௚௢௥௬ሻ
ratings of the product. Incentivized reviewers, though claim to ‫ ݉݋݊ܣ݃݊݅ݐܴܽ݃ݒܣ‬ൌ (4)
்௢௧௔௟௉௥௢ௗ௨௖௧ሺ஼௔௧௘௚௢௥௬ሻ
be unbiased, tend to give positive and less critical reviews for ்௢௧௔௟஼௢௨௡௧஺௡௢௠ሺ஼௔௧௘௚௢௥௬ሻ
‫ ݉݋݊ܣݐ݊ݑ݋ܥ݃ݒܣ‬ൌ (5)
the products compared to non-incentivized reviewers. The ்௢௧௔௟௉௥௢ௗ௨௖௧ሺ஼௔௧௘௚௢௥௬ሻ
occurrence of these reviews has transformed the review panels For each product analyzed under that product category, the
into advertising forum. However, detecting such incentivized individual product¶s duplicates and incentivized review ratio,
or biased reviews is often more challenging. named ܲ‫ ݁ݐ݈ܽܿ݅݌ݑܦ݀݋ݎ‬and ܲ‫ ݀݁ݖ݅ݒ݅ݐ݊݁ܿ݊ܫ݀݋ݎ‬are calculated
accordingly in Eqs. (6) and (7).
்௢௧௔௟஽௨௣௟௜௖௔௧௘ሺ௉௥௢ௗሻ
ܲ‫ ݁ݐ݈ܽܿ݅݌ݑܦ݀݋ݎ‬ൌ (6)
்௢௧௔௟ோ௘௩௜௘௪௦ሺ௉௥௢ௗሻ
்௢௧௔௟ூ௡௖௘௡௧௜௩௜௭௘ௗሺ௉௥௢ௗሻ
ܲ‫ ݀݁ݖ݅ݒ݅ݐ݊݁ܿ݊ܫ݀݋ݎ‬ൌ (7)
்௢௧௔௟ோ௘௩௜௘௪௦ሺ௉௥௢ௗሻ
Here TotalReviews(Prod), TotalDuplicate(Prod), and
TotalIncentivized(Prod) refer to a particular product¶s total
number of reviews, total number of duplicate reviews, and total
number of incentivized reviews, respectively. In our current
study, the four values for individual product, namely
ܲ‫ ݉݋݊ܣ݃݊݅ݐܴܽ݀݋ݎ‬, ܲ‫( ݉݋݊ܣݐ݊ݑ݋ܥ݀݋ݎ‬obtained from
Section II.B) and ܲ‫݁ݐ݈ܽܿ݅݌ݑܦ݀݋ݎ‬, ܲ‫(݀݁ݖ݅ݒ݅ݐ݊݁ܿ݊ܫ݀݋ݎ‬from
Eqs. (6) and (7)) are compared to the corresponding category
Figure 4. Process to detect incentivized reviews average values. If any one individual value is greater than its
Fig. 4 shows the overall steps to detect incentivized corresponding average category value, we consider the product
reviews. First, a collection of synonym phrases for a set of key reviews has one warning sign. So each product can have from
phrases is built using Natural Language Toolkit ± WordNet 0 to 4 warning signs that determines the level of credibility of
[24]. WordNet is a large lexical database that resembles the reviews. Table I shows the color-coded smileys used in this
thesaurus [24]. It labels the semantic relations among words study corresponding to the number of warning signs.
and is recognized universally [1] for generating the synonyms TABLE I. SCORING SCHEME FOR CREDIBILITY OF PRODUCT REVIEWS
that are found in close proximity to one another in the network. # of warning Color coded Description
For example, the equivalent phrases for the key phrase ³honest signs emoji
review´ are ³truthful review,´ ³genuine review,´ ³genuine 0 Nothing of concern
feedback´ and so on according to WordNet. The dictionary is
made of both single and double paired words, e.g., ³discount´ 1 Some concern
and ³no charge.´ The reviews with these synonyms are
identified using the regular expression and the time intervals >=2 Highly concerning
are also captured for analysis.

13

  
 

III. EXPERIMENTAL RESULTS reviews for the product during the year 2013, while the rating
To evaluate the proposed methods, six product categories distribution has been between 4 and 5 stars throughout the
from Amazon review dataset [6] is used as our experiment review time period. The increase in the review count helps
data. Table II shows the number of reviews and the number of increase the product ranking. This spike in the review count is
products under each product category. The test machine used is swift and short-lived, which indicates a suspicious spam
MacBook Pro with 2.7GHz Intel core i5 processor and 8 Giga activity.
bytes of RAM.
TABLE II. SIX AMAZON PRODUCT CATEGORIES FOR EXPERIMENTS

The correctness and the accuracy of the outputs from each


method are presented in Section III.A to III.C. All the manual
text evaluation is mainly done on Automotive dataset as it has
a smaller number of reviews. Section III.D examines the
efficiency and Section III.E analyzes and concludes the Figure 6. Example #1 graph showing the anomaly detection
effectiveness of using the combination of methodologies in In Fig. 7, the algorithm detects 3 anomalies in review count
identifying the credibility score. and 4 anomalies in rating distribution. The graph illustrates that
when the review counts are high, the product has better ratings.
A. Detection of Duplicate Reviews In other words, genuine reviews have a lower rating than the
The output from duplicate detection method is stored in a potentially fake reviews that were registered during the spikes.
CSV file separately for each product category. The column This attests that it is significant to look for the anomaly in both
headings for this CSV file are product X¶s asin, review rating, review count and rating distribution.
unixtime, product Y¶s asin, review rating, unixtime, and
similarity score between X and Y.

Figure 5. Screenshot of the example duplicate detection output


In order to validate the results of the output, example
duplicate detection results are presented in Fig. 5. The
highlighted rows 366 and 367 show high similarity scores that
indicate potential duplicate reviews between products
B00155237W and B00063X7KG; and between B00155237W
and B002XOXSI2. To validate the results, the product asins
are used to find the product description from Amazon.com:
x B00155237W: Cruiser Accessories 76200 Tuf Flat Figure 7. Example #2 graph showing the anomaly detection
Shield Novelty / License Plate Shield
C. Detection of Incentivized Reviews
x B00063X7KG: Meguiar¶s G1016 Smooth Surface
Clay Kit The main purpose of this detection is to identify whether a
x B002XOXSI2: Meguiar¶s G110V2 Professional Dual product has a majority of the reviews when it was at a
Action Polisher discounted price or given free. As mentioned in the literature
Based on the unixtime, the reviewer ³Mack Wu´ has [17], these reviews are normally long and most of them are
written similar reviews to all the three different types of positive reviews with a high star rating. Here are a few review
products and at the same time (reviewTime: 04 15, 2013) examples detected by the algorithm. To save space, the long
which indicates spams. This is in line with the observation reviews are shortened to highlight the incentivized phrases.
made by [10]. ³I¶ve received this as a µfor review¶ unit but the seller did
not ask or even imply that a µpositive¶ review was expected in
B. Detection of Anomaly in Review Count and Rating H[FKDQJH«´
Distribution ³I tested Leather Nova against Lexol .. I received a
Figs. 6 and 7 are some examples of the anomalies detected complimentary Leather Nova sample and this is my honest
in rating distribution and review count, where the red circles review.´
indicate the anomalies detected by the algorithm discussed in The use of WordNet natural language processing toolkit
Section II.B. In Fig. 6, the algorithm detects an anomaly only has helped detect all the different phrases in context with
in the review count. There is an unusually high number of incentivized and discounted.

14

  
 

D. Efficiency results show the effectiveness and efficiency of our proposed


The system can perform product credibility analysis approach.
efficiently. By running the test on a MacBook Pro, the system REFERENCES
is able to generate the credibility report for all the product
[1] Bird, S., and Loper, E. Nltk: the natural language toolkit. In Proceedings
categories (more than 2 million reviews) within an hour. This of the ACL Interactive poster and demonstration sessions (2004),
makes the system usable for online website deployment. It is Association for Computational Linguistics, p. 31.
well adapted to large datasets and also could be used for other [2] Broder, A. Z. On the resemblance and containment of documents. In
datasets with little changes to the feature names. Compression and Complexity of Sequences 1997. Proceedings (1997),
IEEE, pp. 21±29.
E. Credibility Score
[3] CRC. Crc 32 hash:
As explained in section II.D, each product category in the https://docs.aws.amazon.com/redshift/latest/dg/crc32-function.html.
Amazon dataset will have its own adaptive scoring scale. For [4] Feng, S., Xing, L., Gogar, A., and Choi, Y. Distributional footprints of
example, it is relatively common for books to get a high ratio deceptive product reviews. ICWSM 12 (2012), 98±105.
of reviews due to promotional copies, which are considered as [5] Fusilier, D. H., Montes-y Gómez, M., Rosso, P., and Cabrera, R. G.
incentivized reviews. Therefore, the scale should be contextual Detection of opinion spam with character n-grams. In International
and adaptive based on product categories so noise in the results Conference on Intelligent Text Processing and Computational
Linguistics (2015), Springer, pp. 285±294.
is diminished by this adaptive methodology. Table III shows
the total number of duplicate reviews, incentivized reviews, [6] He, R., and McAuley, J. Ups and downs: Modeling the visual evolution
of fashion trends with one-class collaborative filtering. In proceedings of
and anomalies in review count and rating distribution for each the 25th international conference on world wide web (2016),
product category in the Amazon dataset. International World Wide Web Conferences Steering Committee, pp.
507±517.
TABLE III. CREDIBILITY SCORING SCALE FOR THE SIX DATASETS
[7] Hellman, M., and Raviv, J. Probability of error, equivocation, and the
chernoff bound. IEEE Transactions on Information Theory 16, 4 (1970),
368±372.
[8] Hochenbaum, J., Vallis, O. S., and Kejariwal, A. Automatic anomaly
detection in the cloud via statistical learning. arXiv preprint
arXiv:1704.07706 (2017).
[9] Hu, N., Koh, N. S., and Reddy, S. K. Ratings lead you to the product,
reviews help you clinch it? the mediating role of online review
sentiments on product sales. Decision support systems 57 (2014), 42±53.
[10] Jindal, N., and Liu, B. Review spam detection. In Proceedings of the
Given a baby product B00012Q0F4 (see Fig. 8), its 16th international conference on World Wide Web (2007), ACM, pp.
individual product values are computed and compared with 1189±1190.
average Baby category scores. As we can see, with all four [11] Jindal, N., and Liu, B. Opinion spam and analysis. In Proceedings of the
values showing warning signs (i.e., higher than corresponding 2008 International Conference on Web Search and Data Mining (2008),
ACM, pp. 219±230.
category scoring scales), a red credibility score is assigned.
[12] Jindal, N., Liu, B., and Lim, E.-P. Finding unusual review patterns using
unexpected rules. In Proceedings of the 19th ACM international
conference on Information and knowledge management (2010), ACM,
pp. 1549±1552.
[13] Kost, A. Woman paid to post five-star google feedback. ABC7 News
(2012).
[14] Meyer, D. Fake reviews prompt belkin apology. CNet News (2009).
[15] Miller, C. Company settles case of reviews it faked. New York Times
(2009).
[16] Popken, B. Ways you can spot fake online reviews. The Consumerist
Figure 8. Red credibility score example (30).
[17] ReviewMeta. Reviewmeta: https://reviewmeta.com/blog/analysis-of-7-
IV. CONCLUSIONS millionamazon-reviews-customers-who-receive-free-or-discounted-
item-much-more-likelyto-write-positive-review/.
In this paper, we present a multi-dimensional analysis to
[18] S-H-ESD. Anomaly detection with r code access:
detect the spam reviews. It generates credibility reports for https://github.com/twitter/AnomalyDetection.
given products detection based on three methods: detection of [19] Streitfeld, D. For $2 a star, an online retailer gets 5-star product reviews.
duplicate reviews, detection of anomaly in review count and New YorkTimes 26 (2012).
rating distribution, and detection of incentivized reviews. All [20] Topping, A. Historian orlando figes agrees to pay damages for fake
the methods provide useful information, serving as an overlay reviews. The Guardian 16 (2010).
to enable the discovery of fake reviews. To our best [21] Twitter. Anomaly detection with twitter graph:
knowledge, no other work in the literature has investigated the https://anomaly.io/anomaly-detection-twitter-r/.
integration of algorithms into a practical and deployable [22] Ullman, J. Jaccard Similarity - Stanford InfoLab:
framework to determine the credibility. In addition, the http://infolab.stanford.edu/~ullman/mmds/ch3.pdf. 1992, ch. 3.
experiments are conducted on 2,387,645 reviews belonging to [23] Vassilvitskii, S. Dealing with Massive Data:
104,581 products across 6 product categories, which is much http://www.cs.columbia.edu/~coms699812. 2011, pp. 1±30.
larger than any other projects mentioned in the literature. The [24] WordNet. Wordnet with ntlk : https://pythonprogramming.net/wordnet-
nltk-tutorial/.

15

You might also like