Professional Documents
Culture Documents
Abstract— Social media has increasingly promoted users to experience and also guide organizations with marketing and
provide customer feedback on shopping experiences in the form product design strategies.
of online product reviews. Many major ecommerce sites are
collecting revenue from advertisements of products and creating Most review sites are customer centric which make the
better shopping experiences for customers with the help of such presence of fake and unfair competitor placed reviews pose an
reviews. The problem of accurately verifying review authenticity everyday problem. These kinds of untruthful reviews result in
steadily grows and can help in making or breaking the good huge financial gain and grab attention for businesses and
name of a product. Feature engineering performed on the organizations. It has led to the widely known problem of
reviews assist in extracting the useful features that help in opinion spamming seen commonly in the form of fake
identifying actual fake reviews. Supervised machine learning reviews, social network postings such as tweets, and blogs and
algorithms help in classifying reviews into authentic and fake. other forms of deception. These are different forms of illegal
The newly emerging phenomenon of personality prediction has activities which aim at misleading readers by giving
taken hold of social media and this is being employed in the case undeserving positive opinions in order to promote the entities
of reviewer traits which will help identify key personality traits of or provide false negative opinions to some other entities in
fake reviewers. The Big 5 model is used for this which will be order to damage their reputation. The current work deals with
useful in tracking such people through their associated social accurately differentiating fake and authentic product reviews
media accounts.
from online shopping sites predominantly Amazon. Trying to
Keywords—product review, personality trait, social media,
go through such a vast amount of information to understand
opinion mining, feature extraction, review deviation, content the general opinion is impossible for users just by the sheer
similarity, spam detection. volume of data. Every customer is influenced by the recent
market trends and public opinion and hence the need for such
a system. Further, labeling these documents with their
I. INTRODUCTION sentiment would provide a succinct summary to the readers
The onset of social media has brought about massive about the general opinion regarding an entity. Before every
impacts in all spheres of life. From casual comments in social purchase, a customer is involved in analyzing the most recent
networking sites to more serious studies concerning market reviews provided by reliable users of the product and the
trends and analyzing strategies; all which have been possible question rises about which reviews are found to be reliable or
largely due to rising influence of social media data. Data authentic. To determine reliability score for each review is not
mining tasks applied on Big Data has led to knowledge an easy task and these opinions shape the future of the
discovery which has helped analyze market trends and target product. Thus the task of detecting and verifying authentic
customer interests to bring in bigger profits for organizations. reviews or opinions has become critical and significant.
Also people continue to express their interests and opinions
through various social platforms and the collective analyses of II. RELATED WORKS
these texts have also contributed to the growth of Sentiment
analysis. Social media analytics have gained much importance The truthfulness of online product reviews has been widely
and interest among people in present times judging from the discussed in the past. One of the earlier works of Dixit et al.
rapidly growing participation in online social media which has [6] distinguished reviews based on their content as untruthful
risen from 25-43% in many countries. This has motivated reviews, reviews on brands and non reviews. The first
studies which focus on extracting knowledge from opinions category consisting of untruthful reviews is of most concern as
being expressed in social platforms like Twitter, Facebook and they undermine the integrity of the online review system.
also e-commerce sites such as Amazon. One such form of Detection of type 1 review spam is a challenging task as it
social media data are product reviews written by customers attempts to distinguish fake and real reviews by manually
narrating their level of satisfaction with a particular product. reading them.
Organizations such as Amazon are involved in hundreds and Data mining and machine learning techniques together
thousands of electronic transactions every day where the have been effective in detecting fraudulent reviews. A
reviews of customers help new buyers make a better shopping straightforward example of web content mining is opinion
question mark etc which help in building a grammatical vector • Review deviation
representing each reviewer. The Big Five describes a
personality structure divided into five basic elements known Every review is marked by a reviewer with a review
as OCEAN: Openness; Conscientiousness; Extroversion; rating that denotes the level of satisfaction achieved from
Agreeableness; and Neuroticism. These are the five core using the product. For a product p, the rating is marked
personality traits and the presence of each trait is measured in using a five point scale ranging from a very bad rating of
each reviewer and the most common traits identified. Since a 1 to a very high rating of 5. The maximum deviation
reviewer cannot be analyzed in terms of all five traits in the possible for any product is 4. To calculate the review
model, the presence of only traits like extroversion, deviation for a product by a given reviewer, the average
neuroticism and openness will be taken in this work. rating of all customers except current reviewer is taken
and review deviation is calculated. To normalize the score
this value is divided by a factor of 4.This can be
A. Preprocessing
expressed by the formula defined below. If the average
Preprocessing is the first step in any text mining process deviation is greater than a selected threshold value, then
and plays a very important role. The most common words in the reviewer is marked as likely fake.
text documents are articles, prepositions, and pro-nouns which
do not add any specific meaning to the text. These words are RD(p) = |R(p) - R(avg,
4
p)|
(1)
treated as stop words. The reason for removing stop-words
from a text is because they take up additional processing and • Burst Detection
this can be avoided by the dimensionality space of the review
text being analyzed. Many types of stop word removal There are durations within a product lifetime when
methods are followed which are used to remove stop words there are large number of reviews flooding in. These may
from the data which in this case are the product reviews. The be because of sudden popularity of the product which may
classic method is based on removing stop words obtained from be attributed to advertisements and other seasonal
an already pre-compiled list. Another approach includes promotions or else due to fake reviewer groups who write
removing words with low inverse document frequency (IDF) many reviews in a day on various products. These are
and mutual information. The lower value of mutual known as burst periods and detection of such burst periods
information suggests that the sop-word relates to a low are an indication of likely fake reviews. Given a product
discrimination power hence can be removed from the review which has a set of m reviews, the first step is to divide the
text. The classical method has been adopted in the proposed life span of the product into small sub-intervals or bins.
system which uses a file of stop words and removes every This is done by choosing a proper bin size (e.g. two
occurrence of the word if present in the review text. weeks). Next step is finding the average number of
reviews in each bin. For all those months for which the
The second step of pre-processing uses POS tagging. review count is greater than average, mark flag 1 to denote
Traditional parts of speech are nouns, verbs, adverbs, exceptionally high review count for specific duration and
conjunctions, etc. Part-of-speech taggers typically take a mark others 0. This is done for each year in the lifespan of
sequence of words (i.e. a review) as input, and provide a list of the product. After each month of every year is labeled
tuples as output, where each word is associated with the either 1 or 0, next we select every 2 years and check to see
related tag based on both its definition and its context. The recurring patterns occur like seasonal offers or promotions
tagger used in this work is the Stanford POS tagger which which would explain the higher review count for that
provides an accuracy of 96.7%. Taggers are available in month. If such a repeating pattern is observed within a
various different languages. The English taggers use the Penn span of 2 years, then the burst flag is labeled 0 and if not
Treebank tag set. In the proposed system the tagger takes as 1.This means that every month which does not repeat and
input a file containing review text and annotates each word has a higher review count in a particular year is prone to be
with its corresponding tag. This can be used to find the count a fake review. All reviews for the product during the
of each part of speech in a review. This will later be used in particular month are marked 1 which denotes likely fake.
the personality prediction module. This factor is given lesser weight than other features
because burst flag does not prove all reviews during that
B. Feature Extraction period to be fake and only a subset of such reviews may be
The features are divided into two categories of review and actually fake.
reviewer centric features. Review centric features are those • Review Content Similarity
constructed using information available from a single review
and reviewer centric features adapt a holistic approach taking This feature has been used previously and has proved
into account all of the reviews written by any particular to be useful in identifying fake reviews. Clearly this is a
reviewer along with any additional information regarding the reviewer centric feature as it takes in account all the
reviewer himself. In this case, profile information relating to a reviews written by a reviewer and measures the content
reviewer is not available so we rely only on the information similarity value between each review. Content similarity is
collected from past reviews of a reviewer to build the reviewer a measure which can be found using various methods such
centric features. The 4 basic features used in the proposed as cosine similarity, Jaccard similarity and various other
system are defined as follows. similarity measures. Cosine similarity is selected because
it gives a fast and accurate value for similarity between
2016 International Conference on Emerging Technological Trends [ICETT]
each label will correspond to a personality trait in the Big Five used product reviews from Amazon belonging to the musical
model of a binary problem; that is, a problem in which each instrument category and the features extracted are subjective
review text may or may not contain a personality trait. Let R to this category of products. Each review is represented by a
be a reviewer characterized by the traits extroversion and set of features including reviewer name and ID, product code,
openness. After the transformation, this reviewer now has all user review rating, review date and review text.
labels with extroversion and openness marked with ‘1’ and the All reviews are initially loaded from a file into table and
other three personality traits, agreeableness, conscientiousness each reviewer details are inserted into a separate Reviewer
and neuroticism, marked with ‘0’. In the upcoming stage of Table. Content similarity values between every two reviews of
classification module, each classifier is assigned the role to a particular reviewer are calculated and saved linked to each
determine whether a reviewer has the personality trait. The distinct reviewer. Also against each reviewer the number of
presence of a trait is marked by 1 and absence by 0. Only three reviews written during bursts periods is also noted which
classifiers are used in the present work because as previously comprised the burst ratio feature. Sentiment scores are
explained review texts would not be useful in finding out traits calculated using deep learning approaches. Each sentence of a
like conscientiousness and agreeableness. review is annotated and sentiment calculated and average
score for a review is calculated. This helps calculate the
C. Classification module overall sentiment score of each reviewer. The burst detection
One of the most commonly used methods of is carried out on separate product ids and for each year that the
classification in machine learning is the use of semi- product was written about. Every month of the year is marked
supervised techniques which uses a small number of labeled by 1 or 0 indicating the presence of burst or not. If a burst
data and a large number of unlabelled data. This is ideal in the month reappears every time in two consecutive years, such a
present case as the product reviews do not have a previously scenario is considered to be due to natural reasons like
annotated text labeled with personality traits. Personality seasonal promotions which occur during the same months
prediction has been previously carried out in many social every year. In this case, a small weight is added to this feature
networking sites but as to the best of our knowledge it hasn’t because this does not have to be the case always as fake
been applied on product reviews. As a labeled dataset is not reviews are also present during seasonal times. Every other
available in the present scenario, we are taking up an approach month for the year under observation is marked likely burst
from a previous work in which prediction was performed on a with a flag 1.
set of Tweets. The data set taken for this purpose is
synthetically created by analyzing a similar data set of Tweets Therefore we can treat the spam reviews as
having an approximate length corresponding to the review belonging to the positive class and non-spam reviews as
lengths of the fake reviewers. The data matching the overall belonging to the negative class. A classifier can then be built
length of the review texts was analyzed and a dataset was to separate the two classes of reviews. The values for
created with labeled traits of extroversion, neuroticism and accuracy, precision, recall and f-measure are described in
openness. The grammatical features are studied and a
comparative study is done to create a small labeled set of traits
for each review. This method was adopted because of the lack
of presence of a labeled set for product reviews.
[7] Dave, Kushal, Steve Lawrence, and David M. Pennock. "Mining the [22] Mukherjee, Arjun, et al. Fake review detection: Classification and
peanut gallery: Opinion extraction and semantic classification of product analysis of real and pseudo reviews. UIC-CS-03-2013. Technical
reviews." Proceedings of the 12th international conference on World Report, 2013.
Wide Web. ACM, 2003. [23] Mukherjee, Arjun, et al. "Spotting opinion spammers using behavioral
[8] Fei, Geli, et al. "Exploiting Burstiness in Reviews for Review Spammer footprints." Proceedings of the 19th ACM SIGKDD international
Detection." ICWSM 13 (2013). conference on Knowledge discovery and data mining. ACM, 2013.
[9] Feng, Song, et al. "Distributional Footprints of Deceptive Product [24] Mukherjee, Arjun, et al. "What yelp fake review filter might be doing?."
Reviews." ICWSM 12 (2012). ICWSM. 2013.
[10] Jindal, Nitin, Bing Liu, and Ee-Peng Lim. "Finding unusual review [25] Mukherjee, Arjun, Bing Liu, and Natalie Glance. "Spotting fake
patterns using unexpected rules." Proceedings of the 19th ACM reviewer groups in consumer reviews." Proceedings of the 21st
international conference on Information and knowledge management. international conference on World Wide Web. ACM, 2012.
ACM, 2010. [26] Narayanan, Vivek, Ishan Arora, and Arjun Bhatia. "Fast and accurate
[11] Jindal, Nitin, and Bing Liu. "Opinion spam and analysis." Proceedings sentiment classification using an enhanced Naive Bayes model."
of the 2008 International Conference on Web Search and Data Mining. International Conference on Intelligent Data Engineering and Automated
ACM, 2008. Learning. Springer Berlin Heidelberg, 2013.
[12] Jindal, Nitin, and Bing Liu. "Review spam detection." Proceedings of [27] Pang, Bo, and Lillian Lee. "Opinion mining and sentiment analysis."
the 16th international conference on World Wide Web. ACM, 2007. Foundations and trends in information retrieval (2008).
[13] Judge, Timothy A., et al. "The big five personality traits, general mental [28] Ott, Myle, et al. "Finding deceptive opinion spam by any stretch of the
ability, and career success across the life span." Personnel psychology imagination." Proceedings of the 49th Annual Meeting of the
52.3 (1999). Association for Computational Linguistics: Human Language
[14] Khan, Khairullah, et al. "Mining opinion components from unstructured Technologies-Volume 1. Association for Computational Linguistics,
reviews: A review." Journal of King Saud University-Computer and 2011.
Information Sciences(2014). [29] Ott, Myle, Claire Cardie, and Jeffrey T. Hancock. "Negative Deceptive
[15] Khan, Khairullah, Baharum B. Baharudin, and Aurangzeb Khan. Opinion Spam." HLT-NAACL. 2013.
"Mining opinion from text documents: A survey." 2009 3rd IEEE [30] Qian, Tieyun, and Bing Liu. "Identifying Multiple Userids of the Same
International Conference on Digital Ecosystems and Technologies. Author." EMNLP. 2013.
IEEE, 2009. [31] Shojaee, Somayeh, et al. "Detecting deceptive reviews using lexical and
[16] Laryea, Bryan Nii Lartey, et al. "Web Application for Sentiment syntactic features." 2013 13th International Conference on Intellient
Analysis Using Supervised Machine Learning." International Journal of Systems Design and Applications. IEEE, 2013.
Software Engineering and Its Applications 9.1 (2015) [32] Saulsman, Lisa M., and Andrew C. Page. "The five-factor model and
[17] Lau, Raymond YK, et al. "Text mining and probabilistic language personality disorder empirical literature: A meta-analytic review."
modeling for online review spam detecting." ACM Transactions on Clinical psychology review(2004).
Management Information Systems 2.4 (2011). [33] Virmani, Deepali, Vikrant Malhotra, and Ridhi Tyagi. "Sentiment
[18] Li, Jiwei, et al. "Towards a General Rule for Identifying Deceptive Analysis Using Collaborated Opinion Mining." arXiv preprint arXiv
Opinion Spam." ACL (1). 2014. (2014).
[19] Li, Fangtao, et al. "Learning to identify review spam." IJCAI [34] Xie, Sihong, et al. "Review spam detection via temporal pattern
Proceedings-International Joint Conference on Artificial Intelligence. discovery." Proceedings of the 18th ACM SIGKDD international
Vol. 22. No. 3. 2011. conference on Knowledge discovery and data mining. ACM, 2012.
[20] Lim, Ee-Peng, et al. "Detecting product review spammers using rating [35] McAuley, Julian, Rahul Pandey, and Jure Leskovec. "Inferring networks
behaviors." Proceedings of the 19th ACM international conference on of substitutable and complementary products." Proceedings of the 21th
Information and knowledge management. ACM, 2010. ACM SIGKDD International Conference on Knowledge Discovery and
[21] Liu, Bing, et al. "Building text classifiers using positive and unlabeled Data Mining. ACM, 2015.
examples." Data Mining, 2003. ICDM 2003. Third IEEE International
Conference on. IEEE, 2003.