You are on page 1of 5

2nd International Conference on Advances in Science & Technology (ICAST-2019)

K. J. Somaiya Institute of Engineering & Information Technology, University of Mumbai, Maharashtra, India

COMM-AN
Opinion Mining of Customer Feedback

Delna Madan Mohan Jobanputra


K J Somaiya Institute of Engineering and K J Somaiya Institute of Engineering and
Information Technology, Sion, Mumbai Information Technology, Sion, Mumbai
University of Mumbai University of Mumbai
delna.madan@somaiya.edu mohan.j@somaiya.edu

Harvi Shah Sarita Rathod


K J Somaiya Institute of Engineering and K J Somaiya Institute of Engineering and
Information Technology, Sion, Mumbai Information Technology, Sion, Mumbai
University of Mumbai University of Mumbai
harvi.shah@somaiya.edu sarita.r@somaiya.edu

Abstract— Product reviews have now moved to the or even a group of people to do[1].
online platforms, also the sheer amount of reviews since This menial work, can be instead left to a
the past few years has skyrocketed. It is impossible to read machine/system to first differentiate fake reviews and
and get the overall view of what the customer feels about a comments, and then filter them out, and lastly to sort the
product. The idea of this project is to take over this time- reviews and comments according to what the user has
consuming job of examining the data fetched and entered.
unearthing the underlying meanings of the data. This will A general rule that you can apply to help you make
be done by first gathering the data, cleaning/pre- sense of customer feedback is to group it by:
processing the data, clustering of data and finally
1. Type(classifying into categories)
representing the yield in a diagrammatic/graphical format.
We aim to design this project for the creators, developers, 2. Theme(mentioned/asked/complained about)
and producers not for the customers browsing through 3. Code(purpose of comments, in a concise manner)
site(s). After classifying all of the reviews and comments as
mentioned above, we can simply see which category,
Keywords— Natural Language Processing(NLP), theme, feature is the most popular. This information when
Corpus, Opinion Mining, Sentiment Analysis, Topic forwarded to the concerned organization is very useful
Modelling, Data Term Matrix. and will be used to decide any future actions[2].

I. INTRODUCTION II. LITERATURE SURVEY


Nowadays to get a clear picture of what the user truly
feels about a product, is to sieve to the number of user In paper [1], The overall goals of this papers studied,
reviews and feedback. Having human personnel doing this is to develop a common concept of data management to
job is not only going to be very time consuming but also make data meaningful, and to ensure maximum re-
an expensive endeavour, instead it will be wiser to have a usability and speed up the classification process these
computer system do this menial job of going through benchmarking data are analyzed by the use of natural
customer reviews and finding a general consensus. language techniques (NLP).
We also need to take into account that not all the In paper [2], the domain of IT benchmarking
received data(customer reviews and comments) to be real, collected data are often stored in natural language text
we can naturally assume it will contain some malicious and therefore intrinsically unstructured . To ease data
spam content as well. To tackle this,the system will follow analysis and data evaluations across different types of IT
set rules to filter out such content to its best ability. benchmarking approaches a semantic representation of
To understand customer feedback, we must first this information is crucial. Thus, the identification of
understand the types of feedback most important to the conceptual (semantic) similarities is the first step in the
business in question. This normally, is best done by a development of an integrative data management in the
human in most cases, since they can understand the domain of IT benchmarking.
underlying intent behind language, but in most cases This paper explains a part of the second phase of the
nowadays this is impossible since the number of feedbacks Document Classification phase of the proposed system.
a company/organization receives is too large for a person In this we take the previously cleaned data from the

http://ssrn.com/link/2019-ICAST.html
Electronic copy available at: https://ssrn.com/abstract=3368898
2nd International Conference on Advances in Science & Technology (ICAST-2019)
K. J. Somaiya Institute of Engineering & Information Technology, University of Mumbai, Maharashtra, India

earlier phase, then POS tag the words in the article,


compare the words with of Opinion lexicon as well as
Opinion target ‘words’ stored in our database. Finding the
words we were looking for we prepare an appropriate
seed list containing the overall negative or positive score
our the particular article and then finally classify the
article under Positive, Neutral and Negative Opinionated
articles.
In paper [3], a system was proposed for feature
extraction and training the classifier to test whether the
article is Positive, Negative or Neutral and also to remove
irrelevant Ad content removal. This system proposed
would work in two modes, Online as well as Offline
mode. Along with this a Clarity scored is to be calculated
as well using the Divergence algorithm. For future
extraction purposes, the comments who are dependent on
independent on domain are separated. In feature extraction
NER tagger, Naive Bayes classifier and Porter Stemming
algorithms are used. Fig. 1 : 8 Step Workflow
In paper [4], Probabilistic approach is used which is F. Apply suitable techniques: We apply suitable
based on Topic Modeling, its used in a more finely techniques, to develop methods and obtain
grained opinion mining aspect is used. This system applies results that could be a linear regression.
the model to search aspects relating to various G. Share insights: Showing the detailed graphical
segmentation of data which includes classifying various report.
age groups and other such attributes.
H. Review and update: Tweaking the above graph
In paper [5], Different techniques and concepts are to reach our intended goal
mentioned, which are to be used in different
applications.Notably, it explains mining of the data and
the tools used in knowledge discovery from data(KDD). A. NLP techniques used in this system
Prior to explaining the process of data mining, in this ● Sentiment analysis: This helps in determining the
edition,we also learn about the methods for preprocessing, emotional intent behind the feedback that a
processing and warehousing the data. Information about customer gives,the activity of regulating the
data warehouses, online analytical processing(OLAP), and psychological tone underneath the words. Also it
data cube technology. gives us an overview of the public opinion of a
product or topic.
III. PROPOSED METHODOLOGY ● Topic modeling: This type of modelling involves
mathematical and statistical approach for
discovering the theoretical topics present in a
The objective of the proposed system is to aid collection of document. With the help of these
organizations to achieve their goals by analyzing customer models organizations can make better decisions
feedback, the system is capable of acquiring, cleaning and making. Unlike other approaches, it does not use
analyzing the knowledge automatically. Fig. 1 shows the regular expressions or dictionary based keyword
workflow of NLP used for this system, and it is explained searching techniques. It is an unsupervised
as below: learning approach used to group related texts into
A. Start with a question: Ask the question, prior to clusters according to their content(called
defining the data and build your process around “topics”).
it
B. Collection of data: data is collected from various B. Technologies used to implement NLP
sources to perform appropriate actions on it.
➔ Programming: Data - For data, we use pandas
C. Data pre-processing: The collected data is pre- package,as it open source and easier to use,it is
processed i.e the unwanted data is removed. easier and flexible to use.It provides high level
D. Data cleaning: Data is cleaned and all the stop manipulation of data and analyzing the data sets
words, words containing numbers, etc. is using its robust data structures.
cleaned and now we can make use of this data. ➔ Maths and statistics: for data cleaning, text
E. Perform EDA: It is a statistical method, used to corpus is used, it is a large and structured set of
make sense of the data and how data can make texts (nowadays usually electronically stored and
sense. processed).The corpus semantic,is used for
verifying the occurrences or checking linguistic
rules in a clearly defined language territory.
➔ Communication: Effectively communicating data
is an important aspect of this process, this can be
broken down into the following parts :

http://ssrn.com/link/2019-ICAST.html
Electronic copy available at: https://ssrn.com/abstract=3368898
2nd International Conference on Advances in Science & Technology (ICAST-2019)
K. J. Somaiya Institute of Engineering & Information Technology, University of Mumbai, Maharashtra, India

◆ Data designing: Defining the data scope words or terms present in a collection of text. Here, the
and visualizing it along with extra words/terms represent to columns and their
insights. frequency(number of occurrences) represent the rows. To
◆ Domain: defining the data according to determine the value of each entry in the matrix, various
the expertise domain of the data. schemes are used. One such scheme is TF-IDF.
NLTK, it is a toolkit that consists of For instance, if one has the following two (short)
superior packages,it is used to make computers documents:
understand the human language with the intent of D1 = "I like databases" D2 = "I hate databases"
generating a response similar to humans.
Then the document-term matrix would be:
After text corpus, we use document-term matrix Table 1 : Example Data Term Matrix
(DTM), It is basically a matrix, with documents
designated by rows and words by columns, that the
elements are the counts or the weights (usually by TF-
IDF). Subsequent analysis is usually based creatively on
DTM. We perform Exploratory Data Analysis (EDA) on
DTM, EDA refers to the process of performing initial
investigations on data so as to discover patterns,
anomalies with the help of summary statistics and
graphical representations, it is all about making sense of
data in hand, before using it for the required purpose. Table 1 shows which documents contain which terms
Following these steps, NLP techniques are applied as and how frequently they appear.
mentioned above. Using this DTM we can gain insight on the most
Following is the algorithm for implementation : frequently occurring terms in the documents/collection of
documents. If the Data Cleaning step A is not done well,
Data Pre-processing (Cleaning and Spell Checking) - then the DTM will contain useless or erroneous data, for
It is the major task in the text mining process, where example including stop-words like ‘the’ or ‘a’ which are
we filter lots of unwanted terms from the bag of used in regular sentences quite frequently will be
words(fetched data). For further analysis like pattern included in the most frequent words, giving us no useful
finding it is preferred to treat each document as a bag of insight of the data. Another example of erroneous data
words- as a set of all words with the frequency of the word disrupting the useful data gained from a DTM is the
occurred in that document, also known as a corpus. inclusion of incorrect words, if these words are not
Here we have some cleaning methods to preprocessing corrected or removed they will either be counted as a
our corpus. Some documents have the implicit structure separate column or not joining with the column it as
terms like titles, sections, paragraphs etc. Step by step of supposed to be under, for example the word ‘good’
preprocessing task is as follows: occurs 5 times, and the error word ‘goood’ occurs 1 time,
Step 1: Convert all upper case letters into lower case this word will make its own column instead of being
letters example: “GAME” is converted into “game” here included the column of ‘good’.
all upper case letters are changed into lower case letters. Gaining and Displaying Insights -
Step 2: In this step we remove the punctuation marks Now that we have made the DTM for the documents,
like eg., . ? etc ) and replace with a single space character, we need to process it to gain some useful insights from it.
so as to not unintentionally merge words. This can be achieved by implementing Word Cloud and
Step 3: Removing of stop words, here we remove the Sentiment Analysis among other techniques.
stop words like “the, a, of etc” such types of words are ❏ Word Cloud - is a diagrammatic representation of
unused for text analysis. We can also remove some topic- the frequency/importance of a word or term in a
based common keywords from documents. document. In this step we need to simply apply
Step 4: The spell checking process is then used to find the wordcloud package in python, and using the
words not existing in the dictionary and then finding a DTM to provide the frequency of said term. The
suitable replacement for that word that is close enough to higher the frequency count of a word, the larger
the word not found. But for words that have no match are the word in the word cloud. Output of this is in
left as is. Fig. 2.
Step 5: Stemming process, one of the most important ❏ Sentiment Analysis - is used to determine the
task in pre-processing since it can transform words to polarity and subjectivity of document or set of
their roots.Example : removal of prefixes or postfixes of comments(in this case). Polarity is the value set
words, such as -ING, -S, -TION, -ALY, -LY, -IOUS, - to each term which differentiates them into
IOUSLY, -ED, -EDLY, etc. degrees of positivity and negativity within a
range of (-1 to 1). Subjectivity is the value set
Step 6: After completion of the above 5 steps some within the range of (-1 to 1), showing the degree
excess white spaces are generated such type of white of the term being a fact and opinion. The overall
spaces are removed from documents. polarity and subjectivity values are the mean of
Data Representation (Document-Term Matrix) - the polarity and subjectivity of each term in the
A document-term matrix(DTM) or term-document document. Eg. the word ‘good’ is set a polarity
matrix(if transposed) is a matrix of the frequency of value of 0.7 and subjectivity value of 0.6,

http://ssrn.com/link/2019-ICAST.html
Electronic copy available at: https://ssrn.com/abstract=3368898
2nd International Conference on Advances in Science & Technology (ICAST-2019)
K. J. Somaiya Institute of Engineering & Information Technology, University of Mumbai, Maharashtra, India

whereas the string “very good” has the polarity


values of 0.9 and the subjectivity value of 0.78.
Output of this is in Fig.3.
Document Clustering Methods -
Partition Around Medoids (PAM)[4]: Method Both
the k-means and PAM algorithms are partitions (breaking
the data-set up into groups). K-medoids chooses data
points as centers of clusters. It can also be used as a
partitioning technique, that clusters the data sets of ‘n’
documents into ‘k’ clusters with ‘k’ known Fig. 3 : Sample Sentiment Analysis Chart
medoids(centers for each cluster).
In Fig. 3 above, we see the sentiment analysis of
PAM algorithm[4]: Input: DocumentSet= {d1,d2,…,dn} comments- depicted in a graphical format(scatter plot).
of size ‘n’, Set of documents as Corpus. Shows the value of each comment in polarity and
Steps: subjectivity domains. we see a scatter plot chart. Each
➢ Initialize: Select ‘k’(centers) of the ‘n’ total data point in the graph represents a single comment from a user.
points as the medoids. The position of the point is defined by the comment’s
polarity(value from -1 to +1) as the x-coordinate and the
➢ Group up each data point, to its closest ‘k’ medoid.
subjectivity(value from -1 to +1) as the y-coordinate.
➢ Loop while the Cost of the configuration decreases:
○ Loop for every medoid ‘m’ and for every
non-medoid data point ‘o’:
■ Exchange ‘m’ and ‘o’, and
recalculate the sum of distances of
points to their respective medoid.
This is used to find a better, more
suitable center of a cluster, among
the points present in the cluster.
■ If the absolute Cost of the
configuration increments due to the
changes made in the previous
step(i), undo the swap between ‘m’
and ‘o’.
Output: ‘k’ number of clusters from Text Documents.
Shown in Fig. 4.
IV. RESULTS AND DISCUSSIONS
This system has a number of outputs, from which we can
infer the underlying intent of the customers when they post
a review for a product.
Fig 4 : Sample Topic Modeling Graph
In Fig. 2, we can see the frequently occurring
The Fig. 4 shows topic modeling - depicted in
words - depicted in the Word cloud format. The commonly
graphical format (pie chart). It shows the overall topic
used words across comments are listed here. To make this
mentioned by customers. the pie chart of the multiple
more efficient and meaningful we must first remove topics of all the comments for a particular product.
common words relating to the product being reviewed. The
higher the frequency count of a word, the larger the word
in the word cloud.

Fig. 2 : Sample Product Word Cloud


Fig. 5 : Spam Detection Pie Chart
In Fig. 5 we can see spam collection - depicted in

http://ssrn.com/link/2019-ICAST.html
Electronic copy available at: https://ssrn.com/abstract=3368898
2nd International Conference on Advances in Science & Technology (ICAST-2019)
K. J. Somaiya Institute of Engineering & Information Technology, University of Mumbai, Maharashtra, India

graphical format (pie chart). Shows the number of spam-like [12] P. Nie, J. Li, S. Khurshid, R. Mooney, and M.
comments from the total number of comments. This is Gligoric, "Natural Language Processing and Program
determined by applying a weighted dictionary filled with Analysis for Supporting Todo Comments as
spam words with their weights(how often that word is used Software Evolves" The University of Texas at
in spam). Austin, 2014.
[13] Q. Su, X. Xu, H. Guo, Z. Guo, X. Wu, X. Zhang, B.
V. CONCLUSION AND FUTURE SCOPE Swen, and Z. Su,“Hidden Sentiment Association in
Chinese Web Opinion Mining”, Proc. 17th Int’l
Reviewing product comments and feedback will help
Conf. World Wide Web, 2008.
the users(producers, creators, and developers) to better
understand how their product audience feels on an [14] I. Smeureanu, C. Bucur, "Applying Supervised
average about their product(s). Using NLP we can read Opinion Mining Techniques on Online User
through innumerable amounts of comments posted by Reviews", Informatica Economică 2012.
customers and through cleaning, filtering and processing [15] G. Qiu, C. Wang, J. Bu, K. Liu, and C. Chen,
them, we can get an output that is concise and “Incorporate the Syntactic Knowledge in Opinion
understandable format. In the future, we could add more Mining in User-Generated Content,” Proc. WWW
complex and accurate topic modeling algorithms, along 2008 Workshop NLP Challenges in the Information
with implementing real-time analysis of the comments Explosion Era, 2008.
submitted by users. [16] A. Jebaseeli, Dr. E. Kirubakaran “M-Learning
REFERENCES Sentiment Analysis with Data Mining Techniques”,
[1] S. Manke, N. Shivale, “A Review on : Opinion International Journal of Computer Science And
Mining and Sentiment Analysis based on Natural Telecommunications, August 2012.
Language Processing” International Journal of [17] A. Berger, S. Della Pietra, and V.J. Della Pietra
Computer Applications 2015. 1996. "A maximum entropy approach to natural
[2] M. Pfaff, H. Krcmar, “Natural Language Processing language processing", Computational Linguistics.
Techniques for Document Classification in IT [18] G. Angulakshmi, Dr. R. Manickachezian, “An
Benchmarking” Conference Paper 2015. Analysis on Opinion Mining : Techniques and
[3] D. Bhattacharyya, S. Biswas, T. Kim, “A Review on Tools”, International Journal of Advanced Research
Natural Language Processing in Opinion Mining” in Computer and Communication Engineering, July
International Journal of International Journal of 2014.
Smart Home 2010. [19] Q. Mei, X. Ling, M.Wondra, H. Su, and C. Zhai,
[4] V. C. Cheng, C.H.C. Leung, J. Liu, A. Milani, "Topic sentiment mixture: Modeling facets and
“Probabilistic Aspect based mining model for drug opinions in weblogs", Proc. 16th Int. Conf. WWW,
reviews” IEEE transactions on Knowledge and Data USA, 2007.
Engineering 2014. [20] Z. Zhang, Q. Ye, Z. Zhang, Y. Li, “Sentiment
[5] J. Han, M. Kamber, J. Pei, “Data Mining: Concepts Classification of Internet Restaurant Reviews written
and Techniques”, Second Edition (The Morgan in Cantonese”, Expert Systems with Applications,
Kaufmann Series in Data Management Systems), 2011.
2006.
[6] B. Liu, “Sentiment analysis and subjectivity In:
Handbook of Natural Language Processing”, Second
Edition. Taylor and Francis Group, Boca 2010.
[7] R. Sharma, S. Nigam and R. Jain "Opinion Mining of
Movie Reviews at Document Level", International
Journal on Information Theory (IJIT), July 2014.
[8] N. Mishra, C.K.Jha "Classification of Opinion
Mining Techniques", International Journal of
Computer Applications , October 2012.
[9] P. Singh, M. Husain "Methodological Study of
Opinion Mining and Sentiment Analysis
Techniques", International Journal on Soft
Computing(IJSC), February 2014.
[10] T. Ahmad, M. Doja, "Ranking System for Opinion
Mining of Features from Review Documents", IJCSI
International Journal of Computer Science Issues,
July 2012.
[11] G. Krishna, S. Kavitha, S. Yamini, A. Rekha,
"Analysis of Various Opinion Mining Algorithms",
International Journal of Computer Trends and
Technology (IJCTT), April 2015.

http://ssrn.com/link/2019-ICAST.html
Electronic copy available at: https://ssrn.com/abstract=3368898

You might also like