Professional Documents
Culture Documents
Abstract—Sentiment analysis is becoming a popular area Sentiment analysis can be performed at three different
of research and social media analysis, especially around user levels: document, sentence and aspect level. The document
reviews and tweets. It is a special case of text mining generally level sentiment analysis aims at classifying the entire
focused on identifying opinion polarity, People are usually document as positive or negative, (Pang et al, [3]; Turney,
interested to seek positive and negative opinions containing
[4]). The sentence level sentiment analysis is closely related
likes and dislikes, shared by users. Therefore reviews or
product features have got significant role in sentiment analysis. to subjectivity analysis. At this level each sentence is
In addition to sufficient work being performed in text analyzed and its opinion is determined as positive, negative
analytics, feature extraction in sentiment analysis is now or neutral, (Riloff et al, [4]; Terveen et al, [5]). The aspect
becoming an active area of research. Feature based sentiment leave l sentiment analysis aims at identifying the target of
analysis include feature extraction, sentiment classification and the opinion. The basis of this approach is that every opinion
finally the sentiment evaluation. In this paper, explored the has a target and an opinion without a target is of limited use,
machine learning classification approaches with different (Hu and Liu, [6]). The simplest strategy uses a bag-of-words
feature selection schemes to obtain a sentiment analysis model model. Which create lists of 'positive' and 'negative' words
for the movie review dataset and the results obtained are
and judge a document based on whether it has a
compared to identify the best possible approach.
preponderance of positive or negative words. Creating such
Keywords— Feature Extraction, Bag of Words, lists is not easy; some of the words are likely to be quite
Classification, Bigram Collocation, Information Features,
different for different kinds of topics. Initially, take a
Evaluation collection of documents on some topic and rate them (by
hand) as positive or negative and then use this collection to
I. INTRODUCTION train a classifier model
Text mining is the process of discovering useful and Limitations:
interesting knowledge from unstructured text. In order to The bag-of-words model does quite well in assessing short
discover knowledge from unstructured text data, the first reviews which are clearly addressed to a single object, but
step is to convert text data into a manageable representation. they have several limitations:
A common practice is to model text in a document as a set some important words are ambiguous in their opinion:
of word features, i.e., “bag of words” (BOW). Often, some "low" is positive in "low price" but negative in "low
feature selection techniques are applied, such as stop-word quality"
removal or stemming, to only keep meaningful features and the assessment does not reveal which aspects of the
improve the accuracy using supervised classification product led to the positive or negative sentiment,
algorithm [1]. although this may be more crucial to the decision maker
Sentiment Analysis or Opinion Mining is a challenging one to do the overall assessment
in Text Mining and Natural Language Processing for feature If the review compares the item to other items, the bag-
extraction, classification and summarization of sentiments. of-words approach is unable to distinguish references to
Sentiment Analysis also assists individuals and the different items
organizations who are interested in knowing what other Overcoming all three limitations requires a richer model
people comment about a particular product, service topic, which can capture some of the structures of the language.
issue and event to find an optimal choice for which they are NGrams- Unigrams presents the simplest model for the
looking for. Sentiment Analysis in the context of n-gram approach. It consists of all the individual words
Government Intelligence aims at extracting public views on present in the text. The bigram model defines a pair of
government policies and decisions to infer possible public adjacent words. Each pair of words forms a single bigram.
reaction on implementation of certain policies [2]. The higher order grams can be formed in the similar way by
Sentiments with respect to movie reviews, the reviews are taking together the n adjacent words. Higher order n-grams
categorized into pos and neg categories. In this proposed are more efficient in capturing the context as they provide
work simple Naïve Bayes Classifier is considered as a better understanding of the word position.
baseline, using Boolean word feature extraction. An n-gram defines a subsequence of n items from a given
sequence. It is used in various fields of natural language
C. Classification
Classification using machine learning can be in two steps:
1. Learning the model using the training dataset
2. Applying the trained model to the test dataset.
Sentiment analysis is a text classification problem and thus
any existing supervised classification method can be
applied. Naïve Bayes classifier is a simple probabilistic
classifier that is based on the Bayes theorem. This
classification technique assumes that the presence or
Fig. 3.1 Proposed model for improving feature extraction absence of any feature in the document is independent of the
presence or absence of any other feature. Naïve Bayes
D. Evaluation classifier considers a document as a bag of words and
There are various methods to determine effectiveness; assumes that the probability of a word in the document is
however, recision, recall, and accuracy are most often used. independent of its position in the document and the presence
To determine these, one must first begin by understanding if of other word. For a document d and class c, the experiment
the classification of a sentiment was a true positive (TP), was performed using Naive Bayes classifier for classifying
false positive (FP), true negative (TN), or false negative the movie reviews and the below results are obtained.
(FN)