You are on page 1of 3

International Journal of Engineering Trends and Technology (IJETT) Volume 18 Number 4 Dec 2014

Sentiment Analysis Using Weka


Umadevi V#1
#

Associate Professor, Department of CSE, BMS College of Engineering


Bangalore, India

Abstract Online social networks usage are pervasive now

a days. Mining the text present in online social networks


will be useful for predictive analytic. Predicting
information from unstructured data present in the social
networks is a challenging research problem. Extracting,
identifying or otherwise characterizing the sentiment
content of the text unit using statistics and machine
learning methods are referred as sentiment analysis or text
analysis. In this work sentiment analysis using Decision
trees and Support vector machines, which are machine
learning algorithms will be demonstrated using WEKA
tool. Sentiment analysis using Support vector machines
showed high accuracy when compared to Decision trees.

III. EXPERIMENT
Machine learning is about learning from the structure of
data. Two main categories of machine learning algorithms are
Supervised and Unsupervised. In this paper, two popular
supervised machine learning algorithms namely Decision Tree
(DT) and Support Vector Machines (SVM) were used for
sentiment analysis. A general supervised machine learning
approach for Sentiment Analysis is shown in figure 1.
Supervised machine learning algorithms will be provided with
labelled data as a training set. The algorithm learns and
outputs a trained model. Effectiveness of this model will be
evaluated on the unseen data i.e., the unlabelled data set.

Keywords Sentiment Analysis, Text Classification, Social


Network, Decision Tree, Support Vector Machines.
I. INTRODUCTION
Social networks such as Facebook, Twitter, LinkedIn etc.
are the sites where internet users post their comments and
views. Amount of text present in these sites are rapidly getting
increased as the users spend more time in the on line social
networks. Text information written by users in social
networks will be useful for companies to extract business
intelligence with out taking explicit surveys from the
customers. Sentiment analysis is the process of determining
the contextual polarity of the text i.e., whether a text is
positive, negative or neutral. Use of this analysis is to find out
how people feel about a particular topic.
Categorizing the documents as positive or negative
sentiment will be useful for unlabelled documents. In this
paper an in-depth comparison among two very popular
classifiers for the task of classifying SMS text as either
positive or negative has been carried out. The two classifiers
examined were Decision Tree (DT) and Support Vector
Machine (SVM).
II. RELATED WORK
Sentiment analysis, also called opinion mining, is the field
of study that analyses peoples opinions, sentiments towards
entities such as products, services etc., and their attributes [1].
A probabilistic approach for SMS classification systems has
been proposed by [2]. A suspicious email detection system by
decision tree method has been discussed in [3]. A method of
comparison with other machine learning algorithms will be an
aid in performance evaluation. In this work sentiment
classification of SMS text using two popular machine learning
algorithms i.e. Decision Trees and Support vector Machines
has been carried out.

ISSN: 2231-5381

Fig. 1 Supervised Machine learning system approach


Sentiment analysis or Text classification is a supervised
learning task, which means that each training document or text
will have a class label. In feature extraction, a sentence or
document is broken into words to build up the feature matrix.
In the matrix, each sentence or document is a row and each
word form a feature as a column, and the value is the
frequency count of the word in the sentence or document.
Feature matrix is then passed to each classifier and their
performance is evaluated. The control flow of the system
proposed in this paper is shown in figure 2. DT classifies data
into different classes by recursively separating the feature
space into two parts and assigning different classes based
upon which region in the divided space a sentence is, based on
its features. The SVM classifies data by maximizing the
margin between the support vectors, which are the boundary
for the classification.

http://www.ijettjournal.org

Fig. 2 Control flow of the System

Page 181

International Journal of Engineering Trends and Technology (IJETT) Volume 18 Number 4 Dec 2014
IV. DATASETS
Training and testing data used for this work was collected
from [4]. It was a SMS Spam collection Data Set which
consists of 5574 SMSs of positive and negative category.
Only first 200 samples were chosen. The training and test
data consists of 166 positive and 33 negative samples. Three
fold cross validation was used to evaluate the performance of
the classifiers.
V. RESULTS
WEKA [5], an open source tool is a collection of machine
learning algorithm. In the WEKA tool, initial the data set will
be loaded. Under Meta classifier, Filtered classifiers were
used. The filter used was StringToWordVector. This filter
breaks the sentence into individual word. Stemmers was used
to convert words such as Driving, Drives, Drive to a single
word Drive. Stemming reduces number of features and the
sparsity of the data. Stop list was used to avoid the words such
as I, is, the, that, etc. Term frequency count was used to count
the number of occurrences of each word in a given sentence.
Parameters used in weka for filter and decision tree are
highlight and shown in figure 3. In weka decision tree
algorithm used was J48. Under StringToWordVector filter
IDFTransform, TFTransform, outputWordCount, useStopList
was set to True. Stemmer used was IteratedLovinsStemmer.

Fig. 4 Weka screen shot for Decision Tree results


Decision Tree constructed using train data set is shown in
figure 5.

Fig. 3 Weka Parameters setting


Figure 4 presents the screen shot of weka tool with
Decision tree results.
By DT percentage of correctly
classified instance was 87.5% and 12.5% for incorrectly
classified instance.

ISSN: 2231-5381

Fig. 5 Decision Tree constructed for the Train Dataset


Figure 6 presents the screen shot of weka tool with Support
Vector Machine results. By SVM percentage of correctly
classified instance obtained was 91% and 9% for incorrectly
classified instance.

http://www.ijettjournal.org

Page 182

International Journal of Engineering Trends and Technology (IJETT) Volume 18 Number 4 Dec 2014
Various tokenizers (breaking text into words or features)
such as
WordTokenizer,
AplhabeticTokenizer
and
NGramTokenizer were applied for both DT and SVM. For
NGramTokenizer, minimum NGram used was 1 and
maximum NGram used was 3. Time taken and accuracy
results are shown in table III. From the results it is observed
that SVM takes less time to build the model when compared
to DT. And also it was observed that accuracy of SVM is
better than DT.
TABLE III
TIME FOR BUILDING MODEL AND ACCURACY RESULTS AGAINST
VARIOUS TOKENIZERS
Tokeniz
er

DT
Accura
cy

Word
Alphab
etic
NGram

87.5%
89%

1.20sec
1.06sec

88%

11.78sec

Time to
build
model

SVM
Accura
cy

91%
92%

Time
to
build
model
0.08sec
0.05sec

No. of
Tokens
or
Features
extracted
873
769

87%

0.25sec

6524

Fig. 6: Weka screen shot for SVM results


Cross-validation also called rotation estimation, is a way to
analyse how a predictive model will perform on an unknown
dataset, i.e., how well the model generalizes. Three-fold Cross
Validation was used to evaluate DT and SVM. Three nonoverlapping partition of the dataset was created and then three
experiments were carried out in which 2 partitions will used
for training and the remaining one for testing. Table I and II
shows the cross validation results obtained for DT and SVM
respectively when the tokenizer applied was WordTokenizer.

VI. CONCLUSIONS
Increasing growth of social networks is giving rise to vast
amount of online data. Analysis of this data gives insightful
information for business intelligence extraction. Unstructured
social network data analysis is challenging problem. In this
work machine learning approach was applied for text analysis.
Support vector machines, a supervised machine learning
approach took less time to build model and showed great
accurate results on SMS spam text classification then Decision
tree learning approach.

TABLE I
[1]

CROSS VALIDATION RESULTS FOR DT

For Decision
Trees
Negative
Actual Positive
Class Total

Predicted Class
Negative Positive
9
24
1
166
10
190

Total
33
167
200

[2]

[3]
TABLE II
CROSS VALIDATION RESULTS FOR SVM

For Support
Vector Machine
Negative
Actual
Positive
Class
Total

ISSN: 2231-5381

Predicted Class
Negative Positive
15
18
0
167
15
185

Total
33
167
200

[4]

[5]

REFERENCES
Liu, Bing. "Sentiment analysis and opinion mining."
Synthesis Lectures on Human Language Technologies
Vol 5, no. 1, 2012, pp. 1-167.
Ahmed, Ishtiaq, Donghai Guan, and Tae Choong Chung.
"SMS Classification Based on Nave Bayes Classifier
and Apriori Algorithm Frequent Itemset." International
Journal of Machine Learning & Computing, Vol 4, no.2
2014.
Rajaram, Ramasamy, and Appavu Balamurugan.
"Suspicious E-mail detection via decision tree: A data
mining approach." CIT. Journal of computing and
information technology, Vol 15, no.2, 2007, pp. 161169.
Dataset,
SMS
Spam
Collection,
URL
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
[Last Accessed November 2014]
WEKA tool, URL http://www.cs.waikato.ac.nz/ml/weka/
[Last
Accessed
November
2014]

http://www.ijettjournal.org

Page 183

You might also like