You are on page 1of 11

CHAPTER 6

Classifying Streaming of Twitter Data based on Sentiment Analysis


using Hybridization

6.1 Introduction

Twitter becomes the massive social networking website that probably used for regular
surfing by many users in the world. The real-world happenings are expressed regard-
ing the users thinking’s and feelings by this social website. The user’s feelings and
their mentality about some products from different organizations are presented using
sentiment analysis technique. The user’s reviews, blogs, and post are used by the Twit-
ter for analyzing their opinions to help many organizations that used for enriching the
opinions of the customer, recommender system, politics, business etc. The sentiment
analysis is commonly defined as the process for determining the polarity or intention of
the user messages. Furthermore, it is very difficult to identify the people’s impression
without having an automatic sentiment analysis as incoming users tweets are in large
amount. Prominent increase in different sentiment classification techniques helps many
researchers to involve in developing new methodologies.
Figure 6.1 shows the process of sentiment analysis that involved in several terms of
tokenization, text input, negation handling, stop word filtering, sentiment class, stem-
ming, and classification. Using the natural language, the particular issues of text data
based on opinions are expressed by authors. The customer insights based on an event
with real-time are provided by opinion mining. Since, Twitter data consists of noisy,
poor structured sentences and incomplete, ill-formed words, irregular expressions, and
non-dictionary terms. The end-user opinions, emotions, and sentiment are expressed
for the multiple product reviews using these websites.
Therefore, the business organization and several people are able understand better.
Different techniques like natural language processing (NLP), Support Vector Machine
(SVM), Naı̈ve Bayes (NB), Decision Tree (DT), and Maximum Entropy are proposed
for classification of sentiment tweets automatically. Furthermore, polarity of opinions
is predicted by combining ML algorithms with feature selection methods and analyzed
each user’s reviews. Also, emotions are classified into three categories as negative,
neutral, and positive.

90
Fig. 6.1 Process Involved in Sentiment Analysis

Five main steps are used in analyzing sentiment and these are, first: Twitter API
is used for collecting tweets collecting data using Twitter API and analyzing the par-
ticular text data from the dataset and classifies using natural language processing and
text analytics as it has a huge amount of data with different slangs and words. Sec-
ondly, text preprocessing is included for analysis of tweets and extracted particular data
and eliminated. The second step includes text preprocessing that analyze and extract
the particular data and clean the non-textual contents, emoticons and numerical values.
Thirdly, sentiment analysis is processed detecting emotion from the extracted tweets
and its reviews and inspects the opinions. The sentences used for retaining individual
expressions after discarding the objective communication. The classification of senti-
ment is the fourth step in terms of ‘Negative’, ‘Neutral’, and ‘Positive’. Finally, the
meaningful information is converted from unstructured text and presented the output
which is the key objective of sentiment analysis.
In this proposed work, the sentiment analysis is processed using the tweets that are
collected. Particular text feature is extracted from the dataset and processed the five
steps discussed above and analyzed. Finally, the extracted data is used for classifying
sentiment using the proposed hybrid model that is based on lexicon.

91
6.2 Proposed Work

In this segment, Fig. 6.2 shows the overall architecture of the proposed hybrid model.
The tweets are collected in first stage and tokenization process is applied for that data.
The training set data is generated after preprocessing step and assigned as input to
the hybrid algorithm for sentiment classification analysis. Hence, the model is trained
with the training dataset and streaming of tweets are applied with test data after the
preprocessing steps and obtained the classification accuracy. Proposed hybrid combi-
nation technique consist three algorithms and namely, Genetic Algorithm (GA), Particle
Swarm Optimization (PSO), and Decision Tree (DT). The proposed model gives better
performance and high accuracy for sentiment classification.

Fig. 6.2 Overall Proposed Architecture

92
6.2.1 Data Collection

Twitter streaming API is used for collecting tweets with including URL’s. Public
streaming API is used for accessing real-time tweets in which only 1% of overall public
tweets are granted to access due to restrictions over protected accounts. JSON format is
the form that collected tweets could be viewed and each line is pared as objects. There-
fore, a total of 1 lakh tweets are collected and several attributes are presented in the
collected tweets and mainly two types classified and these are tweet based features and
user-based features. For the proposed work, text attribute is used for sentiment clas-
sification process. The dataset parameters are described as ”Positive”, ”Neutral”, and
”Negative”.

6.2.2 Tokenization

In this process, the texts are converted into meaningful words, phrases or symbols that
are known to be tokens. The tokens are used for text parsing or mining. During the
process of tokenization, if any error occurred then some problem can be caused during
the classification process. The first step includes in segmenting text into words which
is a major process. Word boundaries are located which helps in the formation of tok-
enization. Starting of the word and end of the statement is known as word boundary.
For any language processing, it is important to do tokenization in the first place. Tok-
enization can be easy if the words have space. The punctuation symbol is assumed to
be white space when tokenization process occurs if there is no existence of any white
space. Before doing any kind of processing, the tokens notions are defined. The notion
can be methodological or linguistic. An example of tokenization is defined below,

Input: I am going for shopping, movie and friends house.


Output:

I am going for shopping


movie and friends house

6.2.3 Preprocessing of Tweets

Web based life locales have numerous dialects that utilized which are not quite the
same as predominant press found and words in the lexicon. An uncommon ”slang”,
emojis are utilized in web-based social networking stages to stress words by rehashing a
portion of their letters. Furthermore, particular attributes like markup tweets are utilized
for dialects in twitter that were reposted by different clients with ”RT” and furthermore

93
clients signs ”@” and markup of subjects utilizing ”#” is utilized. The preprocessing of
tweets contains following stages as appeared in Fig. 6.3.

Fig. 6.3 Preprocessing Steps

Figure 6.3 demonstrates the preprocessing steps that incorporate expulsion of stop
words, contraction development, amending incorrectly spells in the content, stemming
of words, recognizable proof of labels, positive and negative word arrangements of each
tweet.

6.2.4 Feature Generation

The preprocessing steps are done for further feature generating process. The prepro-
cessed tweets are forwarded for generating features. The frequency of negative and
positive words, analyzing scores for the positive and negative, tag count in the text are
analyzed in this section and overall scoring for the texts are presented. Here, various
features are used in the learning classifier and these are word count that defined as to-
tal words present in each tweet after preprocessing is done, tag count referred as total
number of @ tags used in each tweet, negative word count is the total number of neg-
ative words present in each tweet, positive word count is the total positive words in
each tweet, positive score is the total number of positive scores gained after adding the
positive adjective, negative score is defined by total number of negative scores obtained
when adding each negative adjective, and score is the final total outcome by subtracting
negative score for each tweet with positive score.

94
Algorithm Classification using Hybrid Algorithm

Tokenization and preprocessing steps are applied for training tweets


Input: Training set of tweets
Step 1: Initialize the population of training set
Step 2: For each particle x in T do
Step 3: If f (xt ) == f (P dt ) then
P d t = xt
End if
Step 4: If f (xt ) == f (N dt ) then
N d t = xt
Else
Gdt = (P dt , N dt )
End for
Step 5: Stop if condition fulfilled or goto Step 3
Step 6: Update new particles population S
Step 7: while isNotTerminated ( ) do
Pp (S) = P (S).selectP ositive;
Pn (S) = P (S).selectN egative;
Step 8: Mutate (Pp (S))
Step 9: Mutate (Pn (S))
Step 10: Evaluate (Pp (S), Pn (S))
Step 11: Built new population (Pp (S), Pn (S))
Step 12: New population as D
Step13: GenDecTree (D, features F)
Step 14: if stopping-condition (D, F ) == T rue then
Leaf = createNode( )
Leaf.label = Classify (D)
Return Leaf
Step 15: root = createNode( )
Step 16: root.test-codition = findBestSplit(D)
Step 17: Z = {z | z a possible outcome of root.test-condition}
Step 18: for each value z ∈ Z:
Dz = {d | root.test-condition (D) = z and d ∈ D}
Child = TreeGrowth (Dz , F )
Label the edge after adding child to the root (root → negative or positive) as z
Step 19: Return root

95
6.3 Results

In this section, the text preprocessing sends the tweets that are collected and the words
are separated and formed into tokes using tokenization process. Furthermore, the words
are formed as features and extracted and developed scores and assign the polarity of
each word. The performance is measure using the proposed hybrid model PSG-DT in
terms of Precision, Accuracy, F-measure, and Recall.

• The overall obtained classification and its correctness is referred as accuracy.


Measurement is done by the ratio between the correctly classified occurrences
with the total number of samples.

• For the given sentiment words, the correctly classified tweets obtained fraction
with total number of classified tweets that is included in this sentiment.

• Recall can be defined to be KPI which is nothing but the difference from the
accuracy for that of particular sentiment.

• F-measure is calculated as Eqn. 6.1.

2 ∗ P recision ∗ Recall
F − M easure = (6.1)
P recision + Recall

6.3.1 Ternary Classification

The polarity and emotions of the tweets are classified using the proposed hybrid al-
gorithm. ‘Positive’, ‘Negative’, ‘Neutral’ are the three classes that are grouped for
classifying. “Hate”, “Sadness”, and “Anger” are one part of the tweet class and “Fun”,
“Happiness”, and “Love”. The former class contains tweets from the classes “Happi-
ness”, “Fun”, and “Love”, while later the tweets from the classes have “Hate”, “Anger”,
and “Sadness”. The classification result of the proposed hybrid work technique which
is a combination of genetic algorithm, particle swarm optimization, and decision tree
that shown in Table 6.1. It has seen that 90.4% is the overall accuracy that obtained
from the proposed classifier and negative part gives about 93.5% continued with posi-
tive that obtained about 89.3%. Noticeably, 87.5% is obtained for overall precision of
negative and positive tweets. The negative and positive measurement for recall is ob-
tained about 83.7% and for F-measure is obtained an accuracy of 86.9 percentages for
both polarities. The confusion matrix of overall data is shown in table 6.2 that predicts
negative tweets correctly and positive tweets as positive. The F-measure is measured to
be 86.3% and accuracy of 89.3% for neutral classification of tweets. The performance
is compared with other classifiers and proven that the proposed work results show high
classification accuracy.

96
Table 6.1 Ternary Classification of Tweets

Precision Accuracy Recall F-Measure


Positive 0.894 0.935 0.832 0.896
Negative 0.873 0.893 0.865 0.863
Neutral 0.859 0.884 0.814 0.849
Overall 0.875 0.904 0.837 0.869

Table 6.2 Confusion Matrix of Classified Tweets

Classified to be
Class Positive Negative Neutral
Neutral 1765 1874 4786
Negative 1478 6135 1786
Positive 6648 1152 976

Figure 6.4 shows the polarity classification for the dataset by using the hybrid model
obtaining the text present in the tweet as expected and results are shown with differen-
tiating by positive, negative, and neutral.

Fig. 6.4 Polarity based on Proposed Method

97
Figure 6.5 shows the multiple-class based classification. The classes based on clas-
sify emotions are “anger”, “fear”, “disgust”, “joy”, “surprise”, “sadness”, “unknown”.
Hence, the results for classifying emotions using proposed method is shown in Fig. 6.5
and the class “anger” has more emotion in Twitter dataset.

Fig. 6.5 Classified Emotions of Proposed Work

Figure 6.6 shows the comparison of sentiment classification accuracy of existing


method with proposed hybrid technique. The K-nearest neighbor (KNN) has accuracy
about 67% and it is the lowest when compared with other classifiers. The hybrid tech-
nique, which is a combination of Support Vector Machine (SVM) and KNN, has the
accuracy about 76%. But, the proposed algorithm of combining PSO, GA, and DT has
shown better results than any other classifiers that are about 90.5%.

98
Fig. 6.6 Overall Comparison of Accuracy with Proposed Work

6.3.2 Comparing Other Works

The overall performance of the proposed work has been concluded in terms of F-
measure, Recall, Precision, and Accuracy. The results of existing classifiers and hy-
brid methods are compared with the proposed technique as shown in Table 6.3. The
performance of decision tree (DT), support vector machine (SVM), genetic algorithm
(GA), K-nearest neighbor (KNN),SVM+KNN, particle swarm optimization (PSO) are
analyzed and the results are compared with hybrid technique.The overall performance
of other classifiers based on F-measure, Recall, Precision, and Accuracy is compared
with the proposed work that shown in Table 6.3.

Table 6.3 Proposed vs. Existing Approach

Classifiers Recall F-measure Accuracy Precision


PSO 89.1 86.78 88 88.7
SVM 68.1 68.7 68 69
KNN 69.3 67.9 67 70.5
DT 81.4 80.95 80 81.4
SVM+KNN 68.14 77.56 76 68.45
GA 87.9 87.59 86 87.3
Proposed Work 91.7 91.4 90 91.5

99
6.4 Summary

In this research, the classification of sentiment analysis is analyzed using proposed hy-
brid method which is the combination of PSO, GA, and DT that named as PSG-DT that
proved to give higher performance when comparing with different existing algorithms.
The accuracy of the hybrid model gives about 90% that classify sentiment tweets into
‘negative’ or ‘positive’ classes. Three main stages that are followed in this work and
these are tokenization, preprocessing, and feature generation. Other existing machine
learning classifiers with optimization techniques are used for comparing with the pro-
posed model and the results shows the high accuracy in the performance. Furthermore,
the future work includes in combining other classifiers with the optimization problem
that could seen in the performance in classifying sentiment tweets.

100

You might also like