Professional Documents
Culture Documents
Sentiment Analysis For Government: An Optimized Approach: July 2015
Sentiment Analysis For Government: An Optimized Approach: July 2015
net/publication/286841798
CITATIONS READS
9 4,165
8 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Davide Storelli on 15 April 2016.
1 Introduction
Recently, Twitter, one of the most popular micro-blogging tools, has gained significant
popularity among the social network services. Micro-blogging is an innovative form of
communication in which users can express in short posts their feelings or opinions
about a variety of subjects or describe their current status.
The interest of many studies aimed at SA of Twitter messages, tweets, is
remarkable. The brevity of these texts (the tweets cannot be longer than 140 char-
acters) and the informal nature of social media, have involved the use of slang,
abbreviations, new words, URLs, etc. These factors together with frequent mis-
spellings and improper punctuation make it more complex to extract the opinions and
sentiments of the people.
Despite this, recently, numerous studies have focused on natural language pro-
cessing of Twitter messages [1–4], leading to useful information in various fields, such
as brand evaluation [1], public health [5], natural disasters management [6], social
behaviors [7], movie market [8], political sphere [9].
This work falls within the context of the public administration, with the aim to
provide reliable estimates and analysis of what citizens think about the institutions, the
efficiency of services and infrastructures, the degree of satisfaction about special
events. The paper proposes an optimized sentiment classification approach as a political
and social decision support tool for governments.
The dataset used in the experiments contains Italian tweets relating to public event
“Lecce 2019-European Capital of Culture” which were collected using the Twitter
search API between 2 September 2014 and 17 November 2014.
We offer an optimized approach employing a document-level and a dataset-level
supervised machine learning classifier to provide accurate results in both individual and
aggregated classification. In addition, we detect the particular kind of features that
allow obtaining the most accurate sentiment classification for a dataset of the Italian
tweets in the context of a Public Administration event, considering also the size of the
training set and the way this affect results.
The paper is organized as follows: the research background is described in Sect. 2,
followed by a discussion of the public event “Lecce 2019: European Capital of Cul-
ture” in Sect. 3, a description of the dataset in Sect. 4 and a description of the machine
learning algorithms and optimization employed in Sect. 5. In Sect. 6, the results are
presented. Section 7 concludes and provides indications towards future work.
2 Research Background
The related work can be divided into two groups, general SA research and research
which is devoted specifically to the government domain.
related to SA and Opinion Mining has been growing. The growing interest in SA and
Opinion Mining is partly due to the different application areas: in commercial field to
the analysis of the review of products [12], in political field to identify the electorate
mood and therefore the trend in the voting (or abstaining) [13], etc. In social envi-
ronments, the SA can be used as a survey tool that allows to understand the existing
points of view: for example, to understand the opinion that some people have about a
subject, to predict the impact of a future event or to analyze the influence of a past
occurrence [14]. The big data technologies, the observation methods and the analysis of
the behavior on the web, make SA an important decision making tool for the analysis
of social network, able to develop relation, culture and sociological debate.
Counting 90,000 citizens, Lecce is a mid-sized city, which represents the most
important province of Salento located in the “heel” of the Italian “boot”. Even though
Lecce is known for its enormous cultural artistic and naturalistic heritage, it can also be
Sentiment Analysis for Government: An Optimized Approach 101
4 Dataset
We collected a corpus of tweets using the Twitter search API between 2 September
2014 and 17 November 2014, period in which there were more Twitter messages about
the event. We extracted tweets using query-based search to collect the tweets relating to
“#Lecce 2019” and “#noisiamolecce2019”, hashtag most used for this topic. The
resulting dataset contains 5,000 tweets. Duplicates and retweets are automatically
removed leaving a set of 1,700 tweets with a class distribution as shown in Table 1.
In order to achieve a training set for creating a language model useful to the
sentiment classification, a step of manual annotation was performed. This process
involved three annotators and a supervisor. The annotators were asked to identify the
sentiment associated with the topic of the tweet and the supervisor has developed a
classification scheme and created a handbook to train annotators on how to classify text
documents.
The manual coding was performed using the following 3 labels:
• Positive: tweets that carry positive sentiment towards the topic Lecce 2019;
• Negative: tweets that carry negative sentiment towards the topic Lecce 2019;
• Neutral: tweets which do not carry any sentiment towards the topic Lecce 2019 or
tweets which do not have any mention or relation to the topic.
Each annotator has evaluated all 1,700 tweets. For the construction and the
dimension of the training set, see next section.
Analysis of the inter-coder reliability metrics demonstrates that annotators were
agreed for more than 80 % of the documents (the agreement is average of 0.82) with
Sentiment Analysis for Government: An Optimized Approach 103
5 Method
favorable, creates contrast between the emoticons seen as negative ( ) and the
sentiment favorable for that topic.
• URL replacement with the string “URL”;
• Removal of words that do not begin with a letter of the alphabet.
• Removal of numbers and punctuation. Another element that can characterize text
polarity is punctuation, in particular exclamation points, question marks and
ellipsis. The inclusion of these elements in the classification process, can lead to a
more accurate definition of sentiment, also taking into account the repetitions that
intensify the opinion expressed. However, the inclusion of punctuation slows the
classifier speed.
• Stopwords removal. Some words categories are very frequent in the texts and are
generally not significant for sentiment analysis. This set of textual elements includes
articles, conjunctions and prepositions. The common use, in the field of sentiment
analysis, is to remove these entities from the text before the analysis.
• Removal of token with less than 2 characters;
• Shortening of repeated characters. Sometime words are lengthened with the rep-
etition of characters. This feature can be a reliable indicator of intensified emotion.
In the Italian language there are no sequences of three or more identical characters
in a word, so we can consider that such occurrences are an extension of the base
word. Since the number of repeated characters is not predictable and because the
probability that small differences are significant is low, it is possible to map
sequences of three or more repeated characters in sequences of only two characters.
• Stemming execution. Stemming is the process of reducing words to their word stem.
This can help to reduce the vocabulary size and thus to increase the classifier
performance, especially for small datasets. However stemming can be a
double-edged sword: the reduction of all forms of a word can eliminate the senti-
ment nuances that, in some cases, make the difference, or it can lead to the unifi-
cation of words that have opposite polarities. It seems that the benefits of stemmer
application are more evident when the documents of training are few, although the
differences are generally imperceptible.
• Part-of-Speech Tagging. There are many cases in which there is an interpretation
conflict among words with same representation but different role in the sentence.
This suggests that it may be useful to run a PoS tagger on the data and to use the
pair word-tag as a feature. In the literature, there is often a slight increase in
accuracy with the use of a PoS tagger at the expense of processing speed, which
slows the preprocessing phase.
As specified in the following paragraph, 8 classification stages were performed,
each with different approaches of text preprocessing and features selection, using
training set of different sizes. For each cycle a 10-fold cross-validation was performed,
dividing the manually annotated dataset into a training set and a fixed test set of 700
tweets. Every 10-fold validation was repeated 10 times. This was necessary to obtain
reliable results in terms of accuracy and suffering from a minimum amount of error
[29].
All stages contain the following preprocessing steps: all letters are converted to
lowercase; user names, URLs, numbers, punctuation, hashtags, words that do not begin
Sentiment Analysis for Government: An Optimized Approach 105
with a letter of the Latin alphabet and those composed of less than two characters, are
removed. In addition to these, the sets of features that characterize the 8 classification
stages are the following:
• set 1: uni-grams;
• set 2: bi-grams;
• set 3: uni-grams + bi-grams;
• set 4: set 1 + stopwords removal + repeated letters shortening;
• set 5: set 4 + stemming;
• set 6: set 5 + emoticon inclusion with replacement;
• set 7: set 6 + hashtags inclusion with character “#” removal;
• set 8: set 7 + PoS tag.
proportions of global sentiment of a dataset with a low margin of error, exceeding the
predictive performance of the other algorithms [25].
To evaluate the algorithms that classify individual text documents, the following
performance metrics were measured: Accuracy (A), Precision (P), Recall (R),
F1-measure [27].
In order to compare and evaluate classification algorithms that predict the overall
proportion of the categories of sentiment, the following statistics metrics are used:
Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) [25].
6 Results
The aim of this analysis is to identify which combination of features and training set
size produces an optimal sentiment classification of short text messages related to the
event “Lecce 2019”.
The Root Mean Square Error value has been calculated for the three algorithms
(svm, nb and readme) and for the first three sets of features (1, 2, 3), by varying the size
of the training set from 100 to 1,000 in steps of 100.
Table 2. Root mean square error value for three different sets of features and varying the size of
the training set
svm01 nb01 readme01 svm02 nb02 readme02 svm03 nb03 readme03
100 0,115 0,140 0,115 0,151 0,153 0,150 0,152 0,147 0,116
200 0,136 0,142 0,094 0,144 0,145 0,112 0,142 0,142 0,098
Trainig set dimension
300 0,132 0,139 0,089 0,137 0,139 0,093 0,134 0,136 0,086
400 0,127 0,134 0,081 0,130 0,134 0,083 0,128 0,132 0,079
500 0,123 0,130 0,075 0,125 0,129 0,077 0,123 0,128 0,073
600 0,119 0,125 0,071 0,121 0,125 0,073 0,119 0,123 0,071
700 0,115 0,121 0,069 0,116 0,120 0,070 0,114 0,118 0,069
800 0,111 0,116 0,067 0,112 0,116 0,068 0,110 0,114 0,067
900 0,108 0,113 0,066 0,108 0,112 0,067 0,107 0,111 0,066
1000 0,105 0,110 0,065 0,105 0,109 0,065 0,104 0,108 0,065
As seen from Table 2, the Root Mean Square Error value for the three classification
algorithms shows that, varying the size of the training set, the first three set of features
performs almost the same way. The use of bi-grams (set 2) generates slightly worse
results than the other two features sets, while the joint use of uni-grams and bi-grams
(set 3) produce a greater number of features that slows down a little the classification
step. For this reasons, the subsequent sets were constructed starting from the set 1 and
adding other methods of text preprocessing and features selection.
In all cases, ReadMe appears to be the best algorithm for the aggregate sentiment
(dataset-level) classification, even with a training set of small dimension. This result
validate our choice.
108 A. Corallo et al.
Fig. 1. Root mean square error values for ReadMe algorithm using different sets of features and
varying the size of the training set
Figure 1 shows how Root Mean Square Error of ReadMe classification varies by
changing the set of features.
The Root Mean Square Error decreases adding to unigrams other feature selection
and extraction methods, that is decreasing the number of words taken into account or
characterizing them in different ways. This reduction is particularly evident from set 4
onwards, where stopwords removal and repeated characters shortening allow to obtain
a Root Mean Square Error value of 4.2 %. However, the application of further pre-
processing methods (sets 5, 6, 7) is effective only with a small training set. It can be
also noted that for sets 4, 5, 6 and 7, the increase of the training tweets number over 600
has no effect on the trend of RMSE.
It is rather remarkable the reduction of Root Mean Square Error with the appli-
cation of set 8. For this set, we reach the least error (2.8 %) with a training set of 700
documents. However, having a training set greater than 400 tweets is not very useful in
terms of error reduction, since the Root Mean Square Error reduction is about 0.1 %.
There is not much difference in terms of computational complexity to create
training sets with sets of features 4, 5, 6, 7. All these sets are quite similar, but the set 7
give the best results. The use of PoS tagging (set 8) instead, introduces a slowdown in
the preprocessing stage, but reaching the best results among all feature sets.
The measures of Accuracy (Fig. 2), carried out for SVM and NB algorithms using
the set 7, shows almost the same trend for the two algorithms. This result, which see the
weakly NB next to the best state-of-art classification algorithm SVM, is probably due
to the kind of documents analyzed, namely short text messages (tweets), and is in
agreement with the literature that also point out how SVM perform better with longer
texts [28].
As shown in Fig. 3 and as already pointed out previously, the analysis of the
accuracy for the other sets of features doesn’t lead to significant increases. For all sets,
Sentiment Analysis for Government: An Optimized Approach 109
Fig. 2. Accuracy of SVM and NB using features set 7 varying the training set size
Fig. 3. Accuracy values of NB with different sets of features varying the training set size
the accuracy increases as the size of the training set, up to a value of 78 % for the NB
algorithm with the use of PoS tagging (set 8).
The same trends are obtained in the measures Precision (P), Recall (R),
F1-measure.
In summary, for the dataset-level sentiment analysis of the tweets, the choice of
unigrams features with the stopwords and repeated characters shortening, stemming,
emoticon replacement, hashtags inclusion with “#” character removal and PoS tagging
(set 8), proved to be the most successful. We believe that the best number of tweets to
be included in the training set to get good results with a good compromise between
110 A. Corallo et al.
error and human labor is 300, using features set 8, or 500 using features sets 4, 5, 6 or
7. However, even with 200–300 training tweets you can achieve good results.
For a document-level classification, an accuracy of 78 % is achieved with NB
algorithm by using the 1,000 tweets training set with feature set 8; however, for a
document-level classification it is sufficient to use set 6 or 7, since these generates
almost the same accuracy values than set 8, eliminating slowness of the PoS tagging
step. Here, unlike the dataset-level classification, there seems no be a flattening of the
accuracy increase as the size of training set grows. So, up to a certain limit, the more
training tweets you will have, the more accurate the sentiment classification. However,
we believe that a training set consisting of about 300–400 tweets can generate
acceptable results.
In both types of classification, the use of PoS tagging must be carefully chosen
according to the type of application; if the goal is real-time sentiment analysis, this
preprocessing approach must be avoided otherwise it can be used.
7 Conclusion
Following the state of the art experience about the use of algorithms for sentiment
classification, this paper intends to propose an optimized approach for the analysis of
tweets related to a public administration event. The possibility to extract opinions from
social networks and classify sentiment using different machine learning algorithms,
make this a valuable decision support tool for Government.
To meet this need, this paper proposes an approach that considers document-level
and dataset-level sentiment classification algorithms to maximize the accuracy of the
results in both single document and aggregated sentiment classification. The work also
point out which features sets produce better results compared to the size of the training
set and to the level of classification.
We have introduced a new dataset of 1,700 tweets relating to the public event of
“Lecce 2019: European Capital of Culture”. Each tweet in this set has been manually
annotated for positive, negative or neutral sentiment.
An accuracy of 78 % is achieved using NB document-level sentiment classification
algorithm and unigrams features with stopwords removal, repeated characters short-
ening, stemming, emoticon replacement, hashtags inclusion with “#” character removal
and PoS tagging with a training set of 1,000 tweets. A training set of 300–400 tweets
can be a reasonable lower limit to achieve acceptable results.
Our best overall result for a dataset-level classification is obtained with the ReadMe
approach using a feature set that included also the PoS tagging and a training set of 700
tweets. Using this optimal set of features, a dataset-level sentiment classification reports
a low Root Mean Square Error value, equal to 2.8 %. However, with a training set of
400 tweets can be obtained almost the same results.
In a context such as public administration, the emotional aspect of the opinions can
be crucial. Future work involves carrying out algorithms that allow extracting and
detecting the type of emotions or moods of citizens in order to support the decisions for
the public administration.
Sentiment Analysis for Government: An Optimized Approach 111
References
1. Jansen, B., Zhang, M., Sobel, K., Chowdury, A.: Twitter power: tweets as electronic word of
mouth. J. Am. Soc. Inf. Sci. Technol. 60(11), 2169–2188 (2009)
2. O’Connor, B., Balasubramanyan, R., Routledge, B., Smith, N.: From tweets to polls: linking
text sentiment to public opinion time series. In: Proceedings of the Fourth International
Conference on Weblogs and Social Media, ICWSM 2010, Washington, DC, USA (2010)
3. Tumasjan, A., Sprenger, T., Sandner, P., Welpe, I.: Predicting elections with Twitter: what
140 characters reveal about political sentiment. In: Proceedings of the Fourth International
Conference on Weblogs and Social Media, ICWSM 2010, Washington, DC, USA (2010)
4. Kouloumpis, E., Wilson, T., Moore, J.: Twitter sentiment analysis: the good the bad and the
OMG! In: Proceedings of the Fifth International Conference on Weblogs and Social Media,
ICWSM 2011, Barcelona, Catalonia, Spain (2011)
5. Salathe, M., Khandelwal, S.: Assessing vaccination sentiments with online social media:
implications for infectious disease dynamics and control. PLoS Comput. Biol. 7(10),
1002199 (2011)
6. Mandel, B., Culotta, A., Boulahanis, J., Stark, D., Lewis, B., Rodrigue J.: A Demographic
analysis of online sentiment during hurricane irene. In: Proceedings of the Second Workshop
on Language in Social Media, LSM 2012, Stroudsburg (2012)
7. Xu, J.-M., Jun, K.-S., Zhu, X., Bellmore, A.: Learning from bullying traces in social media.
In: HLT-NAACL, pp. 656–666 (2012)
8. Asur, S., Huberman, B.A.: Predicting the future with social media. In: Proceedings of the
2010 International Conference on 132 Web Intelligence and Intelligent Agent Technology,
WI-IAT 2010, vol. 01, pp. 492–499. IEEE Computer Society, Washington, D.C., USA
(2010)
9. Bakliwal, A., Foster, J., van der Puil, J., O’Brien, R., Tounsi, L., Hughes, M.: Sentiment
analysis of political tweets: towards an accurate classifier. In: Proceedings of the Workshop
on Language in Social Media (LASM 2013), pp. 49–58. Atlanta, Georgia (2013)
10. Sanjiv Das, M.C.: Yahoo! for Amazon: extracting market sentiment from stock message
boards. In: Proceedings of the Asia Pacific Finance Association Annual Conference (APFA)
(2001)
11. Tong, R.M.: An operational system for detecting and tracking opinions in on-line discussion.
In: Proceedings of the SIGIR Workshop on Operational Text Classification (OTC) (2001)
12. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and
semantic classification of product reviews. In: Proceedings of WWW, pp. 519–528 (2003)
13. Neri, F., Aliprandi, C., Camillo, F.: Mining the web to monitor the political consensus. In:
Wiil, U.K. (ed.) Counterterrorism and Open Source Intelligence. LNSN, pp. 391–412.
Springer, Vienna (2011)
14. Kale, A., Karandikar, A., Kolari, P., Java, A., Finin, T., Joshi, A.: Modeling trust and
influence in the blogosphere using link polarity. In: Proceedings of the International
Conference on Weblogs and Social Media (ICWSM) (2007)
15. Dolicanin, C., Kajan, E., Randjelovic, D.: Handbook of Research on Democratic Strategies
and Citizen-Centered E-Government Services, pp. 231–249. IGI Global, Hersey (2014)
16. Chesbrough, H.: Open Services Innovation. Wiley, New York (2011)
17. http://ec.europa.eu/programmes/creative-europe/actions/capitals-culture_en.htm
18. http://www.capitalicultura.beniculturali.it/index.php?it/108/suggerimenti-per-redigere-una-
proposta-progettuale-di-successo
19. http://www.lecce2019.it/2019/utopie.php
112 A. Corallo et al.
20. Koch, G.G., Landis, J.R.: The measurement of observer agreement for categorical data.
Biometrics 33, 159–174 (1977)
21. Hopkins, D., King, G.: Extracting systematic social science meaning from text. Unpublished
manuscript, Harvard University (2007). http://gking.harvard.edu/files/abs/words-abs.shtml
22. Liu, B.: Sentiment analysis and opinion mining. Synthesis Lectures on Human Language
Technologies. Morgan & Claypool Publishers (2012)
23. Narayanan, V., Arora, I., Bhatia, A.: Fast and accurate sentiment classification using an
enhanced naive Bayes model. In: Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise,
T., Li, B., Yao, X. (eds.) IDEAL 2013. LNCS, vol. 8206, pp. 194–201. Springer, Heidelberg
(2013)
24. Yang, Y., Xu, C., Ren, G.: Sentiment Analysis of Text Using SVM. In: Wang, X., Wang, F.,
Zhong, S. (eds.) EIEM 2011. LNEE, vol. 138, pp. 1133–1139. Springer, London (2011)
25. King, G., Hopkins, D.: A method of automated nonparametric content. Am. J. Polit. Sci. 54
(1), 229–247 (2010)
26. Hassan, A., Korashy, H., Medhat, W.: Sentiment analysis algorithms and applications: a
survey. Ain Shams Eng. J. 5, 1093–1113 (2011)
27. Huang, J.: Performance Measures of Machine Learning. University of Western Ontario,
Ontario (2006)
28. Wang, S., Manning, C.D.: Baselines and bigrams: simple, good sentiment and topic
classification. In: Proceedings of the 50th Annual Meeting of the Association for
Computational Linguistics: Short Papers, vol. 2. Association for Computational
Linguistics (2012)
29. Refaeilzadeh, P., Tang, L., Liu, H.: Cross-validation. In: Liu, L., Ӧzsu, M.T. (eds.)
Encyclopedia of Database Systems, pp. 532–538. Springer, New York (2009)