You are on page 1of 6

Available online at www.sciencedirect.

com

ScienceDirect
ICT Express xxx (xxxx) xxx
www.elsevier.com/locate/icte

Sentiment analysis of malayalam tweets using machine learning techniques


Soumya S. ∗, Pramod K.V.
Department of Computer Applications, Cochin University of Science and Technology, Cochin, Kerala, India
Received 4 February 2020; received in revised form 1 April 2020; accepted 14 April 2020
Available online xxxx

Abstract
Sentiment Analysis of Malayalam Tweets using Machine Learning techniques is done in this paper. The tweets are classified into positive
and negative using different machine learning techniques such as Naive Bayes (NB), Support Vector Machine (SVM) and Random Forest (RF).
The different features like Bag of Words (BOW), Term Frequency vs. Inverse Document Frequency (TF − IDF), Unigram with Sentiwordnet,
and Unigram with Sentiwordnet including negation words are considered for feature vector formation of input dataset. The Random Forest
classifier shows higher accuracy while considering Unigram with Sentiwordnet including negation words as a feature.
⃝c 2020 The Korean Institute of Communications and Information Sciences (KICS). Publishing services by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Keywords: Machine learning; Malayalam; Sentiment analysis; Sentiwordnet

1. Introduction RF. One of the significant challenges in the SA of Malay-


alam tweets is the unavailability of sentiment tagged cor-
Sentiment Analysis (SA) is the computational study that pus. Therefore, the sentiment tagged corpus has been created
analyses people’s opinions, sentiment, attitude, and evaluation manually.
from the written languages [1]. Nowadays, due to the emerging The main contribution of this work includes:
growth of social media, people’s opinion is very crucial for 1. Three thousand one hundred eighty-four(3184) tweets
decision making. People express an opinion in their mother have been retrieved using twitter API, based on positive and
tongue through different social media websites like Facebook, negative sentiment oriented words in Malayalam. The tweets
Twitter, Blogs, etc. SA plays a vital role in the film indus- which have been retrieved using positive words sometimes
try, political field, and marketing area. Tweets are messages show negative sentiment and vice versa. So, all the retrieved
with 280 characters long. So, sentence-level analysis is most tweets are manually verified and assigned their actual senti-
suitable for the SA of tweets. ment.
Jack Dorsey created Twitter in 2006. The length of tweets 2. Nine hundred fifty-four(954) positive, 1318 negative, 33
was restricted to 140 characters until 2017. Now it is 280 negation words, and 145 stop words have been identified from
characters long. Malayalam, the mother tongue of Keralites, the 3184 retrieved tweets, which contain 38208 unique words.
is the most commonly used language to express their opinion 3. Feature vectors like Unigram with Sentiwordnet and Un-
through twitter. The SA of Malayalam twitter messages is igram with Sentiwordnet, including negation words have been
inevitable since there exists no automatic Sentiment Analyzer created. Sentiwordnet contains 954 positive and 1318 negative
in this language. words. The first feature vector, Unigram with Sentiwordnet,
The proposed work explains SA of Malayalam tweets, includes three attributes, such as the number of occurrences of
which have been classified into positive and negative using positive words, the number of occurrences of negative words,
and the sentiment of each tweet. The above three attributes
different machine learning algorithms like NB, SVM, and
along with the number of occurrences of negation words con-
∗ Corresponding author.
stitute the second feature vector, Unigram with Sentiwordnet
E-mail address: soumya@cemunnar.ac.in (Soumya S.). including negation words.
Peer review under responsibility of The Korean Institute of Communica- The remaining sections of this paper are organized as
tions and Information Sciences (KICS). follows: Section 2 explains the major works done in this
https://doi.org/10.1016/j.icte.2020.04.003
2405-9595/⃝ c 2020 The Korean Institute of Communications and Information Sciences (KICS). Publishing services by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.
2 Soumya S. and Pramod K.V. / ICT Express xxx (xxxx) xxx

area, whereas Section 3 describes the proposed method for


SA and briefs Machine Learning models. Section 4 illustrates
the experimental setup we have done. Section 5 discusses the
results of different machine learning models. Finally, Section 6
concludes the paper.

2. Related works
SA has been studied and employed widely for the last two
decades. Most of the works in SA are specific for the English
language.
Pang and Lee proposed three different machine learning
algorithms such as NB, Maximum Entropy, and SVM with un-
igram and bigram features for SA of movie reviews in English.
They showed that SVM outperforms other two classifiers [1].
Turney et al. proposed an unsupervised technique for senti-
ment classification. They used the semantic orientation and
Pointwise Mutual Information Information Retrieval (PMI−
IR) method for SA of 410 reviews collected from different
domains [2].
Nowadays, deep learning gives a promising result in Nat-
ural Language Processing. Some works using DNN are men-
tioning here. Cambria, in 2016 studied the effect of emotions
in sentiment analysis and used a hybrid approach of Sentic-
Net and Deep learning techniques for polarity detection [3].
Wang et al. proposed a capsule model-based recurrent neural
networks for sentiment analysis [4]. Liu et al. (2019) proposed
attention-based sentiment reasoner for aspect-based sentiment
analysis. They have applied attention mechanisms for assign-
ing importance for different words in a sentence. The AS
Reasoner model has experimented on four different datasets
of Chinese and English. [5].
SA has been done in different Indian languages like Ben-
gali, Hindi, Punjabi, Manipuri, Kannada, Tamil and Malay-
alam. SA done in the Malayalam language is summarized in Fig. 1. Proposed architecture for SA.
Table 1. Malayalam is a highly agglutinative language making
the preprocessing step more challenging compared with other
languages. A significant issue in SA of Malayalam is the
In the proposed method, we have created the feature matrix
unavailability of the tagged dataset. All the works mentioned
by considering positive, negative sentiment oriented words and
in Table 1 have used their own manually created dataset. Nair
negation words. This feature matrix shows a higher impact in
et al. [6] used both linear SVM and CRF approaches for SA. Fine-tuning of hyperparameters has also been done in this
the SA of Malayalam movie review. Hyperparameter tuning work. Both methods increase the accuracy of prediction.
was not done in their work. Soumya et al. [7] have done
SA of Malayalam tweets with different DNN models. They 3. Proposed methods
considered all the words in the corpus for feature matrix
This section explains about the Dataset, Preprocessing
creation. Finally, they showed that GRU performed well com-
Methods, Feature Selection, and Classifiers used in our ex-
pared with other DNN models. Kumar et al. [8] have done SA
perimental setup. The architecture of the proposed method is
using deep learning techniques such as Convolutional Neural
shown in Fig. 1.
Networks (CNN) and Long Short Term Memory (LSTM).
They considered all the unique words in the corpus for feature 3.1. Dataset
matrix formation of the input dataset. The size of the feature
matrix was huge, as they did not remove insignificant words. Due to the unavailability of the sentiment tagged dataset in
SA of Malayalam text collected from social media was done Malayalam, we have created the dataset by retrieving tweets
by Rahul et al. [9]. They used both CRF and SVM classifiers using twitter API. Twenty-two(22) positive and 13 negative
for SA. Preprocessing and feature extraction was done in Malayalam words are identified and used as the hashtag for
their work, but it shows less accuracy because there was no retrieving tweets. These words are shown in Fig. 2. The dataset
hyperparameter tuning done in their work. contains 3184 tweets.

Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.
Soumya S. and Pramod K.V. / ICT Express xxx (xxxx) xxx 3

Table 1
Summary of related work on the SA of Malayalam.
References Dataset Sentiment classification methods
Soumya et al. [7] Malayalam Tweets Deep neural network architectures like
RNN, LSTM, Bi-LSTM, GRU and CNN models
Kasthoori et al. [10] Malayalam Online Machine Learning Method ART classifier for domain
Newspaper identification and Fuzzy logic for polarity Classification
Rahul et al. [9] Malayalam text from CRF and SVM
social media
Kumar et al. [8] Malayalam Tweets CNN and LSTM
Ashna et al. [11] Malayalam Reviews Lexicon based approach
Thulasi et al. [12] Malayalam Movie Reviews Aspect based analysis using Viterbi and
HMM model
Nair et al. [6] Malayalam Movie Reviews SVM and CRF
Anagha et al. [13] Malayalam film Reviews Fuzzy logic
Jayan et al. [14] Malayalam film Reviews CRF combined with rule based approach
Nair et al. [15] Malayalam Movie Reviews Rule based approach
Mohandas et al. [16] Malayalam Movie Reviews SO - PMI - IR

Fig. 3. The sample dataset of positive and negative polarity classified


Malayalam Tweets.

3.3. Feature selection

Bag of Word, Term Frequency vs. Inverse Document Fre-


quency, Unigram with Sentiwordnet and Unigram with Sen-
tiwordnet including negation words have been considered for
feature vector formation of the input dataset.
• BOW: In BOW, the text is transformed into a bag of
words where each entry corresponds to the number of
occurrences of a particular term in the sentence. The
feature matrix is created with m * n dimension where m
is the number of sentences and n is the number of unique
Fig. 2. Positive and Negative polarity words in Malayalam. words in the corpus.
• TF−IDF: TF−IDF is the statistical measure to evaluate
the significance of a particular term in a corpus. tf − idf
3.2. Preprocessing = t f t * id f t where t f t is the term frequency and id f t is
the inverse document frequency.
The retrieved tweets contain hyperlinks, punctuations, spe- • Unigram with Sentiwordnet feature vector contains three
cial characters, etc., these have been removed using regular attributes like the number of positive words, number
expressions in python language. After that, the 3184 tweets of negative words and the sentiment of corresponding
are manually verified and assigned with positive and negative tweets. The Sentwordnet is created by identifying 954
sentiments. Among the 3184 tweets, 1586 tweets are positive positive and 1318 negative words from 3184 tweets.
sentiment oriented, and 1598 tweets are negative sentiment • Unigram with Sentiwordnet including negation words
oriented sentences. The sample dataset is shown in Fig. 3. feature vector contains four attributes like number of pos-
Stop words are the words which are occurring most frequently itive words, number of negative words, number of nega-
in Malayalam sentences, but it is less informative in the SA tion words, and the sentiment of corresponding tweets.
task. From 38208 unique words, 145 stop words have been In this case, 33 negation words are included along with
identified and removed from the tweets. the Sentiwordnet.

Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.
4 Soumya S. and Pramod K.V. / ICT Express xxx (xxxx) xxx

Table 2
Precision, Recall and F-score of NB Classifier.
Features Positive Negative
Precision Recall F-score Precision Recall F-score
BOW .8872 .824 .8545 .8168 .882 .848
TF−IDF .884 .798 .839 .7956 .882 .836
Unigram with .9685 .9035 .9349 .91 .97 .94
Sentiwordnet
Unigram with Sentiwordnet .973 .913 .942 .916 .974 .9445
including negation words

Table 3
Precision, Recall and F-score of SVM (Kernel = linear) Classifier.
Features Positive Negative
Precision Recall F-score Precision Recall F-score
BOW .836 .88 .858 .878 .834 .866
TF−IDF .852 .886 .869 .883 .849 .866
Unigram with Sentiwordnet .963 .912 .937 .924 .968 .945
Unigram with Sentiwordnet .939 .952 .946 .95 .936 .943
including negation words

3.4. Machine learning classifiers

Three different machine learning algorithms, such as NB,


SVM, and RF have been applied for predicting the sentiment
of Malayalam tweets. The selection of hyperparameter is most
challenging for accurate prediction of data.
• Naive Bayes Classifier: NB predicts the sentiment of the
test dataset as positive or negative using a Multinomial
NB classifier. This classification is done based on Bayes’
theorem [17].
• Support Vector Machine: SVM is a supervised machine
learning algorithm proposed by Vapnik in 1992 [18,19].
SVM finds the linear separator with maximum marginal
distance using support vectors in high dimensional space.
Both linear and RBF kernel functions have been used for
predicting the sentiment of tweets.
• Random Forest: RF is a supervised machine learning
algorithm created by Tin Kam Ho in 1995 [20]. It builds
multiple decision trees and merges together for the clas-
sification of data. It searches for the best feature among Fig. 4. Comparing Accuracy of Test Dataset with different ML Models.
a random subset of features.

both linear and RBF kernel functions for model creation. The
4. Experimental setup fine-tuning of the hyperparameter is done using GridSearchCV
Tweets are retrieved using Twitter API based on a hashtag. function in Python language. The hyperparameter, such as C
The hyperlinks, punctuations, special characters, etc. present in and gamma, takes a different combination of values and finally
the tweets are removed in the preprocessing steps. After pre- gives the best parameter after tuning. C takes different values
processing, feature vectors are formed using BOW, TF−IDF, like 0.1, 1, 10, 100, 1000 and gamma takes values like 0.1,
Unigram with Sentiwordnet, and Unigram with Sentiwordnet 0.01, 0.001 and 0.0001. The SVM gives the value of C as 1 and
including negation words. The dataset contains 3184 tweets, gamma as 0.1 after the fine-tuning. The hyperparameters like
which are split into training and test datasets in the ratio 70:30. n estimators and max depth take a different combination
Different machine learning classifiers like NB, SVM, and RF of values such as 100, 300, 500, 700, 900 and 10, 30, 50,
are used for creating a trained model. After the model creation, 70, 90, respectively. After the fine-tuning of hyperparameters,
the accuracy of the test dataset is predicted using these classi- n estimators and max depth are selected as 500 and 70
fiers. For the NB classifier, a multinomial NB classifier is used respectively for BOW and TF IDF features. But, the Unigram
for classifying the tweets as positive and negative. SVM uses with Sentiwordnet and Unigram with Sentiwordnet including

Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.
Soumya S. and Pramod K.V. / ICT Express xxx (xxxx) xxx 5

Table 4
Precision, Recall and F-score of SVM (Kernel = rbf) Classifier.
Features Positive Negative
Precision Recall F-score Precision Recall F-score
BOW .89 .88 .886 .868 .878 .873
TF−IDF .90 .856 .877 .846 .893 .869
Unigram with Sentiwordnet .971 .91 .938 .91 .972 .94
Unigram with Sentiwordnet .984 .911 .947 .915 .985 .949
including negation words

Table 5
Precision, Recall and F-score of RF classifier.
Features Positive Negative
Precision Recall F-score Precision Recall F-score
BOW .91 .885 .90 .87 .898 .886
TF−IDF .914 .88 .898 .873 .91 .89
Unigram with Sentiwordnet .964 .935 .949 .933 .963 .948
Unigram with Sentiwordnet .983 .93 .956 .93 .983 .956
including negation words

Table 6
Validation accuracy of NB, SVM with Linear, SVM with RBF, RF.
Classifiers Validation accuracy
BOW TF−IDF Unigram with Unigram with Sentiwordnet
Sentiwordnet including negation words
NB .714 .714 .944 .944
SVM with Linear Kernel .68 .70 .922 .938
SVM with RBF Kernel .696 .668 .935 .935
RF 0.72 0.71 0.944 0.944

Table 7
Testdata accuracy of NB, SVM with Linear, SVM with RBF, RF.
Classifiers Test data accuracy
BOW TF−IDF Unigram with Unigram with Sentiwordnet
Sentiwordnet including negation words
NB .851 .838 .937 .944
SVM with Linear Kernel .857 .867 .941 .945
SVM with RBF Kernel .88 .87 .939 .948
RF .891 .894 .948 .956

negation words, selects n estimators as 100 and max depth by considering all the above-mentioned features using four
as 70. different classifiers like NB, SVM (Kernel = Linear), SVM
(Kernel = RBF) and RF are shown in Table 2, Table 3, Table 4
5. Results and discussion and Table 5 respectively. Fivefold cross-validation is done in
Precision, Recall, F-score, and Accuracy [21] have been this work. The validation accuracy of NB, SVM with linear
measured for NB, SVM, and RF classifiers considering four kernel, SVM with RBF kernel, and RF are shown in Table 6.
different features. Feature matrix with BOW, and TF− IDF Table 7 depicts the test data accuracy of the four different
are created by considering all the unique words in the corpus. classifiers modeled. Fig. 4 represents the comparative study
In sentiment analysis, some words are insignificant while pre- of various Machine Learning models with different features.
dicting the sentiment as positive and negative. The size of the The features SW and SW+Negation represent Sentiwordnet,
feature matrix with BOW and TF − IDF is greater compared Sentiwordnet including Negation words, respectively.
with the other two features. All the three classifiers with
Unigram with Sentiwordnet and Unigram with Sentiword- 6. Conclusion
net including negation words shows better accuracy because
sentiment oriented words are significant while predicting the SA of Malayalam tweets using NB, SVM, and RF are
sentiment of sentences. The precision, recall, and F-score proposed in this work. Four different features like BOW, TF

Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.
6 Soumya S. and Pramod K.V. / ICT Express xxx (xxxx) xxx

− IDF, Unigram with Sentiwordnet, and Unigram with Senti- [8] S. Sachin Kumar, M. Anand Kumar, K.P. Soman, Sentiment analysis
wordnet including negation words, are considered for feature of tweets in malayalam using long short-term memory units and
vector formation of the input dataset. All the classifiers with convolutional neural nets, in: International Conference on Mining
Intelligence and Knowledge Exploration, Springer, Cham, 2017.
the last two features have shown better accuracy compared
[9] M. Rahul, R.R. Rajeev, S. Shine, Social Media Sentiment Analysis
with other features. RF classifier with Unigram with Senti- for Malayalam, 2018.
wordnet including negation words, got the highest accuracy, [10] V. Kasthoori, B. Soniya, V. Jayan, Domain-independent sentiment
95.6%. analysis in malayalam, in: Computational Intelligence: Theories, Ap-
plications and Future Directions-Volume II, Springer, Singapore, 2019,
Declaration of competing interest pp. 151–160.
[11] M.P. Ashna, Ancy K. Sunny, Exicon based sentiment analysis
The authors declare that they have no known competing system for malayalam language, in: 2017 International Conference
financial interests or personal relationships that could have on Computing Methodologies and Communication, ICCMC, IEEE,
appeared to influence the work reported in this paper. 2017.
[12] P.K. Thulasi, K. Usha, Aspect polarity recognition of movie and
CRediT authorship contribution statement product reviews in Malayalam, in: 2016 International Conference on
Next Generation Intelligent Systems, ICNGIS, IEEE, 2016.
Soumya S.: Writing - original draft. Pramod K.V.: Super- [13] M. Anagha, et al., Fuzzy logic based hybrid approach for sentiment
vision. analysisl of malayalam movie reviews, in: 2015 IEEE International
Conference on Signal Processing, Informatics, Communication and
References Energy Systems, SPICES, IEEE, 2015.
[14] P. Jayan, Deepu S. Nair, S. Elizabeth Jisha, A subjective feature
[1] Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, Thumbs up?: senti- extraction for sentiment analysis in Malayalam language, Int. J. Eng.
ment classification using machine learning techniques, in: Proceedings Sci. 14 (2015) 1–4.
of the ACL-02 Conference on Empirical Methods in Natural Language [15] Deepu S. Nair, Jisha P. Jayan, Elizabeth Sherly, Sentima-sentiment ex-
Processing-Volume 10, Association for Computational Linguistics, traction for malayalam, in: 2014 International Conference on Advances
2002. in Computing, Communications and Informatics, ICACCI, IEEE,
[2] Peter D. Turney, Thumbs up or thumbs down?: semantic orientation 2014.
applied to unsupervised classification of reviews, in: Proceedings of the [16] Neethu Mohandas, Janardhanan P.S. Nair, V. Govindaru, Domain
40th Annual Meeting on Association for Computational Linguistics, specific sentence level mood extraction from malayalam text, in:
Association for Computational Linguistics, 2002. 2012 International Conference on Advances in Computing and
[3] Erik Cambria, Affective computing and sentiment analysis, IEEE Communications, IEEE, 2012.
Intell. Syst. 31 (2) (2016) 102–107. [17] Daniel Jurafsky, James H. Martin, Classification: naive Bayes, logistic
[4] Yequan Wang, et al., Sentiment analysis by capsules, in: Proceedings regression, sentiment, Speech Lang. Process. (2015).
of the 2018 World Wide Web Conference, 2018. [18] Corinna Cortes, Vladimir Vapnik, Support vector networks, Mach.
[5] Ning Liu, et al., Attention-based sentiment reasoner for aspect-based Learn. 20 (1995) 273–297, Kluwer Academic Publishers, Boston.
sentiment analysis, Hum.-Cent. Comput. Inf. Sci. 9 (1) (2019) 35. Manufactured in The Netherlands.
[6] Deepu S. Nair, et al., Sentiment analysis of malayalam film review us- [19] Bernhard E. Boser, Isabelle M. Guyon, Vladimir N. Vapnik, A training
ing machine learning techniques, in: 2015 International Conference on algorithm for optimal margin classifiers, in: Proceedings of the Fifth
Advances in Computing, Communications and Informatics, ICACCI, Annual Workshop on Computational Learning Theory, ACM, 1992.
IEEE, 2015. [20] Tin Kam Ho, Random decision forests, in: Proceedings of 3rd
[7] S. Soumya, K.V. Pramod, Sentiment analysis of malayalam tweets International Conference on Document Analysis and Recognition, vol.
using different deep neural network models-case study, in: 2019 1, IEEE, 1995.
9th International Conference on Advances in Computing and [21] Louise T. Su, Evaluation measures for interactive information retrieval,
Communication, ICACC, IEEE, 2019. Inf. Process. Manage. 28 (4) (1992) 503–516.

Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.

You might also like