Professional Documents
Culture Documents
Malayalam Tweets PDF
Malayalam Tweets PDF
com
ScienceDirect
ICT Express xxx (xxxx) xxx
www.elsevier.com/locate/icte
Abstract
Sentiment Analysis of Malayalam Tweets using Machine Learning techniques is done in this paper. The tweets are classified into positive
and negative using different machine learning techniques such as Naive Bayes (NB), Support Vector Machine (SVM) and Random Forest (RF).
The different features like Bag of Words (BOW), Term Frequency vs. Inverse Document Frequency (TF − IDF), Unigram with Sentiwordnet,
and Unigram with Sentiwordnet including negation words are considered for feature vector formation of input dataset. The Random Forest
classifier shows higher accuracy while considering Unigram with Sentiwordnet including negation words as a feature.
⃝c 2020 The Korean Institute of Communications and Information Sciences (KICS). Publishing services by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Keywords: Machine learning; Malayalam; Sentiment analysis; Sentiwordnet
Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.
2 Soumya S. and Pramod K.V. / ICT Express xxx (xxxx) xxx
2. Related works
SA has been studied and employed widely for the last two
decades. Most of the works in SA are specific for the English
language.
Pang and Lee proposed three different machine learning
algorithms such as NB, Maximum Entropy, and SVM with un-
igram and bigram features for SA of movie reviews in English.
They showed that SVM outperforms other two classifiers [1].
Turney et al. proposed an unsupervised technique for senti-
ment classification. They used the semantic orientation and
Pointwise Mutual Information Information Retrieval (PMI−
IR) method for SA of 410 reviews collected from different
domains [2].
Nowadays, deep learning gives a promising result in Nat-
ural Language Processing. Some works using DNN are men-
tioning here. Cambria, in 2016 studied the effect of emotions
in sentiment analysis and used a hybrid approach of Sentic-
Net and Deep learning techniques for polarity detection [3].
Wang et al. proposed a capsule model-based recurrent neural
networks for sentiment analysis [4]. Liu et al. (2019) proposed
attention-based sentiment reasoner for aspect-based sentiment
analysis. They have applied attention mechanisms for assign-
ing importance for different words in a sentence. The AS
Reasoner model has experimented on four different datasets
of Chinese and English. [5].
SA has been done in different Indian languages like Ben-
gali, Hindi, Punjabi, Manipuri, Kannada, Tamil and Malay-
alam. SA done in the Malayalam language is summarized in Fig. 1. Proposed architecture for SA.
Table 1. Malayalam is a highly agglutinative language making
the preprocessing step more challenging compared with other
languages. A significant issue in SA of Malayalam is the
In the proposed method, we have created the feature matrix
unavailability of the tagged dataset. All the works mentioned
by considering positive, negative sentiment oriented words and
in Table 1 have used their own manually created dataset. Nair
negation words. This feature matrix shows a higher impact in
et al. [6] used both linear SVM and CRF approaches for SA. Fine-tuning of hyperparameters has also been done in this
the SA of Malayalam movie review. Hyperparameter tuning work. Both methods increase the accuracy of prediction.
was not done in their work. Soumya et al. [7] have done
SA of Malayalam tweets with different DNN models. They 3. Proposed methods
considered all the words in the corpus for feature matrix
This section explains about the Dataset, Preprocessing
creation. Finally, they showed that GRU performed well com-
Methods, Feature Selection, and Classifiers used in our ex-
pared with other DNN models. Kumar et al. [8] have done SA
perimental setup. The architecture of the proposed method is
using deep learning techniques such as Convolutional Neural
shown in Fig. 1.
Networks (CNN) and Long Short Term Memory (LSTM).
They considered all the unique words in the corpus for feature 3.1. Dataset
matrix formation of the input dataset. The size of the feature
matrix was huge, as they did not remove insignificant words. Due to the unavailability of the sentiment tagged dataset in
SA of Malayalam text collected from social media was done Malayalam, we have created the dataset by retrieving tweets
by Rahul et al. [9]. They used both CRF and SVM classifiers using twitter API. Twenty-two(22) positive and 13 negative
for SA. Preprocessing and feature extraction was done in Malayalam words are identified and used as the hashtag for
their work, but it shows less accuracy because there was no retrieving tweets. These words are shown in Fig. 2. The dataset
hyperparameter tuning done in their work. contains 3184 tweets.
Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.
Soumya S. and Pramod K.V. / ICT Express xxx (xxxx) xxx 3
Table 1
Summary of related work on the SA of Malayalam.
References Dataset Sentiment classification methods
Soumya et al. [7] Malayalam Tweets Deep neural network architectures like
RNN, LSTM, Bi-LSTM, GRU and CNN models
Kasthoori et al. [10] Malayalam Online Machine Learning Method ART classifier for domain
Newspaper identification and Fuzzy logic for polarity Classification
Rahul et al. [9] Malayalam text from CRF and SVM
social media
Kumar et al. [8] Malayalam Tweets CNN and LSTM
Ashna et al. [11] Malayalam Reviews Lexicon based approach
Thulasi et al. [12] Malayalam Movie Reviews Aspect based analysis using Viterbi and
HMM model
Nair et al. [6] Malayalam Movie Reviews SVM and CRF
Anagha et al. [13] Malayalam film Reviews Fuzzy logic
Jayan et al. [14] Malayalam film Reviews CRF combined with rule based approach
Nair et al. [15] Malayalam Movie Reviews Rule based approach
Mohandas et al. [16] Malayalam Movie Reviews SO - PMI - IR
Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.
4 Soumya S. and Pramod K.V. / ICT Express xxx (xxxx) xxx
Table 2
Precision, Recall and F-score of NB Classifier.
Features Positive Negative
Precision Recall F-score Precision Recall F-score
BOW .8872 .824 .8545 .8168 .882 .848
TF−IDF .884 .798 .839 .7956 .882 .836
Unigram with .9685 .9035 .9349 .91 .97 .94
Sentiwordnet
Unigram with Sentiwordnet .973 .913 .942 .916 .974 .9445
including negation words
Table 3
Precision, Recall and F-score of SVM (Kernel = linear) Classifier.
Features Positive Negative
Precision Recall F-score Precision Recall F-score
BOW .836 .88 .858 .878 .834 .866
TF−IDF .852 .886 .869 .883 .849 .866
Unigram with Sentiwordnet .963 .912 .937 .924 .968 .945
Unigram with Sentiwordnet .939 .952 .946 .95 .936 .943
including negation words
both linear and RBF kernel functions for model creation. The
4. Experimental setup fine-tuning of the hyperparameter is done using GridSearchCV
Tweets are retrieved using Twitter API based on a hashtag. function in Python language. The hyperparameter, such as C
The hyperlinks, punctuations, special characters, etc. present in and gamma, takes a different combination of values and finally
the tweets are removed in the preprocessing steps. After pre- gives the best parameter after tuning. C takes different values
processing, feature vectors are formed using BOW, TF−IDF, like 0.1, 1, 10, 100, 1000 and gamma takes values like 0.1,
Unigram with Sentiwordnet, and Unigram with Sentiwordnet 0.01, 0.001 and 0.0001. The SVM gives the value of C as 1 and
including negation words. The dataset contains 3184 tweets, gamma as 0.1 after the fine-tuning. The hyperparameters like
which are split into training and test datasets in the ratio 70:30. n estimators and max depth take a different combination
Different machine learning classifiers like NB, SVM, and RF of values such as 100, 300, 500, 700, 900 and 10, 30, 50,
are used for creating a trained model. After the model creation, 70, 90, respectively. After the fine-tuning of hyperparameters,
the accuracy of the test dataset is predicted using these classi- n estimators and max depth are selected as 500 and 70
fiers. For the NB classifier, a multinomial NB classifier is used respectively for BOW and TF IDF features. But, the Unigram
for classifying the tweets as positive and negative. SVM uses with Sentiwordnet and Unigram with Sentiwordnet including
Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.
Soumya S. and Pramod K.V. / ICT Express xxx (xxxx) xxx 5
Table 4
Precision, Recall and F-score of SVM (Kernel = rbf) Classifier.
Features Positive Negative
Precision Recall F-score Precision Recall F-score
BOW .89 .88 .886 .868 .878 .873
TF−IDF .90 .856 .877 .846 .893 .869
Unigram with Sentiwordnet .971 .91 .938 .91 .972 .94
Unigram with Sentiwordnet .984 .911 .947 .915 .985 .949
including negation words
Table 5
Precision, Recall and F-score of RF classifier.
Features Positive Negative
Precision Recall F-score Precision Recall F-score
BOW .91 .885 .90 .87 .898 .886
TF−IDF .914 .88 .898 .873 .91 .89
Unigram with Sentiwordnet .964 .935 .949 .933 .963 .948
Unigram with Sentiwordnet .983 .93 .956 .93 .983 .956
including negation words
Table 6
Validation accuracy of NB, SVM with Linear, SVM with RBF, RF.
Classifiers Validation accuracy
BOW TF−IDF Unigram with Unigram with Sentiwordnet
Sentiwordnet including negation words
NB .714 .714 .944 .944
SVM with Linear Kernel .68 .70 .922 .938
SVM with RBF Kernel .696 .668 .935 .935
RF 0.72 0.71 0.944 0.944
Table 7
Testdata accuracy of NB, SVM with Linear, SVM with RBF, RF.
Classifiers Test data accuracy
BOW TF−IDF Unigram with Unigram with Sentiwordnet
Sentiwordnet including negation words
NB .851 .838 .937 .944
SVM with Linear Kernel .857 .867 .941 .945
SVM with RBF Kernel .88 .87 .939 .948
RF .891 .894 .948 .956
negation words, selects n estimators as 100 and max depth by considering all the above-mentioned features using four
as 70. different classifiers like NB, SVM (Kernel = Linear), SVM
(Kernel = RBF) and RF are shown in Table 2, Table 3, Table 4
5. Results and discussion and Table 5 respectively. Fivefold cross-validation is done in
Precision, Recall, F-score, and Accuracy [21] have been this work. The validation accuracy of NB, SVM with linear
measured for NB, SVM, and RF classifiers considering four kernel, SVM with RBF kernel, and RF are shown in Table 6.
different features. Feature matrix with BOW, and TF− IDF Table 7 depicts the test data accuracy of the four different
are created by considering all the unique words in the corpus. classifiers modeled. Fig. 4 represents the comparative study
In sentiment analysis, some words are insignificant while pre- of various Machine Learning models with different features.
dicting the sentiment as positive and negative. The size of the The features SW and SW+Negation represent Sentiwordnet,
feature matrix with BOW and TF − IDF is greater compared Sentiwordnet including Negation words, respectively.
with the other two features. All the three classifiers with
Unigram with Sentiwordnet and Unigram with Sentiword- 6. Conclusion
net including negation words shows better accuracy because
sentiment oriented words are significant while predicting the SA of Malayalam tweets using NB, SVM, and RF are
sentiment of sentences. The precision, recall, and F-score proposed in this work. Four different features like BOW, TF
Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.
6 Soumya S. and Pramod K.V. / ICT Express xxx (xxxx) xxx
− IDF, Unigram with Sentiwordnet, and Unigram with Senti- [8] S. Sachin Kumar, M. Anand Kumar, K.P. Soman, Sentiment analysis
wordnet including negation words, are considered for feature of tweets in malayalam using long short-term memory units and
vector formation of the input dataset. All the classifiers with convolutional neural nets, in: International Conference on Mining
Intelligence and Knowledge Exploration, Springer, Cham, 2017.
the last two features have shown better accuracy compared
[9] M. Rahul, R.R. Rajeev, S. Shine, Social Media Sentiment Analysis
with other features. RF classifier with Unigram with Senti- for Malayalam, 2018.
wordnet including negation words, got the highest accuracy, [10] V. Kasthoori, B. Soniya, V. Jayan, Domain-independent sentiment
95.6%. analysis in malayalam, in: Computational Intelligence: Theories, Ap-
plications and Future Directions-Volume II, Springer, Singapore, 2019,
Declaration of competing interest pp. 151–160.
[11] M.P. Ashna, Ancy K. Sunny, Exicon based sentiment analysis
The authors declare that they have no known competing system for malayalam language, in: 2017 International Conference
financial interests or personal relationships that could have on Computing Methodologies and Communication, ICCMC, IEEE,
appeared to influence the work reported in this paper. 2017.
[12] P.K. Thulasi, K. Usha, Aspect polarity recognition of movie and
CRediT authorship contribution statement product reviews in Malayalam, in: 2016 International Conference on
Next Generation Intelligent Systems, ICNGIS, IEEE, 2016.
Soumya S.: Writing - original draft. Pramod K.V.: Super- [13] M. Anagha, et al., Fuzzy logic based hybrid approach for sentiment
vision. analysisl of malayalam movie reviews, in: 2015 IEEE International
Conference on Signal Processing, Informatics, Communication and
References Energy Systems, SPICES, IEEE, 2015.
[14] P. Jayan, Deepu S. Nair, S. Elizabeth Jisha, A subjective feature
[1] Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, Thumbs up?: senti- extraction for sentiment analysis in Malayalam language, Int. J. Eng.
ment classification using machine learning techniques, in: Proceedings Sci. 14 (2015) 1–4.
of the ACL-02 Conference on Empirical Methods in Natural Language [15] Deepu S. Nair, Jisha P. Jayan, Elizabeth Sherly, Sentima-sentiment ex-
Processing-Volume 10, Association for Computational Linguistics, traction for malayalam, in: 2014 International Conference on Advances
2002. in Computing, Communications and Informatics, ICACCI, IEEE,
[2] Peter D. Turney, Thumbs up or thumbs down?: semantic orientation 2014.
applied to unsupervised classification of reviews, in: Proceedings of the [16] Neethu Mohandas, Janardhanan P.S. Nair, V. Govindaru, Domain
40th Annual Meeting on Association for Computational Linguistics, specific sentence level mood extraction from malayalam text, in:
Association for Computational Linguistics, 2002. 2012 International Conference on Advances in Computing and
[3] Erik Cambria, Affective computing and sentiment analysis, IEEE Communications, IEEE, 2012.
Intell. Syst. 31 (2) (2016) 102–107. [17] Daniel Jurafsky, James H. Martin, Classification: naive Bayes, logistic
[4] Yequan Wang, et al., Sentiment analysis by capsules, in: Proceedings regression, sentiment, Speech Lang. Process. (2015).
of the 2018 World Wide Web Conference, 2018. [18] Corinna Cortes, Vladimir Vapnik, Support vector networks, Mach.
[5] Ning Liu, et al., Attention-based sentiment reasoner for aspect-based Learn. 20 (1995) 273–297, Kluwer Academic Publishers, Boston.
sentiment analysis, Hum.-Cent. Comput. Inf. Sci. 9 (1) (2019) 35. Manufactured in The Netherlands.
[6] Deepu S. Nair, et al., Sentiment analysis of malayalam film review us- [19] Bernhard E. Boser, Isabelle M. Guyon, Vladimir N. Vapnik, A training
ing machine learning techniques, in: 2015 International Conference on algorithm for optimal margin classifiers, in: Proceedings of the Fifth
Advances in Computing, Communications and Informatics, ICACCI, Annual Workshop on Computational Learning Theory, ACM, 1992.
IEEE, 2015. [20] Tin Kam Ho, Random decision forests, in: Proceedings of 3rd
[7] S. Soumya, K.V. Pramod, Sentiment analysis of malayalam tweets International Conference on Document Analysis and Recognition, vol.
using different deep neural network models-case study, in: 2019 1, IEEE, 1995.
9th International Conference on Advances in Computing and [21] Louise T. Su, Evaluation measures for interactive information retrieval,
Communication, ICACC, IEEE, 2019. Inf. Process. Manage. 28 (4) (1992) 503–516.
Please cite this article as: Soumya S. and Pramod K.V., Sentiment analysis of malayalam tweets using machine learning techniques, ICT Express (2020), https://doi.org/10.1016/j.icte.2020.04.003.