Professional Documents
Culture Documents
Rashedul Amin Tuhin, Bechitra Kumar Paul, Faria Nawrine, Mahbuba Akter, Amit Kumar Das
Department of Computer Science and Engineering
East-West University
Dhaka, Bangladesh
e-mail: mcctuhin@ewubd.edu, bechitra@outlook.com, nawrinefaria@gmail.com, mahbubaakter74@gmail.com,
amit.csedu@gmail.com
Abstract— Sentiment analysis has become a leading context for exists a few research works for Bangla on sentiment analysis
scientific and commercial market research in the field of due to its lack of resources and its complexity. Therefore this
machine learning. Currently, it’s a more prominent research research paper has been done to detect sentiment from
field of Bangla language processing system as there are few Bangla text document.
research works regarding sentiment analysis for this language. The principal obligation of this sentiment analysis
In essence, sentiment analysis is an automated process of text process is to classifying the polarity of a given text whether
mining to determine the emotion from a given text. By using the expressed emotion of a text document on Bangla is
sentiment analysis, a given text can be categorized into several happy/sad/angry/tender/excited/scared concerning training
emotions. This paper deals with six individual emotion classes-
data. We use these six basic emotions as our emotion class
happy, sad, tender, excited, angry and scared. Here, we
proposed two methods of machine learning techniques- Naïve
because each of this emotion class can represent many
Bayes Classification Algorithm and Topical approach to emotions. For example joy, smile, optimistic, laugh, pleased
extract the emotion from any Bangla text. Proposed methods these five emotions can be merged by using happy class and
have been applied for both article and sentence level of scope. so on.
A comparative analysis of the performance between these two Sentiment analysis can be applied at different level of
methods has been done, and the topical approach achieved the scope. In this research paper sentence and document level of
best performance for both levels of magnitude. magnitude for determining emotions has been implemented.
This research work extracts emotion by using Automatic
Keywords- Sentiment analysis, Supervised learning. systems which are opposite of the rule-based systems and
don’t confide in the manually crafted rules. This system
I. INTRODUCTION relies on machine learning techniques to learn from data.
Sentiment Analysis which is also known as opinion In this paper, two approaches are proposed to extract the
mining has become a topic of much interest and development emotion from Bangla text. Those approaches are- Naïve
for research area as it has many practical and empirical Bayes classification algorithm and Topical approach. We
applications. It explores people's sentiments, opinions, used manually created data corpus with 7,500+ sentences as
behavior, attitudes, and emotions towards entities such as learning materials. We operate a comparative study between
individuals, organizations, products, services, issues and these two methods by experimenting with a large number of
their attributes from written or spoken language [1]. Since test dataset with a different combination of texts and the
publicly and privately accessible information over the topical approach achieved the best performance.
internet is always thriving, a massive number of texts The rest of the part of this paper is organized by the
revealing opinions are available in different blogs and other following sections. Section 2 provides a short description of
articles in online journal, social media, and product review some previous works related to our topic. Section 3 gives a
sites. Besides this, with the current progress of machine simple view of the overall working procedure and provides
learning, the competency of algorithms to analyze the text the information regarding dataset preparation. In Section 4,
has progressed numerously. Because of that, the conduct of we briefly describe the methodology for both proposed
sentiment analysis is developing day by day in the field of methods. In Section 5, to compare the result different
product analysis, social media monitoring, and market experiments have been done between these two proposed
analysis and so on. methods. In Section 6, we conclude the paper and provide
Bangla is the fourth most popular language in the world some directions for future works.
and it has approximately 250 million native speakers all over
the world. Numerous research works have been done on II. RELATED WORKS
sentiment analysis for English and other languages such as As far as the author know none of the research work on
Chinese, Hindi, Urdu, and Arabic ([2]- [3]) while sentiment sentiment Analysis from Bangla text has been done to
analysis for Bangla is still at a constructive stage. There determine the emotion for six individual classes by using
361
Excited আিম আজ অেনক আিম ভােলা ফলাফল করায় আিম অেনক সুর িছেলা" and consider ‘িছেলা' this word is missing in the
আনিত সবাই আনিত আনে আিছ
training dataset. Firstly, we will calculate the probability of
Sad other words on this sentence by using equation (2). Then we
আিম সু েখ নই আিম ভােলা নই আিম কে আিছ
will calculate the mean value among those words probability.
Angry আিম তার সােথ স ভােলা কাজ কেরিন আিম তার সােথ Let consider the probability of these three words are-
রাগ কেরিছ রেগ আিছ P(আজেকর) = 0.3, P(িদনটা) = 0.35, P(সুর) = 0.07. Now
Scared আিম তােক ভয় স ভােলা না তাই তােক একা আিছ তাই applying equation (5) and (6) we will calculate the probability
পাই ভয় হয় ভয় হে! of a missing word, Where TW denotes total word in the
sentence and MW denotes the number of missing words.
IV. THE METHODOLOGY OF PROPOSED APPROACHES σ ሺ௧௪ௗ௦ሻ
݁ݑ݈ܽݒ݊ܽ݁ܯൌ (5)
ேሺ௧௪ௗ௦ሻ
A. Naïve Bayes Classification Algorithm P (missing word) = Mean value*(TW-MW) (6)
Naïve Bayes classification is among the most successful ǤଷାǤଷହାǤ
P(িছেলা) = *(4-1)=0.72
known algorithms for learning to classify text documents. In ଷ
this approach, based on probability it provides an emotional 2. Emotion Detection in Article/Document level
score for each word wi correspond to each emotional class cj. In article level of scope firstly we differentiate all the
By using this emotional score, it tries to detect emotion for sentences in the article. Then we applied sentenced level
query sentence. For better understanding, all the steps of this emotion detection technique (explained above equation 1-3)
approach are described here with equation- for each sentence in the article. We get the probability for 6
1. Emotion Detection in Sentence level individual classes for each sentence. We used this probability
Step1: Firstly Probability of each emotion class cj is of each sentence in the article to calculate the emotional
calculated by using equation (1), where NS(cj) denotes the score of the whole article (Ear) for six individual classes by
number of sentences containing ci class and N denotes the applying equation (7). Finally, from the maximum value
total number of sentences in the dataset. among the 6 emotion classes, we detect emotion (Ea) for the
ୗሺୡ୨ሻ
article
ሺ
ሻ ൌ (1) ܧ ሺ݆ܥሻ ൌ ς ܳܧ൫ܿ ൯ ܲ כ൫ܿ ൯ (7)
Step2: The probability of each word wi in the dataset ݊݅ݐ݉ܧሺܧ ሻ ൌ ሾܧ ൫
୨ ൯ሿ (8)
corresponding to each emotion class cj is calculated by using B. Topical Approach
equation (2). Then for each word wi we get probability In this study, we use the generic topical approach to
value for six individual class cj. extract the emotion from an article. In this approach the
ேௌሺ௪ ǡೕ ሻ concept of tf, idf is used to calculate the emotional value of
ܲ൫ݓ ǡ ܿ ൯ ൌ (2) each word in the sentence. Term Frequency-Inverse
ேௌሺೕ ሻ
document frequency in short tf- idf is a numerical statistic
Here, NS (wi,cj) means number of sentences that contains wi that is used to find out the importance of a word in a
word for cj emotion class. document or a corpus. To detect emotion from a text article
Step3: To detect emotion of a query sentence firstly we try by using this approach firstly, for each sentence in the
to find out whether each word of this sentence is available in dataset the idf value of each word is calculated by using the
the training dataset or not. If those words exist in the training equation (9), where wi denotes current word, NS is the total
data, then we determine their probability correspond to six number of sentences in the dataset and NSwi denotes the
individual classes by using equation (2). Then by using the number of sentences containing this word wi.
ேௌ
probability of each word, we calculated the emotional score ݂݅݀ሺݓ ሻ ൌ ሺ ሻ (9)
ேௌೢ
of this query sentence for each emotion class by applying Then the probability of each word wi in the dataset
equation (3). Finally, we detect emotion (ed) for the query corresponding to each emotion class cj is calculated by using
sentence by using the highest value from six individual equation (10) ேௌሺ௪ ǡೕ ሻ
emotion classes. To detect this we use equation (4). ܲ൫ݓ ǡ ܿ ൯ ൌ (10)
ேௌ
ܳܧ൫ܿ ൯ ൌ ς ሺ ݓ ǡ ܿ ሻ (3)
Where NS(wi,cj) means the number of sentences that
ሺୢ ሻ ൌ ሾ൫
୨ ൯ሿ (4) contain wi word for cj emotion class. Then, the idf value of
But the problem will arise if any word from the query each word is distributed to the six emotion classes by
sentence is not available in the dataset. Because the observing the probability of each sentence that contains word
probability for this word will be 0 and when we will multiply cj. This is done by using equation (11)
this with other words probability the product term will be 0. ൫ݓ ǡ ܿ ൯ ൌ ܲ൫ݓ ǡ ܿ ൯ ݂݀݅ כሺݓ ሻ (11)
It will affect the accuracy of the result. To solve this problem, To detect emotion from a query sentence firstly, we need
Binning technique is implemented here. to find out whether all words in the sentence are available in
Binning Technique For better performance, we use the the dataset or not just like Naïve Bayes. If available then, the
binning technique to provide a value for missing words in the emotional value for each word corresponds to six emotion
dataset. We supposed a query sentence "আজেকর িদনটা class will be calculated by using equation (11). After that, for
362
each emotion class, the emotional value of the query Topical Approach is again much higher than the Naïve
sentence will be calculated by using equation (12). And Bayes Classification Algorithm.
finally, the emotion of the query sentence Emotion (QS) will 100%
be detected by observing the maximum value among the
Accuracy
emotion classes by using equation (13). 50% Sentence
ܳܧ൫ܿ ൯ ൌ σ ݊݅ݐ݉ܧ൫ݓ ǡ ܿ ൯ (12)
݊݅ݐ݉ܧሺܳܵሻ ൌ ݉ݑ݉݅ݔܽܯሾܳܧ൫ܿ ൯ሿ (13) 0% Article
To detect emotion for an article firstly for each sentence Naïve Topical
in the article we will calculate the emotion value for six Bayes approach
emotion classes. By using this value, the emotion value for
Figure 3. Performance of Naive Bayes and Topical Approach for both
an entire article will be calculated to six individual categories article and sentence level of scope (for 2 emotion class)
by using equation (14). After that, the emotion of the given
article (QR) will be detected by using equation (15). Another critical factor is that the most frequent words in
݊݅ݐ݉ܧሺܧ ǡ ܿ ሻ ൌ σ ܳܧ൫ܿ ൯ (14) the datasets and their higher probability for any particular
݊݅ݐ݉ܧሺܴܳሻ ൌ ݉ݑ݉݅ݔܽܯሾ݊݅ݐ݉ܧሺܧ ǡ ܿ ሻሿ (15) emotion class affect the performance (Figure 4).
0.8
V. EXPERIMENTAL RESULTS AND DISCUSSION
Happy
In this paper, to evaluate the performance of the proposed 0.6
Probability
Sad
methods four different experiments were performed. In our 0.4
first experiment, we have determined the performance for Tender
sentence level of scope by using both proposed methods for 0.2 Excited
six individual classes. 100 Bangla sentences were used as our
query sentence. Then we calculated the accuracy for both 0 Angry
সব
সবাই
রাগ
কথা
আিম
অেনক
বিশ
ভয়
মানুষ
তা
স
না
ভাল
তাই
আনা
proposed methods. The same thing has been done to measure Scared
the performance in article level of scope. One hundred
articles from different online news portal have been used as
our test data. After that, a comparative analysis has been Figure 4. Most Frequent Words in the dataset and their probability for each
done between these two methods and a comparison graph is emotion class
plotted by showing the accuracy for both proposed methods
(Figure 2). Among these two methods, the accuracy of In our third experiment, we try to figure out whether the
topical approach is much higher than Naïve Bayes number of emotion classes affects the performance or not
classification algorithm for both sentence and article level of and it is explicit that the number of emotion classes is
scope. inversely proportional to the accuracy level. For two emotion
classes, the accuracy level for both methods is much higher
100%
than six emotion classes. (Figure 5).
Accuracy
0% Article 50%
Naïve Bayes Topical 2 classes
Approach 0%
Naïve Bayes Topical
Figure 2. Performance of Naive Bayes and Topical Approach for both Approach
article and sentence level of scope (for 6 emotion class)
Figure 5. Comparison between these two approaches in respect to the
number of classes
In our second experiment, we mapped these six emotion
classes into two higher emotion classes-Positive and
From this three experiment, it is seen that the Bayesian
Negative (Table 3).
Classifier gives less accuracy in the Article level, whereas
TABLE III. MAPPING INTO HIGHER EMOTION CLASS the topical model still works better. The reason for this less
accuracy for the Naive Bayes is pretty straightforward. The
Higher Emotion class Basic Emotion Class
main reason is the multiplication operation in between the
Positive Happy, Excited, Tender items of an item set. However, it is not the case for the
Negative Sad, Angry, scared Topical approach as it takes the decision operation by a
summation operation and comparison.
After that, the same procedure of the first experiment has In our final experiment, we have compared our works
been followed to evaluate the performance of proposed with two other papers regarding sentiment analysis from
approaches for two higher emotion classes. We get another Bangla text. We select these two papers based on their
comparison graph between these proposed methods (Figure working methods and performance level. From Table 4, it is
3). From Figure 3, it is clear that the accuracy level of evident that none of the paper used more than three emotion
363
classes to detect emotion. Beside this, it is visible that our complicated when it needs to classify several classes with
training dataset is much more vibrant than the other research noisy data. On the other hand, for Topical Approach, it is
papers even this dataset can be used for future work. Here, proven that it can work with several classes with reasonable
another important factor is that in paper one the authors used accuracy. Therefore, considering the overall situation, it can
SVM and they achieved higher accuracy for that. But we be said that in some respect the performance of this research
didn’t use SVM. The main reason is that SVM is much paper is much better compared to others for Bangla text.
efficient for separating two classes only, but it becomes more
TABLE IV. COMPARATIVE ANALYSIS WITH OTHER PAPERS REGARDING SENTIMENT ANALYSIS FOR BANGLA TEXT
1. Naive Bayes 6 emotion classes Manually created data corpus with Highest accuracy
Our Paper: Sentiment
Classification (Happy, Sad, 75000 sentences. 7400 sentences acquired by Topical
Analysis from Bangla Text
Algorithm. Tender, Angry, Excited, used as training dataset and 100 Approach (above
Using ML Techniques
2. Topical Approach Scared) sentences used as test data 90%)
Paper 1: 1300 Bangla Tweet used as learning
Performing a sentiment 1. SVM 2 emotion class (Positive or material. 1000 sentences used as Best accuracy attained
Analysis in Bangla 2.Maximum Entropy Negative) training data and 300 used as text by SVM (93%)
Microblog Posts. [8] data
Paper 2: Detecting
1. Used the concept of 1500 short Bangla comment used as
Sentiment from Bangla Text 3 Emotion Class
term frequency and learning material, 1400 used as
using Machine Learning (Positive , Negative or Accuracy was 83%
inverted document training data and 100 used as test
Technique and Feature Neutral )
frequency (Tf-Idf) data
Analysis [5]
364