You are on page 1of 24

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/354321125

HOW CAN WE ANALYSE EMOTIONS ON TWITTER DURING AN EPIDEMIC


SITUATION? A FEATURES ENGINEERING APPROACH TO EVALUATE PEOPLE'S
EMOTIONS DURING THE COVID-19 PANDEMIC

Article in Journal of Tianjin University Science and Technology · September 2021


DOI: 10.17605/OSF.IO/U9H52

CITATIONS READS

2 248

4 authors:

Oumaima Stitini Ali Twil


Cadi Ayyad University Cadi Ayyad University
14 PUBLICATIONS 63 CITATIONS 3 PUBLICATIONS 38 CITATIONS

SEE PROFILE SEE PROFILE

Soulaimane Kaloun Omar Bencharef


Cadi Ayyad University Cadi Ayyad University
33 PUBLICATIONS 271 CITATIONS 80 PUBLICATIONS 448 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Oumaima Stitini on 02 September 2021.

The user has requested enhancement of the downloaded file.


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

HOW CAN WE ANALYSE EMOTIONS ON TWITTER DURING AN


EPIDEMIC SITUATION? A FEATURES ENGINEERING APPROACH TO
EVALUATE PEOPLE’S EMOTIONS DURING THE COVID-19 PANDEMIC.

Oumaima Stitini1, Ali Twil1 , Soulaimane Kaloun3 and Omar Bencharef


L2IS FSTG , Cadi Ayyad University, Marrakesh, Morocco
{oumaima.stitini, ali.twil}@ced.uca.ma {so.kaloun, o.bencharef} @uca.ma

Abstract: The Coronavirus (COVID 19) pandemic has changed the way we live.
Today, we live in a revolution in which the way we communicate and interact with
others has forever changed. The interpretation of the COVID-19 awareness crisis
and the assessment of public feelings expressed via social media under COVID-19
has become a critical task. In this research paper, using Coronavirus related Tweets,
we classify public emotions associated with the pandemic. We may get an idea of
how a person feels about this pandemic by examining the feelings of these tweets.
For that, we give a methodological overview, the first of which concerns the
approach to machine learning using traditional feature extraction with a 64% low
classification accuracy, the second approach uses feature engineering to boost
accuracy. Detecting emotion during an epidemic situation is an emerging research
area generating interest, but which presents particular challenges due to the limited
amount of resources available. In this article, we propose an emotion detection
model that uses machine learning algorithms, especially feature engineering to
classify the content of a tweet as joy, fear, anger or sadness. We first try to apply
machine learning algorithms using traditional feature extraction and then we try to
propose a feature engineering approach in order to boost and construct higher-
accuracy. Our system's primary goal is to figure out how the pandemic has changed
people's actions and interpret the emotions expressed through Twitter from the
beginning of the pandemic.

Keywords: Emotion detection, Sentiment Analysis, covid-19, Natural Language


Processing, Pandemic.

1 INTRODUCTION
Given the existing health circumstances of the pandemic and that it is a modern
infection in which we have existed for almost a year with no cure available or
extremely efficient antiviral treatment, social distancing is among the strategies that
have been adopted to minimize the outbreak. The COVID-19 pandemic has been a
source of extreme danger and confusion due to information shared on global social
networks.
In the field of natural language processing [31], isolation can help to prevent the virus
itself, but not the fear of the virus, so detecting emotions and understanding feelings

August 2021 | 452


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

can now be considered one of the most common research topics in the world. Since
the world has been battling COVID19 for the past few months, and most people are
stranded, social media outlets such as Twitter are still a big source of all important
data. One of the sites where millions of people have shared their feelings about
various issues has been Twitter. Our human relations and our behaviours have been
changed by the outbreak of the Covid-19 pandemic. That opened the way for social
networks to achieve an excess of usage during lockdown.
One of the most involved areas of research in the processing of natural languages
and machine learning is sentiment analysis. It consists of the creation of automated
instruments capable of extracting subjective information from texts in the natural
language in order to produce organized and actionable knowledge that can be used.
In sentiment analysis, the only difficulty is that there is not an expertise in the fields
of emotions correlated with human behaviour. In general, people often communicate
their feelings either through speech or written messages, and it is important to know
the exact feeling behind a topic rather than a generic feeling.
In this analysis, to conduct emotion detection on datasets sampled from Twitter, our
dataset was obtained from the site [25]. The paper is organized as follows: in section
2, we define the state of the art literature review. We address the methodology used
in this study in Section 3 and explain the collection of data, pre-processing, and the
steps for creating a healthy dataset of emotions. We present the outcomes of this
analysis in section 4 and address the results. Finally, we conclude the paper with
section 5 in which we do the future direction of our research.

2. BACKGROUND AND STATE OF THE ART


Sentiment Analysis SA (also known as Opinion Extraction OE) is a technique of text
analysis that identifies polarity in a text (for example, a positive or negative opinion),
a computer study of people's opinions, attitudes and emotions towards a topic. The
two SA or OE expressions are interchangeable. A necessary activity for
understanding people's thoughts and emotions is to consider the emotional
responses associated with the COVID-19 pandemic.

2.1 Types of Sentiment Analysis

Some research has already worked under the theme of the study of opinion and has
recently drawn researchers during this pandemic era. We define the four ways of
study of sentiment:

▪ Fine-grained sentiment analysis:

YIt involves determining the polarity of the opinion. It is a classic positive/negative


sentiment polarity. This type can also go into the higher specification (for example,
very positive, positive, neutral, negative, and very negative). Zirn et al. [30]
presented an accurate annotation system for detecting emotions in non-traditional
textual genres and collected a corpus of blog posts on three topics in three

August 2021 | 453


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

languages (Spanish, English and Italian). They use the Markov process to
incorporate the polarity scores of different sentiment lexicons via information about
the relationships between neighboring segments. Results showed that the feature
engineering structure improved the predictions accuracy with 69%, reaching
accuracy scores of up to 69%. Building corpus was introduced by Robaldo and Di
Caro [33]. They planned a regular formalism referred to as Opinion Mining-ML, a
replacement XML-based formalism for tagging matter expressions conveying
opinions on objects that an area unit thought of relevant within the state of affairs. It’s
a replacement commonplace beside Emotion-ML and WordNet.
The challenge of fine-grained sentiment analysis is that shorter text segments pose a
more difficult classification problem. There are various approaches to determining
the polarity of text. A similar approach is to search for terms in a sentiment lexicon to
find out the polarity scores. The lockdown brought about by Covid-19 has become
synonymous with teleworking for many people and working remotely has created a
sense of isolation and worry. For that the goal of this research is to know the emotion
of each person via his tweet, this means that we do not need to classify a feeling as
positive or negative.

▪ Emotion Detection:

It is used to identify signs of specific emotional states presented in the text. Emotion
detection is a natural language processing task that aims to extract and analyse
emotions from textual data, while the emotions could be explicit or implicit in the
sentences.
It was argued by Plutchik [20] that there are eight basic and prototypical emotions
which are joy, sadness, anger, fear, trust, disgust, surprise, and anticipation.
Emotion detection (ED) can be viewed as a sentiment analysis task: the SA is
primarily responsible for classifying opinions as positive or negative, but ED is
concerned with the detection of various emotions from a text. For example for a
positive opinion one can have joy, surprise, concerning a negative opinion one can
cite sadness, anger, fear and anguish. ED can be implemented using ML algorithms
or the Lexicon lexicon-based approach. ED on a sentence level was proposed by Lu
and Lin [18].They suggested a web-based approach to textual data to identify the
emotions of a particular case encoded in English verbatims. Their method was
based on the probability of reciprocal acts between the aspect and the entity of the
case. They implemented web-based text mining and semantic task tagging
techniques, along with a range of reference entity pairs and hand-crafted emotion
generation rules to identify an event emotion detection method.
Using both ML and Lexicon-based approach was presented by Balahur et al. [23].
They suggested a system based on commonsense information contained in the
knowledge base of the Emotion Corpus (EmotiNet). They said that sentiments are
not necessarily conveyed by the use of terms with an affective sense, i.e. positive,
but by the explanation of real-life circumstances, which the reader views as being
linked to a certain emotion. They have used SVM and SVM-SO algorithms to

August 2021 | 454


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

accomplish their goal. They have shown that the EmotiNet-based approach is the
most effective for detecting emotions from situations where there is no effect-related
terms.

▪ Aspect-based sentiment analysis:


Aspect-based sentiment analysis (ABSA) has recently gained growing interest due to
its wide variety of applications. In current ABSA datasets, most sentences contain
either one aspect or several aspects of the same sentiment polarity, which allows the
ABSA challenge to degenerate into a sentence-level sentiment analysis. The Aspect-
based Sensitivity Analysis (ABSA) attempts to recognize the polarity of sentiment
against the particular aspect of the sentence. Its step is to define an opinion on a
particular aspect of the product. Qingnan Jiang presents a challenge dataset for
aspect-based sentiment analysis, in which each sentence contains multiple aspects
with different sentiment polarities [2]. They proposed a Multi- Aspect Multi-Sentiment
(MAMS) dataset could prevent aspect-level sentiment classification degenerating to
sentence-level sentiment classification, which might push forward the researches on
aspect-based sentiment analysis. Chi Sun built a supplementary sentence based on
the aspect and transformed the ABSA Aspect-based sentiment analysis into a
sentence-pair classification task, such as query response (QA) and natural language
inference (NLI)[7]. Mohammad Erfan Mowlaei, proposed lexicon generation methods
demonstrate improvements upon our previous work in designing dynamic aspect-
based lexicon generation methods. Dynamic lexicons have two main ad- vantages
over their static counterparts [4].
▪ Intent analysis:

Vraj Desai implements an Artificial Neural Network (ANN) architecture for classifying
text based queries which generates a response redirect website and concludes that
the method used performs slightly better than previous approaches [13]. The pre-
processing used gives this model the advantage as it mainly searches for words of
interest and classifies similar words as a unique word.
Dirk described a novel approach to identify the intention of a performed search just
on the basis of the entered search query, and majorly on the basis of the word
frequency that are used in the search phrase. It is developed as a webservice so that
it can easily be used and included in modern information systems that uses in
particular graphical visualizations and are therewith designed to cover a variety of
search tasks, such as targeted, exploratory and analysis searches [14].

3 RELATED WORK:
In recent social media research, the issue of emotion recognition has become an
emerging subject. For the whole issue of sentiment analysis, multiple solutions are
suggested. Detection of feelings may be taken into account as problems of text
description. Prior study has shown that n-grams are used as a feature in the issue of
emotion detection.

August 2021 | 455


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

Ali Shariq Imran [27] presented a detection model for sentiment polarity and
emotions during the initial phase of the covid-19 pandemic and the lockdown period
during NLP and deep learning on Twitter posts. Their goal was to find the correlation
between sentiments and emotions of the people from within neighbouring countries
amidst the coronavirus (COVID-19) outbreak from their tweets. Their results showed
a high correlation between tweets’ polarity originating from the USA and Canada,
and Pakistan and India. Whereas, despite many cultural similarities. Jim Samuel [28]
provided a comparison of textual classification mechanisms used in artificial
intelligence applications and demonstrated their usefulness for varying lengths of
Tweets. They presented methods with valuable informational and public sentiment
insights generation potential, which can be used to develop much needed
motivational solutions and strategies to counter the fast spread of the Coronavirus
Yixian Zhang [10] designed the COVID-19 public opinion and emotion monitoring
system based on time series thermal new word mining. They used two innovative
attempts the first one by given the characteristics of COVID-19 public opinion
communication, they tried to use an improved dictionary construction method based
on the SO-PMI algorithm to adapt to the COVID-19 network public opinion
environment Under the circumstances of minimal transferable models and the
inadequate corpus of seed. Secondly, they adopted the experience methodologies of
several excellent models of Chinese sentiment classification and built a series of
discriminatory systems in this framework. [31] The authors in[29] explore the
emotional reaction of individuals during the epidemic of the Middle East Respiratory
Syndrome (MERS) in South Korea. To evaluate people's answers, they used eight
feelings. Their results showed that 80% of the tweets were neutral, while the disease
tweet was dominated by frustration and terror. In comparison, the indignation grew
over time, accusing the Korean government mainly although there was a reduction in
reactions of anxiety and sorrow over time. This finding was understandable as the
government was taking stringent steps to deter the outbreak, and as time went on
the number of new MERS cases decreased. The significant result was that there
was more or less constant surprise, disgust and satisfaction. [32] The research
focuses on subjective responses through the exploration of tweets during the
COVID-19 outbreak. A random selection of 18,000 tweets, along with eight
emotions, including rage, anticipation, disgust, terror, excitement, sorrow, surprise,
confidence, was analyzed for positive and negative feelings. The results revealed
that there are just as many positive and pessimistic emotions as there are, as most
of the tweets featured both fear and calming phrases.

In this analysis, our basic idea for creating the emotion detection model is to
examine two approaches in order to improve our system accuracy:

Using Machine learning algorithms.


● Using feature engineering (our proposed approach).

August 2021 | 456


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

4 METHODOLOGY
4.1 Study Objective

The emotions expressed during the pandemic on social media must be specified and
understood. Our interest, however, is not to find a general protocol for classification,
but rather to construct an automatic algorithm capable of classifying people's
emotions during the pandemic. In this section, we explain our general approach to
detecting emotions during the Covid-19 Pandemic.

The Figure 1 (see in Appendix) shows a general outline of our proposed model. The
first step is to clean the unstructured data and remove the word and character n-
gram features by preprocessing the dataset. As a function extraction method, we use
TF-IDF and count vectorizer and we determine the function values for the
corresponding terms in all contents of the training set. Training the classifiers is the
next step. We use five algorithms for machine learning, namely, logistic regression,
decision tree, Random Forest, SVM and Naive Bayes. These machine learning
algorithms learn from the training data presented and categorize emotions into
anger, sorrow, fear and happiness.

4.2 Proposed Approach


This section describes the background knowledge used in the proposed technique. It
explains the method of data processing, using standard extraction of features to
construct the model. Finally, the use of feature engineering to improve and boost the
precision of classification as mentioned in Figure 2 (see in Appendix).
In this section, we will explain the methods used to build models that provided us
with the required study of the classification of emotion. For that, our model can be
divided into three major steps: Data Collection Process, training the model with
traditional feature extraction, and our proposed approach which is boosting accuracy
using feature engineering.
▪ Data Collection Process

The dataset contains the tweet_id, tweet_text, username, date, anger_intensity,


fear_intensity, sadness_intensity, joy_intensity, sentiment category, emotion
category, tweet_location, city and country. The table below shows the description of
each column.

August 2021 | 457


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

Table 1 Dataset of emotion detection details.


No Title Description

1 tweet_id The tweet id of the tweet published on twitter.

2 tweet_text The text of the tweet published on twitter.

3 username The username who published the tweet.

4 date The date publication of the tweet.

5 anger_intensity The anger intensity value of the tweet.

5 anger_intensity The anger intensity value of the tweet.

6 fear_intensity The fear intensity value of the tweet.

7 sadness_intensity The sadness intensity value of the tweet.

8 joy_intensity The joy intensity value of the tweet.

9 sentiment_categoty The sentiment category (Positive,


Negative,Neutral).

10 emotion_category The emotion category (Joy, Anger, Fear,


Sadness, no specific emotion).

11 tweet_location The location of the tweet published.

12 city The city of the tweet published.

13 country The country of the tweet published.

▪ Training Using Feature Extraction:


Machine learning is one of the most prominent techniques gaining interest of
researchers due to its adaptability and accuracy. Feature Extraction helps to
minimize the number of features in a dataset by generating (and then discarding the
original features) new features from the current ones. Any of the details found in the
initial package of features should then be able to sum up this new reduced set of

August 2021 | 458


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

features. In this way, from a variation of the original set, a summarized version of the
original features can be generated.
We address the supervised classification algorithms that we used in order to classify
emotions in texts in the following section. The first two machine learners use integer
value function vectors, with the third classifier only considering binary feature values.
Due to its adaptability and precision, machine learning is one of the most popular
techniques attracting attention from researchers. Many of the supervised learning
versions of this technique are used in emotion analysis. It consists of three phases:
data acquisition, pre-processing, data from preparation, classification and findings
from plotting. A selection of tagged corpora is included in the training details. A
collection of attribute vectors from previous data are present to the Classifier. Based
on the training data set that is used for classification purposes over the new/unseen
document, a model is developed.

▪ Our Proposed Approach


In this section, a feature engineering technique was used to improve the accuracy of
our classifier. Linear machine learning algorithms (Logistic regression, Baseline
Naive Bayes, Linear Support Vector Machine,) fit an algorithm where the prediction
is the weighted sum of the input values. For this approach, we will use features
importance to generate new features to boost our models’ accuracy; many of these
algorithms find a set of coefficients to be used in the weighted sum to make a
prediction. These coefficients can be used explicitly as a rudimentary method of
feature importance ranking. In data processing for machine learning, feature
engineering is a core activity. It is the process of designing acceptable features from
unique characteristics that contribute to enhanced predictive efficiency. Feature
engineering requires the framework to create new features with transformation
functions such as arithmetic and aggregate operators on given features.
Transformations help to transform a function into a linear relationship or turn a non-
linear relationship between a function and a target class. Feature engineering is
normally carried out by a data scientist based on her domain knowledge and iterative
checking and assessment of failures and models.
Feature engineering is the process of using domain knowledge to create relevant
features in order to make Machine Learning algorithms more accurate. Feature
engineering can impact the t modelling journey’s performance noticeably; if done
correctly, it helps the model perform very well.

5 RESULTS AND DISCUSSION


5.1 Data collection
5.1.1 Data Scraping process
The used dataset in this research is a subset of a huge Covid-19 tweets dataset
(Gupta, Raj, Vishwanath, Ajay, and Yang, Yinping, 2020), which consists of a
semantically annotated tweet about covid19 pandemic and aspect. The full data
consists of more than eight millions tweets, latent semantic attributes for each public
tweet using natural language processing techniques and machine-learning-based

August 2021 | 459


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

algorithms. Each tweet was annotated using latent semantic, natural language
processing, and machine learning-based algorithm. Five numerical features that
signify the degree of strength of polarity and emotional intensity across four primary
emotions of fear, anger, sadness, and joy. The public data was anonymized due to
twitter’s terms of use. So we had to retrieve each tweet text using Twitter API; a
scraping script was written in Python language with Scrapy package, which allows us
to retrieve the dataset’s other attributes.
This section details how the data sources were gathered, cleaned, and
adjusted when necessary. We used the existing tweet datasets that contain tweet_id,
user_id, sentiment_categoty, emotion_category, all emotions intensity
(anger_intensity,joy_intensity...) and all keywords that have relation with COVID-19.
The purpose is to provide a model that detects and predicts people's emotions from
tweet publication before and during the COVID-19 period focusing on five countries
having different cultures and circumstances (United States, United Kingdom, India,
Canada, South Africa). For that we try to reformulate the existing dataset in order to
get our features tweet text and location, the reason for using this particular dataset
for training our model is the availability of manually labelled state of the art dataset.
The Figure 3 in Appendix shows a general overview of our proposed model.
The first step is preprocessing the dataset by cleaning the unstructured data and the
word and character n-gram features are extracted.N- grams features are extracted
and a feature matrix is formed to represent the content. The dataset is split into 90%
training data and 10% testing data. TF and TF-IDF are feature extraction techniques
and calculate the feature values corresponding to all words in all contents in the
training set. The next step is training the classifiers. We utilize three machine
learning algorithms, namely, SVM and Naive Bayes and Logistic Regression. These
machine learning algorithms learn from the provided training.

5.1.2 Data Pre-processing

We applied data preprocessing steps to our existing data to reduce the scale of the
real data. Texts of raw feelings are an unstructured source of data which contain
noisy information. The raw text must be pre-processed before the functionality of the
templates is disabled. There are different ways of converting the text into a
modeling-ready form.
Next, we omitted the punctuation. Next in the lower cases in the text, we translated
the capital-letter detail. When used as features of text classification, we have
eliminated the stop words that are meaningless in a language and produce noise.
The next move is to turn the phrases into their original form. To decrease the size of
the real data, we applied the data pre-processing measures to the news twitter
stories. Raw news texts are an unstructured source of information and might contain
noisy content. Until removing the functionality for the models, the raw text has to be
pre-processed. There are various means of transforming the text into a shape ready
for modelling.

August 2021 | 460


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

Table 3 The transition steps in data pre-processing.


Before Preprocessing After Preprocessing

“instead of arguing about getting another “instead arguing about getting another
stimulus check..... If you’re able find a Job stimulus check able find that less risk
that’s less of a risk spreading COVID. spreading covid plenty jobs home
plenty of 9-5 jobs to do at home or in seclusion that does work grind build
seclusion. \n\nIF THAT DOESN'T brand flip stuff becoming affiliate
WORK\n\nGrind, build a brand, flip stuff, marketer”
becoming an affiliate marketer ETC”

The table above shows the transformation of the raw dataset into an understandable
format using the eight steps mentioned on the Figure 2 (see in Appendix).

5.1.3 Feature Extraction Techniques:

▪ TF-IDF
We used two feature extraction techniques, namely Term Frequency (TF) and Term
Frequency-Inverted Document Frequency (TF-IDF). Term Frequency (TF). TF is the
number of occurrences of a term in a document in the corpus. It is the ratio of the
number of occurrences of each word appearing in a text to the overall number of
words in that document. It increases as the number of appearances of that term in
the text increases. Each document has its own tf. Equation (1) shows how to find the
term frequency.

Where :

(1)

Inverse Document Frequency (IDF). IDF measures how important a term is.
While computing TF, all words are considered to be extremely significant. It is
acknowledged, however, that certain words, like "is," "of" and "that," can occur
several times but have little relevance. Thus, we need to calculate the weight of rare

August 2021 | 461


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

words across all documents in the corpus. Terms that occasionally appear in the
corpus have a high IDF score. It is given by Equation (2).
(2)
Term Frequency-Inverted Document Frequency (TF-IDF). Combining TF and IDF,
we come up with the TF-IDF score for a word in a document in the corpus.
TF-IDF score for word i in document :

(3)

We used the TfidfVectorizer function of sklearn.feature_extraction library to


generate TF-IDF n-gram features in this research work.

▪ N-grams
N-grams are contiguous sequences of n-items in a text or speech. The n refers to
the number of combinations of items, which can be phonemes, syllables, letters,
words, character, byte or any sequence of data. The wider used of n-grams model in
natural language processing are work-based n-gram and character-based n-grams.

In this research work, we use both word and character n-grams as features. The size
of the n-gram can have different naming conventions. An n-gram of one word or
character is called uni- gram, two is bi-gram, three is tri-gram and four-gram is a term
where n=4.
In this research, we use word and character n-grams to represent the content of the
news and the various sets of n-gram frequency profiles are generated from the
training data to represent real and fake news.
▪ Count vectorizer

CountVectorizer tokenizes the text along with doing very simple preprocessing
(tokenization involves breaking the sentences into words). The punctuation marks
are omitted and all the words are translated to lowercase. The vocabulary of
recognizable words is developed, which is also used later to encrypt unseen text. An
encoded vector is returned with the length of the whole vocabulary and the number
of times each word has appeared in the text is counted as an integer. The following
image illustrates what I mean by the encoded image.

5.2 Data Analysis

First, we include some preliminary quantitative analysis in this segment to explain


the Emotion Detection Model functions. In order to test the consistency of the
Emotion Detection Model repository, we then conduct emotion detection using
multiple state-of-the-art models. All the fear and uncertainty surrounding COVID-19
means it's natural to worry. As can be seen in Figure 4 (see in Appendix), during the

August 2021 | 462


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

first week of March we can signify the transition of emotion from joy to fear by the
number of new infected cases which has experienced a high increase.
We notice an effect of reciprocal stimulation of emotions. The starting point of fear is
not collective, but each person who reacts spreads the emotional impact (see Figure
5 in Appendix). It's normal to think about all the fear and panic around COVID-19. As
we see in Figure 6 (see in Appendix) that the emotion of fear has increased from
May to June and we can illustrate this emotional transition by the number of deaths
which improves overnight from 5% to 13% also the number of new infected cases
which is still increasing (see Figure 6 in Appendix).
The Figure 7 (see in Appendix) shows the dominant words per day, for example
05/23/2020, the word "death" dominates the most chosen which is normal. The
number of deaths on this day is higher. We can observe the dominance of fear over
hope in the life of individuals and collectives in the first months of the pandemic. For
that we address the question of why fear dominates hope in the life of individuals
and collectives, Reports show that people are in panic – buying everything from toilet
paper to ibuprofen, even though there is no outbreak in their nearby surroundings, or
a high possibility of a deficit at any time soon. People have taken to pharmacies in
batches, pulling masks off the shelf like there is no tomorrow, even as the research
clearly shows that masks are a waste of time for most people. The Covid-19
pandemic has exposed and heightened fears of a world more divided than ever
before.

5.3 Experimental Results

The Figure 8 represents the process of generating the new features. To do this, we
made a TFIDF matrix using our tweet corpus, and then we choose the most
contributing features (Fig.5); for example, the presence of the word “deaths” in a
tweet will increase the probability of belonging to the “fear” emotion by 0.000119.

Table 4 Most contributing features for fear emotion.


Anger Fear Sadness Joy

deaths -7.029555e-05 0.0000119 -7.485374e-05 -0.0000123

death -4.018244e-05 0.0000090 5.2575638e-05 -0.0000102

people 1.1708139e-04 0.0000062 5.8724824e-06 -0.0000184

more -4.936135e-05 0.0000058 5.2450054e-07 -0.0000009

died -4.619077e-05 0.0000040 9.8235554e-07 -0.0000092

over 6.228965e-07 0.0000024 2.1555039e-05 -0.0000046

August 2021 | 463


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

These specific words were chosen to generate a new set of features by calculating
the count of each word in our tweets; as it is shown in Figure 5(see in Appendix),
some of the words that have high occurrence does not necessarily have the same
impact on our classes, perhaps due to a lower number of appearances in the tweets.
This approach aims to create new features that will impact the performance of our
algorithm, first, transforming the dataset into bag-of-words as a training set for the
model, this technique will create a binary matrix of features, where the features are
the words, the id of the tweets are the rows and each row is composed of 0 or 1
which are indicators of the presence of the word, so we are covering a sparse matrix.
The newly created features are added to this matrix, and they represent the count of
the most contributed terms in the prediction Figure 5 (see in Appendix).

Table 5 Benchmarking of used approaches for Logistic Regression.

Used Approach C N- Bag-of-Words Accur


gram acy
0.6 (1,2) Count Vectorizer 0.623

Baseline Logistic Regression 1 (1,2) Tf-idf Vectorizer 0.631

1 (1,2) Count Vectorizer 0.643

0.001 (1,2) Count Vectorizer 0.733


Logistic Regression + Features
0.01 (1,2) Tf-idf Vectorizer 0.752
Engineering
0.0001 (1,1) Count Vectorizer 0.761

Table 6 Benchmarking of used approaches for LinearSVC.

Used Approach C N-gram Bag-of-Words Accuracy


0.8 (1,2) Count Vectorizer 0.621
Baseline Linear Support
1 (1,2) Count Vectorizer 0.618
Vector Machine
1.2 (1,3) Tf-idf Vectorizer 0.643
0.001 (1,2) Count Vectorizer 0.742
Linear Support Vector
Machine 0.1 (1,2) Tf-idf Vectorizer 0.643
+ Features Engineering
0.01 (1,3) Count Vectorizer 0.752

August 2021 | 464


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

Table 6 Benchmarking of used approaches for Naive Bayes.

Used Approach Alpha N- Bag-of-Words Accuracy


gram

0.7 (1,3) Tf-idf Vectorizer 0.561

Baseline Naive Bayes 1 (1,2) Tf-idf Vectorizer 0.565

1.2 (1,2) Count Vectorizer 0.569


1 (1,2) Tf-idf Vectorizer 0.741

Naive Bayes + Features


1.2 (1,2) Tf-idf Vectorizer 0.748
Engineering

1.5 (1,3) Count Vectorizer 0.682


Tables 5, 6, and 7 show a descriptive summary of supervised machine learning
approaches' results; the metric used to evaluate these approaches' performance is
accuracy score. Hence, a Grid-Search technique was applied on inverse
regularization (C), Laplacian smoothing parameter (alpha), N-gram, and Bag-of-
words parameters to determine the best classifier.
Table.5 shows that the Logistic Regression and features engineering approach
performed very well with 76% accuracy, unlike the basic approach using a white box
Logistic Regression yielding a rate of 64% maximum accuracy using the same
parameters. The same process was applied using Linear Support Vector Machine
and Naîve Bayes algorithms, and always after adding the feature engineering
approach (Table.2 and Table.3), we capture a remarkable enhancement in accuracy
score. As we notice by using the Count Vectorizer feature extraction we obtain
higher accuracy than the TF-IDF which is normal in the emotion detection we have to
know the term frequency in a sentimental sentence. We conclude that the
Countvectorizer performs better results. We present experimental findings in the
form of plots in this subsection. To better explain the experimental results, we first
introduce the detection results for hot events during COVID-19. Then, we show the
detection results for public emotion evolution and explain public emotion changes
during the pandemic.

“So sorry. Covid 19 is wicked �”

August 2021 | 465


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

Figure 9. An example of analytical tree structure with its sentence.

The automatically created analytical tree structure ATS is a labelled oriented acyclic
graph with a single root (dependency tree). In the ATS every word form and
punctuation mark is explicitly represented as a node of the tree. Each node of the
tree is annotated by a set of attribute-value pairs. One of the attributes is the analytic
function that expresses the syntactic function of the word. The number of nodes in
the graph is equal to the number of word form tokens in the sentence plus that of
punctuation signs and a symbol for the sentence as such (the root of the tree). The
graph edges represent surface syntactic relations within the sentence as defined Fig
2.
The Figure 10 (see in Appendix) shows the variation of the emotion by word char so
we notice that people expressed more about the pandemic when they are angry.

6 FUTURE WORK AND IMPLICATIONS:

The primary purpose of this analysis was to use Twitter as a more used social media
knowledge to monitor the public's emotions during the outbreak of COVID-19 from
the beginning of the pandemic until now and to examine the difference of emotions
from one country to another. The results suggest that the emotions of the public
displayed fascinating migratory trajectories at various stages of COVID-19, and
according to each region, as evidenced by both covariance and transformation
modes. Some studies of public responses to crises have focused on finding the
correlation between the feelings and emotions of people in neighboring countries
and the coronavirus (COVID-19) outbreak from their tweets. However, our study was
just devoted to detecting emotions whether positive or negative, and our results
show that positive audience emotions (e.g. love, respect, praise) coexisted with and
even exceeded. Negative emotions especially after deconfinement during COVID-
19, which is consistent with previous research on other crisis situations.

August 2021 | 466


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

Our future work will concentrate on adding other features in order to let the
system differentiate between negative emotion categories. This study examines the
number of tweets, emotional tones and types of messages on Twitter during the
COVID-19 pandemic. The results of this study reinforce the current view of social
media in the health communication literature although Twitter poses many
challenges, it had a huge impact during the pandemic as a digital health surveillance
tool. As an epidemic unfolds, the percentage of tweets expressing different emotions
and adopting different types of messages fluctuates. While Twitter can be
considered a reliable source of information for gauging the opinions and emotions of
the public, public health authorities and communication experts should develop
specific communication strategies at each stage to respond to the public based on
timely analysis of Twitter data.

7 CONCLUSION
This research explores the number of tweets during the COVID-19 pandemic,
emotional tones and forms of posts on Twitter. The findings of this study confirm the
current image of social media in the literature on health communication: while Twitter
faces many obstacles, as a digital health monitoring tool, it had a significant effect
during the pandemic. The amount of tweets voicing various feelings and following
different kinds of messages fluctuates as an outbreak unfolds. Although Twitter can
be considered a credible source of information to measure public attitudes and
feelings, effective communication techniques should be implemented at each point
by public health officials and communication specialists to respond to the public on
the basis of timely review of Twitter data. For the future works, this study can be
used to analyze the changing emotions and sentiments of people from some
countries and check whether there are major shifts in them over the period of time. It
is expected that as the spread of this pandemic will increase, the sentiments and
emotions in the tweets may change on the lines of what was seen in the case of
China.

REFERENCES
1. Do, H. J., Lim, C.-G., Kim, Y. J., & Choi, H.-J,. Analyzing emotions in twitter
during a crisis: A case study of the 2015 Middle East Respiratory Syndrome
outbreak in Korea. BT - 2016 International Conference on Big Data and Smart
Computing, BigComp 2016, Hong Kong, China, January 18-20, 2016, DOI:
10.1109/BIGCOMP.2016.7425960.
2. Jiang, Q., Chen, L., Xu, R., Ao, X., & Yang, M, A Challenge Dataset and
Effective Models for Aspect-Based Sentiment Analysis. BT - Proceedings of the
2019 Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing, EMNLP-IJC, DOI:
10.18653/v1/D19-1654.

August 2021 | 467


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

3. Burkhardt, D., Pattan, S., Nazemi, K., & Kuijper, A, Search Intention Analysis for
Task- and User-Centered Visualization in Big Data Applications. Procedia Computer
Science, 104, DOI: 10.1016/j.procs.2017.01.170.
4. Mowlaei, M. E., Abadeh, M. S., & Keshavarz, H, Aspect-based sentiment
analysis using adaptive aspect-based lexicons. Expert Syst. Appl., 148, 113234,
DOI: 10.1016/j.eswa.2020.113234.
5. Hamroun, M., & Gouider, M. S, A survey on intention analysis: successful
approaches and open challenges. J. Intell. Inf. Syst., 55(3), DOI: 10.1007/s10844-
020-00604-x.
6. Tang, F., Fu, L., Yao, B., & Xu, W, Aspect based fine-grained sentiment analysis
for online reviews. Inf. Sci., 488, DOI: 10.1016/j.ins.2019.02.064.
7. Sun, C., Huang, L., & Qiu, X. Utilizing BERT for Aspect-Based Sentiment
Analysis via Constructing Auxiliary Sentence,2019, In CoRR: Vol. abs/1903.0.
http://arxiv.org/abs/1903.09588
8. Tan, S., & Wu, Q, A random walk algorithm for automatic construction of
domain-oriented sentiment lexicon. Expert Systems with Applications, 38(10),DOI:
10.1016/j.eswa.2011.02.105.
9. Steinberger, J., Ebrahim, M., Ehrmann, M., Hurriyetoglu, A., Kabadjov, M.,
Lenkova, P., Steinberger, R., Tanev, H., Vázquez, S., & Zavarella, V, Creating
sentiment dictionaries via triangulation. Decision Support Systems, 53(4),DOI:
10.1016/j.dss.2012.05.029.
10. Zhang, Y., Cheng, J., Yang, Y., Li, H., Zheng, X., Chen, X., Liu, B., Ren, T., &
Xiong, N, COVID-19 Public Opinion and Emotion Monitoring System Based on Time
Series Thermal New Word Mining. In Computers, Materials \& Continua (Vol. 64,
Issue 3), DOI: 10.32604/cmc.2020.011316
11. Keshtkar, F., & Inkpen, D, A BOOTSTRAPPING METHOD FOR EXTRACTING
PARAPHRASES OF EMOTION EXPRESSIONS FROM TEXTS. Computational
Intelligence, 29(3), DOI: 10.1111/j.1467-8640.2012.00458.x.
12. Mohammad, S. M, From once upon a time to happily ever after: Tracking
emotions in mail and books. Decision Support Systems, 53(4), DOI:
10.1016/j.dss.2012.05.030.
13. Xianghua, F., Guo, L., Yanyan, G., & Zhiqiang, W, Multi-aspect sentiment
analysis for Chinese online social reviews based on topic modeling and HowNet
lexicon. Knowledge-Based Systems, 37, DOI: 10.1016/j.knosys.2012.08.003.
14. Neviarouskaya, A., Prendinger, H., & Ishizuka, M, Recognition of Affect
Conveyed by Text Messaging in Online Communication. BT - Online Communities
and Social Computing, Second International Conference, OCSC 2007, Held as Part
of HCI International 2007, Beijing, China, July 22-27, 2007, Proceedings, DOI:
10.1007/978-3-540-73257-0_16
15. Montoyo, A., Martínez-Barco, P., & Balahur, A, Subjectivity and sentiment
analysis: An overview of the current state of the area and envisaged developments.
Decision Support Systems, 53(4), DOI: 10.1016/j.dss.2012.05.022.

August 2021 | 468


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

16. Balahur, A., Hermida, J. M., & Montoyo, A, Detecting implicit expressions of
emotion in text: A comparative analysis. Decision Support Systems, 53(4), DOI:
10.1016/j.dss.2012.05.024
17. Neviarouskaya, A., Prendinger, H., & Ishizuka, M, Recognition of Affect,
Judgment, and Appreciation in Text. BT - COLING 2010, 23rd International
Conference on Computational Linguistics, Proceedings of the Conference, 23-27
August 2010, Beijing, China, https://www.aclweb.org/anthology/C10-1091/
18. Lu, C.-Y., Lin, S.-H., Liu, J.-C., Cruz-Lara, S., & Hong, J.-S, Automatic event-
level textual emotion sensing using mutual action histogram between entities. Expert
Systems with Applications, 37(2), DOI: 10.1016/j.eswa.2009.06.099.
19. Neviarouskaya, A., Tsetserukou, D., Prendinger, H., Kawakami, N., Tachi, S., &
Ishizuka, M. (2009). Emerging system for affectively charged interpersonal
communication. 2009 ICCAS-SICE, 3376–3381.
20. PLUTCHIK, R, Chapter 1 - A GENERAL PSYCHOEVOLUTIONARY THEORY
OF EMOTION (R. Plutchik & H. B. T.-T. of E. Kellerman (eds.); Academic Press,
DOI: 10.1016/B978-0-12-558701-3.50007-7.
21. Medhat, W., Hassan, A., & Korashy, H, Sentiment analysis algorithms and
applications: A survey. Ain Shams Engineering Journal, 5(4), DOI:
10.1016/j.asej.2014.04.011.
22. Gupta, A., & Srinivasan, S. M, Constructing a Heterogeneous Training Dataset
for Emotion Classification. Procedia Computer Science, 168, DOI:
10.1016/j.procs.2020.02.259.
23. Boldrini, E., Balahur, A., Martínez-Barco, P., & Montoyo, A, Using EmotiBlog to
annotate and analyse subjectivity in the new textual genres. Data Mining and
Knowledge Discovery, 25(3), DOI: 10.1007/s10618-012-0259-9.
24. Tsytsarau, M., & Palpanas, T, Survey on mining subjective data on the web.
Data Mining and Knowledge Discovery, 24(3), DOI: 10.1007/s10618-011-0238-6.
25. Gupta, Raj, Vishwanath, Ajay, and Yang, Yinping. COVID-19 Twitter Dataset
with Latent Topics, Sentiments and Emotions Attributes. Ann Arbor, MI: Inter-
university Consortium for Political and Social Research [distributor], 2020-09-04.
https://doi.org/10.3886/E120321V5
26. A. V. Krishna Prasad, Rajesh Prabhakar Kaila. INFORMATIONAL FLOW ON
TWITTER - CORONA VIRUS OUTBREAK – TOPIC MODELLING APPROACH.
27. Imran, A.S., Doudpota, S.M., Kastrati, Z., & Bhatra, R. (2020). Cross-Cultural
Polarity and Emotion Detection Using Sentiment Analysis and Deep Learning - a
Case Study on COVID-19. ArXiv, abs/2008.10031
28. Samuel, Jim and Ali, G. G. Md. Nawaz and Rahman, Md. Mokhlesur and Esawi,
Ek and Samuel, Yana, COVID-19 Public Sentiment Insights and Machine Learning
for Tweets Classification, DOI: 10.2139/ssrn.358499.
29. Prabhakar Kaila, Dr. Rajesh and Prasad, Dr. A. V. Krishna, Informational Flow
on Twitter – Corona Virus Outbreak – Topic Modelling Approach (March 31, 2020).
International Journal of Advanced Research in Engineering and Technology
(IJARET), 11 (3), 2020, pp 128-134., Available at SSRN:
https://ssrn.com/abstract=3565169.

August 2021 | 469


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

30. Cäcilia Zirn, Mathias Niepert, Heiner Stuckenschmidt & Michael Strube. (2011).
Fine-Grained Sentiment Analysis with Structural Features. In Proceedings of 5th
International Joint Conference on Natural Language Processing (pp.336–344).Asian
Federation of Natural Language Processing.
31. Neviarouskaya, A., Prendinger, H., & Ishizuka, M. (2009). Compositionality
Principle in Recognition of Fine-Grained Emotions from Text. BT - Proceedings of
the Third International Conference on Weblogs and Social Media, ICWSM 2009, San
Jose, California, USA, May 17-20, 2009.
http://aaai.org/ocs/index.php/ICWSM/09/paper/view/197.
32. Oumaima, S., Soulaimane, K., & Omar, B. (2020). Latest Trends in
Recommender Systems Applied in the Medical Domain: A Systematic Review.
Proceedings of the 3rd International Conference on Networking, Information
Systems & Security, DOI: 10.1145/3386723.3387860.
33. Livio Robaldo, Luigi Di Caro OpinionMining-ML

APPENDIX

Figure.1 The general overview of our model

August 2021 | 470


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

Figure 2. Architecture of our proposed approach.

Figure 3. The general overview of our model.

August 2021 | 471


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

Figure 4. Emotion evaluation per week.

Figure 5. Emotion evaluation per two weeks.

August 2021 | 472


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

Figure 6. Emotion evaluation per month.

Figure 7. Dominant words per day.

August 2021 | 473


Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/
Journal of Tianjin University Science and Technology
ISSN (Online): 0493-2137
E-Publication: Online Open Access
Vol:54 Issue:08:2021
DOI 10.17605/OSF.IO/U9H52

Figure 8. The proposed Features engineering process.

Figure 10. An example of analytical word char count.

August 2021 | 474

View publication stats

You might also like