MTP Report

Analysis of COVID-19 related social
media posts
MTP Report
by
Alpesh Kaushal
17CS30003
Under the guidance of
Prof. Saptarshi Ghosh
DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY
KHARAGPUR
Department of Computer Science and
Engineering Indian Institute of
Technology, Kharagpur
India - 721302
CERTIFICATE
This is to certify that work in the thesis entitled Analysis of COVID-19 related
social media posts, submitted by Alpesh Kaushal (Roll Number: 17CS30003 ) a
dual degree student of Department of Computer Science and Engineering,
Indian Institute of Technology Kharagpur in partial fulfillment for the award of
Dual Degree is carried out by him. We hereby accord our approval of it as a study
carried out and presented in a manner required for its acceptance in partial fulfillment for
the Dual Degree for which it has been submitted. The thesis has fulfilled all the
requirements as per the regulations of the Institute and has reached the standard needed
for submission.
Supervisor
Department of Computer
Science and Engineering
Indian Institute of Technology,
Kharagpur
Place: Kharagpur
Date: 25th April
2022
ACKNOWLEDGEMENTS
I would like to thank Prof. Saptarshi Ghosh who gave me this golden opportunity to
work on this project. I got to learn a lot from this project. I would also like to express my
gratitude to my mentor Soham Poddar for providing me support and guidance. He is a
Research Scholar under Prof. Saptarshi Ghosh at IIT Kharagpur.
Alpesh Kaushal
IIT Kharagpur
Date: 25/04/2022
ABSTRACT
Authorities must be alert in the face of potential rises in COVID-19 cases, which were on
the rise in several parts of the world. Forecasting COVID-19 outbreaks can help prevent
harm and offer time to prepare for circumstances, as well as create applicable policies to
address difficulties and educate people about immunisation. According to the World
Health Organization, vaccine hesitancy was one of the top ten global health threats in
2019. Nowadays, social media plays a significant role in the distribution of vaccine
related information, misinformation, and disinformation. There are a wide variety of
topics and questions encapsulated under the umbrella of vaccinations that one might
notice people discussing while browsing across social media. Some of these include
whether the vaccine is necessary, or what about its side effects, etc. There are several of
these social media platforms to express one's view. One of them is GAB.com which
revels in the freedom of speech it provides. By this we aim to collect all related relevant
data from GAB.com and build a classifier to classify these types of statements of
gab.com into relative categories. Further we will use twitter data to predict the number of
future cases/deaths by Covid19. The purpose of this study is to see if social media
signals (from Twitter) can be used to create automatic predictors for the amount of
COVID-19 cases that will occur in the future. We developed classifiers that can clearly
determine such symptom-reporting tweets, and then investigate how related signals link
to the frequency of COVID-19 cases. We find social media signals that exhibit good
associations with the number of future COVID-19 cases/deaths through experiments
done over worldwide tweets and tweets from India posted in 2020 and 2021
Table of Contents
1. Abstract
2. Introduction
2.1. Vaccine hesitancy
2.2. Predicting future cases/death for
Covid-19
3. Related Work
4. Data
4.1. Gab Data
4.1.1. Manual Analysis of Gabs
4.1.2. Labeled Latent Dirichlet Allocation
4.2. Twitter Data for cases prediction
5. Classification
5.1. Traditional Classification model
5.1.1. Vaccine hesitancy
5.1.2. For symptoms-reporting tweets
5.2. fasttext Classification
5.3. Bert Transformer
6. Correlation of different classes
7. Conclusion
7.1. Summary of Work
7.2. Future Scope
8. References
1. Introduction
Nearly 227 million individuals have been affected by the ongoing COVID-19 pandemic,
which has resulted in over 4.6 million deaths worldwide (as of September 2021) 1. To
stop the virus from spreading, strict measures were implemented in many parts of the
world. However, once limits were eased, different countries saw an increase in cases.
Despite the fact that immunizations began in the first quarter of 2021, the risk of a sharp
increase in COVID-19 persists. As a result, authorities must be able to forecast probable
increases in COVID cases/deaths in order to take precautionary measures and prepare for
medical catastrophes.
Apart from that, understanding the reasons for vaccine hesitancy is critical so that
authorities may develop policies to educate people and respond to their worries, as well
as take action against reported side effects. This is linked because an increase in vaccine
hesitancy can lead to an increase in COVID cases, which will alter the distribution of
resources to aid. It is clear from this that case monitoring is crucial, and that addressing
vaccination reluctance and being aware of it is beneficial to authorities.
Previous research has shown that social media can be beneficial in gathering situational
information during emergency situations such as natural disasters and epidemics (Imran
et al. 2013; Househ 2016). In addition, social media has been utilised to predict future
pandemic outbreaks (Grover and Aujla 2014). People have been routinely writing on
Twitter or Gab about their experiences as a result of COVID-19, and how their lives have
been altered. People have posted stories about someone (or themselves) getting sick,
suffering from COVID-19 symptoms, or why they are opposed to vaccination.
1.1. Vaccine Hesitancy

Vaccines are a safe and effective public health intervention in the case of virus outbreaks
and pandemics. However, there is widespread scepticism about the Coronavirus disease
(COVID-19) vaccination. Understanding vaccination-related behaviour is therefore
crucial for increasing vaccine coverage and flattening the infection curve.
Though mass vaccination programs have already been started globally, the effectiveness
of vaccination programs has been affected by a hesitancy to receive the vaccines in
populations, where vaccine hesitancy is defined as the delay in acceptance or refusal of
available vaccines.
Studies have identified several factors associated with the COVID-19 vaccine
hesitancy in different domains. We have used such categories in this paper to
distinguish social media posts.
We used antivax hashtags to pull data from Gab.com, then we manually analyzed certain
gabs to find some common terms that tell us which group it belongs to. We utilised L-
LDA to find the top 5 keywords for each category using annotated twitter data. We also
used Bert-Transformers to recognise and categorise these gabs to certain predetermined
categories/ reasons for not getting vaccinated, as well as typical multilabel classifiers.
1.2. Predicting future cases/death

for Covid-19
With different degrees of success, certain governments throughout the world have
sought to forecast an increase in COVID-19 cases using applications that rely on people
reporting their symptoms or getting diagnosed (e.g., Aarogya Setu app in India).
Since vast amounts of tweets are being posted everyday, using social media to
automatically gather such insights on the symptoms of users can be a more cost- and
time-effective alternative to help predict potential rise in cases of COVID-19, and any
future disease outbreaks.
In this report, we attempt to investigate various social media signals that could be utilised
to anticipate future COVID-19 cases. We're focusing on symptom-reporting tweets,'
which provide information on someone who is experiencing COVID-19 symptoms. We
gathered tweets containing symptom-keywords (related to a standard set of COVID-19
symptoms established by the WHO) over a long period of time, from February 2020 to
June 2021, Indian tweets. We discovered that a huge percentage of the tweets contain no
information on anyone who is suffering the symptoms. As a result, we created a
customised BERT-based classifier to not only identify tweets that actually report
someone experiencing symptoms (symptom-reporting tweets), but also to differentiate
between different subcategories of symptom-reporting tweets, such as primary/self-
reporting tweets, secondary-reporting tweets, and third-party reporting tweets that report
another person experiencing COVID-19 symptoms. Then, notably in 2021, we show that
certain of these sub-categories have a strong association with the Indian COVID-19 case
dynamics. We also show how these signals may be utilised to create prediction models
for the amount of COVID-19 cases in the future.
We developed an accurate 4-class classifier to identify and classify among different
types of symptom-reporting tweets, that achieves a macro F-score of
0.79 for this challenging task.
We extract various signals from these symptom-reporting tweets and compare them to
COVID-19 cases and deaths in India and around the world. In a few
instances, we see strong relationships. The number of secondary-reporting tweets posted
within a given week, in particular, has a strong link to the number of COVID
cases/deaths that occur 1-2 weeks later.). Even when employing simple regression
models for prediction, we see good outcomes. We anticipate that the findings of this
study will aid in the development of improved models for predicting COVID-19 cases
and deaths (and other diseases). This could aid governments in tracking future COVID-
19 waves or illness outbreaks and preventing/mitigating them.
1.3. Overview of report

In this project We gathered data on vaccination hesitancy in this project and trained multiple
classifiers to categorise it into explanations. We also used Twitter data to train several
classifiers in order to acquire and predict data in order to figure out symptoms reporting
tweets, which will help us understand future case/death numbers. In section 2, we talked
about related work done. In section 3, we discussed about our results of manual analysis of
gabs to figure out some common words and used L-LDA to find top-5 common words
for each category. In sub section 3.2, we talked about how we got twitter data to work on.
In part 4, we discussed various classifiers for vaccine hessitency and for finding symptomps
reporting tweets. We also tried fasttext classification for gab data in sub section 4.2. In sub
section 4.3 we reported how we created and improved BERT model for both in subsection
4.3.1 for vaccine hesitency and 4.3.2 for symptoms reporting tweets. In section 5, we created
time series for predicted tweets from our model created and find out correlation of this with
actual numbers of cases/deaths. In section 6, In Chapter 5 we conclude the discussion, with a
short recap and a possible future scopes.
2. Related Work
Vaccine Hesitancy Previous work by Hilary Piedrahita-Valdés et al. (2021) used a hybrid
approach to perform an opinion-mining analysis on 1,499,227 vaccine- related tweets
published on Twitter from 1st June 2011 to 30th April 2019. Their algorithm classified
69.36% of the tweets as neutral, 21.78% as positive, and 8.86% as negative. The percentage
of neutral tweets showed a decreasing tendency, while the proportion of positive and
negative tweets increased over time.
Jens Lemmens et al (2022) made a Dutch language model adapted to the domain of COVID-
19 tweets. They Adapted BERT for Vaccine Hesitancy and Argumentation Detection.
Steven Lloyd Wilson et al(2020) employed a large-n cross-country regression approach to
assess the global impact of social media on vaccination reluctance. They also discovered a
link between social media activity by organisations and public concerns about vaccine
safety. Furthermore, there is a strong link between foreign disinformation tactics and falling
vaccination rates.
Article by Ariana Remmel (2021) mentions, public confidence in the safety of COVID-19
vaccines in the United States declined after government officials halted vaccinations with the
Johnson & Johnson (J&J) shot in April 2021. Officials investigated whether the vaccine was
linked to a rare type of blood clot during the ten-day delay, but they finally declared the
vaccine safe and granted the go-ahead to restart use. In this social media plays a huge role in
informing others about the investigations and can result in distress in different countries.
According to a report by Centre for Countering Digital Hate (CCDH), anti-vaxxers' social
media accounts have grown their following by at least 7-8 million individuals since 2019.
Haiyan Yu et al (2022) also stated that Negative sentiment COVID-19 tweets of public
organizations attract more responses from followers hence results into sperading to large
audience.
It is significant from this output that social media posts hold a lot of potential to know the
reason behind vaccine hesitancy.
Here in this report we had used some predefined categories/reasons for classification. They
are neutral, mandatory, pharma, conspiracy, political, country
,rushed, ingredients, side-effect, ineffective, religious, none. They are defined in section
3.1.1.
This work uses Gab data and we developed classifier for finding reason why this posts are
against Covid vaccination.
Predicting number of Cases/ Death Previous works have tried using different
indicators to estimate the trends in the number of cases and deaths due to COVID-19.
Karisani and Karisani (2020) were among the first to apply machine learning and
natural language processing (NLP) techniques to predict tweets containing information about
someone who has been infected with COVID-19. To conduct the classification, they used
BERT-based models. Shen et al. (2020) used data from Weibo (China's Twitter) to identify
postings including information about people reporting symptoms and their diagnosis using
traditional machine learning algorithms (such as SVM and random forest). Klein et al. (2021)
employed BERT-based models once more to separate tweets including people reporting that
someone tested positive for COVID-19.
According to Singh et al. (2020), social media conversations are more highly connected with
COVID-19 instances, with the United States, Italy, and China taking the lead. As a result,
social media chats could be used as a precursor to COVID-19 instances.
Li et al. (2020a) found correlations between rising cases in China in early 2020 and search
trends (from Google and Baidu) and social media data (from Weibo). Similarly,
Yousefinaghani et al. (2021) used the SH-ESD algorithm to anticipate COVID-19 waves in
the United States and Canada using a search index (from Google) and tweets relating to
symptoms.
(Shen et al. 2020), which works on Weibo postings from early 2020, and (Klein et al. 2021),
which works on self-reports from early to mid 2020, are two previous works that have
employed comparable signals that we will apply. However, none of these studies have
attempted to investigate the various sub-categories of symptom-reporting tweets, nor have
they examined tweets from the year 2021. It's crucial to figure out which signals are reliable
predictors of COVID cases/deaths over longer time periods, which no previous research has
looked at.
This work is different from prior studies that attempted to correlate social media signals with
COVID cases/deaths. We have designed a customized BERT-based classifier that detects
people reporting COVID-19 symptoms in tweets. This work analyses trends from India over
a much longer period (spanning the first and second COVID waves in India). In addition to
extra features, Part of Speech tagging is also tested along with adding addition linear layer to
this Bert model resulting in improvement of the old model.
3. Data
3.1. Gab Data
Gab is well-known for its tolerance of hate speech. Far-right or alt-right users who have been
banned or suspended from other services have flocked to the site. Torba (Gab CEO) said in a
Gab post in late July 2021 that he was "being bombarded" with text messages from members
of the US military claiming that if they refused the COVID-19 vaccine, they would be court-
martialed. The post received 10,000 likes and shares on Facebook. Torba also posted
documents on Gab's news site that contain false information regarding the COVID-19
vaccine, claiming in an email to The New York Times that "I'm stating the truth" and that
"Your Facebook-funded 'fact checkers' like Graphika are wrong and are the ones selling
disinformation here." All this results making it best platform to check for antivax posts and
understand people opinion about vaccination.
Gab only allows searching for a hashtag or users. There was a feature to do public search but
was removed a few years back. We had written code to extract this data using Gab Api. For
this purpose we had used two types of hashtags to collect data. Gabs for about 120 antivax
keywords and 150 provax keywords were collected along with all vaccine names and their
companies name.
Total gabs: approx 40,000 gabs

Provax: 1830 gabs
Antivax: 21341 gabs
It was then divided into 4 categories:

● anti-vax gabs posted before February 1, 2020 (pre-COVID times)
● anti-vax gabs posted after February 1, 2020 (COVID times)
● pro-vax gabs posted before February 1, 2020 (pre-COVID times)
● pro-vax gabs posted after February 1, 2020 (COVID times)
For training purposes about 4500 annotated tweets are used which are classified into
mentioned categories in 3.1.1, manually. Then this data is used for training
various classifiers and observing results.
(Image 1: Number of data in each category)
3.1.1. Manual Analysis of Gabs

About 200 randomly selected gabs were analysed by me personally and categorized into
13 categories and looked for some common words that will definitely identify specific
gab into specific categories.
These 13 categories are as follows:
neutral The tweet does NOT indicate hesitancy towards any vaccine
unnecessary COVID is not dangerous / Vaccine not required
mandatory Against mandatory vaccination
pharma Against Big Pharma
conspiracy Deeper Conspiracy
political Political side of vaccines
country Country of origin
rushed Rushed Process
ingredients Vaccine Ingredients / technology
side-effect Side Effects
ineffective Vaccine is ineffective
religious Religious Reasons
none No specific reason stated in the tweet
Observation
Most of them were classified under the side-effect category.
Gabs containing this words can be categorized into side-effects :
Bell's Palsy, Bellspalsy, Blind, Deaf, Throat Paralysis, Tremors, vaccines cause autism,
blood clots, bad headaches, high fever, sore muscles, bad headaches, high fever, sore
muscles, spike protein, diarrhea, bloating, high levels of aluminum, autoimmune
disease, chest pains, infertility, Myocarditis, Pericarditis, HeartInflammation,
produces toxins, cardiac arrest, Alzheimers, ALS ,
Neurological Degenerative Diseases, brain thrombosis, death
Gabs containing this words can be categorized into ingredients: mRNA,

death, gene therapy, DNA, protein, HIV encoding, RNA vaccine
Gabs containing this words can be categorized into mandatory:

NO MANDATORY VACCINES, noVaccineMandates, mandatedVaccines,
noMandates, noVaccine,
noMasks
3.1.2. Labeled Latent Dirichlet Allocation

Labeled LDA is a topic model that constrains Latent Dirichlet Allocation by defining a
one-to-one correspondence between LDA’s latent topics and user tags. Labeled LDA can
directly learn topics(tags) correspondences.
One of its usage is about finding top 5 terms associated with topics.
L-LDA is implemented here to find top 5 terms associated with each category.
Observation
side-effect: ['vaccine', 'pfizer', 'reaction', 'effects', 'deaths', 'adverse']

mandatory: ['vaccine', 'mandatory', 'force', 'passport', 'forced'] political:
['trump', 'vaccine', 'election', 'government', 'fda']
ingredients: ['aborted', 'cells', 'vaccines', 'ingredients', 'made'] pharma:
['vaccine', 'pharma', 'pfizer', 'money', 'profit']
country: ['russian', 'china', 'western', 'russia', 'poisoning']
religious: ['johnson', 'religion', 'catholics', 'god', 'jesus'] none:
['like', 'russian', 'vaccination', 'roulette', 'bank']
unnecessary: ['vaccine', 'covid', 'risk', 'need', 'people']
rushed: ['vaccine', 'pfizer', 'rushed', 'trials', 'experimental']
ineffective: ['vaccine', 'effective', 'still', 'flu',, 'even']
neutral: ['get', 'vaccine', 'time', 'birth', 'likely']
conspiracy: ['covid', 'gates', 'bill', 'vaccines', 'world']
3.2. Twitter Data for cases prediction

We got annotated tweets, which were personally examined by two persons, and we went through
random samples of 250 symptom-keyword tweets from the from the Indian Data. During this period,
two types of tweets were observed: symptom reporting tweets and non-reporting tweets. Symptom-
reporting tweets include information on the author (the Twitter user who wrote the tweet) or another
individual who is suffering COVID-19 symptoms, whether or not they have been diagnosed. Please
keep in mind that not all symptom-keyword tweets are symptom- reporting tweets. Non-Reporting
tweets contain a symptom keyword but do not provide information about someone who is having
COVID-19 symptoms. We identified three sub-classes of symptom-reporting tweets based on who is
being reported to have COVID symptoms. The following are the subcategories
(i) Primary: if the author (user who posted the tweet) is experiencing symptoms;
(ii) Secondary: if someone the author knows personally (e.g., a friend, relative, or neighbor ) is
experiencing symptoms; and
(iii) Third-Party: if someone who is not an acquaintance of the author (e.g., a celebrity) is
experiencing symptoms.
The non-reporting category includes a wide range of tweets that include symptom keywords but
don't mention someone who is experiencing symptoms. The following are two of the most prevalent
versions we saw in this class:
(i) General Awareness: the tweet provides useful information/advice on COVID-19 symptoms or
hygiene, with the goal of raising public awareness.
(ii) Irrelevant: the tweet utilised a symptom keyword in an entirely different context (for example,
'Saturday Night Fever,' 'football fever,' and 'Lockdown pain').
We haven't differentiated between these Non- Reporting versions because we're only interested in
Reporting tweets. As a result, simply retrieving tweets containing symptom keywords is unlikely to
be useful for analysing COVID-19 dynamics, as many of these tweets are non-reporting tweets. In
fact, of the 500 symptom-keyword tweets we personally analysed, we found that up to 67 percent of
them were non-reporting. As a result, we'll need to create a classifier that can tell the difference
between the various sub-classes of symptom-reporting tweets and non-reporting tweets. In the
following section, we'll look at a classifier like this.
Finally, we received an annotated random sample of 4K symptom-keyword tweets that were

divided into three categories: (i) Primary: if the author (user who posted the tweet) is
experiencing symptoms, (ii) Secondary: if someone the author knows personally (e.g., a
friend, relative, or neighbour) is experiencing symptoms, and (iii) Third-Party: if someone
who is not an acquaintance of the author (e.g., a celebrity) is experiencing (iv) Non-reporting.
For the purpose of getting timeseries dataset for total cases per day, we used COVID19-
Bharat API. With help of this we found correlation with our created timeseries data from
numbers of reporting tweets.
4. Classification
4.1. Traditional Classification Model
The basic purpose of a classification challenge is to determine which category or class
new data will belong to.
Learning from a set of instances associated with a single label l from a set of disjoint labels
L, |L| > 1 is the focus of traditional single-label categorization.
The learning problem is referred to as a binary classification problem (or filtering in the case
of textual and online data) if |L| = 2, and a multi-class classification problem if |L| > 2.
In this part we had used Knn, Decision Tree, Bagging, Random Forest, Boosting,
Multinomial NB, SVC, Sq. Hinge Loss and Power Set SVC.
(Below shown results are for different datasets and not to be compared together)
4.1.1. For Vaccine Hesitancy
Model Name Macro-F1 Score
Knn 0.4795
Decision Tree 0.5697
Bagging 0.5520
Random Forest 0.5416
Boosting 0.5602
Multinomial NB 0.3661
SVC Sq. Hinge Loss 0.6096
Power Set SVC 0.6258

Dataset that we have is imbalanced i.e. some classes appear more often than others.
4.1.2. For Symptoms-reporting tweets
Knn 0.5399
Decision Tree 0.4359
Bagging 0.5665
Random Forest 0.5674
Boosting 0.5594
Multinomial NB 0.5300
SVC Sq. Hinge Loss 0.5941
4.2. fasttext classification

fastText is a library for efficient learning of word representations and sentence
classification. To be efficient on datasets with very large numbers of categories, it uses a
hierarchical classifier instead of a flat structure, in which the different categories are
organized in a tree. It also exploits the fact that classes are imbalanced by using the
Huffman algorithm to build the tree used to represent categories.
For vaccine hesitancy
P@1 - 0.241
R@1 - 0.241
The output of fastText are the precision at one (P@1) and the recall at one (R@1).
4.3. Bert Transformer
BERT is the first deeply bidirectional, unsupervised language representation, pre-

trained using only a plain text corpus.
BERT also learns to model relationships between sentences by pre-training on a very

simple task that can be generated from any text corpus: Given two sentences A and B, is B
the actual next sentence that comes after A in the corpus, or just a random sentence?

We had focused on application of BERT to the problem of multi-label text classification.
Here we have used a bert-base-cased model and selected a batch size of 8. (BERT-Base,
Cased: 12-layer, 768-hidden, 12-heads , 110M parameters)
Observed:
accuracy 0.9431
Test loss 0.1834
We examined a small number of forecasts. This gives you a sense of how good the
predictions are (qualitative evaluation).
Test comment :
"johnson a vaccine is a good help but pointless if the there is shit still in the water polio for
example"
Unnecessary 0.4786
ineffective 0.4927
Ineffective and unnecessary, Looks good. This one is pretty clean.
By thresholding (0.5), we were able to reduce the noise in the forecasts. Only tag
predictions that were higher than (or equal to) the threshold were considered.
Test comment :
"yeah that's not how the vaccine works there s no evidence on what it actually even does
though so who knows"
ineffective 0.8603545427322388
Our model appears to be doing something reasonable.

We tried (i) BERT-Base, that outputs a 768-length embedding (representation) of a text
(tweet), (ii) BERT-Large, an extension of BERT-Base, that outputs a 1024-length embedding
for an input text, and (iii) Covid-Twitter-Bert-v2 (CT-BERT) that is an extension of the
BERT-Large model by pretraining the model on a large set of tweets related to COVID- 19.
After getting the tweet-embeddings of length d from any of the BERT models, they were
then passed through a ‘classification layer’ – a fully- connected linear layer of dimensions
(d×4) – to perform the 4-class classification. The end result of each model was a probability
score of the input tweet belonging to one of the four classes, and we assigned that tweet to
the class with the highest likelihood. We ddi 5-fold cross validation to evaluate the classifier
models' performance. For each of the 5-folds, we used 80% of the data for training and
validation and 20% as a test set. We also needed a validation set for the deep learning
classifiers, thus the 80 percent training data was randomly split into the final train and
validation sets in a ratio of 75 percent - 25 percent. During the training
phase, the models were trained for 10 epochs at a learning rate of 10 -6, with the models being
evaluated on the validation set after each epoch. For generating the classification metrics on
the test set, the highest performing model on the validation set (in terms of validation loss)
was chosen. This reports the average test set performance (F1 scores) across the 5-folds of
the various classifiers. For every class, the CT-BERT classifier performs the best, with an
average macro-F1 score of 0.774.
To improve classification, add custom features: We noticed some trends while

manually evaluating several tweets: some keywords are more likely to be associated with one
of the classifications. Words like 'I' and'me', for example, are more likely to appear in
Primary Reporting tweets, whereas'my' and familial relations are more likely to appear in
Secondary Reporting tweets. We use some handcrafted features to incorporate these trends
into the CT-BERT classifier to boost classification even more. To achieve this, we used the
ten handcrafted characteristics listed in the table below. These are binary features that are set
to 1 if a tweet contains one of the keywords in the corresponding list, and 0 if none are
present. The CT-BERT encoder generates a 1024-byte embedding. These updated
embeddings were routed through
a fully-connected linear layer of dimensions (1034 x 4) that was fully connected to conduct
the classification. The performance of this model was then improved (CT-BERT + feats) by
cross-validating it using the same methods as the baseline classifiers. The addition of the
features resulted in minor improvements in all of the classes' scores, with an average macro-
F1 score of 0.793.
No. Keywords Category it belong to
1 i, am, me Primary
2 my, we Primary and secondary
3 brother, family, mother, wife, friend, Secondary

cousin, etc
4 he, she, his, him, her Secondary
5 man, men, women, woman, male, female Third-party
6 <URL> Third-party
7 has, have, had, having Reporting
8 they, their Reporting
9 test, tested, positive, admitted, Reporting

symptoms
10 you, people Non-Reporting
(Binary features added to classifier with CT-BERT embeddings)
Using Part-of-Speech tagging to improve results Similar to handcrafted features we

found POS tags for each and did integer encoding to them. After this we appended out output
of encoder and handcrafted features( embedding size increased to 1084, adding 50 POS for
each tweet, adding 0 or popping from last to achieve this). This result input was send to
linear layer with dimensions (1084 x 4).
We observed average macro F1 score to be 0.797.
Part-of-speech (POS) tagging is a task in natural language processing that involves labelling
words in context with their grammatical category, such as noun, verb,
preposition, and so on. The universal dependency treebank, a corpus of texts in many
languages annotated with syntactic trees in the dependency frame, morphological features,
and word-level part of speech tags, is the usual benchmark for this task.
Adding extra Linear layer to improve results Similar to handcrafted features we

added extra Linear layer which takes input from old layer. Previous layer dimensions were
changed to (1034 x 128) and this new layer was set to (128 x 4). We took 128 size output
from Layer 1 and provide it as an input to Layer 2.
We observed average macro F1 score to be 0.83. We
finally decided to use this in our final model.
Over the Indian tweets that we collected, we applied the (CT-BERT + features + additional
Linear layer) classifier (that performed the best). Table 6 shows the distribution of
anticipated classes among tweets. The following are some intriguing findings from this data,
which were also discovered while personally studying the tweets. • In comparison to the
worldwide sample, India has far less primary-reporting tweets. Because of the social stigma
that exists in Indian society, it's possible that the low number of primary-reporting tweets is
related to Indians' reluctance to share publicly that they've been experiencing COVID-19
symptoms or that they've been diagnosed (Bhanot et al. 2020). • In comparison to the
worldwide sample, the Indian data contains more third-party reporting tweets, as individuals
routinely share news about celebrities (such as actors and government figures) being
diagnosed with COVID-19. • When compared to the similar number in 2020 (during the
second wave of COVID-19 cases), the percentage of secondary-reporting tweets posted from
India grew dramatically in 2021 (during the second wave of COVID-19 cases) (the first
wave). Many people were in desperate need of medical supplies during this time because
hospitals were overburdened 8. Naturally, they began tweeting about a close member
becoming ill and requesting assistance with basic necessities (such as oxygen cylinders,
medicines, vacant hospital beds)
We classified about 800K tweets with our finally trained model for categorizing them into
mentioned 4 categories.
BERT-Base 0.7578
BERT-Large 0.7613
CT-BERT 0.7744
CT-BERT + Feats. 0.7929

CT-BERT + Feats. + POS Tag 0.7632
CT-BERT + Feats. + extra layer 0.8325
5. Correlation of different classes

Preparing time-series data: We aggregate the actual number of instances using a 7-day rolling
average to smooth out day-to-day volatility in the data to get the time series data. We create a
time series Tc in which the total number of cases on a given day d is equal to the average of the
actual number of cases/deaths in the seven days preceding d. We extraced actual number of
cases/deaths drom COVID19 Bharat API. Thus if the actual
number cases on the day d is represented by Ac [d] we compute the time series as d 1 X
Tc[d] = × Ac[x] 7 x=d−7 Next, we create similar time-series based on the rolling 7-day averages,
for the social media signals. For example, if we consider all symptom-keyword tweets, the time
series Tsk was formed by averaging the actual number of
symptom-keyword tweets Ask posted over the previous 7 days: Tsk [d] = 1 7 × Pd x=d−7
Ask [x]. Similarly, we obtain the time-series Tpri , Tsec and Ttp for the primary, secondary and
third-party reporting tweets respectively. Apart from the raw numbers of primary, secondary and
third-party reporting tweets, we also consider as signals, the percentages
of each sub-class of tweets. After forming the Tsk , Tpri, Tsec and Ttp time series as stated above,
the percentages of tweets in each class on a day d are calculated with
respect to Tsk [d] . For eg, the %Third-party time- series T%tp for a day d is calculated
as: Ttp [d] × 100 T%tp[d] = Tsk [d] The number and percentages of main, secondary, and third-
party tweets, as well as the total number and percentage of all symptom-reporting tweets, were
chosen as the social-media signals to calculate the correlations (summing up the three sub-
classes). The percentage of every class of tweets is computed with
respect to Tsk[d] as stated above. During two time periods– (i) February to November 2020 and
(ii) January to March 2021– we calculated the Correlation (r) for each of the signals with cases
and death. The correlations are calculated by pushing the cases/deaths forward by 0 to 4 weeks.
Consider calculating the association between the percent of third-party tweets and the COVID
cases Tc moved by lag number of weeks (0-4 weeks lag).
for cases between 2020-02 to 2020-11
0 1 2 3 4
Primary 0.049 0.034 0.007 -0.026 -0.069
Secondary 0.077 0.077 0.058 0.034 0.007
Third-Par 0.278 0.331 0.367 0.406 0.452

ty
Non-Repor 0.342 0.387 0.423 0.465 0.515

ting
Total 0.332 0.375 0.410 0.453 0.502

#symptoms
for deaths between 2020-02 to 2020-11

0 1 2 3 4
Primary 0.046 0.082 0.027 0.019 -0.025
Secondary 0.024 0.077 0.109 0.139 0.114
Third-Par 0.279 0.324 0.304 0.287 0.337

ty
Non-Repor 0.438 0.459 0.429 0.398 0.451

ting
Total 0.423 0.442 0.411 0.380 0.433

#symptoms
keywords
for cases between 2021-01 to 2021-03

0 1 2 3 4
Primary 0.049 0.799 0.884 0.931 0.948
Secondary 0.403 0.709 0.743 0.811 0.836
Third-Par 0.447 0.846 0.878 0.865 0.871

ty
0 1 2 3 4
Primary 0.049 0.799 0.884 0.931 0.948
Non-Repor 0.341 0.662 0.751 0.775 0.828

ting
Total 0.471 0.790 0.845 0.858 0.898

#symptoms
keywords
for deaths between 2021-01 to 2021-03
0 1 2 3 4
Primary 0.352 0.498 0.674 0.473 0.427
Secondary 0.331 0.510 0.358 0.148 0.073
Third-Par 0.473 0.915 0.936 0.850 0.796

ty
Non-Repor 0.431 0.734 0.800 0.747 0.665

ting
Total 0.539 0.833 0.865 0.778 0.698

#symptoms
keywords
6. Conclusion
6.1. Summary of Work
We extracted data from Gab for vaccine hesitency and trained various classifiers based on
received annotated twitter data for vaccine hesitancy. We observed pretty good results with
Bert model for it. Second we developed various classifiers for future cases prediction based
on twitter data and improved results with adding extra linear layer.( d*4 was divided into 2
linear layers d*128 and 128*4). We checked with adding extra information to input with
Part-of-speech tagging but the output dint vary much. Finally we selected CT-Bert model
with handcrafted features and extra linear layer as our final model for classifying tweets into
4 categories (i) Primary: if the author (user who posted the tweet) is him- self/herself
experiencing symptoms, (ii) Secondary: if someone whom the author personally knows
(e.g., a friend or a relative or a neighbor) is reported to experience symptoms, and (iii) Third-
Party: if someone who is not an acquaintance of the author (e.g. a celebrity) is reported to
experience the symptoms. (iv) Non-reporting.
About 800K tweets were classified and that data was used to prepare time series data for find
correlation. We found a correlation with 0-4 weeks of lags for data near lockdown period in
India. We can further use this information for forming better predictor for future numbers of
cases/deaths.
These symptom-reporting classes have been demonstrated to be excellent social

media signals for predicting potential cases in advance. Building such predictors can offer
authorities a heads-up so they can take the required precautions, such as setting restrictions in
public places and preparing medical facilities to handle an influx of cases.
6.2. Future Scope

As it is apparent by image 1, the dataset is imbalanced. The next step towards arriving at
better results would be to use the SMOTE algorithm and take treat this and then train the
models again to improve accuracy. This solves the problem by oversampling the examples in
the minority class.
SMOTE first selects a minority class instance at random and finds its k nearest minority class
neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors,
designated b, at random and connecting a and b to form a line segment in the feature space.
The synthetic instances are then generated as a convex combination of the two chosen
instances a and b.
Additionally, authorities can use regional data (at the city or state level) to better assess
people's needs and follow any new issues that arise (for example rising of fungal infection
cases in COVID-19 survivors). We've just looked at simple regression models so far, but
more complicated models based on many signals can be developed. By combining these
social media signals with other real-world signals, better prediction models can be created.
Models that combine these social media signals with real-world signals like public
transportation utilisation trends, for example, can be utilised to provide better regional level
predictions to authorities.
7. References
Ariana Remmel. Communicating COVID Vaccine Safety Poses a Unique Challenge. (2021).
Available online at:
https://media.nature.com/original/magazine-assets/d41586-021-01257-8/d41586-02 1-
01257-8.pdf (accessed August 14, 2021).
Bhanot, D.; et al. 2020. Stigma and discrimination during COVID-19 pandemic. Frontiers in
public health 8: 829.
Burki, T. The online anti-vaccine movement in the age of COVID-19. Lancet Digit. Health 2,
e504–e505. https://doi.org/10.1016/S2589-7500(20)30227-2 (2020).
Devlin, J.; et al. 2018. Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805 .
Dong, E.; Du, H.; and Gardner, L. 2020. An interactive web- based dashboard to track
COVID-19 in real time. The Lancet infectious diseases 20(5).
Dutta, U.; et al. 2021. Analyzing Twitter Users’ Behavior Before and After Contact by the
Russia’s Internet Research Agency. Proc. CSCW .
Goran Muric et al(2021), COVID-19 Vaccine Hesitancy on Social Media: Building a Public
Twitter Data Set of Antivaccine Content, Vaccine Misinformation, and Conspiracies
Grover, S.; and Aujla, G. S. 2014. Prediction model for influenza epidemic based on Twitter
data. International Journal of Advanced Research in Computer and Communication
Engineering 3(7): 7541–7545.
Higgins, T. S.; et al. 2020. Correlations of online search engine trends with coronavirus
disease (COVID-19) incidence: infodemiology study. JMIR public health and surveillance
6(2).
Hilary Piedrahita-Valdés et al. (2021) Vaccine Hesitancy on Social Media: Sentiment
Analysis from June 2011 to April 2019
Jens Lemmens et al (2022), CoNTACT: A Dutch COVID-19 Adapted BERT for
Vaccine Hesitancy and Argumentation Detection
Klein, A. Z.; et al. 2021. Toward using twitter for tracking covid-19: A natural language
processing pipeline and exploratory data set. JMIR 23(1): e25314.
Lazarus, J. V. et al. A global survey of potential acceptance of a COVID-19 vaccine. Nat.
Med. https://doi.org/10.1038/s41591-020-1124-9 (2020)
Singh, L.; et al. 2020. A first look at COVID-19 information and misinformation
sharing on Twitter. arXiv preprint arXiv:2003.13907 .
Steven Lloyd Wilson et al (2020), Social media and vaccine hesitancy

MTP Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MTP Report

Uploaded by

Copyright:

Available Formats

Analysis of COVID-19 related social

Under the guidance of

Prof. Saptarshi Ghosh

DEPARTMENT OF COMPUTER SCIENCE AND

gratitude to my mentor Soham Poddar for providing me support and guidance. He is a

Research Scholar under Prof. Saptarshi Ghosh at IIT Kharagpur.

1.1. Vaccine Hesitancy

1.2. Predicting future cases/death

1.3. Overview of report

3.1. Gab Data

Total gabs: approx 40,000 gabs

It was then divided into 4 categories:

(Image 1: Number of data in each category)

3.1.1. Manual Analysis of Gabs

Gabs containing this words can be categorized into ingredients: mRNA,

Gabs containing this words can be categorized into mandatory:

3.1.2. Labeled Latent Dirichlet Allocation

side-effect: ['vaccine', 'pfizer', 'reaction', 'effects', 'deaths', 'adverse']

3.2. Twitter Data for cases prediction

Finally, we received an annotated random sample of 4K symptom-keyword tweets that were

4.1.1. For Vaccine Hesitancy

Model Name Macro-F1 Score

Decision Tree 0.5697

Random Forest 0.5416

SVC Sq. Hinge Loss 0.6096

Power Set SVC 0.6258

4.1.2. For Symptoms-reporting tweets

Model Name Macro-F1 Score

Decision Tree 0.4359

Random Forest 0.5674

SVC Sq. Hinge Loss 0.5941

4.2. fasttext classification

BERT is the first deeply bidirectional, unsupervised language representation, pre-

BERT also learns to model relationships between sentences by pre-training on a very

4.3.1. For Vaccine Hesitancy

Test loss 0.1834

Ineffective and unnecessary, Looks good. This one is pretty clean.

Our model appears to be doing something reasonable.

4.3.2. For Symptoms-reporting tweets

To improve classification, add custom features: We noticed some trends while

No. Keywords Category it belong to

2 my, we Primary and secondary

3 brother, family, mother, wife, friend, Secondary

4 he, she, his, him, her Secondary

5 man, men, women, woman, male, female Third-party

7 has, have, had, having Reporting

8 they, their Reporting

9 test, tested, positive, admitted, Reporting

10 you, people Non-Reporting

(Binary features added to classifier with CT-BERT embeddings)

Using Part-of-Speech tagging to improve results Similar to handcrafted features we

Adding extra Linear layer to improve results Similar to handcrafted features we

Model Name Macro-F1 Score

CT-BERT + Feats. 0.7929

CT-BERT + Feats. + extra layer 0.8325

5. Correlation of different classes

Primary 0.049 0.034 0.007 -0.026 -0.069

Secondary 0.077 0.077 0.058 0.034 0.007

Third-Par 0.278 0.331 0.367 0.406 0.452

Non-Repor 0.342 0.387 0.423 0.465 0.515

Total 0.332 0.375 0.410 0.453 0.502

for deaths between 2020-02 to 2020-11

Primary 0.046 0.082 0.027 0.019 -0.025