You are on page 1of 11

Exploring Digital Narratives of COVID-19 through Twitter1

Panel: Research in the Midst of a Pandemic

Susanna Allés Torrent Jerry Bonnell Dieyun Song


Modern Languages and Literatures Computer Science History
University of Miami University of Miami University of Miami
susanna_alles@miami.edu j.bonnell@miami.edu dxs1138@miami.edu

Prepared for delivery at the 2021 Virtual Congress of the Latin American Studies Association,
May 26–29, 2021

Abstract

Digital Narratives of Covid-19 is a cross-disciplinary, multi-institutional, and bilingual digital


humanities initiative conceived both for academic and public audiences. One of our main goals
consists of analyzing public conversations happening in social media, particularly on Twitter. The
project established a digital workflow that has been gathering tweets since late April 2020,
following three main criteria: first, gathering all tweets in Spanish worldwide to analyze public
discourse; second, targeting a bilingual perspective for the area of South Florida, for which we
collect geolocated tweets in English and Spanish to study this particular linguistic ecosystem; third,
tracing narratives and topics in specific Spanish-speaking areas (Argentina, Mexico, Perú,
Colombia, Ecuador, Spain). Our paper focuses on how the team collected, processed, and analyzed
our Twitter corpus through research methods that combine computational techniques, such as
natural language processing, with humanistic interpretations. We will highlight our work using
Twitter’s API, our GitHub dataset repository, and the online corpus (searchable by day, area, or
language). Moreover, we will present our “data interpretation lab,” which uses Jupyter notebooks
on the cloud via Binder, to offer a public space to perform text analysis (e.g., topic modelling and
n-gram frequency), and data visualizations. Our ultimate goal is to offer to the community a digital
tool that not only makes available a linguistic corpus of tweets related to the Covid-19 pandemic,
but also an exploratory space that aims to trace different digital narratives happening on the social
web.

Keywords: Twitter; Covid-19; data mining; digital humanities

1 This paper reflects the labor done by all team members of the project Digital Narratives of Covid-19.

1
1. Digital Narratives of Covid-19: a project to explore public conversations

Digital Narratives of Covid-192 (DHCovid) is a digital humanities project that investigates


online conversations about the current pandemic through the mining and examination of Twitter
discourses in English and in Spanish worldwide. The project is funded by the University of Miami
(FL) and is developed in collaboration with the National Scientific and Technical Research
Council (CONICET, Argentina). The work was initiated in April 2020, when we started collecting
Covid-19 related tweets.
One of the main goals of our project consists of delving into a corpus of tweets written in
Spanish and English that are related to the coronavirus pandemic. This goal is achievable through
the Twitter API which offers the possibility of mining the Twitter feed with some restrictions (in
its free version) with respect to time frame and quantity 3. However, this approach needs some
clarification. In the case of Spanish, we took into consideration two factors: first, there was no
topic-specific Twitter corpus; and second, Spanish is a diverse language spoken in Latin America,
the US, and Spain. Consequently, we decided to create, on one hand, a collection of tweets
recovering all Spanish tweets related to the Covid-19 outbreak and, on the other hand, single
collections of tweets from specific areas (Argentina, Colombia, Ecuador, Mexico, Peru, Spain). In
the case of English, we discarded mining all tweets in English because there were, since the
beginning of the pandemic, several initiatives already devoted to this purpose 4.
We have been producing Twitter datasets structured in three main collections: 1. tweets in
Spanish worldwide (a total of 20.121.933 tweets as of April 2021); 2. tweets with geo locations in
six selected Spanish-speaking areas spanning North and Central America (Mexico, Columbia,
Ecuador), South America (Argentina, Peru), and Europe (Spain); and 3. geolocated tweets in
English and Spanish from the greater Miami area in South Florida. 5 The rationale behind the
mining strategy for the overall and geo-tagged Spanish corpora is to put in dialogue the global and
regional views and to shed light on the linguistic nuances in various Spanish-speaking
communities across the globe. These three sections form an interconnected analytical network to
contour the transnational and cross-Atlantic narratives about the Covid-19 pandemic.

2
Our website is available at: https://covid.dh.miami.edu/
3 Twitter policy on free tweets: https://developer.twitter.com/en/docs/twitter-
api/v1/tweets/search/overview
4 Such is the case of the Panacea Lab which has been mining the Twitter pandemic conversations since

January 2020, http://www.panacealab.org/covid19/


5 The total of tweets we have gathered is as follows: 2,972,107 for Mexico, 883,620 for Argentina,

1.321.851 for Colombia, 312,758 for Ecuador, 406,140 for Peru, and 1,538,294 for Spain; for the Miami
area, in Spanish 59,859, and in English, 91062. It strikes the high number of tweets from Mexico,
probably from bots, that surpasses by far the other areas. Frequency graphs can be found at:
https://covid.dh.miami.edu/charts/

2
2. Infrastructure

Our digital infrastructure and workflow is structured in different phases. First, to assemble
the Twitter corpus (data collection), a PHP script mines the Twitter Data Streaming through
Twitter’s Application Programming Interface (API) and recovers a series of specific tweet IDs.
Our data mining sampling strategy consists of four main variables: language, keywords, region,
and date6. Then, Tweet IDs are stored in a MySQL relational database where they are “hydrated,”
that is, all metadata associated with the tweets is recovered, including its body text. Third, an
additional script organizes the tweet IDs in the database by day, language, and region, and creates
a plaintext file for each combination with a list of corresponding tweet IDs. The script generates
these files daily and organizes them into folders, where each directory represents one day. These
are uploaded directly to our public GitHub repository, where we provide free access to these
datasets 7.
A first stable version of the dataset, published on May 13, 2020, was released through
Zenodo as a ZIP file containing folders of daily tweets made between April 24, 2020 and May 12,
2020. A second and final version will be uploaded by the end of the project in May 2021 with the
complete collection of tweet IDs8.
Once the tweets IDs are recovered and introduced into the database, we proceed to the data
preprocessing phase. In short, we standardize the presentation and format of data, including
rendering everything in lowercase, removing accents, punctuations, mention of users (@users) to
protect privacy, and replacing all links with “URL.” This step is especially challenging and, yet,
crucial considering the frequent use of accents and graphemes in Spanish (like the ñ). Emojis are
a tricky challenge as some of them could be transliterated into a UTF-8 charset and be transformed
into emoji labels, while others are not recognized and remain in the text. 9 Additionally, we decided
to unify all different spellings of Covid-19 under a unique form, and all other characteristics,
including hashtags, are always preserved. 10 This step, in short, allows us to obtain a clean and tidy
collection of tweets organized by language, by day, and by area, and to respond to our research
purposes.

6 We only query tweets written in Spanish and English that contain one of the words from our two lists of
keywords in English and Spanish related to the Covid-19 pandemic. Consequently, only tweets with one of
these words and/or hashtags are selected. The English keywords only apply to Miami area and include terms
such as “covid,” “coronavirus,” “pandemic,” “quarantine,” “#stayathome,” “outbreak,” “lockdown,” and
“#socialdistancing.” The Spanish keywords include “covid,” “coronavirus,” “pandemia,” “cuarentena,”
“confinamiento,” “#quedateencasa,” “desescalada,” and “#distanciamientosocial (Allés Torrent, 2020).
7 The GitHub repository can be found at: https://github.com/dh-

miami/narratives_covid19/tree/master/twitter-corpus
8 The Zenodo repository is available at: https://zenodo.org/record/3824950#.YHQkqWixUV4
9 It is not without interest to mention how during the pandemic many new emojis have appeared.
10 For example, we could find COVID-19, COVID19, covid19, Covid19 with or without dash, etc.

3
On top of that, we built a Wordpress site from which we offer access to our different
repositories, analytical scripts (see section 4), and a series of blog posts where we present our
work-in-progress, findings, and tutorial on how to use our scripts 11.

3. A Corpus in Open Access

The tweet texts are available with open access through our website12 so that any user can
personalize a specific collection of tweets. For example, users can download specific-area tweets
on a particular date.

Offering preprocessed texts facilitates the application of Natural Languages Processing (NLP)
techniques and exploration of textual data in a straightforward and consistent way. For example,
preprocessed tweets produced in Mexico from January 2021 to April 31 2021 can be retrieved in
just a few seconds.
To summarize, users can interact with our corpus via three channels. First, researchers can
go to our Github repository where the Tweet ID dataset is organized by day and area. Each daily
dataset collects nine different files: tweets for six Spanish-speaking areas (Mexico, Argentina,
Colombia, Ecuador, Perú, Spain); English- and Spanish-language tweets in the Greater Miami area,
and Spanish-language tweets worldwide. The dataset is updated daily. Second, a copy of this
dataset is published under a DOI, for citing purposes, in the open-access Zenodo repository. Third,
we also have a beta public interface that allows a simpler, more accessible data customization and
retrieval without the need of any query interface. Users can download tweets by dates and location-
language pairs listed above that best suits their interests.

11 https://covid.dh.miami.edu/blog/
12 https://covid.dh.miami.edu/get/

4
4. coveet.py and frequency analysis

As a first step to understanding the digital narratives surrounding Covid-19 in Spanish-


speaking areas, this research applies frequency analysis with the aim of examining which words
are used, how often certain phrases appear, and how discourses vary from country to country
(Gelfgren 2016). A Python tool coveet.py (or, simply, coveet) is built to address these questions
and fulfills three principal tasks: (1) querying Twitter data from the project API based on the
criteria country, language, and date; (2) tidying queried data, e.g. by lemmatization and elimination
of stopwords, to render it suitable for downstream textual analysis tasks; and (3) analyzing tidied
data by application of NLP techniques, i.e., computing frequent words and hashtags with respect
to date. An intentional design choice of coveet is its modularity: each component is built
independent of each other so that incorporation of new NLP techniques can be done with minimal
effort. This is key if new interpretative demands are to be met, and if the tool is to be used in future
studies. Moreover, these steps form a collective workflow that transforms raw tweets into rich
quantitative data that act as a springboard for interpretation.

In achieving these tasks, coveet follows a three-step pipeline:

(1) Query. A data query to the project API is initiated using three inputs from the user:
location(s), language(s), and time period desired. 13 Using the pandas package in Python,
coveet organizes the results into a data frame where each row contains a tweet and each
column its corresponding metadata: date, language, country, body text, and hashtags. 14 The
output format is a CSV file, which can be inspected using spreadsheet software or used as
input to another script.
(2) Tidy. In its second phase, coveet “tidies” the tweet body text so that the text is rendered
suitable for analysis by later NLP techniques, e.g., frequency analysis and topic modeling.
The work needed to provide this transformation: eliminate user-defined common words (or
“stopwords”), lemmatize inflected forms of a word, and filtering words according to some
part of speech. Each tidying step can be toggled depending on the needs of the analysis.
Upon completion, the tidied body text and hashtags are written out to a new CSV file.
(3) NLP Analysis. At the time of this writing, coveet currently supports two paths for analysis:
(1) frequent n-grams, and (2) unique word retrieval by location-language pair and dates.

To conduct n-gram analysis, a dictionary of location-language pairs is generated where


each key is a location-language pair and the value a list of top n-grams under that setting,
e.g. counts[(‘fl’, ‘es’)] would return the results for Spanish tweets written in the Miami

13 Coveet will use this information to consult the project API for relevant raw tweets, where body text and
hashtags are intertwined together and must be separated after by list processing techniques.
14 Information about the pandas package can be reached at, https://pandas.pydata.org/

5
area.15 Therefore, the location-language pairing uniquely identifies a region of interest with
a target language. While it would be easier to keep track of only locations, the location-
language pair is an important unit for this study as tweets are collected from South Florida
in both Spanish and English. In n-gram analysis, we reinterpret the full tweet as context,
not only adjacent neighbors, to allow for the possibility of different and more interesting
results. In the case of bigrams (n = 2), this means considering the occurrence of two words
in a tweet, regardless of whether they are adjacent, as a bigram candidate. Once gathered,
the most frequent n-grams can be computed where the number of top results to return can
be adjusted. From frequent words (n=1) to phrases (n > 1), frequency analysis can lead to
new findings.

The “unique words” mode assists researchers to quickly identify distinct characteristics,
rather by location, time, or language. The basis of comparison, then, is essential in building
this function. coveet constructs a unique “vocabulary” dictionary for each location-
language pair where the words that appear in that group are mutually exclusive from all
other pairs available in the data frame. In a manner similar to stopword elimination, this
construct is then used to filter out any words from tweets that are not present in a location-
language pair’s vocabulary dictionary.

While the quantitative data produced by coveet are important for exploring the Twitter
corpus, they offer limited value to the interpretative demands of this project without proper
visualization (Sinclair and Rockwell 2016). To this end, we apply the following visualization tools
which are made possible using matplotlib16: (1) a matrix of bar charts to visualize top n-grams and
unique words, and (2) concordance views for studying every occurrence of a given word together
with its context. These are encapsulated in a series of Jupyter notebooks and are made both
accessible online and interactive through Binder 17 . This provides a live, on-demand “data
interpretation lab” for accessing and interacting with coveet, and the visualizations produced by it.

5. General results and visualizations

We run some preliminary results to demonstrate the Coveet scripts using data of the first
quarter of 2021 in the Greater Miami area. We retrieved the top 50 words and top 30 hashtags by
month in English and Spanish in the region. In addition to the number of occurrences, the

15 While it would be easier to keep track of only locations, the location-language pair is an important unit
for this study as tweets are collected from South Florida in both Spanish and English. Therefore, the
location-language pairing uniquely identifies a region of interest with a target language.
16
This library for visualization with Python, matplotlib, is available at: https://matplotlib.org/
17 Binder, https://mybinder.org/, transforms any GitHub repository containing a Jupyter notebook, and to

create an executable environment that can be shared live with others.

6
visualizations that can be produced using coveet’s frequency analysis function also makes it more
effective to showcase the results.
We believe that the Miami area, characterized by its bi- and multilingualism and large
Latinx population, can offer a gateway that bridges the geographical and cultural borders between
northern and southern hemispheres of the Americas. Thus, the South Florida corpus could
illuminate pivotal dimensions of the shared and distinctive sociolinguistic patterns both in
comparison to other geo-tagged areas in the Americas and the global Spanish corpus. Additionally,
comparative analysis of the English and Spanish corpora within South Florida also contributes
valuable insights on how languages and online discourses influence each other.
In prior research, containing tweets from April to September 2020, we could confirm how
English and Spanish discourses in the Miami area discussed daily new cases, infected patients,
deaths, or testing during this global crisis. However, two major differences arose that could lead
to further investigation.

1. Mentions of foreign countries: the South Florida English corpus is more concerned with
national impact (besides Miami and South Florida, we see Texas, California, New York).
The Spanish corpus, namely in hashtags, reflects particular concern for foreign countries
of the large South Florida residents population from Latin America: Cuba, Brazil, Mexico;
but an especial presence is Venezuela, as well as president Maduro.
2. There is a higher presence of news media in the Spanish corpus. Most of them come from
bots and they occupy the top Spanish hashtags. Therefore, Spanish tweets seem “to inform”
(statistics of deaths and approved measures to control the disease), and lack social
engagement. In English, we find #wearamask, #staysafe, #flattenthecurve, #mask,
#reopening, #wedemandbetter, #workersfirst, #trumpvirus, #mentalhealth, #stopthespread,
#makingwaves, #truthbetold; we find also new by-products terms of the pandemic, such as
#covidiots, or a sensibility for the Black Lives Matter movement (#blm), or new ways of
social gathering, such as “Zoom” or “Webinar”. Is there higher citizenship activism and
social discussions in the English corpus? Is the Spanish corpus missing a more “personal
engagement” or is it simply unable to capture those elements due to the large swath of news
outlets? Our results on word frequency might imply that tweets in English have a stronger
intention to interact with readers and influence other behaviors.

Instead, the last few months has produced a similar panorama but with some different nuances, as
we can see in the following figures 18.

18
For our presentation we prepared a Jupyter notebook, under the name LASA_2021.ipynb, which is hosted in our
GitHub repository, https://github.com/dh-miami/narratives_covid19/tree/master/scripts/freq_analysis, and can be
executed through Binder (see icon “Launch Binder”).

7
Figure 1: Top 50 words in Florida’s English and Spanish corpora from January 1 to March 31 of
2021

8
Figure 2. Top 30 hashtags in Florida’s English and Spanish corpora from January 1 to March 31
of 2021

Both the English and Spanish corpora of the first quarter of 2021 show a much more
coherent discussion centering vaccination, in comparison to deaths, testing, education, and issues
related to the virus itself that made up the corpus of Summer and Fall 2020. With the first round
of vaccines rolling out at the end of 2020, the 2021 data suggests a more focused and less anxious,
uncertain outlook of the pandemic. Unlike the dissents, anger, confusion with school reopening,
public health crisis, testing distribution and such topics surrounding the social, daily disruptions
caused by the pandemic in 2020, with treatments improving and the cure being on the horizon,
netizens’ Twitter conversation expresses shared hope and priority in 2021. Mentioning of foreign
country names in the Spanish corpus is not as prominent in 2021 and we have a few speculations—
the increase of vaccine availability shifted the focus from blaming others to prioritizing domestic

9
matters; foreign country names might be from news outlets’ tweets, and there might be more
Twitter users that comprise the 2021 corpus than the 2020 corpus.
We have found hashtags trickier to analyze primarily because of the automatically
generated tags pushed by news outlets and advertisers. For instances, “#cvspharmtech,”
“tvvnoticias,” and “#evnews” do not represent any substantive trending topics. Bot detection and
a more refined filtering system are currently work-in-progress.
Frequency analysis itself does not provide the most meaningful results, but gives us a
preliminary overview of the thematic contours that comprise the corpus which is illegible with
human reading alone. Comparing results among the Spanish-speaking areas or the English- and
Spanish- corpora in Miami can point researchers to more nuanced sociolinguistic and geographical
trends.

6. Future research

The project is expected to continue through May 2021, and we hope to take the project
further until everyone is vaccinated and Covid-19 is no longer an unfortunate trending topic.
Our hope is to have provided the community with a meaningful corpus to dig and look for
social trends. So far, the Covid-19 pandemic has motivated many appealing research directions
focusing on Twitter, including the role and importance of automated Twitter accounts (or “bots”)
and conspiracy theories (Ferrara, 2020), the increase of politically radical discourse in US social
media (Jiang et al., 2020), and the general public perception (Abdo, Alghonaim, Essam, 2020).
Most of these efforts, however, mine and analyze tweets with respect to English (Banda et al.;
Kerchner & Wrubel, 2020; Lamsal, 2020). The DHCovid initiative brings Spanish Twitter
narratives into the conversation.
For the next couple of months, we intend to perform two main explorations. On one hand,
we will provide a concordance interface to recover and visualize the words in contexts. The
possibility to recover tweets from specific countries and framed in concrete dates opens up the
possibility not only towards a more in-depth analysis but also to other scenarios including the
classroom. On the other hand, we will apply techniques of sentiment analysis, which is a
fundamental NLP task that aims to understand the emotional intent of words. Tools that conduct
sentiment analysis have had consequential implications in the research processes within the digital
humanities. A future direction of this work will be to make available such techniques to allow for
richer analyses of the Twitter corpus that go beyond the basic word count. This incorporation is a
feasible next step for the coveet pipeline thanks to its modular design: the frequency analysis phase
is simply substituted for the sentiment analyzer. To provide this sentiment analysis, we will look
at two dictionary-based methods: VADER and the NRC lexicon. However, an immediate
challenge to its application is that the lexicons used by both tools are in English and, therefore,
cannot be applied to a corpus of Spanish (Pérez-Rosas, Banea, & Mihalcea, 2012). Ongoing
research is exploring the use of a Spanish lexicon instead and machine translation as an
intermediary step.

10
References

Abdo, M. S., Alghonaim, A. S., & Essam, B. A. (2020). Public perception of COVID-19’s global
health crisis on Twitter until 14 weeks after the outbreak. Digital Scholarship in the
Humanities, fqaa037. DOI: https://doi.org/10.1093/llc/fqaa03
Allés Torrent, S. (2020, May 23). A Twitter Dataset for Digital Narratives. Digital Narratives of
Covid-19. Retrieved April 05, 2021 from https://covid.dh.miami.edu/2020/05/23/twitter-
dataset-for-digital-narratives/
Banda, J. M., Tekumalla, R., Wang, G., Yu, J., Liu, T., Ding, Y., Artemova, K., Tutubalina, E.,
& Chowell, G. (2021). A large-scale COVID-19 Twitter chatter dataset for open
scientific research—An international collaboration [Data set]. Zenodo. DOI:
https://doi.org/10.5281/zenodo.4460047
Ferrara, E. (2020). What types of COVID-19 conspiracies are populated by Twitter bots? First
Monday, 25(6). DOI: http://dx.doi.org/10.5210/fm.v25i6.10633
Jiang, J., Chen, E., Yan, S., Lerman, K., and Ferrara, E. (2020). Political Polarization Drives
Online Conversations about COVID-19 in the United States. Human Behavior and
Emerging Technologies, 2(3), 200–211. DOI: https://doi.org/10.1002/hbe2.202.
Kerchner, D., & Wrubel, L. (2020). Coronavirus Tweet Ids [Data set]. Harvard Dataverse. DOI:
https://doi.org/10.7910/DVN/LW0BTB
Lamsal, R. (2020). Coronavirus (COVID-19) Tweets Dataset [Data set]. IEEE. https://ieee-
dataport.org/open-access/coronavirus-covid-19-tweets-dataset
Pérez-Rosas, V., Banea, C., and Mihalcea, R. (2012). Learning Sentiment Lexicons in Spanish.
Proceedings of the Eight International Conference on Language Resource and
Evaluation (LREC’12). Istanbul, Turkey: ELRA. http://www.lrec-
conf.org/proceedings/lrec2012/pdf/1081_Paper.pdf

11

You might also like