You are on page 1of 30

27/02/2024, 12:51 Text-based Causal Inference.

Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

Open in app Sign up Sign in

Search

HANDS-ON TUTORIALS

Text-based Causal Inference


Tutorial on analyzing voter fraud disinformation by estimating causal effect with text
as treatment and confounder

Haaya Naushan · Follow


Published in Towards Data Science
23 min read · Nov 23, 2021

Listen Share

Causal diagram of text (W) as treatment (T) and confounder (Z), with outcome Y and covariates C, where T
and Z are correlated. Image by author.

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 1/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

Science fiction tells us that rampant disinformation is a foresign of a society’s


descent into a dystopia. It could be argued that disinformation destabilizes a
democracy (Morgan 2018, Farkas & Schou 2019). Tangibly, people disregarding
medical evidence has a negative impact on public health. For instance, people who
are willing to ignore evidence might choose to refuse vaccines and thereby
endanger others’ lives and their own. One should be cautious because scientific
disinformation is pervasive, but it is hard to hold people accountable when access to
trustworthy news material is garbled by the influx of fake news. A more insidious
form of disinformation is the subversion of reality for a subset of the population; a
type of mass hysteria that captures those cognitively vulnerable in an alternative
reality. I am speaking of the surreal claims of voter fraud over the last American
election, when Trump’s sycophants refused to accept that he lost the election. The
fake news around voter fraud had an undeniably incendiary impact on the following
January 6th insurrection, a tragic event that screamed dystopian society.

Hannah Arendt, the political philosopher, claimed it was necessary for people to
engage with politics as a part of living a good life. In “The Human Condition”,
Arendt says that it is not enough to work and spend time with those you love, you
must also engage in political life (Arendt, 1958). There are many Americans who
follow this ethos and engage with politics; viewing it as their right and responsibility
as a citizen. Unfortunately, some of them are susceptible to fallacious thinking and
fall prey to bizarre conspiracies like QAnon. In “Networked Propaganda” by Benkler,
Faris and Roberts, the authors claim that the propaganda feedback loop is partly
fueled by people’s desire to avoid cognitive discomfort. That is to say, people will
seek out information that reinforces their worldview while ignoring or discounting
evidence to the contrary. In an epidemiological sense, fake news acts as a vector of
disease, spreading dangerous disinformation and saturating the public sphere with
conflicting accounts that make it nigh impossible to discern the truth.

But what does politics have to do with data science? As a researcher interested in
disinformation, I naturally seek to use data science tools to answer questions of
social and political interest. Of immediate interest, is understanding the
relationship between social media and fake news. There are claims that the toxic
nature of social media, which is driven by shock-value and upvoting, has an impact
on the dissemination of fake news. More specifically, looking at Twitter, I question
whether fake news has a causal impact on the outcome of retweet count. Does
sharing fake news lead to a higher retweet count? This tutorial is a result of my

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 2/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

attempt to answer that question, and is a follow up to a previous article about causal
inference using NLP.

In this article, to estimate causal effects using text, I make use of the Twitter
VoterFraud2020 dataset curated by the Jacobs Technion — Cornell Institute. This
dataset was made publicly available by the researchers and shared on this
dashboard, the original paper is attributed to Abilov et al. (2021). I start by
discussing the data and describing the preliminary analysis. Next, I introduce the
causal-text algorithim and walkthrough the corresponding study on causal effects of
linguistic properties (Pryzant, 2021). Additionally, I cover guidelines for using
observational data for causal inference, and detail the estimation procedure of the
causal-text algorithm. Following that, I lay out the framework for the causal
experiment, which leads directly into a tutorial of how to set up and use the causal-
text tool (which I fork and adapt from the original repo). I also cover the steps
required to address the proposed causal question with the causal-text algorithm.
Lastly, I briefly discuss the results and consider possible extensions.
Data Description
The open source Cornell VoterFraud2020 Twitter dataset contains 7.6M tweets and
25.6M retweets from 2.6M users, all related to the claims of voter fraud between
October 23, 2020 to December 16, 2020. Due to Twitter’s privacy policies only the
tweet ids and user ids are shared; however, the GitHub repository for the dataset
includes scripts to hydrate the data. For this experiment, I focused only on the 7.6M
original tweets. Once the tweets were collected, it was necessary to do some
preprocessing to clean the tweet text and extract urls. The urls were all in Twitter’s
shortened format of “t.co”, and thus had to be resolved. To better understand the
popularity of the resolved urls, each url was given an Alexa rank from Amazon’s
analysis of web traffic statistics.

Media Cloud is an open-source media content analysis tool developed at the


Berkman Klein Center for Internet & Society at Harvard University. This platform
has curated source lists for American media sources, divided by political affiliation,
covering the left, center-left, center, center-right and right. Using these US news
source lists, I cross referenced with the urls resolved from the voter fraud tweets
dataset. This selected for urls that linked specifically to news articles. Media Cloud
has great investigative capabilities for news media, so I was able to use Media Cloud
queries to determine the media in-link share count and Facebook share count of the

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 3/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

isolated articles. In addition to this article metadata, I scraped the full text of all the
news articles that were still online at the time.

These steps gave me a combined dataset of original tweets that shared news articles,
with the full text of the news articles, article metadata, tweet metadata and Alexa
rankings for the urls. To be clear, the dataset of 7.6M tweets was pared down such
that every tweet had a corresponding news article. The purpose of collecting the full
text of the articles was to perform topic modeling with latent Dirichlet allocation
(LDA) to see if it was possible to isolate the fake news articles. Additionally, the
VoterFraud2020 dataset of 2.6M users also contained each user’s community or
cluster as determined by a community detection algorithm (e.g. Louvain method).
Given the multiple data streams and richness of the resulting tweet-article dataset it
was necessary to run some preliminary analysis, which is covered next.

Preliminary Analysis
Firstly, considering the tweets themselves, the text of the tweets may have value in
identifying fake news. Hence, I started by topic modeling the tweet text with LDA to
gain a general sense of the content of the Twitter discourse.

Intertopic distance map for select VoterFraud2020 tweets between October 23 and December 16, 2020.
Image by author.

The results of the LDA model highlighted several topics that were expressly about
aspects of the voter fraud disinformation conversation. For instance, one notable

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 4/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

topic that was isolated from the other topics, was about the alt-right hashtag
“#stopthesteal”. Also notable was the similarly isolated Wall Street Journal and New
York Times fact checking topic. Interestingly, tweets that claimed to have evidence
of fraud greatly overlapped with tweets from right-wing news media sources, like
Fox News. Overall, there are several threads of disinformation, ranging from ballot
harvesting, conspiracies about the voting software, affidavits asserting fraud, and
rumours of military involvement.

Analysis of the user communities with a community detection algorithm provided 5


distinct communities that varied in user count, as shown below.

The amplifier groups were those that pushed the voter fraud agenda, and the
foreign groups represent potential foreign influence and are tiny in comparison.

Focusing on the spread of fake news, these five communities behaved differently
when it came to sharing urls of news articles. The Media Cloud metadata contained
the media inlink count which represents the number of times an article was linked
to by other media sources. A highly in-linked article could be considered to be more
mainstream. The graph below shows the trend of media in-link count over time for
the three largest communities.

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 5/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

Media in-link count for news articles shared about voter fraud from October 23 to December 16, 2020. Image
by author.

The above time series suggests that the center-left community tended to share
articles that were more “mainstream” when compared to the two amplifier
communities. Although the center-left community was more likely to share
mainstream articles, they did not garner high retweet counts. This is shown below
in a time series of mean retweet counts by community.

The mean retweet count of urls shared by communities over time. Image by author.

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 6/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

In fact, it is the third largest community “Amplifiers_1” that had the highest retweet
counts for shared urls, despite representing only 11.5% of users. The concern here
is that even if the center-left were trying to fact check those spreading the fake news
of voter fraud, they were not getting much visibility on Twitter. It is also startling to
realize that the relatively small group of “Amplifiers_1” was highly influential in
spreading information, despite not sharing mainstream media.

It is commonly assumed that fake news is often on fringe websites, far out of the
mainstream. Having calculated the Alexa rank or popularity of each news article
url, it was possible to look at the relationship between a website’s “fringiness” and
which communities were retweeting these fringe sites. In the heatmap below, the
Alexa rank or “fringe-score” is weighted by retweet count and the community
distribution by topic is mapped.

Heatmap of retweet-weighted Alexa rankings (fringe-scores), showing the community distribution by topic
of the fringe scores. Image by author.

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 7/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

Here we can see that the “Amplifier_1” group not only garners the largest retweet
count shares, but they also share the most fringe websites. Since we are interested
in the causal question of whether the treatment of fake news has a causal effect on
the outcome of retweet count, the relationship between fringe-scores and fake news
articles is also of interest.

At this point, it became necessary to look at the actual text of the news articles to
better help classify fake news. The process of topic modeling the news articles with
LDA resulted in five of the seven topics being explicitly fake news articles. This
allowed for labeling each url, and thereby each tweet, as either fake news or not.
This NLP derived label is used as a proxy label in the setup of the causal experiment
that is described later. Furthermore, in order to test the usefulness of the causal-text
algorithm, I also labeled 100 of the most popular urls for fake news. This labeling
covered 18% of the tweet-article dataset, and gave me roughly 28K tweet-article
pairs that had both proxy labels via topic modeling and true labels via manual
annotation. Having both proxy treatment labels and true treatment labels, allows for
benchmarking the causal-text algorithm for this task. In the next three sections, I
will discuss the details of the causal-text algorithm and introduce some of the causal
concepts needed to understand the tool.
Causal-text Algorithm
The causal-text algorithm used in this tutorial was created by Pryzant et al. (2021), it
was introduced as “TEXTCAUSE” in a paper titled “Causal Effects of Linguistic
Properties”. This causal algorithm makes use of another tool — CausalBERT, that
was originally designed by Veitch et al. (2020). CausalBERT was developed to
produce text embeddings for causal inference; in essence, the authors designed a
way to use AI language models to adjust for text when testing for causality.

The causal-text algorithm has two components, first it makes use of distant
supervision to improve the quality of proxy labels, and second, CausalBERT is used
to adjust for the text. Pryzant et al. tried to formalize the causal effect of a writer’s
intent, along with establishing the assumptions necessary to identify the causal
effect from observational data. Another contribution of this work is that they
proposed an estimator for this setting where the bias is bounded when adjusting for
text.

The VoterFraud2020 dataset represents observational data, where the tweets were
obtained without an intervention. Since, the measurement of causal effect requires

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 8/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

the fulfillment of the ceteris paribus assumption, where all covariates are kept
fixed, we must reason about interventions. Pryzant et al. describe two challenges to
estimating causal effects from observational data. Firstly, there is a need to
“formalize the causal effect of interest by specifying the hypothetical intervention to
which it corresponds.” (Pryzant et al., 2021). This challenge is overcome by
imagining an intervention on the writer of a text, where they are told to use a
different linguistic property.

The second challenge to causal inference is identification, where the actual


linguistic property we are interested in can only be measured by a noisy proxy (e.g.
topic labels). Therefore, the study also established the assumptions needed to
recover the true causal effects of linguistic properties from the noisy proxy labels.
The creators of the causal-text algorithm adjust for confounding in text with
CausalBERT and prove that this process bounds the bias of the causal estimates. In
my previous article about causality and NLP, I discussed in detail the issue of
confounding due to text.

Causal Inference with Observational Data


When discussing causal inference with observational data, it is necessary to talk
about the average treatment effect (ATE). As seen in the image below, the ATE is the
difference in potential outcomes between the real world (T=1) and the
counterfactual world (T=0). I have previously described the potential outcomes
framework in intuitive ways in two articles: Causal inference using NLP, and
CausalML for Econometrics: Causal Forests.

The average treatment effect (ATE) is the difference in potential outcomes between the real world and the
counterfactual world. Image by author.

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 9/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

However, as mentioned, we are also concerned with confounding. To deal with


confounders (Wᵢ), the backdoor adjustment formula (Pearl, 2009) can be used to
rewrite the ATE in terms of all the observed variables: Tᵢ for treatment and Yᵢ for
outcome. This confounding relationship is seen in the image below, where the
confounder Wᵢ has an effect on both the treatment and outcome.

The confounder Wi has an effect on both the treatment Ti and the outcome Yi. Source: Veitch et al. (2020).

The confounding effect of Wᵢ results in a spurious correlation, which can also be


referred to as “open backdoor paths that induce non-causal associations” (Pryzant et
al., 2021). I have previously discussed spurious correlations and backdoor paths in
an article about improving NLP models with causality. The backdoor adjustment
formula for the ATE is shown in the image below.

How to calculate the average treatment effect (ATE) with the backdoor adjustment formula. Adapted from
Pryzant et al. (2021).

If we assume the confounder Wᵢ is discrete, then the data can be grouped into values
of W, the average difference in potential outcomes can be calculated, and lastly, we
take the average over the groups of W.

Pryzant et al. (2021) propose the following causal model of text and outcomes:

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 10/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

The ATE of the writer’s intent may be different from the ATE of the reader’s perspective, since a reader does
not know the writer’s intent. Source: Pryzant et al. (2021).

The text is represented by W, which has the linguistic property T (as treatment), and
other qualities Z (as covariates). Here, Z could be the topic, sentiment, length or
other qualities of a text. This causal model is built with the literary argument that
language is subject to two perspectives: the text as intended by the author and the
text as interpreted by the reader. The second perspective of the reader is shown by
T_tilde and Z_tilde — where T_tilde represents the treatment as received by the
reader, and Z_tilde represents the other qualities of the text W as perceived by the
reader. The outcome Y is affected by the tilde variables instead of Z and T directly.
The T_hat variable represents the proxy label as derived from the text W, which
could be topic labels.

The hypothetical intervention on the treatment, is to ask the writer to use (or not
use) a linguistic property T, where T is a binary choice. It is not possible to use
observational data to capture the unobserved linguistic characteristics of Z, because
it is correlated with T. It is, however, possible to estimate the linguistic properties as
perceived by a reader, which is represented by the tilde variables. The ATE of the
reader’s perspective is defined as:

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 11/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

The average treatment effect as perceived by the reader, T_tilde is substituted for T. Adapted from Pryzant
et al. (2021).

In order to calculate the causal effect of interest, the ATE of the writer’s perspective,
Pryzant et al. (2021) developed a theorem (Theorem 1) which exploited the ATE of
the reader’s perspective as calculated from T_tilde. They define Z_tilde as a function
of the text W, as seen in the image below, where the potential outcomes of Y are
equivalent given either W or both T_tilde and Z_tilde.

Theorem 1, the confounding information, Z_tilde is a function of the text, W. Adapted from Pryzant et al.
(2021).

Having defined Z_tilde as such, it is then possible to define the ATEᵣₑₐ as the
following equation:

The ATE of the reader’s perspective can be written with T_tilde and Z_tilde. Adapted from Pryzant et al.
(2021).

The ATEᵣₑₐ is said to be equal to the ATE𝓌ᵣᵢ, and the text W, splits into the
information that the reader uses to perceive the tilde variables. Z_tilde represents
confounding properties, since it affects the outcome and is correlated with T_tilde.
To be clear, this theorem only holds under certain assumptions, of which there are
three. Firstly, unobserved confounding (W) blocks the backdoor paths between
T_tilde and the outcome Y. Secondly, we need to assume that T = T_tilde, that is to
say, there is an agreement of intent (ATE𝓌ᵣᵢ) and perception (ATEᵣₑₐ). The last
assumption is the positivity (or overlap) assumption, which is that the probability of
the treatment is between 0 and 1. I provided an intuitive explanation of the positivity
assumption in another article about causality.

A further complication is that we cannot observe the perception of the reader, in


addition to not being able to directly observe the intent of the writer; hence, the
need for proxies. For T_tilde it is possible to use a proxy T_hat to calculate the causal

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 12/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

effect of interest, where the T_tilde is substituted out by T_hat in the previous
equation to calculate an estimand (ATE ᵣₒₓᵧ). ₚ

Substituting T_tilde with T_hat allows for calculating the proxy ATE. Adapted from Pryzant et al. (2021).

At this point it is necessary to adjust the estimand for confounding, in other words,

adjust the ATE ᵣₒₓᵧ for Z_tilde. This is made possible using CausalBERT, a pre-trained
language model, to measure T_hat. The other advantage of this approach is that the
bias due to the proxy label is bounded, such that it is benign — “it can only decrease
the magnitude to the effect but it will not change the sign.”. Pryzant et al. (2021),
refer to this as Theorem 2, and state that a “more accurate proxy will yield a lower
estimation bias.”.
Causal Estimation
Now that we have discussed how to use observational data for causal inference with
text, the practical part is the estimation procedure. The causal-text algorithm has
two important features: improving proxy labels and adjusting for text. The approach
to improving the accuracy of the proxy labels is based on the fact that the bias is
bounded. The proxy labels are improved using distant supervision, which was
inspired by work on lexicon induction and label propagation. The goal is to improve
the recall of proxy labels by training a classifier to predict the proxy label, then
using that classifier to relabel examples that were labeled T=0 but look like T=1.
Essentially, the proxy labels are relabeled if necessary.

The second feature of the causal-text algorithm is that it adjusts for the text using a

pre-trained language model. The ATE ᵣₒₓᵧ is measured by using the text (W), the
improved proxy labels (T_hat*) and the outcomes (Y). This relies on Theorem 1,
which as described earlier shows how to adjust for the confounding parts of text.
Pryzant et al. (2021) use a DistilBERT model to produce a representation of the text
with embeddings and then select the vector corresponding to a prepended
classification token, [CLS]. Pryzant et al. use Huggingface transformers
implementation of DistilBERT which has 66M parameters, and the vectors for text

adjustment M , add 3,080 parameters. This model is then optimized such that the
representation b(W), directly approximates the confounding information, Z_tilde.

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 13/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

An estimator, Q, is trained for the expected conditional outcome, as seen in the


image below.

The estimator Q_hat relies on the treatment, t, the model representation b(W), and the covariates, C.
Adapted from Pryzant et al. (2021).

In this equation, the estimator Q is shown to be equivalent to the expected


conditional outcome for Y, when given the proxy T_hat, which itself is based on not
only the treatment, t, but also the model representation of Z_tilde (b(W)), and the
covariates C. The proxy estimator, Q_hat, is equivalent to the parameterized sum of
ₜ ₜ
a bias term (b) and two vectors (Mᵇ , Mᶜ ), that rely on representation b(W) and a c
vector. The c vector is a one-hot encoding vector of covariates C, and the two M are ₜ
learned for value t of the treatment. The training objective of this model is to
optimize:

Training objective used to optimize the causal-text model. Source: Pryzant et al. (2021)

In this equation, 𝛩 is all the parameters of the model, and L(.) is the cross-entropy
loss which is used with the estimator Q_hat, itself based on the M vectors. The ₜ
original BERT masked language modeling objective (MLM) is represented as R(.),
and the 𝛼 hyperparameter is a penalty for the MLM objective. With the Q_hat
ₜ ₜ
estimator, the parameters Mᵇ and Mᶜ are updated on examples where the improved
proxy label is equivalent to t.

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 14/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

This setup is shown in the diagram below, where W represents the text, C represents
the covariates and the CausalBERT model is a representation of the text such that it
is possible to predict the potential outcomes of Y.

Text adjustment with CausalBERT, which predicts Y’s potential outcomes using BERT word embeddings.
Adapted from Pryzant et al. (2021).

In summary, estimation with the full causal-text algorithm requires the improved
proxy labels, and a causal model of the text and outcomes which extracts and
adjusts for the confounding of Z_tilde. The algorithm also allows for the inclusion of
covariates C when estimating the causal effect. As seen in the image above, the
vector c and the model representation b(W) are used to predict the potential
outcomes of Y while using information from T_hat*, the proxy label. The
representation b(W) directly approximates the confounding information (Z_tilde),
which allows it to adjust for the text.

Once the estimator Q_hat is fitted, it is possible to calculate the hatted ATE ᵣₒₓᵧ as ₚ
seen in the equation below:

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 15/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science


The hatted ATE ᵣₒₓᵧ is estimated with the proxy estimator, Q_hat. Adapted from Pryzant et al. (2021).

The ATE that is derived with this method can be used to determine the causal effect
of the reader’s perspective, which itself is assumed to be equivalent to the causal
effect of the writer’s perspective. The accuracy of this ATE is dependent on how
accurate the proxies are and how well CausalBERT adjusts for the text. The next
section describes the experimental framework used to test for the causal effect of
fake news on retweet count.

Experimental Framework
The causal question is whether fake news has a causal effect on retweet count. A few
years ago, a very popular study in Science claimed that on social media, fake news
spreads more rapidly than real news. This study, however, does not rely on causal
analysis. There is a possibility that the results were based on spurious correlations
between confounders and the retweet count. For instance, the community a user
belonged to was not investigated, nor was the popularity of the news site. Some
communities might be more vulnerable to spreading fake news, and people might
be more likely to share popular news sites. Moreover, language is complex and the
text of tweets can act as a confounder, so we need to control for topic, writing style,
tone and length of the tweet. Therefore, there is value in designing a causal study
where possible confounders are controlled for, and the tweet text itself is adjusted
for confounding qualities.

As previously mentioned, there are two challenges to estimating causal effects from
observational data: interventions and identification. Firstly, we need to reason about
the hypothetical intervention that we would make on the writer’s intent, such that
they would use (or not use) a particular linguistic property. It is necessary to think of
the sharing of fake news as a linguistic property that represents the writer’s intent,
then it can be a treatment, T, that can be intervened upon. More simply, we view the
sharing of a url that links to fake news as a linguistic property, where the
intervention would be to tell the user to share a real news article (T=0) instead of a
fake news article (T=1). While intervening on this treatment, the remaining qualities
of the tweet must be kept constant. We will refer to these other text qualities as Z,
such that Z represents potential confounding factors like topic, style, tone or length.
The tweet text will be referred to as W (or simply “text”), and additional covariates
such as user community or Alexa rank will be denoted as C. This setup is shown in
the image below.

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 16/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

Causal diagram of experimental setup. W is the tweet text, T is the proxy topic label, Z is other confounding
qualities, Y is the retweet count, and C represents either community or Alexa rank. Image by author.

The proxy treatment is a fake news label as determined by topic modeling of the
articles with LDA. Since the dataset has gold labels for the fake news articles, there
are two treatment variables (T_true and T_proxy) so that it is possible to benchmark
the T_proxy against T_true. Lastly, the outcome Y is the retweet count. For a first test,
the C variable is categorical, where a number is used to represent the user
community. All other variables, except the text, are binary numerical indicators (0
or 1). For a second test, the Alexa rank is used as the covariate C, and we look
specifically at a single community: “Amplifiers_1”. In this test, the Alexa rank for
each url is turned into a categorical variable by binning the values by quantile. The
next section details how I adapted the causal-text algorithm for this tutorial, and
explains how to interpret the results.

Estimating Causal Effect


Pryzant et al. (2021) shared the causal-text algorithm on GitHub, which utilized a
Pytorch implementation of CausalBERT. For this tutorial, it was necessary to adapt
the original causal-text package since it was specifically tailored to the causal
experiments described in the introductory paper. Furthermore, it does not appear to
be maintained (updated) by the authors, so I had to update the requirements. I also

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 17/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

simplified the output and removed the extraneous simulation parts that were not
needed for this tutorial. The rest of the changes were minor and were made in the
process of debugging. Overall, I made very few changes to the original algorithm,
my adaptation can be accessed on GitHub. If you find this algorithm useful, please
star Pryzant et al.’s original repository for the causal-text algorithm.

The tool runs from the command line, and I suggest running it with a GPU to make
use of the speed for deep learning. Here, I explain how to set up Colab (to make use
of the free GPU instance) and run the causal-text algorithm. First though, the data
needs to be in the correct format. The tool accepts a “.tsv” file with five columns, for
five variables: T_proxy, T_true, C, Y, text. The covariates C have to be categorical, and
represented by simple integers. The T_proxy, T_true and outcome, Y, variables must
be binary numerical indicators (0 or 1). The “text” is simply the tweet text. The
adapted causal-text algorithm produces seven different ATE result values.

Using the T_true label, the causal-text algorithm calculates an “oracle” ATE value;
this can be thought of as the true ATE which will act as a baseline. Next, an
“unadjusted” ATE value is calculated as an additional baseline, where the ATE is the
expected difference in outcomes conditioned on T_hat without accounting for
covariates. The next two values are “T-boost” ATE values, where T-boost refers to
treatment boosting by improving proxy labels. The proxy labels are improved in two
ways by two different classifiers. One classifier works only with positive and
unlabeled data, while the other is a straightforward regression, specifically Sci-kit
Learn’s stochastic gradient descent classifier. The next ATE value is one where the
text has been adjusted for, this is the “W adjust” value. The last two ATE values
combine the T-boosting with text adjustment, for one ATE value per classifier type.
These last two values represent the full “TEXTCAUSE” algorithm as designed by
Pryzant et al. (2021).

The first step is to install the needed packages in Colab. This is done with the
following single line of code:

!pip install sklearn transformers tensorflow

Next, we check to see if the GPU is available.

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 18/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

import torch
if torch.cuda.is_available():
device = torch.device("cuda")
print('There are %d GPU(s) available.' %
torch.cuda.device_count())
print('We will use the GPU:', torch.cuda.get_device_name(0))
!nvidia-smi

else:
print('No GPU available, using the CPU instead.')
device = torch.device("cpu")

The “.tsv” file with the data should be saved in Google Drive for easy access. We
simply mount the drive to get access to the files.

from google.colab import drive


drive.mount('/content/gdrive')

Then we navigate to the folder where the data is saved.

%cd gdrive/My Drive/my_folder

Next we clone the adapted repo for the causal-text algorithm from GitHub.

!git clone https://github.com/haayanau/causal-text.git

After the causal-text package has been cloned, it is necessary to navigate to the
directory where the main script is located.

%cd causal-text/src

Running the algorithm is very simple, run the following command with the path
leading to the “.tsv” file. The “run_cb” argument means that CausalBERT will be
used to adjust for the text. The models are trained for 3 epochs each.
https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 19/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

!python main.py --run_cb --data


/content/gdrive/MyDrive/my_folder/my_data.tsv

This command results in seven types of ATE values as described earlier. Pryzant et
al.(2021) do caution that “ATE estimates lose fidelity when the proxy is less than 80%
accurate”. They also claim that it is crucial to adjust for the confounding parts of
text, and that estimates that account for C without adjusting for the text, might be
worse than the unadjusted estimate. The next section briefly discusses the results of
the two experiments, and suggests some extensions.

Results and Extensions

For the first test, we are looking at the ATE of fake news (T) on retweet count (Y),
with the user community (C) and tweet text as confounders. There are 15,468
observations and the results are shown below.

The true (oracle) ATE value suggests that there is practically no causal effect, which
is contrary to the popular expectation that fake news spreads quicker than real
news, and would garner higher retweet counts. The unadjusted ATE value also does
not show causal effect, despite not accounting for covariate C. In terms of matching
the true ATE value, the (W adjust) ATE that adjusts for text with CausalBERT comes
the closest. Neither of the values from the full “TEXTCAUSE” algorithm (adjusts for

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 20/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

text and improves labels) are as close to the true ATE as the W adjust value, which
does not use the improved labels.

The second test looked only at the “Amplifiers_1” community and accounts for
Alexa rank as a potentially confounding covariate. There are 1,485 observations and
the results are shown below.

Here again, there does not appear to be a causal effect of fake news on retweet
count, once we control for the Alexa rank. The true (oracle) ATE value is mildly
negative, and the full version of the “TEXTCAUSE” algorithm that uses the “pu”
classifier for T-boosting, produces the ATE that most closely matches the true value.
This treatment boosted value (TextCause pu), not only includes improved proxy
labels but also adjusts for the text with CausalBERT. The unadjusted ATE had the
worst performance, however, all of the other ATE values had a similarly poor
performance.

There is a strong possibility that there are unobserved confounders that have not
been controlled for in this experiment. This might explain the lack of causal effect
being detected, or conversely, there simply is no causal effect. At this point, we have
not proven that there is a causal effect of fake news on retweet count, nor have we
definitively proven that there is no causal effect. All we have done is bring into
question the assumption popularly claimed by researchers, that fake news spreads
faster than real news on social media. It might be possible that including further
covariates would improve the experiment, however, it will be tricky to determine

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 21/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

which covariates to include. There is also a possibility that the sample sizes were not
big enough, especially for the second test where there were only 1,485 observations.

There are several extensions that we could implement. Starting with the first test,
we could substitute the user community with Alexa rank for the covariate, C. For the
second test, we could increase the sample size or even compare across
communities. It would be helpful if the causal-text algorithm could accommodate
more than one covariate (higher dimensionality). Even more useful, would be if the
causal-text algorithm could deal with heterogeneous treatment effects and calculate
the conditional average treatment effect (CATE). For instance, we could condition on
user community, to see if there is a difference in CATEs across the groups.
Final thoughts
The intersection of causal inference and NLP is fascinating, and the causal-text
algorithm is a great example of creativity and initiative. My hope is that this research
will continue to push the boundaries of what is possible with regards to methods for
estimating a causal effect with text. In terms of application, the causal-text
algorithm can be applied to various fields such as economics, healthcare,
marketing, public policy and even epidemiology. There is a shift in thinking around
the phenomenon of fake news, for instance, there are calls to treat the issue as a
public health issue (Donovan, 2020). The WHO has taken an epidemiological
approach, and are referring to incidents of fake news as “infodemics”. All of these
changes suggest that it might be time to take a causal approach to disinformation.
Exploring causality could be a way of developing an economics-inspired framework,
for looking at the causal effect of disinformation on society (True Costs of
Misinformation Workshop, TaSC, Shorenstein Center). Personally, I am interested in
applying this method to economic research that utilizes open source social media
data.

I welcome questions and feedback, please feel free to connect with me on Linkedin.

NLP Machine Learning Causality Editors Pick Hands On Tutorials

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 22/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

Follow

Written by Haaya Naushan


1K Followers · Writer for Towards Data Science

Data Scientist enthusiastic about machine learning, social justice, video games and philosophy.

More from Haaya Naushan and Towards Data Science

Haaya Naushan in Towards Data Science

Causal Machine Learning for Econometrics: Causal Forests


Introduction to causal machine learning for econometrics, including a code tutorial on
estimating the CATE with a causal forest using…

13 min read · Apr 17, 2021

493 13

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 23/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

Cristian Leo in Towards Data Science

The Math behind Adam Optimizer


Why is Adam the most popular optimizer in Deep Learning? Let’s understand it by diving into
its math, and recreating the algorithm.

16 min read · Jan 31, 2024

2.1K 16

Siavash Yasini in Towards Data Science

Python’s Most Powerful Decorator


https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 24/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

And 5 ways to use it in data science and machine learning

· 11 min read · Feb 2, 2024

2.3K 16

Haaya Naushan in Towards Data Science

Topic Modeling with Latent Dirichlet Allocation


A practical exploration of the Natural Language Processing technique of Latent Dirichlet
Allocation and its application to the task of…

18 min read · Dec 3, 2020

240 3

See all from Haaya Naushan

See all from Towards Data Science

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 25/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

Recommended from Medium

Aysel

Causal Relationship and Uplift Modelling


Running a successful business often resembles a puzzle with intricate pieces that affect your
profits and expenses. Each decision becomes a…

11 min read · Aug 31, 2023

24

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 26/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

Aleksander Molak in Towards Data Science

Jane the Discoverer: Enhancing Causal Discovery with Large Language


Models (Causal Python)
A practical guideline to LLM-enhanced causal discovery that minimizes the risks of
hallucinations (with Python code)

· 18 min read · Oct 23, 2023

202 3

Lists

Predictive Modeling w/ Python


20 stories · 944 saves

Natural Language Processing


1228 stories · 714 saves

Practical Guides to Machine Learning


10 stories · 1112 saves

The New Chatbots: ChatGPT, Bard, and Beyond


12 stories · 315 saves

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 27/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

Ryuta Yoshimatsu

Causal ML for Decision Making


Introduction and Application

15 min read · Oct 25, 2023

209

Netflix Technology Blog in Netflix TechBlog

Causal Machine Learning for Creative Insights

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 28/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

A framework to identify the causal impact of successful visual components.

13 min read · Jan 12, 2023

906 6

Hamza Rabi

Getting Started with DoWhy: A Beginner’s Guide


start here

3 min read · Sep 22, 2023

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 29/30
27/02/2024, 12:51 Text-based Causal Inference. Tutorial on analyzing voter fraud… | by Haaya Naushan | Towards Data Science

Shivangi Choudhary

Causal Inference techniques for Panel Data in Retail — Part II


Hello readers! In my previous article on Comprehensive Guide on Causal Inference in Retail —
Part I , I discussed about the assumptions …

7 min read · Dec 26, 2023

27

See more recommendations

https://towardsdatascience.com/text-based-causal-inference-86e640efb2af 30/30

You might also like