You are on page 1of 9

The Economics of FinTech

Steven Kou
Department of Finance
Questrom School of Business
Boston University

This version Spring 2021

This study source was downloaded by 100000880261197 from CourseHero.com on 02-06-2024 06:17:54 GMT -06:00

https://www.coursehero.com/file/115499588/lec11Causal1pdf/
Chapter 11

Causal Inference for


Machine Learning (I)

11.1 An Motivating Example: Boston Housing


Data

Boston housing data consists of the median value of owner-occupied home in


$1000s, and 13 covariates. We are interested in two covariates ‘dis,’the weighted
mean of distances to …ve Boston employment centres,’and ‘nox,’nitrogen oxides
concentration (parts per 10 million).
1. We shall do random forest to detect nonlinear e¤ects, in addition to the
linear e¤ects revealed in the standard linear regression model.
2. Note that the two variables, ‘dis’and ‘nox,’have di¤erent casual relation
to the home values. More precisely:
(a) Confounding e¤ects may appear if we study the causal e¤ects from ‘nox’
to home values. More precisely, there may be other factors that a¤ect both
nitrogen oxides pollution and home values, but the nitrogen oxides pollution
may not have the reserve impacts to these factors.
(b) Mediation e¤ects may emerge if we investigate the causal e¤ects from ‘dis’
to home values (as, in addition to a¤ect the home values directly, the distance
to city centers may a¤ect other variables, such as crime rates and proportion of
non-retail business acres per town, which in turn also a¤ect the home values).
To avoid bias, causal inference informs us that the above two cases need
di¤erent statistical analysis, e.g. when regression is conducted. Things get
more complicated when machine learning algorithms are applied, instead of
regression, as in many cases machine learning algorithms are often criticized as
black boxes.
This prompts the need to study causal inference for machine learning algo-
rithms.

This study source was downloaded by 100000880261197 from CourseHero.com on 02-06-2024 06:17:54 GMT -06:00

https://www.coursehero.com/file/115499588/lec11Causal1pdf/
11.2 Bias and Causal Inference
Three major problems in statistics are: (1) How to design experiments and
collect data that better and e¢ ciently address the questions of interest (Exper-
imental Design and Sampling Design). (2) How to describe the major features
and detect patterns in the data (Exploratory Data Analysis and Data Min-
ing). (3) How to account for sampling variability and bias and draw reliable
conclusions from the data (Statistical Inference).
Much of the statistics is about investigate and control the variability in the
third problem. To study variability, one can construct 95% con…dence intervals,
and conduct statistical hypothesis testing.
The key goal of causal inference for observational data is, however, about
reducing the bias. The biased samples can appear in many di¤erent ways. For
example, in 1983, A national television news program invited its viewers to
participate in a “phone-in” on the issue whether the U.N. should continue to
be based in the U.S. The phone-in result was: yes, 33% and no 67%, with a
sample size 180,000. However, a more scienti…c survey several days later based
on only about 1,000 random samples revealed that about 78% of people in U.S.
thought that the U.N. should continue to be based in the U.S. The recent U.S.
presidential elections also revealed the potential bias in public opinion pools
In …nance, the self-selection biases refers to the biases arising when data
availability leads to certain subsets of stocks (assets) being excluding from the
analysis. For example, if we look at the stock returns within U.S., we may have
a survival bias. Namely failing stocks during the past are not included; thus
the observed return numbers will be somewhat higher. By the way, which was
the 3rd largest stock exchanges in world in 1900? The answer may surprise
some of you. Survival bias is of particular concern when we analysis high tech
stocks. For a detailed analysis of survival bias, see, Brown, Goetzmann, and
Ross (Journal of Finance, 1995), and Li and Xu (2002, Journal of Finance).

11.3 Di¤erent types of potential bias


If we want to estimate the magnitude of the impact of X on Y , we can draw
a directed acyclic graph (DAG), with a notation X ! Y as in Figure 1. Then
there are 4 di¤erent types of potential bias from another set of variables Z (could
be multiple variables).
1. Confounding factors.
Example (a): Gender (with the symbol X), car accidents (with the symbol
Y ), and mileage (with the symbol Z).
Example (b): Nobel laureates per 10 million people (Y ), chocolate con-
sumption (X), social-economic status (Z). See the paper by Messerli (2012,
New England Journal of Medicine).
See also Figure 1 in the paper by Dablander (2020).
2. Mediators.
Example (a): Economic stress (X), psychological depression (Z), withdraw-
ing from entrepreneurship (Y ).

This study source was downloaded by 100000880261197 from CourseHero.com on 02-06-2024 06:17:54 GMT -06:00

https://www.coursehero.com/file/115499588/lec11Causal1pdf/
Example (b): Green …nance. Corporate social responsibility (X), …nancial
performance (Z), buy or sell by institutional investors (Y ).
3. Collider.
Example (a) Ascertainment bias. Beauty (X), acting talent (Y ), and Holly-
wood stars (Z).
Example (b) Selection bias. Economic conditions (X), heights of enlisted
men (Y ), civilian employment (Z).
Example (c) Non-response bias. Social status (X), marriage outcome (Y ),
migration to another city (Z).
If conditioning on collider, it may lead to selection bias.
4. Feedback. This is the most di¢ cult type for causal inference. More
research is needed here, as this is related to directed cyclic graphs (rather than
directed acyclic graphs) or non-recursive structural equation models.
More complicated causal can be shown in Figures 2 and 3.

11.4 Basic Ideas of Casual Inference


11.4.1 Three Levels of Causal Inference
There are three levels of causal inference, with the third level being the most
di¢ cult.
"Seeing": the goal is to remove bias in observational studies.
"Doing": the goal is to study the impact of intervention on population.
"Imaging": the goal is to answer what-if questions for individuals, i.e. to
study counterfactuals at the individual levels.

11.4.2 The notation of counterfactuals and do calculus


Note that observed outcome Y jX = x is di¤erent from the intervention Y jdo(X =
x). For example, we observe that more chocolate consumption (X) leads to more
Nobel laureates per 10 million people (Y ); i.e. Y jX = x has a positive slope, if
we run a linear regression. However, if we give people more chocolate to eat, i.e.
do(X = x0 ) at a higher level x0 > x, then we may not get more Nobel laureates.
To mitigate this problem, we shall introduce two things:
(1) Counterfactual notations. For example, Y (0) means Y jdo(X = 0) and
Y (1) means Y jdo(X = 1). Note that if X is a treatment, with X = 1 denoting
the treatment and X = 0 no treatment, we can only observe one of Y (0) and
Y (1), as we cannot do the treatment and not do the treatment for an individual
simultaneously.
This missing data issue creates many problem. For example, how do we
estimate the average causal e¤ect of treatment (ACE), where

ACE = E[Y (1) Y (0)]?

(2) The do operation requires a di¤erent tool of calculation, as

P (Y = yjX = x) 6= P (Y = yjdo(X = x)):

This study source was downloaded by 100000880261197 from CourseHero.com on 02-06-2024 06:17:54 GMT -06:00

https://www.coursehero.com/file/115499588/lec11Causal1pdf/
(3) One has to be careful with the adjustment with conditioning variables and
do the conditioning carefully. Sometimes, one has to do conditioning (e.g. in the
presence of confounding variables), and other times, one cannot do conditioning
(e.g. in the presence of colliders).
(4) One has to separate direct and indirect e¤ects (as in the mediation
analysis).
(5) Statistics alone cannot solve causal inference problems. One has to draw
upon the expertise from the related domain …eld to propose proper causal struc-
tures (e.g. drawing DAG graphs, setting up structural equations), before choos-
ing proper statistical tools. In other words, draw (sometime literarily as in DAG
graphs) assumptions before reach conclusion.

11.5 Simpson’s Paradox


Simpson’s paradox is a perfect case to illustrate causal inference.
Example A. In 1970’s there was a lawsuit against UC Berkeley graduate
school admission o¢ ce. The lawsuit was primarily based on the fact that in Fall
1973, there are 8442 male and 4321 female applicants for the graduate school.
However, there 44% male and 35% female applicants were admitted. Thus,
there appears to be a discrimination against female applications. However, the
UC Berkeley won the lawsuit by providing more detailed data. A sample of the
data is as follows.
Men Women
# of % # of %
Major App. Adm. App. Adm.
A 825 62 108 82
B 560 63 25 68
C 417 33 375 35
D 191 28 393 34
Therefore, in all these four majors, female applicants had better chances to
be admitted, although the overall admission rate for female is lower than that
for male. Incidentally, major A and B are science/engineering majors, and C
and D are humanity majors.
This is a typical example of Simpson’s paradox. More precisely, Simpson’s
paradox occurs when some overall conclusion concerning a set of objects fails to
hold within each of a collection of subsets of those subjects.
Example B. Simponson’s paradox also has implication in terms of …nance.
For example consider investments in two industries
OSX index (oil service stocks) has a return of 48% during year 2005.
Suppose one stock OSX-H within the oil service group has a return of
60%, while the second stock OSX-L within the group has a return of 35%.
SOXX index (semiconductor stocks) has a return of 10.6% during year
2005. Suppose SOXX-H has a return of 15%, while SOXX-L has a return
of 5%.

This study source was downloaded by 100000880261197 from CourseHero.com on 02-06-2024 06:17:54 GMT -06:00

https://www.coursehero.com/file/115499588/lec11Causal1pdf/
There are two fund managers A and B, and suppose that manager A is a
much better stock picker than manager B, in every categories. More precisely,

A chose OSX-H and SOXX-H


B chose OSX-L and SOXX-L.

However

B invest 80% of the money in oil, and 20% in semi-conductor.


A invest 80% of the money in semi-conductor, and 20% in oil stocks

The …nal outcomes are


The return for Manager A: 80% 15% + 20% 60% = 24%
The return for Manager B: 20% 5% + 80% 35% = 29%

Now we have a Simpson’s paradox again. In other words, although Manager


A has better returns within each industry, Manager B has a better overall
returns, simply because B put more money in the right industry sector.
This example shows the importance of asset allocation vs. stock picking.

11.6 Seeing Level: Correcting Bias via Condi-


tioning
At the seeing level, a main objective of causal inference is to reduce the bias by
choosing suitable conditioning variables.
(1) Do not conditioning on a collider, as it would create bias.
(2) Conditioning on all confounding variables.
(3) For mediators, using mediation analysis (several regressions) to separate
direct and indirect e¤ects. This will be covered in the next chapter.
In this section, we will focus on (2), how to evaluate whether a set of variables
are confounding variables or not. To do so, we shall use the d-separation and
the backdoor criterion.

11.6.1 D-separation
A path from X to Y is any consecutive sequence of nodes and edges that the
start and end notes are X and Y , ignoring the direction.
We de…ne a path to be either blocked or open according to the following
graphical rules.
1. If there are no variables being conditioned on, a path is blocked if and
only if two arrowheads on the path collide at some variable on the path (i.e.
there is a collider on the path and the collider is not conditioned). For example,
in Figure 2 the path X, Z1, Y is blocked.
2. Any path that contains a non-collider that has been conditioned on is
blocked. For example, in Figure 3 conditioning on Z1, the path X, Z1, and Y
is blocked.

This study source was downloaded by 100000880261197 from CourseHero.com on 02-06-2024 06:17:54 GMT -06:00

https://www.coursehero.com/file/115499588/lec11Causal1pdf/
3. A collider that has been conditioned on does not block a path. For
example, in Figure 2, conditioning on Z1 the path X, Z1, Y is open.
4. A collider that has a descendant that has been conditioned on does not
block a path. For example, conditioning on Z3 in Figure 3 the path X, Z2, and
Y is open.

11.6.2 Backdoor criterion


In practice, if we believe confounding is likely, a key question is whether we can
determine a set of measured covariates L for which we can condition on L to
study the e¤ect of X on Y in a unbiased way. To do this the backdoor criterion
can be applied to the causal DAG.
A backdoor path is a path that contains an arrow into X. Note that beside
the arrow into X, the direction in the path is ignored. For example, in Figure
2, there are 4 back-door paths from X to Y : (1) X, Z2, Z3, Y ; (2) X, Z2, Z3,
Z4, Y ; (3) X, Z3, Z4, Y ; (4) X, Z3, Z4, Y .
A set of covariates L satis…es the backdoor criterion if all backdoor paths
between X and Y are blocked by conditioning on L and L contains no variables
that are descendants of treatment X. For example, in Figure 3 although Z2 is
on the backdoor path Z, X, Z2, Z3, if one want to study the impact of Z1 to
Z3, we cannot condition on Z2, because Z2 is a descendant of Z1 (from the
directed path Z1; Y , Z2, Z3).

11.7 Doing Level: the Do Calculus


The e¤ects of do(X = x) :
1. Setting X = x or do(X = x) cuts all incoming causal arrows to X:
See Figure 6 in the paper by Dablander (2020).
2. If L, a set of the covariates, satis…es the backdoor criterion, then
X
P (Y = yjdo(X = x)) = P (Y = yjX = x; L = l)P (L = l):
l

Note that this is di¤erent from the standard condition probability, as


X
P (Y = yjX = x) = P (Y = yjX = x; L = l)P (L = ljX = x):
l

11.8 Doing Level: Performing Causal Inference


for Simpson’s Paradox
See the two examples in the paper by Dablander (2020).
HWK.
1. Use the same return numbers for OSX-H, OSX-L, SOXX-H, and SOXX-L
(i.e. 60%, 35%, 15%, 5%, respectively). Suppose manager A invests proportion
p1 in OSX-H and (1 p1 ) in SOXX-H, and manager B invests proportion p2 in

This study source was downloaded by 100000880261197 from CourseHero.com on 02-06-2024 06:17:54 GMT -06:00

https://www.coursehero.com/file/115499588/lec11Causal1pdf/
OSX-L and (1 p2 ) in SOXX-L. Under what conditions on p1 and p2 does the
Simpson’s paradox appear?
2. For Example B (asset allocation), calculate the average causal e¤ect of
two portfolio managers on portfolio returns.

11.9 Imagining: Using Structural Causal Equa-


tions
At the imagining level, we are asking what if questions at the individual level.
Thus, the DAG is not enough, as the DAG focuses on the population level. To
do so, we need to introduce the structural causal equations, such as

Y := + X + ":

Note that, unlike the standard equation Y = + X + ", this does not imply
that
X := 0 + 0 Y + "0 :
For example, Y can be the grade a student may get and X can be the hours
studied. By the equation, increase one hour of study can on average increase
the grade by . But the reversed causality is not true, i.e. if a teacher adds one
point to the score, does not change how many hours the student has studied.
See pp. 9-12 of Dablander (2020).

11.10 Dynamic Simpson’s Paradox and Dynamic


Causal Inference
Example. Here is another example related to the U.S. tax cut between 1974-
1978. In 1974 the U.S. congress decided to cut tax for every one. So the tax
rates are reduced in every tax brackets, as it is shown in the following table.

Year 1974 Income Tax Tax Rate New Tax Rate


Low Income 380.7 37.2 9.8% 8.5%
Mid Income 470.0 75.0 16.0% 15.9%
High Income 29.4 11.3 38.4% 38.2%
Total 880.1 123.5
Overall 14.0%

However, 4 years later, the outcome is totally di¤erent.

Year 1978 Income Tax Tax Rate


Low Income 314.5 26.7 8.5%
Mid Income 865.0 137.9 15.9%
High Income 62.8 24.0 38.2%
Total 1242.3 188.6
Overall 15.2%

This study source was downloaded by 100000880261197 from CourseHero.com on 02-06-2024 06:17:54 GMT -06:00

https://www.coursehero.com/file/115499588/lec11Causal1pdf/
In other words, the overall tax rate increased, despite the tax rates were
lower in each category. The main reason is that there were a lot more people
belong to middle and high income categories, possibly due to the high in‡ation
during the time period, as the cuto¤ points for the tax brackets did not change
from 1974 to 1978. For example, due to the in‡ation, a person used to belong
to the low income category in 1974 might belong to the middle income category
in 1978, resulting in a tax rate at 15.9% in 1978 instead of 9.5% in 1974.
This is related to dynamic Simpson’s paradox, as we are comparing two
outcomes at di¤erent time periods. The causal inference for dynamic data is
much more complicated, as the causal relationship and the magnitude of causal
e¤ects (including the possibility of delayed e¤ects) can change from time to
time.

This study source was downloaded by 100000880261197 from CourseHero.com on 02-06-2024 06:17:54 GMT -06:00

https://www.coursehero.com/file/115499588/lec11Causal1pdf/
Powered by TCPDF (www.tcpdf.org)

You might also like