Professional Documents
Culture Documents
Lecture 1
Lecture 1
ECON3210
Big Data Econometrics
Lecture Slides Week 1
Fangzhou Yu
ECON3210
2 / 48
ECON3210
3 / 48
ECON3210
4 / 48
Administrative data
ECON3210
5 / 48
Big data […] refers to large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments,
sensors, Internet transactions, email, video, click streams, and/or all other digital sources available
The definition of big data is data that contains greater variety, arriving in increasing volumes and with more
velocity.
ECON3210
6 / 48
ECON3210
7 / 48
Online panels
Later, We will have a more detailed discussion of “Big Data” in this course
ECON3210
8 / 48
p << n 10 : 2 5
− 10 observations, less than 10 or tens of variables
ECON3210
9 / 48
Big n problem ✗
Big n is when the number of observations explodes
Census data / user data in big tech firms (Goolge, Amazon…) / high frequency
data (stock transaction)
ECON3210
10 / 48
Big p problem ✓
Big p is when the number of variables explodes The dataset does not have to be big in size
Sometimes due to rich dataset Huge challenge for traditional methods (will discuss this
ECON3210
12 / 48
Research objectives
Descriptive Analytics: summarizing & exploring data
ECON3210
13 / 48
Research objectives
Causal Inference: predicting counterfactuals
What is the impact of combating inflation by raising interest rates on the housing market?
ECON3210
14 / 48
GPT-2 2019 ~40GB 1.5B WebText (45 million pages from diverse web sources)
GPT-3 2020 Hundreds of GBs 175B WebText2 (Extended and larger version of WebText)
GPT-4 Knowledge Not available Not Assumed larger and more diverse dataset
cutoff 2021 available
ECON3210
15 / 48
ECON3210
16 / 48
ECON3210
17 / 48
“…ML tools are becoming standard across disciplines, so the economist’s toolkit needs to adapt accordingly while
preserving the traditional strengths of applied econometrics.” - Athey and Imbens (2019)
Susan Athey: Professor of Economics at Stanford, President of American Economic Association, Chief Economist at
Microsoft
Guido Imbens: Professor of Economics at Stanford, Nobel Laureate in 2021 for his research in causal inference, Chief
Editor of Econometrica, Husband of Susan Athey
ECON3210
18 / 48
ECON3210
19 / 48
(advertising.csv)
Cov(Xi, Yi)
^
β1 =
V ar(Xi)
ECON3210
20 / 48
Or without assuming linearity, OLS is Best Linear Unbiased Predictor (BLUE) by Gauss Markov theorem
^ = 7.033 + 0.048X
Y i i
A market where T V i
ˆ = 11.833
= 100 → salesi
ECON3210
21 / 48
Reverse Causality
ECON3210
22 / 48
or D, E, F …
ECON3210
23 / 48
ECON3210
24 / 48
How does variable Y change if X is changed but all other relevant factors are held constant
Useful to describe how an experiment would have to be designed to infer the causal effect in question
See this NSW doc and Netflix blog for a summary of experiment or A/B test
ECON3210
25 / 48
Experiment 1
Impact of back-to-work program on employment
“If a person from population of those looking for work and given access to a back-to-work program, will that
increase their chance of employment?
Implicit assumption: all other factors that influence employment (experience, ability, local employment
prospects…) are held fixed
Experiment:
Experiment works because of randomness: characteristics of people are independent of where they receive
program or not
ECON3210
26 / 48
Experiment 2
A/B testing of a website landing page
“If a business rearranges its current website (UI), by how much will this change customer behaviors such as
consumption, time spending on the website…”
Implicit assumption: all other factors that influence customer behaviors are held fixed
ECON3210
28 / 48
ECON3210
29 / 48
Experiment 3
Measuring returns to education
“If a person is given another year of education, by how much will his or her wage increase
Implicit assumption: experience, family background, intelligence etc. are held fixed
ECON3210
30 / 48
ECON3210
31 / 48
ECON3210
32 / 48
ECON3210
33 / 48
Deciding to go to university is a choice and unlikely to be as if randomly assigned leading to a selection problem
ECON3210
34 / 48
Yi(1) if treated
Yi(0) if untreated
T Ei = Yi(1) − Yi(0)
We observe Y i = Yi(1) if D
i ,
= 1 Yi = Yi(0) if D
i = 0
ECON3210
35 / 48
Di = 1
ECON3210
37 / 48
Di = 1
Di = 0
ECON3210
38 / 48
Di = 1
Yi(Di) = Yi(1) = 1
Di = 0
Yi(Di) = Yi(0) = 0
ECON3210
39 / 48
Di = 1
Yi(Di) = Yi(1) = 1
Di = 0
Yi(Di) = Yi(0) = 1
ECON3210
40 / 48
Q3: For any individual, the treatment effect cannot be identified/estimated. Why?
ECON3210
41 / 48
1 0 0 0 0 0
2 1 1 1 1 0
3 1 0 0 0 0
4 0 0 0 0 0
5 0 1 0 1 -1
6 1 1 1 0 1
With this population, we can define the Average treatment effect (ATE)
ECON3210
42 / 48
1 0 0 0 ?
2 1 1 1 ?
3 1 0 0 ?
4 0 0 0 ?
5 0 1 1 ?
6 1 1 1 ?
Q3: This is why treatment effect can not be identified/estimated at individual level
ECON3210
43 / 48
1 0 0 0 ?
2 1 1 1 ?
3 1 0 0 ?
4 0 0 0 ?
5 0 1 1 ?
6 1 1 1 ?
2/3 1/3
ECON3210
44 / 48
Yi = DiYi(1) + (1 − Di)Yi(0)
The RHS is the estimator we usually use in experiments, right? Why is it not working in this case?
ECON3210
45 / 48
E[Yi|Di = 1] − E[Yi|Di = 0]
=E[Yi(1)|Di = 1] − E[Yi(0)|Di = 0]
So for Q4,
But cannot in observational data, this is the challenge we will deal with in later lectures
ECON3210
46 / 48
Assume Y (0)
i = α + ϵi(0)
Yi = α + τ Di + ϵi(0)
E[ϵi|Di] = E[ϵi(0)|Di] = 0
Also a preparation of future discussion of the new causal inference methods in Econometrics
ECON3210
48 / 48
References
Athey, Susan, and Guido W Imbens. 2019. “Machine Learning Methods That Economists Should Know About.” Annual
Review of Economics 11: 685–725.
Haynes, Laura, Ben Goldacre, David Torgerson, et al. 2012. “Test, Learn, Adapt: Developing Public Policy with
Randomised Controlled Trials.” Cabinet Office-Behavioural Insights Team.
ECON3210