Lecture 1

1 / 48
ECON3210
Big Data Econometrics
Lecture Slides Week 1
Fangzhou Yu
ECON3210
2 / 48
Week 1: Course Overview

Econometrics in a Big Data world
Course overview & themes
Big Data & research design
Econometrics vs ML or Econometrics & ML
Prediction & Causal inference
Prediction methods for causal inference
This course is 20% Prediction + 80% Causal inference
The key takeaway from this course:
Difference between Prediction and Causal inference
ECON3210
3 / 48
1. What is Big Data?
ECON3210
4 / 48
Data, data everywhere

Conventional sources
Statistical agencies such as Australia Bureau of Statistics (ABS)
Census of Population & Housing
Surveys such as Australian Health Survey
GDP, employment rate, income…
Reserve Bank of Australia (RBA)
Interest rates, exchange rates…
Administrative data
Firms, government agencies routinely record their operations
Sources of data have exploded & become more diverse
Era of Big Data & the data deluge
ECON3210
5 / 48
What exactly is Big Data?

According to NSF and NIH — two of the largest funders of academic research on big data
In their 2012 joint program solicitation:
Big data […] refers to large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments,
sensors, Internet transactions, email, video, click streams, and/or all other digital sources available
Or according to the “3Vs” definition of Big Data
The definition of big data is data that contains greater variety, arriving in increasing volumes and with more
velocity.
ECON3210
6 / 48

“The GDELT Project is an initiative to
construct a catalog of human societal-
scale behavior and beliefs across all
countries of the world, connecting every
person, organization, location, count,
theme, news source, and event across
the planet into a single massive network
that captures what’s happening around
the world, what its context is and who’s
involved, and how the world is feeling
about it, every single day.”
ECON3210
7 / 48

Also observe many more “traditional” surveys
Data sources varied & not monopoly of statistical agencies
Data collection costs relatively low
Researchers/Business developing new data sources
Online panels
Lab, field & social experiments
However, the above definitions/examples of Big Data is vague
Later, We will have a more detailed discussion of “Big Data” in this course
ECON3210
8 / 48
Common data in ECON

A common data set we have in Economics
n observations and p variables
p << n 10 : 2 5
− 10 observations, less than 10 or tens of variables
The traditional methods (e.g OLS, IV) can handle this
ECON3210
9 / 48
Big n problem ✗
Big n is when the number of observations explodes
Census data / user data in big tech firms (Goolge, Amazon…) / high frequency
data (stock transaction)
The size of the data file is the major challenge
New software/platform to analyze this kind of data: SQL/Hive/Hadoop
But little challenge to the traditional methods
Not discussed in this course
ECON3210
10 / 48
Big p problem ✓
Big p is when the number of variables explodes The dataset does not have to be big in size
Sometimes due to rich dataset Huge challenge for traditional methods (will discuss this
Sometimes we create a Big p problem by adding in next week)
polynomials or interactions to capture complex Sometimes also called high-dimensional data

relationships This is the focus of this course
ECON3210
11 / 48
2. Big Data with different research objectives
ECON3210
12 / 48
Research objectives
Descriptive Analytics: summarizing & exploring data
Is there a problem with customer churn?
Infering “social networks” from social media interactions
Our world in data
Predictive Analytics: forecasting out-of-sample data
Which customers are at the most risk of churning?
What is the GDP/employment rate/electricity consumption in next year?
“Your recommendations” on Amazon, Netflix …
ECON3210
13 / 48
Research objectives
Causal Inference: predicting counterfactuals
Will customers stay if they are offered discounts?
Is it profitable if online advertising is increased?
What is the effect of lockdown during Covid-19?
What is the impact of combating inflation by raising interest rates on the housing market?
How much is the gender wage gap?
Why does incumbency status affect election outcomes?
This kind of research objectives is the main focus of Economics
Differs from the other two because correlation is not causation
ECON3210
14 / 48
Big p challenge under research objectives

For Descriptive Analytics and Predictive Analytics
Machine Learning methods are developed to deal with Big p
Example: Chatgpt, the Large Language Model (LLM)
Model Release Training Data Size Parameters Training Data Source

Date
GPT-1 2018 ~800 million 117M BooksCorpus (7,000 unpublished books)

words
GPT-2 2019 ~40GB 1.5B WebText (45 million pages from diverse web sources)
GPT-3 2020 Hundreds of GBs 175B WebText2 (Extended and larger version of WebText)
GPT- Not available Not available Not Not available

3.5 available
GPT-4 Knowledge Not available Not Assumed larger and more diverse dataset
cutoff 2021 available
ECON3210
15 / 48
Previous page is generated by Chatgpt4
ECON3210
16 / 48
Big p challenge under research objectives

For Causal Inference
Machine Learning methods cannot be applied directly
ML is algorithm-based while Econometrics is model- or design-based
ML stresses accurate predictions Y^
while Econometrics stresses relationships β

^
and its inferences (e.g. hypothesis testing)
ML relies on data-driven model selection
while Econometrics starts with a model based on economic theory
ECON3210
17 / 48
“…ML tools are becoming standard across disciplines, so the economist’s toolkit needs to adapt accordingly while
preserving the traditional strengths of applied econometrics.” - Athey and Imbens (2019)
The main task of ECON3210 is to introduce recent research in

causal inference with ML methods
Susan Athey: Professor of Economics at Stanford, President of American Economic Association, Chief Economist at
Microsoft
Guido Imbens: Professor of Economics at Stanford, Nobel Laureate in 2021 for his research in causal inference, Chief
Editor of Econometrica, Husband of Susan Athey
ECON3210
18 / 48
3. Regression as description, prediction and

causal inference
ECON3210
19 / 48
Linear regression for descriptive analysis

Linear regression model
Yi = β0 + β1Xi + ϵi
Consider sales (Y ) & TV advertising (X )

i i
(advertising.csv)
Q1 What key features of the data are revealed by the

scatter plot?
Without specifying a linear model,
the OLS estimator β

^
is just (standardized) covariance of sales and TV advertising
1
Cov(Xi, Yi)
^
β1 =
V ar(Xi)
ECON3210
20 / 48
Linear regression for predictive analysis

Assume that the relationship between sales and TV ad is linear
Or without assuming linearity, OLS is Best Linear Unbiased Predictor (BLUE) by Gauss Markov theorem
Either way, we can use TV ad to predict out-of-sample market
^ = 7.033 + 0.048X
Y i i
A market where T V i
ˆ = 11.833
= 100 → salesi
ECON3210
21 / 48
Linear regression for causal inference

But may want the model to do even more - causality & “what-if” counterfactuals
What happens to sales in a particular market if TV ad were increased
Doesn’t our regression answer this question?
At least two threats to causally interpret the regression
Confounding variables leading to omitted variable bias
What could be the confounding variables?
Reverse Causality
What if markets with low sales increase advertising?
ECON3210
22 / 48
Correlation is not causation

If X and Y are correlation, what could be the true relationship (or more formally, the data generating process (DGP))
X could cause Y (causality we are looking for)
or Y could cause X (threat of reverse causality)
or X could cause Y , and C could cause both X and Y (threat of confounding)
or no causation, but C could cause both X and Y (threat of confounding)
or D, E, F …
Q2: What if X and Y are not correlated?
ECON3210
23 / 48
4. Introduction to causal inference
ECON3210
24 / 48
Causality & notion of ceteris paribus

Definition of causal effect of X on Y
How does variable Y change if X is changed but all other relevant factors are held constant
In evaluating an intervention or policy change think of counterfactual outcomes
e.g. A person’s wage with & without higher education
Important to define the causal effect of interest
Useful to describe how an experiment would have to be designed to infer the causal effect in question
See this NSW doc and Netflix blog for a summary of experiment or A/B test
ECON3210
25 / 48
Experiment 1
Impact of back-to-work program on employment
“If a person from population of those looking for work and given access to a back-to-work program, will that
increase their chance of employment?
Implicit assumption: all other factors that influence employment (experience, ability, local employment
prospects…) are held fixed
Experiment:
Choose a group of workers looking for work
Randomly assign them to access the program or not
Compare employment outcomes in next period
Experiment works because of randomness: characteristics of people are independent of where they receive
program or not
ECON3210
26 / 48
RCT evaluating back-to-work program
Haynes et al. (2012) ECON3210

27 / 48
Experiment 2
A/B testing of a website landing page
“If a business rearranges its current website (UI), by how much will this change customer behaviors such as
consumption, time spending on the website…”
Implicit assumption: all other factors that influence customer behaviors are held fixed
ECON3210
28 / 48
A/B test by Netflix
ECON3210
29 / 48
Experiment 3
Measuring returns to education
“If a person is given another year of education, by how much will his or her wage increase
Implicit assumption: experience, family background, intelligence etc. are held fixed
How would you design an experiment of this?
What are the difficulties in your experiment?
ECON3210
30 / 48
Experiments in a Big Data world

Experiments by firms are common
ECON3210
31 / 48

Economists also run experiments
ECON3210
32 / 48

But most empirical analysis in Economics relies on non-experimental/observational data
Experiments can be expensive, Economists do not have the funds as firms
Experiments can be unethical, such as random assignment of education years
It is more challenging to identify causal effect in observational data
The difficulty is usually called “selection bias”
Next, need to formalize how we think about impact of a treatment
ECON3210
33 / 48
Potential outcome framework

Previous examples included various types of treatment: advertising, lockdown, raising interest rate…
Will concentrate on binary treatment or policy
Let D represents a binary treatment

i
Customer saw new website (D i ) or old (D

= 1 i = 0 )
Person completed a university degree (D i = 1 ) or not (D

i )
= 0
In experiments, treatments applied by chance
In observational data, treatment (are more like to) be applied by choice
Deciding to go to university is a choice and unlikely to be as if randomly assigned leading to a selection problem
ECON3210
34 / 48
Consider two potential states of the world
Two potential outcomes depending on the treatment status
Yi(1) if treated
Yi(0) if untreated
Unit level treatment/causal effect is
T Ei = Yi(1) − Yi(0)
We observe Y i = Yi(1) if D
i ,
= 1 Yi = Yi(0) if D
i = 0
ECON3210
35 / 48

I am having a headache, and I need to make the decision whether to take pill or not
Binary treatment D : take pill or not

i
Graph by Brady Neal

ECON3210
36 / 48
Di = 1
ECON3210
37 / 48
Di = 1
Di = 0
ECON3210
38 / 48
Di = 1
Yi(Di) = Yi(1) = 1
Di = 0
Yi(Di) = Yi(0) = 0
ECON3210
39 / 48
Di = 1
Yi(Di) = Yi(1) = 1
Di = 0
Yi(Di) = Yi(0) = 1
ECON3210
40 / 48
Q3: For any individual, the treatment effect cannot be identified/estimated. Why?
Q4: What if we can observe more individuals?
ECON3210
41 / 48

The true status of the two parallel worlds
i D Y Y (1) Y (0) Y (1) − Y (0)
1 0 0 0 0 0
2 1 1 1 1 0
3 1 0 0 0 0
4 0 0 0 0 0
5 0 1 0 1 -1
6 1 1 1 0 1
With this population, we can define the Average treatment effect (ATE)
AT E = E[T Ei] = E[Yi(1) − Yi(0)] =?
ECON3210
42 / 48

We can only observe
i D Y Y (1) Y (0) Y (1) − Y (0)
1 0 0 0 ?
2 1 1 1 ?
3 1 0 0 ?
4 0 0 0 ?
5 0 1 1 ?
6 1 1 1 ?
Q3: This is why treatment effect can not be identified/estimated at individual level
What about Q4?
ECON3210
43 / 48

i D Y Y (1) Y (0) Y (1) − Y (0)
1 0 0 0 ?
2 1 1 1 ?
3 1 0 0 ?
4 0 0 0 ?
5 0 1 1 ?
6 1 1 1 ?
2/3 1/3
Try to use the group means we have (a feasible estimator)
2/3 − 1/3 = 1/3 ≠ AT E
ECON3210
44 / 48

Formally, the observed outcome can be written as
Yi = DiYi(1) + (1 − Di)Yi(0)
What we did in previous slide is
E[Yi|Di = 1] − E[Yi|Di = 0] = 1/3
And we obtained the conclusion that
E[Yi(1) − Yi(0)] ≠ E[Yi|Di = 1] − E[Yi|Di = 0]
The RHS is the estimator we usually use in experiments, right? Why is it not working in this case?
ECON3210
45 / 48

Because the treatment is not randomly assigned! Otherwise, we would have
E[Yi|Di = 1] − E[Yi|Di = 0]
=E[Yi = DiYi(1) + (1 − Di)Yi(0)|Di = 1] − E[Yi = DiYi(1) + (1 − Di)Yi(0)|Di = 0]
=E[Yi(1)|Di = 1] − E[Yi(0)|Di = 0]
=E[Yi(1)] − E[Yi(0)] by random treatment
So for Q4,
We can estimate ATE in experiments
But cannot in observational data, this is the challenge we will deal with in later lectures
ECON3210
46 / 48
Linear regression in Potential outcome framework

How does the familiar linear regression fit into our discussion?
The observed outcome
Yi = DiYi(1) + (1 − Di)Yi(0) = Yi(0) + (Yi(1) − Yi(0))Di
Need some assumptions to transform this into a linear model
Assume Y (0)
i = α + ϵi(0)
Assume treatment effect is constant across individuals, that is T E

i = Yi(1) − Yi(0) = τ
Then we have a linear model
Yi = α + τ Di + ϵi(0)
τ can be estimated in experiments because the exogeneity assumption in linear regression
E[ϵi|Di] = E[ϵi(0)|Di] = 0
which is implied by the random assignment treatment in experiments

ECON3210
47 / 48
Summary and Future Plans

Dicussed the concept of Big Data
Difference between prediction and causal inference
The role of experiments in causal inference
Difficulties in causal inference with observational data
For Week 2 and 3, introduce 2 machine learning methods
Prediction problems in Econometrics with Big Data
Also a preparation of future discussion of the new causal inference methods in Econometrics
Come back to causal inference after Week 3
Causal inference with machine learning methods
ECON3210
48 / 48
References
Athey, Susan, and Guido W Imbens. 2019. “Machine Learning Methods That Economists Should Know About.” Annual
Review of Economics 11: 685–725.
Haynes, Laura, Ben Goldacre, David Torgerson, et al. 2012. “Test, Learn, Adapt: Developing Public Policy with
Randomised Controlled Trials.” Cabinet Office-Behavioural Insights Team.
ECON3210

Lecture 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1

Uploaded by

Copyright:

Available Formats

1 / 48

Week 1: Course Overview

Course overview & themes

Big Data & research design

Econometrics vs ML or Econometrics & ML

Prediction & Causal inference

Prediction methods for causal inference

This course is 20% Prediction + 80% Causal inference

The key takeaway from this course:

Difference between Prediction and Causal inference

1. What is Big Data?

Data, data everywhere

Statistical agencies such as Australia Bureau of Statistics (ABS)

Census of Population & Housing

Surveys such as Australian Health Survey

GDP, employment rate, income…

Reserve Bank of Australia (RBA)

Interest rates, exchange rates…

Firms, government agencies routinely record their operations

Sources of data have exploded & become more diverse

Era of Big Data & the data deluge

What exactly is Big Data?

In their 2012 joint program solicitation:

Or according to the “3Vs” definition of Big Data

What exactly is Big Data?

What exactly is Big Data?

Data sources varied & not monopoly of statistical agencies

Data collection costs relatively low

Researchers/Business developing new data sources

Lab, field & social experiments

However, the above definitions/examples of Big Data is vague

Common data in ECON

n observations and p variables

The traditional methods (e.g OLS, IV) can handle this

The size of the data file is the major challenge

New software/platform to analyze this kind of data: SQL/Hive/Hadoop

But little challenge to the traditional methods

Not discussed in this course

Sometimes we create a Big p problem by adding in next week)

polynomials or interactions to capture complex Sometimes also called high-dimensional data

2. Big Data with different research objectives

Is there a problem with customer churn?

Infering “social networks” from social media interactions

Our world in data

Predictive Analytics: forecasting out-of-sample data

Which customers are at the most risk of churning?

What is the GDP/employment rate/electricity consumption in next year?

“Your recommendations” on Amazon, Netflix …

Will customers stay if they are offered discounts?

Is it profitable if online advertising is increased?

What is the effect of lockdown during Covid-19?

How much is the gender wage gap?

Why does incumbency status affect election outcomes?

This kind of research objectives is the main focus of Economics

Differs from the other two because correlation is not causation

Big p challenge under research objectives

Machine Learning methods are developed to deal with Big p

Example: Chatgpt, the Large Language Model (LLM)

Model Release Training Data Size Parameters Training Data Source

GPT-1 2018 ~800 million 117M BooksCorpus (7,000 unpublished books)

GPT- Not available Not available Not Not available

Previous page is generated by Chatgpt4

Big p challenge under research objectives