Lecture 2 2020

Econometrics and Machine Learning in
Economics
- Topics 2 and 3 -
Probability Models and Data Generating

Processes
Higher School of Economics
A. Duplinskiy
What we did until now!
1 Reviewed some concepts from Time-Series Econometrics
2 Stationarity, Conditional vs Unconditional moments
2 / 77 Week 2 Introduction
This week
Ergodicity
Assignment
Multivariate Regression: More than 1 feature
Clustering: K-Means and Hierarchical, Affinity
Propagation
Dimensionality Reduction: PCA, Lasso-Ridge:
Logistic Regression
Consider checking out

https://online.stanford.edu/courses/sohs-ystatslearning-
statistical-learning
Next Week
1 Advanced time-series models

The models you will implement in your work!
Private companies, central banks, governments, public
institutions, research institutions, universities
Simple linear models everyone can estimate! (click buttons)

Econometricians (are expected to) do more!
Econometricians are suppose to be specialists in

cutting-edge, state-of-the-art models (not available in
software packages).
2 Machine Learning Vs Econometrics:
Cases when one has to be careful with ML
Experimental vs Observational data
Predictive vs Casual Models
Homework: Lecture 1
Solutions to Selected Exercises

Exercises 1.2 and 1.3
Exercise 1.2: Make a Venn diagram specifying the relation

between the following types of stochastic processes:
(a) iid processes (b) weakly stationary processes
(c) strictly stationary processes
Exercise 1.3: Give examples of time series that characterize

each set (including each intersection and union) in the Venn
diagram elaborated on the previous question.
(1) WS, SS, IID (4) WS, SS, IID

(2) WS, SS, IID (5) WS, SS, IID
(3) WS, SS, IID
(1) {Xt } iid with Xt ∼ N(µ, σ 2 ) for every t ≤ t∗ and Xt ∼

t(µ, σ 2 ) for every t > t∗
(2) {Xt } iid with Xt ∼ N(µ, σ 2 ) for every t
(3) (Xt , Xt−1 ) ∼ N(µ, Σ) for every t, where

" #
1 0.5
µ = [0, 0] and Σ =
0.5 1
(4) {Xt } iid with Xt ∼ Cauchy for every t
(5) (Xt , Xt−1 ) ∼ t(µ, Σ, λ) for every t, where

" #
1 0.5
µ = [0, 0] and Σ = and λ < 2.
0.5 1
Exercise 1.6
Exercise 1.6: Show that ρX (t + h, t) = ρX (h) if the time series
is weakly stationary.
Answer: The ACF is given by
Cov(Xt+h , Xt )
ρX (t + h, t) = p
Var(Xt+h )Var(Xt )
If {Xt } is weakly/strictly stationary, then, for every (t, h),
2
Cov(Xt+h , Xt ) = γX (h) and Var(Xt+h ) = Var(Xt ) = σX
Therefore, we can re-write
γX (h) γX (h) γX (h)

ρX (t + h, t) = q = 2 = = ρX (h)
2 2
σX σX σ X γX (0)
Exercise 1.11
Exercise 1.11: Derive the autocorrelation function of the

random walk starting at t = 1.
Xt = ε1 + ε2 + . . . + εt ∀ t ∈ N, were {εt } ∼ W N (0, σε2 ).
Recall: Var(Xt ) = tσε2 and Cov(Xt , Xt−h ) = (t − h)σε2 .
γX (t, t − h)
Hence: ρX (t, t − h) = p
Var(Xt )Var(Xt−h )
√
(t − h)σε2 t−h
= p = √
(tσε2 )((t − h)σε2 ) t
The correlations between elements of {Xt } change over time!

Exercise 1.17
(a) Xt = a + bZt + cZt−2 , {Zt } ∼ N ID(0, σ 2 )
E(Xt ) = E(a + bZt + cZt−2 )

= E(a) + E(bZt ) + E(cZt−2 )
= a + bE(Zt ) + cE(Zt−2 )
= a + b · 0 + c · 0 = a.
Var(Xt ) = Var(a + bZt + cZt−2 )

= Var(a) + Var(bZt ) + Var(cZt−2 )
= 0 + b2 Var(Zt ) + c2 Var(Zt−2 )
= (b2 + c2 )σ 2 .
Exercise 1.17
(a) (continued)
Cov(Xt , Xt−h ) = Cov(a + bZt + cZt−2 , a + bZt−h + cZt−h−2 )

= b2 Cov(Zt , Zt−h ) + bcCov(Zt , Zt−h−2 )
+ cbCov(Zt−2 , Zt−h ) + c2 Cov(Zt−2 , Zt−h−2 )
Now since Cov(Zt , Zt−h ) = 0 ∀ h 6= 0, we have,
γ(0) = (b2 + c2 )σ 2
γ(1) = 0
γ(2) = bcσ 2
γ(h) = 0 ∀ h > 2.
Since expectation, variance and covariance are all finite and
constant over time we conclude that {Xt } is weakly stationary.
Exercise 1.18
Exercise 1.18: Suppose {Xt } and {Yt } are uncorrelated weakly

stationary sequences. Show that {Xt + Yt } is weakly stationary.
Let {Wt } := {Xt + Yt }. Then,
E(Wt ) = E(Xt + Yt ) = E(Xt ) + E(Yt ) = µX + µY .
2
Var(Wt ) = Var(Xt + Yt ) = Var(Xt ) + Var(Yt ) = σX + σY2
Cov(Wt , Wt−h ) = Cov(Xt + Yt , Xt−h + Yt−h )
= Cov(Xt , Xt−h ) + Cov(Xt , Yt−h ) + Cov(Yt , +Xt−h )
+ Cov(Yt , Yt−h )
= γX (h) + 0 + 0 + γY (h) = γX (h) + γY (h).
Since expectation, variance and covariance are all finite and
constant over time we conclude that {Wt } is weakly stationary.
Useful info
Note: groups should have three or four students.

No more, no less. (contact me if you do not have partners)
Deadline for forming groups: Sunday, November 1
Assignment deadlines:
Part 1 and 2: Monday, December 14, at 18:30.
Important: Failure to deliver on time implies zero points on that

part!
Good grade: answers should be: correct, clear and complete.

Deliver a report with appropriate justifications and insightful
comments and remarks. Think about your report: Too much
information is unreadable. Too little information is ill advised.
14 / 77 Probability Models Introduction

Assignment and Software
Software: Assignment should be completed with python or any

other software, Matlab, R
Topics: Clustering, Text Processing
Doing extra: Go ahead, but make sure you know what you are
doing

Autosuggestions
About this assignment:
In this assignment you have data about id’s and when and
where they were issues. Do not worry, there is no id
number and I modified the data!
Assignment objectives:
1 Work with text data and learn how to cluster text data.
2 Get the feeling how to calculate the value of your work.
3 Visualize data and use clustering for data cleaning.

Id data

Load the data and process it

Visualise data

Visualise cleaned data

Autosuggestion accuracy
Figure : Proportion of entries that have a smaller error then a number

on x-axis according to a levenshtein distance. Axis: x- levengshtein
distance, y - proportion of entries that have error smaller than x

Multivariate regression
Extension of the univariate regression:

Strictly speaking univariate models we worked with have
two variables often – regressor and constant.
Mathematically very similar – the same techniques to
analyse.
Typically the same packages, but more things to look at.
statsmodels is more statistic driven, sklearn is more
machine learning.
similar functionality slightly different focus and notation.

Multivariate regression examples
Number of Shops on Population and Number of Horeca

Establishments

Load Packages

Fit Plot

Multivariate Regression

Plot fit

Multivariate Plot
Why is this mess here?

Multivariate Scatter Plot

Multivariate Scatter Plot

Residuals Plot

Clustering
What is it? Why do we do it?

Clustering refers to a very broad set of techniques for
finding subgroups, or clusters, in a data set.
We want to partition the data into distinct groups based on
the information we have. So that similar shoes are in
distinct groups.
What doest it mean for shoes to be similar or different?
In non-trivial cases, we need to combine categorical and
numerical features and use domain-specific knowledge

Clustering

Alternative to going to more sophisticated models. Instead
of going to semi-nonparametric or non-parametric methods
such as ANN we can split data into groups and have a
simpler model for them
Netflix and Spotify recommendation is essentially a
sophisticated version of two-sided clustering. Find people
with similar tastes. Find products with similar properties.

Clustering Recom systems

Clustering

Alternative to going to more sophisticated models. Instead
of going to semi-nonparametric or non-parametric methods
such as ANN we can split data into groups and have a
simpler model for them
Netflix and Spotify recommendation is essentially a
sophisticated version of two-sided clustering. Find people
with similar tastes. Find products with similar properties.
Solve the problem of cold start. If time series data is not
available how to predict sales of a new article?
Imagine a map and us trying to guess the value of a
property based on k-nearest neighbors

Clustering Example
Sales and Product data of Shoes:

Suppose we have a large number of shoe characteristics
(e.g. number of shoes sold last year, average price, product
category(e.g. running, football, outdoor), and so forth) for
a large number of shoes.
Our purpose is to group similar shoes together to make a
separate demand/promotion model for each group..
This task can be linked to user segmentation. Similar
people like similar shoes (related to recommender systems)

Clustering Result Plot

K-means Clustering Details
Goal:
Assign a unique index to each observation.
To minimize the within-cluster variation given the number
of clusters.
1
Pp
− x i0 j ) 2
P
W CV (Ck ) = |Ck | xi ,xi0 ∈Ck j=1 (xij

K-means Clustering Details
Procedure:
1 Randomly assign a number, from 1 to K, to each of the
observations. These serve as initial cluster assignments for
the observations.
2 Iterate until the cluster assignments stop changing:
1 For each of the K clusters, compute the cluster centroid.
The kth cluster centroid is the vector of the p feature means
for the observations in the kth cluster.
2 Assign each observation to the cluster whose centroid is
closest (where closest is defined using Euclidean distance).

Example by Robert Tibshirani and Trevor Hastie
Figure : Steps of K-means algorithm. Source: Statistical Learning at

https://online.stanford.edu/
Breaking down steps
The progress of the K-means algorithm with K=3.

Top left: Scatter plot of the data.
Top center: Step 1 – Randomly assign each observation to
a cluster.
Top right: Step 2(a) – Compute the cluster centroids.
Bottom left: Step 2(b), Assign each observation to the
nearest centroid.
Bottom center: Step 2(a) – Compute new cluster
centroids.
Top right: Results after 10 iterations.

Scale Data

Residuals Plot

Residuals Plot

K-means code

Scatter Plot

Clustering Result Plot

K-means Clustering. How to Choose K?
No simple answer:
Number given by domain knowledge.
Elbow method.
Silhouette Score.

Elbow Method

Elbow Method

Silhuette Score k = 2



Code For Silhouette Score. Part 1

Code For Silhouette Score. Part 2

Bonus

Get Cluster Labels

Scatter Plot
Compare with K-means?

Two types of Clustering
Top-down and Bottom-Up:

In K-means clustering, we seek to partition the
observations into a pre-specified number of clusters.
In hierarchical clustering, we do not know in advance how
many clusters we want; in fact, we end up with a tree-like
visual representation of the observations, called a
dendrogram, that allows us to view at once the clusterings
obtained for each possible number of clusters, from 1 to n.

Affinity Prop
Based on distance matrix

Affinity Prop
What are the clusters here?

Affinity Prop
What are the clusters here?

Affinity Prop
Compare with K-means?
https://scikit-learn.org/stable/modules/clustering.html

Logit
Implementation:
Dependent variable is binary ∈ {0, 1} – probability of
succes.
Use stats models and sklearn

Logit with sm ols

Logit with sm ols
Get Odds Ratio and Confidence intervals

Logit with sklearn

Logit Performance measures

Area under Curve

Area under Curve

Lasso
Pp 2
ML + a loss function λ i=1 βi

Lasso

Lasso

Chapter 3: Probability models
Some more reading material:

1 Davidson (1994), “Stochastic Limit Theory”
Chapter 1.6, 2.3, 3.1 and 7.1
2 Billingsley (1995), “Probability and Measure”
Chapter 2 and 5
3 White (1996), “Estimation Inference and
Specification Analysis”
Chapter 2.1, 2.2 and 20
4 Fan and Yao (2005), “Nonlinear Time-Series”
Chapter 1.3

Probability spaces and random variables
Note: we need to learn some basic concepts of set theory and

measure theory in order to find an appropriate definition of
probability model and data generating process.
A very brief history of 20th century mathematics:
1 Late 19th and 20th centuries: foundations!
2 Gottlob Frege attempted to give proper definitions of numbers,
functions and variables. What is the number 2 after all?
3 Days before publication (10 years of work) Bertrand Russell pointing
out ‘small inconsistency’ which turned out to destroy Frege’s work.
4 Bertrand Russell gives foundations to mathematics in ‘Principia
Mathematica’ in 3 volumes.
5 Kurt Godel’s Incompleteness Theorem shows that any logical
axiomatic system will always be incomplete or contradictory!

Probability space
A probability space is a triplet (E, F, P ) where E is the ‘event

space’, F is a σ-field defined on the event space E and P is a
probability measure defined on the σ-field F.
Event space E is the collection of all possible outcomes for the
random variable.
Probability measure P defines probability associated to each

event and each collection of events in E.
σ-field F contains all the relevant collections of events.
Note: P : F → [0, 1] maps elements of F to the interval [0, 1].

Probability space
A probability space is a triplet (E, F, P ) where E is the ‘event

space’, F is a σ-field defined on the event space E and P is a
probability measure defined on the σ-field F.
Examples of event spaces E:
Coin tosses: E = {heads, tails}
Dice tosses: E = {1, 2, 3, 4, 5, 6}
Gaussian random variable E = R
Question: Why is P defined on collections of sets in F?

Answer: Describes what are joint and disjoint sets!

Probability space: Coin toss example
Example: A σ-field F of the event space E = {heads, tails} is
n o
F := {∅} , {heads} , {tails} , {heads, tails} .
Note:
F contains the empty set ∅
F contains each element of E
F contains the event space E = {heads, tails}
Hence: Probability measure P must define a probability of
Nothing happening P (∅)
Drawing heads P (heads)
Drawing tails P (tails)
Drawing either heads or tails P ({heads, tails})
σ-fields (σ-algebras)
Note: there are certain rules that must be followed for

constructing a σ-algebra F.
Banach-Tarsky Paradox (1924): It is possible to take a ball,

cut it into pieces, and re-arrange those pieces in such a manner
as to obtain two balls of the exact same size, that have no parts
missing! The σ-algebra solves this problem!
A σ-field F of a set E is a collection of subsets of E satisfying:

(i) E ∈ F.
(ii) If F ∈ F, then F c ∈ F.
S∞
(iii) If {Fn }n∈N is a sequence of sets in F then n=1 Fn ∈ F.

Measurable spaces and probability measures
A measurable space is just a pair (E, F) composed of an event

space E and respective σ-algebra F.
A probability measure P defined on a measurable space (E, F)
is a function P : F → [0, 1] satisfying:
(i) P (F ) ≥ 0 ∀ F ∈ F.
(ii) P (E) = 1.
(iii) If {Fn }n∈N is a collection of sets in F, then
S∞ P∞
P n=1 Fn = n=1 P (Fn ).

Random variable
Given two measurable spaces (A, FA ) and (B, FB ), a function

f : A → B is said to be measurable if f is such that every
element b ∈ FB satisfies f −1 (b) ∈ FA .
Note: inverse map f −1 exists always, it may just not be a
function! Do you remember the properties of a function?
Given a probability space (E, F, P ) and a measurable space

(R, FR ), a random-variable xt is a measurable map xt : E → R
that maps elements of E to the real numbers R.

Random variable
Note: The definition of random variable is very intuitive!
Note: measurability of xt : E → R implies that we can assign

probabilities for each interval R ⊆ R of the real line
PR (R) = P (x−1

t (R)) = P e ∈ E : xt (e) ∈ R
we obtain a new probability space (R, FR , PR ).
Note: we can now define the cumulative distribution function F

that you know so well as
F (a) = PR (x ≤ a) ∀ a ∈ R.
Important: xt is a random variable. xt (e) ∈ R is the

realization of the random variable produced by event e ∈ E.

Random vectors and random elements
Note: The concept of random variable is easy to generalize!
Given a probability space (E, F, P ) and a measurable space

(Rn , FRn ) with n ∈ N, an n-variate random-vector xt is a
measurable map xt : E → Rn that maps elements of E to Rn .
Given a probability space (E, F, P ) and the measurable space

(A, FA ), a random-element at taking values in A is a measurable
map at : E → A that maps elements of E to A.

Is this a random variable? Borel σ-algebra
Important: definition of random variable depends on the

σ-algebra that one is using!
Question: Consider the case where xt is a normal random

variable xt ∼ N (0, σ 2 ). Is x2t also a random variable? How
about exp(xt )?
Answer: Yes, if we use the Borel σ-algebra! (Emile Borel)
Given set A the Borel σ-algebra BA is the smallest σ-algebra

containing all open sets of A.

Is this a random variable? Continuous functions
Important: all continuous functions are measurable under the

Borel σ-algebra! Any continuous transformation f (xt ) of a
random variable xt is also a random variable!
Note: It is obvious that all continuous functions are

measurable! Just look at the definition of continuous function.
Let (A, TA ) and (B, TB ) be topological spaces. A function

f : A → B is said to be continuous if its inverse f −1 maps open
sets to open sets; i.e. if for every b ∈ TB we have f −1 (b) ∈ TA .

What is a probability model?
Question: What exactly is a model?
Example: Given T tosses of a coin, it is reasonable to suppose

that x1 , ..., xT are realizations of T Bernoulli random variables,
xt ∼ Bern(θ) with unknown probability parameter θ ∈ [0, 1].
Important: Each θ defines a probability distribution for the

random vector (x1 , ..., xT ) taking values in RT . Our model is a
collection of probability distributions on RT .
This definition of model is the one you have been

always using. Even if you did not realize it!

What is a probability model?
Question: What exactly is a model?
Example 2: Gaussian linear AR(1) model,
xt = α + βxt−1 + εt , εt ∼ N (0, σε2 ) , ∀t∈Z
Important: Each θ = (α, β, σε2 ) defines a distribution for the

time-series {xt }t∈Z . Our model is a collection of probability
distributions on R∞ .
This definition of model is the one you have been
always using. Even if you did not realize it!

Probability model
Given a measurable space (E, F) and a parameter space Θ, a

probability model is a collection PΘ := {Pθ , θ ∈ Θ} of
probability measures defined on F.
Given the measurable space (R∞ , FR∞ ) and a parameter space

Θ, a probability model is a collection PΘ := {Pθ , θ ∈ Θ} of
probability measures defined on FR∞ .

Some more useful definitions...
A probability model PΘ := {Pθ , θ ∈ Θ} is said to be:

’parametric’ if the parameter space Θ is finite dimensional;
‘nonparametric’ if Θ is infinite dimensional;
‘semi-parametric’ if e Θ = Θ1 × Θ2 where Θ1 is finite
dimensional and Θ2 is infinite dimensional;
‘semi-nonparametric’ if ΘT is indexed by the sample size T with
‘sieves’ {ΘT }T ∈N with increasing dimension.
Given a measurable space (E, F) and two parametric models

PΘ := {Pθ , θ ∈ Θ} and P∗Θ∗ := {Pθ∗∗ , θ ∗ ∈ Θ∗ }, we say that model PΘ
nests model P∗Θ∗ if and only if P∗Θ∗ ⊆ PΘ .

Lecture 2 2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2 2020

Uploaded by

Copyright:

Available Formats

Econometrics and Machine Learning in

Probability Models and Data Generating

Higher School of Economics

1 Reviewed some concepts from Time-Series Econometrics

2 Stationarity, Conditional vs Unconditional moments

Dimensionality Reduction: PCA, Lasso-Ridge:

Consider checking out

1 Advanced time-series models

Simple linear models everyone can estimate! (click buttons)

Econometricians are suppose to be specialists in

Solutions to Selected Exercises

Exercise 1.2: Make a Venn diagram specifying the relation

Exercise 1.3: Give examples of time series that characterize

(1) WS, SS, IID (4) WS, SS, IID

(1) {Xt } iid with Xt ∼ N(µ, σ 2 ) for every t ≤ t∗ and Xt ∼

(2) {Xt } iid with Xt ∼ N(µ, σ 2 ) for every t

(3) (Xt , Xt−1 ) ∼ N(µ, Σ) for every t, where

(5) (Xt , Xt−1 ) ∼ t(µ, Σ, λ) for every t, where

If {Xt } is weakly/strictly stationary, then, for every (t, h),

Therefore, we can re-write

γX (h) γX (h) γX (h)

Exercise 1.11: Derive the autocorrelation function of the

Xt = ε1 + ε2 + . . . + εt ∀ t ∈ N, were {εt } ∼ W N (0, σε2 ).

Recall: Var(Xt ) = tσε2 and Cov(Xt , Xt−h ) = (t − h)σε2 .

The correlations between elements of {Xt } change over time!

E(Xt ) = E(a + bZt + cZt−2 )

Var(Xt ) = Var(a + bZt + cZt−2 )

Cov(Xt , Xt−h ) = Cov(a + bZt + cZt−2 , a + bZt−h + cZt−h−2 )

Exercise 1.18: Suppose {Xt } and {Yt } are uncorrelated weakly

E(Wt ) = E(Xt + Yt ) = E(Xt ) + E(Yt ) = µX + µY .

Note: groups should have three or four students.

Part 1 and 2: Monday, December 14, at 18:30.

Important: Failure to deliver on time implies zero points on that

Good grade: answers should be: correct, clear and complete.

14 / 77 Probability Models Introduction

Software: Assignment should be completed with python or any

Topics: Clustering, Text Processing

15 / 77 Probability Models Introduction

About this assignment:

16 / 77 Probability Models Introduction

17 / 77 Probability Models Introduction

18 / 77 Probability Models Introduction

19 / 77 Probability Models Introduction

20 / 77 Probability Models Introduction

Figure : Proportion of entries that have a smaller error then a number

21 / 77 Probability Models Introduction

Extension of the univariate regression:

22 / 77 Probability Models Introduction

Number of Shops on Population and Number of Horeca

23 / 77 Probability Models Introduction

24 / 77 Probability Models Introduction

25 / 77 Probability Models Introduction

26 / 77 Probability Models Introduction

27 / 77 Probability Models Introduction

Why is this mess here?

28 / 77 Probability Models Introduction

29 / 77 Probability Models Introduction

30 / 77 Probability Models Introduction

31 / 77 Probability Models Introduction

What is it? Why do we do it?

32 / 77 Probability Models Introduction

What is it? Why do we do it?

33 / 77 Probability Models Introduction