You are on page 1of 89

Econometrics and Machine Learning in

Economics

- Topics 2 and 3 -

Probability Models and Data Generating


Processes

Higher School of Economics

A. Duplinskiy
What we did until now!

1 Reviewed some concepts from Time-Series Econometrics

2 Stationarity, Conditional vs Unconditional moments

2 / 77 Week 2 Introduction
This week

Ergodicity
Assignment
Multivariate Regression: More than 1 feature
Clustering: K-Means and Hierarchical, Affinity
Propagation

Dimensionality Reduction: PCA, Lasso-Ridge:

Logistic Regression

Consider checking out


https://online.stanford.edu/courses/sohs-ystatslearning-
statistical-learning

3 / 77 Week 2 Introduction
Next Week

1 Advanced time-series models


The models you will implement in your work!
Private companies, central banks, governments, public
institutions, research institutions, universities

Simple linear models everyone can estimate! (click buttons)


Econometricians (are expected to) do more!

Econometricians are suppose to be specialists in


cutting-edge, state-of-the-art models (not available in
software packages).
2 Machine Learning Vs Econometrics:
Cases when one has to be careful with ML
Experimental vs Observational data
Predictive vs Casual Models

4 / 77 Week 2 Introduction
Homework: Lecture 1

Solutions to Selected Exercises


Exercises 1.2 and 1.3

Exercise 1.2: Make a Venn diagram specifying the relation


between the following types of stochastic processes:
(a) iid processes (b) weakly stationary processes
(c) strictly stationary processes

Exercise 1.3: Give examples of time series that characterize


each set (including each intersection and union) in the Venn
diagram elaborated on the previous question.

6 / 77 Week 2 Introduction
Exercises 1.2 and 1.3

(1) WS, SS, IID (4) WS, SS, IID


(2) WS, SS, IID (5) WS, SS, IID
(3) WS, SS, IID
7 / 77 Week 2 Introduction
Exercises 1.2 and 1.3

(1) {Xt } iid with Xt ∼ N(µ, σ 2 ) for every t ≤ t∗ and Xt ∼


t(µ, σ 2 ) for every t > t∗

(2) {Xt } iid with Xt ∼ N(µ, σ 2 ) for every t

(3) (Xt , Xt−1 ) ∼ N(µ, Σ) for every t, where


" #
1 0.5
µ = [0, 0] and Σ =
0.5 1
(4) {Xt } iid with Xt ∼ Cauchy for every t

(5) (Xt , Xt−1 ) ∼ t(µ, Σ, λ) for every t, where


" #
1 0.5
µ = [0, 0] and Σ = and λ < 2.
0.5 1

8 / 77 Week 2 Introduction
Exercise 1.6
Exercise 1.6: Show that ρX (t + h, t) = ρX (h) if the time series
is weakly stationary.
Answer: The ACF is given by

Cov(Xt+h , Xt )
ρX (t + h, t) = p
Var(Xt+h )Var(Xt )

If {Xt } is weakly/strictly stationary, then, for every (t, h),

2
Cov(Xt+h , Xt ) = γX (h) and Var(Xt+h ) = Var(Xt ) = σX

Therefore, we can re-write

γX (h) γX (h) γX (h)


ρX (t + h, t) = q = 2 = = ρX (h)
2 2
σX σX σ X γX (0)

9 / 77 Week 2 Introduction
Exercise 1.11

Exercise 1.11: Derive the autocorrelation function of the


random walk starting at t = 1.

Xt = ε1 + ε2 + . . . + εt ∀ t ∈ N, were {εt } ∼ W N (0, σε2 ).

Recall: Var(Xt ) = tσε2 and Cov(Xt , Xt−h ) = (t − h)σε2 .

γX (t, t − h)
Hence: ρX (t, t − h) = p
Var(Xt )Var(Xt−h )

(t − h)σε2 t−h
= p = √
(tσε2 )((t − h)σε2 ) t

The correlations between elements of {Xt } change over time!


10 / 77 Week 2 Introduction
Exercise 1.17
(a) Xt = a + bZt + cZt−2 , {Zt } ∼ N ID(0, σ 2 )

E(Xt ) = E(a + bZt + cZt−2 )


= E(a) + E(bZt ) + E(cZt−2 )
= a + bE(Zt ) + cE(Zt−2 )
= a + b · 0 + c · 0 = a.

Var(Xt ) = Var(a + bZt + cZt−2 )


= Var(a) + Var(bZt ) + Var(cZt−2 )
= 0 + b2 Var(Zt ) + c2 Var(Zt−2 )
= (b2 + c2 )σ 2 .

11 / 77 Week 2 Introduction
Exercise 1.17
(a) (continued)

Cov(Xt , Xt−h ) = Cov(a + bZt + cZt−2 , a + bZt−h + cZt−h−2 )


= b2 Cov(Zt , Zt−h ) + bcCov(Zt , Zt−h−2 )
+ cbCov(Zt−2 , Zt−h ) + c2 Cov(Zt−2 , Zt−h−2 )
Now since Cov(Zt , Zt−h ) = 0 ∀ h 6= 0, we have,
γ(0) = (b2 + c2 )σ 2
γ(1) = 0
γ(2) = bcσ 2
γ(h) = 0 ∀ h > 2.
Since expectation, variance and covariance are all finite and
constant over time we conclude that {Xt } is weakly stationary.
12 / 77 Week 2 Introduction
Exercise 1.18

Exercise 1.18: Suppose {Xt } and {Yt } are uncorrelated weakly


stationary sequences. Show that {Xt + Yt } is weakly stationary.
Let {Wt } := {Xt + Yt }. Then,

E(Wt ) = E(Xt + Yt ) = E(Xt ) + E(Yt ) = µX + µY .

2
Var(Wt ) = Var(Xt + Yt ) = Var(Xt ) + Var(Yt ) = σX + σY2
Cov(Wt , Wt−h ) = Cov(Xt + Yt , Xt−h + Yt−h )
= Cov(Xt , Xt−h ) + Cov(Xt , Yt−h ) + Cov(Yt , +Xt−h )
+ Cov(Yt , Yt−h )
= γX (h) + 0 + 0 + γY (h) = γX (h) + γY (h).
Since expectation, variance and covariance are all finite and
constant over time we conclude that {Wt } is weakly stationary.
13 / 77 Week 2 Introduction
Useful info

Note: groups should have three or four students.


No more, no less. (contact me if you do not have partners)
Deadline for forming groups: Sunday, November 1
Assignment deadlines:

Part 1 and 2: Monday, December 14, at 18:30.

Important: Failure to deliver on time implies zero points on that


part!

Good grade: answers should be: correct, clear and complete.


Deliver a report with appropriate justifications and insightful
comments and remarks. Think about your report: Too much
information is unreadable. Too little information is ill advised.

14 / 77 Probability Models Introduction


Assignment and Software

Software: Assignment should be completed with python or any


other software, Matlab, R

Topics: Clustering, Text Processing

Doing extra: Go ahead, but make sure you know what you are
doing

15 / 77 Probability Models Introduction


Autosuggestions

About this assignment:

In this assignment you have data about id’s and when and
where they were issues. Do not worry, there is no id
number and I modified the data!

Assignment objectives:
1 Work with text data and learn how to cluster text data.
2 Get the feeling how to calculate the value of your work.
3 Visualize data and use clustering for data cleaning.

16 / 77 Probability Models Introduction


Id data

17 / 77 Probability Models Introduction


Load the data and process it

18 / 77 Probability Models Introduction


Visualise data

19 / 77 Probability Models Introduction


Visualise cleaned data

20 / 77 Probability Models Introduction


Autosuggestion accuracy

Figure : Proportion of entries that have a smaller error then a number


on x-axis according to a levenshtein distance. Axis: x- levengshtein
distance, y - proportion of entries that have error smaller than x

21 / 77 Probability Models Introduction


Multivariate regression

Extension of the univariate regression:


Strictly speaking univariate models we worked with have
two variables often – regressor and constant.
Mathematically very similar – the same techniques to
analyse.
Typically the same packages, but more things to look at.
statsmodels is more statistic driven, sklearn is more
machine learning.
similar functionality slightly different focus and notation.

22 / 77 Probability Models Introduction


Multivariate regression examples

Number of Shops on Population and Number of Horeca


Establishments

23 / 77 Probability Models Introduction


Load Packages

24 / 77 Probability Models Introduction


Fit Plot

25 / 77 Probability Models Introduction


Multivariate Regression

26 / 77 Probability Models Introduction


Plot fit

27 / 77 Probability Models Introduction


Multivariate Plot

Why is this mess here?

28 / 77 Probability Models Introduction


Multivariate Scatter Plot

29 / 77 Probability Models Introduction


Multivariate Scatter Plot

30 / 77 Probability Models Introduction


Residuals Plot

31 / 77 Probability Models Introduction


Clustering

What is it? Why do we do it?


Clustering refers to a very broad set of techniques for
finding subgroups, or clusters, in a data set.
We want to partition the data into distinct groups based on
the information we have. So that similar shoes are in
distinct groups.
What doest it mean for shoes to be similar or different?
In non-trivial cases, we need to combine categorical and
numerical features and use domain-specific knowledge

32 / 77 Probability Models Introduction


Clustering

What is it? Why do we do it?


Alternative to going to more sophisticated models. Instead
of going to semi-nonparametric or non-parametric methods
such as ANN we can split data into groups and have a
simpler model for them
Netflix and Spotify recommendation is essentially a
sophisticated version of two-sided clustering. Find people
with similar tastes. Find products with similar properties.

33 / 77 Probability Models Introduction


Clustering Recom systems

34 / 77 Probability Models Introduction


Clustering

What is it? Why do we do it?


Alternative to going to more sophisticated models. Instead
of going to semi-nonparametric or non-parametric methods
such as ANN we can split data into groups and have a
simpler model for them
Netflix and Spotify recommendation is essentially a
sophisticated version of two-sided clustering. Find people
with similar tastes. Find products with similar properties.
Solve the problem of cold start. If time series data is not
available how to predict sales of a new article?
Imagine a map and us trying to guess the value of a
property based on k-nearest neighbors

35 / 77 Probability Models Introduction


Clustering Example

Sales and Product data of Shoes:


Suppose we have a large number of shoe characteristics
(e.g. number of shoes sold last year, average price, product
category(e.g. running, football, outdoor), and so forth) for
a large number of shoes.
Our purpose is to group similar shoes together to make a
separate demand/promotion model for each group..
This task can be linked to user segmentation. Similar
people like similar shoes (related to recommender systems)

36 / 77 Probability Models Introduction


Clustering Result Plot

37 / 77 Probability Models Introduction


K-means Clustering Details

Goal:
Assign a unique index to each observation.
To minimize the within-cluster variation given the number
of clusters.
1
Pp
− x i0 j ) 2
P
W CV (Ck ) = |Ck | xi ,xi0 ∈Ck j=1 (xij

38 / 77 Probability Models Introduction


K-means Clustering Details

Procedure:
1 Randomly assign a number, from 1 to K, to each of the
observations. These serve as initial cluster assignments for
the observations.
2 Iterate until the cluster assignments stop changing:
1 For each of the K clusters, compute the cluster centroid.
The kth cluster centroid is the vector of the p feature means
for the observations in the kth cluster.
2 Assign each observation to the cluster whose centroid is
closest (where closest is defined using Euclidean distance).

39 / 77 Probability Models Introduction


Example by Robert Tibshirani and Trevor Hastie

Figure : Steps of K-means algorithm. Source: Statistical Learning at


https://online.stanford.edu/
40 / 77 Probability Models Introduction
Breaking down steps

The progress of the K-means algorithm with K=3.


Top left: Scatter plot of the data.
Top center: Step 1 – Randomly assign each observation to
a cluster.
Top right: Step 2(a) – Compute the cluster centroids.
Bottom left: Step 2(b), Assign each observation to the
nearest centroid.
Bottom center: Step 2(a) – Compute new cluster
centroids.
Top right: Results after 10 iterations.

41 / 77 Probability Models Introduction


Scale Data

42 / 77 Probability Models Introduction


Residuals Plot

43 / 77 Probability Models Introduction


Residuals Plot

44 / 77 Probability Models Introduction


K-means code

45 / 77 Probability Models Introduction


Scatter Plot

46 / 77 Probability Models Introduction


Clustering Result Plot

47 / 77 Probability Models Introduction


K-means Clustering. How to Choose K?

No simple answer:
Number given by domain knowledge.
Elbow method.
Silhouette Score.

48 / 77 Probability Models Introduction


Elbow Method

49 / 77 Probability Models Introduction


Elbow Method

50 / 77 Probability Models Introduction


Silhuette Score k = 2

51 / 77 Probability Models Introduction


Silhuette Score k = 3

52 / 77 Probability Models Introduction


Silhuette Score k = 4

53 / 77 Probability Models Introduction


Code For Silhouette Score. Part 1

54 / 77 Probability Models Introduction


Code For Silhouette Score. Part 2

55 / 77 Probability Models Introduction


Bonus

56 / 77 Probability Models Introduction


Get Cluster Labels

57 / 77 Probability Models Introduction


Scatter Plot

Compare with K-means?

58 / 77 Probability Models Introduction


Two types of Clustering

Top-down and Bottom-Up:


In K-means clustering, we seek to partition the
observations into a pre-specified number of clusters.
In hierarchical clustering, we do not know in advance how
many clusters we want; in fact, we end up with a tree-like
visual representation of the observations, called a
dendrogram, that allows us to view at once the clusterings
obtained for each possible number of clusters, from 1 to n.

59 / 77 Probability Models Introduction


Affinity Prop

Based on distance matrix

60 / 77 Probability Models Introduction


Affinity Prop

What are the clusters here?

61 / 77 Probability Models Introduction


Affinity Prop

What are the clusters here?

62 / 77 Probability Models Introduction


Affinity Prop

Compare with K-means?

https://scikit-learn.org/stable/modules/clustering.html

63 / 77 Probability Models Introduction


Logit

Implementation:
Dependent variable is binary ∈ {0, 1} – probability of
succes.
Use stats models and sklearn

64 / 77 Probability Models Introduction


Logit with sm ols

65 / 77 Probability Models Introduction


Logit with sm ols

Get Odds Ratio and Confidence intervals

66 / 77 Probability Models Introduction


Logit with sklearn

67 / 77 Probability Models Introduction


Logit Performance measures

68 / 77 Probability Models Introduction


Area under Curve

69 / 77 Probability Models Introduction


Area under Curve

70 / 77 Probability Models Introduction


Lasso
Pp 2
ML + a loss function λ i=1 βi

71 / 77 Probability Models Introduction


Lasso

72 / 77 Probability Models Introduction


Lasso

73 / 77 Probability Models Introduction


Chapter 3: Probability models

Some more reading material:


1 Davidson (1994), “Stochastic Limit Theory”
Chapter 1.6, 2.3, 3.1 and 7.1
2 Billingsley (1995), “Probability and Measure”
Chapter 2 and 5
3 White (1996), “Estimation Inference and
Specification Analysis”
Chapter 2.1, 2.2 and 20
4 Fan and Yao (2005), “Nonlinear Time-Series”
Chapter 1.3

74 / 77 Probability Models Introduction


Probability spaces and random variables

Note: we need to learn some basic concepts of set theory and


measure theory in order to find an appropriate definition of
probability model and data generating process.
A very brief history of 20th century mathematics:
1 Late 19th and 20th centuries: foundations!
2 Gottlob Frege attempted to give proper definitions of numbers,
functions and variables. What is the number 2 after all?
3 Days before publication (10 years of work) Bertrand Russell pointing
out ‘small inconsistency’ which turned out to destroy Frege’s work.
4 Bertrand Russell gives foundations to mathematics in ‘Principia
Mathematica’ in 3 volumes.
5 Kurt Godel’s Incompleteness Theorem shows that any logical
axiomatic system will always be incomplete or contradictory!

75 / 77 Probability Models Introduction


Probability space

A probability space is a triplet (E, F, P ) where E is the ‘event


space’, F is a σ-field defined on the event space E and P is a
probability measure defined on the σ-field F.
Event space E is the collection of all possible outcomes for the
random variable.

Probability measure P defines probability associated to each


event and each collection of events in E.

σ-field F contains all the relevant collections of events.

Note: P : F → [0, 1] maps elements of F to the interval [0, 1].

76 / 77 Probability Models Introduction


Probability space

A probability space is a triplet (E, F, P ) where E is the ‘event


space’, F is a σ-field defined on the event space E and P is a
probability measure defined on the σ-field F.
Examples of event spaces E:
Coin tosses: E = {heads, tails}
Dice tosses: E = {1, 2, 3, 4, 5, 6}
Gaussian random variable E = R

Question: Why is P defined on collections of sets in F?


Answer: Describes what are joint and disjoint sets!

77 / 77 Probability Models Introduction


Probability space: Coin toss example
Example: A σ-field F of the event space E = {heads, tails} is
n o
F := {∅} , {heads} , {tails} , {heads, tails} .

Note:
F contains the empty set ∅
F contains each element of E
F contains the event space E = {heads, tails}
Hence: Probability measure P must define a probability of
Nothing happening P (∅)
Drawing heads P (heads)
Drawing tails P (tails)
Drawing either heads or tails P ({heads, tails})
78 / 77 Probability Models Introduction
σ-fields (σ-algebras)

Note: there are certain rules that must be followed for


constructing a σ-algebra F.

Banach-Tarsky Paradox (1924): It is possible to take a ball,


cut it into pieces, and re-arrange those pieces in such a manner
as to obtain two balls of the exact same size, that have no parts
missing! The σ-algebra solves this problem!

A σ-field F of a set E is a collection of subsets of E satisfying:


(i) E ∈ F.
(ii) If F ∈ F, then F c ∈ F.
S∞
(iii) If {Fn }n∈N is a sequence of sets in F then n=1 Fn ∈ F.

79 / 77 Probability Models Introduction


Measurable spaces and probability measures

A measurable space is just a pair (E, F) composed of an event


space E and respective σ-algebra F.
A probability measure P defined on a measurable space (E, F)
is a function P : F → [0, 1] satisfying:
(i) P (F ) ≥ 0 ∀ F ∈ F.
(ii) P (E) = 1.
(iii) If {Fn }n∈N is a collection of sets in F, then
S∞  P∞
P n=1 Fn = n=1 P (Fn ).

80 / 77 Probability Models Introduction


Random variable

Given two measurable spaces (A, FA ) and (B, FB ), a function


f : A → B is said to be measurable if f is such that every
element b ∈ FB satisfies f −1 (b) ∈ FA .
Note: inverse map f −1 exists always, it may just not be a
function! Do you remember the properties of a function?

Given a probability space (E, F, P ) and a measurable space


(R, FR ), a random-variable xt is a measurable map xt : E → R
that maps elements of E to the real numbers R.

81 / 77 Probability Models Introduction


Random variable

Note: The definition of random variable is very intuitive!

Note: measurability of xt : E → R implies that we can assign


probabilities for each interval R ⊆ R of the real line
PR (R) = P (x−1

t (R)) = P e ∈ E : xt (e) ∈ R

we obtain a new probability space (R, FR , PR ).

Note: we can now define the cumulative distribution function F


that you know so well as

F (a) = PR (x ≤ a) ∀ a ∈ R.

Important: xt is a random variable. xt (e) ∈ R is the


realization of the random variable produced by event e ∈ E.

82 / 77 Probability Models Introduction


Random vectors and random elements

Note: The concept of random variable is easy to generalize!

Given a probability space (E, F, P ) and a measurable space


(Rn , FRn ) with n ∈ N, an n-variate random-vector xt is a
measurable map xt : E → Rn that maps elements of E to Rn .

Given a probability space (E, F, P ) and the measurable space


(A, FA ), a random-element at taking values in A is a measurable
map at : E → A that maps elements of E to A.

83 / 77 Probability Models Introduction


Is this a random variable? Borel σ-algebra

Important: definition of random variable depends on the


σ-algebra that one is using!

Question: Consider the case where xt is a normal random


variable xt ∼ N (0, σ 2 ). Is x2t also a random variable? How
about exp(xt )?

Answer: Yes, if we use the Borel σ-algebra! (Emile Borel)

Given set A the Borel σ-algebra BA is the smallest σ-algebra


containing all open sets of A.

84 / 77 Probability Models Introduction


Is this a random variable? Continuous functions

Important: all continuous functions are measurable under the


Borel σ-algebra! Any continuous transformation f (xt ) of a
random variable xt is also a random variable!

Note: It is obvious that all continuous functions are


measurable! Just look at the definition of continuous function.

Let (A, TA ) and (B, TB ) be topological spaces. A function


f : A → B is said to be continuous if its inverse f −1 maps open
sets to open sets; i.e. if for every b ∈ TB we have f −1 (b) ∈ TA .

85 / 77 Probability Models Introduction


What is a probability model?

Question: What exactly is a model?

Example: Given T tosses of a coin, it is reasonable to suppose


that x1 , ..., xT are realizations of T Bernoulli random variables,
xt ∼ Bern(θ) with unknown probability parameter θ ∈ [0, 1].

Important: Each θ defines a probability distribution for the


random vector (x1 , ..., xT ) taking values in RT . Our model is a
collection of probability distributions on RT .

This definition of model is the one you have been


always using. Even if you did not realize it!

86 / 77 Probability Models Introduction


What is a probability model?

Question: What exactly is a model?

Example 2: Gaussian linear AR(1) model,

xt = α + βxt−1 + εt , εt ∼ N (0, σε2 ) , ∀t∈Z

Important: Each θ = (α, β, σε2 ) defines a distribution for the


time-series {xt }t∈Z . Our model is a collection of probability
distributions on R∞ .
This definition of model is the one you have been
always using. Even if you did not realize it!

87 / 77 Probability Models Introduction


Probability model

Given a measurable space (E, F) and a parameter space Θ, a


probability model is a collection PΘ := {Pθ , θ ∈ Θ} of
probability measures defined on F.

Given the measurable space (R∞ , FR∞ ) and a parameter space


Θ, a probability model is a collection PΘ := {Pθ , θ ∈ Θ} of
probability measures defined on FR∞ .

88 / 77 Probability Models Introduction


Some more useful definitions...

A probability model PΘ := {Pθ , θ ∈ Θ} is said to be:


’parametric’ if the parameter space Θ is finite dimensional;
‘nonparametric’ if Θ is infinite dimensional;
‘semi-parametric’ if e Θ = Θ1 × Θ2 where Θ1 is finite
dimensional and Θ2 is infinite dimensional;
‘semi-nonparametric’ if ΘT is indexed by the sample size T with
‘sieves’ {ΘT }T ∈N with increasing dimension.

Given a measurable space (E, F) and two parametric models


PΘ := {Pθ , θ ∈ Θ} and P∗Θ∗ := {Pθ∗∗ , θ ∗ ∈ Θ∗ }, we say that model PΘ
nests model P∗Θ∗ if and only if P∗Θ∗ ⊆ PΘ .

89 / 77 Probability Models Introduction

You might also like