You are on page 1of 205

Introduction to Machine Learning

(Lectures on Machine Learning for Economists I)

Jesús Fernández-Villaverde1
December 28, 2018
1
University of Pennsylvania
What is machine learning?

• Set of algorithms to detect and learn from patterns in the data and use them
for decision making or to forecast future realizations of random variables.

• Focus on recursive processing of information to improve performance over time.

• Operational definition of learning (i.e. Turing test).

• Opposition with traditional scientific computation (even Monte Carlos).

• Opposition with symbolic reasoning, expert systems, and cognitive model


approaches in artificial intelligence.

• Think about the example of how to program a computer to play chess.

1
Turing test

2
Why now?

• Many of the ideas of machine learning (e.g., basic neural network by McCulloch
and Pitts, 1943, and perceptron by Rosenblatt, 1958) are decades old.

• Previous waves of excitement: late 1980s and early 1990s. Those decades were
followed by a backlash.

• Four forces behind the revival:

1. Big data.

2. Long tails.

3. Cheap computational power.

4. Algorithmic advances (influence of Google).

• Likely that these four forces will become stronger over time.

• Exponential growth in industry⇒plenty of packages for Python, R, and other


languages.
3
Data sizes

10 9
Dataset size (number examples)

10 8 Canadian Hansard
WMT Sports-1M
10 7 ImageNet10k
10 6 Public SVHN
10 5 Criminals ImageNet ILSVRC 2014
10 4
MNIST CIFAR-10
10 3
10 2 T vs. G vs. F Rotated T vs. C
1 Iris
10
10 0
1900 1950 1985 2000 2015
Y
e 1.8: Dataset sizes have increased greatly over time. In the early 1900s, sta
ed datasets using hundreds or thousands of manually compiled measurements 4
Number of transistors


10,000,000,000 ● ●
● ● ●


● ● ●
● ● ●
● ● ●
● ●
● ●

● ● ● ● ● ●

● ● ●
● ● ●
● ● ●
1,000,000,000 ● ● ●


● ●
● ● ●

● ● ● ●
● ● ●
100,000,000 ● ●
● ● ●
● ●
Transistors

● ●
● ●
10,000,000 ●
● ●



● ● ●

● ● ●
1,000,000
● ●
● ● ● ●
● ●
● ●
100,000 ● ●

● ●
● ● ●


10,000 ● ●





● ●

1,000
1971 1976 1981 1986 1991 1996 2001 2006 2011 2016
Year

5
6
7
8
9
Relation with other fields

• Link with computer science, statistical learning, data science, data mining,
predictive analytics, and optimization: frontiers are often fuzzy.

• Many similarities with econometrics and statistical learning, but emphasis is


somewhat different:

1. No unified approach.

2. Practical algorithms vs. theoretical properties (scalability vs. asymptotic


properties).

3. Traditional statistical inference is de-emphasized.

4. More interest in forecasting than in causality assertions (cross-validation,


regularization).

10
Economics and machine learning I

• Recent boom in economics:

1. Data processing: Blumenstock et al. (2017).

2. Alternative empirical models: deep IVs by Hartford et al. (2017).

3. New solution methods: my work with Galo and Samuel.

4. Alternative to older bounded rationality models.

• However, important to distinguish signal from noise.

• Most important lesson for economists from data science: Everything is data.

11
Library data

12
Parish and probate data

13
Satellite imagery

14
Luminosity

15
Cell use

16
TABLE I
MOST PARTISAN PHRASES FROM THE 2005 CONGRESSIONAL RECORDa

Panel A: Phrases Used More Often by Democrats


Two-Word Phrases
private accounts Rosa Parks workers rights
trade agreement President budget poor people
American people Republican party Republican leader
tax breaks change the rules Arctic refuge
trade deficit minimum wage cut funding
oil companies budget deficit American workers
credit card Republican senators living in poverty
nuclear option privatization plan Senate Republicans
war in Iraq wildlife refuge fuel efficiency
middle class card companies national wildlife
Three-Word Phrases
veterans health care corporation for public cut health care
congressional black caucus broadcasting civil rights movement
VA health care additional tax cuts cuts to child support
billion in tax cuts pay for tax cuts drilling in the Arctic National
credit card companies tax cuts for people victims of gun violence
security trust fund oil and gas companies solvency of social security
social security trust prescription drug bill Voting Rights Act
privatize social security caliber sniper rifles war in Iraq and Afghanistan
American free trade increase in the minimum wage civil rights protections
central American free system of checks and balances credit card debt
middle class families 17
TABLE I—Continued

Panel B: Phrases Used More Often by Republicans


Two-Word Phrases
stem cell personal accounts retirement accounts
natural gas Saddam Hussein government spending
death tax pass the bill national forest
illegal aliens private property minority leader
class action border security urge support
war on terror President announces cell lines
embryonic stem human life cord blood
tax relief Chief Justice action lawsuits
illegal immigration human embryos economic growth
date the time increase taxes food program
Three-Word Phrases
embryonic stem cell Circuit Court of Appeals Tongass national forest
hate crimes legislation death tax repeal pluripotent stem cells
adult stem cells housing and urban affairs Supreme Court of Texas
oil for food program million jobs created Justice Priscilla Owen
personal retirement accounts national flood insurance Justice Janice Rogers
energy and natural resources oil for food scandal American Bar Association
global war on terror private property rights growth and job creation
hate crimes law temporary worker program natural gas natural
change hearts and minds class action reform Grand Ole Opry
global war on terrorism Chief Justice Rehnquist reform social security

a The top 60 Democratic and Republican phrases, respectively, are shown ranked by χ2 . The phrases are classified
pl
as two or three word after dropping common “stopwords” such as “for” and “the.” See Section 3 for details and see
Appendix B (online) for a more extensive phrase list. 18
19
Economics and machine learning II

• A more general point ⇒ role of causality in economics:

1. Counterfactuals.

2. Welfare.

3. General equilibrium effects.

4. New changes.

5. Less data.

• Another example by Athey (2017): hotel prices and occupancy rates. In the
data, prices and occupancy rates are strongly positively correlated, but what is
the expected impact of a hotel raising its prices on a given day?

20
21
References I

• I will lecture from:

1. Machine Learning, A Probabilistic Perspective, Kevin P. Murphy (2012).

2. The Elements of Statistical Learning: Data Mining, Inference, and Prediction


(2nd ed.), Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009).

3. Reinforcement Learning (2nd ed.), Richard S. Sutton and Andrew G. Barto


(2018).

4. Stephen Hansen’s notes available at


https://github.com/sekhansen/machine_learning_economics
(we might merge our slides in the middle run; I copy some material from him
nearly verbatim).

5. NBER Summer Institute 2015 Methods Lectures available at


http://www.nber.org/econometrics_minicourse_2015.

22
References II

• Other good references:

1. Machine Learning: The Art and Science of Algorithms that Make Sense of Data,
Peter Flach (2012).

2. Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2016).

3. Andrew Ng’s CS 229: Machine Learning webpage and videos.

23
A wide range of algorithms

• Machine learning is a catch-all name for a large family of methods.

• Some of them are old-fashioned methods in statistics and econometrics


presented under alternative names.

• Devroye (1982) and Wolpert (1996)’s no free lunch theorem.

• Main division:

1. Supervised learning (conditional density estimation).

2. Unsupervised learning (unconditional density estimation).

3. Reinforcement learning.
24
Supervised Learning

(Lectures on Machine Learning for Economists II)

Jesús Fernández-Villaverde1
December 28, 2018
1
University of Pennsylvania
Supervised learning I

• Use a training sample


N
D = {yi , xi }i=1
to classify new observations
M
F = {yi , xi }i=N+1

• xi may potentially belong to a high-dimensional space.

• Standard regression in econometrics is a particular example when yi is a


continuous variable.

• But often, in machine learning, yi is categorical (classification problem).

• Example: classify loans of a bank into good and bad risks.

• Caveat: it can be highly labor intensive!


1
2
2.5 Local Methods in High Dimensions 23

1.0
Unit Cube
p=10
p=3

0.8
p=2

0.6
p=1

Distance

0.4
0.2
0

0.0
1
0.0 0.2 0.4 0.6
Neighborhood
Fraction of Volume

FIGURE 2.6. The curse of dimensionality is well illustrated by a subcubical


neighborhood for uniform data in a unit cube. The figure on the right shows the
side-length of the subcube needed to capture a fraction r of the volume of the data,
for different dimensions p. In ten dimensions we need to cover 80% of the range 3
of each coordinate to capture 10% of the data.
4
Supervised learning II

• Problem of function approximation:

y = f (x)

when we know next to nothing about the function f (·).

• Linked with projection and perturbation methods.

• Also, linked with traditional econometrics.

• Then:
yb = fb (x) = arg max p (y = c|x,D)
c∈C

• In Bayesian, this is the mode of the posterior, but in machine learning it is


called a MAP estimate (maximum a posteriori).

5
Example I: K-nearest algorithm

• We look at the K −nearest points to xi in the training sample.

• Formally:
1 X
p (y = c|x,D, K ) = I (yi = c)
K
i∈NK (x,K )

• It generates a Voronoi tessellation.

• Simple and intuitive, but it suffers from curse of dimensionality.

6
7
Bias-variance trade-off

• How do we select K ? Minimize misclassification rate on a validation set.

• Bias-variance trade-off is about underfitting vs. overfitting.

• We have:

y = f (x) = fb (x) + ε, ε ∼ ID 0, σ 2

• Then:
 2  2
E y − fb (x) = E fb (x) − f (x)
 2  2
= Efb (x) − f (x) + E fb (x) − Efb (x)
= Bias 2 + var

• Traditional econometrics obsession with BLUE is mysterious.


8
Closest fit in population
Realization

Closest fit

Truth MODEL
SPACE
Model bias

Estimation Bias Shrunken fit

Estimation
Variance

RESTRICTED
MODEL SPACE

9
Linear Regression of 0/1 Response

.. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. . . . . . . . . . . . . . . .o .. .. ..
............................................... .. ..
.. ..
.. .. o .... .... .... ....o.... .... .... .... .... .... .... .... .... .... .... .... .... o.... .... .... ....o.... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ....
..
..
.. ..
.. ..
.. ..
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. o .. . . . . . . . . . . . . .o .. .. .. .. ..o..o . . . . . . . . . . . . .o .. .. ............................
.. .. .. .. .. .. ... ... ... ... ...o... ... ... ... ... ... ... ... o
.. .. .. .. ..
.. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... .. .. .. .. .. .. .. .. .. .. .. ..o ... ... . . .. .. ..o.. .. .. ..o ... ...
o ..... o..... ..... ..... ..... ..... o .. .. .. .. .. .. .. .. .. .. .. ... ...o
.. ....................................
.. .. .. .. .. .. .. .. ..o .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ..
o o o.... .... .... .... .... .... ..... .....o..... ..... ..... ..... ..... ..... ..... .....o..... ..... .....oo..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
.. .. .. .. .. .. ..
.. .. .. .. .. .. ..
..
... ... . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . .. ..o.. o ... ... ... ...
o .... .... .... .... .... .... .... .... .... .... .... .... .... o.... .... .... .... .... ....o.... .... .... .... ....oo.... .... .... .... .... .... .... ....o
.. ...........................
.. ..
.. ..
..
..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. ..o.. .. o
.. ..
.. .. .. .. o
.. ..
.. .. o.. ....o.... o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
.. .. .. ... o
... ... .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. o .. . . .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ...
.. .. .. .. .. .. o .. . .. .. .. .. .. .. .. .. o . ... ... o ........o .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. ..
. .. .. .. ... ... ... ... ... ... ... ...o
.. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. ..
.. .. .. .. .. .. ... o . . . .. .. .. .. .. ...o . . .. o . .. ..o.. o . . . . . .. .. .. ..o
.. ..o.. ... ... ... ... ...o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ... ... ...o .. .. .. .. .. .. ... ... ... ... o .. .. .. .. ...o o . . . .. .. .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... o
.. .. ..o .. .. .. .. .. .. .. .. .. .. .. o .. .. .. ..o . .. .. .. .. .. .. ..
o ... ..... .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. o
..
.. o .. o.. ..o o.. o.. .. .. .. .. .. .. .. .. .. .. .. .. o
... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o
.. .. o
... ...
.. ..o .. .. .. .. .. ..o .. .. .. o ..o
. . . . . . . ... o
.. .. .. .. .. ... ... ... ... ... ... ... o
. . . . . .o.
.. ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o .. .. .. o ..o.. o
.. .. .. .. .. .. .. .. ..o
.. o .. .. .. o .. .. ..o
.. ..
.. .. .. .. .. .. .. .. o .. .. .. .. o .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... o ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... ...o oo ...o
.. .. .. .. .. .. ..
.. .. ...o .. .. .. ...o... ... ...o... ...o .. .. ... ... ... ... ... ... ... ... ...o... ... ... ... ... ... o .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o .. .. .. ... ... ... o .. .. .. .. ... ... ... ... ... ... o
.. .. .. ... o
.. .. o
.. .. .. .. .. ..o... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
... ... ... ... ... .o . . .. .. ... ... o ... ...
.. .. .. .. .. .. ... o .. ... ... ... ... ... ... o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o .. .. .. .. .. .. ..
.. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ...o
..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. ..
.. .. .. .. o
.o o
.. .. .. .. .. o
. . . . . . .
.. ..o.. .. .. .. o . .. ..
. . .. o
. .. .. .. .. .. .. .. .. o
. . . . . . . . . .
.. .. o .
.. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ...
. . . .
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. o
..
. . . . . . . . . .o
.. .. . . . . . . .
. . . . . .. .. .. .. .. .. o . . .o
.. ... . . o
o . . . . . . . .
. . . . . . . . . . . . . .. .. .. .. .. .. ..
. . . . . . . . . . . .
. . .. .. ... ... ... ... o
.. .. .. .. .. .. .. . . . . .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o . .o . . . . .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... .. .. ... ... ... ... o
.. ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o .. .. .. .. .. .. .. o ..o..o.. ..o .. o .. .. .. .. .. ... ... ... ... ... ... ... ... ... ...o ..............
.. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. ..o . .o .. ..
. . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ...o... ... ... ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. ..o
... ... .. ... ... o .. .. .. o
... ... .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. .. .. . . . . . . . .o ..................
. . . . . . .. .. ..o . . . . . . .o . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ...o
.. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. .. ..
.. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ...o... ... ... ... ... ... o .. .. .. .. .. ..o... ... ... ...o
... ... .. . . . . . . . . . . . . . . ..o.. .. .. .. .. o
... ... ... ... . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o ... o
.. .. .. .. .. ..
.. .. .. .. .. .. o .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ..
..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. ... ... ...o .. .. o .. .... .. o .. .. .. .. ..o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o
..
... ... .. ... ... ... ... .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ...o
.. .. ..
..
.. ..
.. ..
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. ..
o. . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. .. .. ................
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... .. ... ... ... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
..
.. .. .. .............................................................. o 10
1−Nearest Neighbor Classifier

oo
o o o
o o oo oo o
o o o oo o o
o oo o o oo oo
o oo
o o o oo o oo
o o
o o o oo o
o o ooo oo o o
o o
o oo oo o ooo o
oooooo oo oooo o ooo o
o o
o oo ooo o o o o o oo
o o o o oo o o oo oo
o oo o oo ooo oo o o o
o oo oo oooo
oo oo
o o oo o o
o oooo o oooooo o
o oo o
o o o
o oo o o oo o
o ooo oo oo o
o
o o ooo o
o ooo
o o
o 11
15-Nearest Neighbor Classifier

.. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. . . . . . . . . . . . . . . .o .. .. ..
............................................... .. ..
.. ..
.. ..
o .... .... .... ....o.... .... .... .... .... .... .... .... .... .... .... .... .... o.... .... .... ....o.... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ....
..
..
.. ..
.. ..
.. ..
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. o . . . . . . . . . . . . .o
.. .. .. ..o..o
.. .. . . . . . . . . . . . . .o .. .. ............................
.. .. .. .. .. .. ... ... ... ... ...o... ... ... ... ... ... ... ... o
.. .. .. .. ..
.. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. ..
.. .. .. .. .. .. .. .. .. .. ..o . . .. .. ..o.. .. .. ..o
.. .. .. ..
o ..... o..... ..... ..... ..... ..... o .. .. .. .. .. .. .. .. .. .. .. ... ...o
.. .. .. .. .. ....................................
.. ..
.. .. .. .. ..o .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
o o o.... .... .... .... .... .... ..... .....o..... ..... ..... ..... ..... ..... ..... .....o..... ..... .....oo..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
... ... .. ... ... ... ...
.. .. .. .. .. .. ..
..
... ... . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . .. ..o.. o ... ... ... ...
o .... .... .... .... .... .... .... .... .... .... .... .... .... o.... .... .... .... .... ....o.... .... .... .... ....oo.... .... .... .... .... .... .... ....o
.. ...........................
.. ..
.. ..
..
..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. ..o.. .. o
.. ..
.. .. .. .. o
.. ..
.. .. o.. ....o.... o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
.. .. .. ... o
... ... .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. o .. . . .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ...
.. .. .. .. .. .. o ..
. .. .. .. .. .. .. .. .. o . ... ... o ........o ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... ... ...
. .. .. .. ... ... ... ... ... ... ... ...o
.. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. ..
.. .. .. .. .. .. ... o .. .. .. .. .. .. .. .. ...o . . .. o . .. ..o.. o . . . . . .. .. .. ..o
.. ..o.. ... ... ... ... ...o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. ..o
..
.. o .. .. .. .. .. ... ... ... ... o
o. o. ...oo... o .. .. .. .. .. .. ...o .. .. o
. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... o
.. .. .. .. .. .. .. o .. .. .. ..o ........
.. .. .. .. .. .. .. ... ... ... ... ... ... ... ...
o ... ... .. .. .. .. .. .. .. .. .. .. .. .. ... ... o .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. ...o
. .. o
. ...o... .... .. .. .. .. ..o . .. .. o ..o .. .. .. .. .. ... ... ... ... ... ... ... o .. o .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o .. . . . .o . .o .. ..
. .. .. .. .. .. .. ..o . . . .. .. ..o
. .. ... . .. .. . ..o . . . . .o . . .. .. .. .. .. .. ... ... ... ... ... ... ... ...
.. .. ..o .. .. .. .. .. .. o .. .. .. .. .. .. .. .. .. ... ... ...o...o... o
.. .. ..
.. .. .. .. ..o .. .. ... ...o... o
o .. ..o
.. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... o .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o
.. ... ... .o. . . .. ..o .. .. .. .. .. ..o.. .. .. ..o.. .. .. .. .. .. ..o
... ...
. . . . .. .. ... ... ... ... o . .. o
.. .............
.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o . . . . . . .. .. .. .. .. .. .. .. .. .. .. ... ...o... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ...o
.. .. . . .. .. .. o .. .. .. ..
.. .. .. .. o .. .. ... ... ... ... o .. .. .. .. .. .. .. .. ... o . ... ... ... ... ... ... o
.. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o .......
.. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. . .. .. .. o
o
... ... ... ... ... o
.. .. ... ...o... ... ... ... o
. .. ..... o
.
... ... ... ... ... ... ... ... o ... ... o . . . . . . . . . . . .. .. .. .. .. .. ..
... ... .. .. .. .. .. .. .. .. .. .. .. .. .. o .. . . . . . . . . . .o . o. ... ...
. . . . .
. . . . . .. .. .. .. .. .. o . . . . o. ... ...
. .o
. o . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
. .. ... ... ... ... o
. . . .. .. .. ..o ....................
.. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... o .. .. . .o ..o .. .. . . . . .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o .. . . . . . .o . ..o.. ..o
.. .. . . .. ... ... ... ... o . . . . . .. .. .. .. .. .. .. .. .. ..o
.. .. ..............
... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... o
..
.. .. ..o.. ... ... ... o
... ... ..o..o ... ...
.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. o
..
.. ... ...
oo o
... ... . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. o ... ... ... ... ... ... ... ...o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. o
.. ..
..
. . . . . .o
..
.. .. o
... ...
..
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. . ... ...
.. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. o
.. .. ... .. ..
.. .. o .. .. ..o.. .. .. .. .. o
o.. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o
... ... ..
.. .. .. .. o
... ... .. o.. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... o
.. .. .. .. .. .. .. .. .. .. ..o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o
... ...
..
.. ... ... .. .. .. ..o o
.. .. .. .. .. o
... ... .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o
.. .. .. .. .. .. .. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... o
.. .. .. .. .. .. .. .................
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... .. ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. ................
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ..
.. .. .. .. ..
.. ..
.. ..
.. ..
..
..............................................................
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
..
.. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o 12
Bayes Optimal Classifier

.. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. . . . . . . . . . . . . . . .o .. .. ..
............................................... .. ..
.. ..
.. ..
o .... .... .... ....o.... .... .... .... .... .... .... .... .... .... .... .... .... o.... .... .... ....o.... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ....
..
..
.. ..
.. ..
.. ..
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. o . . . . . . . . . . . . .o
.. .. .. ..o..o
.. .. . . . . . . . . . . . . .o .. .. ............................
.. .. .. .. .. .. ... ... ... ... ...o... ... ... ... ... ... ... ... o
.. .. .. .. ..
.. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... .. .. .. .. .. .. .. .. .. .. .. ..o . . .. .. ..o.. .. .. ..o
... ... ... ...
o ..... o..... ..... ..... ..... ..... o .. .. .. .. .. .. .. .. .. .. .. ... ...o
.. ....................................
.. .. .. .. .. .. .. .. ..o .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ..
o o o.... .... .... .... .... .... ..... .....o..... ..... ..... ..... ..... ..... ..... .....o..... ..... .....oo..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
..
... ... .. ... ... ... ...
.. .. .. .. .. .. ..
. . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . .. ..o.. o
o .... .... .... .... .... .... .... .... .... .... .... .... .... o.... .... .... .... .... ....o.... .... .... .... ....oo.... .... .... .... .... .... .... ....o
.. .. .. .. .. .. .. ...........................
.. ..
.. ..
..
..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. ..o.. .. o
.. ..
.. .. .. .. o
.. ..
.. ..
.. .. .. ... o
o.. ....o.... o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
... ... .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. o .. . . .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ...
.. .. .. .. .. .. o ..
. .. .. .. .. .. .. .. .. o . ... ... o ........o .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. ..
. .. .. .. ... ... ... ... ... ... ... ...o
.. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. ..
.. .. .. .. .. .. ... o .. .. .. .. .. .. .. .. ...o . . .. o . .. ..o.. o . . . . . .. .. .. ..o
.. ..o.. ... ... ... ... ...o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. o .. .. .. .. .. ... ... ... ... o .. .. .. .. ...o o . . . .. .. .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... o
.. .. .. .. .. .. .. .. .. .. .. .. ..o .. .. .. .. .. .. .. .. .. .. .. o .. .. .. ..o . .. .. .. .. .. .. ..
... ..... .. .. .. .. .. .. .. .. .. .. ..o .. o
.. o .. ..o o.. o.. .. .. .. .. .. .. .. .. .. .. .. .. o
... ... .. .. o
... ... . . . . . . . ... o . . . . . .o.
o .. .. .. .. .. .. .. .. .. .. .. .. .. .. o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o .. ..o .. .. .. .. .. ..o .. .. .. o ..o .. .. .. .. .. ... ... ... ... ... ... ... o . .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o .. .. .. o ..o.. o .. ..
.. o.. .. .. .. .. .. ..o .. .. .. o .. .. ..o
.. .. .. .. .. .. .. .. .. .. o .. .. .. .. o .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ...
.. .. ..o .. .. .. .. .. .. o .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. ..o
.. .. .. .. .. .. ..o.. .. ..o
o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o
.. ... ... .o. . . .. ..o ... ..... .. .. .. .. ..o.. .. .. ..o.. .. .. .. .. .. ..o
. . . . .. .. ... ... ... ... o . .. o
.. .............
.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o . . . . . . .. .. .. .. .. .. .. .. .. .. .. ... ...o... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ...o
.. .. . . .. .. .. o .. .. .. ..
.. .. .. .. o .. .. ... ... ... ... o
.. .. .. .. .. .. .. .. ... o .. ... ... ... ... ... ... o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o .......
.. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. . .. .. .. o
.. ..o
... ... ... ... ... o ... ...o... ... ... ... o .. ..
... o
. .. .. .. .. .. .. .. .. o
.. .. .. .. .. .. .. ... ... o
. . .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ...
... ..... .. .. .. .. .. .. .. .. .. .. .. .. o .. . . . . . . . . . .o . o.... ...
. . . . .
. . . . . .. .. .. .. .. .. o . . . . o... ...
. .. o
.o
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.
. .. ... ... ... ... o
. . . .. .. .. ..o ....................
.. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... o .. .. . .o ..o .. .. . . . . .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o .. . . . . . .o . ..o.. ..o
.. .. . . .. ... ... ... ... o . . . . . .. .. .. .. .. .. .. .. .. ..o
.. .. ..............
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... o
... ... ..
.. .. ..o.. ... ... ... o
... ... ..o..o ... ...
.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. ..o
.. .. ..
.. ... ... o .. .. .. o
... ... .. o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. o
.. .. . . . . . . . .o ..................
.. .. .. .. .. .. .. .. .. .. .. .. o
... ...
..
.. .. .. .. .. .. ..o .. .. ..o .. .. .. .. .. .. .. o
.. ..
.. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ..
.. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. o
.. .. ..o.. .. .. .. .. o
.. .. .. .. .. ..o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o
... ... .. ... ... .. o .. o ... ...
o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... o
.. .. .. .. .. .. .. .. ..o .....................
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. . . . .o . .o . . . .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. ... ... ... ... ... ... ... ... ... o
.. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
... ... ... ... ... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..
..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
.. .. .. .. .. o.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... o
.. .. .. .. .. .. .. ................
... ... ..
.. ... ... o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. ..
. .............................................................. o 13
k − Number of Nearest Neighbors

151 101 69 45 31 21 11 7 5 3 1

0.30
0.25 Linear
Test Error

0.20
0.15
0.10

Train
Test
Bayes

2 3 5 8 12 18 29 67 200
14
Degrees of Freedom − N/k
ample, the quadratic model is perfectly matched to the true structu
k so it generalizes well to new data.

  



  

5.2: We fit three models to this example training set. The training dat
ed synthetically, by randomly sampling x values and choosing y determinis
uating a quadratic function. (Left)A linear function fit to the data suffer
15
16
Example II: Regularized regression

• Standard linear regression:


yi = x0i β + εt

• When number of regressors is 3 or more, OLS is not admissible (Stein’s


phenomenon).

• In particular, when the number of regressors is large, forecasting properties are


poor.

• Zvi Grilliches: “never trust OLS with more than five regressors.”

• This problem is becoming more prominent as we get larger datasets with


thousands of covariates.

• Also relevant when we have multiple potential IVs.

• Note difference between problem of learning about coefficients and forecasting!


17
A general formulation

• Solution: regularization (aka as penalization)

1. Shrink estimates towards zero.

2. Sparse representation of non-zero parameters.

• Long tradition in econometrics: James–Stein estimator.

• Estimator:
N
X 2 p
βreg = min (yi − x0i β) + λ kβkp
β
i=1
where
K
! p1
X p
kβkp = |β|
i=1

18
19
L0 norm

• Estimator with p = 0 (i.e., the number of non-zero coefficients).


• With an appropriately selected λ, we essentially choose β according to
AIC/BIC criteria, which have strong statistical foundations.
• Challenge: the optimization problem is difficult, and becomes computationally
infeasible when number of coefficients is large.
• Two alternatives:
1. Stepwise forward regression: start with no covariates; add whichever single
covariate improves the fit most; continue until no covariate significantly improves
fit.
2. Stepwise backward regression: start with all covariates; drop single covariate that
leads to the lowest reduction in fit; continue to drop until there is a significant
reduction in fit.

• Once we select the relevant covariates, we can perform OLS, but standard
errors are not correct since we do not condition on model selection.
• Thus, not very used in practice. 20
Rigde regression

• Estimator with p = 2.

• Close form:
−1
β = (X0 X + λI) X0 Y

• Estimators are shrink towards zero by factor 1 + λ.

• It does not enforce sparsity and, thus, we do not achieve model selection.

2  
• Bayesian interpretation λ = στ 2 where βprior ∼ N 0, τ 2 I and εt ∼ N 0, σ 2 .

21
LASSO

• Least absolute selection and shrinkage operator (Tibshirani, 1996):


N
X 2
min (yi − x0i β) + λ kβk1
i=1

• LASSO shrinks some coefficients exactly to zero and others towards zero
(variation of relaxed lasso).

• Algorithms for its implementation are easy to code ⇒ package in R package


glmnet.

• Bayesian interpretation: mode of posterior given a Laplace prior


p (β) ∝ exp (−λ kβk1 )

• Bayesian alternative: spike-and-slab prior by Mitchell and Beauchamp (1988).


22
β2 ^
β
. β2 ^
β
.

β1 β1

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression
(right). Shown are contours of the error and constraint functions. The solid blue
areas are the constraint regions |β1 | + |β2 | ≤ t and β12 + β22 ≤ t2 , respectively, 23
while the red ellipses are the contours of the least squares error function.
also the posterior mean, but the lasso and best subset selection are not.
Looking again at the criterion (3.53), we might try using other values
of q besides 0, 1, or 2. Although one might consider estimating q from
the data, our experience is that it is not worth the effort for the extra
variance incurred. Values of q ∈ (1, 2) suggest a compromise between the
lasso and ridge regression. Although this is the case, with q > 1, |βj |q is
differentiable at 0, and so does not share the ability of lasso (q = 1) for

q=4 q=2 q=1 q = 0.5 q = 0.1

P
FIGURE 3.12. Contours of constant value of j |βj |q for given values of q.

24
25
Which penalty?

• The results of LASSO are sensitive to λ.


• Usually chosen via cross-validation by eliminating some test data:
1. Choose λ to minimize average error of held-out data.
2. Choose largest λ such that error is within one standard deviation of minimum.

• The choice of penalty parameter also affects model selection consistency.


• Meinshausen and Bühlmann (2006): the prediction-optimal choice of λ is
guaranteed NOT to recover the true model asymptotically.
• When λ is chosen by cross validation, we tend to overselect variables and
include false positives.
• We need stronger penalization to guarantee model selection consistency, but in
finite samples the appropriate choice of λ is not clear.
• Bottom line: in applied work, the LASSO is most useful for variable screening
rather than model selection.
26
27
Statistical inference

• We might also be interested in performing hypothesis testing on coefficient


estimates from LASSO.
• One option is to put selected variables into an OLS regression. Highly
problematic because no conditioning on model selection.
• For example, the first variable selected by LASSO will be the one with the
highest partial correlation with the response.
• Suppose number of regressors is large and covariates are randomly generated.
One of the covariates is likely to be highly correlated with the response; will be
selected by LASSO; and have a significant p-value in an OLS regression.
• This post-selection inference problem is not fully resolved in the statistics
literature.
• Possibility: sample splitting. Estimate LASSO on some data points, perform
OLS regression on selected variables in the remaining data points.

28
Adaptive LASSO

• Proposed by Zou (2006).

• The LASSO penalty screens out irrelevant variables, but it also biases the
estimates of relevant variables.

• Ideally, we want to weaken the penalty on relevant covariates and vice versa.

• Estimator:
N
X X
2
min (yi − x0i β) + ωj |βj |
i=1 j

where
1
ωj = γ
|βbjOLS |

• You can still use standard software to estimate it.


29
Elastic net

• Advantages and disadvantages of LASSO vs. ridge: saturation in the “large


number of regressons, small sample” case.

• We can combine Lasso and ridge:


N
X 2
min (yi − x0i β) + λ1 kβk1 + λ2 kβk2
i=1

or
N
X 2
min (yi − x0i β) + λ (kβk1 + α kβk2 )
i=1

• A elastic net can be expressed as linear support vector machine.

• Particularly useful because we can use available software for support vector
machines.
30
q = 1.2 α = 0.2

Lq Elastic Net
P q
ours of constant value of j |β j | 31
More background

• References:

1. Statistical Learning with Sparsity: The Lasso and Generalizations, by Trevor


Hastie, Robert Tibshirani, and Martin Wainwright (2015).

2. “High-Dimensional Methods and Inference on Treatment and Structural Effects


in Economics,” by Belloni, Chernozhukov, and Hansen (2014).

3. “Inference on Treatment Effects after Selection among High-Dimensional


Controls,” by Belloni, Chernozhukov, and Hansen (2014).

32
Example II: Support vector machines

• Developed by Vapnik and Alexey Ya. Chervonenkis (1963).

• One of the most popular algorithms in supervised learning.

• Nice example of the “kernel trick.”


N
• Classification problem for {yi , xi }i=1 with yi ∈ {−1, 1}.

• Divide observations in two groups by a boundary f (xi ).

• Maximum-margin hyperplane principle: we search for the boundary that


separates as much as possible both regions of data (hard-margin in separable
case vs. soft-margin in non-separable case).

• We call observations at the margin support vectors.

• Comparison with logit or probit.


33
418 12. Flexible Discriminants

xT β + β0 = 0 • xT β + β0 = 0 •
• • • •
• •• • ξ4∗ • ∗ ••
ξ5
• • • •
1
M= ∗
• ••
kβk •ξ ∗• ξ3 •• M = 1
1 kβk
•• • •• • •∗
ξ2
• margin •
•• •• •• • • margin
1
M= kβk
• • 1
M= kβk

FIGURE 12.1. Support vector classifiers. The left panel shows the separable
case. The decision boundary is the solid line, while broken lines bound the shaded
maximal margin of width 2M = 2/kβk. The right panel shows the nonseparable
(overlap) case. The points labeled ξj∗ are on the wrong side of their margin by 34
∗ ∗
Linear boundaries

• A standard choice: linear boundary with one regressor

yi = 1 if β0 + β1 x ≥ 0
yi = −1 otherwise

• Find:
N
1 X 2
min max [0, (1 − yi (β0 − β1 ) x)] + λ kβk2
β0 ,β1 N
i=1

2
• The regularization term λ kβk2 is known as the hinge loss.

• Note that by setting λ → 0 in the hinge loss, we approximate a hard-margin


boundary.

• You can solve the primal problem with a sub-gradient descent method or dual
problem with a coordinate descent algorithm.
35
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o..... ..... o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o
.. .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .........................................
.. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
o o .. ..o .. .. ..o..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . .o ............................
.. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
. . .. .. ..o.. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..o
.. .. ... ...o
....................................
o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. . .o
.. .. .. .. ..o .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
o oo o ..o
. .. .. .. .. .. .. o
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ........................................
o .... .... .... ....oo.... .... .... .... .... .... .... .... .... .... .... .... .... .... ....o....o.... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ....
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o .. .. .. .. .. .. .. ... ...oo... ... ... ... ... ... ... ...o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
•... ......o...... o...... ...... ...... ...... ...... ...... ...... ...... ...... o...... ...... ...... ......o...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ......
. . . . . . . ..............................
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. ..o.. .. .. .. ..o.. .. o . . .. .. ..o
.. o .. .. .. .. .. .. .. .. .. ... ... o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o .. .. o ..o.. .. ... o
o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o.. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ...o... ... ... ... o . . . . . . . .o .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
o... o... ... ... ... ...o... ... ... ... ...oo... ... ... ..... ..... ..... ..... o..... ..... ..... .....o

o oooo .....o..... ..... .....o..... o..... .....o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. o .. .. .. .. .. .. ..o .. .. .. .. .. .. .. ..
o oooo.. o .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . . .. .. o
. . .
..o . . . . .
.. .. .. .. .. .. .. .. .. ..o . . . . . .. .. ... ... ... ... ... ... ... ... ... ... ... o
. . .. .. .. .. .. ..o..
..o .. .. o .. .. o .. .. .. .. .. .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o .. .. .. ..o .. .. .. o .. o .. .. .. .. .. .. .. .. .. .. .. .. o .. ... ... ... ... ... ... ...
o o o o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
oo o ..o
o
... ... ... ... ...o... ... ... ... ... ...o
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .. .. .. .. .. .. .. ... ... ... ... ... ... ...
. . . . . . . . . . . . . . . . . . . . .
o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ...o... ... ...o... ...o .. ... ... ... ... ...o... ... ... ...o... ... ... ... ... ... o .o
o.... ....o.... o.... .... o.... .... .... .... o.... .... .... .... .... .... o.... .... o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .............
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ...o... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. o .. .. .. .. .. .. .. .. o .. ... ... ... ... ... ... o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o .. .. .. .. .. .. ..
.. .. .. o .. .. .. ..o .. .. .. .. .. o
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o .. ..o.. .. .. .. o .. o.. .. .. .. .. .. .. .. o . . . . . .
.. .. o .
.. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ...
. . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . . . . . . .o . . .oo
o ... ... ... ... ... ... ... ... o... ... ... ... ... ... ...o... ... ... ... o... ... ... ... o... ... ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... .... .... .... .... .... .... ....
. . . . . . . . . . . . . . . . . . . .
. .. .. .. .. .. .. .. .. ..•
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . ..o..o.. ..o ..o . .. .. .. o . . . . . .o
.. ... ... ... ... ... ... o
..o ..............
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. o .. .. .. ... ... ... o .. ..o .. ... o .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ...o... ... ... ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o... ...oo... ... ... o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. o . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. .. .. .. .. .. .. .. ..o
o ..... ..... .....o..... ..... .....o..... ..... ..... .....o..... ..... ..... o..... ..... ..... ..... ..... ..... ..... ..... o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..................
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. ..o
.. .. .. .. .. ... ... ... ... ... ... ... ...o... ... ... ... ... ... o . . . . . . .. .. .. ..o
. . . . . . . . . . . . . . ..o.. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. . . . .. .. ... ... ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
..................
.. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ...o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .o
.. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... o .. .. .. .. ..o .....................
.. .. .. .. .. .. .. .. .. .. .. .. .. o .. .. .. ..o .. .. o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. o .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
... ...Training.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. ... ... ... ... ... ... ... ... Error: .. .. .. .. .. .. 0.270 .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... o ................
.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ...
.. ..Test .. .. .. .. Error:
.. .. .. .. .. .. .. .. .. .. 0.288 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ................
Bayes Error: 0.210 .. . .. .. . .. .. .. . .. .. . .. .. . .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

C = 10000 36
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o..... ..... o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o
.. .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .........................................
.. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
o o .. ..o .. .. ..o..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . .o ............................
.. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
. . .. .. ..o.. .. .. ..o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ..o
.. .. ... ...o
....................................
o o o .. .. .. .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. . .o
.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
o oo o ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. .. .. .. .. .. .. o
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..........................................
o .... .... .... ....oo.... .... .... .... .... .... .... .... .... .... .... .... .... .... ....o....o.... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ....
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

o o ... ... ... ... ... ... ... ... ...oo... ... ... ... ... ... ... ...o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. ..o.. .. .. .. ..o.. .. o .. .. o .. .. ..o... o
.. ..o . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ... o . . . . . . . . . . . . .o ..............
.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
. . . . . . . . . . . . . o
o o o....o.... .... o....o.... .... .... .... .... .... .... .... .... .... .... .... ....o.... .... .... .... o
. . . . . .. .. .. .. .. .. .. .. o .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. ..o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o ... ... ... ... ... o... ...o o.. o.. .. .. .. ..o.. .. .. ... ...oo... ... ... ... ... ... ... o... ... ... ...o... ... ...o... ... ... ... ... ... ...o... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.
oooooo.. o . . . . . . .. .. .. .. .. .. .. ... ... o . . . . . .o.
o...o.... .... o....o.... o.... .... .... .... .... .... .... ....o.... o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
o ..o .. ...o . . .. .. .. ..o . . . ..o .. .. .. .. .. .. .. .. .. .. .. .. o . .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. ..o.. o .. .. .. ... ... o .. ... ...o .. .. .. .. ... ... ... o .. o .. .. .. .. o .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ...
o o .. .. .. .. .. .. .. .. .. .. ..o
.. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o ..o.. .. ..o.. ..o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
. . o
.. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. o
. . . . . . . . . . . . . . . .
. .. .. .. .. .. .. ... ... ... ... ... ... ...
. . . . . .
o.... ....o.... o.... .... o.... .... .... .... o.... .... .... .... .... .... o.... .... o .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. ...o . . . . . . .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ...o... ... ... ... o .......
.. .. .. .. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. o .. .. .. .. .. .. .. o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ...
.. .. .. ..o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. o .o . . . . o . o
.
. .. .. .. .. .. .. .. .. ..o . . . .o
. . o .
. .. o . o o
. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ...
. . . . . . . . . . . . .
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
o ... ... ... ... ... ... ... ... o... ... ... ... ... ... ...o... ... ... ... o...o... ... ... o... ... ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
. . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. .. .. .. .. .. .. o .o..o.. ..o ..o . .. .. .. o . .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o
...o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..............
.. .. .. .. .. .. ... o .. .. .. ... ... ... o .. ..o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ...o... ... ... ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . . . . . . . oo . . . o . .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ...
. . . . . . . . . . . . .
o
o ..... ..... .....o..... ..... .....o..... ..... ..... .....o..... ..... ..... o..... ..... ..... ..... ..... ..... ..... ..... o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. ..o .. .. .. .. .. .. .. .. .. .. .. .. .. .. o ..o.. .. .. .. .. o .. .. .. ..o ..................
.. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. o .o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.o . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. o . . . .o . .o
.. .. .. .. .. .. .. .. .. .. ..o.. .. ... ... ... ... ... ... ... ... ... o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
.. ..Training
. . . . . . . . .. .. .. .. .. .. 0.26 ...... o.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. ... ... ... ... ... ... ... ... Error: .. .. .. .. .. .. ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... o ................
.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ...Test
... ... ... ... Error:
... ... ... ... ... ... ... ... ... ... 0.30
.. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ...
............................................... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Bayes Error: 0.21 o

C = 0.01 37
38
Example III: Regression trees

• Classification and Regression Trees, Leo Breiman, Jerome Friedman, Charles J.


Stone, and R.A. Olshen.

• Main idea: recursively subdivide regressors into subspaces and compute the
mean of the regressor in that subspace.

• Result is a flexible step function.

• Works surprisingly well in practice (very popular in data mining), but nearly no
theoretical result (asymptotic normality?).

• Cross-validation to determine penalty λ on the number of leaves in the tree.

• “Bet on sparsity” principle.


39
Recursive partitions I

• Compute
f (x) = y
with
N
X 2
MSE (f (x)) = (yi − f (x))
i=1

• For a regressor xk and a threshold t, find:

y |xk ≤ t and y |xk > t

and define (
y |xk ≤ t if xk ≤ t
fk,t (x) =
y |xk > t if xk > t

40
Recursive partitions II

• Select regressor xk and a threshold t that minimize:

(xk∗ , t ∗ ) = arg min MSE (fk,t (x))


xk ,t

• Keep iterating until

min {MSE (fk,t (x)) + λ#leaves}

41
R5
R2 t4

X2

X2
R3
t2 R4

R1

t1 t3
X1 X1

X1 ≤ t1
|

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4
R1 R2 R3
X2 X1

42
R4 R5
Improvements

• Pruning an existing tree.

• Boosting: regression tree on new data (ε − f (x) , x) and iterate.

• Bagging (Bootstrap AGGregatING): regresssion trees on bootstraped data,


estimate is average over bootstraped samples.

• Random forest.

43
Other algorithms

• Multi-task LASSO.

• Least Angle Regression (LARS).

• LARS LASSO.

• Gaussian Process Regression (GPR).

• Dantzig Selector.

• Ensemble methods.

• Super learners/stacking (asymptotic properties).

44
Unsupervised and Reinforcement Learning

(Lectures on Machine Learning for Economists III)

Jesús Fernández-Villaverde1
December 28, 2018
1
University of Pennsylvania
Unsupervised learning

• Use a sample:
N
D = {xi }i=1
to:

1. Group observations in interesting patterns.

2. Describe most important sources of variation in the data.

3. Dimensionality reduction.

• Example: what can we learn about the loan book of a bank without imposing
too much a priori structure?

• More concretely, we search for:


p (xi |θ)

• Clustering and association rules.

1
506 14. Unsupervised Learning

2
• •• •• • •
4

• •• •• •

•••

1

• •• • •• • • • •• •
2

• •••• • •
• •• • •
• •• • • •

• •••••••••••• • •
• • •• •
• • • •• • • •• •• • •• •
X2

X2
••
••• • • • •• • ••••
0

••

0
• • •• •
• ••
• •••• ••• • • •• •••••••••••• • • ••
••• •
• • • •• • • • ••••• •
-2

• ••••••• • • • ••

-1
• • •••
• •
-4

••
• • •

-2
-6

-6 -4 -2 0 2 4 -2 -1 0 1 2

X1 X1

FIGURE 14.5. Simulated data: on the left, K-means clustering (with K=2) ha
been applied to the raw data. The two colors indicate the cluster memberships. On
the right, the features were first standardized before clustering. This is equivalen
2
to using feature weights 1/[2 · var(Xj )]. The standardization has obscured the two
Cluster discovering

1. Select K clusters
K ∗ = argmax p (K |D)
K

2. Assign each observation to a cluster

zt∗ = argmax p (zi = k|xi , D)


k

3. A common method to pin down K is the silhouette. For each observation i, we


compute:
b(i) − a(i)
si =
max(a(i), b(i))
where a(i) is the average distance between i and all other members of the
cluster while b(i) is the minimum distance between i and all other members of
another cluster.
3
K-means

• K-means clustering by Steinhaus (1957)


k X
X 2
argmax kx − µi k
S i=1 x∈Si

• It requires an iterative algorithm for implementation Lloyd (1957).

• Related variations:

1. k-medians ⇒ uses medians computed through the Taxicab geometry.

2. k-medoids ⇒minimizes a sum of pairwise dissimilarities.

3. k-SVD.

4
5
6
Other algorithms

• Other clustering methods:

1. Agglomerative clustering.

2. DBSCAN.

3. Birch.

• Principal component analysis.

• Density estimation.

• Gaussian mixture models.

• Association rules and the Apriori algorithm (Agrawal and Srikant, 1994).

7
8
4
Largest Principal o
Component
o o
2 o o oo o o
o oo oo
ooooo o
oooo o
o o oo
ooo oo o o ooo o o o
o o oo
o
oo o o o o
o ooo ooooo o ooo o o
o o oooo ooo o o o o
X2

o o oo o
ooo ooo ooo o o
0

o o o o o
ooo ooo oo o
o o
o o oo o o
ooo o ooooo
o ooo ooo
o ooo o oo o
oo oo ooooo o
o
o ooo oo ooo o o
o ooo oo o o
o o oo o
o o
-2

oo o o
oo o Smallest Principal
o Component

o
o
-4

-4 -2 0 2 4
9
X1
Reinforcement learning

• Algorithms that use training information that evaluates the actions taken
instead of deciding whether the action was correct.

• Purely evaluative feedback assess how good the action taken was, but not
whether it was the best feasible action.

• In comparison, in supervised learning, we have purely instructive feedback that


indicates best feasible action regardless of action actually taken.

• Current research challenge: how to handle associate behavior.

• Similar (same?) ideas are called approximate dynamic programming or


neuro-dynamic programming.

10
Example: Multi-armed bandit problem

• You need to choose action a among k available options.

• Each option is associated with a probability distribution of payoffs.

• You want to maximize the expected (discounted) payoffs.

• But you do not know which action is best, you only have estimates (dual
control problem of identification and optimization).

• You can observe actions and payoffs.

• Go back to the study of “sequential design of experiments” by Thompson


(1933, 1934) and Bellman (1956).

11
12
Theory vs. practice

• You can follow two pure strategies:

1. Follow greedy actions: actions with highest expected value. This is known as
exploiting.

2. Follow non-greedy actions: actions with dominated expected value. This is


known as exploring.

• This should remind you of a basic dynamic programming problem: what is the
optimal mix of pure strategies?

• If we impose enough structure on the problem (i.e., distributions of payoffs


belong to some family, stationarity, etc.), we can solve (either theoretically or
applying standard solution techniques) the optimal strategy (at least, up to
some upper bound on computational capabilities).

• But these structures are too restrictive for practical purposes outside the pages
of Econometrica.
13
An action-value method I

• Proposed by Thathachar and Sastry (1985).

• A very simple method that uses the averages Qn (a) of rewards


Ri (a), i = {1, ..., n}, actually received:
n−1
1X
Qn (a) = Ri (a)
n
i=1

• We start with Q0 (a) = 0 for all k. Here (and later), we randomize among ties.

• We update Qn (a) thanks to the nice recursive update based on linearity of


means:
1
Qn+1 (a) = Qn (a) + [Rn (a) − Qn (a)]
n
Averages of actions not picked are not updated.

14
An action-value method II

• How do we pick actions?

1. Pure greedy method: argmaxa Qt (a).

2. -greedy method. Mixed best action with a random trembling.

• Easy to generalize to more sophisticated strategies.

• In particular, we can connect with genetic algorithms (we will discuss AlphaGo
later on).

15
problem, such as the one shown in Figure 2.1, the action values, q⇤ (a), a = 1, . . .

2
q⇤ (3)
q⇤ (5)
1
q⇤ (9)
q⇤ (4)
Reward 0
q⇤ (1)
q⇤ (7)
distribution q⇤ (10
q⇤ (2)
-1 q⇤ (8)
q⇤ (6)

-2

-3

1 2 3 4 5 6 7 8 9 10
16
Action
1.5
" = 0.1
" = 0.01
1
" = 0 (greedy)
Average
reward
0.5

0
01 250 500 750 1000

Steps

100%

80%
" = 0.1
% 60%
" = 0.01
Optimal
action 40%

" = 0 (greedy)
20%

0%
01 250 500 750 1000
Steps
17
Figure 2.2: Average performance of "-greedy action-value methods on the 10-armed testbed.
A more general update rule

• Let’s think about a modified update rule:


Qn+1 (a) = Qn (a) + α [Rn (a) − Qn (a)]
for α ∈ (0, 1].
• This is equivalent, by recursive substitution, to:
n−1
X
Qn+1 (a) = (1 − α)n Q1 (a) + α α(1 − α)n−i Ri (a)
i=1

• We can also have a time-varying αn (a), but, to ensure convergence with


probability 1 as long as:

X
αn (a) = ∞
i=1
X∞
αn2 (a) = ∞
i=1

18
Improving the algorithm

• We can start with “optimistic” Q0 to induce exploration.

• We can implement an upper-confidence-bound action selection


" s #
log n
argmax Qn (a) + c
a Nn (a)

• We can have a gradient bandit algorithms based on a softmax choice:


e Hn (a)
πn (a) = P (An = a) = Pk
Hn (b)
b=1 e
where

Hn+1 (An ) = Hn (An ) + α (1 − πn (An )) Rn (a) − R n

Hn+1 (a) = Hn (a) − απn (a) Rn (a) − R n for all a 6= An
This is a slightly hidden version of a stochastic gradient algorithm that we will
see soon when we talk about deep learning.
19
ng a generally useful approach to encouraging exploration. For example, it is n
suited to nonstationary problems because its drive for exploration is inheren

100%
Optimistic,
optimistic, greedy
greedy
Q1 = 5, " = 0
80% Q01 = 5, !!= 0

60% Realistic, " -greedy


realistic, ! -greedy
% Q1 = 0, " = 0.1
Optimal Q01 = 0, !!= 0.1
action 40%

20%

0%
01 200 400 600 800 1000
Plays
Steps

ure 2.3: The e↵ect of optimistic initial action-value estimates on the 10-armed testb
h methods used a constant step-size parameter, ↵ = 0.1.

20
spaces, particularly when using function approximation as developed in Part II
book. In these more advanced settings the idea of UCB action selection is usua
ractical.

1.5 UCB c = 2

-greedy = 0.1
1
Average
reward
0.5

0
1 250 500 750 1000
Steps

re 2.4: Average performance of UCB action selection on the 10-armed testbed. As show
generally performs better than "-greedy action selection, except in the first k steps, wh
cts randomly among the as-yet-untried actions.

21
ise 2.8: UCB Spikes In Figure 2.4 the UCB algorithm shows a distinct spi
ntermediate value of their parameter, neither too large nor too small. In assess

1.5
UCB greedy with
optimistic
1.4 initialization
α = 0.1
Average 1.3 -greedy
reward
gradient
over first 1.2 bandit
1000 steps
1.1

1
1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 4

" ↵ c Q0
ure 2.6: A parameter study of the various bandit algorithms presented in this chap
point is the average reward obtained over 1000 steps with a particular algorithm a
cular setting of its parameter.
22
Other algorithms

• Monte Carlo prediction.

• Temporal-difference (TD) learning:

V n+1 (st ) = V n (st ) + α (rt+1 + βV n (st+1 ) − V n (st ))

• SARSA ⇒ On-policy TD control:

Q n+1 (at, st ) = Q n (at, st ) + α (rt+1 + βQ n (at+1, st+1 ) − Q n (at, st ))

• Q-learning ⇒ Off-Policy TD Control:


 
n+1 n n n
Q (at, st ) = Q (at, st ) + α rt+1 + β max Q (at+1, st+1 ) − Q (at, st )
at+1

• Actor-critic methods.
23
Deep Learning

(Lectures on Machine Learning for Economists IV)

Jesús Fernández-Villaverde
December 28, 2018
1
University of Pennsylvania
Deep learning

• Some of the most active research right now thanks to the rediscovery of
artificial neural networks (ANN; also known as connectionist systems) in the
2000s.

• It can deal with very abstract set-ups and, thus, it has more than considerable
option value.

• Key advances:

1. Multilayer neural networks (combinations of generalized linear models).

2. Back-propagation through gradient descent.

• Also, huge computational power requirements that are now feasible.

1
2
AlphaGo

• Big splash: AlphaGo vs. Lee Sedol in March 2016.

• Silver et al. (2018): now applied to chess, shogi, and Go.

• Check also:

1. https://deepmind.com/research/alphago/.

2. https://www.alphagomovie.com/

• Very different than Deep Blue against Kasparov.

• New and surprising strategies.

• However, you need to keep this accomplishment in perspective.

3
ARTICLE RESEARCH

a b
Rollout policy SL policy network RL policy network Value network Policy network Value network

Neural network
pS pV pU QT pVU (a⎪s) QT (s′)

Policy gradient
n
Cla

tio

Se

n
ca

ssio
ssifi

lf P
ssifi

lay

gre
ca

Cla
tio

Re
n

Data
s s′
Human expert positions Self-play positions
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-play data set.
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation of the neural network architecture used in
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation of the board position
A reinforcement learning (RL) policy network pρ is initialized to the SL s as its input, passes it through many convolutional layers with parameters
policy network, and is then improved by policy gradient learning to σ (SL policy network) or ρ (RL policy network), and outputs a probability
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, represented by a
versions of the policy network. A new data set is generated by playing probability map over the board. The value network similarly uses many
games of self-play with the RL policy network. Finally, a value network vθ convolutional layers with parameters θ, but outputs a scalar value vθ(s′)
is trained by regression to predict the expected outcome (that is, whether that predicts the expected outcome in position s′.

sampled state-action pairs (s, a), using stochastic gradient ascent to and its weights ρ are initialized to the same values, ρ = σ. We play
maximize the likelihood of the human move a selected in state s games between the current policy network pρ and a randomly selected
previous iteration of the policy network. Randomizing from a pool 4
∂log p (a | s )
RESEARCH ARTICLE

a Selection b Expansion c Evaluation d Backup

QT
P P Q Q
Q + u(P) max Q + u(P)
QT QT
Q Q
P P
Q + u(P) max Q + u(P)
pV QT QT QT

P P
pS

r r r r

igure 3 | Monte Carlo tree search in AlphaGo. a, Each simulation is evaluated in two ways: using the value network vθ; and by running
averses the tree by selecting the edge with maximum action value Q, a rollout to the end of the game with the fast rollout policy pπ, then
us a bonus u(P) that depends on a stored prior probability P for that computing the winner with function r. d, Action values Q are updated to
dge. b, The leaf node may be expanded; the new node is processed once track the mean value of all evaluations r(·) and vθ(·) in the subtree below
y the policy network pσ and the output probabilities are stored as prior that action.
robabilities P for each action. c, At the end of a simulation, the leaf node

arning of convolutional networks, won 11% of games against Pachi23 (s, a) of the search tree stores an action value Q(s, a), visit count N(s, a
nd 12% against a slightly weaker program, Fuego24. and prior probability P(s, a). The tree is traversed by simulation (th
is, descending the tree in complete games without backup), startin
einforcement learning of value networks from the root state. At each time step t of each simulation, an action
he final stage of the training pipeline focuses on position evaluation, is selected from state st 5
p
ce has been to create programs that can in- superhuman performance across multiple chal- Instead of an alpha-beta search with dom
d learn for themselves from first principles lenging games. specific enhancements, AlphaZero uses a gen
6). Recently, the AlphaGo Zero algorithm A landmark for artificial intelligence was purpose Monte Carlo tree search (MCTS) algor
eved superhuman performance in the game achieved in 1997 when Deep Blue defeated the Each search consists of a series of simu
human world chess champion (1). Computer games of self-play that traverse a tree from
chess programs continued to progress stead- state sroot until a leaf state is reached. Each
Mind, 6 Pancras Square, London N1C 4AG, UK. 2University ily beyond human level in the following two ulation proceeds by selecting in each stat
e London, Gower Street, London WC1E 6BT, UK.
e authors contributed equally to this work.
decades. These programs evaluate positions by move a with low visit count (not previo
esponding author. Email: davidsilver@google.com (D.S.); using handcrafted features and carefully tuned frequently explored), high move probability
tact@google.com (D.H.) weights, constructed by strong human players and high value (averaged over the leaf stat

1. Training AlphaZero for 700,000 steps. Elo ratings were (B) Performance of AlphaZero in shogi compared with the 2017
puted from games between different players where each player CSA world champion program Elmo. (C) Performance of AlphaZero
given 1 s per move. (A) Performance of AlphaZero in chess in Go compared with AlphaGo Lee and AlphaGo Zero (20 blocks
pared with the 2016 TCEC world champion program Stockfish. over 3 days).

r et al., Science 362, 1140–1144 (2018) 7 December 2018 1

6
Why deep learning?

• We want to approximate complex data structures with a multitude of factors of


variation. In economics: models with large state spaces, highly-dimensional
data such as network linkages.

• To do so, we build representations of those data structures.

• Deep learning can build extremely accurate representations by combining


simpler representations recursively using simple linear combinations and
non-linear monotone transformations.

• Some of these simple representations can be, however, highly counter-intuitive


and yet, deliver excellent performance.

• Moreover, ANN often require less “inside knowledge” by experts on the area..

• Excellent software libraries.

7
Deep learning Example:
Shallow
Example: Example:
Example: autoencoders
Logistic Knowledge
MLPs
regression bases

Representation learning

Machine learning

AI

8
9
Limitations of deep learning

• While deep learning can work extremely well, there is no such a thing as a silver
bullets.

• Clear and serious trade-offs in real-life applications.

• Rule-of-thumb in the industry is that one needs around 107 labeled observations
to properly train a complex ANN with around 104 observations in each relevant
group.

• Of course, sometimes “observations” are endogenous (we can simulate them),


but if your goal is to forecast GDP next quarter, it is unlikely an ANN will beat
an ARIMA(n,p,q) (at least only with macro variables).

10
Neural networks

• Non-linear functional approximation method.

• Much hype around them and over-emphasis of biological interpretation.

• We will follow a much sober formal treatment (which, in any case, agrees with
state-of-art researchers approach).

• In particular, we will highlight connections with econometrics (e.g., NOLS,


semiparametric regression, and sieves).

• We will start describing the simplest possible neural network.

11
A neuron

• N observables: x1 , x2 ,...,xN . We stack them in x.

• Coefficients (or weights): θ0 (a constant), θ1 , θ2 , ...,θN . We stack them in θ.

• We build a linear combination of observations:


N
X
z = θ0 + θ n xn
n=0

Theoretically, we could build non-linear combinations, but unlikely to be a


fruitful idea in general.

• We transform such linear combination with an activation function:

y = g (x; θ) = φ (z)

The activation function might have some coefficients γ on its own.

• Why do we need an activation function?


12
Flow representation

Inputs Weights
x1 θ1

x2 θ2 Activation
n
X Perceptron
θi xi classification
x3 i=1 output
θ3
Net input
γ

xn θn

13
The biological analog

14
Activation functions

• Traditionally:

1. A sigmoidal function:
1
φ (v ) =
1 + e −v

2. A particular limiting case: step function.

3. Hyperbolic tangent.

4. Identity function: linear regression.

• Other activation functions have gained popularity recently:

1. Rectified linear unit (ReLU):


φ (v ) = max(0, v )

2. Softplus:
φ (v ) = log (1 + e v )
15
16
17
Interpretation

• θ0 controls the activation threshold.

• The level of the θ control the activation rate (the higher the θ, the harder the
activation).

• Some textbooks separate the activation threshold and a scaling parameter from
θ as different coefficients in φ, but such separation moves notation farther away
from standard econometrics.

• Potential identification problem between θ and more general activation


functions with their own parameters.

• But in practice θ do not have a structural interpretation, so the identification


problem is of secondary importance.

• Closely resembles single index models in econometrics.


18
Combining neurons into a neural network

• As before, we have N observables: x1 , x2 ,...,xN .

• Coefficients (or weights): θ0,m (a constant), θ1,m , θ2,m , ...,θN,m .

• We build M linear combinations of observations:


N
X
zm = θ0,m + θn,m xn
n=0

• We transform and add such linear combinations with an activation function:


M
X
y = g (x; θ) = θ0 + θm φ (zm )
m=1

Note rich parameterization of coefficients but parsimony of basis functions.

• Also, quasi-linear structure in terms of vectors of observables and coefficients.

• This is known as a single layer network.


19
Two classic (yet remarkable) results I

Universal approximation theorem: Hornik, Stinchcombe, and White (1989)


A neural network with at least one hidden layer can approximate any Borel
measurable function mapping finite-dimensional spaces to any desired degree of
accuracy.

• Intuition of the result.

• Comparison with other results in series approximations.

20
Two classic (yet remarkable) results II

• Assume, as well, that we are dealing with the class of functions for which the
Fourier transform of their gradient is integrable.

Breaking the curse of dimensionality: Barron (1993)


A one-layer NN achieves integrated square errors of order O(1/M), where M is the
number of nodes. In comparison, for series approximations, the integrated square
error is of order O(1/(M 2/N )) where N is the dimensions of the function to be
approximated.

• We actually rely on the theorems by Leshno et al. (1993) and Bach (2017).

• What about Chebyshev polynomials? Splines? Problems of convergence and


extrapolation.

• There is another, yet more subtle curse of dimensionality.

21
22
Training the network

• θ is selected to minimize the quadratic error function E (θ; Y, b


y):

θ∗ = arg min E (θ; Y, b


y)
θ
J
X
= arg min E (θ; yj , ybj )
θ
j=1
J
1X 2
= arg min kyj − g (xj ; θ)k
θ 2
j=1

• Where from do the observations Y come?

• How do we solve this minimization problem?

23
Minimization

• Minibatch gradient descent (a variation of stochastic gradient descent) is the


most popular algorithm.
Appendix A

• Why stochastic? Intuition from Monte Carlos.

• Why mini-batch? Intuition from GMM. Note also resilience to scaling.

• In practice, we do not need a global min (6= likelihood).

• You can flush the algorithm to a graphics processing unit (GPU) or a tensor
processing unit (TPU) instead of a standard CPU.

24
re 2-6. Batch gradient descent is sensitive to saddle points, which can lead to25pre
convergence
roach is illustrated by Figure 2-7, where instead of a single static error surfac
r surface is dynamic. As a result, descending on this stochastic surface si
ly improves our ability to navigate flat regions.

re 2-7. The stochastic error surface fluctuates with respect to the batch error su
bling saddle point avoidance
26
Stochastic gradient descent

• Random multi-trial with initialization from a proposal distribution Θ (typically a


Gaussian or uniform):
θ0 ∼ Θ

• θ is recursively updated:
θt+1 = θt − t ∇E (θ; yj , ybj )
where:
" #>
∂E (θ; yj , ybj ) ∂E (θ; yj , ybj ) ∂E (θ; yj , ybj )
∇E (θ; yj , ybj ) ≡ 2 , 2 , ..., 1
∂θ0 ∂θ1 ∂θN,M

is the gradient of the error function with respect to θ evaluated at (yj , ybj ) until:
kθt+1 − θt k < ε

• In a minibatch, you use a few observations instead of just one.


27
Some details

• We select the learning rate m > 0 in each iteration by line-search to minimize


the error function in the direction of the gradient.

• We evaluate the gradient using back-propagation (Rumelhart et al., 1986):


∂E (θ; yj , ybj )
= yj − g (xj ; θ)
∂θ0
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) φ (zm ) , for ∀m
∂θm
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) θm φ0 (zm ) , for ∀m
∂θ0,m
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) θm xn φ0 (zm ) , for ∀n, m
∂θn,m
where φ0 (z) is the derivative of the activation function.

• This will be particularly important below when we introduce multiple layers.

28
Alternative minimization algorithms

1. Newton and Quasi-Newton are unlikely to be of much use in practice. Why?


However, perhaps your problem is an exception.

2. McMc/Simulated annealing.

3. Genetic algorithms:

• In fact, much of the research in deep learning incorporates some flavor of genetic
selection.

• Basic idea.

29
Multiple layers I

• The hidden layers can be multiplied without limit in a feed-forward ANN.


• We build K layers:
N
X
1 1 1
zm = θ0,m + θn,m xn
n=0

and
M
X
2 2 2 1

zm = θ0,m + θm φ zm
m=1
...
M
X
K −1
y = g (x; θ) = θ0K + K

θm φ zm
m=1

30
Flow representation

x1

x2

x3

Input Values Hidden Layer 1 Output Layer


Input Layer Hidden Layer 2
31
Multiple layers II

• Why do we want to introduce hidden layers?

1. It works! Our brains have six layers. AlphaGo has 12 layers with ReLUs.

2. Hidden layers induce highly nonlinear behavior.

3. Allow for clustering of variables.

• We can have different M’s in each layer ⇒ fewer neurons in higher layers allow
for compression of learning into fewer features.

• We can also add multidimensional outputs.

• Or even to produce, as output, a probability distribution, for example, using a


softmax layer:
K −1
e zm
ym = PM K −1
zm
m=1 e

32
Alternative ANNs

• Convolutional neural networks.

• Feedback ANN such as the Hopfield network.

• Self-organizing maps (SOM).

• ANN and reinforcement learning.

33
Input
Kernel
a b c d
w x
e f g h
y z
i j k l

Output

aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz

ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz

34
35
36
Appendix A
Direction set methods

• Suppose a function f is roughly approximated as a quadratic form:


1 T
f (x) ≈ x Ax − bT x + c
2
A is a known, square, symmetric, positive-definite matrix.

• Then f (x) is minimized by the solution to

Ax = b

• We can, in general, calculate f (P) and ∇f (P) for a given N-dimensional point
P.

• How can we use this additional information?

37
Steepest descent method

• An tempting (but not very good) possibility: steepest descent method.

• Start at a point P0 . As many times as needed, move from point Pi to the point
Pi+1 by minimizing along the line from Pi in the direction of the local downhill
gradient −∇f (Pi ).

• Risk of over-shooting.

• To avoid it: perform many small steps (perhaps with line search)⇒ Not very
efficient!

38
Steepest descent method

39
Conjugate gradient method

• A better way.

• In RN take N steps each of which attains the minimum in each direction, w/o
undoing previous steps’ progress.

• In other words, proceed in a direction that is somehow constructed to be


conjugate to the old gradient, and to all previous directions traversed.

40
Conjugate gradient method

41
Algorithm - linear

1. Let d0 = r0 = b − Ax0 .

2. For i = 0, 1, 2, ..., N − 1 do:

riT ri
• αi = diT Adi
.

• xi+1 = xi + αi di .

• ri+1 = ri − αi Adi .
T
ri+1 ri+1
• βi+1 = riT ri
.

• di+1 = ri+1 + βi+1 di .

3. Return xN .

42
Algorithm - non-linear

1. Let d0 = r0 = −f 0 (x0 ).

2. For i = 0, 1, 2, ..., N − 1 do:

• Find αi that minimizes f 0 (xi + αi di ).

• xi+1 = xi + αi di .

• ri+1 = −f 0 (xi+1 ).
 
T T
ri+1 ri+1 ri+1 (ri+1 −ri )
• βi+1 = riT ri
or βi+1 = max riT ri
,0 .

• di+1 = ri+1 + βi+1 di .

3. Return xN .
Go Back

43
Text Analysis

(Lectures on Machine Learning for Economists V)

Jesús Fernández-Villaverde1
December 28, 2018
1
University of Pennsylvania
Text analysis

• Text is a form of data.

• Important for economics:

1. Statements by policy makers.

2. Documents in libraries.

3. Electronic news.

4. Verbal surveys.

5. Opinion mining and sentiment analysis from social media.

• How do we use text in statistical methods?

• Large area with many other applications that, at the moment, are of less
interest in economics.
1
1600 QUARTERLY JOURNAL OF ECONOMICS

FIGURE I
EPU Index for the United States 2
ECONOMIC POLICY UNCERTAINTY

3
What is text?

• Formally, text is an ordered string of characters.

• Some of these may be from the Latin alphabet –‘a’, ‘A’, ’p,’– but there may
also be:

1. Decorated Latin letters (e.g., ö).

2. Non-Latin alphabetic characters (e.g., Chinese and Arabic).

3. Punctuation (e.g., ‘!’).

4. White spaces, tabs, newlines.

5. Numbers.

6. Non-alphanumeric characters (e.g., ‘@’).

• Key Question: How can we obtain an informative, quantitative representation


of these character strings? This is the goal of text analysis.

• First step is to pre-process strings to obtain a cleaner representation.


4
From files to databases

• Turning raw text files into structured databases is often a challenge:

1. Separate metadata from text.

2. Identify relevant portions of text (paragraphs, sections, etc).

3. Remove graphs and charts.

• First step for non-editable files is conversion to editable format, usually with
optical character recognition software.

• With raw text files, we can use regular expressions to identify relevant patterns.

• HTML and XML pages provide structure through tagging.

• If all else fails, manual extraction.

5
Raw text files

The Quartz guide to bad data


I once acquired the complete dog licensing database for Cook County, Illinois.
Instead of requiring the person registering their dog to choose a breed from a list,
the creators of the system had simply given them a text field to type into. As a
result this database contained at least 250 spellings of Chihuahua.

• Issues:

1. Inconsistent spelling and/or historical changes.


2. N/A, blank, or null values.
3. 0 values (or −1 or dates 1900, 1904, 1969, or 1970).
4. Text is garbled.
5. Lines ends are garbled.
6. Text comes from optical-character recognition (OCR).

6
Regular expressions I

7
Regular expressions II

• You need to learn a programming language that manipulates regular expressions


efficiently.

• Tye Rattenbury et al. claim that between 50% and 80% of real-life data
analysis is spent with data wrangling.

• About regular expressions in general:

1. Tutorial: https://www.regular-expressions.info/reference.html.

2. Online trial: https://regexr.com/.

8
Regular expressions III

• Modern programming languages have powerful regular expressions capabilities.

• In Python: https:
//www.tutorialspoint.com/python/python_reg_expressions.htm.

• In R: https://www.rstudio.com/wp-content/uploads/2016/09/
RegExCheatsheet.pdf.

• Two key packages: dplyr and tidyr part of tidyverse.

• In particular, learn to use the piping command from dplyr to make code more
readable.

• Look also at https://www.tidytextmining.com/ for text mining.

9
Pre-processing I: tokenization

• Tokenization is the splitting of a raw character string into individual elements of


interest.

• Often these elements are words, but we may also want to keep numbers or
punctuation as well.

• Simple rules work well, but not perfectly. For example, splitting on white space
and punctuation will separate hyphenated phrases as in ‘risk-averse agent’ and
contractions as in ‘aren‘t’.

• In practice, you should (probably) use a specialized library for tokenization.

10
Pre-processing II: stopword removal

• The frequency distribution of words in natural languages is highly skewed, with


a few dozen words accounting for the bulk of text.

• These stopwords are typically stripped out of the tokenized representation of


text as they take up memory but do not help distinguish one document from
another.

• Examples from English are ‘a’, ‘the’, ‘to’, ‘for,’ and so on.

• No definitive list, but example on


http://snowball.tartarus.org/algorithms/english/stop.txt.

• Also common to drop rare words, for example those that appear is less than
some fixed percentage of documents.

11
Pre-processing III: linguistic roots

• For many applications, the relevant information in tokens is their linguistic root,
not their grammatical form. We may want to treat ‘prefer’, ‘prefers’,
‘preferences’ as equivalent tokens.

• Two options:

1. Stemming : Deterministic algorithm for removing suffixes. Porter stemmer is


popular.

2. Stem need not be an English word: Porter stemmer maps ‘inflation’ to ‘inflat’.

3. Lemmatizing : Tag each token with its part of speech, then look up each (word,
POS) pair in a dictionary to find linguistic root.

4. E.g., ‘saw’ tagged as verb would be converted to ‘see’, ‘saw’ tagged as noun left
unchanged.

• A related transformation is case-folding each alphabetic token into lowercase.


Not without ambiguity, e.g. ‘US’ and ‘us’ each mapped into same token.
12
Pre-processing IV: multi-word phrases

• Sometimes groups of individual tokens like “Bank Indonesia” or “text mining”


have a specific meaning.

• One ad-hoc strategy is to tabulate the frequency of all unique two-token


(bigram) or three-token (trigram) phrases in the data, and convert the most
common into a single token.

• In FOMC data, most common bigrams include ‘interest rate’, ‘labor market’,
‘basi point’; most common trigrams include ‘feder fund rate’, ‘real interest
rate’, ‘real gdp growth’, ‘unit labor cost’.

13
More systematic approach

• Some phrases have meaning because they stand in for specific names, like
“Bank Indonesia”. One can used named-entity recognition software applied to
raw, tokenized text data to identify these.

• Other phrases have meaning because they denote a recurring concept, like
“housing bubble”. To find these, one can apply part-of-speech tagging, then
tabulate the frequency of the following tag patterns:

AN/NN/AAN/ANN/NAN/NNN/NPN.

• See chapter on collocations in Manning and Schütze’s Foundations of


Statistical Natural Language Processing for more details.

14
Some terminology

• Corpus: the dataset under consideration (e.g., corporate reportws, political


speeches, statements, court decisions, newspaper articles, tweets, ...).

• Document: each of the components of the corpus.

• Terms: each of the components of a document (usually words).

• Ngrams: Adjacent terms that we may want to count together (“United


States,” “high unemployment”).

• Metadata: covariates associated with each document (not always present).

15
Notation

• The corpus is composed of D documents indexed by d.

• After pre-processing, each document is a finite, length-Nd list of terms


wd = (wd,1 , . . . , wd,Nd ) with generic element wd,n .

P
• Let w = (w1 , . . . , wD ) be a list of all terms in the corpus, and let N ≡ d Nd
be the total number of terms in the corpus.

• Suppose there are V unique terms in w, where 1 ≤ V ≤ N, each indexed by v .

• We can map each term in the corpus into this index, so that wd,n ∈ {1, . . . , V }.

P
• Let xd,v ≡ n 1(wd,n = v ) be the count of term v in document d.

16
Example

• Consider three documents:

1. ‘En un lugar’

2. ‘Muchos años después’

3. ‘Después del lugar’

• Set of V = 7 unique terms:

{en, un, lugar, muchos, años, después, del}

• Iindex:
en un lugar muchos años después del
1 2 3 4 5 6 7
• We then have w1 = (1, 2, 3); w2 = (4, 5, 6); w3 = (6, 2, 3).

• Moreover x1,1 = 1, x2,1 = 0, x3,1 = 0, etc.

17
Multinominal distribution I

• Multivariate generalization of the Bernoulli, categorical, and binomial


distributions.
• An experiment with n independent trials over order K ≥ 2 and probability
parameters β1 , β2 , ..., βK has a probability mass function:
KY
n!
f (x1 , x2 , ..., xK |n, β1 , β2 , ..., βK ) = βixi
x1 !x2 !...xK !
i=1

where
K
X
xi = n
i=1
K
X
βi = 1
i=1

18
Multinominal distribution II

• Note alternative, yet equivalent, form of the pdf:


P 
K
Γ i=1 xi + 1 Y x
K
f (x1 , x2 , ..., xK |n, β1 , β2 , ..., βK ) = QK βi i
i=1 Γ (x i + 1) i=1

• However, this normalization constant is not very important for inference.

• Moments:

E (Xi ) = nβi
Var (Xi ) = nβi (1 − βi )
Cov (Xi , Xj ) = −nβi βj for i 6= j

19
Dirichlet distribution I

• Multivariate generalization of the beta distribution.


• An experiment with order K ≥ 2 and concentration parameters β1 , β2 , ..., βK
has pdf in the K − 1 simplex:
K Y β −1
1
f (x1 , x2 , ..., xK |β1 , β2 , ..., βK ) = xi i
B (β1 , β2 , ..., βK )
i=1
where
K
X
xi = 1 and xi ≥ 0 for all i = 1, ..., K
i=1
QK
Γ (βi )
B (β1 , β2 , ..., βK ) = i=1
PK 
Γ i=1 β i
Z ∞
Γ (β) = x β−1 e −x dx
0

• Each order is often called a category.


20
5
1.2

4
3
0.8
Density

Density

2
0.4

1
0.0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

(i) (ii)

12
6

10
8
Density

Density
4

6
4
2

2
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

(iii) (iv)

Figure 1.1 Plots of the densities of Z ∼ Beta(a1 , a2 ) with various parameter values (i) 21
a1 = a2 = 0.9; (ii) a1 = a2 = 20; (iii) a1 = 0.5 and a2 = 2; (iv) a1 = 5 and a2 = 0.4.
0 0
0 10 20 30 4 5
-1 0 1 2 3 4 5

Z
Z

1
1 0.8
0.8 0.6 1
0.6 1 0.8
0.8 0 0.6
0 Y .4
Y .4 0.6 0.2 0.4
X
0.2 0.4 0.2
X 0
0.2 0
0 0
(i) (ii)
5 0 5

0 5 0
-5 0 5 10 1 2 2

-5 0 5 1 1 2
Z

1 Z 1
0.8 0.8
0.6 1 1
0.8 0.6
0 0.8
Y .4 0.6 0 0.6
0.2 0.4 Y .4
X 0.2 0.4
0.2 0.2
X
0 0 0 0
(iii) (iv)


Figure 2.1 Plots of the densities of (x1 , x2 ) ∼ Dirichlet(a1 , a2 ; a3 ) on V2 with various
parameter values: (i) a1 = a2 = a3 = 2; (ii) a1 = 1, a2 = 5, a3 = 10; (iii) a1 = 10, a2 = 3, 22
a3 = 8; (iv) a1 = 2, a2 = 10, a3 = 4.
Dirichlet distribution II

• Moments:
βi
E (Xi ) = PK
i=1 βi
P 
K
βi i=1 βi − βi
Var (Xi ) = P 2 P 
K K
i=1 βi i=1 βi + 1
−βi βj
Cov (Xi , Xj ) = P 2 P  for i 6= j
K K
i=1 βi i=1 βi + 1

• Conjugate prior of the multinomial distribution.

23
Simple probability model

• Consider the list of terms w = (w1 , . . . , wN ) where wn ∈ {1, . . . , V }.

• Suppose that each term is i.i.d., and that p(wn = v ) = βv ∈ [0, 1].

• Let β = (β1 , . . . , βV ) ∈ ∆V −1 be the parameter vector we are interested in.

• The probability of the ordered data given the parameters is


YX Y
p(w|β) = 1(wn = v )βv = βvxv
n v v

where xv is the count of term v in w.

• Note that term counts are a sufficient statistic for w in the estimation of β.
The independence assumption provides statistical foundations for the
bag-of-words model.

24
Bayesian inference

• Why?

• Highly parameterized model.

• Words are not uniformly distributed in texts.

• McMcs simplify inference.

• The Dirichlet is a great prior: conjugacy and interpretability.

25
26
Posterior distribution

• We can add term counts to the prior distribution’s parameters to form posterior:
V
Y V
Y V
Y
p(β|w) ∝ p(w|β)p(β) ∝ βvxv βvηv −1 = βvxv +ηv −1
v =1 v =1 v =1

• Posterior is a Dirichlet with parameters (η10 , . . . , ηV0 ) where ηv0 ≡ ηv + xv .


• Dirichlet hyperparameters can be viewed as pseudo-counts.
• Thus, we obtain
η v + xv
E [βv |w] = P
v ηv + N
which also corresponds to the predictive distribution p [wN+1 = v |w].
• MAP (mode) estimator of βv is
η v + xv − 1
E [βv |w] = P
v ηv + N − 2

• Nice asymptotic properties.


27
Generative latent variable model (LVM)

• Documents in a corpus might be have a more complex structure.

• k separate categorical distributions (“topics”), each with parameter vector β k .

• θd,k is the share of topic k in document d.

• Thus, each document is represented on a space of topics with θ d ∈ ∆K −1


instead of a raw vocabulary space.

• The probability that topic k generate term v is βk,v .


P
• General probabilistic structure: xd ∼ MN( k θd,k β k , Nd ).

• Let β = (β 1 , . . . , β K ) and θ = (θ 1 , . . . , θ D ).

28
pics as Urns
Topics as urns

= wage = employ

= price = increase

“Inflation” Topic “Labor” Topic 29


How do we model topics?

• Two approaches:

1. Mixture models: every document belongs to single category zd ∈ {1, . . . , K },


which is independent across documents and drawn from p(zd = k) = ρk .

We then have 
1 if zd = k
θd,k = .
0 otherwise

2. Mixed-membership models: we can assign each word in each document to a


topic.

Let zd,n ∈ {1, . . . , K } be the topic assignment of wd,n ; zd = (zd,1 , . . . , zd,Nd );


and z = (z1 , . . . , zD ).

30
ixture Model for Document
Mixture model

zd = 1 zd = 2

Inflation Topic Labor Topic

wd,n wd,n 31
Mixed-Membership Model for Document
Mixed-membership model

θd
0.25 0.75

zd,n = 1 zd,n = 2

Inflation Topic Labor Topic

wd,n wd,n 32
A canonical model

• Latent Dirichlet allocation (Blei, Ng, and Jordan 2003).

• Extremely popular and the base for more general models.

• Structure:

1. Draw θ d independently for d = 1, . . . , D from Dirichlet(α).

2. Each word wd,n in document d is generated from a two-step process:

2.1 Draw topic assignment zd,n from θ d .

2.2 Draw wd,n from β zd,n .

• Estimate hyperparameters α and term probabilities β 1 , . . . , β K .

33
A modification of the Latent Dirichlet allocation

• We can slightly modified the previous model to ease, later on, the
implementation of a Gibbs sampler.

• Structure:

1. Draw θ d independently for d = 1, . . . , D from Dirichlet(α).

2. Draw β k independently for k = 1, . . . , K from Dirichlet(η).

3. Each word wd,n in document d is generated from a two-step process:

3.1 Draw topic assignment zd,n from θd .

3.2 Draw wd,n from β zd,n .

• Fix scalar values for α and η.

34
LDA I
L ATENT D IRICHLET A LLOCATION

α θ z w N
M

1: Graphical model representation of LDA. The boxes are “plates” representing replicates.
The outer plate represents documents, while the inner plate represents the repeated choice
of topics and words within a document.

p(zn | θ) is simply θi for the unique i such that zin = 1. Integrating over θ and summing over
obtain the marginal distribution of a document: 35
!
LDA II

000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
11
00
00
11 topic 1
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
00
11 000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
x 111111111111111111111
000000000000000000000
topic simplex
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
x 000000000000000000000
111111111111111111111
x
000000000000000000000
111111111111111111111
x
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
x 000000000000000000000
111111111111111111111 word simplex
000000000000000000000
111111111111111111111
x
x 000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
x
000000000000000000000
111111111111111111111
x 000000000000000000000
111111111111111111111
x 111111111111111111111
000000000000000000000
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
x x
x 000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
x 000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
x
000000000000000000000
111111111111111111111
x 111111111111111111111
000000000000000000000
x
000000000000000000000
111111111111111111111
x
x 000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
x
x 000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
x 000000000000000000000
111111111111111111111
x
000000000000000000000
111111111111111111111
x111111111111111111111
x 000000000000000000000
x x
000000000000000000000
111111111111111111111
topic 2 000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
11
00 000000000000000000000
111111111111111111111
00000000000000000000000000000
11111111111111111111111111111 1
0
11
00 00000000000000000000000000000
11111111111111111111111111111 1
0
00000000000000000000000000000
11111111111111111111111111111
00000000000000000000000000000
11111111111111111111111111111 topic 3
00000000000000000000000000000
11111111111111111111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000000000000000000
11111111111111111111111111111

36
LDA III

37
Example statement: Yellen, March 2006, #51

Raw Data → Remove Stop Words → Stemming → Multi-word tokens = Bag of


Words
We have noticed a change in the relationship between the core CPI and the chained
core CPI, which suggested to us that maybe something is going on relating to
substitution bias at the upper level of the index. You focused on the nonmarket
component of the PCE, and I wondered if something unusual might be happening
with the core CPI relative to other measures.

38
Example statement: Yellen, March 2006, #51

Raw Data → Remove Stop Words → Stemming → Multi-word tokens = Bag of


Words
We have noticed a change in the relationship between the core CPI and the chained
core CPI, which suggested to us that maybe something is going on relating to
substitution bias at the upper level of the index. You focused on the nonmarket
component of the PCE, and I wondered if something unusual might be happening
with the core CPI relative to other measures.

39
Example statement: Yellen, March 2006, #51

Raw Data → Remove Stop Words → Stemming → Multi-word tokens = Bag of


Words
We have noticed a change in the relationship between the core CPI and the chained
core CPI, which suggested to us that maybe something is going on relating to
substitution bias at the upper level of the index. You focused on the nonmarket
component of the PCE, and I wondered if something unusual might be happening
with the core CPI relative to other measures.

40
Example statement: Yellen, March 2006, #51

Allocation
We have 17ticed a 39ange in the 39lationship 1etween the 25re 25I and the 41ained
25re 25I, which 25ggested to us that 36ybe 36mething is 38ing on 43lating to
25bstitution 20as at the 25per 39vel of the 16dex. You 23cused on the 25nmarket
25mponent of the 25E, and I 32ndered if 38mething 16usual might be 4appening
with the 25re 25I 16lative to other 25asures.

41
Distribution
Distribution of
of Attention
topics

42
Topic 25
Topic 25

43
Topic 11
Topic 11

44
Advantage of Flexibility

• ‘measur’ has probability 0.026 in topic 25, and probability 0.021 in topic 11.

• It gets assigned to 25 in this statement consistently due to the presence of


other topic 25 words.

• In statements containing words on evidence and numbers, it consistently gets


assigned to 11.

• Sampling algorithm can help place words in their appropriate context.

45
ternal Validation—BBD
Topics vs. BDD

200

200
.1

.1
BBD Topic 35 BBD Topic 12
Fraction of FOMC 1 & 2

Fraction of FOMC 1 & 2


.08

.08
150

150
.06

.06
BBD

BBD
.04

.04
100

100
.02

.02
50

50
0

1987m7 1991m7 1995m7 1999m7 2003m7 2007m7 1987m7 1991m7 1995m7 1999m7 2003m7 2007m7

FOMC Meeting Date FOMC Meeting Date

46
o-Cyclical Topics
Pro-cyclical topics
Fraction of speaker’s time in FOMC2

Fraction of speaker’s time in FOMC2


.2

.2
Min Median Max Min Median Max
.15

.15
.1

.1
.05

.05
0

1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7 1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7

FOMC Meeting Date FOMC Meeting Date

47
unter-Cyclical Topics
Counter-cyclical topics
Fraction of speaker’s time in FOMC2

Fraction of speaker’s time in FOMC2


.2

.2
Min Median Max Min Median Max
.15

.15
.1

.1
.05

.05
0

1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7 1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7

FOMC Meeting Date FOMC Meeting Date

48
tribution of Topics in Iraq Articles
Time series

0.14
topic0
topic7
0.12 topic9

0.10

0.08

0.06

0.04

0.02
1980 1985 1990 1995 2000 2005 2010 2015 49
Larsen (2018)

Figure 1. Examples of topic distributions


(a) Macroeconomics (b) Monetary policy

te: The 150 words with the highest probabilities are shown, the size of the words corresponds to th
obability of that word occurring in the topic distribution. All the word clouds are available at
tp://www.vegardlarsen.com/Word_clouds/.

ne by starting out with a given set of word distributions where the probabilities of50th
Figure 2. Aggregate newspaper uncertainty

Note: The black line plots the 300 day backward-looking rolling mean. The series gives the share of
uncertainty terms per 1 000 000 words in the newspaper.

ωi = number of total words in article i. (2


Then, as a first approach, I calculate an aggregate daily measure, that is the overa
51
(a) Monetary policy uncertainty

(b) Stock market uncertainty

(c) Macroeconomics uncertainty 52


(c) Macroeconomics uncertainty

(d) Fear uncertainty

53
te: The black line plots the 300 day backward-looking mean. The uncertainty count is the number
Note: A plot of the four principal components extracted from 80 topic-based uncertainty measures. 54
Figure 8. Correlations with alternative measures

e correlations are computed at a quarterly frequency. Blue is for positive co


ative correlations. The correlation coefficients are reported in the rectangles
55
Posterior distribution

• The inference problem in LDA is to compute the posterior distribution over z,


θ, and β given the data w and Dirichlet hyperparameters.

• The posterior distribution of the latent variables taking the parameters as given
is:
p (w|z = z 0 , θ, β) p (z = z 0 |θ, β)
p (z = z 0 |w, θ, β) = P 0 0
z 0 p (w|z = z , θ, β) p (z = z |θ, β)

• We can compute the numerator and each element of denominator.

• But z 0 ∈ {1, . . . , K }N ⇒ there are K N terms in the sum ⇒ intractable problem.

• For example, a 100 word corpus with 50 topics has ≈ 7.88 × 10169 terms.

56
Gibbs sampler

• Different McMcs can handle this problem.

• For a basic implementation, a Gibbs sampler is easy to code and efficient.

• Outline:

1. Sample from a multinomial distribution N times for the topic allocation variables.

2. Sample from a Dirichlet D times for the document-specific mixing probabilities.

3. Sample from a Dirichlet K times for the topic-specific term probabilities.

• We can improve upon the basic Gibbs sampler with collapsed sampling, i.e.,
analytically integrating out some variables in the joint likelihood and sampling
the remainder (Griths and Steyvers, 2004, and Hansen, McMahon, and Prat,
2015).

57
Sampling θ d

• By Bayes’ Rule, we have p(θ d |α, zd ) ∝ p(zd |θ d )p(θ d |α).

• Then
YX Y
p(zd |θ d ) = 1(zd,n = k)θd,k = nd,k
θd,k .
n k k

• Putting this together, we arrive at


Y n Y Y n +α−1
d,k α−1 d,k
p(θ d |α, zd ) ∝ θd,k θd,k = θd,k
k k k

which is a Dirichlet distribution.

58
Sampling β k

• By Bayes’ Rule, we have p(β k |z, w, η, β −k ) ∝ p(z, w|β)p(β k |η).

• The likelihood function p(z, w|β) takes the form


YYXX
p(z, w|β) = 1(wd,n = v )1(zd,n = k 0 )βk 0 ,v
d n v k0
YY mk 0 ,v
= βk 0 ,v
v k0
Y m
YY m 0
= βk,vk,v βk 0k,v,v
v v k 0 6=k
Y m
∝ βk,vk,v
v

• Putting this together, we arrive at


Y m Y η−1 Y m +η−1
p(β k |z, w, η, β −k ) ∝ βk,vk,v βk,v = βk,vk,v
v v v
which is a Dirichlet distribution.
59
Sampling z

• Finally, note that:

p(zd,n = k|wd,n = v , θ d , β)
p(wd,n = v |zd,n = k, θ d , β)p(zd,n = k|θ d , β)
=P
k p(wd,n = v |zd,n = k, θ d , β))p(zd,n = k|θ d , β)
θk β v
= P d kk v
k θ d βk

60
Priors

• There are three parameters to set to run the Gibbs sampling algorithm:
number of topics K and hyperparameters α, η.

• Priors are treated cavalierly in the literature.

• Griffiths and Steyvers recommend η = 200/V and α = 50/K . Smaller values


will tend to generate more concentrated distributions. See also Wallach et al.
(2009).

• Methods to choose K :

1. Predict text well → out-of-sample goodness-of-fit.

2. Information criteria.

3. Cohesion (focus on interpretability).

61
Cross validation

• Fit LDA on training data, obtain estimates of βb1 , . . . , βbK .

• For test data, obtain θd distributions via sampling as above, or else use uniform
distribution.

• Compute log-likelihood of held-out data as


!
  D X
X V K
X
b =
` w|Θ xd,v log θbd,k βbk,v
d=1 v =1 k=1

• Higher values indicate better goodness-of-fit.

62
Information criteria

• Information criteria trade off goodness-of-fit with model complexity.

• There are various forms: AIC, BIC, DIC, etc.

• Erosheva et al. (2007) compare several in the context of an LDA-like model for
survey data, and find that AICM is optimal.

P  
b s be the average value of the log-likelihood across S
• Let µ` = S1 s ` w | Θ
draws of a Markov chain.

1
P    2
b s − µ` be the variance.
• Let σ`2 = S s ` w |Θ

• The AICM is 2(µ` − σ`2 ).

63
Formalizing interpretablility

• Topics seem objectively interpretable in many contexts.

• Chang et al. (2009), in “Reading Tea Leaves: How Humans Interpret Topic
Models,” propose an objective way of determining whether topics are
interpretable.

• Two tests:

1. Word intrusion. Form set of words consisting of top five words from topic k +
word with low probability in topic k. Ask subjects to identify inserted word.

2. Topic intrusion. Show subjects a snippet of a document + top three topics


associated to it + randomly drawn other topic. Ask to identify inserted topic.

• Estimate LDA and other topic models on NYT and Wikipedia articles for
K = 50, 100, 150.

64
Results
Distribution of topics

65

You might also like