You are on page 1of 104

Introduction and basics

Adv. Topics in Data Science

David Rossell, UPF


Reading material

• Section 4, asymptotics. van der Vaart Ch 5 (misspecification), Ch


9.3 (normality), Ch 10 (Bayes), Ch 16 (likelihood-ratio test)
• Section 4, graphical models. Hastie-T-F Ch 17
• Section 4, bootstrap. Hastie-T-F Ch 8.2
• Section 4, Bayes. Gelman et al Ch 1, 2, 4 (intro), Ch 3 (asymptotic
posterior)
• Section 5, Bayesian computation. Hastie-T-F Ch 8.6. Gelman et al
Ch 10-11.
• Section 6, random effects. Gelman et al Ch 5 and 15
References
Gelman et al. Bayesian data analysis (3rd ed). CRC press.
Hastie, Tibshirani, Friedman. The Elements of Statistical Learning. Springer
van der Vaart. Asymptotic Statistics. Cambridge University Press, 1988
Outline

1 Course overview

2 Examples

3 Common probability distributions

4 Basics of Statistical inference

5 Intro to Bayesian computation


What is this course about?
Data Science is a vague term. It includes
• Statistics: focus on models & understand the data-generating
process (quantify uncertainty). Large n is great, large p a challenge
• Machine learning: focus on optimization criteria & prediction.
Often does not assume a model (explicitly)
• Computer science: focus on algorithms, software and hardware.
Large n or p create challenges
Goals
1 Familiarize with modern data analysis: high-dim (large p), flexible
models
2 Familiarize with research topics in data analysis
3 Learn to apply to your own research (including software)
Note: Emphasis on explaining/interpreting over predicting
Syllabus

1 Basics: frequentist & Bayesian frameworks, computation


2 High-dimensional linear regression: penalized likelihood, Bayesian
model selection & averaging
3 Extensions: time series, GLMs, discrete/count data
4 Flexible models: mixture/hidden Markov models, topic models,
Bayesian non-parametrics, semi-parametric & robust regression
5 Bits and pieces: dimension reduction, computational shortcuts

Disclaimer: no time for proofs/deep technical arguments. We focus on


intuition & interpreting models/results
Emphasis on regression & latent variable models. Simple to understand
and the basis for fancier stuff
• Non-parametric regression, neural networks / deep learning...
• Dimension reduction / factor models
• Graphical models / causality
[MRes Adv Techniques in Applied Econ, by Christian Brownlees]
• Text data analysis [MRes Adv Topics in Applied Econ, by Rubén
Durante]
• ...
Reading papers cover additional topics (e.g. sparse factor models,
advanced text analysis, applications...)

Final project: a chance to deepen into methods / applications


Notation

• x is a scalar. x a vector, x0 its transpose. Always column-vectors


 
x1
x = . . . 
xn

• X is a matrix, xij its (i, j) element


• Greek letters for parameters, latin for observable quantities (usually)
• p(x): probability density or mass function evaluated at x
• p(x, y ) joint density/mass evaluated at (x, y )
• p(x | y ) = p(x, y )/p(y ) conditional density/mass at x given y
• ∝ means proportional to, e.g. p(x | y ) ∝ p(x, y )
Outline

1 Course overview

2 Examples

3 Common probability distributions

4 Basics of Statistical inference

5 Intro to Bayesian computation


Precision medicine/Health Econ.
(data from Calon et al, Cancer Cell, 2012)
Keyword: regression with many covariates
Gene TGFB is important for cancer progression, and there are
experimental drugs to inhibit it
• n = 262 colon cancer patients
• yi ∈ R: expression of TGFB in patient i
• xi ∈ Rp : expression of p ≈ 20, 000 genes

yi = x0i β + i
Goals
1 Understand how other genes are related to TGFB
2 Predict TGFB

We want few genes: practicality, cost of the diagnostic, patentability


House takings and prices
(from Belloni, Chernozhukov, Hansen. Journal Econ Perpsec 2014)
Keyword: treatment effect estimation with many controls

US government may take private property. Court may rule it unlawful:


upholds individual property rights, may affect future prices
• Effect of ruling on house prices?
• Adjust for characteristics of 3-judge panel

yct = β1 rct + αc + δt + γc t + β20 wct + ct

• yct : log home price index at circuit court c, time t


• rct : number of pro-plaintiff decisions
• wct : info about judges (gender, race, religion, education...)
n=183, p=147 (variables + pairwise interactions + cubic effects)
Salary discrimination
Keyword: multiple treatments, high confounding
USA survey from 2010 and 2019 (pre-covid19)
• How is salary associated to potentially discriminatory treatments?
• Evolution over time?
• Overall salary variation associated to treatments?

Data
• ≥ 18 years, income ≥ $1,000 year, work 20-60 h/week
• n = 64, 380 in 2010, n = 58, 885 in 2019
• J = 278 controls: state, household & place of residence, education,
migration, health, financial records, subsidies, sources of income...
• Treatments: gender, black race, hispanic ethnicity, born in Latin
America. Also, interactions with state (204 treatments)
Issue: treatments highly correlated with controls (variance inflation)
Results from Miquel
Torrens’s thesis
• Black: original data
• Grey: adding 50 & 100
artificial controls

Methods
• OLS: least-squares
• DML: double
machine-learning
• BMA: Bayesian model
averating
• CIL: confounder
importance learning
(Miquel)
Teenager well-being vs Tech use
Orben & Przybylski Nat Human Behavior 2019 (519 citations 27/9/2021)
Keyword: uncertainty in combining many possible models
association between digital technology use and well-being (...) too small to warrant
policy change
association with regularly eating potatoes was nearly as negative as the association
with technology use

Specification Curve Analysis (Simonsohn et al NHB 2020, 10 citations)


We took another look (Christoph Semken & Rossell)
• Each outcome/treatment separately (lonely, depressed, suicidal...)
• Combine models via EBIC/Bayesian model averaging

EDU: Odds of loneliness increase 1.83 [1.74,1.93]. To plan suicide 1.85 [1.72,1.98]
YRBS data: n=66,303, p=11; MCS data: n=8,351, p=20
Large n. But need to combine models (acknowledge uncertainty)
Law of small numbers

(Daniel Kahneman’s Thinking fast & slow Ch10, example by Howard Wainer &
Harris Zwerling)

Study incidence of kidney cancer in 3,141 USA counties.

“counties in which the incidence is lowest are mostly rural, sparsely


populated, and located in traditionally Republican states in the Midwest,
the South and the West”

What do you suspect this is due to?


It turns out that, also,

“counties in which the incidence is highest are mostly rural, sparsely populated, and
located in traditionally Republican states in the Midwest, the South and the West”

Law of small numbers: if sample size is small, estimates are less precise
• ni : population in county i = 1, . . . , 3141
• θi : probability of developing kidney cancer in county i
• yi ∼ Bin(ni , pi ): number of observed cases
A natural estimator (maximum likelihood estimator) θ̂i = yi /ni has

θi (1 − θi )
Var(θ̂i ) =
ni
For instance, what if θ1 = . . . = θ3141 (no differences)?
Issue: we estimate many parameters, for some the sample size is small
Fix: share info across counties
Keyword: hierarchical/spatial random effects, clustering

e of f , xi = ⟨ f, ψ i ⟩. It will be µ($, !) = n · max |⟨ϕk, ψ j⟩|. (3)
here ! is the n × n matrix Supervised image compression 1≤k, j≤n

he implication
(from of sparsityWakin.
Candès, is In plain Proc
Signal English, the coherence
Magazine 2008)measures the largest correlation
arse expansion, one can dis-
Keyword: sparse between any two elements of $ and !; see also [5]. If $ and !
high-dimensional regression
out much perceptual loss.
Data y = (y1 , . . . , yn )contain
0 correlated elements,6 the coherence is large. Otherwise,
intensities for n = 10 pixels. One can use a
by keeping only the terms it is small. As for how large and how small, it follows from linear
so-called wavelet basis to define an n × n matrix √ X such that
ues of (xi ) in the expansion algebra that µ($, !) ∈ [1, n].
re here and below, xS is the Compressive
Xn sampling is mainly concerned with low coher-
but the largest S set to zero. enceypairs,
i = and xwe ij β̂now
j ⇒ give
y =examples
X β̂ of such pairs. In our first
nse since all but a few of its example, $j=1 is the canonical or spike basis ϕk(t ) = δ(t − k ) and
parse STATISTICAL LEARNING WITH SPARSITY 5
zero But many |β̂j | ≈ 0. Keep 25,000 largest (throws away 97.5%)
rmal
Wavelet
have × 104 Coefficients
x is 2
ense 1.5
(xi ) 1
0.5
roxi-
0
error
−0.5
rms, −1
ction 0 2 4 6 8 10
loss. × 105
here (a) (b) (c)
Intuition
Say image is m × m, y stacks columns. Then n = m2 . Set

Intercept c1 ... cm−1 r1 ... rm−1 c 1 r1 c1 r2 ... cm−1 rm−1


 
1 1 ... 0 1 ... 0 1 1 ... 0
 1 1 ... 0 1 ... 0 1 1 ... 0 
X =  
 ... 
1 0 ... 1 0 ... 1 0 0 ... 1

• c1 : n × 1 vector indicating if pixel is in first m/2 columns


• c2 , c3 indicate first m/4, 3m/4
• ...
• Likewise r1 , r2 , . . . for rows
Number of columns in X is 1 + 2(m − 1) + (m − 1)2 = m2 = n
Note: this is a Haart wavelet basis. Others possible
Wine dataset
Keyword: latent clusters / subpopulations of individuals
n = 178 wines, p = 13 (chemical concentrations) (R package rattle)
• Summarize wines with fewer variables? (principal components)
• Product segmentation: are there wine subtypes? (mixture model)
4

4




●●

● ●● ●
Second principal component

●● ●
● ●●
● ●
● ●
● ● ● ●

2

2
●● ● ● ●● ● ●



● ● ●● ●● ●● ●●● ● ● ●●
● ● ● ● ●
● ●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ●

PC2
● ● ● ●
● ●
● ● ● ● ● ● ●
0

0
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●●● ● ● ● ●
● ● ●●
● ● ● ●

● ● ●● ●
● ●● ● ●
● ● ● ● ● ●● ● ●

● ●● ● ● ● ●● ●
● ●
● ●● ● ●

● ●● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ●
● ●
● ●● ● ●●
● ●●

● ● ●● ●
● ● ●●
●● ● ● ● ●● ●
● ● ●
−2

−2
● ● ● ● ● ● ●
● ● ●
● ● ●● ●
● ●
● ●
● ● ● ●

−4 −2 0 2 4 −4 −2 0 2 4

First principal component PC1


Topic models

Keyword: latent clusters of variables

• We have documents i = 1, . . . , n (speeches, regions in a country...)


• For each document we count J items (J words, J company types...)
Data: yi = (yi1 , . . . , yiJ )0 counts for i = 1, . . . , n
• There might be unknown document subtypes
• Each document type has different count probabilities
Examples: positive vs. negative reviews, normal emails vs. spam,
technological vs. rural region
Topic model regression
Keyword: latent clusters meets regression
Regress topics on covariates, e.g. (Gadarian & Albertson Political Psychol 2014)

• Goal: how do emotions influence immigration preferences


• Outcome: answer to open-ended questions (text)
• Treatment: 50% individuals encouraged to worry about
immigration. 50% to just think about immigration
• Covariate: party identification, interaction with treatment

Regress an outcome on text data


• Popularity of political tweets. Outcome is number of re-tweets,
predictors are tweet topic, word usage, political events...
• Effect of Federal reserve statements on housing market
13

topic share

0.000
0.025
0.050
0.075
0.100
1 Crime, violence and migrants
2 Prevent the diesel ban
3 Merkel attending events
4 Security & Herrmann
5 Environmental protection & security
6 Strengthen family rights
7 Campaigning
8 Strenthening employee rights
9 Right of return (to fulltime)
Invest in education, taxes, limit rents
lidarity pension" & improv. country life
12 Go and vote
13 Bavaria − secure and successful
14 Tax reliefs, listening & "Gefährder"
15 FDP & Lindner
16 (unclear)
17 TV duel & Merkel
aign events: Guttenberg, Schulz & co
19 Various campaigning events
20 (unclear)
21 (unclear)
22 Failure, change & Gauland
23 Social justice
Campaigning, buttons & demolitions
• What topics does each party talk about?

25 Campaigning & campaign buttons


26 Gratitude & saying thanks
legal immigration, AfD & Alice Weidel
28 Coalition scenarios & green topics
29 (unclear)
30 (unclear)
mily migration, constitution, inviolable
32 (unclear)
33 Integration & language
34 Various local campaigning events
• What parties attract more engagement? What topics?

35 Keeping pensions stable


36 (unclear)
37 Solidarity tax
4.4milion tweets from candidates/parties. Count likes & favorites

Free daycare centres and education


Twitter & German elections 2017

39 Shaping the future


40 Panel discussions & nazis
41 (unclear)
(with R. Knudsen & S. Majó, Oxford Reuters Institute for Investigative Journalism)

42 (unclear)
43 (unclear)
scussions at the Dreschschuppenfest
45 General election vocabulary
party
AfD

FDP

SPD
CSU
CDU

Linke
Gruene
Engagement vs. party, followers, topic etc
Engagement vs. number of followers and party
x = "candidate" x
u2

1,000.0 ●
● ●
● ●
● ●
● ●
● ● ● ●● ● ● ● ●
●● ●● ● ● ●●
● ●● ● ● ● ●
●● ●●● ● ●
● ● ●● ● ●
● ● ● ●● ● ● ●
●●● ●● ●




● ●
●● ●● ●●●●
● ●
●●

● ●
● ●● ● ●
10.0 ●●● ●● ●● ●
●●●●
●●● ●●● ● ●
● ● ●
● ● ●●● ● ●
● ● ●
●●
●●
●●
●● ● ●
● ●●●●●●●

●●
●●
●●● ●
● ● ●● ● ●●●● ● ● ● ●●●
●● ●
●●●
●● ●
● ● ● ●● ●● ● ●● ●
●●● ●
●●

●●
●●
● ●
●●●● ● ●
● ● ●
● ● ●●● ●● ●● ●● ●●● ● ●●●● ●● ● ●●●
● ● ●● ● ●● ● ●
● ● ●●
●●● ● ●●●●●● ●●
● ● ●
● ●
●● ●●
● ●
● ● ●●
●●
●● ●●
● ● ●●
● ●
● ● ● ●●
●● ●●●● ● ●● ● ● ●
● ●● ● ● ● ●● ● ● ●
●●●
●●
● ● ●●

●●
● ●●
●● ● ●●● ●
0.1 ● ● ● ●

●●●●
●● ●●●●●
●●

●●
●●●
●●●●

●●●●
●●
●●
●●


●●
●●


●●

●●
●●
●●

●●
●●
●●
●●
●●

●●
●●●●●
●●●
●●●

1,000.0
Increase/decrease in engagement for +10% in Topic 1
● ● ●●
● ●
AfD CDU CSU FDP●● Gruene●●● ● ●Linke
● ●
● ● ●
SPD

● ●● ● ●● ●
● ●●●● ●● ●
0.57 1.38 2.22 1.59 ● ● ●0.64●●
●●


0.43
● ●



0.70
● ●●●● ● ●
10.0 ●
● ●● ● ●●
●●● ●●
●●●
●●
●● ●
● ● ●
●●●● ●

● ●●●●●●● ●

● ●●●● ●
●● ● ● ● ●●

● ●● ●●●●●● ● ● ●
● ● ● ● ● ● ●● ●
● ●
●● ●
Prenatal care

(Conway & Deb, Journal Health Econ 2005, data from National Maternal & Infant
Health Survey)
Keyword: latent clusters in regression

Does prenatal care (PNC) improve infant health


• Outcome yi : weight at birth of infant i
• Intervention xi : onset of prenatal care (time when it starts)
• Other adjustment variables (mother characteristics etc.)
Issue
• Randomized studies show PNC is good (negative coefficient)
• Observational studies show little effect
492 K.S. Conway, P. Deb / Journal of Health Economics 24 (2005) 489

Estimated coef (weighted least squares) β̂1 = −0.132 (not stat. signif.)

Thought: unobserved variable “difficulty of pregnancy”?

Fit mixture-of-regressions model (infer normal vs. complicated births)

Normal Complicated
β̂1 -0.308∗ 0.377
S&P500 returns

(from www.quantstart.com, depmixS4 package, data from quantmod library)


Keyword: hidden Markov models
Outcome: yt percent log-returns in S&P500 at time t (2005-2020)
Market hypothesized to undergo 2 regimes
• Bullish: positive mean, low variance
• Bearish: slightly negative mean, higher variance
Our model (hidden Markov model)
• Mixture: given regime rt ∈ {1, 2} at time t, yt | rt ∼ N(µrt , φrt )
• Markov dependence: rt depends on rt−1 . pij = p(rt = j | rt−1 = i)

p(yt , rt | y1 , . . . , yt−1 , r1 , . . . , rt−1 ) = p(yt | rt )p(rt | rt−1 )


Probability Returns

0.0 0.4 0.8 −0.10 0.00 0.10

2004−01
2004−07
2004−12
2005−06
2005−12
2006−06

Regime 2
Regime 1
2006−12
2007−06
2007−12
2008−06
2008−12
2009−06
2009−11
2010−05
2010−11
2011−05
2011−11
2012−05
2012−11
2013−05
2013−11
2014−05
2014−11
2015−05
2015−10
2016−04
Regime Posterior Probabilities

2016−10
2017−04
2017−10
2018−04
2018−10
2019−04
2019−10
2020−04
S&P500 returns
Recap from examples

Issue 1. Many parameters θ1 , . . . , θp relative to n, but


• A few important ones, the rest are zero/smallish. Sparsity
• They’re similar to each other. Share information
Strategy: shrink θ̂i ’s towards 0 or towards common θ̂

Issue 2. There may be (unobserved) subpopulations. Learning them helps


• Avoid biases in inference
• Understand the process
• Simplify predicting some outcome
• Obtain a more flexible model
Strategy: model yi (or θi ) as arising from subpopulations (mixtures,
hidden Markov models, Bayesian non-parametrics)
Outline

1 Course overview

2 Examples

3 Common probability distributions

4 Basics of Statistical inference

5 Intro to Bayesian computation


For continuous random vectors in Rp : multivariate Normal, Laplace, T
For positive variables in R+ : Gamma, Inverse gamma
For variables in [0, 1] adding up to 1 (e.g. proportions)
• Scalars: Beta
• Vectors: Dirichlet

For count data (discussed later in the course)


• Binomial, Multinomial, Poisson
• Negative Binomial, zero-inflated & hurdle models
Multivariate Normal
For y ∈ Rp . y ∼ N(µ, Σ), mean µ ∈ Rp , p × p covariance Σ

 
1 1
p(y | µ, Σ) = N(y; µ, Σ) = p 1 exp − (y − µ)0 Σ−1 (y − µ)
(2π) 2 |Σ| 2 2
1
log p(y | µ, Σ) = c − (y − µ)0 Σ−1 (y − µ)
2
Properties
• Linear combinations z = Ay ∼ N(Aµ, AΣA0 ), if A is q × p full-rank
• Conditional distributions are Normal. If
     
y µ1 Σ11 Σ12
y= 1 ∼N ;
y2 µ2 Σ012 Σ22

Then y1 | y2 ∼ N(µ1 + Σ12 Σ−1 −1 0


22 (y2 − µ2 ), Σ11 − Σ12 Σ22 Σ12 )

Corollary: subsets are Normal, y1 ∼ N(µ1 , Σ11 )


   
0 2 1
Bivariate Normal with µ = ,Σ=
0 1 1

3
0.02

2
0.06

1
0.1

0.14

X2

0
0.12

−1
0.08

0.04

−2
−3
−3 −2 −1 0 1 2 3

X1

Note: Normal pdf contours are elliptical, but it’s not the only such
distribution (multivariate T, Laplace, logistic...)
Double exponential (or Laplace)
Univariate case: y ∼ L(µ, φ) with E (y ) = µ, V (y ) = 2φ
 
1 y − µ y − µ
p(y | µ, φ) = 1 exp − √
φ
⇒ log p(y | µ, φ) = c − √
φ

2φ 2

0.5
L(0,1)
N(0,2)
0.4
0.3
Density
0.2
0.1
0.0

−4 −2 0 2 4
Gamma & Inv. gamma (positive variables)

Gamma: x ∼ G(α, β). Inv gamma. y = 1/x ∼ IG(α, β) where α, β > 0

β α α−1 −βx β α 1 −β/y


p(x | α, β) = x e ; p(y | α, β) = e
Γ(α) Γ(α) y α+1

• G (α, β) generalizes the exponential and chi-square distributions


• E (x) = α/β, V (x) = α/β 2

• E (y ) = β/(α − 1) for α > 1, V (y ) = β2


(α−1)2 (α−2) for α > 2
[Note: Γ(α) is the so-called gamma function. It generalizes the factorial, for integer
k > 1 then Γ(k) = (k − 1)!]
Main differences: behaviour at origin and tails

G(0.5,1)
G(2,1)
2.0 IG(0.5,1)
IG(2,1)
1.5
Density
1.0
0.5
0.0

0 1 2 3 4 5 6
x
From Normal to Laplace

Laplace: y ∼ L(µ, Σ) is the marginal of p(y, v ) = p(v )p(y | v )


• v ∼ G(1, 2)
• y | v ∼ N(µ, v Σ)

Common trick to obtain more complex distrib than Normal


• Other p(v ) give other interesting symmetric distributions
• Similar tricks to induce asymmetry (skew or two-piece Normal etc)
• Convenient for Bayesian computation (e.g. MCMC)
Beta distribution
For y ∈ [0, 1]. Parameters α1 , α2 > 0
Γ(α1 + α2 ) α1 −1
p(y | α1 , α2 ) = y (1 − y )α2 −1
Γ(α1 )Γ(α2 )

(from https://commons.wikimedia.org/w/index.php?curid=74166)
Dirichlet distribution

Pp
Generalizes Beta to y = (y1 , . . . , yp )0 where i=1 yi = 1. Parameters
α = (α1 , . . . , αp )0 , αi > 0.
P p
Γ( αi ) Y αi −1
p(y | α) = Q yi
i Γ(αi ) i=1
Pp
if i=1 yi = 1, else p(y | α) = 0
Particular case: if α1 = . . . = αp = 1 then y ∼ Unif
Moments
αi E (yi ) (1 − E (yi ))
E (yi ) = Pp ; V (yi ) = Pp
j=1 αj 1 + j=1 αj
Dirichlet for various (α1 , α2 , α3 )
(from Wikipedia)
Outline

1 Course overview

2 Examples

3 Common probability distributions

4 Basics of Statistical inference

5 Intro to Bayesian computation


Likelihood-based inference

We observe y. Assume prob. model p(y | θ) with parameter θ ∈ Θ. MLE

θ̂ = arg max p(y | θ)


θ

Let y = (y1 , . . . , yn ). Under certain assumptions θ̂ has good properties


1 Fixed p = dim(Θ) and large sample size n → ∞
2 Model is true, y ∼ p(y | θ ∗ ) for some θ ∗ ∈ Θ
3 + regularity assump. (log p(y | θ) is smooth, θ ∗ ∈ interior(Θ)...)
Issues
1 If n is small relative to p: high-dim statistics (p  n)
2 If model is not true
When Assumptions 1-3 hold
√ D
n(θ̂ − θ ∗ ) −→ N(0, I −1 (θ ∗ ))

where I (θ ∗ ) is Fisher’s information matrix at θ = θ ∗


• Asymptotically unbiased & efficient
• We can make uncertainty statements (intervals, hypothesis tests...)

If assumed model is not true, say y ∼ p ∗ (y)


√ D
n(θ̂ − θ ∗ ) −→ N(0, VI −1 (θ ∗ )V )
• V : expected cross-products of gradient of log p(y | θ) under p ∗ (y)
• θ ∗ minimizes a measure of divergence between p ∗ (y) and p(y | θ)
[Note: result from M-estimation theory, not discussed here]

Message: still get reasonable properties. Not so if p = dim(Θ) large


Key insight: MLE can be too “aggressive” for large p. Instead

arg max log p(y | θ) + h(θ)


θ

h(θ) is a likelihood penalty (e.g. LASSO) encouraging “simple” solutions

Example. Graphical model. y ∼ N(0, Ω−1 ), p = 10 and Ω below

1
10

5
6
Background
     
y1 0 σ11 Σ12
Let y = ∼N ,
y2 0 Σ21 Σ22
−1
• E (y1 | y2 ) = Σ12 Σ22 y2 = β 0 y2 , where β = Σ−1
22 Σ21

• V (y1 | y2 ) = σ11 − β 0 Σ21


Key: If βj = 0 then y2j has no effect on E (y1 | y2 ), V (y1 | y2 ). Then y1
indep of y2j given other y2 ’s

ω11 Ω012
 
−1
Result: let Σ = Ω = . Then β = − ω111 Ω12
Ω12 Ω22
• βj = 0 ⇔ Ω12 has j th element = 0 (partial correlation=0)
• Ωij = 0 ⇔ yi indep of yj , conditional on other y ’s
Research:
Similar interpretation for elliptical copula models (Rossell & Zwiernik, 2020)
Combinations of continuous/discrete variables
How to estimate Ω

WLOG suppose that yi ’s have zero mean


MLE

Ω̂ = arg max log p(y | Ω)


Solution: Ω̂ is the inverse of the sample covariance matrix

Graphical LASSO
X
Ω̂ = arg max log p(y | Ω) + λ |ωij |

i6=j

for some adequately chosen λ


Simulate yi ∼ N(0, Ω), i = 1, . . . , 20 indep. Ω has 55 parameters

Ω̂ via MLE Ω̂ via Graphical LASSO

GLASSO penalizes large ω̂ij ’s. Bias > than MLE, but lower overall error
GLASSO works better if Ω truly has zeroes (sparsity)
Model misspecification

Besides n  p, classical MLE inference assumes that model is true

Example: if yi = x0i θ ∗ + i , i ∼ N(0, φ∗ ) iid then

θ̂ ∼ N(θ ∗ , φ∗ (X 0 X )−1 )

If i not Normal, indep or id ⇒ CIs/Pvalues not valid

Computational alternatives
• Bootstrap: confidence intervals
• Permutation methods: hypothesis tests
Slower, but dispense with some assumptions
Bootstrap

Data y = (y1 , . . . , yn ) truly generated by yi ∼ p ∗ (y). θ̂(y) an estimate.


Algorithm. For b = 1, . . . , B do
(b) (b)
1 Sample y(b) = (y1 , . . . , yn ) with replacement from y1 , . . . , yn
2 Compute θ̂(y(b) )

Distribution of θ̂ (b) approximates sampling distribution p ∗ (θ̂)


• (1 − α)% CI given by (α/2, 1 − α/2) quantiles of θ̂ (b)
• Histogram / kernel density estimates
For some fun animations, see www.stat.auckland.ac.nz/~wild/BootAnim
Example: food expenditure vs. household income (Engel, 1857)
n
X
Median regr: min |yi − β0 − β1 xi |
β
i=1

Distrib of (β̂0 , β̂1 ) known if n large and yi = β0 + β1 xi + i with iid i


2000


1500

● ●

● ● ●
Estimate 95% CI
foodexp

● ●


● β̂0 81.5 (53.3,114.0)
1000

● ●


● ●
● ●


●● ●
● ●
●●●


β̂1 0.56 (0.49,0.60)
●●
● ●● ● ● ●

●●● ● ●
● ● ●●● ●
● ● ●●
● ●● ●● ●
●●●●●
●● ● ● ●

●●●●● ● ●●
●●●●●● ●
●● ●
●●●
● ● ●●

● ● ●●●


● ● ●
●● ● ●
● ●● ●
500

●●● ●
● ●
●●● ● ●●
●●●
●●● ● ● ●●
●●●●●

● ●● ●
● ●●●
● ●● ●
●●●●●
●●●●●●
●●●
● ●
● ●
●●●
● ●
●●
●●●●●


●●●●● ●
●●●
●●
●●
●●●

1000 2000 3000 4000 5000

income
Bootstrapped β̂1

3000
2000
Frequency
1000
500
0

0.45 0.50 0.55 0.60 0.65


β1

MLE 95% CI Bootstrap 95% CI


β̂0 81.5 (53.3,114.0) (41.7,150.3)
β̂1 0.56 (0.49,0.60) (0.47,0.61)
R code

library(quantreg)
data(engel)

fit1= rq(foodexp ~ income, tau = .5, data = engel)


summary(fit1)

beta= matrix(NA,nrow=10^4,ncol=2)
for (i in 1:nrow(beta)) {
bootengel= engel[sample(1:nrow(engel),nrow(engel),replace=TRUE),]
beta[i,]= coef(rq(foodexp ~ income, tau =.5, data=bootengel))
}

hist(beta[,2],main=’’,xlab=expression(beta[1]))

#95% confidence interval (bootstrap)


quantile(beta[,1],probs=c(.025,0.975))
quantile(beta[,2],probs=c(.025,0.975))
Bayesian framework

Suppose we specify
• Probability model for our data (likelihood): p(y | θ)
• Probability distribution for the parameters (prior): p(θ)
Then Bayes theorem gives the posterior distribution

p(y | θ)p(θ)
p(θ | y) = ∝ p(y | θ)p(θ)
p(y)

Contains our updated knowledge on θ after having seen y


• Estimate parameters, obtain intervals
• Test hypotheses (more generally, model choice)
• Statements about future unobserved data
Example

German study on gender birth ratio in mothers with placenta previa


• Goal: estimate probability θ of having a girl
• Observe n births where mother has placenta previa
• Record number of girls y ∼ Bin(n, θ)

Set prior p(θ) = Beta(θ; a, b). Then

p(θ | y ) ∝ Bin(y ; θ, n)Beta(θ; a, b) =


 
n y Γ(a + b) a−1
θ (1 − θ)n−y × θ (1 − θ)b−1 ∝ θy +a−1 (1 − θ)n−y +b−1
y Γ(a)Γ(b)

That is, p(θ | y ) = Beta(θ; a + y , b + n − y )


Example. Set a = b = 1 (uniform)
Case 1: n = 10, y = 6. Case 2: n = 10, y = 2. Case 3: n = 100, y = 60.

10 Beta(1+6,1+4)
Beta(1+2,1+8)
Beta(1+60,1+40)
8
6
Posterior
4
2
0

0.0 0.2 0.4 0.6 0.8 1.0


θ

P(θ > 0.5 | y ) = 0.746; P(θ > 0.5 | y ) = 0.019; P(θ > 0.5 | y ) = 0.978
Parameter estimation
Posterior mean
Z
θ̂ = E (θ | y) = θp(θ | y)dθ

Minimizes quadratic loss E ((θ − θ̂)2 | y)

Posterior mode

θ̂ = arg max p(θ | y) = arg max log p(y | θ) + log p(θ)


θ θ
 
Minimizes 0-1 loss P θ 6∈ N (θ̂) | y as  → ∞, N an −neighbourhood

• If p(θ) ∝ 1, posterior mode=MLE.


• If θ has infinite support (say θ ∈ R), p(θ) ∝ 1 not a prob distrib!
• In principle forbidden (improper prior), but some people do it
Example. Beta posterior

Let p(θ | y ) = Beta(θ; a + y , b + n − y ), as in placenta previa example


• E (θ | y ) = a+y
a+b+n
1
• V (θ | y ) = E (θ | y )(1 − E (θ | y )) a+b+n+1
• Posterior interval. Find (l, u) s.t. P(θ ∈ (l, u) | y) = 0.95 (easy)
Compare to MLE
• θ̂ = y /n
• V (θ̂) = θ̂(1 − θ̂)/n

Quiz: can you give an interpretation to (a, b)?

Intuition: as n → ∞, p(y | θ) becomes concentrated and p(θ) stays


constant. Maybe posterior inference not so different from MLE?
Asymptotic equivalences

Bernstein - von Mises theorem. If p fixed, n → ∞ (+ regularity conditions,


basically when MLE is asymptotically normal)

T .V .
p(θ | y) −→ N(θ; θ̂, I −1 (θ̂))

(total variation is stronger than convergence in distribution)

√ D
n(E (θ | y) − θ ∗ ) −→ N(0, I −1 (θ ∗ ))

Implication: as n → ∞, same estimates & confidence regions than MLE


• ||E (θ | y) − θ̂|| = Op (1/n) for MLE θ̂ (same for post mode)
• P(θ ∈ A | y) ≈ P(z ∈ A) where z ∼ N(θ̂, n−1 I (θ̂))
Limitations: may not hold if p  n. Result refers to estimation, but
model selection (hypothesis tests) is fundamentally different
Examples

Linear reg. Assume y = X θ ∗ + ,  ∼ N(0, φ∗ I ), X 0 X invertible


The MLE (least-squares) θ̂ = (X 0 X )−1 X 0 y is distributed

θ̂ ∼ N(θ ∗ , φ∗ (X 0 X )−1 )

BvM: for a wide class of priors p(θ), as n → ∞


T .V .
p(θ | y) −→ N(θ; θ̂, φ̂(X 0 X )−1 )

Binomial. Let y ∼ Bin(n, θ∗ ), θ̂ = y /n ∼ N(θ∗ , θ∗ (1 − θ∗ )/n) approx

p(θ | y ) ≈ N(θ; θ̂, θ̂(1 − θ̂)/n), as n → ∞


Bayesian model selection
Consider models M1 , . . . , MK with param θ1 , . . . , θK .
• Likelihood: p(y | θk , Mk )
• Prior on parameters: p(θk | Mk )
• Prior on models: p(Mk )
Example: y ∼ Bin(n, θ). M1 : θ = 0.5 vs. M2 : θ 6= 0.5

p(y | Mk )p(Mk )
Z
p(Mk | y) = ∝ p(Mk ) p(y | θk , Mk )p(θk | Mk )dθk
p(y)

p(y | Mk ) is the integrated or marginal likelihood. To compare 2 models

p(Mk | y) p(y | Mk ) p(Mk ) p(Mk )


= = BFkj
p(Mj | y) p(y | Mj ) p(Mj ) p(Mj )
where BFkj is the Bayes factor comparing Mk vs Mj
BF vs LR
BF based on integrating. Likelihood ratio test on maximizing

R
p(y | θk , Mk )p(θk | Mk )dθk
BFkj = R
p(y | θj , Mj )p(θj | Mj )dθj
maxθk p(y | θk , Mk )
LRkj =
maxθj p(y | θj , Mj )

LR favors large models, if Mj ⊂ Mk then LRkj > 1. In contrast, under


quite general conditions as n → ∞
P
BFkj −→ ∞, if Mk is true
P
BFkj −→ 0, if Mj is true

P
Implication: If number of models K fixed then p(Mk ∗ | y) −→ 1 for the
true Mk ∗ (more later: how fast, K  n)

Reading paper: Benjamin et al 2017, Nature Human Behavior


Placenta previa example

Number of girls y ∼ Bin(n, θ). Test M1 : θ = 0.5 vs. M2 : θ 6= 0.5


Suppose n = 20, y = 5, then under M2 the MLE is θ̂ = 1/4

Bin(y ; n, θ = 1/4)
LR21 = = 13.68
Bin(y ; n, 0.5)
D
Recall: if M1 true then 2 log LR21 −→ χ21 (large n)
In our case, 2 log LR21 = 5.23 ⇒ P-value=0.022. Reject M1

Discussion: how do we interpret a P-value?


Example (cont.)

Now the Bayesian way. p(y | M1 ) = Bin(y ; n, 0.5).


Z Z 1
p(y | M2 ) = p(y | θ)p(θ)dθ = Bin(y ; n, θ)Beta(θ; a, b)dθ
  0
n Γ(a + b) Γ(y + a)Γ(n − y + b)
=
y Γ(a)Γ(b) Γ(n + a + b)

 −1
p(y | M2 ) p(M1 )
BF21 = = 3.22 ⇒ p(M2 | y ) = 1 + BF12 = 0.763
p(y | M1 ) p(M2 )

Here p(M1 ) = 1/2. Suppose now p(M1 ) = 3/4, then p(M2 | y ) = 0.518.

• P-value: prob on data given a hypothesis


• Post prob: prob of a hypothesis given data
R code

> n= 20; y= 5; a= b= 1
> lrt= 2*(dbinom(y,n,prob=y/n,log=TRUE) - dbinom(y,n,prob=0.5,log=TRUE))
> 1-pchisq(lrt,df=1)
[1] 0.02216888

> logm1= dbinom(y,n,prob=0.5,log=TRUE)


> logm2= lchoose(n,y) + lgamma(a+b) - lgamma(a) - lgamma(b)
+ lgamma(y+a) + lgamma(n-y+b) - lgamma(n+a+b)

> priormodel= c(.5,.5)


> exp(logm2)*priormodel[2]/(exp(logm1)*priormodel[1]+exp(logm2)*priormodel[2])
[1] 0.7630669

#same but more numerically stable


> postodds= exp(logm2-logm1) * priormodel[2]/priormodel[1]
> postodds/(1+postodds)
[1] 0.7630669
Other differences
Bayesian model selection
• We can have > 2 models. Non-nested models OK
• Measures uncertainty via p(Mk | y)

Combining with parameter estimation. Strategies


1 Select a model, e.g. k̂ = arg maxk p(Mk | y). Get posterior mean

θ̂k = E (θ | y, Mk̂ )

2 Bayesian model averaging. If θ has same meaning across models


K
X K
X
E (θ | y) = E (θ | y, Mk )p(Mk | y) = θ̂k p(Mk | y)
k=1 k=1

Option 1 simpler, but ignores uncertainty (unsure that k̂ is correct)


Example y ∼ Bin(n, θ)

Consider M1 : θ = 0.5, M2 : θ 6= 0.5


• p(M1 ) = p(M2 ) = 0.5
• p(θ | M2 ) = Beta(θ; 1, 1)

Observe y = 5, n = 20 ⇒ p(M1 | y ) = 0.237, p(M2 | y ) = 0.763

E (θ | y ) = E (θ | y , M1 )p(M1 | y ) + E (θ | y , M2 )p(M2 | y ) =
5+1
0.5 × 0.237 + × 0.763 = 0.327
20 + 2

Way closer to 0.5 than MLE θ̂ = 0.25 (shrinkage)


Posterior predictive inference

Sometimes the goal is to make statements about future data y∗ given y


Example: 2016 USA elections
• Survey data: y = (y1 , . . . , y51 ), number who say will vote Clinton in
state i out of ni interviews. yi ∼ Bin(ni , θi )
• Parameters: θi true proportion of people who say will vote Clinton
• Future data: n ∗ = (n1∗ , . . . , n51

) who actually vote. Assume known
• Future data: y∗ = (y1∗ , . . . , y51

) who vote Clinton, yi∗ ∼ Bin(ni∗ , θi )
For simplicity assume yi indep i = 1, . . . , 51, same for yi∗ . Then

Z Z 51
!
Y
p(y∗ | y) = p(y∗ | θ, y)p(θ | y)dθ = Bin(yi∗ ; ni∗ , θi ) p(θ | y)dθ
i=1
USA elections
Set independent priors θi ∼ Beta(1, 1), i = 1, . . . , 51. Then
• Posterior distribution
51
Y
p(θ | y) = Beta(θi ; 1 + yi , 1 + ni − yi )
i=1

• Posterior predictive distribution

51
Z Y
p(y∗ | y) = Bin(yi∗ ; ni∗ , θi ) Beta(θi ; 1 + yi , 1 + ni − yi ) dθi
| {z }
i=1
| {z } p(θ|y)
p(y∗ |θ,y)

Integral has closed-form. In general easier to approx p(y∗ , θ | y) by


sampling.
Posterior sampling
Idea: p(θ | y) is a prob distrib, sample many θ (b) ∼ p(θ | y)
B
1 X
E (g (θ) | y) ≈ g (θ (b) )
B
b=1

a.s.
• Law large numbers: Ê (g (θ) | y) −→ E (g (θ) | y) as B → ∞

• Central limit theorem: Ê (g (θ) | y) = E (g (θ) | y) + Op (1/ B)

Example: P(θi > 0.5 | y). Let g (θi ) = I(θi > 0.5), then
B
1 X (b)
P(θi > 0.5 | y) = E [g (θi ) | y] ≈ I(θi > 0.5)
B
b=1

(1) (B)
Aka, simulate θi , . . . , θi ∼ p(θi | y), count proportion that are > 0.5
Posterior predictive sampling

Goal: approximate p(y∗ | y) or p(y∗ , θ | y). For b = 1, . . . , B,


1 Sample θ (b) ∼ p(θ | y)
2 Sample (y∗ )(b) ∼ p(y∗ | θ, y)
If we keep only the (y∗ )(b) ’s, we simulated from p(y∗ | y)

Example: take random sample of ni = 500 voters from each state


(data from www.presidency.ucsb.edu/showelection.php?year=2016)

We seek post pred prob of winning state i: P(yi∗ > ni∗ /2 | y)


1 Simulate many yi∗ ’s
2 Count proportion of (yi∗ )(b) > ni∗ /2
R code
Samples from posterior predictive p(θ, y∗ | y)
binBetaIndep= function(y,n,alpha,beta,n.ynew,nsamples=10^4) {
#Posterior sampler for Binomial-Beta model
# - y[i] ~ Bin(n[i], theta[i])
# - theta[i] ~ Beta(alpha,beta)
#
# Model for future data (ignored if n.ynew missing)
# - ynew[i] ~ Bin(n.ynew[i], theta[i])
#
# Output: nsamples post samples theta | y, and post pred ynew | y
if (length(y) != length(n)) stop("y and n must be of same length")
if (length(y) != length(n.ynew)) stop("y and n.ynew must be of same length")
nfail= n-y
theta= ynew= matrix(NA,nrow=nsamples,ncol=length(y))
for (i in 1:nsamples) {
theta[i,]= rbeta(length(y), y+alpha, n-y+beta)
ynew[i,]= rbinom(ncol(ynew), size=n.ynew, prob=theta[i,])
}
ans= list(theta=theta,ynew=ynew)
return(ans)
}
R code

#Take random sample of 500 voters from each state


> usaelec= read.table(’usaelections_2016.txt’,header=TRUE,sep=’\t’)
> totalVotes= usaelec[,3] + usaelec[,4] #Clinton + Trump voters
> hillaryVotes= usaelec[,3]
> set.seed(1)
> n= rep(500,nrow(usaelec))
> y= rbinom(nrow(usaelec),n,prob=hillaryVotes/totalVotes)
> names(y)= names(n)= names(totalVotes)= names(hillaryVotes)=
as.character(usaelec[,1])

#Obtain posterior predictive samples (y*,theta)


> fit1= binBetaIndep(y,n=n,alpha=1,beta=1,n.ynew=totalVotes,nsamples=10^4)
> colnames(fit1$theta)= colnames(fit1$ynew)= as.character(usaelec[,1])

#Obtain post pred samples for y*> n*/2


> statewin= matrix(NA,nrow=nrow(fit1$ynew),ncol=length(totalVotes))
> colnames(statewin)= colnames(fit1$ynew)
> for (i in 1:nrow(statewin)) statewin[i,]= (fit1$ynew[i,]>0.5*totalVotes)
Post pred of yi∗ in each state

2000
50% votes 50% votes
2000

1500
Posterior predictive

Posterior predictive
1500

1000
1000

500
500
0

0
4000000 4500000 5000000 3500000 4000000 4500000
Votes in New York Votes in Texas

> table(statewin[,’New York’])


TRUE
10000
> table(statewin[,’Texas’])
FALSE TRUE
9854 146
Post pred on other quantities

Often we’re interested in some function u = u(y∗ , θ, y). Trivial to sample


from its posterior predictive p(u | y)! For b = 1, . . . , B
1 Simulate θ (b) ∼ p(θ | y), then (y∗ )(b) ∼ p(y∗ | θ (b) )
2 Compute u (b) = u((y∗ )(b) , θ (b) , y)

Example: take random sample of ni = 500 voters from each state


(data from www.presidency.ucsb.edu/showelection.php?year=2016)

Each state has si seats in parliament. Post pred on won seats


51
X
u= si I(yi∗ > ni∗ /2)
i=1

where I() is the indicator function


Post pred on number of seats

2000
50% delegates

1500
Posterior predictive
1000
500
0

150 200 250 300


Number of delegates won by Clinton (out of 435)

> nbdel= integer(nrow(fit1$ynew)); totaldel= usaelec$number_voting_delegates


> for (i in 1:length(nbdel)) nbdel[i]= sum(totaldel * statewin[i,])
> table(nbdel > sum(totaldel)/2)
FALSE TRUE
6424 3576
Interview ni = 5, 000 per state
Prob winning ≥ 218 delegates: 0.330

2500
2000 50% delegates
Posterior predictive
1500
1000
500
0

150 200 250 300


Number of delegates won by Clinton (out of 435)
Interview ni = 50, 000 per state
Prob winning ≥ 218 delegates: 0.174

2000
1500 50% delegates
Posterior predictive
1000
500
0

150 200 250 300


Number of delegates won by Clinton (out of 435)
Summary. Frequentist vs. Bayes
Estimation
• MLE and Bayes often equivalent as n → ∞, p fixed
• Point estimates (large p). Direct connection penal. MLE vs. Bayes
• Uncertainty quantification (large p). No direct connection

Hypothesis testing / model selection


• Frameworks are fundamentally different
• BMS more conservative, more general (> 2 hypotheses, non-nested)

Predictive inference
• Point predictions ŷ∗ often similar; quantifying uncertainty may not
• Bayes can easily predict u(y∗ , θ) and associated uncertainty
Outline

1 Course overview

2 Examples

3 Common probability distributions

4 Basics of Statistical inference

5 Intro to Bayesian computation


Bayesian computation

The object of interest

p(θ | y) ∝ p(y | θ)p(θ)

Sometimes p(θ | y) is a well-known distribution


• y ∼ Bin(n, θ) likelihood + Beta prior ⇒ θ | y ∼ Beta
• y ∼ N(X β, φI ) + Normal-IG prior ⇒ (β, φ) | y ∼ Normal-IG
Otherwise, numerical methods (in increasing accuracy)
1 Variational Bayes, Expectation-propagation
2 Asymptotic Normality (BvM, Laplace approx, nested Laplace...)
3 Posterior sampling (Markov Chain Monte Carlo, sequential MC...)
Variational Bayes

Approx p(θ | y) by something simpler, e.g. split θ = (θ1 , . . . , θS ),


S
Y
p(θ | y) ≈ q(θ) = qs (θs )
s=1
R
where qs are defined by minq KL(q, p) = log(q(θ)/p(θ))q(θ)dθ
• Usually super-fast, e.g. qs reduces to matching moments
• Point estimates often good, but problems with posterior uncertainty
Qp
Example (mean-field VB): p(θ | y) ≈ j=1 qj (θj ) (univariate)

Expectation-propagation: related algorithm, minq KL(p, q) instead


Asymptotic expansions

Normal approx. If the Bernstein-von Mises theorem applies

p(θ | y) ≈ N(θ; θ̂, H −1 (θ̂))


where θ̂ = arg maxθ p(θ | y), H(θ̂) hessian of log p(θ | y)

Properties
• Fairly fast (just requires θ̂, H(θ̂))
• Accurate as n → ∞, p fixed. When p  n not so good
Extensions: higher-order/saddlepoint approx, high-dim BvM
Example from Minka (MIT, 2001) w

Exact
Exact Exact
Exact
Normal approx
Laplace
Laplace VB Bound
Bound EP Exact
EP
p(D,w)

p(D,w)
p(D,w)

p(D,w)

p(D,w)
w w ww w

Normal based on Taylor approx: good locally, fails to capture tails


Exact
Exact
VB underestimatesEPEPuncertainty, but good point estimates
EP better in this example, but has less theoretical guarantees
p(D,w)
p(D,w)
Markov Chain Monte Carlo

Idea: init θ = θ (0) . Build Markov Chain with transition probabilities

s(θ (b+1) | θ (b) )

such that its stationary distribution is p(θ | y), regardless of θ (0)


• As b → ∞ we’re sampling from p(θ | y)
• Dependent samples
• Many such s, with varying efficiency (converge speed, called mixing)

Fundamental algorithm to build s(): Metropolis-Hastings


Metropolis-Hastings
Goal: sample from p(θ | y). Init at any θ (0) . For b = 1, . . . , B
1 Propose new θ ∗ ∼ h(θ | θ (b−1) ) from an arbitrary proposal
distribution h (covering the support of p(θ | y ))
2 Accept θ (b) = θ ∗ with probability min{1, u},

p(θ ∗ | y) h(θ (b−1) | θ ∗ )


u=
p(θ (b−1) | y) h(θ ∗ | θ (b−1) )

else θ (b) = θ (b−1) . Set b = b + 1


Quiz. What happens if h(θ ∗ | θ (b−1) ) = p(θ ∗ | y)?
Example: Random walk h(θ | θ (b−1) ) = N(θ; θ (b−1) , S)

p(θ ∗ | y) N(θ (b−1) ; θ ∗ , S) p(y | θ ∗ )p(θ ∗ )


u= =
p(θ (b−1) | y) N(θ ∗ ; θ (b−1) ; S) p(y | θ (b−1) )p(θ (b−1) )
Quiz. If u ≥ 1 then θ ∗ is accepted. For what θ ∗ do we get u ≥ 1?
Key: proposal h should explore efficiently regions with high p(θ | y)

Hard for random-walk if dim(θ) large, e.g. most θ ∗ rejected.

Alternatives:
• Gibbs sampling (MH with acceptance prob u = 1)
• Use derivatives of log p(θ | y) (Hamiltonian MC)
• ...

Animations: https://chi-feng.github.io/mcmc-demo
Gibbs sampling
(0) (0)
Goal: sample from q(θ1 , . . . , θp ). Init at arbitrary (θ1 , . . . , θp )
(b) (b−1) (b−1)
1 Sample θ1 ∼ q(θ1 | θ2 , . . . , θp )
(b) (b) (b−1) (b−1)
2 Sample θ2 ∼ q(θ2 | θ1 , θ3 , . . . , θp )
3 ...
(b) (b) (b)
4 Sample θp ∼ q(θp | θ1 , . . . , θp−1 )
Many variations
• MH-within-Gibbs: sample θj(b) ∼ q(θj | . . .) via MH

• Collapsed Gibbs: sample θ1(b) ∼ q(θ1 ), Gibbs q(θ2(b) , . . . , θp(b) | θ1(b) )


• Block-Gibbs: sample subset of θi ’s all at once
• Random scan: randomize order in which θi ’s are sampled
Example: bivariate normal
Fun 4 independent chains from different starting values. Target: Normal
zero mean, unit variance, correlation 0.8
Example: median regression
Model yi = x0i β + i , where i ∼ L(0, φ) iid

( n )
1 1 X yi − x0i β
p(y | β, φ) ∝ n exp −
φ2 2 √φ
i=1
n
n 1 X
log p(y | β, φ) = log(φ) − 1 |yi − x0i β|
2 2φ 2 i=1

An equivalent model. Let ˜i ∼ N(0, φ), vi ∼ G(1, 2) indep

yi = x0i β + vi ˜i

This defines p̃(y, v | β, φ) with marginal p(y | β, φ). Add priors


1 Sample (β, φ) ∼ p̃(β, φ | y, v) trivial (Normal linear regr.)
2 Sample v ∼ p̃(v | y, β, φ) trivial (e.g. indep inv gamma’s)
Hamiltonian Monte Carlo
Idea: MH where proposal depends on gradient of log p(θ | y)

Introduce artificial momentum variables v ∈ Rp such that


(θ, v) ∼ p(θ | y)N(v; 0, M)
for some matrix M (typically diagonal, ideally ∝ V −1 (θ | y)

The algorithm (L and  > 0 are tuning parameters)


1 Sample v(b) ∼ N(0, M)
2 Deterministic proposal from Hamiltonian dynamics (partial diff
equations). Set (θ ∗ , v∗ ) = (θ (b−1) , v(b) ). Repeat L times:
• v∗ = v∗ + 2 ∇θ log p(θ ∗ | y)
• θ ∗ = θ ∗ + M −1 v∗
• v∗ = v∗ + 2 ∇θ log p(θ ∗ | y)
3 Set θ (b) = θ ∗ with probability min{1, u}
p(y | θ ∗ )p(θ ∗ )
u=
p(y | θ (b−1) )p(θ (b−1) )
Hamiltonian Monte Carlo

Remarks:
• At the limit  = 0, the acceptance probability is u = 1
• STAN and numpyro softwares use a no U-turn HMC that
automatically sets L
• In principle we need the gradient

∇ log p(y | θ) + log p(θ)

but software provides automatic differentiation


Example. Hierarchical logit model

yi : number of Clinton voters in state i out of ni people

yi ∼ Binom(ni , θi ).

Set normal random effects on the logit voting probabilities


 
θi
ηi = log ∼ N(µ, σ 2 )
1 − θi

and set priors

µ ∼ N(0, 1)
σ ∼ IG(0.1, 0.1).

Intuition: share info across states, useful when ni is small


Posterior p(θ | y) does not have closed-form
• Since θi = 1/(1 + e −ηi ), even if we knew (µ, σ)

51
Y
Bin yi ; ni , 1/(1 + e −ηi ) N(ηi ; µ, σ 2 )

p(η | µ, σ, y) ∝
i=1

• The posterior on η (equiv. on θ) no longer independent, since


Z
p(η) = p(η | µ, σ)p(µ, σ)dµdσ

The prior does not factor across i, hence neither does p(η | y)

Instead, we use Stan


• 4 chains, 2,000 iterations each
• Discard first 1,000 as burn-in
Estimates for each state

Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Dist. of Col.
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
0.2 0.4 0.6 0.8 1.0
Comparison to MLE
Very similar, with a little shrinkage

0.9
0.8
0.7
Posterior mean
0.6
0.5
0.4
0.3

0.3 0.4 0.5 0.6 0.7 0.8 0.9


MLE
Random effects distribution
We estimate Ê (µ | y) = −0.074, Ê (σ | y) = 0.621.

0.6
Random effects distribution at estimated (mu,sigma)

0.5
0.4
0.3
0.2
0.1
0.0

−2 −1 0 1 2

Logit voting probabilities


Assessing MCMC convergence
Compare estimated posterior from the 4 chains

mu sigma

Chain
1
2
3
4

−0.2 0.0 0.2 0.5 0.6 0.7 0.8 0.9


Assessing MCMC convergence
Trace plot for the 4 chains

mu sigma

0.9

0.2

0.8

0.0
Chain
0.7 1
2
3
4

0.6

−0.2

0.5

−0.4

0 200 400 600 800 1000 0 200 400 600 800 1000
Assessing MCMC convergence
Practical strategies
• Check within-chain variance of results as iterations grow
• Check that different chains give similar results
Let R be within-chain variance / between-chain variance.

Example: bivariate Normal (corr =0.8). Goal: 95% interval for θ1 and θ2
Why random effects?

We have a large number p < n parameters, but we expect them to be


informative about each other

Example. Exam score of student i = 1, . . . , I , question j ∈ {1, . . . , 10}

yij = βi + θj + ij , ij ∼ N(0, φ)

We have n = 10I observations, p = I + 10 coef

Assume random effects for students’ abilities

yij | βi , θj , φ ∼ N(βi + θj , φ)
βi | µ, τ ∼ N(µ, τ )
MMLE vs. Bayes

Maximum marginal likelihood for fixed (non-random) effects ω


Z
ω̂ = (θ̂, φ̂, µ̂, τ̂ ) = arg max p(y | ω) = p(y | β, ω)p(β | ω)dβ
ω

β̂ = arg max p(y | β, ω̂)p(β | ω̂)


β

Requires evaluating p(y | ω) (integrating)


Bayesian: similar, but add prior p(ω)

p(β, ω | y) ∝ p(β | ω, y)p(ω | y) ∝ p(y | β, ω)p(β | ω)p(y | ω)p(ω)


Example

(Goldstein, Multilevel Statistical Models Chapter 11; lab1.html)

Grade of student i = 1, . . . , n = 3, 435 in school gi ∈ {1, . . . , 148}

yi ∼ N(βgi , φ) indep i = 1, . . . , n
βj ∼ N(µ, τ ) indep j = 1, . . . , 148

Estimate (µ, τ, φ) via MMLE. Then β̂ by


n 148
1X 1X
max p(y | β)p(β | µ̂, τ̂ , φ̂) = min (yi − βgi )2 + (βj − µ̂)2
β β φ̂ i=1 τ̂
j=1

What happens if τ̂ ≈ 0? And if τ̂ very large?


Approximating integrals

Laplace approx. Let θ ∈ Rp , g a function


Z p
(2π) 2
I = g (θ)dθ ≈ g (θ̃) 1 = Iˆ
|H(θ̃)| 2
where θ̃ = arg maxθ log g (θ), H(θ̃) = −∇2 log g (θ) |θ=θ̃ its hessian
Usually Iˆ/I = 1 + O(1/h) (relative error), where h measures curvature of
log g (θ) (for us usually h ∝ n)

Importance sampling. Let q(θ) be a density function

θq(θ)
Z Z
E (θ) = θq(θ)dθ = h(θ)dθ
h(θ)
(b) (b)
Sample θ(b) ∼ h(θ). Estimate Ê (θ) = B1 b=1 θ h(θq(θ
PB )
(b) )

Then Ê (θ) = E (θ) + Op (1/ B) (additive error)
Integrated Nested Laplace approx (INLA)
Suppose there’s many θi ’s (random effects, spatial models) and one can
write
n
Y
p(y | θ, Ψ1 ) = p(yi | θi , Ψ1 )
i=1
p(θ | Ψ2 ) = N (θ; m(Ψ2 ), diag(V (Ψ2 )))

where (Ψ1 , Ψ2 ) ∼ p(Ψ1 , Ψ2 ) not too big (say < 20 parameters)

Example. yij = θi + x0i Ψ1 + ij , where θi ∼ N(0, Ψ2 ), ij ∼ N indep (i, j)

p(Ψ | y) ∝ p(Ψ)p(θ | Ψ)p(y | θ, Ψ)/p(θ | Ψ, y)


Z
p(θi | y) ≈ p̂(θi | Ψ, y)p̂(Ψ | y)dΨ

where p̂ are Laplace approx (several options available)


Bayesian software

R packages https://cran.r-project.org/web/views/Bayesian.html

Stan (MCMC, variational Bayes, penalized likelihood) http://mc-stan.org


Interfaces with R, python, shell, Matlab, Julia

numpyro: Stan adaptation for Python, faster (better automatic


differentiation, use of GPUs)

Nimble (MCMC, sequential MC, convert R code/models to C++)


https://r-nimble.org

INLA (GLMs, random effects, GAM, spatial models, SPDEs...)


http://www.r-inla.org

You might also like