Professional Documents
Culture Documents
1 Course overview
2 Examples
1 Course overview
2 Examples
yi = x0i β + i
Goals
1 Understand how other genes are related to TGFB
2 Predict TGFB
Data
• ≥ 18 years, income ≥ $1,000 year, work 20-60 h/week
• n = 64, 380 in 2010, n = 58, 885 in 2019
• J = 278 controls: state, household & place of residence, education,
migration, health, financial records, subsidies, sources of income...
• Treatments: gender, black race, hispanic ethnicity, born in Latin
America. Also, interactions with state (204 treatments)
Issue: treatments highly correlated with controls (variance inflation)
Results from Miquel
Torrens’s thesis
• Black: original data
• Grey: adding 50 & 100
artificial controls
Methods
• OLS: least-squares
• DML: double
machine-learning
• BMA: Bayesian model
averating
• CIL: confounder
importance learning
(Miquel)
Teenager well-being vs Tech use
Orben & Przybylski Nat Human Behavior 2019 (519 citations 27/9/2021)
Keyword: uncertainty in combining many possible models
association between digital technology use and well-being (...) too small to warrant
policy change
association with regularly eating potatoes was nearly as negative as the association
with technology use
EDU: Odds of loneliness increase 1.83 [1.74,1.93]. To plan suicide 1.85 [1.72,1.98]
YRBS data: n=66,303, p=11; MCS data: n=8,351, p=20
Large n. But need to combine models (acknowledge uncertainty)
Law of small numbers
(Daniel Kahneman’s Thinking fast & slow Ch10, example by Howard Wainer &
Harris Zwerling)
“counties in which the incidence is highest are mostly rural, sparsely populated, and
located in traditionally Republican states in the Midwest, the South and the West”
Law of small numbers: if sample size is small, estimates are less precise
• ni : population in county i = 1, . . . , 3141
• θi : probability of developing kidney cancer in county i
• yi ∼ Bin(ni , pi ): number of observed cases
A natural estimator (maximum likelihood estimator) θ̂i = yi /ni has
θi (1 − θi )
Var(θ̂i ) =
ni
For instance, what if θ1 = . . . = θ3141 (no differences)?
Issue: we estimate many parameters, for some the sample size is small
Fix: share info across counties
Keyword: hierarchical/spatial random effects, clustering
√
e of f , xi = ⟨ f, ψ i ⟩. It will be µ($, !) = n · max |⟨ϕk, ψ j⟩|. (3)
here ! is the n × n matrix Supervised image compression 1≤k, j≤n
he implication
(from of sparsityWakin.
Candès, is In plain Proc
Signal English, the coherence
Magazine 2008)measures the largest correlation
arse expansion, one can dis-
Keyword: sparse between any two elements of $ and !; see also [5]. If $ and !
high-dimensional regression
out much perceptual loss.
Data y = (y1 , . . . , yn )contain
0 correlated elements,6 the coherence is large. Otherwise,
intensities for n = 10 pixels. One can use a
by keeping only the terms it is small. As for how large and how small, it follows from linear
so-called wavelet basis to define an n × n matrix √ X such that
ues of (xi ) in the expansion algebra that µ($, !) ∈ [1, n].
re here and below, xS is the Compressive
Xn sampling is mainly concerned with low coher-
but the largest S set to zero. enceypairs,
i = and xwe ij β̂now
j ⇒ give
y =examples
X β̂ of such pairs. In our first
nse since all but a few of its example, $j=1 is the canonical or spike basis ϕk(t ) = δ(t − k ) and
parse STATISTICAL LEARNING WITH SPARSITY 5
zero But many |β̂j | ≈ 0. Keep 25,000 largest (throws away 97.5%)
rmal
Wavelet
have × 104 Coefficients
x is 2
ense 1.5
(xi ) 1
0.5
roxi-
0
error
−0.5
rms, −1
ction 0 2 4 6 8 10
loss. × 105
here (a) (b) (c)
Intuition
Say image is m × m, y stacks columns. Then n = m2 . Set
4
●
●
●
●
●●
●
● ●● ●
Second principal component
●● ●
● ●●
● ●
● ●
● ● ● ●
●
2
2
●● ● ● ●● ● ●
●
●
●
● ● ●● ●● ●● ●●● ● ● ●●
● ● ● ● ●
● ●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ●
●
PC2
● ● ● ●
● ●
● ● ● ● ● ● ●
0
0
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●●● ● ● ● ●
● ● ●●
● ● ● ●
●
● ● ●● ●
● ●● ● ●
● ● ● ● ● ●● ● ●
●
● ●● ● ● ● ●● ●
● ●
● ●● ● ●
●
● ●● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ●
● ●
● ●● ● ●●
● ●●
●
● ● ●● ●
● ● ●●
●● ● ● ● ●● ●
● ● ●
−2
−2
● ● ● ● ● ● ●
● ● ●
● ● ●● ●
● ●
● ●
● ● ● ●
−4 −2 0 2 4 −4 −2 0 2 4
topic share
0.000
0.025
0.050
0.075
0.100
1 Crime, violence and migrants
2 Prevent the diesel ban
3 Merkel attending events
4 Security & Herrmann
5 Environmental protection & security
6 Strengthen family rights
7 Campaigning
8 Strenthening employee rights
9 Right of return (to fulltime)
Invest in education, taxes, limit rents
lidarity pension" & improv. country life
12 Go and vote
13 Bavaria − secure and successful
14 Tax reliefs, listening & "Gefährder"
15 FDP & Lindner
16 (unclear)
17 TV duel & Merkel
aign events: Guttenberg, Schulz & co
19 Various campaigning events
20 (unclear)
21 (unclear)
22 Failure, change & Gauland
23 Social justice
Campaigning, buttons & demolitions
• What topics does each party talk about?
42 (unclear)
43 (unclear)
scussions at the Dreschschuppenfest
45 General election vocabulary
party
AfD
FDP
SPD
CSU
CDU
Linke
Gruene
Engagement vs. party, followers, topic etc
Engagement vs. number of followers and party
x = "candidate" x
u2
1,000.0 ●
● ●
● ●
● ●
● ●
● ● ● ●● ● ● ● ●
●● ●● ● ● ●●
● ●● ● ● ● ●
●● ●●● ● ●
● ● ●● ● ●
● ● ● ●● ● ● ●
●●● ●● ●
●
●
●
●
● ●
●● ●● ●●●●
● ●
●●
●
● ●
● ●● ● ●
10.0 ●●● ●● ●● ●
●●●●
●●● ●●● ● ●
● ● ●
● ● ●●● ● ●
● ● ●
●●
●●
●●
●● ● ●
● ●●●●●●●
●
●●
●●
●●● ●
● ● ●● ● ●●●● ● ● ● ●●●
●● ●
●●●
●● ●
● ● ● ●● ●● ● ●● ●
●●● ●
●●
●
●●
●●
● ●
●●●● ● ●
● ● ●
● ● ●●● ●● ●● ●● ●●● ● ●●●● ●● ● ●●●
● ● ●● ● ●● ● ●
● ● ●●
●●● ● ●●●●●● ●●
● ● ●
● ●
●● ●●
● ●
● ● ●●
●●
●● ●●
● ● ●●
● ●
● ● ● ●●
●● ●●●● ● ●● ● ● ●
● ●● ● ● ● ●● ● ● ●
●●●
●●
● ● ●●
●
●●
● ●●
●● ● ●●● ●
0.1 ● ● ● ●
●
●●●●
●● ●●●●●
●●
●
●●
●●●
●●●●
●
●●●●
●●
●●
●●
●
●
●●
●●
●
●
●●
●
●●
●●
●●
●
●●
●●
●●
●●
●●
●
●●
●●●●●
●●●
●●●
●
1,000.0
Increase/decrease in engagement for +10% in Topic 1
● ● ●●
● ●
AfD CDU CSU FDP●● Gruene●●● ● ●Linke
● ●
● ● ●
SPD
●
● ●● ● ●● ●
● ●●●● ●● ●
0.57 1.38 2.22 1.59 ● ● ●0.64●●
●●
●
●
0.43
● ●
●
●
●
0.70
● ●●●● ● ●
10.0 ●
● ●● ● ●●
●●● ●●
●●●
●●
●● ●
● ● ●
●●●● ●
●
● ●●●●●●● ●
●
● ●●●● ●
●● ● ● ● ●●
●
● ●● ●●●●●● ● ● ●
● ● ● ● ● ● ●● ●
● ●
●● ●
Prenatal care
(Conway & Deb, Journal Health Econ 2005, data from National Maternal & Infant
Health Survey)
Keyword: latent clusters in regression
Estimated coef (weighted least squares) β̂1 = −0.132 (not stat. signif.)
Normal Complicated
β̂1 -0.308∗ 0.377
S&P500 returns
2004−01
2004−07
2004−12
2005−06
2005−12
2006−06
Regime 2
Regime 1
2006−12
2007−06
2007−12
2008−06
2008−12
2009−06
2009−11
2010−05
2010−11
2011−05
2011−11
2012−05
2012−11
2013−05
2013−11
2014−05
2014−11
2015−05
2015−10
2016−04
Regime Posterior Probabilities
2016−10
2017−04
2017−10
2018−04
2018−10
2019−04
2019−10
2020−04
S&P500 returns
Recap from examples
1 Course overview
2 Examples
1 1
p(y | µ, Σ) = N(y; µ, Σ) = p 1 exp − (y − µ)0 Σ−1 (y − µ)
(2π) 2 |Σ| 2 2
1
log p(y | µ, Σ) = c − (y − µ)0 Σ−1 (y − µ)
2
Properties
• Linear combinations z = Ay ∼ N(Aµ, AΣA0 ), if A is q × p full-rank
• Conditional distributions are Normal. If
y µ1 Σ11 Σ12
y= 1 ∼N ;
y2 µ2 Σ012 Σ22
3
0.02
2
0.06
1
0.1
0.14
X2
●
0
0.12
−1
0.08
0.04
−2
−3
−3 −2 −1 0 1 2 3
X1
Note: Normal pdf contours are elliptical, but it’s not the only such
distribution (multivariate T, Laplace, logistic...)
Double exponential (or Laplace)
Univariate case: y ∼ L(µ, φ) with E (y ) = µ, V (y ) = 2φ
1 y − µ y − µ
p(y | µ, φ) = 1 exp − √
φ
⇒ log p(y | µ, φ) = c − √
φ
2φ 2
0.5
L(0,1)
N(0,2)
0.4
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
Gamma & Inv. gamma (positive variables)
G(0.5,1)
G(2,1)
2.0 IG(0.5,1)
IG(2,1)
1.5
Density
1.0
0.5
0.0
0 1 2 3 4 5 6
x
From Normal to Laplace
(from https://commons.wikimedia.org/w/index.php?curid=74166)
Dirichlet distribution
Pp
Generalizes Beta to y = (y1 , . . . , yp )0 where i=1 yi = 1. Parameters
α = (α1 , . . . , αp )0 , αi > 0.
P p
Γ( αi ) Y αi −1
p(y | α) = Q yi
i Γ(αi ) i=1
Pp
if i=1 yi = 1, else p(y | α) = 0
Particular case: if α1 = . . . = αp = 1 then y ∼ Unif
Moments
αi E (yi ) (1 − E (yi ))
E (yi ) = Pp ; V (yi ) = Pp
j=1 αj 1 + j=1 αj
Dirichlet for various (α1 , α2 , α3 )
(from Wikipedia)
Outline
1 Course overview
2 Examples
1
10
5
6
Background
y1 0 σ11 Σ12
Let y = ∼N ,
y2 0 Σ21 Σ22
−1
• E (y1 | y2 ) = Σ12 Σ22 y2 = β 0 y2 , where β = Σ−1
22 Σ21
ω11 Ω012
−1
Result: let Σ = Ω = . Then β = − ω111 Ω12
Ω12 Ω22
• βj = 0 ⇔ Ω12 has j th element = 0 (partial correlation=0)
• Ωij = 0 ⇔ yi indep of yj , conditional on other y ’s
Research:
Similar interpretation for elliptical copula models (Rossell & Zwiernik, 2020)
Combinations of continuous/discrete variables
How to estimate Ω
Graphical LASSO
X
Ω̂ = arg max log p(y | Ω) + λ |ωij |
Ω
i6=j
GLASSO penalizes large ω̂ij ’s. Bias > than MLE, but lower overall error
GLASSO works better if Ω truly has zeroes (sparsity)
Model misspecification
θ̂ ∼ N(θ ∗ , φ∗ (X 0 X )−1 )
Computational alternatives
• Bootstrap: confidence intervals
• Permutation methods: hypothesis tests
Slower, but dispense with some assumptions
Bootstrap
●
1500
● ●
● ● ●
Estimate 95% CI
foodexp
● ●
●
●
● β̂0 81.5 (53.3,114.0)
1000
● ●
●
●
● ●
● ●
●
●● ●
● ●
●●●
●
●
β̂1 0.56 (0.49,0.60)
●●
● ●● ● ● ●
●
●●● ● ●
● ● ●●● ●
● ● ●●
● ●● ●● ●
●●●●●
●● ● ● ●
●
●●●●● ● ●●
●●●●●● ●
●● ●
●●●
● ● ●●
●
● ● ●●●
●
●
● ● ●
●● ● ●
● ●● ●
500
●●● ●
● ●
●●● ● ●●
●●●
●●● ● ● ●●
●●●●●
●
● ●● ●
● ●●●
● ●● ●
●●●●●
●●●●●●
●●●
● ●
● ●
●●●
● ●
●●
●●●●●
●
●
●●●●● ●
●●●
●●
●●
●●●
income
Bootstrapped β̂1
3000
2000
Frequency
1000
500
0
library(quantreg)
data(engel)
beta= matrix(NA,nrow=10^4,ncol=2)
for (i in 1:nrow(beta)) {
bootengel= engel[sample(1:nrow(engel),nrow(engel),replace=TRUE),]
beta[i,]= coef(rq(foodexp ~ income, tau =.5, data=bootengel))
}
hist(beta[,2],main=’’,xlab=expression(beta[1]))
Suppose we specify
• Probability model for our data (likelihood): p(y | θ)
• Probability distribution for the parameters (prior): p(θ)
Then Bayes theorem gives the posterior distribution
p(y | θ)p(θ)
p(θ | y) = ∝ p(y | θ)p(θ)
p(y)
10 Beta(1+6,1+4)
Beta(1+2,1+8)
Beta(1+60,1+40)
8
6
Posterior
4
2
0
P(θ > 0.5 | y ) = 0.746; P(θ > 0.5 | y ) = 0.019; P(θ > 0.5 | y ) = 0.978
Parameter estimation
Posterior mean
Z
θ̂ = E (θ | y) = θp(θ | y)dθ
Posterior mode
T .V .
p(θ | y) −→ N(θ; θ̂, I −1 (θ̂))
√ D
n(E (θ | y) − θ ∗ ) −→ N(0, I −1 (θ ∗ ))
θ̂ ∼ N(θ ∗ , φ∗ (X 0 X )−1 )
p(y | Mk )p(Mk )
Z
p(Mk | y) = ∝ p(Mk ) p(y | θk , Mk )p(θk | Mk )dθk
p(y)
R
p(y | θk , Mk )p(θk | Mk )dθk
BFkj = R
p(y | θj , Mj )p(θj | Mj )dθj
maxθk p(y | θk , Mk )
LRkj =
maxθj p(y | θj , Mj )
P
Implication: If number of models K fixed then p(Mk ∗ | y) −→ 1 for the
true Mk ∗ (more later: how fast, K n)
Bin(y ; n, θ = 1/4)
LR21 = = 13.68
Bin(y ; n, 0.5)
D
Recall: if M1 true then 2 log LR21 −→ χ21 (large n)
In our case, 2 log LR21 = 5.23 ⇒ P-value=0.022. Reject M1
−1
p(y | M2 ) p(M1 )
BF21 = = 3.22 ⇒ p(M2 | y ) = 1 + BF12 = 0.763
p(y | M1 ) p(M2 )
Here p(M1 ) = 1/2. Suppose now p(M1 ) = 3/4, then p(M2 | y ) = 0.518.
> n= 20; y= 5; a= b= 1
> lrt= 2*(dbinom(y,n,prob=y/n,log=TRUE) - dbinom(y,n,prob=0.5,log=TRUE))
> 1-pchisq(lrt,df=1)
[1] 0.02216888
θ̂k = E (θ | y, Mk̂ )
E (θ | y ) = E (θ | y , M1 )p(M1 | y ) + E (θ | y , M2 )p(M2 | y ) =
5+1
0.5 × 0.237 + × 0.763 = 0.327
20 + 2
Z Z 51
!
Y
p(y∗ | y) = p(y∗ | θ, y)p(θ | y)dθ = Bin(yi∗ ; ni∗ , θi ) p(θ | y)dθ
i=1
USA elections
Set independent priors θi ∼ Beta(1, 1), i = 1, . . . , 51. Then
• Posterior distribution
51
Y
p(θ | y) = Beta(θi ; 1 + yi , 1 + ni − yi )
i=1
51
Z Y
p(y∗ | y) = Bin(yi∗ ; ni∗ , θi ) Beta(θi ; 1 + yi , 1 + ni − yi ) dθi
| {z }
i=1
| {z } p(θ|y)
p(y∗ |θ,y)
a.s.
• Law large numbers: Ê (g (θ) | y) −→ E (g (θ) | y) as B → ∞
√
• Central limit theorem: Ê (g (θ) | y) = E (g (θ) | y) + Op (1/ B)
Example: P(θi > 0.5 | y). Let g (θi ) = I(θi > 0.5), then
B
1 X (b)
P(θi > 0.5 | y) = E [g (θi ) | y] ≈ I(θi > 0.5)
B
b=1
(1) (B)
Aka, simulate θi , . . . , θi ∼ p(θi | y), count proportion that are > 0.5
Posterior predictive sampling
2000
50% votes 50% votes
2000
1500
Posterior predictive
Posterior predictive
1500
1000
1000
500
500
0
0
4000000 4500000 5000000 3500000 4000000 4500000
Votes in New York Votes in Texas
2000
50% delegates
1500
Posterior predictive
1000
500
0
2500
2000 50% delegates
Posterior predictive
1500
1000
500
0
2000
1500 50% delegates
Posterior predictive
1000
500
0
Predictive inference
• Point predictions ŷ∗ often similar; quantifying uncertainty may not
• Bayes can easily predict u(y∗ , θ) and associated uncertainty
Outline
1 Course overview
2 Examples
Properties
• Fairly fast (just requires θ̂, H(θ̂))
• Accurate as n → ∞, p fixed. When p n not so good
Extensions: higher-order/saddlepoint approx, high-dim BvM
Example from Minka (MIT, 2001) w
Exact
Exact Exact
Exact
Normal approx
Laplace
Laplace VB Bound
Bound EP Exact
EP
p(D,w)
p(D,w)
p(D,w)
p(D,w)
p(D,w)
w w ww w
Alternatives:
• Gibbs sampling (MH with acceptance prob u = 1)
• Use derivatives of log p(θ | y) (Hamiltonian MC)
• ...
Animations: https://chi-feng.github.io/mcmc-demo
Gibbs sampling
(0) (0)
Goal: sample from q(θ1 , . . . , θp ). Init at arbitrary (θ1 , . . . , θp )
(b) (b−1) (b−1)
1 Sample θ1 ∼ q(θ1 | θ2 , . . . , θp )
(b) (b) (b−1) (b−1)
2 Sample θ2 ∼ q(θ2 | θ1 , θ3 , . . . , θp )
3 ...
(b) (b) (b)
4 Sample θp ∼ q(θp | θ1 , . . . , θp−1 )
Many variations
• MH-within-Gibbs: sample θj(b) ∼ q(θj | . . .) via MH
( n )
1 1 X yi − x0i β
p(y | β, φ) ∝ n exp −
φ2 2 √φ
i=1
n
n 1 X
log p(y | β, φ) = log(φ) − 1 |yi − x0i β|
2 2φ 2 i=1
yi = x0i β + vi ˜i
Remarks:
• At the limit = 0, the acceptance probability is u = 1
• STAN and numpyro softwares use a no U-turn HMC that
automatically sets L
• In principle we need the gradient
yi ∼ Binom(ni , θi ).
µ ∼ N(0, 1)
σ ∼ IG(0.1, 0.1).
51
Y
Bin yi ; ni , 1/(1 + e −ηi ) N(ηi ; µ, σ 2 )
p(η | µ, σ, y) ∝
i=1
The prior does not factor across i, hence neither does p(η | y)
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Dist. of Col.
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
0.2 0.4 0.6 0.8 1.0
Comparison to MLE
Very similar, with a little shrinkage
0.9
0.8
0.7
Posterior mean
0.6
0.5
0.4
0.3
0.6
Random effects distribution at estimated (mu,sigma)
0.5
0.4
0.3
0.2
0.1
0.0
−2 −1 0 1 2
mu sigma
Chain
1
2
3
4
mu sigma
0.9
0.2
0.8
0.0
Chain
0.7 1
2
3
4
0.6
−0.2
0.5
−0.4
0 200 400 600 800 1000 0 200 400 600 800 1000
Assessing MCMC convergence
Practical strategies
• Check within-chain variance of results as iterations grow
• Check that different chains give similar results
Let R be within-chain variance / between-chain variance.
Example: bivariate Normal (corr =0.8). Goal: 95% interval for θ1 and θ2
Why random effects?
yij | βi , θj , φ ∼ N(βi + θj , φ)
βi | µ, τ ∼ N(µ, τ )
MMLE vs. Bayes
yi ∼ N(βgi , φ) indep i = 1, . . . , n
βj ∼ N(µ, τ ) indep j = 1, . . . , 148
θq(θ)
Z Z
E (θ) = θq(θ)dθ = h(θ)dθ
h(θ)
(b) (b)
Sample θ(b) ∼ h(θ). Estimate Ê (θ) = B1 b=1 θ h(θq(θ
PB )
(b) )
√
Then Ê (θ) = E (θ) + Op (1/ B) (additive error)
Integrated Nested Laplace approx (INLA)
Suppose there’s many θi ’s (random effects, spatial models) and one can
write
n
Y
p(y | θ, Ψ1 ) = p(yi | θi , Ψ1 )
i=1
p(θ | Ψ2 ) = N (θ; m(Ψ2 ), diag(V (Ψ2 )))
R packages https://cran.r-project.org/web/views/Bayesian.html