You are on page 1of 20

Hierarchical Bayesian

Modeling
Prof. Nicholas Zabaras

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

September 18, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


References
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to
Compulational Implementation, Springer-Verlag, NY, 2001 (online resource)
 A Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis,
Chapman and Hall CRC Press, 2nd Edition, 2003.
 J M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online
resource)
 D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University
Press, 2006.
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B.
Vidakovic.
 Kevin Murphy, Machine Learning, A probabilistic Perspective, Chapter 5.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


Contents
 Hierarchical Bayesian Models, Modeling Cancer Rates Example

 Empirical Bayes – Evidence Approximation

 ML-II (Empirical Bayes) vs. MAP-II vs. Full Bayes

 James Stein Estimator, Example with Gaussian likelihood and Gaussian prior,
Predicting baseball scores

 The goals of today’s lecture are:


 Learn the fundamental assumptions of hierarchical Bayesian models, and
empirical Bayes and evidence methodologies
 Learn how to derive shrinkage parameter estimates (James Stein
estimator)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3
Hierarchical Bayesian Models
 It often helps to decompose prior knowledge into several levels particularly when the available
data is hierarchical.

 The hierarchical Bayes method is a powerful tool for expressing rich statistical models that
more fully reflect a given problem.

 Often the prior on q depends in turn on other parameters 𝜙 that are not mentioned in the
likelihood. So, the prior 𝜋(𝜃) must be replaced by a prior 𝜋(𝜃|𝜙) , and a prior 𝜋(𝜙) on the
newly introduced parameters 𝜙 is required, resulting in a posterior probability 𝜋(𝜃, 𝜙|𝒙).
 (q ,  | x ) ~  ( x | q ,  )  (q ,  ) ~  ( x | q )  (q |  ) ( )
 ( x|q )  (q , )

 This is the simplest example of a hierarchical Bayes model.


 The process may be repeated, e.g, 𝜙 may depend on parameters 𝜓, which will require their
own prior. Eventually the process must terminate, with priors that do not depend on any
other parameters.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


Hierarchical Bayesian Models
 Consider 𝑚 −level hierarchical Bayesian model

 (q )  
1 1 ...m
 (q | q1 ) (q1 | q 2 )... (q m1 | q m ) (q m ) dq1...dq m

 Two level hierarchical modeling gives:

 Full posterior:  (q ,q1 | x ) ~  ( x | q ) (q | q1 )  (q1 )


 (q |q1 , x )

 Conditional posterior:  (q | q1 , x ) ~  ( x | q ) (q | q1 )

 Marginal Posterior:  (q | x )    (q ,q1 | x )dq1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Hierarchical Bayes
 A key requirement for computing the posterior 𝑝(𝜽|𝒟) is the specification of a prior 𝑝(𝜽|𝜼),
where 𝜼 are the hyper-parameters.

 What if we don’t know how to set 𝜼?

 In some cases, we can use uninformative priors.

 A more Bayesian approach is to put a prior on our priors! In terms of graphical models
(showing explicitly dependence relations), we can represent the situation as follows:

 q  D
 This is an example of a hierarchical Bayesian model, also called a multi-level model.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Hierarchical Bayes: Modeling Cancer Rates
 Consider the problem of predicting cancer rates in various cities.
 We measure the people in various cities, 𝑁𝑖 , and the people who died of cancer in these
cities, 𝑥𝑖. We assume 𝑥𝑖~ ℬ𝒾𝓃(𝑁𝑖, 𝜽𝑖 ) and we want to estimate the cancer rates 𝜽𝑖 .
 We can estimate them all separately, but this will suffer from the sparse data problem
(underestimation of the rate of cancer due to small 𝑁𝑖 ).
 We can assume all the 𝜽𝑖 are the same (parameter tying). But the assumption that all the
cities have the same rate is a rather strong one.
 As a compromise we assume that the 𝜽𝑖 are similar, but that there may be city-specific
variations. This can be modeled by assuming 𝜽𝑖 ~ℬℯ𝓉𝒶(𝑎, 𝑏). The full joint distribution can be
written as
N
p  D ,q ,   p    Bin  xi | N i ,q i  Beta q i |  ,    a, b 
i 1

 By treating 𝜼 as an unknown (hidden variable), we allow the data-poor cities to borrow


statistical strength from data-rich ones.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Hierarchical Bayes: Modeling Cancer Rates
 Compute 𝑝 𝜂, 𝜃|𝒟 , then the marginal 𝑝 𝜃|𝒟 . The posterior mean is shrunk towards the
pooled estimate more strongly for cities with small 𝑁𝑖 .

Compare 𝑥𝑖
cities 1 & 20
 Both have
zero cancer 𝑁𝑖
rates
 City 20 has
smaller 𝑀𝐿𝐸: 𝜃መ𝑖
population
 The posterior
mean
qi | D 
𝔼 𝜃20 |𝒟 is
pooled more σ𝑖 𝑥𝑖 𝑎ො
towards the Pooled MLE: 𝜃መ = Prior Mean: 𝑎+
ො 𝑏෠
ො 𝑏෠ computed using all data 𝒟.
, 𝑎,
σ𝑖 𝑁𝑖
pooled
estimate cancerRatesEb
from Kevin Murphys’ PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
Hierarchical Bayes: Modeling Cancer Rates
 95% posterior credible intervals for 𝜽𝑖 .
 City 15, which has a very large population, has small posterior uncertainty. It has the largest
impact on the posterior of 𝜼 which in turn impacts the estimate of the cancer rates for other
cities.
95% credible interval on theta, *=median
 Cities 10 and 19, which 20

have the highest MLE, also 18 City 19


have the highest posterior 16
uncertainty, reflecting the 14 City 15
fact that such a high 12

estimate is in conflict with 10


City 10
the prior (which is 8
estimated from all the other 6
cities). 4
cancerRatesEb
2
from Kevin Murphys’ PMTK
0
0 1 2 3 4 5 6 7 8
-3
x 10

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


Empirical Bayes – Evidence Procedure
 In hierarchical Bayesian models, we need to compute the posterior on multiple levels of latent
variables. For example, in a two-level model,
p  ,q | D   p  D | q  p q |   p  
 In some cases, we can analytically marginalize out 𝜽; this leaves is with the simpler problem
of just computing 𝑝(𝜼 |𝒟).

 We can approximate the posterior on the hyper-parameters with a point-estimate,

ഥ = argmax𝑝 𝜼|𝒟 = argmax න𝑝 𝒟|𝜽 𝑝 𝜽|𝜼 𝑑𝜽


𝑝 𝜼|𝒟 ≈ 𝛿𝜼ഥ 𝜼 , 𝜼

 Since 𝜼 is typically much smaller than 𝜽 in dimensionality, it is less prone to overfitting, so we


can safely use a uniform prior on 𝜼.

 The quantity inside the brackets is the marginal likelihood, often called the evidence. The
approach is called empirical Bayes (EB) or type-II maximum likelihood or the evidence
procedure.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10


Empirical Bayes: Summary of Procedures
 Empirical Bayes violates the principle that the prior should be chosen independently of the
data.

 View it as a cheap approximation to inference in a hierarchical Bayesian model.

 We can construct a hierarchy in which the more integrals one performs, the “more Bayesian”
one becomes:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Empirical Bayes
 Let us return to the cancer rates model. We can analytically integrate out 𝜽𝑖 , and write down
the marginal likelihood directly, as follows:

 N i  B  a  xi , b  N i  xi 
p  D | a, b     Bin  xi |  i q i  Beta q i | a, b dq i    
i i  xi  B  a, b 
 Various ways of maximizing this wrt 𝑎 and 𝑏 are discussed in Minka.
 Having estimated 𝑎 and 𝑏, we can plug in the hyper-parameters to compute the posterior
ො 𝑏෠ in the usual conjugate analysis.
𝑝 𝜃𝑖 |𝒟, 𝑎,

 It can be shown that the posterior mean of each 𝜽𝑖 is a weighted average of


its local MLE and the prior means, which depend on 𝜼 = (𝑎, 𝑏).
 Since 𝜼 is estimated using all the data, each 𝜽𝑖 is influenced by all data.

Minka, T. (2000e). Estimating a Dirichlet distribution, Technical Report.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Empirical Bayes: Gaussian-Gaussian Model
 We now consider an example where the data is real-valued. We use a Gaussian likelihood
and a Gaussian prior.

 Suppose we have data from multiple related groups, e.g. 𝑥𝑖𝑗 is the test score for student 𝑖 in
school 𝑗, 𝑗 = 1: 𝐷, 𝑖 = 1: 𝑁𝑗 . We want to estimate the mean score for each school, 𝜃𝑗.

 Since 𝑁𝑗 may be small for some schools, we regularize the problem by using a hierarchical
Bayesian model, where qj comes from a common prior, 𝒩(𝜇, 𝜏2).

 The joint distribution has the following form:

 Nj

p q , D |  , s      N  xij | q j , s N q j |  ,  ,     , 
D
2 2 2

j 1  i 1 
 We assume for simplicity that s2 is known.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Empirical Bayes: Gaussian-Gaussian Model
 Exploit that 𝑁𝑗 Gaussian measurements with values 𝑥𝑖𝑗 and variance 𝜎2 are equivalent to one
𝜀𝑗2
measurement 𝑥𝑗ҧ = ෍ 𝑥𝑖𝑗 /𝑁𝑗 with variance of the school’s mean 𝜎𝑗 ≡ ൗ𝑁𝑗 , 𝜀𝑗2 = true
2
𝑖=1:𝑁𝑗
variance of grades in j (= 𝜎2)
 This yields the following unnormalized posterior
𝐷

ො 𝜏Ƹ 2 𝒩 𝑥𝑗ҧ |𝜃𝑗 , 𝜎𝑗2


𝑝 𝜃, 𝒟|𝜂,Ƹ 𝜎 2 = ෑ 𝒩 𝜃𝑗 |𝜇,
where: 𝑗=1
𝐷 𝐷
1 1 2 Computed with
𝜇ො = ෍ 𝑥𝑗ҧ ≡ 𝑥ҧ , 𝜎 2 + 𝜏Ƹ 2 = ෍ 𝑥𝑗ҧ − 𝑥ҧ the Evidence
𝐷 𝐷 approximation
𝑗=1 𝑗=1
 From this, it follows that the posteriors are:

ො 𝜏Ƹ 2 = 𝒩 𝜃𝑗 |𝐵෠𝑗 𝜇ො + 1 − 𝐵෠𝑗 𝑥𝑗ҧ , 1 − 𝐵෠𝑗 𝜎𝑗2 ,


𝑝 𝜃𝑗 |𝒟, 𝜇,
𝐷 𝐷
1 2 1 2 𝜎𝑗2
2
𝜇ො = ෍ 𝑥𝑗ҧ ≡ 𝑥ҧ , 𝜎 + 𝜏Ƹ = ෍ 𝑥𝑗ҧ − 𝑥ҧ , 𝐵෠𝑗 =
𝐷 𝐷 𝜎𝑗2 + 𝜏Ƹ 2
𝑗=1 𝑗=1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14


Empirical Bayes: Gaussian-Gaussian Model
 We can compute the evidence:
𝐷

න𝒩 𝜃𝑗 |𝜇, 𝜏 2 𝒩 𝑥𝑗ҧ |𝜃𝑗 , 𝜎𝑗2 𝑑𝜃𝑗 = 𝒩 𝑥𝑗ҧ |𝜇, 𝜎𝑗2 + 𝜏 2 ⇒ 𝑝 𝒟|𝜇, 𝜏 2 , 𝜎 2 = ෑ 𝒩 𝑥𝑗ҧ |𝜇, 𝜎𝑗2 + 𝜏 2
j=1

 For constant 𝜎𝑗2 , we can derive the previously shown estimates using MLE:
𝐷 𝐷
1 2 2
1 2
𝜇ො = ෍ 𝑥𝑗ҧ ≡ 𝑥ҧ , 𝜎 + 𝜏Ƹ = ෍ 𝑥𝑗ҧ − 𝑥ҧ ≡ 𝑠2
𝐷 𝐷
𝑗=1 𝑗=1
 In general use 𝜏Ƹ 2 = 𝑚𝑎𝑥 0, 𝑠 2 − 𝜎 2

 For non-constant 𝜎𝑗2 , you need to use Expectation-Maximization to derive the empirical Bayes
(EB) estimate. Full Bayesian inference is also possible.

 Key insight in Empirical Bayes: Use priors to keep your estimates under control but obtain the
priors empirically from the data itself.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15


James Stein Estimator
ො 𝜏ҧ 2 = 𝑁 𝜃𝑗 |𝐵෠𝑗 𝜇ො + 1 − 𝐵෠𝑗 𝑥𝑗ҧ , 1 − 𝐵෠𝑗 𝜎𝑗2 ,
𝑝 𝜃𝑗 |𝐷, 𝜇,
𝐷 𝐷
1 2 1 2 𝜎𝑗2
2
𝜇ො = ෍ 𝑥𝑗ҧ ≡ 𝑥ҧ , 𝜎𝑗 + 𝜏Ƹ = ෍ 𝑥𝑗ҧ − 𝑥ҧ , 𝐵෠𝑗 =
𝐷 𝐷 𝜎𝑗2 + 𝜏Ƹ 2
𝑗=1 𝑗=1

 The quantity 0 ≤ 𝐵෠𝑗 ≤ 1 controls the degree of shrinkage towards the overall mean, 𝜇.

 If the data is reliable for group 𝑗, then 𝜎𝑗2 will be small relative to 𝜏Ƹ 2 ; hence 𝐵෠𝑗 will be small,
and we will put more weight on 𝑥𝑗ҧ when we estimate 𝜃𝑗. However, groups with small 𝑁𝑗 will get
regularized (shrunk towards the overall mean 𝜇) ො more heavily.

 For sj constant across 𝑗, the posterior mean becomes (James Stein estimator):

𝜎2
𝜃𝑗ҧ = 𝐵෠ 𝑥ҧ + 1 − 𝐵෠ 𝑥𝑗ҧ = 𝑥ҧ + 1 − 𝐵෠ 𝑥𝑗ҧ − 𝑥ҧ , 𝐵෠ = 2
𝜎 + 𝜏Ƹ 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16


Predicting Baseball Scores
 This is an example of shrinkage applied to baseball batting averages.
 We observe the number of hits for 𝐷 = 18 players during the first 𝑇 = 45 games. Let the
number of hits 𝑏𝑖 and assume 𝑏𝑗 ∼ ℬ(𝑇, 𝜃𝑗), where 𝜃𝑗 is the “true” batting average for player 𝑗.
The goal is to estimate the 𝜃𝑗.
The MLE is 𝜃መ𝑗 = 𝑥𝑗 , 𝑥𝑗 = 𝑏𝑗 Τ𝑇 being the empirical batting average. One can use an Empirical
Bayes approach to do better.
 To apply the Gaussian shrinkage approach described above, we require that the likelihood be
Gaussian, 𝑥𝑗 ∼ 𝒩(𝜃𝑗, 𝜎2) for known 𝜎2.
 However, in this example we have a binomial likelihood. While this has the right mean,
𝔼[𝑥𝑗] = 𝜃𝑗, the variance is not constant:

1 Tq j 1  q j 
var  x j   2 var b j  
T T2

 Efron, B. and C. Morris (1975). Data analysis using stein’s estimator and its generalizations. J. of the Am. Stat.
Assoc. 70(350), 311–319.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17
Predicting Baseball Scores
 So we apply a variance stabilizing transform to 𝑥𝑗 to better match the Gaussian assumption.
y j  f ( x j )  T arcsin  2 x j  1

 Now we have approximately* 𝑦𝑗 ∼ 𝒩(𝑓(𝜃𝑗), 1) ≡ 𝒩(𝜇𝑗 , 1). We use Gaussian shrinkage to


estimate the 𝜇𝑗 using
𝜇ҧ𝑗 = 𝐵෠ 𝑦ത + 1 − 𝐵෠ 𝑦ത𝑗 = 𝑦ത + 1 − 𝐵෠ 𝑦ത𝑗 − 𝑦ത

with 𝜎2 = 1, and we then transform back to get the shrinkage estimate


𝜇ҧ𝑗
ҧ
𝜃𝑗 = 0.5 𝑠𝑖𝑛 +1
𝑇
 The results are shown next.
*
Consider a transform Y  f ( X ) where [ X ]   , var[ X ]  s 2 s.t.
Y  f ( X )  f (  )  f '(  )( X   ) with var[Y ]  f '(  ) 2 s 2 (  )
If f '(  ) 2 s 2 (  ) is independent of  , we call f ( X ) a variance stabilizing transform

4T Tq j 1  q j 
Here : f '(  ) s (  ) 
2 2
1
1   2 x j  1
2
T2
  [ x j ]q j
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Predicting Baseball Scores
1
MLE (top) and shrinkage estimates (bottom)
 We plot the MLE 𝜃መ𝑗 .
0.9

0.8
 All the estimates have shrunk
0.7
towards the global mean,
0.265.
0.6

0.5

0.4

0.3

0.2

0.1

0
0.2 0.25 0.3 0.35 0.4
ShrinkageDemoBaseBall
from Kevin Murphys’ PMTK

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19


Predicting Baseball Scores
MSE MLE = 0.0042, MSE shrunk = 0.0013
0.4
true
 We plot the true value 𝜃𝑗, the
0.35 shrunk MLE 𝜃መ𝑗 and the posterior mean
MLE

0.3
𝜃𝑗ҧ

 The “true” values of 𝜃𝑗 are


0.25
MSE

0.2 estimated from a large number


0.15 of independent games. On
average, the shrunken estimate
0.1
is much closer to the true
0.05 parameters than the MLE is.
0
1 2 3 4 5
player number  The mean squared error is over
ShrinkageDemoBaseBall
from Kevin Murphys’ PMTK
three times smaller using the 𝜃𝑗ҧ
𝐷 shrinkage estimates than using
1 2 the MLEs 𝜃መ𝑗
𝑀𝑆𝐸 = ෍ 𝜃𝑗 − 𝜃𝑗ҧ
𝑁
𝑗=1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20

You might also like