Professional Documents
Culture Documents
Modeling
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
James Stein Estimator, Example with Gaussian likelihood and Gaussian prior,
Predicting baseball scores
The hierarchical Bayes method is a powerful tool for expressing rich statistical models that
more fully reflect a given problem.
Often the prior on q depends in turn on other parameters 𝜙 that are not mentioned in the
likelihood. So, the prior 𝜋(𝜃) must be replaced by a prior 𝜋(𝜃|𝜙) , and a prior 𝜋(𝜙) on the
newly introduced parameters 𝜙 is required, resulting in a posterior probability 𝜋(𝜃, 𝜙|𝒙).
(q , | x ) ~ ( x | q , ) (q , ) ~ ( x | q ) (q | ) ( )
( x|q ) (q , )
(q )
1 1 ...m
(q | q1 ) (q1 | q 2 )... (q m1 | q m ) (q m ) dq1...dq m
Conditional posterior: (q | q1 , x ) ~ ( x | q ) (q | q1 )
A more Bayesian approach is to put a prior on our priors! In terms of graphical models
(showing explicitly dependence relations), we can represent the situation as follows:
q D
This is an example of a hierarchical Bayesian model, also called a multi-level model.
Compare 𝑥𝑖
cities 1 & 20
Both have
zero cancer 𝑁𝑖
rates
City 20 has
smaller 𝑀𝐿𝐸: 𝜃መ𝑖
population
The posterior
mean
qi | D
𝔼 𝜃20 |𝒟 is
pooled more σ𝑖 𝑥𝑖 𝑎ො
towards the Pooled MLE: 𝜃መ = Prior Mean: 𝑎+
ො 𝑏
ො 𝑏 computed using all data 𝒟.
, 𝑎,
σ𝑖 𝑁𝑖
pooled
estimate cancerRatesEb
from Kevin Murphys’ PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
Hierarchical Bayes: Modeling Cancer Rates
95% posterior credible intervals for 𝜽𝑖 .
City 15, which has a very large population, has small posterior uncertainty. It has the largest
impact on the posterior of 𝜼 which in turn impacts the estimate of the cancer rates for other
cities.
95% credible interval on theta, *=median
Cities 10 and 19, which 20
The quantity inside the brackets is the marginal likelihood, often called the evidence. The
approach is called empirical Bayes (EB) or type-II maximum likelihood or the evidence
procedure.
We can construct a hierarchy in which the more integrals one performs, the “more Bayesian”
one becomes:
N i B a xi , b N i xi
p D | a, b Bin xi | i q i Beta q i | a, b dq i
i i xi B a, b
Various ways of maximizing this wrt 𝑎 and 𝑏 are discussed in Minka.
Having estimated 𝑎 and 𝑏, we can plug in the hyper-parameters to compute the posterior
ො 𝑏 in the usual conjugate analysis.
𝑝 𝜃𝑖 |𝒟, 𝑎,
Suppose we have data from multiple related groups, e.g. 𝑥𝑖𝑗 is the test score for student 𝑖 in
school 𝑗, 𝑗 = 1: 𝐷, 𝑖 = 1: 𝑁𝑗 . We want to estimate the mean score for each school, 𝜃𝑗.
Since 𝑁𝑗 may be small for some schools, we regularize the problem by using a hierarchical
Bayesian model, where qj comes from a common prior, 𝒩(𝜇, 𝜏2).
Nj
p q , D | , s N xij | q j , s N q j | , , ,
D
2 2 2
j 1 i 1
We assume for simplicity that s2 is known.
න𝒩 𝜃𝑗 |𝜇, 𝜏 2 𝒩 𝑥𝑗ҧ |𝜃𝑗 , 𝜎𝑗2 𝑑𝜃𝑗 = 𝒩 𝑥𝑗ҧ |𝜇, 𝜎𝑗2 + 𝜏 2 ⇒ 𝑝 𝒟|𝜇, 𝜏 2 , 𝜎 2 = ෑ 𝒩 𝑥𝑗ҧ |𝜇, 𝜎𝑗2 + 𝜏 2
j=1
For constant 𝜎𝑗2 , we can derive the previously shown estimates using MLE:
𝐷 𝐷
1 2 2
1 2
𝜇ො = 𝑥𝑗ҧ ≡ 𝑥ҧ , 𝜎 + 𝜏Ƹ = 𝑥𝑗ҧ − 𝑥ҧ ≡ 𝑠2
𝐷 𝐷
𝑗=1 𝑗=1
In general use 𝜏Ƹ 2 = 𝑚𝑎𝑥 0, 𝑠 2 − 𝜎 2
For non-constant 𝜎𝑗2 , you need to use Expectation-Maximization to derive the empirical Bayes
(EB) estimate. Full Bayesian inference is also possible.
Key insight in Empirical Bayes: Use priors to keep your estimates under control but obtain the
priors empirically from the data itself.
The quantity 0 ≤ 𝐵𝑗 ≤ 1 controls the degree of shrinkage towards the overall mean, 𝜇.
ො
If the data is reliable for group 𝑗, then 𝜎𝑗2 will be small relative to 𝜏Ƹ 2 ; hence 𝐵𝑗 will be small,
and we will put more weight on 𝑥𝑗ҧ when we estimate 𝜃𝑗. However, groups with small 𝑁𝑗 will get
regularized (shrunk towards the overall mean 𝜇) ො more heavily.
For sj constant across 𝑗, the posterior mean becomes (James Stein estimator):
𝜎2
𝜃𝑗ҧ = 𝐵 𝑥ҧ + 1 − 𝐵 𝑥𝑗ҧ = 𝑥ҧ + 1 − 𝐵 𝑥𝑗ҧ − 𝑥ҧ , 𝐵 = 2
𝜎 + 𝜏Ƹ 2
1 Tq j 1 q j
var x j 2 var b j
T T2
Efron, B. and C. Morris (1975). Data analysis using stein’s estimator and its generalizations. J. of the Am. Stat.
Assoc. 70(350), 311–319.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17
Predicting Baseball Scores
So we apply a variance stabilizing transform to 𝑥𝑗 to better match the Gaussian assumption.
y j f ( x j ) T arcsin 2 x j 1
4T Tq j 1 q j
Here : f '( ) s ( )
2 2
1
1 2 x j 1
2
T2
[ x j ]q j
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Predicting Baseball Scores
1
MLE (top) and shrinkage estimates (bottom)
We plot the MLE 𝜃መ𝑗 .
0.9
0.8
All the estimates have shrunk
0.7
towards the global mean,
0.265.
0.6
0.5
0.4
0.3
0.2
0.1
0
0.2 0.25 0.3 0.35 0.4
ShrinkageDemoBaseBall
from Kevin Murphys’ PMTK
0.3
𝜃𝑗ҧ