You are on page 1of 23

Statistical Science

1998, Vol. 13, No. 2, 163–185

Simulating Normalizing Constants:


From Importance Sampling to Bridge
Sampling to Path Sampling
Andrew Gelman and Xiao-Li Meng

Abstract. Computing (ratios of) normalizing constants of probability


models is a fundamental computational problem for many statistical and
scientific studies. Monte Carlo simulation is an effective technique, es-
pecially with complex and high-dimensional models. This paper aims to
bring to the attention of general statistical audiences of some effective
methods originating from theoretical physics and at the same time to ex-
plore these methods from a more statistical perspective, through estab-
lishing theoretical connections and illustrating their uses with statistical
problems. We show that the acceptance ratio method and thermodynamic
integration are natural generalizations of importance sampling, which is
most familiar to statistical audiences. The former generalizes importance
sampling through the use of a single “bridge” density and is thus a case
of bridge sampling in the sense of Meng and Wong. Thermodynamic
integration, which is also known in the numerical analysis literature
as Ogata’s method for high-dimensional integration, corresponds to the
use of infinitely many and continuously connected bridges (and thus a
“path”). Our path sampling formulation offers more flexibility and thus
potential efficiency to thermodynamic integration, and the search of op-
timal paths turns out to have close connections with the Jeffreys prior
density and the Rao and Hellinger distances between two densities. We
provide an informative theoretical example as well as two empirical ex-
amples (involving 17- to 70-dimensional integrations) to illustrate the
potential and implementation of path sampling. We also discuss some
open problems.
Key words and phrases: Acceptance ratio method, Hellinger distance,
Jeffreys prior density, Markov chain Monte Carlo, numerical integration,
Rao distance, thermodynamic integration.

1. THE NEED FOR COMPUTING an unnormalized densityR function, but cannot di-
NORMALIZING CONSTANTS rectly calculate z = qω‘µdω‘, the normalizing
constant, where µ can be a counting measure, a
Thanks to powerful Markov chain Monte Carlo
Lebesgue measure or a mixture of them. Distribu-
(MCMC) methods, we can now simulate from a com-
tions for which qω‘ can be easily computed but z
plex probability model pω‘, where ω is in general a
is intractable arise in many statistical models, such
high-dimensional variable, without knowing its nor-
as spatial models, Bayesian hierarchical models and
malizing constant. That is, one can evaluate qω‘, models for incomplete data. In addition, sometimes
a quantity of interest is deliberately formulated as a
Andrew Gelman is Associate Professor, Department normalizing constant of a density from which draws
of Statistics, Columbia University, New York, New can be made.
York 10027 e-mail: gelman@stat.columbia.edu‘. For example, in likelihood analysis with miss-
Xiao-Li Meng is Associate Professor, Department of ing data, it commonly occurs that if one had all
Statistics, University of Chicago, Chicago, Illinois the observations, denoted by ycom , the computation
60637 e-mail: meng@galton.uchicago.edu‘. of the complete-data likelihood for parameters ψ,

163
164 A. GELMAN AND X.-L. MENG

LψŽycom ‘ = pycom Žψ‘, would be straightforward. provides an order of magnitude of improvement. The
This suggests the following method for simulating path sampling, which was not part of their study,
the observed-data likelihood LψŽyobs ‘ = pyobs Žψ‘ has potentials for even more dramatic improvement,
in the cases where it is difficult to calculate as we demonstrate in the current paper.
LψŽyobs ‘ directly (an example is given in Sec- In physics and chemistry, a well-studied problem
tion 5.2). Because of computing normalizing constants is known as free
 pycom Žψ‘ LψŽycom ‘ energy estimation. The problem starts with an un-
(1) p ycom Žyobs ; ψ = ≡ ; normalized density, the system density:
pyobs Žψ‘ LψŽyobs ‘  
Hω; α‘
we can treat the likelihood of interest LψŽyobs ‘ (2) qωŽT; α‘ = exp − ;
kT
as the normalizing constant of pycom Žyobs ; ψ‘, with
the complete-data likelihood LψŽycom ‘ serving as where Hω; α‘ is the energy function of state ω, k is
the unnormalized density. In this formulation, ycom Boltzmann’s constant, T is the temperature and α
plays the role of ω in our general notation. is a vector of system characteristics. The free energy
For instance, in genetic linkage analysis a key F of the system is defined as
step is the computation of the likelihood of ψ, the 
(3) FT; α‘ = −kT log zT; α‘ ;
locations of disease genes relative to a set of mark-
ers, based on the observed data yobs from a pedigree. where zT; α‘ is the normalizing constant of the
The problem turns out to be very difficult for a large system density. Simulation of ω from pωŽT; α‘ =
pedigree with many loci, because of the missing ob- qωŽT; α‘/zT; α‘ is typically carried out via MCMC
servations (e.g., allele types inherited from parents) methods. For detailed discussions of this and re-
from some members of the pedigree. In this exam- lated topics, see, among others, Ciccotti and Hoover
ple, simulating ycom from pycom Žyobs ; ψ‘ is feasible (1986), Ceperley (1995) and Frankel and Smit
though far from trivial, for example, by using the (1996). A more statistically oriented review is given
sequential imputation method (see Irwin, Cox and in Neal (1993).
Kong, 1994, and Kong, Liu and Wong, 1994). Be- In applications in both genetics and physics, the
cause of (1), we can use draws from pycom Žyobs ; ψ‘ real interest is not a single normalizing constant
to estimate LψŽyobs ‘ as a normalizing constant; itself, but rather ratios, or equivalently differences
this is essentially the only known effective method of the logarithms, of them (i.e., differences of log-
for dealing with this problem (see, e.g., Thompson, likelihoods; free energy differences). This is also
1996). An application of bridge sampling, which we true in many other applications, such as comput-
discuss in Section 3, in linkage analysis with large ing observed-data likelihood ratios for the purpose
pedigrees is given by Jensen and Kong (1997). of monitoring convergence of Monte Carlo EM al-
A related general problem is, given an unnormal- gorithms (Meng and Schilling, 1996). Even when
ized joint densityR qω; θ‘, to evaluate the marginal it appears that we need to deal with a single nor-
density pθ‘ = pω; θ‘µdω‘. Marginal densities malizing constant, we can almost always bring in a
can be of interest in physical models (e.g., eval- convenient completely known density on the same
uating the distribution of the energy in a Gibbs space as a reference point, as done in DiCiccio et al.
model at a specified temperature) or in statistics, (1997). Therefore, without loss of generality, we can
as marginal likelihoods or marginal posterior den- consider a class of densities on the same space,
sities (e.g., if θ is a parameter of interest and ω which we denote either by a numerical index t or
is a vector of nuisance parameters; see Section 5.3 by a continuous parameter θ; that is,
for an example). The computation of a Bayes fac-
1 1
tor, which requires the calculation of two proba- (4) pt ω‘ = q ω‘ or pωŽθ‘ = qωŽθ‘:
bilities, each of which is the marginal density un- zt t zθ‘
R
der an individual model, py‘ = pyŽφ‘pφ‘ dφ; We make a convention that whenever one of the
is another problem of this sort. This problem has triplet ”p; q; z• is defined with a proper index, so
received much attention in recent literature; for ex- are the other two with the same index. We also
ample, Gelfand and Dey (1994), Chib (1995), Raftery use λ as a generic notation for the log ratio (e.g.,
(1996), Lewis and Raftery (1997) and DiCiccio, Kass, λ = logz1 /z0 ‘). For some examples, we are inter-
Raftery and Wasserman (1997). In particular, Di- ested in a particular log ratio λ; for others, we wish
Ciccio et al. (1997) provide a comparative study on to evaluate zθ‘, up to an arbitrary multiplicative
a variety of methods, from Laplace approximation to constant, for a continuous range of θ.
bridge sampling, for computing Bayes factors. Their There are three common approaches for approx-
main conclusion is that bridge sampling typically imating analytically intractable normalizing con-
SIMULATING NORMALIZING CONSTANTS 165

stants: analytic approximation (e.g., DiCiccio et ter θ. This may come naturally from a paramet-
al., 1997), numerical integration (e.g., Evans and ric family, as in many statistical applications. In
Swartz, 1995) and Monte Carlo simulation. Of general, given two unnormalized densities with the
these, Monte Carlo simulation is widely used in same support (not necessarily from the same fam-
statistics, mainly because of its general applica- ily), q0 ω‘ and q1 ω‘, we can always construct a con-
bility and its familiarity to statisticians. Arguably, tinuous path to link them (the issue of optimizing
it is also the only general method available for over the choice of path is discussed later in this pa-
dealing with complex, high-dimensional problems. per). For example, as suggested in statistical physics
Current routine simulation methods in statistics (e.g., Neal, 1993, page 96), we can construct a geo-
rely on the scheme of importance sampling, ei- metric path using a scalar parameter θ ∈ ’0; 1“,
ther using draws from an approximate density or
from one of pt ω‘ (or pωŽθ‘); see Section 3. How- (5) geometric path, qωŽθ‘ = q01−θ ω‘qθ1 ω‘;
ever, the theoretical evidence provided in Meng and
or a harmonic path by analogy to the harmonic
Wong (1996) and the empirical evidence provided
mean. (As we show in Section 4.3, the geometric
in DiCiccio et al. (1997) and in Meng and Schilling
path is in general suboptimal for the purpose of es-
(1996) in the context of bridge sampling, show that
timating the ratio of normalizing constants.)
substantial reductions of Monte Carlo errors can be
To derive the basic identity for path sampling,
achieved with little or minor increase in computa-
we first assume that θ is a scalar quantity; with-
tional effort, by using draws from more than one
out loss of generality, we assume that θ ∈ ’0; 1“
pt ω‘. The key idea here is to use “bridge” densities
and that we are interested in computing the ratio
to effectively shorten the distances among target
r = z1‘/z0‘. Taking logarithms and then differ-
densities, distances that are responsible for large
entiating both sides of the second equation in (4)
Monte Carlo errors with the standard importance
with respect to θ yields the standard formula (e.g.,
sampling methods.
Ripley, 1988, page 64), assuming the legitimacy of
The purpose of this paper is fourfold. First, we de-
interchange of integration with differentiation,
scribe the method of path sampling for estimating
λ unbiasedly (Section 2); the method is a general d Z 1 d
formulation, with the introduction of flexible paths log zθ‘ = qωŽθ‘µdω‘
dθ zθ‘ dθ
aiming at reduction of Monte Carlo errors, of the (6)  
d
thermodynamic integration method for simulating = Eθ log qωŽθ‘ ;
free energy differences. Second, we show that im- dθ
portance sampling, bridge sampling and path sam- where Eθ denotes the expectation with respect to
pling represent a natural methodological evolution, the sampling distribution pωŽθ‘. Identity (6) is a
from using no bridge densities to using an infinite consequence of the fact that the expected score func-
number of them (Section 3); we thus show that ther- tion is zero for any θ. By analogy to the potential in
modynamic integration is a natural generalization statistical physics, we label
of the acceptance ratio method, another well-known
method for free energy estimation, since the latter d
Uω; θ‘ = log qωŽθ‘:
corresponds to bridge sampling with a single bridge. dθ
Third, we investigate the problem of optimal paths, Integrating (6) from 0 to 1 yields
which turns out to be closely related to the Jef-   Z1
freys prior distribution and the Rao and Hellinger z1‘
(7) λ = log = Eθ ’Uω; θ‘“ dθ:
distances between two distributions; we illustrate z0‘ 0
the theoretical results by a simple yet informative
example (Section 4). Fourth, we provide two ap- Now, if we consider θ as a random variable (as in
plications (Section 5) to illustrate the implementa- Bayesian analysis) with a uniform distribution on
tion and potential of path sampling for statistical ’0; 1“, we can interpret the right-hand side of (7) as
problems. the expectation of Uω; θ‘ over the joint distribution
of ω; θ‘. More generally, we can introduce a prior
2. A GENERAL FRAMEWORK density pθ‘ for θ ∈ ’0; 1“ and rewrite (7) as
FOR PATH SAMPLING  
Uω; θ‘
(8) λ=E ;
2.1 Basic Identities for Path Sampling pθ‘
Unless otherwise stated, we assume that densi- where the expectation is with respect to the joint
ties are indexed by a continuous (vector) parame- density pω; θ‘ = pωŽθ‘pθ‘.
166 A. GELMAN AND X.-L. MENG

Identity (8) immediately suggests an unbiased es- variate case (i.e., (8) and (9)), we can reexpress the
timator of λ: prior density via a path function by solving θ̇t‘ =
1X n
Uωi ; θi ‘ 1/pθt‘‘.
(9) λ̂ =
n i=1 pθi ‘ 2.2 Thermodynamic Integration
using n (not necessarily independent) draws ωi ; and Ogata’s Method
θi ‘ from pω; θ‘. In addition, we can estimate Using identity (7) for calculating λ is not a new
logzb‘/za‘‘ for intermediate values a; b ∈ ’0; 1“ idea. For example, the thermodynamic integration
by just using the sample points i for which θi ∈ method uses (7) for computing the free energy dif-
’a; b“. The simulation error of λ̂ depends both on ference between two molecular-dynamic systems. As
the choice of pθ‘ and how the samples are actu- a simple example, using the notation in (2)–(3), we
ally drawn. A key advantage of (8) or (9) is that can calculate the free energy difference between two
the summand is on the log scale, which is generally systems with the same temperature T as
more stable than the ratio scale. This is particu-
larly important when computing the log-likelihood FT; α1 ‘ − FT; α0 ‘
ratio as a weighted sum of log-ratios of normalizing Z α1  
(12) ∂Hω; α‘
constants, as in Meng and Schilling (1996). = ET; α dα;
α0 ∂α
Extensions of (8) to multivariate θ are straightfor-
ward and in fact suggested to us the term path sam- where ET; α denotes the expectation with respect
pling. Suppose θ is now a d-dimensional parameter to the system density pωŽT; α‘ (here α is a scalar
vector and we are interested in the ratio zθ1 ‘/zθ0 ‘ quantity, such as the volume). Equation (12) is
for given vectors θ0 and θ1 . We first select a con- an application of (7) in conjunction with (3) using
tinuous path in the d-dimensional parameter space log qωŽθ = α‘ = −Hω; α‘/kT‘. Similarly, we can
that links θ0 and θ1 : θt‘ = θ1 t‘; : : : ; θd t‘‘, for calculate free energy difference for systems with
t ∈ ’0; 1“, with θ0‘ = θ0 and θ1‘ = θ1 : Defining different temperatures but the same α; identity (10)
also allows for different α’s and different T’s simul-
∂ log qωŽθ‘
Uk ω; θ‘ = ; taneously. See Frenkel (1986), Frankel and Smit
∂θk (1996) and Neal (1993, Section 6.2) for more dis-
dθk t‘ cussions of thermodynamic integration—so named
θ̇k t‘ = ; k = 1; : : : ; d because identities such as (12) were originally de-
dt
rived from differential equations for describing
and applying the same argument as with (7) for t
thermodynamic relationships.
going from 0 to 1, we obtain
  Applying (7), Ogata (1989; also see Ogata and
Z1 d Tanemura, 1984) proposed an innovative method for
λ= Eθt‘ log qωŽθt‘‘ dt
0 dt high-dimensional integrations. For simplicity, sup-
(10) Z1  d  pose we are interested in integrating a positive func-
X
= Eθt‘ θ̇k t‘Uk ω; θt‘‘ dt: tion qω1 ; : : : ; ωk ‘ on the k-dimensional cube ’a; b“k
0 k=1 that includes the origin 0; : : : ; 0‘, where k can be
very large (e.g., k = 1000). To apply (7), we con-
From (10), we can easily construct the correspond-
struct a family of densities indexed by a scale para-
ing path sampling estimator for λ,
meter σ,
n  X
d 
1X
(11) λ̂ = θ̇k ti ‘Uk ωi ; θti ‘‘ ; (13) pωŽσ‘ = qσω1 ; : : : ; σωk ‘/zk σ‘;
n i=1 k=1
where
where the ti ’s are sampled uniformly from [0,1] and
ωi is a draw from pωŽθti ‘‘. For any given path, zk σ‘
(11) is a consistent (and unbiased) estimator of λ as ZbZb Zb
long as the sample average converges to its popu- = · · · qσω1 ; : : : ; σωk ‘ dω1 dω2 · · · dωk :
a a a
lation average, a requirement that is met by many
MCMC methods. The choice of the path obviously Treating qσω1 ; : : : ; σωk ‘ ≡ qσω‘ as the unnor-
affects the Monte Carlo error, as we shall illustrate malized density, we obtain from (7) that
later. In searching for optimal estimators, the in-
log zk 1‘ − log zk 0‘
troduction of a nonuniform density for t on ’0; 1“  
(14) Z1 d
is unnecessary, as such a density can be absorbed
= Eσ log qσω‘ dσ;
by the path function θt‘. In fact, even in the uni- 0 dσ
SIMULATING NORMALIZING CONSTANTS 167

where Eσ is with respect to the density given in 2.3 Path Sampling Estimates Using Numerical
(13). Since zk 1‘ is exactly the integration we want, Integration over §
and zk 0‘ = b − a‘k q0‘, (14) allows us to esti-
An alternative to using (9) for estimating λ is
mate zk 1‘ by using draws ”ωi‘ ; σi ‘, i = 1; : : : ; n•
to numerically evaluate the integral over θ, which
from pωŽσ‘pσ‘, where pωŽσ‘ is given by (13)
and pσ‘ = 1 for σ ∈ ’0; 1“. Simulations from (13) essentially amounts to replacing pθ‘ in (9) by in-
can be accomplished via the Metropolis algorithm verses of spacings. For example, as in Ogata (1989),
(Metropolis et al., 1953), as illustrated in Ogata one can use the trapezoidal rule when θ is univari-
(1989). In view of (8), we do not have to simulate ate. Specifically, we first order the unique values of
σ from a uniform distribution; other densities may the simulation draws θi so that θ1‘ < θ2‘ < θ3‘ <
provide better Monte Carlo errors (see Section 4). · · ·, excluding any duplicates (such as occur if θ is
Ogata’s (1989) original proposals include the use of updated using the Metropolis algorithm). For each
deterministic choices of σi (e.g., equal-spaced) and newly labeled θj‘ , we then compute Uj‘ as the av-
the use of numerical integration techniques (e.g., erage of the values of Uωi ; θi ‘ for all simulation
trapezoidal rule) to carry out the one-dimensional draws i for which θi = θj‘ . Suppose we want to es-
integration in (14), in which cases one needs mul- timate the log density ratio λa; b‘ = log’zb‘/za‘“
tiple draws of ω for any given σi ; see Sections 2.3 for 0 ≤ a < b ≤ 1. Let ja and jb be the indexes
and 5.1. such that θja ‘ ≤ a < θja +1‘ < · · · < θjb −1‘ <
It appears that Ogata (1989) had independently b ≤ θjb ‘ . Applying the trapezoidal rule, we estimate
discovered the thermodynamic integration method. λa; b‘ by
In a subsequent paper, Ogata (1990, page 408)  
λ̂a; b‘ = 21 θja +1‘ − a Uja +1‘ + Ua
wrote: “Recently, I learned that such an estimation
jb −2
method of log ZN σ‘, which is called free energy, 1
X  
by the derivative of a suitable scalar parameter σ (15) + 2
θj+1‘ − θj‘ Uj+1‘ + Uj‘
j=ja +1
has been commonly used in the field of statistical  
1
physics since late the 1970’s (see Binder (1986) for + 2
b − θjb −1‘ Ub + Ujb −1‘ ;
example).” On the other hand, although Ogata’s
work was motivated by high-dimensional integra- where Ua and Ub are obtained via interpolation
tions for Bayesian computations (Ogata, 1990), or extrapolation, wherever necessary. Similarly, one
there is no mention of his method in Evans and can apply Simpson’s rule.
Swartz’s (1995) review article on methods for ap- Estimating λ using (15) is particularly useful
proximating integrals with special emphasis on when θ is evaluated on a fixed grid or when pθ‘
Bayesian integration problems, nor is there a men- is not known. The latter happens, for example,
tion of thermodynamic integration or other popular when the draws of ω; θ‘ are made jointly via a
MCMC-based methods in physics, such as the Metropolis–Hastings algorithm (Hastings, 1970)
acceptance ratio method (see Section 3). using our ability to evaluate qωŽθ‘, which we
We note these lack of citations not to criticize any now view as an unnormalized density on the joint
author, but rather to emphasize the great need of space
R ω; θ‘. In this case, pθ‘ is proportional to
communications among researchers, especially from qωŽθ‘µdω‘, which is the unknown normaliz-
different fields. Indeed, when we initially worked ing constant zθ‘ that we want to estimate. In
on this problem we also started from scratch (Gel- such cases, (9) is not applicable but (15) is. See
man and Meng, 1994) because we were not aware of Section 5.1 for more discussion of this issue.
thermodynamic integration or Ogata’s method. The In the case that zθ‘ is interpreted as a (unnor-
lack of communication is particularly unfortunate in malized) marginal density, a similar method can
this case, because many of us have missed perhaps be applied to estimate its corresponding cumula-
some most effective methods for high-dimensional tive distribution function (cdf). We first estimate,
integrations, in view of their routine and success- for
R a any 0 < a ≤ 1, the unnormalized cdf Ga‘ =
ful use in physics. A main purpose of this paper 0 zθ‘ dθ by
is to bring to the attention of statisticians some of ja −1n
these powerful methods, and at the same time to X 
b
Ga‘ = 1
θj+1‘ − θj‘
explore more flexible and statistical formulations 2
j=0
aiming at potential further improvements as well    o
as more general applicability. In particular, the for- (16) · exp λ̂0; θj+1‘ ‘ + exp λ̂0; θj‘ ‘
mulation given in Section 2.1 allows arbitrary con-
+ 12 a − θja ‘ ‘
struction of a path, even in distribution spaces, as    
we explore in Section 4.3. · exp λ̂0; a‘ + exp λ̂0; θja ‘ ‘ ;
168 A. GELMAN AND X.-L. MENG

where θ0 = 0 and λ̂·; ·‘ is defined by (15). We then an analytic approximation of the target density
estimate the cdf by pω‘ = qω‘/z). The importance sampling estima-
b tor of z is based on the identity
b Ga‘  
(17) Fa‘ = : qω‘
b
G1‘ z = Ep̃ ;
p̃ω‘
For multivariate θ, just as in (10), there is no
unique way of performing the numerical integra- and the corresponding Monte Carlo estimator is
tions; we can apply (15) with many different choices 1X n
qωi ‘
of path, and we can even consider combining (e.g., by (20) ẑ = ;
n i=1 p̃ωi ‘
weighted averages) estimators from different paths
(see Section 4). Here, we present a simple method, where ω1 ; : : : ; ωn are draws from p̃ω‘. For exam-
based on averaging over one component of θ at a ple, Dempster, Selwyn and Weeks (1983) use this
time, that turns out to be effective in our exam- method to check an analytic approximation of z for
ple of Section 5.2. For simplicity, we describe the a logistic regression likelihood. As usual with im-
method when θ is two-dimensional and evaluated on portance sampling, this method is effective only if
j
a rectangular grid of values θ1i ; θ2 ‘, i = 1; : : : ; m1 , p̃ is a fairly good approximation to p. For complex
j = 1; : : : ; m2 . We first estimate the following func- models, such as those encountered in free-energy
tions on the grid: estimations, finding an acceptable importance sam-
pling density is often out of the question. In fact,
g1 θ1 ; θ2 ‘ = log zθ1 ; θ2 ‘ − log zθ10 ; θ2 ‘;
even with various variance-reduction techniques
g2 θ1 ; θ2 ‘ = log zθ1 ; θ2 ‘ − log zθ1 ; θ20 ‘; (e.g., using control variates), importance sampling
does not provide usable answers for these complex
where θ10 ; θ20 ‘ can be any fixed point on the grid. For
j j problems—otherwise, the more advanced methods
each θ2 , the function g1 θ1 ; θ2 ‘ can be estimated as would not be so popular.
a function of θ1 using the path sampling estimate The second kind of importance sampling method
(15), averaging along θ1 . Similarly, g2 θ1i ; θ2 ‘ can be uses draws from densities that themselves are only
estimated by path sampling along θ2 , for each θ1i . known in unnormalized forms, and thus (20) can-
These estimates can be combined using the follow- not be applied directly. This is typically the case
ing identity: with iterative simulation (e.g., the Metropolis al-
log zθ1 ; θ2 ‘ − log zθ1i ; θ20 ‘ gorithm) where one can produce draws from pt ω‘
while only knowing qt ω‘ = zt pt ω‘, with zt being
(18) = g1 θ1 ; θ2 ‘ + g2 θ1i ; θ2 ‘ the unknown quantity of interest. This is the situa-
− g1 θ1i ; θ2 ‘ for any θ1i . tion we address in this paper. In such a case, various
methods are based on special cases of the following
Averaging over all values of θ1i yields identity studied in detail by Meng and Wong (1996):
log zθ1 ; θ2 ‘ = g1 θ1 ; θ2 ‘ z1 E ’q ω‘αω‘“
(21) r≡ = 0 1 ;
1 X m1
 z0 E1 ’q0 ω‘αω‘“
(19) + g2 θ1i ; θ2 ‘ − g1 θ1i ; θ2 ‘
m1 i=1 where Et denotes the expectation with respect to
pt ω‘ t = 0; 1‘; αω‘ is an arbitrary function satis-
+ constant: fying
Of course, the order of θ1 and θ2 can be reversed Z

in the above expression, giving an alternative esti- (22) 0 < αω‘p0 ω‘p1 ω‘µdω‘ < ∞;
0 ∩ 1
mate; we find in the example of Section 5.2 that the
order of integration can make a practical difference. and t is the support of pt ω‘, and we assume
Section 4 provides a theoretical investigation of the µ0 ∩ 1 ‘ > 0. For example, taking α = q−1 0 in
choices of paths. (21) leads to the commonly used identity (e.g., Ott,
1979; Geyer and Thompson, 1992; Green, 1992):
3. A METHODOLOGICAL EVOLUTION  
z1 q1 ω‘
(23) = E0 assuming 1 ⊂ 0 ;
3.1 Direct Importance Sampling Methods z0 q0 ω‘
Two different importance sampling schemes and taking α = ’q0 q1 “−1 leads to a generalization of
are commonly used for computing normalizing the “harmonic rule” of Newton and Raftery (1994).
constants. The first approach uses draws from a When z1 = 1, that is, when q1 w‘ = p1 w‘, (23)
trial density p̃ω‘ that is completely known (e.g., leads to the so-called reciprocal importance sam-
SIMULATING NORMALIZING CONSTANTS 169

pling method (see Gelfand and Dey, 1994, and Di- The derivation of Bennett (1976) corresponds to
Ciccio et al., 1997). choosing Jω; 1‘Žω; 0‘‘ = Jω; 0‘Žω; 1‘‘ = 1,
that is, always proposing to switch. The probability
3.2 Acceptance Ratio Method that a proposal is accepted is then the same as the
and Bridge Sampling transition kernel (see (26)), and thus the right-hand
While (21) is trivial to verify, it was the key side of (28) is the ratio of the (marginal) accep-
identity underlying the powerful acceptance ra- tance probabilities—hence the name acceptance
tio method of Bennett (1976), who motivated (21) ratio given by Bennett. For general J (and thus α,
by considering a Metropolis algorithm that al- which corresponds to Bennett’s weight function W),
lows moves between p0 and p1 . Here we recast Meng and Wong (1996) suggested the name bridge
his derivation under the more general Metropolis– sampling, a term that we explain in Section 3.3.
Hastings algorithm in order to reveal an explicit Note that although one could implement the above
relationship between the α function in (21) and the switching algorithm and then use the empirical
corresponding proposal, or jumping distribution, proportions to estimate r via (24), it is not nec-
J·Ž·‘, for the Metropolis–Hastings algorithm. essary and in fact typically not desirable to do so
We start by considering a Metropolis–Hastings al- because one can generally estimate r more accu-
gorithm on the joint space of ω; t‘, where t = 0 rately by using (28) and averaging the acceptance
or 1, with target density pω; t‘ ∝ qt ω‘. Clearly, probabilities rather than the acceptance rates—this
pωŽt‘ = pt ω‘, and marginally is a case of “Rao–Blackwellization” (as in Gelfand
and Smith, 1990).
pt = 1‘ z Given draws ω0i ; i = 1; : : : ; n0 ‘ from p0 ω‘,
(24) = 1 = r;
pt = 0‘ z0 draws ω1i ; i = 1; : : : ; n1 ‘ from p1 ω‘ and a choice
which is the ratio in which we are interested. Now of α, the sample version of (21) is
consider all moves where ω stays the same but t n0
changes (i.e., we are switching from one density 1/n0 ‘ i=1 q1 ω0i ‘αω0i ‘
(30) r̂α = n1 :
to the other with the same argument ω). By the 1/n1 ‘ i=1 q0 ω1i ‘αω1i ‘
detailed balance requirement of the Metropolis–
Hastings algorithm, we have Whereas r̂α is a consistent estimator of r as long
  as the sample averages in (30) converge to their
(25) q1 ω‘T ω; 0‘Žω; 1‘ = q0 ω‘T ω; 1‘Žω; 0‘ ;
corresponding population means, its variance obvi-
where the transition kernel T·Ž·‘ is given by (when ously varies with α and how the draws are made.
u 6= v) The question of optimal choice of α, however, is dif-
 ficult to answer in general due to the correlations
T ω; u‘Žω; v‘ among the draws. A case where the answer is eas-

= J ω; u‘Žω; v‘ ily obtained is when we have independent draws
(26)  
q ω‘Jω; v‘Žω; u‘‘ from both p0 and p1 ; although this assumption
· min 1; u is typically violated in practice, it permits useful
qv ω‘Jω; u‘Žω; v‘‘
 theoretical explorations and in fact the optimal es-
Jω; u‘Žω; v‘‘ timator obtained under this assumption performs
= qu ω‘ min ; rather well in general (see Bennett, 1976; Meng
qu ω‘
(27)  and Schilling, 1996).
Jω; v‘Žω; u‘‘
: Specifically, under the independence assumption,
qv ω‘
the optimal α in the sense of minimizing the asymp-
It follows then, by integrating (or summing) both totic variance of logr̂α ‘ (Bennett, 1976) or equiva-
sides of (25) with respect to µdω‘, lently the asymptotic relative variance of r̂α (Meng
z1 E ’Tω; 1‘Žω; 0‘‘“ and Wong, 1996) is given by
(28) = 0 ;
z0 E1 ’Tω; 0‘Žω; 1‘‘“ 1
αopt ω‘ ∝
which is the same as (21), in view of (27), when we s0 p0 ω‘ + s1 p1 ω‘
let   (31)
J ω; 1‘Žω; 0‘ 1
αω‘ = min ; ∝ ; ω ∈ 0 ∩ 1 ;
q1 ω‘ s0 rq0 ω‘ + s1 q1 ω‘
(29) 
J ω; 0‘Žω; 1‘ where st = nt /n0 + n1 ‘, t = 0; 1, are assumed to
:
q0 ω‘ be asymptotically bounded away from 0 and 1. The
170 A. GELMAN AND X.-L. MENG

corresponding asymptotic minimal error is and z1/2 /z1 and then take the ratio to cancel z1/2 .
Z −1  The gain of efficiency arises because with a sensi-
1 p0 p1
(32) µdω‘ −1 : ble choice of the “bridge” density p1/2 , there is less
ns0 s1 0 ∩ 1 s0 p0 + s1 p1 nonoverlap between pt t = 0; 1‘ and p1/2 than that
Since the optimal αopt is not directly usable as it between p0 and p1 . That is, p1/2 serves as a bridge
depends on the unknown ratio r, Meng and Wong between p0 and p1 , hence the name bridge sam-
(1996) construct an iterative estimator, pling. In terms of q1/2 , we see from (31) and (34)
n0 t‘ that the best bridge density is the (weighted) har-
t+1‘ 1/n0 ‘ i=1 ’l0i /s0 r̂opt + s1 l0i ‘“ monic mean of p0 and p1 :
r̂opt = n1 t‘
;
(33) 1/n1 ‘ i=1 ’1/s0 r̂opt + s1 l1i ‘“ opt p0 p1
q1/2 = s1 p0−1 + s0 p1−1 ‘−1 = ;
s0 p0 + s1 p1
t = 0; 1; : : : ;
and, interestingly, its normalizing constant deter-
where lmi = q1 ωmi ‘/q0 ωmi ‘, m = 0; 1, are cal- mines the (asymptotic) minimal error given in (32).
culated before the iteration. They show that each See Meng and Wong (1996) for a discussion of the
iterate in (33), r̂t+1‘ , t ≥ 0, provides a consistent es- relationship between bridge sampling and umbrella
timator of r, and that the unique limit r̂opt achieves sampling, another method developed in computa-
the asymptotic minimal error given in (32). They tional physics (Torrie and Valleau, 1977), which has
also study noniterative choices of α such as α = 1 been termed ratio importance sampling by Chen
and α = q0 q1 ‘−1/2 . The empirical results presented and Shao (1997a, b) in the statistical literature. Also
in Meng and Schilling (1996) show that these esti- see Neal (1993, page 98) for a nice graphical repre-
mators can substantially (e.g., by a factor of 5 to 30) sentation of (35).
reduce the relative mean-squared errors compared The idea of creating a bridge can obviously be
to estimators based on (the same amount of) draws pushed further. It is possible that the two densi-
from only one density (e.g., p0 ), such as when using ties q0 ω‘ and q1 ω‘ are so far separated that, even
(23). Note that Bennett (1976) suggested a graphi- opt
with the optimal bridge density q1/2 , the estimator
cal method for obtaining r̂opt , and Geyer (1994) pro- (36) is too variable to use in practice (or even does
posed an interesting “profile-likelihood” derivation not exist if p1 and p0 are completely separated). In
for r̂opt . such cases, it is useful to construct a finite series
3.3 Connecting Bridge and Path Sampling of L − 1 intermediate densities, from which we can
to Importance Sampling make draws. For simplicity of later derivations, we
label the corresponding unnormalized densities as
The fundamental identity (21) underlying bridge qωŽθl ‘, l = 0, 1; : : : ; L, including the two endpoints.
sampling can also be motivated easily from impor- For each pair of consecutive functions qωŽθl ‘ and
tance sampling, which is familiar to most statistical qωŽθl+1 ‘, l = 0; : : : ; L−1, we label the intermediate
readers. To see this, define unnormalized density by qωŽθl+1/2 ‘, which will be
q1/2 ω‘ computed but not sampled from. With these 2L + 1
(34) αω‘ = ; ω ∈ 0 ∩ 1 ; (unnormalized) densities, we can apply identity (35)
q0 ω‘q1 ω‘
in a telescoping fashion:
where q1/2 ω‘ is an arbitrary unnormalized density L zθ
having support 0 ∩ 1 (thus, condition (22) is satis- z1 z1‘ Y l−1/2 ‘/zθl−1 ‘
≡ =
fied). We use the subscript “1/2” to indicate that we z0 z0‘ l=1 zθl−1/2 ‘/zθl ‘
intend to use a density that is “between” q0 and q1 , (37)
L E
in the sense of being overlapped by both of them.
Y θl−1 ’qωŽθl−1/2 ‘/qωŽθl−1 ‘“
= :
Substituting this α into (21) yields l=1
Eθl ’qωŽθl−1/2 ‘/qωŽθl ‘“
z1 z1/2 /z0 E0 ’q1/2 ω‘/q0 ω‘“ As a generalization of (35), this is bridge sam-
(35) r≡ = = ; pling with 2L − 1 spans. (Meng and Wong, 1996,
z0 z1/2 /z1 E1 ’q1/2 ω‘/q1 ω‘“
also present multiple-bridge identities for estimat-
with the corresponding estimator ing more than one ratio of normalizing constants
n0 simultaneously; also see Geyer, 1994, for a “profile-
1/n0 ‘ i=1 ’q1/2 ω0i ‘/q0 ω0i ‘“
(36) r̂ = n1 ; likelihood” approach for estimating several ratios
1/n1 ‘ i=1 ’q1/2 ω1i ‘/q1 ω1i ‘“
simultaneously.) Using intermediate systems (dis-
based on n0 draws ω0i from p0 and n1 draws ω1i tributions) to implement importance sampling is
from p1 . That is, instead of applying (23) to directly also a well-known idea in computational physics
estimate z1 /z0 , we apply it to first estimate z1/2 /z0 (e.g., Neal, 1993, Section 6.2; Ceperley, 1995).
SIMULATING NORMALIZING CONSTANTS 171

Given (37), one is tempted to study the limit- pωŽθ‘pθ‘, then the Monte Carlo variance of λ̂ is
ing case when L → ∞, that is, an infinite num- 
ber of bridges. This can be easily done by consider- 1 Z 1 Z U2 ω; θ‘
varλ̂‘ =
ing the indexes θl as corresponding to a parameter n 0 p2 θ‘
θ ∈ ’0; 1“, indexing a parametric family ”qωŽθ‘; 0 ≤ 
2
θ ≤ 1•, with θa = a/L for any a ∈ ’0; L“. With · pωŽθ‘pθ‘µdω‘ dθ − λ
this setup, taking logarithms of both sides of (37)  
yields 1 Z 1 Eθ ’U2 ω; θ‘“ 2
(39) = dθ − λ :
n 0 pθ‘
L     
z1 X 1 1
(38) log = Gl−1 − Gl − ; Assume for now that pωŽθ‘ is given. Then we seek
z0 l=1
2L 2L the the marginal (or prior) density pθ‘ that min-
imizes (39), which is equivalent to minimizing the
where the functions Gl are defined by first term in (39).
Z qωŽθ + ξ‘ By the Cauchy–Schwarz inequality,
l
Gl ξ‘ ≡ log pωŽθl ‘µdω‘; Z 1 Eθ ’U2 ω; θ‘“
qωŽθl ‘ dθ
0 pθ‘
l = 0; 1; : : : ; L: Z 1 q 2 Z 1  p E ’U2 ω; θ‘“ 2
θ
= pθ‘ dθ p dθ
It is easy to verify that, for any l, Gl 0‘ = 0 and 0 0 pθ‘
G0l 0‘ = Eθl ’Uω; θl ‘“, using the notation of Section Z 1 q 2
2.1, under the regularity condition that the support ≥ Eθ ’U2 ω; θ‘“ dθ :
of pωŽθ‘ does not depend on θ. Thus, when L → ∞, 0

by the Taylor expansion of the right-hand side of The right-hand side above does not depend on pθ‘,
(38), we have and the equality holds when
p
L E E ’U2 ω; θ‘“
z1 1 X θl−1 ’Uω; θl−1 ‘“+Eθl ’Uω; θl ‘“
log = lim (40) pθ‘ = R 1 p θ :
z0 L→∞ L l=1 2 0 Eθ0 ’U2 ω; θ0 ‘“ dθ0
Z 1 It follows that pθ‘ of (40) is the optimal prior den-
= Eθ ’Uω; θ‘“dθ; sity, and the optimal variance of λ̂ is
0
 q 2 
which is exactly the basic identity (7) underlying 1 Z1
(41) varopt = Eθ ’U2 ω; θ‘“ dθ − λ2 :
path sampling. n 0
The foregoing derivation may be also helpful in
studying the trade-off of implementation efficiency Interestingly, when zθ‘ is independent of θ, in
versus Monte Carlo efficiency in adopting multi- which case λ = 0, the optimal density given in
bridge sampling and path sampling, an open issue (40) is exactly the Jeffreys prior density (based
of practical interest. on pωŽθ‘) restricted to θ ∈ ’0; 1“. In general, (40)
can be viewed as a generalized local Jeffreys prior
density based on the unnormalized density qωŽθ‘
4. A THEORETICAL INVESTIGATION (the expectation Eθ is with respect to the normal-
OF PATH SAMPLING Rized density pωŽθ‘), and it is proper whenever
1p 2
E θ ’U ω; θ‘“ dθ < +∞. One can also view
4.1 Optimal Prior Density in One Dimension 0
(40) as a “variance-stabilizing” transformation via
The arbitrariness of the prior density pθ‘ in (9) the equation pθt‘‘θ̇t‘ = 1 between the path
allows us to search for optimal estimators in the function and the prior density, although in the cur-
sense of achieving minimal Monte Carlo variances. rent setting the term “second-moment-stabilizing”
Due to the difficulty of establishing general results transformation would be more appropriate because
under arbitrary sampling schemes, we shall assume Eθ ’Uω; θ‘“ is generally not zero. The second-
independent draws for theoretical explorations and moment-stabilizing property can be seen by notic-
guidelines (but not for real implementations, as pre- ing that with the optimal prior (40), the second
sented in Section 5). moment of each term of (9) is free of θ. This makes
If ωi ; θi ‘, i = 1; : : : ; n, in (9) are n indepen- sense: the optimal procedure should balance the
dent draws from the joint distribution pω; θ‘ = second sampling moment of Uω; θ‘/pθ‘ at differ-
172 A. GELMAN AND X.-L. MENG

ent locations of θ, since we intend to minimize the lated because the accuracy of the path sampling es-
average of them. timator depends crucially on the distance between
The variance given in (39) is not appropriate for two unnormalized densities, q0 ω‘ and q1 ω‘, and
the estimator λ̂0; 1‘ given by (15), because of the the Rao distance provides the appropriate measure.
use of an estimated pθ‘. The derivations of the The Rao distance is constructed by considering the
asymptotic variances for (15) and (17) are quite in- variance of the score function projected to a partic-
volved even under the independence assumption ular path, and thus the only difference between the
due the presence of (linear and nonlinear) functions Rao distance and the current calculation is that we
of order statistics, and thus we do not discuss them are dealing with unnormalized densities. For exam-
here. ple, (43) differs from (2.7) of Atkinson and Mitchell
(1981) only by using log qωŽθ‘ instead of log pωŽθ‘
4.2 Optimal Path in Many Dimensions in defining the U functions inside gij . In fact, in the
Generalization of the above result to multivari- next section we show that the Rao distance in distri-
ate θ is immediate. For any given path, the optimal bution space is directly related to the optimal path
density for θ over that path is the generalized local in distribution space.
Jeffreys prior density on that path. This does not, As explored in the literature on the Rao dis-
however, answer the question of which path is op- tance, solving (43) is typically difficult. Atkinson
timal in the sense of yielding minimal Monte Carlo and Mitchell (1981) suggested two alternative ways
variance of (11) among all possible paths. The an- of expressing solutions via Hamilton’s equations
swer to this question is a problem in the calculus of and Hamilton–Jacobi equations (e.g., Courant and
variations. Specifically, the variance of (11), under Hilbert, 1961) and provided a differential geometry
independent sampling, is argument for finding the Rao distance between two
  d 2 normal densities. Despite these efforts, the gen-
1 Z 1Z X eral problem remains difficult. Section 4.4 provides
varλ̂‘ = θ̇k t‘Uk ω; θt‘‘
n 0 k=1
a theoretical example for the normal distribution,
 where (43) has an analytic (but nontrivial) solution.
· pωŽθ‘µdω‘ dt − λ2
4.3 Optimal Path in Distribution Space
  d  
1 Z1 X In the previous section, the family of distributions
(42) = gij θt‘‘θ̇i t‘θ̇j t‘ dt − λ2 ;
n 0 i; j=1 pωŽθ‘ is given. A different problem is to find an
optimal path in the space of integrable nonnegative
where gij θ‘ = Eθ ’Ui ω; θ‘Uj ω; θ‘“. The path functions that connects the two unnormalized den-
function θt‘ that minimizes the first term on the sity functions, q0 ω‘ and q1 ω‘. That is, we seek
right-hand side of (42) is the solution of the fol- to optimize over all nonnegative functions qωŽθ‘,
lowing Euler–Lagrange equations (e.g., Atkinson with θ a scalar parameter having the range ’0; 1“,
and Mitchell, 1981) with the boundary condition subject to the boundary conditions, qωŽ0‘ = cq0 ω‘
θt‘ = θt , for t = 0; 1: and qωŽ1‘ = cq1 ω‘, where c is an arbitrary posi-
d
X d
X tive constant. Without loss of generality we can as-
gik θt‘‘θ̈i t‘ + ’ij; k“θ̇i t‘θ̇j t‘ = 0; sume θ has a uniform distribution over ’0; 1“ be-
(43) i=1 i; j=1 cause of the “absorption” transformation discussed
k = 1; : : : ; d; at the end of Section 2.1. In practice, it might be
necessary to define a family of functions because
where θ̈t‘ denotes the second derivative with re- q0 ω‘ and q1 ω‘ are not part of a common paramet-
spect to t, and ’ij; k“ is the Christoffel symbol of the ric family. Or, the two distributions might have a
first kind: common parametric form, but a more efficient path
" #
1 ∂gik θ‘ ∂gjk θ‘ ∂gij θ‘ may be possible by leaving the parametric form and
’ij; k“ = + − ; moving through general distribution space. Possible
2 ∂θj ∂θi ∂θk
general constructions include the geometric path (5)
i; j; k = 1; : : : ; d: suggested in physics and the scaling path proposed
by Ogata (1990, 1994) as an example of possible
Similar calculus of variations problems arise in paths in the general distribution space of the form
the literature on finding the Rao distance between qωŽθ‘ = q0 ω‘hθ ω‘:
two densities (e.g., Rao, 1945, 1949; Atkinson and
Mitchell, 1981; Mitchell, 1992). The Rao distance q1 θω‘
(44) scaling path, qωŽθ‘ = q0 ω‘ :
and the minimal variance of (11) are naturally re- q0 θω‘
SIMULATING NORMALIZING CONSTANTS 173

When using this path to estimate λ, we need to ad- by a differential geometry approach, as reviewed
just for a known bias, logq0 0‘/q1 0‘‘. In our the- in Burbea (1989); also see Kass and Vos (1997). It
oretical example in Section 4.4, the geometric path turns out that, somewhat unexpectedly, the prob-
and scaling path lead to identical Monte Carlo er- lem can also be solved via the Cauchy–Schwarz in-
ror, which compares favorably to that from optimal equality, as we show in the Appendix. The result
bridge sampling but can be improved substantially is (of course) identical to the result on the Rao dis-
within the path sampling framework. It is of great tance in distribution space, though our expressions
practical interest to find general simple paths with are more convenient for the path sampling applica-
good properties. tion.
Finding the optimal path in the whole distribu-
tion space turns out to be an easier mathematical Lemma 2. Let Iθ‘ be the Fisher information for
problem than the optimization problem described in pωŽθ‘; where pωŽ0‘ ≡ p0 ω‘ and pωŽ1‘ ≡ p1 ω‘
the previous section. We start by writing the path are given. Let
density as qωŽθ‘ = pωŽθ‘zθ‘ and expressing  
Hp0 ; p1 ‘
Z1  2 (47) αH = arctan p ;
d 4 − H2 p0 ; p1 ‘
Eθ log qωŽθ‘ dθ
0 dθ R p p
where Hp0 ; p1 ‘ = ’  p1 ω‘ − p0 ω‘‘2 µdω‘“1/2
Z 1 d 2
is the Hellinger distance between p0 and p1 . Then
(45) = log zθ‘ dθ
0 dθ Z1
Z1  2 (48) Iθ‘ dθ ≥ 16α2H ;
d 0
+ Eθ log pωŽθ‘ dθ:
0 dθ and the equality in (48) holds if and only if
q 
To minimize the left-hand side of (45) over qωŽθ‘ cos’2θ − 1‘αH “
we can separately minimize the two terms on the pωŽθ‘ = p0 ω‘
2 cosαH ‘
right-hand side, both under the appropriate bound- 
ary conditions. For the first term on the right-hand sin’2θ − 1‘αH “

side, the following simple result, a consequence of 2 sinαH ‘
(49) q 
the Cauchy–Schwarz inequality, provides the an- cos’2θ − 1‘αH “
swer. + p1 ω‘
2 cosαH ‘
2
Lemma 1. If zθ‘ is a positive function on θ ∈ sin’2θ − 1‘αH “
+ ;
’0; 1“ such that log z1‘ − log z0‘ = λ; then 2 sinαH ‘
Z 1 d 2
almost surely with respect to the product measure
log zθ‘ dθ ≥ λ2 ; formed by µ and the Lebesgue measure on ’0; 1“.
0 dθ

with the equality holding if and only if zθ‘ equals Applying the lemma with (45) and (46), we see
(46) zopt θ‘ = z1−θ θ
∝ expλθ‘ that the optimal qωŽθ‘ = pωŽθ‘zθ‘ in the distri-
0 z1
butional space is given by
almost surely with respect to the Lebesgue measure q 
cos’2θ − 1‘αH “
on ’0; 1“. qωŽθ‘ ∝ eλθ p0 ω‘
2 cosαH ‘

This Rresult implies that any qωŽθ‘ that yields sin’2θ − 1‘αH “

zθ‘ = qωŽθ‘µdω‘ different from (46) cannot be 2 sinαH ‘
optimal in the distributional space because q̃ωŽθ‘ = (50) q 
cos’2θ − 1‘αH “
qωŽθ‘’zopt θ‘/zθ‘“ dominates qωŽθ‘ (here we are + p1 ω‘
discussing theoretical optimality, not the implemen- 2 cosαH ‘

tation feasibility). For example, the geometric path sin’2θ − 1‘αH “ 2
(5) is suboptimal in general because, for that path, + :
R 2 sinαH ‘
zθ‘ = q1−θ0 ω‘q θ 1−θ θ
1 ω‘µdω‘ < z0 z1 for 0 < θ < 1.
The second term on the right-hand side of (45) is The corresponding minimal variance is given by
R1
simply 0 Iθ‘dθ, where Iθ‘ is the RFisher informa- (51) varλ̂‘ = 16α2H /n;
1
tion for pωŽθ‘. Thus minimizing 0 Iθ‘ dθ is the
same as finding the Rao geodesic distance in the which is a simple function of Hp0 ; p1 ‘. This intrin-
distribution space, a problem that can be solved sic connection with the Hellinger distance also ap-
174 A. GELMAN AND X.-L. MENG

pears in the bridge sampling context, as Meng and normalized normal densities:
Wong (1996) show that the optimal bridge-sampling  
ω − µ‘2
error given in (32) is bounded below and above by (52) qωŽθ‘ = exp − ;
simple functions of Hp0 ; p1 ‘ when s0 = s1 . Unlike 2σ 2
bridge sampling where the optimal error is achieved with θ = µ; σ‘, θ0 = 0; 1‘ and θ1 = D; 1‘.
by the iterative solution found with (33), however, In order to make (nearly) fair comparisons, we
it is unclear whether the optimal error in (51) is assume that (i) with importance sampling, we make
achievable (asymptotically) in practice since the op- n draws from N0; 1‘, (ii) with bridge sampling,
timal solution given in (50) assumes the knowledge we make n/2 (assume n is even) draws from each
of the unknown normalizing constants (and the abil- of N0; 1‘ and ND; 1‘ and (iii) with path sam-
ity to make independent draws from (50)). Using an pling, we first draw ti , i = 1; : : : ; n; uniformly
adaptive method (e.g., iteratively estimating λ) may from [0,1]; then for each ti we make one draw ω
lead to an increase in variance. In fact, we doubt from Nµti ‘; σ 2 ti ‘‘, where θt‘ = µt‘; σt‘‘ is a
that (51) is achievable as it is bounded above by given path. All draws are independent within each
π 2 /n even if p0 and p1 are infinitely apart. An in- scheme. In addition, since importance sampling
teresting and empirically relevant problem is to fig- and bridge sampling estimate the ratio r whereas
ure out the achievable minimal error and how it path sampling estimates the log-ratio λ, we con-
varies with Hp0 ; p1 ‘ (or some other distance mea- vert the estimates of r to the scale of λ by letting
sures). This is an important issue because an un- λ̂ = log r̂. Under this conversion, the variance of λ̂
achievable theoretical optimal error could misguide is asymptotically the same as the squared relative
the choices of methods (e.g., contrast the theoretical error of r̂ (i.e., Er̂ − r‘2 /r2 ). Under such a setting,
comparisons in Chen and Shao, 1997a, with the em- Table 1 compares p six estimators of λ, where the
pirical comparisons in Chen and Shao, 1997b, when computations of nEλ̂ − λ‘2 are exact for the path
the actual computational time needed by the ratio sampling estimators and correct to terms of On−1 ‘
importance sampling method is taken into account). for the others.
Our interests in exploring these theoretical re- In Table 1, estimator (I) is the importance sam-
sults lie in finding useful insights and practical pling estimator using (23), estimators (II) and (III)
guidelines about the potential and limits of path √
are bridge sampling using (35) with q1/2 = q0 q1
sampling. As we shall demonstrate in Section 4.4,
and q1/2 = p−1 −1 −1
0 + p1 ‘ , respectively, and the cor-
where we solve (43) for a family of normal distribu-
responding variance computations are from Meng
tions, an optimal path can reduce the Monte Carlo
variance by orders of magnitude when compared to
some “natural” nonoptimal choices, a gain that is Table 1
especially important when the two densities are far Comparison of theoretical Monte Carlo errors of importance,
bridge and path sampling estimators for two normal densities
apart. We emphasize, however, it is not necessary spaced D standard deviations apart
to find an optimal path in order to gain substantial
q
reduction in variance; for example, in the normal Method nEl̂ 2 l‘2
example, very simple paths reduce the variance by
orders of magnitude compared to previous meth- (I) Importance sampling ’expD2 ‘ − 1“1/2
ods. The empirical implementations presented in (II) Bridge sampling with   2 1/2
Section 5 also demonstrate the superiority of path D
geometric bridge 2 exp −1
sampling with simple choices of paths. 4
(III) Bridge sampling with   2 1/2
1 D
(optimal) harmonic 2 √ D exp −1
2πβD‘ 8
4.4 A Theoretical Illustration bridge
(IV) Path sampling with r
To illustrate theoretically the potential of path 1 1
geometric (and D2 + 2
12 D
sampling for reducing Monte Carlo variances, we scaling) path
adopt the following example, which was used by
(V) Path sampling with D
Meng and Wong (1996) for illustrating bridge sam- optimal path in
pling. The example is of “toy” nature, but the find- µ-space
" s !#
ings are not and in fact somewhat surprised us. Let (VI) Path sampling with √ D D2
q0 ω‘ = exp−ω2 /2‘ and q1 ω‘ = exp−ω−D‘2 /2‘, optimal path in 12 log √ + 1 +
12
µ; σ‘-space 12
where D > 0, and thus the true λ being “estimated”
R∞
is zero. For the purpose of path sampling, we con- Note: In (III), βD‘ = 1/π‘ 0 exp’−x2 /2D2 ‘“/coshx/2‘‘ dx,
sider p0 and p1 as two points in the family of un- with the property βD‘ ≤ 1 and limD→∞ βD‘ = 1.
SIMULATING NORMALIZING CONSTANTS 175

and Wong (1996). Estimator (IV) is the path sam- pendix. This amounts to a half-ellipsoid curve in
pling estimator using the geometric path (5), which µ; σ‘-space:
in this case leads to the identical estimator as that  
from the scaling path (44). Estimator (V) is the op- D 2 D2
(53) µ− + 3σ 2 = 3 + ; σ > 0;
timal univariate path sampling estimator based on 2 4
(7) by considering σ to be fixed at 1 and letting µ as displayed, in boldface, in Figure 2c with D = 5.
vary in ’0; D“. It is easy to verify that, in the cur- The optimal path thus increases the variances of
rent example, the (generalized) local Jeffreys prior the normal densities in the middle, which makes
density defined by (40) is uniform, so the optimal sense as we want the middle densities to have large
path function is given by µt‘ = Dt. Estimator (VI) overlaps with the two endpoint densities. However,
is the optimal multivariate path sampling estima- the variance of the intermediate densities are not
tor based on (10) when both µ and σ are allowed to allowed to be arbitrarily large because that would
very freely in their two-dimensional space; the form introduce too much sampling variability at each
of the optimal path will be discussed shortly. given value of σ. The optimal path is a result of
Figure 1 plots the six expressions in Table 1 as such a trade-off. This can be seen more clearly
functions of D ∈ ’0; 10“, where the dotted line plots from Figure 2d, which displays the normalized nor-
4αH (see (47)) with H2 p0 ; p1 ‘ = 21−exp−D2 /8‘‘, mal densities corresponding to qωŽµt‘; σt‘‘ for
which is the lower bound from (51). As we discussed t = 0; 0:1; 0:2; : : : ; 1, with the two end densities,
in Section 4.3, we doubt this bound can be achieved N0; 1‘ and N5; 1‘, shown as boldfaced lines.
in reality, though we can easily improve estimator
(VI) by using the optimal normalizing constants, 5. PRACTICAL IMPLEMENTATION
given by (46), for the densities in the path. This AND EXAMPLES
entails using σ −1 exp”−ω − µ‘2 /2σ 2 ‘• in place of
(52), and the resulting Monte Carlo error is obtained 5.1 Issues in Implementing Path Sampling
by replacing the values “12” with “8” in row (VI) of To implement path sampling, we need draws
Table 1; the result is lower for all values of D but ω; θ‘ from a joint distribution that can be written
still is not optimal in distribution space. as pω; θ‘ = qωŽθ‘/cθ‘. The marginal distribution
The optimal path in µ; σ‘-space, denoted by of θ in these draws is
θt‘ = µt‘; σt‘‘ (with t ∈ ’0; 1“), turns out to Z zθ‘
be quite interesting and informative. Figure 2a, b pθ‘ = pω; θ‘µdω‘ = :
plots µt‘ and σt‘ with D = 5 as the boldfaced cθ‘
segments; the general expressions for µt‘ and We have the freedom to specify pθ‘ or cθ‘, but not
σt‘ and their derivations are given in the Ap- both (or else zθ‘ would already be known). We em-
phasize that, depending on the nature of cθ‘, pθ‘
can be completely unrelated to zθ‘, which we want
to compute, or can be proportional or even identical
to zθ‘.
We distinguish between two kinds of implemen-
tations of path sampling. In the first kind, which
is the subject of most of the preceding discussion,
we specify pθ‘, with the particular form chosen
typically for reasons of convenience and perceived
optimality. The simplest method of obtaining simu-
lation draws is direct sampling, in which we draw
ω; θ‘ by first drawing θ from the known pθ‘ (or
choosing θ systematically over a grid, as in Ogata,
1989, and in our example of Section 5.2), then draw-
ing ω from qωŽθ‘. The step of sampling θ is easy
when we have the freedom to specify pθ‘. Draw-
Fig. 1. Relative Monte Carlo errors for various simulation-based ing ω given θ is usually more difficult, however: in
estimates of logz1 /z0 ‘; comparing N0; 1‘ to ND; 1‘ densities, many problems for which we would like to apply
using (I) importance sampling, (II) bridge sampling with geomet-
path sampling, there is no easy way to directly sam-
ric bridge, (III) bridge sampling with optimal bridge, (IV) path
sampling with geometric (or scaling) path, (V) optimal path sam- ple ω. Instead, the preferred method is some form
pling in µ-space and (VI) optimal path sampling in µ; σ‘-space. of iterative simulation, such as the Metropolis algo-
The dotted line is the lower bound given by (51). rithm. To implement path sampling using iterative
176 A. GELMAN AND X.-L. MENG

Fig. 2. Optimal path from N0; 1‘ to N5; 1‘ in µ; σ‘-space: (a) parameterization of µt‘; (b) parameterization of σt‘; (c) optimal
path; (d) normalized densities along the optimal path.

simulation, we can proceed directly by using nested here is to alternately update ω and θ in a Gibbs or
loops: for each simulated (or chosen) θ, run an iter- Metropolis-type algorithm. This is essentially a spe-
ative simulation algorithm (until approximate con- cial case of simulated tempering (see Marinari and
vergence). The result is a large number of draws Parisi, 1992, and Geyer and Thompson, 1995) with
ω for each θ, but the estimator (9) is still appli- θ being viewed as the temperature variable, which
cable. The nested simulation approach may be at- is also similar to the multicanonical algorithms pro-
tractive in a parallel computing environment. The posed in the statistical physics literature (e.g., Berg
nested-loop method has a flavor of the more elab- and Neuhaus, 1991; Berg and Celik, 1992); see Neal
orate Metropolis-coupled Markov chain method of (1993, page 94) and Geyer and Thompson (1995) for
Geyer (1991). more discussion.
In the second kind of implementation, we specify Once the draws are made, we can apply (15) (in
cθ‘ (up to a multiplicative constant). This means conjunction with (19) when θ is multivariate) to es-
we can write the joint density of ω; θ‘, up to a timate zθ‘ as a function of θ, as discussed in Sec-
constant, and so can draw from this density, com- tion 2.3. (Interestingly, we can still directly estimate
bining the simulations of ω and θ in a single loop relative values of zθ‘ using (15), without having
of iterative simulation. The most natural approach to make any adjustment for cθ‘.) A special case of
SIMULATING NORMALIZING CONSTANTS 177

this kind of implementation is when drawing from a cθ‘ is not actually required for the path-sampling
joint density given an unnormalized density qω; θ‘, estimators to be valid. This is reminiscent of the it-
which means we have set cθ‘ ≡ constant (and thus erative sequence (33) for the optimal single-bridge
pθ‘ ∝ zθ‘‘; we illustrate this implementation with estimator, where each iterate provides a valid esti-
an example in Section 5.3. mate of r, and the iteration is needed only for the
When using the single-loop method, a new prob- purpose of optimality.
lem can arise when zθ‘ is not to be used as a One nice feature of the above iterative procedure
marginal density. The difficulty is that zθ‘ can vary is that the empirical distribution of the simulated θ
over several orders of magnitude in the region of θ values converges to a known distribution—uniform
of interest—this is not much a problem when zθ‘ on ’0; 1“. We can thus monitor the convergence of
is used as a (unnormalized) marginal density if only the simulations by comparing to this known dis-
regions of relatively high marginal mass are of in- tribution, which is far easier than the usual task
terest. Simply sampling ω; θ‘ from the joint dis- of monitoring convergence to an unknown target
tribution proportional to qωŽθ‘ (i.e., setting cθ‘ ≡ distribution. In general, one can construct checks
constant‘ would leave very few draws of θ in re- on the convergence of an iterative simulation as
gions of low marginal density and thus very little a by-product of path sampling. For example, when
ability to compute zθ‘ in those regions using path cθ‘ = 1, we can compare the estimate of Fa‘ in
sampling or any other method. (For example, to es- (17) to the empirical distribution of the simulated
timate zb‘/za‘, both (9) and (15) require draws of values of θ (see Section 5.3); a discordance between
θ in the interval ’a; b“.) Fortunately, in single-loop these two distributions (as measured by some cri-
sampling we have the ability to choose cθ‘ to reduce terion of practical concern, such as a comparison of
the variance of our estimators. Since we cannot, the 95% central posterior intervals) indicates a lack
in general, easily compute the optimal pθ‘—the of convergence in the simulation (or an error in the
generalized Jeffreys prior density—we aim for the implementation of the sampler or the path sampling
simpler goal of a uniform pθ‘ [i.e., the goal is cθ‘ estimate). Similar procedures are available with ar-
proportional to zθ‘], which at least avoids the prob- bitrary chosen (known) cθ‘. We can use such pro-
lem that some regions have far fewer draws than cedures to check the convergence of any parameter
others. Several noniterative approaches are avail- in the model (i.e., any parameter can take the role
able for creating an approximation to zθ‘, includ- of θ with the others taking the role of ω in the anal-
ing Laplace’s method (e.g., DiCiccio et al., 1997), ysis) by merely changing the derivative in U and
the method of coding for conditional distributions recomputing the path sampling estimate.
(Besag, 1974) and various numerical methods (e.g.,
Evans and Swartz, 1995). The approximation here is 5.2 Example 1: Censored Data in Spatial Statistics
used as cθ‘ for making draws, not as our final esti- The problem of high-dimensional integration com-
mate of zθ‘, so the inaccuracy in the approximation monly arises with missing or censored data. It is of-
does not bias our estimates from path sampling. ten the case that, given the uncensored data ω, the
In cases where a reasonable approximation to likelihood function LθŽω‘ = pωŽθ‘, is easy to com-
zθ‘ is not immediately available, we can update pute. On the other hand, the likelihood based on
the function cθ‘ iteratively. We start with some the censored data y cannot be calculated directly.
initial guess, say, cθ‘ ≡ 1. We then run the simu- To fix ideas, we assume that ω = ω1 ; : : : ; ωd ‘ is
lation of ω; θ‘ using the Metropolis–Hastings a vector of real numbers, and the censored data
algorithm. Occasionally, say, every few hundred it- y = y1 ; : : : ; yd ‘ are given by yj = maxωj ; 0‘ for
erations, we stop and estimate the function zθ‘, j = 1; : : : ; d: The likelihood based on the censored
either using path sampling (15) or some other den- data is then
sity estimate of pθ‘ = zθ‘/cθ‘ based directly on
Z0 Z0 Y
the simulated values of θ. In either case, we update (54) pyŽθ‘ = ··· pωŽθ‘ dωj ;
cθ‘ to equal the current estimate of zθ‘ and then −∞ −∞ jx yj =0
continue the iteration. In the limit, under suitable
mixing conditions, c → z, and so pθ‘ converges to integrating over all the censored components. Treat-
uniformity. This iterative scheme is similar to the ing pyŽθ‘ as the normalizing constant of pωŽy; θ‘
iterative method proposed in Geyer and Thompson with the complete-data likelihood pωŽθ‘ as the un-
(1995) for adjusting a pseudoprior needed for imple- normalized density, we are in the setting of (1).
menting simulated tempering. We emphasize that For a particular example, we consider a station-
because we are using the estimator (15), which does ary model in spatial statistics described by Stein
not require pθ‘ to be computed, the convergence of (1992). In this example, each yj is observed at a
178 A. GELMAN AND X.-L. MENG

location xj in two-dimensional space. The vector of


uncensored data ω is modeled by a joint normal dis-
tribution, in which each component ωj has mean m
and variance c, and the correlation between any two
components, ωi and ωj , is exp−Žxi − xj Ž‘. Figure 1
of Stein (1992) presents a set of simulated data on
a 6 × 6 grid evenly spread over the square ’0; 1“2 ,
in which 17 of the 36 components yj equal 0. The
goal is to compute the likelihood of the parameter
vector θ = m; log c‘. Were it not for the censored
data, the likelihood would be trivial to compute from
the joint normal density; however, because of the
spatial dependence among the 36 observations, the
17-dimensional integral (54) cannot be calculated
analytically.
Stein (1992) used importance sampling to com-
pute the relative values of the marginal likelihood
pyŽθ‘ on a 21 × 21 grid in the space of θ. At each
point θ on the grid, Stein used a decomposition of a
truncated multivariate normal distribution to con-
struct an approximation hθ ω‘ to pωŽy; θ‘, with
known normalizing constant. He sampled 1000
draws of ω at each point of θ and estimated pyŽθ‘
by importance sampling. A crucial step that makes
Stein’s method work is that he used the same 1000
pseudorandom numbers for all draws at each θ,
which was feasible because Stein sampled from
the approximate densities h using an inverse-cdf
approach. This introduces desirable positive depen-
dence between the importance sampling estimates Fig. 3. Estimated negative loglikelihood for spatial statistics ex-
at the different points on the grid of θ, which, as ample (replication of Figure 3 of Stein, 1992) using path sam-
pling, with a 21 × 21 grid and only 100 draws of ω at each point.
Stein noted, greatly reduces the Monte Carlo error
The two plots show the estimates based on two different paths.
of the resulting ratio estimator.
Here we replicate Stein’s results using path sam-
pling with nested loops, which is computationally the plot given in Figure 3 of Stein (1992). The path
straightforward thanks to the simplicity of Gibbs sampling method used here does not involve con-
sampler in this case. For each value of θ = m; log c‘ structing approximate densities and does not need
in the 21×21 grid, we use the Gibbs sampler to sim- to use the same pseudorandom numbers at different
ulate from the conditional distribution of the uncen- points of θ.
sored values, pωŽy; θ‘. We monitored the conver- 5.3 Example 2: Heteroscedastic Regression
gence of parallel runs of the Gibbs sampler using Models for Election Forecasting
the method of Gelman and Rubin (1992) and found
that the simulations had reached approximate con- 5.3.1 Statistical model and substantive back-
vergence after 100 iterations. We discard the first ground. Consider the heteroscedastic regression
half of each simulation (i.e., we use 50 draws at each model,
value of θ). We then use (15) in conjunction with
(55) y Ž β; σ; θ ∼ NXβ; σ 2 r−θ ‘;
(19) to estimate the function log’pyŽθ‘/pyŽθ0 ‘“ on
the 21 × 21 grid of m; log c‘, where θ0 is the max- where r is a vector of weights, and θ ∈ ’0; 1“ is
imum likelihood estimate. Figure 3a gives the con- a model parameter. Boscardin and Gelman (1996)
tour plot of the estimated negative log-likelihood use this model for forecasting U.S. Presidential elec-
ratio when we integrate log c first in applying (19). tions, with units i representing states and election
Figure 3b shows the corresponding plot when we in- years, yi the Democratic Party’s share of the vote in
tegrate m first. The effects of the two different paths the state in that year, X a matrix of predictors and
are quite visible in this case, as Figure 3b gives a ri proportional to the number of voters in the state
much smoother answer and is almost identical to in that year. The predictors X used in the regres-
SIMULATING NORMALIZING CONSTANTS 179

sion are chosen based on existing regression models cause we wish to consider the possibilities of models
used in political science. that fall between the two extremes θ = 0 and θ = 1:
The values θ = 0 and 1 represent two extreme that is, perhaps there is some truth in both of the
models that have been considered, implicitly or existing approaches. (A theoretical argument for al-
explicitly, in political science. Setting θ = 0 cor- lowing θ to vary is that its appropriate value might
responds to equal residual variances for the 50 very well depend on the set of explanatory vari-
states, which is generally assumed in forecasting ables X used in the model, so that, e.g., the game-
and regression models of elections in political sci- theoretic descriptions might be more or less accu-
ence research, perhaps for convenience as much as rate depending on what information is assumed to
any other reason. Setting θ = 1, so that variance be known.) Now that we are allowing θ to be uncer-
is inversely proportional to the number of voters, tain on a continuous range, the classical estimation
has theoretical appeal as a generalization of the approach has some serious problems, most notably
binomial model, which is implicit in many game- that the likelihood can be extremely flat (parame-
theoretic models of voting. One of the major trends ters of the variance model, such as θ in this example,
in recent research in political science is to unify can often be poorly identified in data sets of mod-
empirical and theoretical analyses; here, the value erate size) so that no point estimate is an accurate
of θ is an issue that needs to be resolved, for rea- summary. Along with this is the possibility that the
sons both applied (obtaining efficient forecasts and point estimate could be outside [0,1] just due to high
regression estimates) and theoretical (understand- variability or, if the estimate is constrained, that it
ing the variability of voters in the aggregate). See could be on the boundary. For example, it is possible
Gelman, King and Boscardin (1998) for a discussion to have a point estimate at θ = 0 even though θ = 1
of these issues, along with many references from is also well-supported by the data. These problems
political science and economics on these models. get more serious in the presence of nuisance param-
At this point, statistical practice suggests several eters (e.g., when β contains random effects compo-
different ways of using the data to assess the infor- nents). For all these reasons, we prefer a Bayesian
mation of the data about the parameter θ. Classical approach of summarizing the information about θ
approaches include (a) using significance tests to ac- by a posterior distribution (or, to use non-Bayesian
cept or reject the null hypothesis θ = 0 against the terminology, a marginal likelihood, since we shall
alternative, θ = 1 (or vice-versa); and (b) obtaining use the uniform prior distribution, pθ‘ = 1). As
an approximately unbiased point estimate of θ, con- discussed in Section 1, determining the marginal
sidering it as a nonlinear estimation problem with posterior density of θ is mathematically equivalent
nuisance parameters. Bayesian approaches include to computing a normalizing constant parameterized
(c) choosing between θ = 0 and θ = 1, using the by θ ∈ ’0; 1“. Because we are constraining θ ∈ ’0; 1“,
Bayes factor to assess the relative evidence in favor it is also important to examine the behavior of the
of the two possibilities; and (d) including θ as a con- likelihood near the boundary to see if there is evi-
tinuous parameter in the model (taking the range dence that θ < 0 or θ > 1.
[0,1]) and computing its posterior distribution. We consider two methods of computing the poste-
Of these four approaches, (a) and (c) involve rior density, or marginal likelihood, or normalizing
nearly identical computations, since both are based constant, as a function of θ: (i) the usual approach
on the distribution of the likelihood ratio un- of Bayesian simulation, which is to consider θ as
der the two candidate models. In the context of a parameter in the model and then summarize its
simulation-based inference, approach (c) would be posterior distribution by the empirical distribution
more natural, and it would involve the computation of its simulation draws, as was done by Boscardin
of a ratio of marginal densities, which, as discussed and Gelman (1996); and (ii) path sampling. In fact,
in Section 1, is equivalent to a ratio of normalizing we shall use the simulation draws from (i) to im-
constants. This computation could be done using plement path sampling and then compare the esti-
any of the methods described in this paper; to the mated marginal posterior distribution for θ under
extent that the likelihoods under the two models path sampling to the direct estimates from the sim-
(θ = 0 and θ = 1) are far apart (which will generally ulation draws.
be the case as the number of data points increases),
it would be advisable to consider path sampling. A 5.3.2 Path sampling for the nonhierarchical
natural choice of path is (55) with θ varying from 0 model. To check the performance of path sam-
to 1. pling, we first consider the simple nonhierarchical
However, in this application, we prefer to consider model, which assigns a uniform prior distribution to
θ to be a continuous parameter from the start, be- β; log σ; θ‘. As discussed in Boscardin and Gelman
180 A. GELMAN AND X.-L. MENG

Fig. 4. Estimates of the density function and cdf of the heteroscedasticity parameter θ for the election forecasting example with non-
hierarchical model: (a) dashed line is exact density, dotted line is estimated density from path sampling and histogram is from 1000 iid
simulation draws; (b) dashed line is exact cdf, dotted line (almost exactly on top of dashed line) is estimated cdf from path sampling and
solid line is empirical cdf from simulation draws.

(1996), the (unnormalized) marginal posterior den- tion to a 20-dimensional integration indexed by an
sity for θ in this model can be written analytically, entire curve with only 2000 draws in total.
and posterior draws for the vector of parameters
can be obtained directly by first drawing θ from 5.3.3 Path sampling for the hierarchical model.
a discrete approximation to its numerically calcu- We now move to a more realistic, and thus more
lated marginal posterior distribution, then drawing complicated, model fitted by Boscardin and Gel-
β; σ‘ from their normal-inverse-χ2 posterior distri- man (1996) in which the marginal density for θ
bution conditional on the drawn θ. For the election cannot be computed analytically. In this model, 50
example, this was done using 1000 independent additional components of β are added (and thus
draws of θ and 2 draws of β; σ‘ for each draw we are now dealing with a 70-dimensional integra-
of θ. In our general notation, ω = β; σ‘, which tion), along with a hierarchical regression model
is 20-dimensional because β has 19 components. and additional variance components. For this ex-
To compute the path sampling estimate of pθŽy‘, panded model, the marginal posterior density of
we must first determine the function Uω; θ‘; the θ cannot be computed exactly, and posterior sim-
differentiation is easy and yields ulations are obtained using the Gibbs sampler
1 X 2 and the Metropolis algorithm, alternating between
(56) Uω; θ‘ = − 2
log ri ‘rθi yi − Xβ‘i : Metropolis jumps for θ and Gibbs draws for the re-
2σ i
maining parameters. Approximately overdispersed
We use the simulation draws, which have the starting points for the algorithm are obtained by a
nested-loop form, to compute the path sampling t4 approximation to the posterior distribution. The
estimate of pθŽy‘. Figure 4a shows the results, Gibbs sampler draws are performed using linear re-
comparing the exact density (smooth line), path gression operations and simulations of normal and
sampling estimate (slightly jagged line) and the χ2 random variables, and the Metropolis steps use
histogram from the 1000 simulation draws. The a univariate normal jumping kernel with a scale
path sampling estimate is (of course) worse than set to 2.38 times the estimated standard deviation
the exact density but compares much more favor- of θ from the initial approximation (motivated by
ably to the histogram estimate. Another comparison Gelman, Roberts and Gilks, 1996).
is afforded by Figure 4b, which shows the corre- For the election example, 10 sequences, each of
sponding cdf ’s. Here, the jagged line is the empirical length 500, were sufficient for approximate conver-
cdf of the 1000 draws, and the smooth line repre- gence of the simulations as monitored using the
sents both the path sampling estimate using (17) methods of Gelman and Rubin (1992). To compute
and the exact cdf—the differences are barely vis- the path sampling estimate of pθŽy‘, we again need
ible! Obviously, in this case, one can simply use the function U, which actually has the same form
the exact formula, but it is informative and encour- (56) as before, since the added hierarchical part
aging to be able to confirm that path sampling is of the model does not involve the parameter θ. In
capable of producing such an accurate approxima- this case, the simulation draws have the single-loop
SIMULATING NORMALIZING CONSTANTS 181

Fig. 5. Estimates of the density function and cdf of the heteroscedasticity parameter θ for the election forecasting example with hier-
archical model: (a) dotted line is estimated density from path sampling and histogram is from 1250 simulation draws from Metropolis
algorithm; (b) smooth line is estimated cdf from path sampling, and jagged line is empirical cdf from simulation draws.

form, meaning that we can only learn about pθŽy‘ the methods discussed in Section V of Ceperley
for the range of θ’s that were obtained in the sim- (1995) are potentially very useful for implementing
ulation. Figure 5a shows the estimated marginal path sampling in general—indeed, the phrase “path
posterior density from path sampling and the his- sampling” is used there to describe sampling meth-
togram of simulation draws, and Figure 5b shows ods for simulating path integrals for the so-called
the corresponding estimated cdf ’s. The path sam- thermal density matrix.
pling estimates are far smoother. Based on the ev- Given its great success in theoretical physics
idence given in Figure 4a, b we can be quite con- for dealing with complex integrations, as well as
fident that the path sampling estimates are very Ogata’s (1990, 1994) successful applications to Bay-
close to the truth, especially for the cdf. For both esian computations, we believe that path sampling
these models, the smoothness of the path sampling can be generally useful in statistical computa-
estimates is an intrinsic property of the estimation tions for dealing with complex integrations. The
procedure—despite their appearance, no smoothing method is not only capable of producing remark-
was used in creating the estimates. ably accurate results but is also quite straight-
forward, in the class of methods that are use-
ful for high-dimensional complex integrations. As
6. SUMMARY AND FURTHER RESEARCH
Frenkel (1986, page 169) states in his review chap-
This paper attempts to bring to the attention ter on free-energy estimation: “Thermodynamic
of statistical researchers some useful methods for integration (TI) is undoubtedly the method most
computing normalizing constants for complex, high- widely used to compute absolute free energies and
dimensional probability models, or more generally free-energy differences. The reason is that, al-
computing high-dimensional integrations with com- though it may be more time consuming than some
plicated integrands. Both bridge and path sampling of the sophisticated methods described above, it
are rooted in popular methods in theoretical physics, is straightforward, accurate and does not run into
namely, the acceptance ratio method and thermo- special problems at high densities or for large sys-
dynamic integration. Due to extremely challenging tem sizes.” We hope our simple (though not trivial)
and important computational problems in physics empirical illustrations, as well as the theoretical
and chemistry, some of which are far from being example, have helped to convey these messages
resolved (e.g., minimum energy configurations of to statistical researchers; the quote also makes it
protein molecules), there is a huge literature in the- clear that thermodynamic integration (and hence
oretical and computational physics and chemistry path sampling) does not dominate other methods.
on creative methods for high-dimensional integra- Furthermore, we hope our derivation of how path
tion and optimization. For an excellent, though not sampling relates to importance sampling via bridge
necessarily “statistician friendly,” recent review of a sampling will help general statistical readers to un-
good number of these powerful methods, see Ceper- derstand the method intuitively and thus be able
ley’s (1995) long review article on path integrals to apply it with more confidence. In a statistical
in the theory of condensed helium. In particular, context, path sampling also gives an alternative
182 A. GELMAN AND X.-L. MENG

method of estimating marginal distributions and where bω‘ is a positive function to be determined.
offers an effective check on the convergence of Solving (58) for pωŽθ‘ = hωŽθ‘/gθ‘ with the
Monte Carlo simulations. given boundary condition yields
Our general formulation and investigation also np
reveals that further research is needed in order to pωŽθ‘ = p0 ω‘1 − Gθ‘/G1‘‘
explore fully the potential of path sampling. For (59) p 2 o
example, the construction of efficient yet simple + p1 ω‘Gθ‘/G1‘‘ /gθ‘ ;
general paths is of great importance for routine ap- Rθ
plication of path sampling. Our theoretical results where Gθ‘ = 0 gξ‘ dξ. The freedom in choosing
(e.g., the general suboptimality of the geometric gθ‘ allows us to ensure that pωŽθ‘ of (59) is a
path and, for the normal example, the optimal path proper density for any θ ∈ ’0; 1“, a requirement that
that curves through the space of µ; σ‘) show that leads to a differential equation for Gθ‘:
the best paths are not always obvious. The question H2 p0 ; p1 ‘
of achievable optimal error is not only of theoreti- G0 θ‘ = 1 −
4
cal interest but also of practical relevance if we can (60)  
construct an easily implementable iterative pro- Gθ‘ 1 2
+ H2 p0 ; p1 ‘ − :
cedure, just as with bridge sampling, to compute G1‘ 2
the optimal estimate. Such questions are inher- Solving (60) for gθ‘ = G0 θ‘ with the boundary
ently statistical, and thus we statisticians should condition g0‘ = g1‘ = 1 yields
be able to contribute substantially to the study of
efficient implementation of path sampling, espe- cos2 αH ‘
gθ‘ = :
cially in view of the theoretical relations between cos2 ’2θ − 1‘αH “
optimal paths and the Jeffreys prior and the Rao
The rest of the proof follows by simple algebraic ma-
and Hellinger distances. With path sampling, as
nipulation.
with Markov chain Monte Carlo methods—another
statistical tool that originated in computational A.2 Derivation of the Optimal Path in (‹, °)-Space
physics—there is the potential not only to benefit for the Normal Example
from a powerful method but also to make it more
efficient and applicable, thus broadening the range Here we derive the optimal path in the normal
of statistical models that we can use routinely. family example of Section 4.4 in the general case
of any endpoints θ0 = µ0 ; σ0 ‘ and θ1 = µ1 ; σ1 ‘;
APPENDIX without loss of any generality, we assume µ1 ≥ µ0 .
We start by noting that, with the normal family (52),
A.1 An Elementary Proof of Lemma 2 the variance formula (42) becomes
 
Proof. Let gθ‘ be a differentiable positive func- 1 Z 1 µ̇2 t‘ + 3σ̇ 2 t‘ 2
tion on ’0; 1“ such that g0‘ = g1‘ = 1, and let (61) varλ̂‘ = dt − λ :
n 0 σ 2 t‘
hωŽθ‘ = pωŽθ‘gθ‘. Then, by Fubini’s theorem,
we can verify that The corresponding Euler–Lagrange equations (43)
for the optimal path can be simplified into
Z1
Iθ‘ dθ (62) µ̇t‘ − c0 σ 2 t‘ = 0;
0
Z Z 1  ∂p hωŽθ‘ 1 2  (63) 3σ̈t‘σt‘ − 3σ̇ 2 t‘ + µ̇2 t‘ = 0;
(57) =4 p dθ µdω‘
0 ∂θ gθ‘ where c0 is a constant to be determined by the
Z 1  d log gθ‘ 2 boundary conditions: µt‘ = µt , σt‘ = σt , t = 0; 1:
− dθ: This differential equation can be solved using a dif-
0 dθ ferential geometric argument developed in Atkinson
By the Cauchy–Schwarz inequality, the first term and Mitchell (1981) or directly as follows.
on the right-hand We first substitute (62) into (63) and obtain
R 1 side of (57) is bounded below by
4H2 p0 ; p1 ‘/ 0 gθ‘ dθ with the bound achieved if
(64) 3σ̈t‘σt‘ − 3σ̇ 2 t‘ + c02 σ 4 t‘ = 0:
and only if
p q We then let v = σ̇t‘ and express v = vσ‘ via t =
∂ hωŽθ‘ 1
p = bω‘ gθ‘ t−1 σ‘, which yields
(58) ∂θ gθ‘
dv dv dσ
a.s. µ × Lebesgue on ’0; 1“‘; (65) σ̈t‘ = = = v̇σ‘vσ‘:
dt dσ dt
SIMULATING NORMALIZING CONSTANTS 183

Combining (65) with (64) gives The solution (70) also induces the optimal prior
  densities on the optimal path; for µ, it is
d v2 σ‘ 2
(66) + c02 σ = 0; 1 1
dσ σ2 3 pµ‘ = ;
(73) Rφ1 − φ0 ‘ 1 − ’µ − C‘/R“2
which implies
s µ0 ≤ µ ≤ µ1 ;
c2 and for σ,
(67) σ̇t‘ ≡ vσ‘ = c1 σ 2 − 0 σ 4 ;
3 2 1
pσ‘ = p ;
where c1 is a constant to be determined. (74) φ1 − φ0 σ 1 − 3σ 2 /R2 ‘
When c0 6= 0, (67) leads to
Z minσ0 ; σ1 ‘ ≤ σ ≤ σmax ;

t= q where
c1 σ 2 − c02 /3‘σ 4 

√R
p  ; if µ0 ≤ C ≤ µ1 ;
1 −1 3c1 σmax = 3
= √ cosh + c2 ; 

c1 c0 σt‘ maxσ0 ; σ1 ‘; otherwise:
where c2 is another constant to be determined. It The density of σ has an asymptote at σmax because
2
follows that σ̇t‘ = 0 at t = σmax . (In our example, σmax =
p 2
1 + D /12.) The optimal error associated with the
3c1 √
(68) σt‘ = sech’ c1 t − c2 ‘“: optimal path can be easily obtained, using the fact
c0 that, on the path, the integrand inside (61) is free
Finally, combining (68) with (62) yields of t due to the “second-moment-stabilizing” transfor-
Z mation (see Section 4.1). The optimal error is given,
µt‘ = c0 σ 2 t‘ dt in general, by
(69) √   
3 c1 √ 1 σ 2
= tanh’ c1 t − c2 ‘“ + c3 ; varopt = 3φ1 − φ0 ‘2 − log 1 ;
c0 n σ0
where the constant c3 is determined, along with c0 , where φt , t = 0; 1; are given by (71). In our example,
c1 and c2 , by the boundary conditions. The solution it is simplified to
" s !#2
is then given by
1 √ D D2
µt‘ = R tanh’φ0 1 − t‘ + φ1 t“ + C; varopt = 12 log √ + 1 + ;
n 12 12
(70) R
σt‘ = √ sech’φ0 1 − t‘ + φ1 t“; because µ0 = 0, µ1 = D and σ0 = σ1 = 1.
3
where ACKNOWLEDGMENTS
 2  
µ0 − µ1 3 9 σ12 − σ02 2 We thank D. Ceperley, P. McCullagh, R. Neal,
R2 = + σ12 + σ02 ‘ + ; M. Stein and W. Wong for helpful conversations,
2 2 4 µ1 − µ0
and Y. Ogata for sending (pre)reprints. Gelman’s
µ0 + µ1 3 σ12 − σ02 research was supported in part by NSF Grants
C= + ;
2 2 µ1 − µ0 DMS-94-04305, SBR-97-08424 and Young Inves-
(71)  
µt − C tigator Award DMS-94-57824, and by fellowship
φt = tanh−1 F/96/9 of Katholike Universiteit Leuven. Meng’s
R
research was supported in part by NSF Grants
1 R + µt − C DMS-95-05043 and DMS-96-26691, and NSA Grant
= log for t = 0; 1:
2 R − µt + C MDA 904-96-1-0007.
This implies a path in µ; σ‘-space of the form
(72) µ − C‘2 + 3σ 2 = R2 ; REFERENCES
Atkinson, C. and Mitchell, A. F. S. (1981). Rao’s distance mea-
which reduces to (53) when σ0 = σ1 = 1, µ0 = 0 sure. Sankhyā Ser. A 43 345–365.
and µ1 = D. The case of c0 = 0 corresponds to the Bennett, C. H. (1976). Efficient estimation of free energy differ-
special case µ0 = µ1 , in which case the solution is ences from Monte Carlo data. J. Comput. Phys. 22 245–268.
Berg, B. and Celik, T. (1992). New approach to spin-glass sim-
µt‘ ≡ µ0 ; σt‘ = σ1t σ01−t for 0 ≤ t ≤ 1: ulations. Phys. Rev. Lett. 69 2292–2295.
184 A. GELMAN AND X.-L. MENG

Berg, B. and Neuhaus, T. (1991). Multicanonical algorithms for Transactions on Pattern Analysis and Machine Intelligence 6
the first order phase transitions. Phys. Lett. B 267 249–253. 721–741.
Besag, J. (1974). Spatial interaction and the statistical analysis Geyer, C. J. (1991). Markov chain Monte Carlo maximum like-
of lattice systems (with discussion). J. Roy. Statist. Soc. Ser. B lihood. In Computing Science and Statistics: Proceedings of
36 192–236. the 23rd Symposium on the Interface (E. M. Keramidas, ed.)
Binder, K. (1986). Introduction: theory and technical aspects of 156–163. Interface Foundation, Fairfax Station, VA.
Monte Carlo simulations. In Monte Carlo Methods in Statis- Geyer, C. J. (1994). Estimating normalizing constants and
tical Physics (K. Binder, ed.). Topics in Current Physics 7. reweighting mixtures in Markov chain Monte Carlo. Tech-
Springer, Berlin. nical Report 568, School of Statistics, Univ. Minnesota.
Boscardin, W. J. and Gelman, A. (1996). Bayesian regression Geyer, C. J. and Thompson, E. A. (1992). Constrained Monte
with parametric models for heteroscedasticity. Advances in Carlo maximum likelihood for dependent data (with discus-
Econometrics 11A 87–109. sion). J. Roy. Statist. Soc. Ser. B 54 657–699.
Burbea, J. (1989). Rao distance. In Encyclopedia of Statistical Geyer, C. J. and Thompson, E. A. (1995). Annealing Markov
Science, supplement volume, 128–130. Wiley, New York. chain Monte Carlo with applications to ancestral inference.
Ceperley, D. M. (1995). Path integrals in the theory of con- J. Amer. Statist. Assoc. 90 909–920.
densed helium. Rev. Modern Phys. 67 279–355. Green, P. J. (1992). Discussion of “Constrained Monte Carlo
Chen, M.-H. and Shao, Q. M. (1997a). On Monte Carlo methods maximum likelihood for dependent data” by C. J. Geyer and
for estimating ratios of normalizing constants. Ann. Statist. E. A. Thompson. J. Roy. Statist. Soc. Ser. B 54 683–684.
25 1563–1594. Hastings, W. K. (1970). Monte Carlo sampling methods using
Chen, M.-H. and Shao, Q. M. (1997b). Estimating ratios of nor- Markov chains and their applications. Biometrika 57 97–
malizing constants for densities with different dimensions. 109.
Statist. Sinica 7 607–630. Irwin, M., Cox, N. and Kong, A. (1994). Sequential imputation
Chib, S. (1995). Marginal likelihood from the Gibbs output. for multilocus linkage analysis. Proc. Nat. Acad. Sci. U.S.A.
J. Amer. Statist. Assoc. 90 1313–1321. 91 11684–11688.
Ciccotti, G. and Hoover, W. G., eds. (1986). Molecular- Jensen, C. S. and Kong, A. (1998). Blocking Gibbs sampling for
Dynamics Simulation of Statistical-Mechanical Systems. linkage analysis in large pedigrees with many loops. Ameri-
North-Holland, Amsterdam. can Journal of Human Genetics. To appear.
Kass, R. E. and Vos, P. W. (1997). Geometry Foundations of
Courant, R. and Hilbert, D. (1961). Methods of Mathematical
Asymptotic Inference. Wiley, New York.
Physics 2. Wiley, New York.
Kong, A., Liu, J. and Wong, W. H. (1994). Sequential imputa-
Dempster, A. P., Selwyn, M. R. and Weeks, B. J. (1983).
tions and Bayesian missing data problems. J. Amer. Statist.
Combining historical and randomized controls for assessing
Assoc. 89 278–288.
trends in proportions. J. Amer. Statist. Assoc. 78 221–227.
Lewis, S. M. and Raftery, A. E. (1997). Estimating Bayes fac-
DiCiccio, T. J., Kass, R. E., Raftery, A. and Wasserman, L.
tors via posterior simulation with Laplace–Metropolis esti-
(1997). Computing Bayes factors by combining simulation
mator. J. Amer. Statist. Assoc. 92 648–663.
and asymptotic approximations. J. Amer. Statist. Assoc.. 92
Marinari, E. and Parisi, G. (1992). Simulated tempering: a new
903–915.
Monte Carlo scheme. Europhys. Lett. 19 451–458.
Evans, M. and Swartz, T. (1995). Methods for approximating
Meng, X. L. and Schilling, S. (1996). Fitting full-information
integrals in statistics with special emphasis on Bayesian in-
factor models and an empirical investigation of bridge sam-
tegration problems. Statist. Sci. 10 254–272.
pling. J. Amer. Statist. Assoc. 91 1254–1267.
Frankel, D. and Smit, B. (1996). Understanding Molecular Sim- Meng, X. L. and Wong, W. H. (1996). Simulating ratios of nor-
ulation. Academic Press, New York. malizing constants via a simple identity: a theoretical explo-
Frenkel, D. (1986). Free-energy computation and first-order ration. Statist. Sinica 6 831–860.
phase transition. In Molecular-Dynamics Simulation of Sta- Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N.,
tistical-Mechanical Systems (G. Ciccotti and W. G. Hoover, Teller, A. H. and Teller, E. (1953). Equation of state cal-
eds.) 151–188. North-Holland, Amsterdam. culations by fast computing machines. Journal of Chemical
Gelfand, A. E. and Dey, D. K. (1994). Bayesian model choice: Physics 21 1087–1092.
asymptotic and exact calculations. J. Roy. Statist. Soc. Ser. B Mitchell, A. F. S. (1992). Estimative and predictive distances.
56 501–514. Test 1 105–121.
Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based ap- Neal, R. M. (1993). Probabilistic inference using Markov chain
proaches to calculating marginal densities. J. Amer. Statist. Monte Carlo methods. Technical Report CRG-TR-93-1, Dept.
Assoc. 85 398–409. Computer Science, Univ. Toronto.
Gelman, A., King, G. and Boscardin, J. (1998). Estimating the Newton, M. A. and Raftery, A. E. (1994). Approximate
probability of events that have never occurred: when is your Bayesian inference and the weighted likelihood bootstrap
vote decisive? J. Amer. Statist. Assoc. 93 1–9. (with discussion). J. Roy. Statist. Soc. Ser. B 56 3–48.
Gelman, A. and Meng, X. L. (1994). Path sampling for comput- Ogata, Y. (1989). A Monte Carlo method for high dimensional
ing normalizing constants: identities and theory. Technical integration. Numer. Math. 55 137–157.
Report 376, Dept. Statistics, Univ. Chicago. Ogata, Y. (1990). A Monte Carlo method for an objective
Gelman, A., Roberts, G. O. and Gilks, W. R. (1996). Efficient Bayesian procedure. Ann. Inst. Statist. Math. 42 403–433.
Metropolis jumping rules. In Bayesian Statistics 5 (J. O. Ogata, Y. (1994). Evaluation of Bayesian visualization models—
Berger, J. M. Bernardo, D. V. Lindley and A. F. M. Smith, Two computational methods. Research Memorandum 503,
eds.) 599–607. Oxford Univ. Press. Institute of Statistical Mathematics, Tokyo.
Gelman, A. and Rubin, D. B. (1992). Inference from itera- Ogata, Y. and Tanemura, M. (1984). Likelihood analysis of spa-
tive simulation using multiple sequences (with discussion). tial point patterns. J. Roy. Statist. Soc. Ser. B 46 496–518.
Statist. Sci. 7 457–511. Ott, J. (1979). Maximum likelihood estimation by counting
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs methods under polygenic and mixed models in human pedi-
distributions, and the Bayesian restoration of images. IEEE grees. American Journal of Human Genetics 31 161–175.
SIMULATING NORMALIZING CONSTANTS 185

Raftery, A. E. (1996). Hypothesis testing and model selection Ripley, B. D. (1988). Statistical Inference for Spatial Processes.
via posterior simulation. In Practical Markov Chain Monte Cambridge Univ. Press.
Carlo (W. Gilks, S. Richardson and D. J. Spiegelhalter, eds) Stein, M. (1992). Prediction and inference for truncated spatial
163–187. Chapman and Hall, London. data. J. Comput. Graph. Statist. 1 91–110.
Rao, C. R. (1945). Information and the accuracy attainable in the Thompson, E. A. (1996). Likelihood and linkage: from Fisher to
estimation of statistical parameters. Bull. Calcutta Math. the future. Ann. Statist. 24 449–465.
Soc. 37 81–91. Torrie, G. M. and Valleau, J. P. (1977). Nonphysical sampling
Rao, C. R. (1949). On the distance between two populations. distributions in Monte Carlo free-energy estimation: um-
Sankhyā 9 246–248. brella sampling. J. Comput. Phys. 23 187–199.

You might also like