You are on page 1of 10

THE JOURNAL OF CHEMICAL PHYSICS 129, 034103 共2008兲

A statistical analysis of the precision of reweighting-based simulations


Tongye Shen1,a兲 and Donald Hamelberg2
1
Theoretical Biology & Biophysics Group, Los Alamos National Laboratory, Los Alamos,
New Mexico 87545, USA, and Center for Nonlinear Studies, Los Alamos National Laboratory,
Los Alamos, New Mexico 87545, USA
2
Department of Chemistry, Georgia State University, Atlanta, Georgia 30302–4098, USA
共Received 19 March 2008; accepted 21 May 2008; published online 17 July 2008兲

Various advanced simulation techniques, which are used to sample the statistical ensemble of
systems with complex Hamiltonians, such as those displayed in condensed matters and biomolecular
systems, rely heavily on successfully reweighting the sampled configurations. The sampled points of
a system from an elevated thermal environment or on a modified Hamiltonian are reused with
different statistical weights to evaluate its properties at the initial desired temperature or of the
original Hamiltonian. Often, the decrease of accuracy induced by this procedure is ignored and the
final results can be far from what is expected. We have addressed the reasons behind such a
phenomenon and have provided a quantitative method to estimate the number of sampled points
required in the crucial step of reweighting of these advanced simulation methods. We also provided
examples from temperature histogram reweighting and accelerated molecular dynamics reweighting
to illustrate this idea, which can be generalized to the dynamic reweighting as well. The study shows
that this analysis may provide a priori guidance for the strategy of setting up the parameters of
advanced simulations before a lengthy one is carried out. The method can therefore provide insights
for optimizing the parameters for high accuracy simulations with finite amount of computational
resources. © 2008 American Institute of Physics. 关DOI: 10.1063/1.2944250兴

I. INTRODUCTION itself, whereas very little consideration is given to the statis-


tical accuracy of the reweighting process in recovering the
Applying simulation-based methods to calculate the target system.
thermodynamic properties of soft matters and biomolecular Practically, researchers face a dilemma as to how “ag-
systems has been an integral part of research tools for almost gressively” they should alter the target system. Often a very
half a century.1–4 One persistent theme in the theoretical re- strong deviation from the target system and/or environment
search community is to find a fast, yet accurate, sampling makes step 共1兲, the sampling of the altered system, easier and
scheme to achieve the desired results with finite amount of quicker to achieve even though it amplifies the error intro-
computational resources. It turns out often that the 共free兲 duced after reweighting the final amount of sample points.
energy landscape of a complex system is difficult to sample In this article, a general scheme of identifying the con-
directly with constant energy molecular dynamics 共MD兲, sto- dition at which insufficient statistics may arise because of the
chastic dynamic simulation such as Langevin dynamics, or reweighting procedure is presented. We then present two
Metropolis Monte Carlo simulations. In recent years, various concrete examples: 共A兲 Hamiltonian reweighting for equilib-
research efforts have been devoted to searching for a better rium properties and 共B兲 temperature reweighting 共TR兲 for
sampling method and have provided methods such as simu- equilibrium properties. The manner by which this scheme is
lated tempering, multicanonical method, replica exchange straight forwardly applicable to 共C兲 reweighting for dynamic
method, and accelerated dynamics and hyperdynamics, such properties and other revelations from these methods have
as those listed in Refs. 5–10 These methods have vastly im- also been discussed generally.
proved the abilities of traditional methods in many aspects. We believe that this type of error analysis is critical for a
Quite often, an advanced sampling method can be split into number of advanced simulation methods that use altered dy-
two parts: 共1兲 performing the simulation, for example, with namics to accelerate sampling. It may provide an alternative
an altered temperature or Hamiltonian, which is directly and more rational means than an ad hoc convergence of the
faithful to the correct Boltzmann statistics and 共2兲 using a final results itself. Proper execution of these types of analy-
reweighting method to recover the statistics of the goal ses can provide a quick guesstimate of how long a simulation
Hamiltonian or of the system at the desired temperature.7 should be run with the consideration of how to wisely use the
It is important to stress that the ultimate goal of a simu- “boost,” a symbolic term used in this study to measure how
lation is to accurately sample the target system. However, far the target dynamics is from the altered dynamics. The
sometimes the procedure of the altered dynamics used in step precise meaning will be detailed in the examples shown later
共1兲 is only designed to sample enough of the altered system in this article. Following this analysis, one can see that al-
though a small boost does not tremendously help with sam-
a兲
Electronic mail: tshen@lanl.gov. pling, a huge boost destroys the statistics after reweighting.

0021-9606/2008/129共3兲/034103/9/$23.00 129, 034103-1 © 2008 American Institute of Physics


034103-2 T. Shen and D. Hamelberg J. Chem. Phys. 129, 034103 共2008兲

lowed by a reweighting procedure that is used to get the


correct statistics of the target system. Statistics can be ob-
tained by using these statistical factors si = ln wi, where wi is
the weight of the data points. To be more specific, a Gaussian
distribution of s is used as an example so that p共s兲 can be
completely described by the first two moments:
p共s兲 = exp关−共s − 具s典兲2 / 共2⌬2兲兴 / 共冑2␲⌬兲. Here, the square root
of the variance ⌬ ª 具共s − 具s典兲2典1/2. The reason why p共s兲, in-
stead of p共w兲, is directly studied is twofold. First, from the
theoretical aspect, it can be seen later that p共s兲 is much more
directly connected to the cumulant expansion method. Sec-
ond, from a practical point of view, the distribution function
p共s兲 has specific reasons as being a good Gaussian approxi-
mation for several examples shown below and thus is more
connected to the equations derived in this section. Note that
the method presented below is completely general and does
not rely on the Gaussian properties of the statistics. Nonethe-
less, in the case of the Gaussian distribution, everything can
be easily expressed analytically. For more general cases, nu-
merical evaluation will be required.
The first step of the analyses is to locate the values of the
FIG. 1. An illustration of sampling the altered Hamiltonian and reweighting most dominate point, that is, the one with the largest s that
to recover the statistics of the target Hamiltonian. Note that potential loss of shows up in a set of sample points. Here we follow the
sample point during reweighting because different weights 共represented by
size of the circle兲 were assigned to each data. approach used in the argument of the random energy model
共REM兲. Similar to the procedure used in the REM to identify
the position of the glass transition temperature, one can iden-
II. THE ORIGIN OF ERROR AMPLIFICATION
tify the position of the most likely positions of the extreme
OF REWEIGHTING PROCESSES
weights. The data point with the largest reweighting factor sh
First, for an unbiased simulation preserving the proper- and the smallest reweighting factor sl can be found with the
ties of a canonical ensemble, the free energy profile is ob- equation p共s兲 ⫻ Nx = 1, where Nx is the total number of
tained by F共r兲 = −␤−1 ln N共r兲. Here N共r兲 refers to the number sampled points at a particular location or bin. Physically it
of uncorrelated sampling falling in the predefined conforma- means that, on average, p共s兲 is too small when s ⬎ sh or s
tion r during the simulation. N共r兲 can be related to the free ⬍ sl for the total number of sample points Nx to even have a
energy accuracy ⑀ of the simulation. An estimation of the single point show up in this rare region. This gives, for the
level of the average fluctuation of N共r兲 is 冑N. Thus, from the Gaussian case of p共s兲,
average number N at a particular r, the free energy of the
state obtained is −kBT ln N, and its typical variation of the
free energy measurement is set approximately as ⑀ ⯝ kBT
sh/l = 具s典 ⫾ ⌬ 冑冉 冊 ln
N2x
2␲⌬2
. 共2兲

关ln共N + 冑N兲 − ln共N兲兴. Finally, reversing the above equation, This expression is the first central result of this article. This
the required average sampling number N to the desired ac- idea of using Nx ⫻ p共s兲 = 1 to identify extreme s is of the
curacy of measurement is expressed as same origin as the cutoff of energy population of the REM
N = 共e␤⑀ − 1兲−2 . 共1兲 共Ref. 11兲 and for the onset of the glass temperature for a
complex system.12,13
A shorthand notation ␦s ª ⌬ ⫻ 冑ln关N2x / 共2␲⌬2兲兴 is
Ideally, the actual sampling number should be much larger
than N to satisfy the desired accuracy ⑀. Apparently for a
adopted, and be aware that it explicitly depends on Nx as
profile of N共r兲, the bottleneck of the profile has relatively the
well. These factors si that are much lower than sh apparently
smallest N and thus the highest free energy; quite often, that
will carry a very small relevance compared to those close to
location of r indicates the transition state of the profile.
sh, and thus will not contribute much to the effective data
Why might reweighting increase the error of a simula-
points Ne to be useful after reweighting. This point is illus-
tion? A reweighting tags each of the initially equal weighted
trated in Fig. 2. Thus the second essential aspect of the ar-
data points with different weights. As a result, some points
ticle is to define this effective number Ne. This effective
have very large weights, whereas others may have small
number at a specific location or bin is quantitatively ex-
weights. The effective number of independent measures thus
pressed as
decreases because many small weighted points are no longer
relevant, and the results are dictated by the points with large
weights. A reweighting procedure changes a collection of Ne = Nx ⫻ 冕sl
sh
p共s兲exp共s − sh兲ds, 共3兲
sample points: The ith sample point will be weighted by wi.
This situation is schematically shown in Fig. 1. An altered which can be further rendered to erf共 兲 in the Gaussian case
dynamics simulation is performed to obtain the statistics fol- of p共s兲, that is,
034103-3 Reweighting-based simulations J. Chem. Phys. 129, 034103 共2008兲

FIG. 2. An illustration of the definition of the effective number.

Ne 1 ⌬2/2−␦s
= e
Nx 2
erf 冋 冉 冊 冉 冊册
␦s − ⌬2
冑2⌬ + erf
␦s + ⌬2
冑2⌬ . 共4兲

Normally erf关共␦s + ⌬2兲 / 共冑2⌬兲兴 ⬇ 1 when the effect of the FIG. 3. 共a兲 The effective number Ne as functions of input number Nx for
lower bond sl can be ignored. Finally, the sampling length of three sets of values for ⌬ = 0.5, 1.0, 2.0. 共b兲 The ratio Ne / Nx as functions of
whole simulation N is related in terms of number of timestep ⌬ for three sets of Nx.
nr to the accuracy ⑀ as
III. EXAMPLES AND DISCUSSIONS
N
nr
⫻ exp共− ␤⌬F兲 ⫻ 冕 sl
sh
p共s兲es−shds = 共e␤⑀ − 1兲−2 . 共5兲 The examples used in this article are the sampling of
conformations of a small peptide, alanine dipeptide, which
has been rigorously studied by various computational meth-
Here, nr is the inverse of data collection frequency and ods over the last three decades. This system was chosen as it
should be large enough to ensure the independence of the is simple, yet nontrivial. Any accurate calculation on a prac-
sample points. Apparently the more spread out the distribu- tical model requires both a good sampling and an accurate
tion p共s兲 is, the smaller the effective number of sample point Hamiltonian 共so-called molecular force field兲. We stress that
will be. Only ideal ␦-function distributed p共s兲 = ␦共s − s̄兲 will we are only concerned with the efficiency of different sam-
keep the original number of data points. pling methods, rather than whether the subtle features pre-
As shown in Fig. 3, the relation between the effective sented under sufficient samplings of this particular model
number and the original number is plotted before reweight- Hamiltonian 关AMBER 8.0,14 PARM99,15 and generalized Born
ing for several sets of parameters of the distribution p共s兲. We solvation16兴 reflect results from experiments. Generally
found that generally for a Gaussian distribution of p共s兲, the speaking, when the sampling problem that clouds simula-
effective number Ne increases with total input number Nx. tions of complex systems has been lifted, one can then focus
However, the ratio of Ne / Nx decreases with increasing Nx. on the force-field problem. We also want to stress the fact
Thus, for such a situation, it can be concluded that although that the peptide system has been picked for the error analysis
increasing the total number of sample points improves the not because of a tremendous sampling problem in the peptide
precision of the target system, the increase is sublinear. Note simulation field but because it is probably one of the systems
that the increase in the sublinear behavior is because of the that is most accurately sampled and most directly compa-
nature of the Gaussian distribution. When one samples really rable to experiments in the biomolecular simulation field.17
thoroughly the conformations of the system, one will even- This makes it is easy to compare the results of advanced
tually find a deviation of the realistic p共s兲 from an ideal methods to the direct method which is difficult for those
Gaussian distribution, and that deviation will cause the dis- even more complicated and sampling-problem-ridden sys-
tribution to be more tightly bounded than that of a Gaussian. tems. We would like to stress also that the levels of pertur-
Also, the sampling efficiency will be linear with the input bation used in all the examples are relatively small. As a
points. Still, practically for a large complex system, the result, we have very good energy overlapping between the
Gaussian distribution of the reweighted scaling factors is al- target and original Hamiltonians.
ways encountered unless an extremely long sampling or a The properties calculated here is the two-dimensional
very subtle perturbation is used. 共2D兲 free energy profile of the protein backbone torsional
034103-4 T. Shen and D. Hamelberg J. Chem. Phys. 129, 034103 共2008兲

focused on for comparison. The contours shown in the plot


are the iso-free energy lines. The lines of the same definition,
that is, above the global minimum by the same series of
values 共0.1, 0.5, 1.0, 1.5, 2.0, …兲 kcal/mol are used through-
out the following display of various profiles generated. As
shown in Fig. 4, one can identify many features such as the
minima and the saddle points connecting them. Higher free
energy part of this plot is not explicitly drawn and is of no
interest for the current study.
To answer the question pertaining to when a simulation
has sufficiently sampled the landscape, a criterion for ⑀ used
in Eq. 共5兲 is required. For the current purpose, a 2D free
FIG. 4. 共Color兲 The free energy profile at 350 K with conventional MD energy profile will suffice. The criterion of a smooth profile
sampling. The contours are 0.1 共red兲, 0.5 共orange兲, 1.0 共orange兲, 1.5 共green兲, as defined from the standard solution displayed in Fig. 4 has
2.0 共green兲, 2.5 共cyan兲, 3.0 共cyan兲, 3.5 共blue兲, 4.0 共blue兲, 5.0 共purple兲, and been used in this study. This is a necessary condition but not
6.0 共black兲 above the global minimum.
always a sufficient condition. This point is further illustrated
with examples of cumulant methods later. More specifically,
angles of this peptide. The so-called Ramachandran plot in- we will see below that poorer sampling will lead to an ap-
dicates the position preference of the backbone angles ␾ and parent rough profile. There exists a crossover, a type of soft-
␺. The values of these two angles are collected from snap- ened deroughening transition, when the data points are
shots every 20 fs and are sorted into 2D bins with the reso- gradually increased until all the iso-free-energy lines are
lution of 10° ⫻ 10° per bin for statistical analysis. quite smooth. The microscopic judgment of smooth isolines
It is important to stress that this collection interval 共ev- is the statistical error of the free energy of each coarse-
ery 20 fs兲 ensures that the data points are collected uncorre- grained coordinate 共each bin兲 and is less than the gradient of
lated for various reweighting dynamics. Apparently, a very the free energy profile times bin size, that is, ⑀共␾ , ␺兲
high collection frequency will be futile as it basically would Ⰶ 兩ⵜF共␾ , ␺兲兩 ⫻ 10°. Below we will use a simplified, global ⑀
just multiply the numbers of points in each bin by a global approximated as 0.1 kcal/ mol to ensure this relation. Of
constant and would not improve the sampling. Thus, a study course, these parameters are from an educated guess, which
of the bin dwell time has also been performed to check that requires that one has some knowledge of the overall
the interval of collection that has been used is proper. A landscape.
dwell time is defined as the average time the conformation of
the molecule stays on any particular bin before it moves to a A. Altering energy landscape for equilibrium
new bin. This definition is dependent on the bin size. Also a properties
higher frequency of data collecting 共here, every 10 fs was
chosen兲 was required for this bin dwell-time calculation. As The results of altering the Hamiltonian and reweighting
reported later, the dwell times for various setups are indeed with the weight factor w = exp共+␤⌬U兲 is presented first.
shorter than the collection interval. From the dynamic view- Here, ⌬U is the difference of the potential energies between
point, all the modified-simulation-reweighting based meth- the modified potential energy landscape U⬘ which was actu-
ods set the goal of speeding up the dynamics and thus ally used in the simulation and the original potential U. Typi-
shorten the correlation length of the data. For a given fixed cally a good reference system is designed to make the local
total time, this will increase usable data collection frequency energy minima of the target system shallower, and thus at
and thus speed up the calculation. those conformations the system is less likely to be stuck and
Another remark is that apparently more sophisticated in turn explore the landscape more efficiently.
methods18,19 of statistical analysis for distributions and free As the torsional angles are indisputably the most impor-
energy profiles exist, other than the simple binning method tant terms of the conformations of macromolecules, the func-
that has been used in the current study. For example, kernel tion for this term was altered to facilitate the sampling. The
density estimation replaces each data with a softened local goal was to raise the low energy basins and thus accelerate
distribution and thus avoids the harsh binning process. How- the transitions across basins. Specifically, an altered potential
ever, the main point of this study is about the precision re- targeting the total 38 torsional angle terms of the system was
duction because of reweighting, which is an orthogonal as- used. The modification function is of the form
pect of the precision of statistical analysis. The ideas ⌬U = U⬘ − U = ⌰共Eo − U兲 ⫻ 共Eo − U兲2/共␣ + Eo − U兲.
presented here are equally suitable for these more sophisti-
共6兲
cated methods.
An extremely long 共half millisecond兲 normal MD simu- Here, the Heaviside function ⌰共x兲 = 1 if x ⬎ 0 and zero oth-
lation was performed to provide the correct solution for vari- erwise. This setup guarantees ⌬U 艌 0. This function form
ous setups for comparison purposes. As shown in Fig. 4, the was designed to alter the original energy function efficiently
profile 共standard solution兲 is quite smooth for the region of at the same time keeping the smoothness up to the second
the relatively low free energy part, up to 6 kcal/ mol above order derivative of the energy.9 Various other forms were
the global minimum, which is also the region that will be designed previously in similar spirits of attacking various
034103-5 Reweighting-based simulations J. Chem. Phys. 129, 034103 共2008兲

FIG. 5. 共Color兲 The distribution of reweighting factor p共s兲 for four sets of
energy reweighting and two sets of TR simulations.

systems such as those listed in Refs. 5, 6, and 20–23. The


boost energy parameter Eo controls how the torsional energy
term gets altered and the shape parameter ␣ controls the
aggressiveness of the alterations.
Several sets of parameters are used to illustrate that ag-
gressive sampling does not always give the best precision as
a result of energy reweighting 共ER兲. For setup ER-1, ER-2,
and ER-3, the boost energy parameter Eo was set at
39.8 kcal/ mol for all, and the shape parameter was set as
␣ = 15.0, 30.0, 60.0 kcal/ mol, respectively. Relatively speak-
ing, a set with lower ␣ is a more aggressive alternation of the
energy function, which shows up in terms of the ensemble of
scaling factor s with larger values. This is clearly demon-
strated in the distributions p共s兲 shown in Fig. 5. On the basis
of the boost distribution p共s兲 of ER-1, ER-2, and ER-3 dis-
played, one can see that ensemble of scaling factors has both
increasing mean 具s典 and deviation from mean ⌬ with increas-
ing boost by lowering ␣.
For a comparison, a very low boost setup ER-4 with the
parameters Eo = 19.8 kcal/ mol and ␣ = 2.0 kcal/ mol is also
shown. In such a case, p共s兲 has a delta-distributed-like com-
ponent at zero and the integral 兰p共s兲ds = 94.6% for region
s ⬎ 0.
Among the four ER setups, it turns out ER-3 has the best
performance of recovering the original free energy profile FIG. 6. The comparison of different setup of shape parameter ␣ of the
though it does not have the largest 共or the smallest兲 ensemble energy reweighting. 共a兲 ER-1, 共b兲 ER-2, 共c兲 ER-3. Here Eo = 39.8 kcal/ mol
of the reweighting factor s. This point can be demonstrated and ␣ = 15, 30, 60 kcal/ mol for the three cases, respectively. The simulation
length used in this plot is 0.1 ms with data collection every 20 fs for all
by showing the profiles obtained with equal length simula- three cases, i.e., total 1 ⫻ 106 points for each case.
tion and the same amount of data point collection for various
setups of ER simulations as seen in Fig. 6. cision. Apparently, this larger deviation wins over in this
The fact that ER-3 outperforms ER-4 is expected, and case and leads to lower effective number and thus poor pre-
can be understood as it changes the landscape more aggres- cision according to Eq. 共4兲.
sively and accelerates the dynamics to sample the conforma- According to the definition in the previous section, the
tions faster. On the other hand, how it outperformed ER-1 bin dwell time for setups ER-1, ER-2, and ER-3 are 16.09,
and ER-2 can only be understood under the current theory of 16.13, and 16.26 fs, respectively. The dwell time for the nor-
reweighting precision. More specifically, the ensemble with mal MD is 16.80 fs. The change in the bin dwell time signi-
larger reweighting factors s also has a larger deviation from fies the effect of the boost on the diffusion coefficient on the
mean. The first component, a larger mean reweighting factor, energy landscape24,25 and is different from the escape time
facilitates the efficient sampling, whereas the second compo- over large energy barriers that is usually more dramatic for
nent, a larger deviation of the distribution p共s兲 hurts the pre- accelerated MD. The escape time for the ␤-strand to the
034103-6 T. Shen and D. Hamelberg J. Chem. Phys. 129, 034103 共2008兲

␣-helical transition along the ␺ angle for normal MD, ER-1,


ER-2, and ER-3 are 13.04, 3.60, 4.07, and 5.24 ps, respec-
tively. These small bin dwell times validate the uncorrelated
data assumption made with the data collecting frequency of
20 fs for all the analyses 共the exception is the calculation of
dwell time themselves, of course兲. This effect of shortening
of dwell time with increasing boost also confirms that high
boost facilitates the dynamics of sampling. This speeding up
of the dynamics with altered energy landscape was previ-
ously observed and was taken advantage of to calculate dy-
namic properties of complex systems.24,26,27
The threshold of sufficient sampling with a given setup
can also be calculated. Applying the method developed in the
previous section, with statistical information and p共s兲 ob-
tained from Fig. 5, we obtained ⌬ = 1.944 for ER-3. Recall
also ⑀ = 0.1 was previously estimated. If one wants to see the
critical number to ensure all the regions below the 共purple兲
contour line of ⌬F = 5 kcal/ mol, then feeding all these vari-
ables into Eq. 共5兲, the threshold obtained is n* = N* / nr
⯝ 2.1⫻ 107 for the 共purple兲 contour line and the regions be-
low. Indeed, as shown in the panels of Fig. 7, results from a
number of data points one and two orders below 共n = n*
⫻ 10−1,−2兲 and around this predicted threshold 共n = n*兲, we
see insufficient and ample sampling shown around the con-
tour line of 5 kcal/ mol, respectively.

B. Altering temperature for equilibrium properties


The histogram reweighting by changing environmental
parameter, such as the temperature, is now studied specifi-
cally in this subsection. Here, the idea is to use high tem-
perature to facilitate sampling based on the fact that the sys-
tem will have larger kinetic energy at higher temperature, is
easier to overcome barriers, and to escape from local traps.
Simulations at a constant temperature Tr were performed and
through TR obtained the physical properties at T. In such a
case, the scaling factor for TR is defined as si = 共␤ − ␤r兲Ei
with ␤r−1 = kBTr. Two sets of Tr are used here, 450 and 700 K,
for recovering the statistics of the peptide at T = 350 K. As
the potential energy fluctuation of the sampling is approxi-
mately m / 2kBTr, where m is a phenomenological constant
symbolizing the independent, excited degrees of freedom for
this system, m should be very weakly dependent on tempera- FIG. 7. The comparison of different lengths of simulations for the setup
ture in the temperature range suited for this study. Thus we ER-3. 共a兲 4 ␮s, about 1 / 100 of the critical length 共b兲 0.04 ms, about 1 / 10 of
can obtain a rough estimation of the variance of the distribu- the critical length, and 共c兲 0.4 ms, about the critical length for sufficient
sampling of the region of 5 kcal/ mol and below.
tion p共s兲, ⌬ ⯝ m / 2 ⫻ 共␤ − ␤r兲 / ␤r ⬀ 共Tr − T兲 / T. Indeed, as
shown in Fig. 5, two sets of simulations at 450 and 700 K
have their variance ⌬ approximately equal to the phenom- As a comparison, the original normal MD has the bin dwell
enological estimates. From the above assumption, one can time of 16.80 fs. We do admit that it is slightly unfair to
derive the ratio ⌬TR700 / ⌬TR450 = 3.5, whereas the actual mea-
conclude that the inefficiency of 700 K reweighting com-
sured values gives the ratio of the two variances as
pared to the original simulation here. The shorter dwell time
5.0284: 1.4384⯝ 3.496.
It is clear from the observation that ⌬ increases with T deserves a fast frequency of data collection which potentially
− Tr and the fact that simulation at 700 K is not an effective improves the statistics for a given total time. Further proce-
way of simulation to recover the results at 350 K. Again, this dures of obtaining the effective number of sampling data
low efficiency of convergence is not because of the dynamics points of reweighting are similar to the previous subsection.
of the sampling step. On the contrary, the bin dwell time is In the case of 450 K reweighing, critical number n* ⯝ 3.75
13.47 fs, the shortest time of all the ER and TR simulations. ⫻ 106 is obtained for free energy up to 5 kcal/ mol above the
034103-7 Reweighting-based simulations J. Chem. Phys. 129, 034103 共2008兲

⌬F = 0. The resulting critical number n*, will accurately es-


timate the precision of the kth bin. The disadvantage of a
local theory compared to the global theory is obviously a
problem of the resolution, and thus more information is
needed for the estimation.
The statistical reweighting is not limited to the change of
temperature alone. For example one can use a collection of
sample points from a constant force ensemble28 simulation, a
molecule subject to tension by pulling its ends with the mag-
nitude of balanced force f. One can use reweighting factors
to obtain the ensemble of another magnitude of f ⬘ with the
reweighting factor w = e−共f ⬘−f兲⌬l. Here, the displacement of
the ends, ⌬l, is the conjugate variable of force. The methods
discussed previously can be applied to these situations
straightforwardly.

C. Alternating the setups for dynamic properties


and cumulant expansion
The common theme of both subsections presented earlier
is obtaining equilibrium properties with a simulation of a
Hamiltonian system 共possibly altered兲 in an equilibrium en-
vironment 共possibly altered兲. Section III A focused on the
alternation of the former, whereas Sec. III B was based on
the latter. Besides the obvious and straightforward combina-
tion of these two types of alternations that are also applicable
for the current topic, we want to extend the discussion in a
new direction.
Nowadays, equilibrium properties are not the only things
researchers pursue with dynamical simulations; indeed a va-
riety of dynamical properties can be investigated with simu-
lations too, for example, from a simple relaxation of tor-
sional angle rotations24 to complex gating dynamics of
enzymes.29 There is a greater challenge to obtain the dynami-
cal properties of the system with reweighting methods. Simi-
lar to the previous subsections, these methods involve a
simulation of an altered setup 共with altered Hamiltonian, al-
tered environmental conditions, or could even be nonequilib-
rium dynamics rule for the evolution of the conformation兲
and a reweighting to recover the dynamical property. In pur-
suit of dynamical properties and/or with nonequilibrium
rules governing the dynamics, even more attention should be
paid to the central issues that have been discussed in this
article.
For example, one can use a path integral formulation to
FIG. 8. The comparison of different length of simulations for the setup calculate the dynamical perturbation of a dynamic trajectory,
TR450. 共a兲 0.075 ␮s, about 1 / 100 of the critical length, 共b兲 0.75 ␮s, about similar to the thermal dynamical perturbation of the
1 / 10 of the critical length, and 共c兲 75 ␮s, about the critical length for suf-
ficient sampling of the region of 5 kcal/mold and below. snapshots.30–32 In such a calculation, an ensemble of easier
dynamic histories will be generated with the altered setup
and a reweighting factor for each history is recorded. The
global minima. As shown in Fig. 8, profiles obtained from final, ensemble-averaged dynamics is calculated with
various numbers to illustrate this rough-to-smooth transition weights of that factor. One can see that each reweighting
were plotted. factor now is an integral of 共or a discrete summation in prac-
Both evaluations of n*, the one used here and that in the tice兲 the Lagrangian difference that will exacerbate the po-
previous subsection, are based on a simple method with only tential problem of reduction of resolution after reweighting,
the global distribution of p共s兲 and ⌬F as input. A more spe- as the distribution of reweighting will be very broad.
cific theory can be extended to any specific region with a This can be easily seen because the summation of ran-
local distribution pk共s兲 where k is the location such as the bin dom variables has a much larger variation than the variation
index. For such local version of the equation, one can set of each random variable. Each path history being discretized
034103-8 T. Shen and D. Hamelberg J. Chem. Phys. 129, 034103 共2008兲

with M time slices has a variance ⌬ ⬀ M⌬s. Here, M will be


a large number, unless the dynamics is an ultrashort,
collision-like process. ⌬s is the variance of the scaling factor
for each time point, which will not be small for a complex
system of any moderate alteration of the setup. Thus, one
needs to take into consideration the reweighting statistics in
planning such simulation of complex systems for dynamic
properties and may again face the following dilemma. On
one hand, a very low alteration of the setup from conven-
tional simulation apparently will not value much as an accel-
eration of dynamics, whereas a more aggressive alteration
will face the aforementioned statistical problem. Thus, an
optimum window for such dynamic reweighting or nonequi-
librium setup to accurately capture the desired properties of a
complex system is much smaller, if it exists at all, than those
of the equilibrium properties already discussed.
A related issue is using a nonequilibrium setup to re-
cover equilibrium properties.33–35 Our approach used in
studying the reweighting distribution is once again important
for any practical application. A strong nonequilibrium setup
will render a large variance of the distribution of the re-
weighting factor, whereas a very weak setup will not do
much to the acceleration. The method to locate a working
window of parameters of the setup for such methods is not
obvious.
Sometimes researchers also adopt a cumulant expansion
technique36,37 for the reweighting, especially when large sta-
tistical errors are present after straightforward reweighting of
each data point by its weights esi. On the basis of the cumu-
lant expansion equation, one can use the information on the
moments of pk共s兲, the distribution for the kth bin, to express
the dimensionless free energy as −␤Fk = 兺zj=1C j / j! + ln Nk,
where C j is the jth cumulant of the distribution pk共s兲 and Nk
is the total raw number of sample points dropped into the kth
bin. Here z is the order of cutoff and C1 = 具s典, C2 = 具s2典 − C21,
and C3 = 具s3典 − 3C1C2 + 2C31. Generally Ci is closely related to,
but not the same as the central moment Di = 具共s − 具s典兲i典 for
higher orders.
It is easy to understand from these analyses that the cu-
mulant expansion with a low cutoff order may smooth the
apparent rough data. It effectively forces the distribution p共s兲
to have only several lowest order cumulants nonzero,
and this effectively narrows down the otherwise broader FIG. 9. The effects of cumulant expansion with different cutoff order, up to
distribution. first 共a兲, second 共b兲, and third 共c兲 order. The data set is from the ER-3
The broader distribution is the reason for the decrease of simulation, total 0.1 ms with data collection every 20 fs. The exponential
the effective sampled data points. It is not surprising that the form of average was displayed in Fig. 6共c兲.
lower the cutoff order z is, the smoother the free energy
profiles will be. As shown in Fig. 9, the cumulant expansion higher values. A more quantitative understanding of these
is used to calculate the same set of date that was used to plot problems and an answer to whether one can easily identify
Fig. 6共c兲 with the direct reweighting of exponential factors. an optimal procedure for cumulant expansion deserve to be
We see that lower cutoff does corresponds to a smoother free studied further.
energy profile. However, it is not always easy to identify that
it is not due to a pure cosmetic effect of smoothing, but that
IV. CONCLUSIONS
it rather gives an accurate physical result. It appears that the
relation among total number of sample point N, cutoff order We have identified why methods that rely on strong al-
z, and total statistical error, R共z , N兲 = 兰dr关Fz,N共r兲 − Fexact共r兲兴2 teration of the initial setup of the system, such as obtaining
follow an asymptotic series. In other words, for any given N, the free energy profiles of molecular conformations, might
there is an optimal z to minimize R共z , N兲 and larger cutoff encounter a severe problem of accuracy after reweighting. A
will increase R, whereas larger N will shift the optimal z to more aggressively alternating setup of the system may facili-
034103-9 Reweighting-based simulations J. Chem. Phys. 129, 034103 共2008兲

9
tate the exploration of the phase space of the system but at D. Hamelberg, J. Mongan, and J. A. McCammon, J. Chem. Phys. 120,
11919 共2004兲.
the same time suffers much more during the reweighting 10
Y. Q. Gao, J. Chem. Phys. 128, 064105 共2008兲.
step. The main reason is that many logged visits of the phase 11
B. Derrida, Phys. Rev. Lett. 45, 79 共1980兲.
space do not count as much as the few dominating points, 12
J. D. Bryngelson and P. G. Wolynes, Proc. Natl. Acad. Sci. U.S.A. 84,
which have high weights. 7524 共1987兲.
13
From these ideals, a model was built to define the effec- K. K. Koretke, Z. Luthey-Schulten, and P. G. Wolynes, Proc. Natl. Acad.
Sci. U.S.A. 95, 2932 共1998兲.
tive number after reweighting. This article presents the 14
D. A. Earlman, D. A. Case, J. W. Caldwell, W. S. Ross, T. E. Cheatham,
method to estimate the level of accuracy of free energy cal- S. Debolt, D. Ferguson, G. Seibel, and P. Kollman, Comput. Phys.
culation based on perturbation calculation and reweighting Commun. 91, 1 共1995兲.
15
procedures based on the distribution of reweighting factor W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M.
Ferguson, D. C. Spellmeyer, T. Fox, J. W. Caldwell, and P. A. Kollman,
p共s兲. This in itself is typically much easy to obtain, making
J. Am. Chem. Soc. 117, 5179 共1995兲.
this method potentially proper for a priori, the design of a 16
A. Onufriev, D. Bashford, and D. A. Case, J. Phys. Chem. B 104, 3712
long simulation. 共2000兲.
17
Though only demonstrated in two types of situations for S. Gnanakaran, H. Nymeyer, J. Portman, K. Y. Sanbonmatsu, and A. E.
the calculation of equilibrium properties from reweighting of Garcia, Curr. Opin. Struct. Biol. 13, 168 共2003兲.
18
B. Silverman, Density Estimation for Statistics and Data Analysis 共Chap-
altered energy function and temperature, it has been clearly man and Hall, New York, 1986兲.
pointed out that the analysis is equally suited for a more 19
T. E. Holy, Phys. Rev. Lett. 79, 3545 共1997兲.
20
general situation, such as the reweighting of the dynamical M. M. Steiner, P. A. Genilloud, and J. W. Wilkins, Phys. Rev. B 57,
properties. We also discussed a related issue: The free energy 10236 共1998兲.
21
S. Pal and K. A. Fichthorn, Chem. Eng. J. 74, 77 共1999兲.
perturbation calculation by cumulant expansion method. It is 22
J. Rahman and J. C. Tully, J. Chem. Phys. 116, 8750 共2002兲.
becoming increasingly evident that these reweighting issues 23
L. Yang, M. P. Grubb, and Y. Q. Gao, J. Chem. Phys. 126, 125102
are going to be an indispensable part of many simulation 共or 共2007兲.
24
even some experimental兲 methods. D. Hamelberg, T. Shen, and J. A. McCammon, J. Chem. Phys. 122,
241103 共2005兲.
25
D. Hamelberg, T. Shen, and J. A. McCammon, J. Chem. Phys. 125,
ACKNOWLEDGMENTS
094905 共2006兲.
26
We would like to thank Dr. J. A. McCammon, Dr. S. C. Xing and I. Andricioaei, J. Chem. Phys. 124, 034110 共2006兲.
27
J. Xing, Phys. Rev. Lett. 99, 168103 共2007兲.
Gnanakaran, and Dr. P. G. Wolynes for their kind support 28
T. Shen, D. Hamelberg, and J. A. McCammon, Phys. Rev. E 73, 041908
and encouragement, and M. Fajer and Dr. C. Zong for read- 共2006兲.
29
ing the manuscript. T. Y. Shen, K. Tai, and J. A. McCammon, Phys. Rev. E 63, 041902
共2001兲.
30
1
D. Frenkel and B. Smit, Understanding Molecular Simulation: From Al- L. S. Schulman, Techniques and Applications of Path Integration 共Dover,
gorithms to Applications, 2nd ed. 共Academic, San Diego, 2002兲. New York, 2005兲.
31
2
D. P. Landau and K. Binder, A Guide to Monte Carlo Simulations in H. Kleinert, Path Integrals in Quantum Mechanics, Statistics, Polymer
Statistical Physics, 2nd ed. 共Cambridge University Press, Cambridge, Physics, and Financial Markets, 3rd ed. 共World Scientific, Singapore,
2005兲. 2003兲.
32
3
M. P. Allen and D. J. Tildesley, Computer Simulation of Liquids 共Oxford L. Y. Chen and N. J. M. Horing, J. Chem. Phys. 126, 224103 共2007兲.
33
University Press, Oxford, 1997兲. R. D. Astumian, Am. J. Phys. 74, 683 共2006兲.
34
4
J. A. McCammon and S. C. Harvey, Dynamics of Proteins and Nucleic C. Jarzynski, Phys. Rev. Lett. 78, 2690 共1997兲.
35
Acids 共Cambridge University Press, Cambridge, 1987兲. G. N. Bochkov and Y. E. Kuzovlev, Physica A 106, 443 共1981兲.
5 36
H. Grubmüller, Phys. Rev. E 52, 2893 共1995兲. M. P. Eastwood, C. Hardin, Z. Luthey-Schulten, and P. G. Wolynes, J.
6
A. F. Voter, Phys. Rev. Lett. 78, 3908 共1997兲. Chem. Phys. 117, 4602 共2002兲.
7 37
U. H. E. Hansmann and Y. Okamoto, Phys. Rev. E 56, 2228 共1997兲. S. Park, F. Khalili-Araghi, E. Tajkhorshid, and K. Schulten, J. Chem.
8
K. K. Bhattacharya and J. P. Sethna, Phys. Rev. E 57, 2553 共1998兲. Phys. 119, 3559 共2003兲.

You might also like