[gmx-users] Very large fluctuations in dg/dl

Tue May 8 19:45:22 CEST 2007

Hi Patrick,

I like your plots.  They nicely demonstrate the difficulty of  
convergence of the estimate of <dg/dl>.

It looks like there may be some other oddities in the plot, such as  
switching between conformations or some other effect that has a long  
correlation time.  In particular, at lambda = 0.55, there is a big  
switch about 500 ps in.

We (David Mobley and I) have found it helpful to examine the dg/dl  
timeseries plots directly to look for the potential presence of an  
initial non-equilibrated state (which would look like dg/dl spending  
time near some value A_1 and then switching to fluctuate about A_2  
for the remainder of the simulation) or multiple states with a long  
correlation time (which would look like hops between fluctuations  
about different values).  In this case, long correlation times may  
require you to run much longer.

One of the best diagnostic tools besides examination of the  
timeseries is to compute the correlation time and statistical  
inefficiency for each dg/dl timeseries.  That Janke article I  
mentioned previously describes how to do this, and is available  
online here:

http://www.fz-juelich.de/nic-series/volume10/janke2.pdf

Also, there is a discussion of how to do so efficiently in my recent  
paper on the analysis of parallel tempering simulations using WHAM  
(see Sections 2.4 and 5.2):

http://www.dillgroup.ucsf.edu/~jchodera/pubs/pdf/replica-exchange- 
wham.pdf

The method I described for computing the uncertainties is termed  
"correlation analysis", as it relies on computation (and integration)  
of the autocorrelation function for the timeseries.  This was first  
applied to molecular dynamics simulations by Bill Swope when he was  
with Hans Andersen:

W. C. Swope, H. C. Andersen, P. H. Berens, and K. R. Wilson. A  
computer simulation method for the calculation of equilibrium  
constants for the formation of physical clusters of molecules:  
Application to small water clusters. J. Chem. Phys., 76(1):637–649,  
1982.

(This is actually the same paper where he introduces the velocity  
Verlet integrator -- it's a good read!)

Block averaging should give equivalent uncertainty estimates to the  
correlation analysis method (to about an order of magnitude) if the  
block sizes are chosen appropriately, but it usually requires either  
calculation of the statistical inefficiency to determine the block  
size first, or application of an iterative method like that of  
Flyvbjerg (cited in my WHAM paper) to determine the statistical  
uncertainty from consideration of many block sizes.  Wolfhard Janke's  
paper does a great job of discussing how various methods for  
estimating the statistical uncertainty compare.

Finally, if there are unequilibrated regions of your dataset, there  
is now a method to automatically determine the boundary between  
unequilibrated and equilibrated regions:
W. Yang, R. Bitetti-Putzer, and M. Karplus. Free energy simulations:  
Use of reverse cumulative averaging to determine the equilibrated  
region and the time required for convergence. J. Chem. Phys., 120(6): 
2618–2628, 2004.

To address your observation that some authors use only a few hundred  
ps of simulation time, while your system seems to require at least 1  
ns: Different systems have different correlation times and variances  
of dg/dl, both of which affect the statistical uncertainty. Some  
systems are much "easier" to converge than others.  David Mobley has  
found that application of restraints in free energy calculations  
which are later removed can actually transform a "very hard" problem  
into an "easy" problem -- see the publication below.  But the basic  
answer is that this is why it is extremely important to *always*  
compute the correlation time for whatever it is you are averaging --  
this may be very different from system to system, or even from lambda  
value to lambda value.

D. L. Mobley, J. D. Chodera, and K. A. Dill, "Confine and Release:  
Obtaining Correct Binding Free Energies in the Presence of Protein  
Conformational Change", accepted, Journal of Chemical Theory and  
Computation.
http://www.dillgroup.ucsf.edu/~dmobley/papers/flex.pdf

Best of luck!

- John

--
John Chodera <jchodera at gmail.com>             | Mobile    : 415 867-7384
Postdoctoral researcher, Pande lab            | Lab phone : 650.723.1097
Department of Chemistry, Stanford University  | Lab fax   : 650.724.4021
http://www.dillgroup.ucsf.edu/~jchodera

On May 8, 2007, at 3.20 AM, Patrick Fuchs wrote:

> Hi John,
> thanks a lot for your reply.
> Indeed, the standard deviation I presented in my previous post is  
> the one of dg/dl samples. I was just surprised by the fact the std.  
> dev. is always larger than the value itself (since I'm starting  
> with FE calculation I had no expectation of what the behavior would  
> be and needed a confirmation).
> I was also suprised about the convergence of the mean. I put a plot  
> for each lambda value at the URL:
> http://condor.ebgm.jussieu.fr/~fuchs/download/convergence.png
> (at each time step, I recalculate the mean from the beginning). My  
> first observation was that 1 ns seemed to be a minimum for certain  
> lambda values
> (e.g. lambda=0.70). I sometimes read in literature that some  
> authors used a few hundreds of ps, which seemed (to me) not  
> sufficient for proper convergence.
> Now, if we come back to the error estimate of the mean, I found  
> (for lambda=0.00) 0.2 kJ/mol using block averaging (using the -ee  
> option of g_analyze), which is reasonable I imagine (even if higher  
> precisions have been described in literature). I'm not sure whether  
> this is the same
> way of calculating the uncertainty compared to what you proposed.
> Can you confirm?
> I will have a look to the book you mentioned, thanks for the pointer.
> Cheers,
>
> Patrick
>
> On Mon, 7 May 2007, John D. Chodera wrote:
>
>> Hi Patrick,
>>
>>> I find a reasonable DeltaGsol value of 8.6 kJ/mol for methane  
>>> (compared
>>> to 8.7 in Geerke & van Gunsteren, ChemPhysChem 2006, 7, 671 ?  
>>> 678) but
>>> I get really huge fluctuations in the values of dg/dl:
>>> lambda=0.00: 5.0 +/- 10.8 (mean +/- standard deviation)
>>> lambda=0.05: 4.3 +/- 11.2
>>
>> Is the standard deviation you quote here the standard deviation of  
>> the dg/dl samples, or the standard deviation of the mean?
>>
>> If the former, then this behavior is totally expected: While the  
>> standard deviation of a random variable (your dg/dl samples) may  
>> be large, with enough sampling, we can get a very precise estimate  
>> of the mean.  More sampling will not change the standard deviation  
>> of the dg/dl samples, but it will reduce the standard error in the  
>> mean, which is what we need for precise estimates of free energy  
>> differences.
>>
>> The uncertainty in the estimate of <dg/dl> is given simply by
>>
>> d<dg/dl> = sigma / sqrt(N / g)
>>
>> where here, sigma is the standard deviation of your dg/dl samples,  
>> N is the number of data points you have collected, and g is  
>> something called the "statistical inefficiency", which can be  
>> estimated from the correlation time of your dg/dl samples.  More  
>> information on this sort of analysis can be found in reference [1]  
>> below.
>>
>> Once you have the uncertainty in each estimate of <dg/dl>, you  
>> still have to combine these to get the uncertainty estimate for  
>> the integrated free energy difference using standard propagation  
>> of error.  This depends on your choice of quadrature for TI.   
>> David Mobley has done a lot of this, and I'm sure would be willing  
>> to help if you had trouble figuring it out.
>>
>> Good luck!
>>
>> - John
>>
>> [1] W. Janke. Statistical analysis of simulations: Data  
>> correlations and error estimation. In J. Grotendorst, D. Marx, and  
>> A. Murmatsu, editors, Quantum Simulations of Complex Many-Body  
>> Systems: From Theory to Algorithms, volume 10, pages 423?445. John  
>> von Neumann Institute for Computing, 2002.
>>
>> --
>> John Chodera <jchodera at gmail.com>             | Mobile    : 415  
>> 867-7384
>> Postdoctoral researcher, Pande lab            | Lab phone :  
>> 650.723.1097
>> Department of Chemistry, Stanford University  | Lab fax   :  
>> 650.724.4021
>> http://www.dillgroup.ucsf.edu/~jchodera
>> --
>>
>> Date: Mon, 07 May 2007 16:41:26 +0200
>> From: Patrick Fuchs <Patrick.Fuchs at ebgm.jussieu.fr>
>> Subject: [gmx-users] Very large fluctuations in dg/dl
>> To: gmx-users at gromacs.org
>> Message-ID: <463F3A96.9050206 at ebgm.jussieu.fr>
>> Content-Type: text/plain; charset=windows-1252; format=flowed
>>
>> Hi Gromacs users,
>> I have a few questions related to solvation free energy  
>> calculation via
>> thermodynamic integration.
>> I'm trying to reproduce some literature data (on e.g. methane,
>> methanol...) using the GROMOS G53a6 force field. I followed the  
>> tutorial
>> of David Mobley (thanks to him BTW), but I used the standard non  
>> bonded
>> options of the G53a6 force field (instead of OPLS). For each lambda
>> value I do a minimization, a 10 ps NVT followed by a 20 ps NPT
>> equilibration, and a 1 ns NVT production using the sd integrator.  
>> I used
>> 21 lambda values (0.00, 0.05...1.00).
>> Here's my topology file:
>> ----------------begining of methane.top------------------------
>> ; topology for a methane molecule
>>
>> ; include GROMOS53a6 force field
>> #include "ffG53a6.itp"
>>
>> ;;;;;;; begin methane definition ;;;;;;;
>> [ moleculetype ]
>> ; Name           nrexcl
>> METH             3
>>
>> [ atoms ]
>> ;nr type resnr residue atom cgnr charge mass    typeB chargeB massB
>> 1  CH4  1     METH    C1   0    0.0000 16.0430 DUM   0.0000  16.04300
>> ;;;;;; end methane definition ;;;;;;;;
>>
>> ; include water topology
>> #ifdef FLEX_SPC
>> #include "flexspc.itp"
>> #else
>> #include "spc.itp"
>> #endif
>>
>> [ system ]
>> ; name
>> 1 methane molecule in water
>>
>> [ molecules ]
>> ; name  number
>> METH    1
>> SOL     893
>> -----------------end of methane.top------------------------
>>
>> And here is my mdp file for lambda=0:
>> ---------------begining of prod.mdp---------------------
>> title               = production NVT methane/water
>> cpp                 = /lib/cpp
>> ; OPTIONS FOR BOND CONSTRAINTS
>> constraints         = all-bonds
>> ; RUN CONTROL PARAMETERS
>> integrator          = sd
>> tinit               = 0
>> dt                  = 0.002
>> nsteps              = 500000 ; 1000 ps
>> ; NUMBER OF STEPS FOR CENTER OF MASS MOTION REMOVAL
>> nstcomm             = 100
>> ; OUTPUT CONTROL OPTIONS
>> nstxout                  = 500
>> nstvout                  = 500
>> nstfout                  = 0
>> nstlog                   = 500
>> nstenergy                = 100
>> nstxtcout                = 5000
>> xtc-precision            = 1000
>> ; NON BONDED STUFF
>> ns_type             =  grid
>> nstlist                = 5
>> rlist                  = 0.8
>> coulombtype            = generalized-reaction-field
>> rcoulomb               = 1.4
>> rvdw                   = 1.4
>> epsilon_rf             = 54.0
>> ;OPTIONS FOR TEMPERATURE COUPLING
>> tc_grps                  = system
>> tau_t                    = 0.1 ; inverse langevin friction cst
>> ref_t                    = 300
>> ;OPTIONS FOR PRESSURE COUPLING
>> Pcoupl                   = no
>> tau_p                    = 0.5
>> compressibility          = 4.5e-5
>> ref_p                    = 1.0
>> ; FREE ENERGY CONTROL STUFF
>> free_energy              = yes
>> init_lambda              = 0.00
>> delta_lambda             = 0
>> sc_alpha                 = 0.5
>> sc-power                 = 1.0
>> sc-sigma                 = 0.3
>> ; VELOCITY GENERATION
>> gen_vel                  = yes
>> gen_temp                 = 300
>> gen_seed                 = -1
>> -----------------end of prod.mdp------------------------
>> I find a reasonable DeltaGsol value of 8.6 kJ/mol for methane  
>> (compared
>> to 8.7 in Geerke & van Gunsteren, ChemPhysChem 2006, 7, 671 ? 678)  
>> but
>> I get really huge fluctuations in the values of dg/dl:
>> lambda=0.00: 5.0 +/- 10.8 (mean +/- standard deviation)
>> lambda=0.05: 4.3 +/- 11.2
>> ...
>> lambda=1.00: -0.3 +/- 4.0
>> Furthermore, each of these mean value is very slow at converging  
>> (1 ns
>> seems a minimum for certain lambda values...).
>> I can't get reasonable fluctuations even if I sample more. In  
>> addition,
>> there are very frequent warnings in the log file such as:
>> ----
>> Large VCM(group rest):      0.01363,      0.00818,      0.01147,
>> ekin-cm:  3.09490e+00
>> ----
>> Here are my questions:
>> 1) Has someone an idea of what could be the cause of these [very]  
>> large
>> fluctuations? Does it come from my setup, or is this a normal  
>> behavior?
>> 2) Are these 'Large VCM(group rest)' warnings related to the use  
>> of sd
>> integrator (when I switch to md integrator, I no longer get these
>> warnings) ?
>> Thanks for your answer,
>>
>> Patrick
>>