[gmx-users] Very large fluctuations in dg/dl

Wed May 9 18:12:31 CEST 2007

Hi John and David,
I reply to you both. Thank you very much for your complete answers as 
well as for all the pointers you mentioned. I will have a look at these 
references. Free energy calculation is definitely a challenging issue !
Cheers,

Patrick

John D. Chodera a écrit :
> Hi Patrick,
> 
> I like your plots.  They nicely demonstrate the difficulty of 
> convergence of the estimate of <dg/dl>.
> 
> It looks like there may be some other oddities in the plot, such as 
> switching between conformations or some other effect that has a long 
> correlation time.  In particular, at lambda = 0.55, there is a big 
> switch about 500 ps in.
> 
> We (David Mobley and I) have found it helpful to examine the dg/dl 
> timeseries plots directly to look for the potential presence of an 
> initial non-equilibrated state (which would look like dg/dl spending 
> time near some value A_1 and then switching to fluctuate about A_2 for 
> the remainder of the simulation) or multiple states with a long 
> correlation time (which would look like hops between fluctuations about 
> different values).  In this case, long correlation times may require you 
> to run much longer.
> 
> One of the best diagnostic tools besides examination of the timeseries 
> is to compute the correlation time and statistical inefficiency for each 
> dg/dl timeseries.  That Janke article I mentioned previously describes 
> how to do this, and is available online here:
> 
> http://www.fz-juelich.de/nic-series/volume10/janke2.pdf
> 
> Also, there is a discussion of how to do so efficiently in my recent 
> paper on the analysis of parallel tempering simulations using WHAM (see 
> Sections 2.4 and 5.2):
> 
> http://www.dillgroup.ucsf.edu/~jchodera/pubs/pdf/replica-exchange-wham.pdf
> 
> The method I described for computing the uncertainties is termed 
> "correlation analysis", as it relies on computation (and integration) of 
> the autocorrelation function for the timeseries.  This was first applied 
> to molecular dynamics simulations by Bill Swope when he was with Hans 
> Andersen:
> 
> W. C. Swope, H. C. Andersen, P. H. Berens, and K. R. Wilson. A computer 
> simulation method for the calculation of equilibrium constants for the 
> formation of physical clusters of molecules: Application to small water 
> clusters. J. Chem. Phys., 76(1):637–649, 1982.
> 
> (This is actually the same paper where he introduces the velocity Verlet 
> integrator -- it's a good read!)
> 
> Block averaging should give equivalent uncertainty estimates to the 
> correlation analysis method (to about an order of magnitude) if the 
> block sizes are chosen appropriately, but it usually requires either 
> calculation of the statistical inefficiency to determine the block size 
> first, or application of an iterative method like that of Flyvbjerg 
> (cited in my WHAM paper) to determine the statistical uncertainty from 
> consideration of many block sizes.  Wolfhard Janke's paper does a great 
> job of discussing how various methods for estimating the statistical 
> uncertainty compare.
> 
> Finally, if there are unequilibrated regions of your dataset, there is 
> now a method to automatically determine the boundary between 
> unequilibrated and equilibrated regions:
> W. Yang, R. Bitetti-Putzer, and M. Karplus. Free energy simulations: Use 
> of reverse cumulative averaging to determine the equilibrated region and 
> the time required for convergence. J. Chem. Phys., 120(6):2618–2628, 2004.
> 
> To address your observation that some authors use only a few hundred ps 
> of simulation time, while your system seems to require at least 1 ns: 
> Different systems have different correlation times and variances of 
> dg/dl, both of which affect the statistical uncertainty. Some systems 
> are much "easier" to converge than others.  David Mobley has found that 
> application of restraints in free energy calculations which are later 
> removed can actually transform a "very hard" problem into an "easy" 
> problem -- see the publication below.  But the basic answer is that this 
> is why it is extremely important to *always* compute the correlation 
> time for whatever it is you are averaging -- this may be very different 
> from system to system, or even from lambda value to lambda value.
> 
> D. L. Mobley, J. D. Chodera, and K. A. Dill, "Confine and Release: 
> Obtaining Correct Binding Free Energies in the Presence of Protein 
> Conformational Change", accepted, Journal of Chemical Theory and 
> Computation.
> http://www.dillgroup.ucsf.edu/~dmobley/papers/flex.pdf
> 
> Best of luck!
> 
> - John
> 
> -- 
> John Chodera <jchodera at gmail.com>             | Mobile    : 415 867-7384
> Postdoctoral researcher, Pande lab            | Lab phone : 650.723.1097
> Department of Chemistry, Stanford University  | Lab fax   : 650.724.4021
> http://www.dillgroup.ucsf.edu/~jchodera
> 
> 
> On May 8, 2007, at 3.20 AM, Patrick Fuchs wrote:
> 
>> Hi John,
>> thanks a lot for your reply.
>> Indeed, the standard deviation I presented in my previous post is the 
>> one of dg/dl samples. I was just surprised by the fact the std. dev. 
>> is always larger than the value itself (since I'm starting with FE 
>> calculation I had no expectation of what the behavior would be and 
>> needed a confirmation).
>> I was also suprised about the convergence of the mean. I put a plot 
>> for each lambda value at the URL:
>> http://condor.ebgm.jussieu.fr/~fuchs/download/convergence.png
>> (at each time step, I recalculate the mean from the beginning). My 
>> first observation was that 1 ns seemed to be a minimum for certain 
>> lambda values
>> (e.g. lambda=0.70). I sometimes read in literature that some authors 
>> used a few hundreds of ps, which seemed (to me) not sufficient for 
>> proper convergence.
>> Now, if we come back to the error estimate of the mean, I found (for 
>> lambda=0.00) 0.2 kJ/mol using block averaging (using the -ee option of 
>> g_analyze), which is reasonable I imagine (even if higher precisions 
>> have been described in literature). I'm not sure whether this is the same
>> way of calculating the uncertainty compared to what you proposed.
>> Can you confirm?
>> I will have a look to the book you mentioned, thanks for the pointer.
>> Cheers,
>>
>> Patrick
>>
>> On Mon, 7 May 2007, John D. Chodera wrote:
>>
>>> Hi Patrick,
>>>
>>>> I find a reasonable DeltaGsol value of 8.6 kJ/mol for methane (compared
>>>> to 8.7 in Geerke & van Gunsteren, ChemPhysChem 2006, 7, 671 ? 678) but
>>>> I get really huge fluctuations in the values of dg/dl:
>>>> lambda=0.00: 5.0 +/- 10.8 (mean +/- standard deviation)
>>>> lambda=0.05: 4.3 +/- 11.2
>>>
>>> Is the standard deviation you quote here the standard deviation of 
>>> the dg/dl samples, or the standard deviation of the mean?
>>>
>>> If the former, then this behavior is totally expected: While the 
>>> standard deviation of a random variable (your dg/dl samples) may be 
>>> large, with enough sampling, we can get a very precise estimate of 
>>> the mean.  More sampling will not change the standard deviation of 
>>> the dg/dl samples, but it will reduce the standard error in the mean, 
>>> which is what we need for precise estimates of free energy differences.
>>>
>>> The uncertainty in the estimate of <dg/dl> is given simply by
>>>
>>> d<dg/dl> = sigma / sqrt(N / g)
>>>
>>> where here, sigma is the standard deviation of your dg/dl samples, N 
>>> is the number of data points you have collected, and g is something 
>>> called the "statistical inefficiency", which can be estimated from 
>>> the correlation time of your dg/dl samples.  More information on this 
>>> sort of analysis can be found in reference [1] below.
>>>
>>> Once you have the uncertainty in each estimate of <dg/dl>, you still 
>>> have to combine these to get the uncertainty estimate for the 
>>> integrated free energy difference using standard propagation of 
>>> error.  This depends on your choice of quadrature for TI.  David 
>>> Mobley has done a lot of this, and I'm sure would be willing to help 
>>> if you had trouble figuring it out.
>>>
>>> Good luck!
>>>
>>> - John
>>>
>>> [1] W. Janke. Statistical analysis of simulations: Data correlations 
>>> and error estimation. In J. Grotendorst, D. Marx, and A. Murmatsu, 
>>> editors, Quantum Simulations of Complex Many-Body Systems: From 
>>> Theory to Algorithms, volume 10, pages 423?445. John von Neumann 
>>> Institute for Computing, 2002.
>>>
>>> -- 
>>> John Chodera <jchodera at gmail.com>             | Mobile    : 415 867-7384
>>> Postdoctoral researcher, Pande lab            | Lab phone : 650.723.1097
>>> Department of Chemistry, Stanford University  | Lab fax   : 650.724.4021
>>> http://www.dillgroup.ucsf.edu/~jchodera
>>> -- 
>>>
>>> Date: Mon, 07 May 2007 16:41:26 +0200
>>> From: Patrick Fuchs <Patrick.Fuchs at ebgm.jussieu.fr>
>>> Subject: [gmx-users] Very large fluctuations in dg/dl
>>> To: gmx-users at gromacs.org
>>> Message-ID: <463F3A96.9050206 at ebgm.jussieu.fr>
>>> Content-Type: text/plain; charset=windows-1252; format=flowed
>>>
>>> Hi Gromacs users,
>>> I have a few questions related to solvation free energy calculation via
>>> thermodynamic integration.
>>> I'm trying to reproduce some literature data (on e.g. methane,
>>> methanol...) using the GROMOS G53a6 force field. I followed the tutorial
>>> of David Mobley (thanks to him BTW), but I used the standard non bonded
>>> options of the G53a6 force field (instead of OPLS). For each lambda
>>> value I do a minimization, a 10 ps NVT followed by a 20 ps NPT
>>> equilibration, and a 1 ns NVT production using the sd integrator. I used
>>> 21 lambda values (0.00, 0.05...1.00).
>>> Here's my topology file:
>>> ----------------begining of methane.top------------------------
>>> ; topology for a methane molecule
>>>
>>> ; include GROMOS53a6 force field
>>> #include "ffG53a6.itp"
>>>
>>> ;;;;;;; begin methane definition ;;;;;;;
>>> [ moleculetype ]
>>> ; Name           nrexcl
>>> METH             3
>>>
>>> [ atoms ]
>>> ;nr type resnr residue atom cgnr charge mass    typeB chargeB massB
>>> 1  CH4  1     METH    C1   0    0.0000 16.0430 DUM   0.0000  16.04300
>>> ;;;;;; end methane definition ;;;;;;;;
>>>
>>> ; include water topology
>>> #ifdef FLEX_SPC
>>> #include "flexspc.itp"
>>> #else
>>> #include "spc.itp"
>>> #endif
>>>
>>> [ system ]
>>> ; name
>>> 1 methane molecule in water
>>>
>>> [ molecules ]
>>> ; name  number
>>> METH    1
>>> SOL     893
>>> -----------------end of methane.top------------------------
>>>
>>> And here is my mdp file for lambda=0:
>>> ---------------begining of prod.mdp---------------------
>>> title               = production NVT methane/water
>>> cpp                 = /lib/cpp
>>> ; OPTIONS FOR BOND CONSTRAINTS
>>> constraints         = all-bonds
>>> ; RUN CONTROL PARAMETERS
>>> integrator          = sd
>>> tinit               = 0
>>> dt                  = 0.002
>>> nsteps              = 500000 ; 1000 ps
>>> ; NUMBER OF STEPS FOR CENTER OF MASS MOTION REMOVAL
>>> nstcomm             = 100
>>> ; OUTPUT CONTROL OPTIONS
>>> nstxout                  = 500
>>> nstvout                  = 500
>>> nstfout                  = 0
>>> nstlog                   = 500
>>> nstenergy                = 100
>>> nstxtcout                = 5000
>>> xtc-precision            = 1000
>>> ; NON BONDED STUFF
>>> ns_type             =  grid
>>> nstlist                = 5
>>> rlist                  = 0.8
>>> coulombtype            = generalized-reaction-field
>>> rcoulomb               = 1.4
>>> rvdw                   = 1.4
>>> epsilon_rf             = 54.0
>>> ;OPTIONS FOR TEMPERATURE COUPLING
>>> tc_grps                  = system
>>> tau_t                    = 0.1 ; inverse langevin friction cst
>>> ref_t                    = 300
>>> ;OPTIONS FOR PRESSURE COUPLING
>>> Pcoupl                   = no
>>> tau_p                    = 0.5
>>> compressibility          = 4.5e-5
>>> ref_p                    = 1.0
>>> ; FREE ENERGY CONTROL STUFF
>>> free_energy              = yes
>>> init_lambda              = 0.00
>>> delta_lambda             = 0
>>> sc_alpha                 = 0.5
>>> sc-power                 = 1.0
>>> sc-sigma                 = 0.3
>>> ; VELOCITY GENERATION
>>> gen_vel                  = yes
>>> gen_temp                 = 300
>>> gen_seed                 = -1
>>> -----------------end of prod.mdp------------------------
>>> I find a reasonable DeltaGsol value of 8.6 kJ/mol for methane (compared
>>> to 8.7 in Geerke & van Gunsteren, ChemPhysChem 2006, 7, 671 ? 678) but
>>> I get really huge fluctuations in the values of dg/dl:
>>> lambda=0.00: 5.0 +/- 10.8 (mean +/- standard deviation)
>>> lambda=0.05: 4.3 +/- 11.2
>>> ...
>>> lambda=1.00: -0.3 +/- 4.0
>>> Furthermore, each of these mean value is very slow at converging (1 ns
>>> seems a minimum for certain lambda values...).
>>> I can't get reasonable fluctuations even if I sample more. In addition,
>>> there are very frequent warnings in the log file such as:
>>> ----
>>> Large VCM(group rest):      0.01363,      0.00818,      0.01147,
>>> ekin-cm:  3.09490e+00
>>> ----
>>> Here are my questions:
>>> 1) Has someone an idea of what could be the cause of these [very] large
>>> fluctuations? Does it come from my setup, or is this a normal behavior?
>>> 2) Are these 'Large VCM(group rest)' warnings related to the use of sd
>>> integrator (when I switch to md integrator, I no longer get these
>>> warnings) ?
>>> Thanks for your answer,
>>>
>>> Patrick
>>>
> 
> 
> 
> 

-- 
_______________________________________________________
Patrick FUCHS
Equipe de Bioinformatique Genomique et Moleculaire
INSERM U726, Universite Paris 7
Case Courrier 7113
2, place Jussieu, 75251 Paris Cedex 05, FRANCE
Tel : +33 (0)1-44-27-77-16 - Fax : +33 (0)1-43-26-38-30
E-mail : Patrick.Fuchs at ebgm.jussieu.fr
Web Site: http://www.ebgm.jussieu.fr/~fuchs