[gmx-users] Very large fluctuations in dg/dl
Patrick Fuchs
Patrick.Fuchs at ebgm.jussieu.fr
Wed May 9 18:12:31 CEST 2007
Hi John and David,
I reply to you both. Thank you very much for your complete answers as
well as for all the pointers you mentioned. I will have a look at these
references. Free energy calculation is definitely a challenging issue !
Cheers,
Patrick
John D. Chodera a écrit :
> Hi Patrick,
>
> I like your plots. They nicely demonstrate the difficulty of
> convergence of the estimate of <dg/dl>.
>
> It looks like there may be some other oddities in the plot, such as
> switching between conformations or some other effect that has a long
> correlation time. In particular, at lambda = 0.55, there is a big
> switch about 500 ps in.
>
> We (David Mobley and I) have found it helpful to examine the dg/dl
> timeseries plots directly to look for the potential presence of an
> initial non-equilibrated state (which would look like dg/dl spending
> time near some value A_1 and then switching to fluctuate about A_2 for
> the remainder of the simulation) or multiple states with a long
> correlation time (which would look like hops between fluctuations about
> different values). In this case, long correlation times may require you
> to run much longer.
>
> One of the best diagnostic tools besides examination of the timeseries
> is to compute the correlation time and statistical inefficiency for each
> dg/dl timeseries. That Janke article I mentioned previously describes
> how to do this, and is available online here:
>
> http://www.fz-juelich.de/nic-series/volume10/janke2.pdf
>
> Also, there is a discussion of how to do so efficiently in my recent
> paper on the analysis of parallel tempering simulations using WHAM (see
> Sections 2.4 and 5.2):
>
> http://www.dillgroup.ucsf.edu/~jchodera/pubs/pdf/replica-exchange-wham.pdf
>
> The method I described for computing the uncertainties is termed
> "correlation analysis", as it relies on computation (and integration) of
> the autocorrelation function for the timeseries. This was first applied
> to molecular dynamics simulations by Bill Swope when he was with Hans
> Andersen:
>
> W. C. Swope, H. C. Andersen, P. H. Berens, and K. R. Wilson. A computer
> simulation method for the calculation of equilibrium constants for the
> formation of physical clusters of molecules: Application to small water
> clusters. J. Chem. Phys., 76(1):637–649, 1982.
>
> (This is actually the same paper where he introduces the velocity Verlet
> integrator -- it's a good read!)
>
> Block averaging should give equivalent uncertainty estimates to the
> correlation analysis method (to about an order of magnitude) if the
> block sizes are chosen appropriately, but it usually requires either
> calculation of the statistical inefficiency to determine the block size
> first, or application of an iterative method like that of Flyvbjerg
> (cited in my WHAM paper) to determine the statistical uncertainty from
> consideration of many block sizes. Wolfhard Janke's paper does a great
> job of discussing how various methods for estimating the statistical
> uncertainty compare.
>
> Finally, if there are unequilibrated regions of your dataset, there is
> now a method to automatically determine the boundary between
> unequilibrated and equilibrated regions:
> W. Yang, R. Bitetti-Putzer, and M. Karplus. Free energy simulations: Use
> of reverse cumulative averaging to determine the equilibrated region and
> the time required for convergence. J. Chem. Phys., 120(6):2618–2628, 2004.
>
> To address your observation that some authors use only a few hundred ps
> of simulation time, while your system seems to require at least 1 ns:
> Different systems have different correlation times and variances of
> dg/dl, both of which affect the statistical uncertainty. Some systems
> are much "easier" to converge than others. David Mobley has found that
> application of restraints in free energy calculations which are later
> removed can actually transform a "very hard" problem into an "easy"
> problem -- see the publication below. But the basic answer is that this
> is why it is extremely important to *always* compute the correlation
> time for whatever it is you are averaging -- this may be very different
> from system to system, or even from lambda value to lambda value.
>
> D. L. Mobley, J. D. Chodera, and K. A. Dill, "Confine and Release:
> Obtaining Correct Binding Free Energies in the Presence of Protein
> Conformational Change", accepted, Journal of Chemical Theory and
> Computation.
> http://www.dillgroup.ucsf.edu/~dmobley/papers/flex.pdf
>
> Best of luck!
>
> - John
>
> --
> John Chodera <jchodera at gmail.com> | Mobile : 415 867-7384
> Postdoctoral researcher, Pande lab | Lab phone : 650.723.1097
> Department of Chemistry, Stanford University | Lab fax : 650.724.4021
> http://www.dillgroup.ucsf.edu/~jchodera
>
>
> On May 8, 2007, at 3.20 AM, Patrick Fuchs wrote:
>
>> Hi John,
>> thanks a lot for your reply.
>> Indeed, the standard deviation I presented in my previous post is the
>> one of dg/dl samples. I was just surprised by the fact the std. dev.
>> is always larger than the value itself (since I'm starting with FE
>> calculation I had no expectation of what the behavior would be and
>> needed a confirmation).
>> I was also suprised about the convergence of the mean. I put a plot
>> for each lambda value at the URL:
>> http://condor.ebgm.jussieu.fr/~fuchs/download/convergence.png
>> (at each time step, I recalculate the mean from the beginning). My
>> first observation was that 1 ns seemed to be a minimum for certain
>> lambda values
>> (e.g. lambda=0.70). I sometimes read in literature that some authors
>> used a few hundreds of ps, which seemed (to me) not sufficient for
>> proper convergence.
>> Now, if we come back to the error estimate of the mean, I found (for
>> lambda=0.00) 0.2 kJ/mol using block averaging (using the -ee option of
>> g_analyze), which is reasonable I imagine (even if higher precisions
>> have been described in literature). I'm not sure whether this is the same
>> way of calculating the uncertainty compared to what you proposed.
>> Can you confirm?
>> I will have a look to the book you mentioned, thanks for the pointer.
>> Cheers,
>>
>> Patrick
>>
>> On Mon, 7 May 2007, John D. Chodera wrote:
>>
>>> Hi Patrick,
>>>
>>>> I find a reasonable DeltaGsol value of 8.6 kJ/mol for methane (compared
>>>> to 8.7 in Geerke & van Gunsteren, ChemPhysChem 2006, 7, 671 ? 678) but
>>>> I get really huge fluctuations in the values of dg/dl:
>>>> lambda=0.00: 5.0 +/- 10.8 (mean +/- standard deviation)
>>>> lambda=0.05: 4.3 +/- 11.2
>>>
>>> Is the standard deviation you quote here the standard deviation of
>>> the dg/dl samples, or the standard deviation of the mean?
>>>
>>> If the former, then this behavior is totally expected: While the
>>> standard deviation of a random variable (your dg/dl samples) may be
>>> large, with enough sampling, we can get a very precise estimate of
>>> the mean. More sampling will not change the standard deviation of
>>> the dg/dl samples, but it will reduce the standard error in the mean,
>>> which is what we need for precise estimates of free energy differences.
>>>
>>> The uncertainty in the estimate of <dg/dl> is given simply by
>>>
>>> d<dg/dl> = sigma / sqrt(N / g)
>>>
>>> where here, sigma is the standard deviation of your dg/dl samples, N
>>> is the number of data points you have collected, and g is something
>>> called the "statistical inefficiency", which can be estimated from
>>> the correlation time of your dg/dl samples. More information on this
>>> sort of analysis can be found in reference [1] below.
>>>
>>> Once you have the uncertainty in each estimate of <dg/dl>, you still
>>> have to combine these to get the uncertainty estimate for the
>>> integrated free energy difference using standard propagation of
>>> error. This depends on your choice of quadrature for TI. David
>>> Mobley has done a lot of this, and I'm sure would be willing to help
>>> if you had trouble figuring it out.
>>>
>>> Good luck!
>>>
>>> - John
>>>
>>> [1] W. Janke. Statistical analysis of simulations: Data correlations
>>> and error estimation. In J. Grotendorst, D. Marx, and A. Murmatsu,
>>> editors, Quantum Simulations of Complex Many-Body Systems: From
>>> Theory to Algorithms, volume 10, pages 423?445. John von Neumann
>>> Institute for Computing, 2002.
>>>
>>> --
>>> John Chodera <jchodera at gmail.com> | Mobile : 415 867-7384
>>> Postdoctoral researcher, Pande lab | Lab phone : 650.723.1097
>>> Department of Chemistry, Stanford University | Lab fax : 650.724.4021
>>> http://www.dillgroup.ucsf.edu/~jchodera
>>> --
>>>
>>> Date: Mon, 07 May 2007 16:41:26 +0200
>>> From: Patrick Fuchs <Patrick.Fuchs at ebgm.jussieu.fr>
>>> Subject: [gmx-users] Very large fluctuations in dg/dl
>>> To: gmx-users at gromacs.org
>>> Message-ID: <463F3A96.9050206 at ebgm.jussieu.fr>
>>> Content-Type: text/plain; charset=windows-1252; format=flowed
>>>
>>> Hi Gromacs users,
>>> I have a few questions related to solvation free energy calculation via
>>> thermodynamic integration.
>>> I'm trying to reproduce some literature data (on e.g. methane,
>>> methanol...) using the GROMOS G53a6 force field. I followed the tutorial
>>> of David Mobley (thanks to him BTW), but I used the standard non bonded
>>> options of the G53a6 force field (instead of OPLS). For each lambda
>>> value I do a minimization, a 10 ps NVT followed by a 20 ps NPT
>>> equilibration, and a 1 ns NVT production using the sd integrator. I used
>>> 21 lambda values (0.00, 0.05...1.00).
>>> Here's my topology file:
>>> ----------------begining of methane.top------------------------
>>> ; topology for a methane molecule
>>>
>>> ; include GROMOS53a6 force field
>>> #include "ffG53a6.itp"
>>>
>>> ;;;;;;; begin methane definition ;;;;;;;
>>> [ moleculetype ]
>>> ; Name nrexcl
>>> METH 3
>>>
>>> [ atoms ]
>>> ;nr type resnr residue atom cgnr charge mass typeB chargeB massB
>>> 1 CH4 1 METH C1 0 0.0000 16.0430 DUM 0.0000 16.04300
>>> ;;;;;; end methane definition ;;;;;;;;
>>>
>>> ; include water topology
>>> #ifdef FLEX_SPC
>>> #include "flexspc.itp"
>>> #else
>>> #include "spc.itp"
>>> #endif
>>>
>>> [ system ]
>>> ; name
>>> 1 methane molecule in water
>>>
>>> [ molecules ]
>>> ; name number
>>> METH 1
>>> SOL 893
>>> -----------------end of methane.top------------------------
>>>
>>> And here is my mdp file for lambda=0:
>>> ---------------begining of prod.mdp---------------------
>>> title = production NVT methane/water
>>> cpp = /lib/cpp
>>> ; OPTIONS FOR BOND CONSTRAINTS
>>> constraints = all-bonds
>>> ; RUN CONTROL PARAMETERS
>>> integrator = sd
>>> tinit = 0
>>> dt = 0.002
>>> nsteps = 500000 ; 1000 ps
>>> ; NUMBER OF STEPS FOR CENTER OF MASS MOTION REMOVAL
>>> nstcomm = 100
>>> ; OUTPUT CONTROL OPTIONS
>>> nstxout = 500
>>> nstvout = 500
>>> nstfout = 0
>>> nstlog = 500
>>> nstenergy = 100
>>> nstxtcout = 5000
>>> xtc-precision = 1000
>>> ; NON BONDED STUFF
>>> ns_type = grid
>>> nstlist = 5
>>> rlist = 0.8
>>> coulombtype = generalized-reaction-field
>>> rcoulomb = 1.4
>>> rvdw = 1.4
>>> epsilon_rf = 54.0
>>> ;OPTIONS FOR TEMPERATURE COUPLING
>>> tc_grps = system
>>> tau_t = 0.1 ; inverse langevin friction cst
>>> ref_t = 300
>>> ;OPTIONS FOR PRESSURE COUPLING
>>> Pcoupl = no
>>> tau_p = 0.5
>>> compressibility = 4.5e-5
>>> ref_p = 1.0
>>> ; FREE ENERGY CONTROL STUFF
>>> free_energy = yes
>>> init_lambda = 0.00
>>> delta_lambda = 0
>>> sc_alpha = 0.5
>>> sc-power = 1.0
>>> sc-sigma = 0.3
>>> ; VELOCITY GENERATION
>>> gen_vel = yes
>>> gen_temp = 300
>>> gen_seed = -1
>>> -----------------end of prod.mdp------------------------
>>> I find a reasonable DeltaGsol value of 8.6 kJ/mol for methane (compared
>>> to 8.7 in Geerke & van Gunsteren, ChemPhysChem 2006, 7, 671 ? 678) but
>>> I get really huge fluctuations in the values of dg/dl:
>>> lambda=0.00: 5.0 +/- 10.8 (mean +/- standard deviation)
>>> lambda=0.05: 4.3 +/- 11.2
>>> ...
>>> lambda=1.00: -0.3 +/- 4.0
>>> Furthermore, each of these mean value is very slow at converging (1 ns
>>> seems a minimum for certain lambda values...).
>>> I can't get reasonable fluctuations even if I sample more. In addition,
>>> there are very frequent warnings in the log file such as:
>>> ----
>>> Large VCM(group rest): 0.01363, 0.00818, 0.01147,
>>> ekin-cm: 3.09490e+00
>>> ----
>>> Here are my questions:
>>> 1) Has someone an idea of what could be the cause of these [very] large
>>> fluctuations? Does it come from my setup, or is this a normal behavior?
>>> 2) Are these 'Large VCM(group rest)' warnings related to the use of sd
>>> integrator (when I switch to md integrator, I no longer get these
>>> warnings) ?
>>> Thanks for your answer,
>>>
>>> Patrick
>>>
>
>
>
>
--
_______________________________________________________
Patrick FUCHS
Equipe de Bioinformatique Genomique et Moleculaire
INSERM U726, Universite Paris 7
Case Courrier 7113
2, place Jussieu, 75251 Paris Cedex 05, FRANCE
Tel : +33 (0)1-44-27-77-16 - Fax : +33 (0)1-43-26-38-30
E-mail : Patrick.Fuchs at ebgm.jussieu.fr
Web Site: http://www.ebgm.jussieu.fr/~fuchs
More information about the gromacs.org_gmx-users
mailing list