[gmx-users] Re: Possible bug: energy changes with the number of nodes for energy minimization

Wed May 30 17:51:46 CEST 2012

Hi Justin and Mark,

Thanks once again for getting back.

I've found the problem - it's actually a known bug already:

http://redmine.gromacs.org/issues/901

The dispersion correction is multiplied my the number of processes (I found
this after taking a closer look at my md.log files to see where the energy
was changing)! I guess this means I should use the serial version for
meaningful binding energies. It also looks like it will be fixed for
version 4.5.6

Thank you again, I really appreciate your help.

Steve

> On 30/05/2012 9:42 PM, Stephen Cox wrote:
> > Hi Justin,
> >
> > Thanks for getting back and posting the links.
> >
> >
> >     On 5/29/12 6:22 AM, Stephen Cox wrote:
> >     > Hi,
> >     >
> >     > I'm running a number of energy minimizations on a clathrate
> >     supercell and I get
> >     > quite significantly different values for the total energy
> >     depending on the
> >     > number of mpi processes / number of threads I use. More
> >     specifically, some
> >     > numbers I get are:
> >     >
> >     > #cores      energy
> >     > 1        -2.41936409202696e+04
> >     > 2        -2.43726425776809e+04
> >     > 3        -2.45516442350804e+04
> >     > 4        -2.47003944216983e+04
> >     >
> >     > #threads    energy
> >     > 1        -2.41936409202696e+04
> >     > 2        -2.43726425776792e+04
> >     > 3        -2.45516442350804e+04
> >     > 4        -2.47306458924815e+04
> >     >
> >     >
> >     > I'd expect some numerical noise, but these differences seem to0
> >     large for that.
> >
> >     The difference is only 2%, which by MD standards, is quite good,
> >     I'd say ;)
> >     Consider the discussion here:
> >
> >
> > I agree for MD this wouldn't be too bad, but I'd expect energy
> > minimization to get very close to the same local minimum from a given
> > starting configuration. The thing is I want to compute a binding curve
> > for my clathrate and compare to DFT values for the binding energy
> > (amongst other things), and the difference in energy between different
> > number of cores is rather significant for this purpose.
>
> Given the usual roughness of the PE surface to which you are minimizing,
> some variation in end point is expected.
>

> >
> > Furthermore, if I only calculate the energy for nsteps = 0 (i.e. a
> > single point energy for identical structures) I get the same trend as
> > above (both mpi/openmp with domain/particle decomposition). Surely
> > there shouldn't be such a large difference in energy for a single
> > point calculation?
>
> nsteps = 0 is not strictly a single-point energy, since the constraints
> act before EM step 0. mdrun -s -rerun will give a single point. This
> probably won't change your observations. You should also be sure you're
> making observations with the latest release (4.5.5).
>

> If you can continue to observe this trend for more processors
> (overallocating?), then you may have evidence of a problem - but a full
> system description and an .mdp file will be in order also.
>
> Mark

>
> >
> >     http://www.gromacs.org/Documentation/Terminology/Reproducibility
> >
> >     To an extent, the information here may also be relevant:
> >
> >
> http://www.gromacs.org/Documentation/How-tos/Extending_Simulations#Exact_vs_binary_identical_continuation
> >
> >     > Before submitting a bug report, I'd like to check:
> >     > a) if someone has seen something similar;
> >
> >     Sure.  Energies can be different due to a whole host of factors
> >     (discussed
> >     above), and MPI only complicates matters.
> >
> >     > b) should I just trust the serial version?
> >
> >     Maybe, but I don't know that there's evidence to say that any of
> >     the above tests
> >     are more or less accurate than the others.  What happens if you
> >     run with mdrun
> >     -reprod on all your tests?
> >
> >
> > Running with -reprod produces the same trend as above. If it was
> > numerical noise, I would have thought that the numbers would fluctuate
> > around some average value, not follow a definite trend where the
> > energy decreases with the number of cores/threads...
> >
> >
> >     > c) have I simply done something stupid (grompp.mdp appended below);
> >     >
> >
> >     Nope, looks fine.
> >
> >     -Justin
> >
> > Thanks again for getting back to me.
> >
> >
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.gromacs.org/pipermail/gmx-users/attachments/20120530/a4ed4a18/attachment-0001.html
>
> ------------------------------
>
> Message: 2
> Date: Wed, 30 May 2012 07:51:02 -0400
> From: "Justin A. Lemkul" <jalemkul at vt.edu>
> Subject: Re: [gmx-users] Re: Possible bug: energy changes with the
>        number  of      nodes for energy minimization
> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
> Message-ID: <4FC609A6.4090702 at vt.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>
>
> On 5/30/12 7:42 AM, Stephen Cox wrote:
> > Hi Justin,
> >
> > Thanks for getting back and posting the links.
> >
> >
> >     On 5/29/12 6:22 AM, Stephen Cox wrote:
> >      > Hi,
> >      >
> >      > I'm running a number of energy minimizations on a clathrate
> supercell and
> >     I get
> >      > quite significantly different values for the total energy
> depending on the
> >      > number of mpi processes / number of threads I use. More
> specifically, some
> >      > numbers I get are:
> >      >
> >      > #cores      energy
> >      > 1        -2.41936409202696e+04
> >      > 2        -2.43726425776809e+04
> >      > 3        -2.45516442350804e+04
> >      > 4        -2.47003944216983e+04
> >      >
> >      > #threads    energy
> >      > 1        -2.41936409202696e+04
> >      > 2        -2.43726425776792e+04
> >      > 3        -2.45516442350804e+04
> >      > 4        -2.47306458924815e+04
> >      >
> >      >
> >      > I'd expect some numerical noise, but these differences seem to0
> large for
> >     that.
> >
> >     The difference is only 2%, which by MD standards, is quite good, I'd
> say ;)
> >     Consider the discussion here:
> >
> >
> > I agree for MD this wouldn't be too bad, but I'd expect energy
> minimization to
> > get very close to the same local minimum from a given starting
> configuration.
> > The thing is I want to compute a binding curve for my clathrate and
> compare to
> > DFT values for the binding energy (amongst other things), and the
> difference in
> > energy between different number of cores is rather significant for this
> purpose.
> >
>
> I think the real issue comes down to how you're going to calculate binding
> energy.  I would still expect that with sufficient MD sampling, the
> differences
> should be small or statistically insignificant given the nature of MD
> calculations.  EM will likely be very sensitive to the nature of how it is
> run
> (MPI vs. serial, etc) since even the tiny rounding errors and other factors
> described below will cause changes in how the EM algorithm proceeds.  For
> most
> purposes, such differences are irrelevant as EM is only a preparatory step
> for
> more intense calculations.
>

> > Furthermore, if I only calculate the energy for nsteps = 0 (i.e. a
> single point
> > energy for identical structures) I get the same trend as above (both
> mpi/openmp
> > with domain/particle decomposition). Surely there shouldn't be such a
> large
> > difference in energy for a single point calculation?
> >
>
> That depends.  Are you using the same .mdp file, just setting "nsteps =
> 0"?  If
> so, that's not a good test.  EM algorithms will make a change at step 0,
> the
> magnitude of which will again reflect the differences you're seeing.  If
> you use
> the md integrator with a zero-step evaluation, that's a better test.
>

> -Justin
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20120530/d95107da/attachment.html>