[gmx-users] Fwd: related to bug 1222

Wed May 28 18:34:02 CEST 2014

---------- Forwarded message ----------
From: albert ardevol <albert.ardevol at gmail.com>
Date: 2014-05-28 18:26 GMT+02:00
Subject: related to bug 1222
To: gromacs.org_gmx-users at maillist.sys.kth.se, pszilard at kth.se

Dear users/developers,

  I am running replica exchange MD (REMD) with gromacs 4.6.5 in Piz Daint
using gpu/cpu. I have two different systems.

  System 1 is small (< 16000 atoms) with 32 replicas. I am running it using
4 nodes without any problem.

  System 2 is big (< 49000 atoms) with 32 replicas too. I am running it
using 8 nodes, but after some steps, the simulation becomes unstable and
the jobs crash. Restarting from the previous checkpoint, the simulation
continues for some steps until it becomes unstable again (at a different
point) and the job crashes again. The number of steps that the job can run
until the simulation becomes unstable ranges from 5,000 to 1,170,000 steps.
The gromacs output file gives me the following error

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6.5
Source code file:
/apps/daint/sandbox/lucamar/src/gromacs-4.6.5/src/mdlib/pme.c, line: 851

Fatal error:
2 particles communicated to PME node 2 are more than 2/3 times the cut-off
out of the domain decomposition cell of their charge group in dimension x.
This usually means that your system is not well equilibrated.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

 I found this post on the gromacs mailing list "
http://redmine.gromacs.org/issues/1222<https://mail.ethz.ch/owa/redir.aspx?C=kG7i6RYKnkWORgJwPyvSbCHIwNklTtEIc1Tm9Or9iBnqxN_ITsGkccnUV4IOV0kwo5oDo3yFq-o.&URL=http%3a%2f%2fredmine.gromacs.org%2fissues%2f1222>"
in which they seem to have a similar problem and they say that it depends
on the number of GPUs used. So I launched again the job using only 4 nodes
ran without problems for 2,259,000 steps steps. This bug was supposed to
affect version 4.6.1 and to be fixed by version 4.6.5 (the one I am using).

  Notice that I had previously equilibrated each of the replicas
(separately, i.e. not using replica exchange) for 5,000,000 steps using 1
node per run without any problem.

  That makes me wonder whether the bug is really fixed or not.

  Best regards,
  Albert.