[gmx-users] Multi-node Replica Exchange Segfault

Mark Abraham mark.j.abraham at gmail.com
Fri Oct 30 14:50:08 CET 2015


Hi,

That looks like the segfault is coming while re-doing the domain
decomposition after successful replica exchange. We have a test case for
that (even NPT REMD), but perhaps nothing in it changes enough to hit the
problem. Can you please file an issue at http://redmine.gromacs.org and
attach your .tprs, so we can reproduce where the segfault happens?

However, I notice you've had a several attempts with ~zero probability, and
then suddenly 1 because the energy difference between the two replicas has
changed sign. That's a bit unlikely. Volume doesn't look like it changed
much. You could try
a) NVT REMD (to see if that has the problem)
b) more equilibration before NPT REMD (in case life is just not stable yet,
but still there'd be a problem for GROMACS to fix)
c) inspecting the energy components near the final step to see what energy
component has jumped (in case there's a bug somewhere before REMD got
involved)

Thanks!

Mark

On Fri, Oct 30, 2015 at 2:32 PM Barnett, James W <jbarnet4 at tulane.edu>
wrote:

> I added the -debug flag to my two replica test. The end of my mdrun log
> files
> looks like this (right before it segfaults):
>
>     Replica exchange at step 11000 time 22.00000
>     Repl 0 <-> 1  dE_term = -6.536e-01 (kT)
>       dpV = -3.524e-05  d = -6.537e-01
>     Repl ex  0 x  1
>     Repl pr   1.0
>
>     Replica Exchange Order
>     Replica 0:
>     Replica 1:
>     Atom distribution over 16 domains: av 1207 stddev 26 min 1165 max 1238
>
> On Fri, 2015-10-30 at 13:08 +0000, Barnett, James W wrote:
> > Hey Mark,
> >
> > On Fri, 2015-10-30 at 08:14 +0000, Mark Abraham wrote:
> > > Hi,
> > >
> > > I've never heard of such. You could try a multisim without -replex, to
> help
> > > diagnose.
> >
> >
> > A multidir simulation runs without issue when -replex is omitted.
> >
> > >
> > > On Fri, 30 Oct 2015 03:33 Barnett, James W <jbarnet4 at tulane.edu>
> wrote:
> > >
> > > > Good evening here,
> > > >
> > > > I get a segmentation fault with my GROMACS 5.1 install only for
> replica
> > > > exchange
> > > > simulations right at the first successful exchange on a multi-node
> run.
> > > > Normal
> > > > simulations across multiple nodes work fine, and replica exchange
> > > > simulations on
> > > > one node work fine.
> > > >
> > > > I've reproduced the problem with just 2 replicas on 2 nodes with
> GPU's
> > > > disabled
> > > > (-nb cpu). Each node has 20 CPU's so I'm using 20 MPI ranks on each
> > > > (OpenMPI).
> > > >
> > > > I get a segfault right when the first exchange is successful.
> > > >
> > > > The only other error I get sometimes is that the Infiniband
> connection
> > > > timed out
> > > > retrying the communication between nodes at the exact same moment as
> the
> > > > segfault, but I don't get that every time, and it's usually with all
> > > > replicas
> > > > going (my goal is to do 30 replicas on 120 cpus). No other error
> logs, and
> > > > mdrun's log does not indicate an error.
> > > >
> > > > PBS log: http://bit.ly/1P8Vs49
> > > > mdrun log: http://bit.ly/1RD0ViQ
> > > >
> > > > I'm currently troubleshooting this some with the sysadmin, but I
> wanted to
> > > > check
> > > > to see if anyone has had a similar issue or any further steps to
> > > > troubleshoot.
> > > > I've also searched the mailing list and used my Google-fu, but it has
> > > > failed me
> > > > so far.
> > > >
> > > > Thanks for your help.
> > > >
> >
> > --
> > James "Wes" Barnett, Ph.D. Candidate
> > Louisiana Board of Regents Fellow
> >
> > Chemical and Biomolecular Engineering
> > Tulane University
> > 341-B Lindy Boggs Center for Energy and Biotechnology
> > 6823 St. Charles Ave
> > New Orleans, Louisiana 70118-5674
> > jbarnet4 at tulane.edu
>
> --
> James "Wes" Barnett, Ph.D. Candidate
> Louisiana Board of Regents Fellow
>
> Chemical and Biomolecular Engineering
> Tulane University
> 341-B Lindy Boggs Center for Energy and Biotechnology
> 6823 St. Charles Ave
> New Orleans, Louisiana 70118-5674
> jbarnet4 at tulane.edu
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list