[gmx-users] Multi-node Replica Exchange Segfault

Barnett, James W jbarnet4 at tulane.edu
Fri Oct 30 20:38:15 CET 2015


On Fri, 2015-10-30 at 13:49 +0000, Mark Abraham wrote:
> Hi,
> 
> That looks like the segfault is coming while re-doing the domain
> decomposition after successful replica exchange. We have a test case for
> that (even NPT REMD), but perhaps nothing in it changes enough to hit the
> problem. Can you please file an issue at http://redmine.gromacs.org and
> attach your .tprs, so we can reproduce where the segfault happens?

Thanks, I'll file a bug report.

> 
> However, I notice you've had a several attempts with ~zero probability, and
> then suddenly 1 because the energy difference between the two replicas has
> changed sign. That's a bit unlikely. Volume doesn't look like it changed
> much. You could try
> a) NVT REMD (to see if that has the problem)

I get the same issue with NVT. Segfault right at the first successful exchange.
Also energy difference between two replicas changed sign here again.

> b) more equilibration before NPT REMD (in case life is just not stable yet,
> but still there'd be a problem for GROMACS to fix)

I will try this and report if it helps on the bug report.

> c) inspecting the energy components near the final step to see what energy
> component has jumped (in case there's a bug somewhere before REMD got
> involved)

Looking through the energy file with gmx energy nothing seems to stand out.

I originally thought that I was having success running on a single node with
replica exchange, but after further testing when it reaches a successful
exchange I still get the segmentation fault.

> 
> 
> On Fri, Oct 30, 2015 at 2:32 PM Barnett, James W <jbarnet4 at tulane.edu>
> wrote:
> 
> > I added the -debug flag to my two replica test. The end of my mdrun log
> > files
> > looks like this (right before it segfaults):
> > 
> >     Replica exchange at step 11000 time 22.00000
> >     Repl 0 <-> 1  dE_term = -6.536e-01 (kT)
> >       dpV = -3.524e-05  d = -6.537e-01
> >     Repl ex  0 x  1
> >     Repl pr   1.0
> > 
> >     Replica Exchange Order
> >     Replica 0:
> >     Replica 1:
> >     Atom distribution over 16 domains: av 1207 stddev 26 min 1165 max 1238
> > 
> > On Fri, 2015-10-30 at 13:08 +0000, Barnett, James W wrote:
> > > Hey Mark,
> > > 
> > > On Fri, 2015-10-30 at 08:14 +0000, Mark Abraham wrote:
> > > > Hi,
> > > > 
> > > > I've never heard of such. You could try a multisim without -replex, to
> > help
> > > > diagnose.
> > > 
> > > 
> > > A multidir simulation runs without issue when -replex is omitted.
> > > 
> > > > 
> > > > On Fri, 30 Oct 2015 03:33 Barnett, James W <jbarnet4 at tulane.edu>
> > wrote:
> > > > 
> > > > > Good evening here,
> > > > > 
> > > > > I get a segmentation fault with my GROMACS 5.1 install only for
> > replica
> > > > > exchange
> > > > > simulations right at the first successful exchange on a multi-node
> > run.
> > > > > Normal
> > > > > simulations across multiple nodes work fine, and replica exchange
> > > > > simulations on
> > > > > one node work fine.
> > > > > 
> > > > > I've reproduced the problem with just 2 replicas on 2 nodes with
> > GPU's
> > > > > disabled
> > > > > (-nb cpu). Each node has 20 CPU's so I'm using 20 MPI ranks on each
> > > > > (OpenMPI).
> > > > > 
> > > > > I get a segfault right when the first exchange is successful.
> > > > > 
> > > > > The only other error I get sometimes is that the Infiniband
> > connection
> > > > > timed out
> > > > > retrying the communication between nodes at the exact same moment as
> > the
> > > > > segfault, but I don't get that every time, and it's usually with all
> > > > > replicas
> > > > > going (my goal is to do 30 replicas on 120 cpus). No other error
> > logs, and
> > > > > mdrun's log does not indicate an error.
> > > > > 
> > > > > PBS log: http://bit.ly/1P8Vs49
> > > > > mdrun log: http://bit.ly/1RD0ViQ
> > > > > 
> > > > > I'm currently troubleshooting this some with the sysadmin, but I
> > wanted to
> > > > > check
> > > > > to see if anyone has had a similar issue or any further steps to
> > > > > troubleshoot.
> > > > > I've also searched the mailing list and used my Google-fu, but it has
> > > > > failed me
> > > > > so far.
> > > > > 
> > > > > Thanks for your help.
> > > > > 
> > > 

-- 
James "Wes" Barnett, Ph.D. Candidate
Louisiana Board of Regents Fellow

Chemical and Biomolecular Engineering
Tulane University
341-B Lindy Boggs Center for Energy and Biotechnology
6823 St. Charles Ave
New Orleans, Louisiana 70118-5674
jbarnet4 at tulane.edu


More information about the gromacs.org_gmx-users mailing list