[gmx-users] Multi-node Replica Exchange Segfault

Barnett, James W jbarnet4 at tulane.edu
Fri Oct 30 14:31:39 CET 2015


I added the -debug flag to my two replica test. The end of my mdrun log files
looks like this (right before it segfaults):

    Replica exchange at step 11000 time 22.00000
    Repl 0 <-> 1  dE_term = -6.536e-01 (kT)
      dpV = -3.524e-05  d = -6.537e-01
    Repl ex  0 x  1
    Repl pr   1.0

    Replica Exchange Order
    Replica 0:
    Replica 1:
    Atom distribution over 16 domains: av 1207 stddev 26 min 1165 max 1238

On Fri, 2015-10-30 at 13:08 +0000, Barnett, James W wrote:
> Hey Mark,
> 
> On Fri, 2015-10-30 at 08:14 +0000, Mark Abraham wrote:
> > Hi,
> > 
> > I've never heard of such. You could try a multisim without -replex, to help
> > diagnose.
> 
> 
> A multidir simulation runs without issue when -replex is omitted.
> 
> > 
> > On Fri, 30 Oct 2015 03:33 Barnett, James W <jbarnet4 at tulane.edu> wrote:
> > 
> > > Good evening here,
> > > 
> > > I get a segmentation fault with my GROMACS 5.1 install only for replica
> > > exchange
> > > simulations right at the first successful exchange on a multi-node run.
> > > Normal
> > > simulations across multiple nodes work fine, and replica exchange
> > > simulations on
> > > one node work fine.
> > > 
> > > I've reproduced the problem with just 2 replicas on 2 nodes with GPU's
> > > disabled
> > > (-nb cpu). Each node has 20 CPU's so I'm using 20 MPI ranks on each
> > > (OpenMPI).
> > > 
> > > I get a segfault right when the first exchange is successful.
> > > 
> > > The only other error I get sometimes is that the Infiniband connection
> > > timed out
> > > retrying the communication between nodes at the exact same moment as the
> > > segfault, but I don't get that every time, and it's usually with all
> > > replicas
> > > going (my goal is to do 30 replicas on 120 cpus). No other error logs, and
> > > mdrun's log does not indicate an error.
> > > 
> > > PBS log: http://bit.ly/1P8Vs49
> > > mdrun log: http://bit.ly/1RD0ViQ
> > > 
> > > I'm currently troubleshooting this some with the sysadmin, but I wanted to
> > > check
> > > to see if anyone has had a similar issue or any further steps to
> > > troubleshoot.
> > > I've also searched the mailing list and used my Google-fu, but it has
> > > failed me
> > > so far.
> > > 
> > > Thanks for your help.
> > > 
> 
> -- 
> James "Wes" Barnett, Ph.D. Candidate
> Louisiana Board of Regents Fellow
> 
> Chemical and Biomolecular Engineering
> Tulane University
> 341-B Lindy Boggs Center for Energy and Biotechnology
> 6823 St. Charles Ave
> New Orleans, Louisiana 70118-5674
> jbarnet4 at tulane.edu

-- 
James "Wes" Barnett, Ph.D. Candidate
Louisiana Board of Regents Fellow

Chemical and Biomolecular Engineering
Tulane University
341-B Lindy Boggs Center for Energy and Biotechnology
6823 St. Charles Ave
New Orleans, Louisiana 70118-5674
jbarnet4 at tulane.edu


More information about the gromacs.org_gmx-users mailing list