[gmx-users] Multi-node Replica Exchange Segfault

Barnett, James W jbarnet4 at tulane.edu
Fri Oct 30 03:32:40 CET 2015


Good evening here,

I get a segmentation fault with my GROMACS 5.1 install only for replica exchange
simulations right at the first successful exchange on a multi-node run. Normal
simulations across multiple nodes work fine, and replica exchange simulations on
one node work fine.

I've reproduced the problem with just 2 replicas on 2 nodes with GPU's disabled
(-nb cpu). Each node has 20 CPU's so I'm using 20 MPI ranks on each (OpenMPI).

I get a segfault right when the first exchange is successful. 

The only other error I get sometimes is that the Infiniband connection timed out
retrying the communication between nodes at the exact same moment as the
segfault, but I don't get that every time, and it's usually with all replicas
going (my goal is to do 30 replicas on 120 cpus). No other error logs, and
mdrun's log does not indicate an error.

PBS log: http://bit.ly/1P8Vs49
mdrun log: http://bit.ly/1RD0ViQ

I'm currently troubleshooting this some with the sysadmin, but I wanted to check
to see if anyone has had a similar issue or any further steps to troubleshoot.
I've also searched the mailing list and used my Google-fu, but it has failed me
so far.

Thanks for your help.

-- 
James "Wes" Barnett, Ph.D. Candidate
Louisiana Board of Regents Fellow

Chemical and Biomolecular Engineering
Tulane University
341-B Lindy Boggs Center for Energy and Biotechnology
6823 St. Charles Ave
New Orleans, Louisiana 70118-5674
jbarnet4 at tulane.edu


More information about the gromacs.org_gmx-users mailing list