[gmx-users] GROMACS (w. OpenMPI) fails to run with -np larger than 10

Wed Apr 11 15:03:42 CEST 2012

Hello,

I have a very peculiar problem: I have a micro cluster with three nodes (18
cores total); the nodes are clones of each other and connected to a
frontend via Ethernet. I am using Debian squeeze as the OS for all nodes.

I have compiled a GROMACS v.4.5.5 environment with FFTW v.3.3.1 and OpenMPI
v.1.4.2 (OpenMPI version that is standard for Debian). On the nodes,
individually, I can do simulations of any size and complexity; however, as
soon as I want to do a parallel job the whole thing crashes.
Since my own simulations can be non-ideal for a parallel situation I have
used the gmxbench d.dppc files, this is the result I get:

For a simple parallel job I use: path/mpirun –hostfile path/machinefile –np
XX path/mdrun_mpi –p tcp –s path/topol.tpr –o path/output.trr
For –np XX being smaller than or 10 it works, however as soon as I make use
of 11 or larger the whole thing crashes and I get:
[host:xxxx] Signal: Bus error (7)
[host:xxxx] Signal code: Non-existant physical address (2)
[host]xxxx] Lots of lines with libmpi.so.0

I have tried using different versions of OpenMPI, v.1.4.5 and all the way
to beta v.1.5.5, they all behave exactly the same. This is making no sense.

When I use threads over the OpenMPI interface I can get all cores engaged
and the simulation works until 5-7 minutes in then it gives an error
“Cannot rename checkpoint file; maybe you are out of quota?” even though I
have more than 500gb left on each node.

I hope somebody can help me figure out what is wrong and maybe a possible
solution.

regards,
Mohtadin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20120411/9dfd1132/attachment.html>