[gmx-users] GROMACS (w. OpenMPI) fails to run with -np larger than 10

Mark Abraham Mark.Abraham at anu.edu.au
Wed Apr 11 17:07:27 CEST 2012


On 11/04/2012 11:03 PM, Seyyed Mohtadin Hashemi wrote:
> Hello,
>
> I have a very peculiar problem: I have a micro cluster with three 
> nodes (18 cores total); the nodes are clones of each other and 
> connected to a frontend via Ethernet. I am using Debian squeeze as the 
> OS for all nodes.
>
> I have compiled a GROMACS v.4.5.5 environment with FFTW v.3.3.1 and 
> OpenMPI v.1.4.2 (OpenMPI version that is standard for Debian). On the 
> nodes, individually, I can do simulations of any size and complexity; 
> however, as soon as I want to do a parallel job the whole thing crashes.
> Since my own simulations can be non-ideal for a parallel situation I 
> have used the gmxbench d.dppc files, this is the result I get:
>
> For a simple parallel job I use: path/mpirun --hostfile 
> path/machinefile --np XX path/mdrun_mpi --p tcp --s path/topol.tpr --o 
> path/output.trr
> For --np XX being smaller than or 10 it works, however as soon as I 
> make use of 11 or larger the whole thing crashes and I get:
> [host:xxxx] Signal: Bus error (7)
> [host:xxxx] Signal code: Non-existant physical address (2)
> [host]xxxx] Lots of lines with libmpi.so.0
>
> I have tried using different versions of OpenMPI, v.1.4.5 and all the 
> way to beta v.1.5.5, they all behave exactly the same. This is making 
> no sense.

Sounds like an MPI configuration program. I'd get a test program running 
on 18 cores before worrying about anything else.

>
> When I use threads over the OpenMPI interface I can get all cores 
> engaged and the simulation works until 5-7 minutes in then it gives an 
> error "Cannot rename checkpoint file; maybe you are out of quota?" 
> even though I have more than 500gb left on each node.

Sounds like a filesystem availability problem. Checkpoint files are 
written in the working directory, so the available local disk space is 
not strictly relevant.

Mark

>
> I hope somebody can help me figure out what is wrong and maybe a 
> possible solution.
>
> regards,
> Mohtadin
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20120412/b51e1651/attachment.html>


More information about the gromacs.org_gmx-users mailing list