[gmx-users] Re: GROMACS (w. OpenMPI) fails to run with -np larger than 10

Mark Abraham Mark.Abraham at anu.edu.au
Wed Apr 11 18:16:55 CEST 2012


On 12/04/2012 1:42 AM, haadah wrote:
> Could you clarify what you mean by "Sounds like an MPI configuration program.
> I'd get a test program running on 18 cores before worrying about anything
> else."? My problem is that i can't get anything to work with -np set to more
> than 10.

Only you can configure your hardware to use your MPI software correctly. 
You should use some simple non-GROMACS test program to probe whether it 
is working correctly. Only then is testing GROMACS a reasonable thing to 
do, given your existing observations.

Mark

>
>
> The “Cannot rename checkpoint file; maybe you are out of quota?” problem is
> fixed, i needed to set the NFS to sync instead of async.
>
>
>
> Mark Abraham wrote
>> On 11/04/2012 11:03 PM, Seyyed Mohtadin Hashemi wrote:
>>> Hello,
>>>
>>> I have a very peculiar problem: I have a micro cluster with three
>>> nodes (18 cores total); the nodes are clones of each other and
>>> connected to a frontend via Ethernet. I am using Debian squeeze as the
>>> OS for all nodes.
>>>
>>> I have compiled a GROMACS v.4.5.5 environment with FFTW v.3.3.1 and
>>> OpenMPI v.1.4.2 (OpenMPI version that is standard for Debian). On the
>>> nodes, individually, I can do simulations of any size and complexity;
>>> however, as soon as I want to do a parallel job the whole thing crashes.
>>> Since my own simulations can be non-ideal for a parallel situation I
>>> have used the gmxbench d.dppc files, this is the result I get:
>>>
>>> For a simple parallel job I use: path/mpirun --hostfile
>>> path/machinefile --np XX path/mdrun_mpi --p tcp --s path/topol.tpr --o
>>> path/output.trr
>>> For --np XX being smaller than or 10 it works, however as soon as I
>>> make use of 11 or larger the whole thing crashes and I get:
>>> [host:xxxx] Signal: Bus error (7)
>>> [host:xxxx] Signal code: Non-existant physical address (2)
>>> [host]xxxx] Lots of lines with libmpi.so.0
>>>
>>> I have tried using different versions of OpenMPI, v.1.4.5 and all the
>>> way to beta v.1.5.5, they all behave exactly the same. This is making
>>> no sense.
>> Sounds like an MPI configuration program. I'd get a test program running
>> on 18 cores before worrying about anything else.
>>
>>> When I use threads over the OpenMPI interface I can get all cores
>>> engaged and the simulation works until 5-7 minutes in then it gives an
>>> error "Cannot rename checkpoint file; maybe you are out of quota?"
>>> even though I have more than 500gb left on each node.
>> Sounds like a filesystem availability problem. Checkpoint files are
>> written in the working directory, so the available local disk space is
>> not strictly relevant.
>>
>> Mark
>>
>>> I hope somebody can help me figure out what is wrong and maybe a
>>> possible solution.
>>>
>>> regards,
>>> Mohtadin
>>>
>>>
>>
>> -- 
>> gmx-users mailing list    gmx-users@
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at .
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>
> --
> View this message in context: http://gromacs.5086.n6.nabble.com/GROMACS-w-OpenMPI-fails-to-run-with-np-larger-than-10-tp4832034p4832579.html
> Sent from the GROMACS Users Forum mailing list archive at Nabble.com.




More information about the gromacs.org_gmx-users mailing list