[gmx-users] Re: GROMACS (w. OpenMPI) fails to run with -np larger than 10
szilard.pall at cbr.su.se
Wed Apr 11 18:29:41 CEST 2012
On Wed, Apr 11, 2012 at 6:16 PM, Mark Abraham <Mark.Abraham at anu.edu.au> wrote:
> On 12/04/2012 1:42 AM, haadah wrote:
>> Could you clarify what you mean by "Sounds like an MPI configuration
>> I'd get a test program running on 18 cores before worrying about anything
>> else."? My problem is that i can't get anything to work with -np set to
>> than 10.
> Only you can configure your hardware to use your MPI software correctly. You
> should use some simple non-GROMACS test program to probe whether it is
> working correctly. Only then is testing GROMACS a reasonable thing to do,
> given your existing observations.
That's a good point, test a (few) simple, but not "hello world"
trivial, program(s) first. If that works, it might be a GROMACS issue.
Above >10 processes separate PME nodes are used. Try the -npme 0
option to see if the error occurs without separate PME nodes on >10
>> The “Cannot rename checkpoint file; maybe you are out of quota?” problem
>> fixed, i needed to set the NFS to sync instead of async.
>> Mark Abraham wrote
>>> On 11/04/2012 11:03 PM, Seyyed Mohtadin Hashemi wrote:
>>>> I have a very peculiar problem: I have a micro cluster with three
>>>> nodes (18 cores total); the nodes are clones of each other and
>>>> connected to a frontend via Ethernet. I am using Debian squeeze as the
>>>> OS for all nodes.
>>>> I have compiled a GROMACS v.4.5.5 environment with FFTW v.3.3.1 and
>>>> OpenMPI v.1.4.2 (OpenMPI version that is standard for Debian). On the
>>>> nodes, individually, I can do simulations of any size and complexity;
>>>> however, as soon as I want to do a parallel job the whole thing crashes.
>>>> Since my own simulations can be non-ideal for a parallel situation I
>>>> have used the gmxbench d.dppc files, this is the result I get:
>>>> For a simple parallel job I use: path/mpirun --hostfile
>>>> path/machinefile --np XX path/mdrun_mpi --p tcp --s path/topol.tpr --o
>>>> For --np XX being smaller than or 10 it works, however as soon as I
>>>> make use of 11 or larger the whole thing crashes and I get:
>>>> [host:xxxx] Signal: Bus error (7)
>>>> [host:xxxx] Signal code: Non-existant physical address (2)
>>>> [host]xxxx] Lots of lines with libmpi.so.0
>>>> I have tried using different versions of OpenMPI, v.1.4.5 and all the
>>>> way to beta v.1.5.5, they all behave exactly the same. This is making
>>>> no sense.
>>> Sounds like an MPI configuration program. I'd get a test program running
>>> on 18 cores before worrying about anything else.
>>>> When I use threads over the OpenMPI interface I can get all cores
>>>> engaged and the simulation works until 5-7 minutes in then it gives an
>>>> error "Cannot rename checkpoint file; maybe you are out of quota?"
>>>> even though I have more than 500gb left on each node.
>>> Sounds like a filesystem availability problem. Checkpoint files are
>>> written in the working directory, so the available local disk space is
>>> not strictly relevant.
>>>> I hope somebody can help me figure out what is wrong and maybe a
>>>> possible solution.
>>> gmx-users mailing list gmx-users@
>>> Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>> Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to gmx-users-request at .
>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> View this message in context:
>> Sent from the GROMACS Users Forum mailing list archive at Nabble.com.
> gmx-users mailing list gmx-users at gromacs.org
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
More information about the gromacs.org_gmx-users