[gmx-users] Re: Gromacs 4.5.4 on multi-node cluster

Dimitris Dellis ntelll at gmail.com
Thu Dec 8 11:06:10 CET 2011


Hi.
This is openmpi related.

Probably you have active the virbr0 interface with IP 192.168.122.1 on 
nodes.
Stop and disable the libvirtd (and probably libvirt-guests) service if 
you don't need it.

Alternatively,
1. add --mca btl_tcp_if_exclude lo,virbr0 in mpirun flags
or
2. add in /home/gromacs/.Installed/openmpi/etc/openmpi-mca-params.conf 
the following line :
btl_tcp_if_exclude = lo,virbr0
to exclude virbr0 from the interfaces list that openmpi can use for 
communication.

(if virtbr1 etc. are present, add also in exclude list)







On 12/08/2011 11:44 AM, Nikos Papadimitriou wrote:
>
>>
>> email message attachment
>>
>>> -------- Forwarded Message --------
>>> *From*: Nikos Papadimitriou <nikpap at ipta.demokritos.gr 
>>> <mailto:Nikos%20Papadimitriou%20%3cnikpap at ipta.demokritos.gr%3e>>
>>> *To*: gmx-users at gromacs.org <mailto:gmx-users at gromacs.org>
>>> *Subject*: [gmx-users] Gromacs 4.5.4 on multi-node cluster
>>> *Date*: Wed, 7 Dec 2011 16:26:46 +0200
>>>
>>> Dear All,
>>>
>>> I had been running Gromacs 4.0.7 on a 12-node cluster (Intel i7-920 
>>> 4-cores) with OS Rocks 5.4.2. Recently, I have upgraded the cluster 
>>> OS to Rocks 5.4.3 and I have installed Gromacs 4.5.4 from the Bio 
>>> Roll repository. When running in parallel on the same node, 
>>> everything works fine. However, when I am trying to run on more than 
>>> one nodes the run stalls immediately with the following message:
>>>
>>> [gromacs at tornado Test]$ /home/gromacs/.Installed/openmpi/bin/mpirun 
>>> -np 2 -machinefile machines 
>>> /home/gromacs/.Installed/gromacs/bin/mdrun_mpi -s md_run.tpr -o 
>>> md_traj.trr -c md_confs.gro -e md.edr -g md.log -v
>>> NNODES=2, MYRANK=0, HOSTNAME=compute-1-1.local
>>> NNODES=2, MYRANK=1, HOSTNAME=compute-1-2.local
>>> NODEID=0 argc=12
>>> NODEID=1 argc=12
>>>
>>> The mdrun_mpi thread seems to start in both nodes but the run does 
>>> not go on and no file is produced. It seems that the nodes are 
>>> waiting for some kind of communication between them. The problem 
>>> occurs even for the simplest case (i.e. NVT simulation of 1000 Argon 
>>> atoms without Coulombic interactions). Openmpi and networking 
>>> between the nodes seem to work fine since there are not any problems 
>>> with other software that run with MPI.
>>>
>>> In an attempt to find a solution, I have manually compiled and 
>>> installed Gromacs 4.5.5 (with --enable-mpi) after having installed 
>>> the latest version of openmpi and fftw3 and no error occurred during 
>>> the installation. However, when trying to run on two different nodes 
>>> exactly the same problem appears.
>>>
>>> Have you any idea what might cause this situation?
>>> Thank you in advance!
>> email message attachment
>>
>>> -------- Forwarded Message --------
>>> *From*: Mark Abraham <Mark.Abraham at anu.edu.au 
>>> <mailto:Mark%20Abraham%20%3cMark.Abraham at anu.edu.au%3e>>
>>> *Reply-to*: "Discussion list for GROMACS users" <gmx-users at gromacs.org>
>>> *To*: Discussion list for GROMACS users <gmx-users at gromacs.org 
>>> <mailto:Discussion%20list%20for%20GROMACS%20users%20%3cgmx-users at gromacs.org%3e>>
>>> *Subject*: [gmx-users] Gromacs 4.5.4 on multi-node cluster
>>> *Date*: Wed, 7 Dec 2011 16:53:49 +0200
>>>
>>> On 8/12/2011 1:26 AM, Nikos Papadimitriou wrote:
>>>> Dear All,
>>>>
>>>> I had been running Gromacs 4.0.7 on a 12-node cluster (Intel i7-920 
>>>> 4-cores) with OS Rocks 5.4.2. Recently, I have upgraded the cluster 
>>>> OS to Rocks 5.4.3 and I have installed Gromacs 4.5.4 from the Bio 
>>>> Roll repository. When running in parallel on the same node, 
>>>> everything works fine. However, when I am trying to run on more 
>>>> than one nodes the run stalls immediately with the following message:
>>>>
>>>> [gromacs at tornado Test]$ /home/gromacs/.Installed/openmpi/bin/mpirun 
>>>> -np 2 -machinefile machines 
>>>> /home/gromacs/.Installed/gromacs/bin/mdrun_mpi -s md_run.tpr -o 
>>>> md_traj.trr -c md_confs.gro -e md.edr -g md.log -v
>>>> NNODES=2, MYRANK=0, HOSTNAME=compute-1-1.local
>>>> NNODES=2, MYRANK=1, HOSTNAME=compute-1-2.local
>>>> NODEID=0 argc=12
>>>> NODEID=1 argc=12
>>>>
>>>> The mdrun_mpi thread seems to start in both nodes but the run does 
>>>> not go on and no file is produced. It seems that the nodes are 
>>>> waiting for some kind of communication between them. The problem 
>>>> occurs even for the simplest case (i.e. NVT simulation of 1000 
>>>> Argon atoms without Coulombic interactions). Openmpi and networking 
>>>> between the nodes seem to work fine since there are not any 
>>>> problems with other software that run with MPI.
>>>
>>> Can you run 2-processor MPI test program with that machine file?
>>>
>>> Mark
>>>
> "Unfortunately", other MPI programs run fine on 2 or more nodes. There 
> seems to be no problem with MPI.
>>>>
>>>> In an attempt to find a solution, I have manually compiled and 
>>>> installed Gromacs 4.5.5 (with --enable-mpi) after having installed 
>>>> the latest version of openmpi and fftw3 and no error occurred 
>>>> during the installation. However, when trying to run on two 
>>>> different nodes exactly the same problem appears.
>>>>
>>>> Have you any idea what might cause this situation?
>>>> Thank you in advance!
>>>>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20111208/da837831/attachment.html>


More information about the gromacs.org_gmx-users mailing list