[gmx-users] Problems with simulation on multi-nodes cluster

Mark Abraham Mark.Abraham at anu.edu.au
Mon Apr 2 15:06:15 CEST 2012


On 2/04/2012 7:13 PM, James Starlight wrote:
> Mark,
>
> As I've told previously I have problems with the running simulation in 
> multi-node mode.

Yup, and my bet is you can't run any other software on multiple MPI 
nodes either, because your MPI system is not set up correctly, or maybe 
is too old. We can't help with that, since it's nothing to do with GROMACS.

>
> I checked logs of such simulations and fond like this
>
> Will use 10 particle-particle and 6 PME only nodes
> This is a guess, check the performance at the end of the log file
> Using 6 separate PME nodes
>
> This simulation was run on the 2 nodes ( 2*8 CPUs). And I've never 
> obtain the same notions about PME nodes when I've launch my systems on 
> the singe node.

Not surprising. Running in parallel is a lot more tricky than running in 
serial, and so there's lots of software engineering that supports it. 
See manual 3.15 and 3.17.5. Running at near-maximum efficiency in 
parallel requires you understand some of that, but by default it will 
"just run" almost all the time.

> Might it be that some special options for the PME nodes are needed in 
> the mdp file to be defined ?

Not in the sense you mean. There are not normally any .mdp changes 
necessary to support parallelism, and you get told about them when they 
arise. The trace back below clearly indicates that the problem occurs as 
GROMACS goes to set up the parallel communication infrastructure, which 
has nothing directly to do with the .mdp contents.

Mark

>
> James
>
> 20 ????? 2012 ?. 18:02 ???????????? Mark Abraham 
> <Mark.Abraham at anu.edu.au <mailto:Mark.Abraham at anu.edu.au>> ???????:
>
>     On 20/03/2012 10:35 PM, James Starlight wrote:
>
>         Could someone tell me what tell the below error
>
>         Getting Loaded...
>         Reading file MD_100.tpr, VERSION 4.5.4 (single precision)
>         Loaded with Money
>
>
>         Will use 30 particle-particle and 18 PME only nodes
>         This is a guess, check the performance at the end of the log file
>         [ib02:22825] *** Process received signal ***
>         [ib02:22825] Signal: Segmentation fault (11)
>         [ib02:22825] Signal code: Address not mapped (1)
>         [ib02:22825] Failing at address: 0x10
>         [ib02:22825] [ 0]
>         /lib/x86_64-linux-gnu/libpthread.so.0(+0xf030) [0x7f535903e03$
>         [ib02:22825] [ 1]
>         /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x7e23) [0x7f535$
>         [ib02:22825] [ 2]
>         /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8601) [0x7f535$
>         [ib02:22825] [ 3]
>         /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8bab) [0x7f535$
>         [ib02:22825] [ 4]
>         /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so(+0x42af) [0x7f5353$
>         [ib02:22825] [ 5]
>         /usr/lib/libopen-pal.so.0(opal_progress+0x5b) [0x7f535790506b]
>         [ib02:22825] [ 6] /usr/lib/libmpi.so.0(+0x37755) [0x7f5359282755]
>         [ib02:22825] [ 7]
>         /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1c3a) [0x7f$
>         [ib02:22825] [ 8]
>         /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(+0x7fae) [0x7f$
>         [ib02:22825] [ 9] /usr/lib/libmpi.so.0(ompi_comm_split+0xbf)
>         [0x7f535926de8f]
>         [ib02:22825] [10] /usr/lib/libmpi.so.0(MPI_Comm_split+0xdb)
>         [0x7f535929dc2b]
>         [ib02:22825] [11]
>         /usr/lib/libgmx_mpi_d.openmpi.so.6(gmx_setup_nodecomm+0x19b) $
>         [ib02:22825] [12] mdrun_mpi_d.openmpi(mdrunner+0x46a) [0x40be7a]
>         [ib02:22825] [13] mdrun_mpi_d.openmpi(main+0x1256) [0x407206]
>         [ib02:22825] [14]
>         /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd) [0x7f$
>         [ib02:22825] [15] mdrun_mpi_d.openmpi() [0x407479]
>         [ib02:22825] *** End of error message ***
>         --------------------------------------------------------------------------
>         mpiexec noticed that process rank 36 with PID 22825 on node
>         ib02 exited on sign$
>         --------------------------------------------------------------------------
>
>
>         I've obtained it when I've tried to use my system on
>         multi-node station ( there is no problem on single node). Does
>         this problem with the cluster system or something wrong with
>         parameters of my simulation?
>
>
>     The trace back suggests your MPI system is not configured
>     correctly for your hardware.
>
>     Mark
>
>     -- 
>     gmx-users mailing list gmx-users at gromacs.org
>     <mailto:gmx-users at gromacs.org>
>     http://lists.gromacs.org/mailman/listinfo/gmx-users
>     Please search the archive at
>     http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>     Please don't post (un)subscribe requests to the list. Use the www
>     interface or send it to gmx-users-request at gromacs.org
>     <mailto:gmx-users-request at gromacs.org>.
>     Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20120402/5a92eda8/attachment.html>


More information about the gromacs.org_gmx-users mailing list