[gmx-users] parallel runs with lam

Tue Dec 27 19:52:01 CET 2005

hi guys,

i experience strange behavior when running parallel gromacs (3.3) using 
lam (version 7.1.1, iirc).

is there any limitation on the number of processors used for the md 
simulation based on the total number of atoms in the system?

when i run the simulation of 896 water molecules, i can use 20 
processors at a time without problems (i'm running using myrinet2000 
interconnection).
when i run a simulation with 192 water molecules only, i can run up to 
12 processors, but not more. the log file for processor 0 and standard 
and error outputs follow (14 processors used; i tried to remove 
unnecessary stuff from the outputs, if you miss something, let me know)

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
LOGFILE
=======
...
CPU=  0, lastcg=   13, targetcg=  109, myshift=    7
CPU=  1, lastcg=   27, targetcg=  123, myshift=    7
CPU=  2, lastcg=   41, targetcg=  137, myshift=    7
CPU=  3, lastcg=   55, targetcg=  151, myshift=    7
CPU=  4, lastcg=   69, targetcg=  165, myshift=    7
CPU=  5, lastcg=   83, targetcg=  179, myshift=    8
CPU=  6, lastcg=   96, targetcg=    0, myshift=    8
CPU=  7, lastcg=  110, targetcg=   14, myshift=    7
CPU=  8, lastcg=  123, targetcg=   28, myshift=    7
CPU=  9, lastcg=  137, targetcg=   42, myshift=    7
CPU= 10, lastcg=  150, targetcg=   54, myshift=    7
CPU= 11, lastcg=  164, targetcg=   68, myshift=    7
CPU= 12, lastcg=  177, targetcg=   82, myshift=    7
CPU= 13, lastcg=  191, targetcg=   96, myshift=    7
nsb->shift =   8, nsb->bshift=  0
Listing Scalars
nsb->nodeid:         0
nsb->nnodes:     14
nsb->cgtotal:   192
nsb->natoms:   1152
nsb->shift:       8
nsb->bshift:      0
Nodeid   index  homenr  cgload  workload
      0       0      84      14        14
      1      84      84      28        28
      2     168      84      42        42
      3     252      84      56        56
      4     336      84      70        70
      5     420      84      84        84
      6     504      78      97        97
      7     582      84     111       111
      8     666      78     124       124
      9     744      84     138       138
     10     828      78     151       151
     11     906      84     165       165
     12     990      78     178       178
     13    1068      84     192       192

.....
Parallelized PME sum used.
PARALLEL FFT DATA:
    local_nx:                   1  local_x_start:                   0
    local_ny_after_transpose:   2  local_y_start_after_transpose    0
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
   0:  System, initial mass: 3458.96
There are: 42 Atoms
There are: 42 VSites
Removing pbc first time
Done rmpbc

Constraining the starting coordinates (step -2)

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. 18 (1997) pp. 1463-1472
-------- -------- --- Thank You --- -------- --------

Initializing LINear Constraint Solver
   number of constraints is 42
   average number of constraints coupled to one constraint is 2.0

    Rel. Constraint Deviation:  Max    between atoms     RMS
        Before LINCS         0.001124     61     66   0.000824
         After LINCS         0.000091     79     84   0.000066

Constraining the coordinates at t0-dt (step -1)
    Rel. Constraint Deviation:  Max    between atoms     RMS
        Before LINCS         0.000110     50     54   0.000070
         After LINCS         0.000172     67     68   0.000073

Started mdrun on node 0 Tue Dec 27 19:23:41 2005
Initial temperature: 246.98 K
            Step           Time         Lambda
               0        0.00000        0.00000

Grid: 4 x 4 x 30 cells
Configuring nonbonded kernels...

and here it stops

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
STDOUT
======
it doesn't contain nothing unusual

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
STDERR
======

firstly i get a warning from grompp:

checking input for internal consistency...
WARNING 1 [file _freeze.mdp, line unknown]:

Generated 10 of the 10 non-bonded parameter combinations
Excluding 1 bonded neighbours for NE6               192
processing coordinates...
double-checking input for internal consistency...
Cleaning up constraints and constant bonded interactions with virtual sites
renumbering atomtypes...
converting bonded parameters..
...
then it reads trr and edr file from the previous part of the simulation 
(was run on another machine - single processor, amd64; now running on 
cluster with xeons)
...
Walking down the molecule graph to make shake-blocks
There are 192 charge group borders and 960 shake borders
There are 192 total borders
Division over nodes in atoms:
       84      84      84      84      84      84      78      84 
78      84      78      84      78      84
writing run input file...
There was 1 warning

gcq#252: "Uh-oh .... Right Again" (Laurie Anderson)

and then mpirun is run with appropriate parameters

NNODES=14, MYRANK=10, HOSTNAME=skirit40.ics.muni.cz
NNODES=14, MYRANK=11, HOSTNAME=skirit40.ics.muni.cz
NNODES=14, MYRANK=8, HOSTNAME=skirit38.ics.muni.cz
NNODES=14, MYRANK=9, HOSTNAME=skirit38.ics.muni.cz
NNODES=14, MYRANK=12, HOSTNAME=skirit41.ics.muni.cz
NNODES=14, MYRANK=13, HOSTNAME=skirit41.ics.muni.cz
NNODES=14, MYRANK=4, HOSTNAME=skirit35.ics.muni.cz
NNODES=14, MYRANK=5, HOSTNAME=skirit35.ics.muni.cz
NNODES=14, MYRANK=6, HOSTNAME=skirit36.ics.muni.cz
NNODES=14, MYRANK=7, HOSTNAME=skirit36.ics.muni.cz
NNODES=14, MYRANK=0, HOSTNAME=skirit33.ics.muni.cz
NNODES=14, MYRANK=1, HOSTNAME=skirit33.ics.muni.cz
NNODES=14, MYRANK=2, HOSTNAME=skirit34.ics.muni.cz
NNODES=14, MYRANK=3, HOSTNAME=skirit34.ics.muni.cz
NODEID=2 argc=15
NODEID=3 argc=15
NODEID=0 argc=15
NODEID=4 argc=15
NODEID=1 argc=15
NODEID=5 argc=15
NODEID=6 argc=15
NODEID=8 argc=15
NODEID=7 argc=15
NODEID=12 argc=15
NODEID=9 argc=15
NODEID=13 argc=15
NODEID=10 argc=15
NODEID=11 argc=15
                          :-)  G  R  O  M  A  C  S  (-:
...
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 24349 failed on node n2 (xxx.xxx.x.xxx) due to signal 11.
-----------------------------------------------------------------------------

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

i checked that node and there is a corefile. there are 2 log files on 
that node (2 processors) and they end with

Initializing LINear Constaint Solver
  number of constraints is 42
   average number of constraints coupled to one constraint is 2.0

and

Started mdrun on node 4 Tue Dec 27 19:23:41 2005
Grid: 4 x 4 x 30 cells
Configuring nonbonded kernels...

i'm now running another simulation (12 processors), 2 processes are 
running on the node where it crashed with 14 processors. so i don't 
think it is hardware...

i'm not sure how to debug the core file that is coming from parallel run 
(actually i don't know it even for a single processor task :))

anyone knows what could be causing this problem? i would be very glad 
for any hints/information/suggestion... could that error from grompp be 
a problem?

thank you in advance. with best regards,

-- 
Lubos
_ at _"