[gmx-users] parallel runs with lam

Wed Dec 28 01:07:03 CET 2005

Lubos Vrbka wrote:
> hi guys,
> 
> i experience strange behavior when running parallel gromacs (3.3) using 
> lam (version 7.1.1, iirc).
> 
> is there any limitation on the number of processors used for the md 
> simulation based on the total number of atoms in the system?
> 
> when i run the simulation of 896 water molecules, i can use 20 
> processors at a time without problems (i'm running using myrinet2000 
> interconnection).
> when i run a simulation with 192 water molecules only, i can run up to 
> 12 processors, but not more. the log file for processor 0 and standard 
> and error outputs follow (14 processors used; i tried to remove 
> unnecessary stuff from the outputs, if you miss something, let me know)

It looks weird, your  grid is 4 x 4 x 30.
Is your box also very long?

It should work anyway though, if it runs on 1 or a few processors, it 
should run on 14, or quit with a decent error message. Obviously it is 
not going to be efficient to run such a small system on many processors.

If you can repeat this, please post a bugzilla.
> 
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> LOGFILE
> =======
> ...
> CPU=  0, lastcg=   13, targetcg=  109, myshift=    7
> CPU=  1, lastcg=   27, targetcg=  123, myshift=    7
> CPU=  2, lastcg=   41, targetcg=  137, myshift=    7
> CPU=  3, lastcg=   55, targetcg=  151, myshift=    7
> CPU=  4, lastcg=   69, targetcg=  165, myshift=    7
> CPU=  5, lastcg=   83, targetcg=  179, myshift=    8
> CPU=  6, lastcg=   96, targetcg=    0, myshift=    8
> CPU=  7, lastcg=  110, targetcg=   14, myshift=    7
> CPU=  8, lastcg=  123, targetcg=   28, myshift=    7
> CPU=  9, lastcg=  137, targetcg=   42, myshift=    7
> CPU= 10, lastcg=  150, targetcg=   54, myshift=    7
> CPU= 11, lastcg=  164, targetcg=   68, myshift=    7
> CPU= 12, lastcg=  177, targetcg=   82, myshift=    7
> CPU= 13, lastcg=  191, targetcg=   96, myshift=    7
> nsb->shift =   8, nsb->bshift=  0
> Listing Scalars
> nsb->nodeid:         0
> nsb->nnodes:     14
> nsb->cgtotal:   192
> nsb->natoms:   1152
> nsb->shift:       8
> nsb->bshift:      0
> Nodeid   index  homenr  cgload  workload
>      0       0      84      14        14
>      1      84      84      28        28
>      2     168      84      42        42
>      3     252      84      56        56
>      4     336      84      70        70
>      5     420      84      84        84
>      6     504      78      97        97
>      7     582      84     111       111
>      8     666      78     124       124
>      9     744      84     138       138
>     10     828      78     151       151
>     11     906      84     165       165
>     12     990      78     178       178
>     13    1068      84     192       192
> 
> .....
> Parallelized PME sum used.
> PARALLEL FFT DATA:
>    local_nx:                   1  local_x_start:                   0
>    local_ny_after_transpose:   2  local_y_start_after_transpose    0
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
>   0:  System, initial mass: 3458.96
> There are: 42 Atoms
> There are: 42 VSites
> Removing pbc first time
> Done rmpbc
> 
> Constraining the starting coordinates (step -2)
> 
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
> LINCS: A Linear Constraint Solver for molecular simulations
> J. Comp. Chem. 18 (1997) pp. 1463-1472
> -------- -------- --- Thank You --- -------- --------
> 
> 
> Initializing LINear Constraint Solver
>   number of constraints is 42
>   average number of constraints coupled to one constraint is 2.0
> 
>    Rel. Constraint Deviation:  Max    between atoms     RMS
>        Before LINCS         0.001124     61     66   0.000824
>         After LINCS         0.000091     79     84   0.000066
> 
> 
> Constraining the coordinates at t0-dt (step -1)
>    Rel. Constraint Deviation:  Max    between atoms     RMS
>        Before LINCS         0.000110     50     54   0.000070
>         After LINCS         0.000172     67     68   0.000073
> 
> Started mdrun on node 0 Tue Dec 27 19:23:41 2005
> Initial temperature: 246.98 K
>            Step           Time         Lambda
>               0        0.00000        0.00000
> 
> Grid: 4 x 4 x 30 cells
> Configuring nonbonded kernels...
> 
> and here it stops
> 
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> STDOUT
> ======
> it doesn't contain nothing unusual
> 
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> STDERR
> ======
> 
> firstly i get a warning from grompp:
> 
> checking input for internal consistency...
> WARNING 1 [file _freeze.mdp, line unknown]:
> 
> Generated 10 of the 10 non-bonded parameter combinations
> Excluding 1 bonded neighbours for NE6               192
> processing coordinates...
> double-checking input for internal consistency...
> Cleaning up constraints and constant bonded interactions with virtual sites
> renumbering atomtypes...
> converting bonded parameters..
> ...
> then it reads trr and edr file from the previous part of the simulation 
> (was run on another machine - single processor, amd64; now running on 
> cluster with xeons)
> ...
> Walking down the molecule graph to make shake-blocks
> There are 192 charge group borders and 960 shake borders
> There are 192 total borders
> Division over nodes in atoms:
>       84      84      84      84      84      84      78      84 78      
> 84      78      84      78      84
> writing run input file...
> There was 1 warning
> 
> gcq#252: "Uh-oh .... Right Again" (Laurie Anderson)
> 
> and then mpirun is run with appropriate parameters
> 
> NNODES=14, MYRANK=10, HOSTNAME=skirit40.ics.muni.cz
> NNODES=14, MYRANK=11, HOSTNAME=skirit40.ics.muni.cz
> NNODES=14, MYRANK=8, HOSTNAME=skirit38.ics.muni.cz
> NNODES=14, MYRANK=9, HOSTNAME=skirit38.ics.muni.cz
> NNODES=14, MYRANK=12, HOSTNAME=skirit41.ics.muni.cz
> NNODES=14, MYRANK=13, HOSTNAME=skirit41.ics.muni.cz
> NNODES=14, MYRANK=4, HOSTNAME=skirit35.ics.muni.cz
> NNODES=14, MYRANK=5, HOSTNAME=skirit35.ics.muni.cz
> NNODES=14, MYRANK=6, HOSTNAME=skirit36.ics.muni.cz
> NNODES=14, MYRANK=7, HOSTNAME=skirit36.ics.muni.cz
> NNODES=14, MYRANK=0, HOSTNAME=skirit33.ics.muni.cz
> NNODES=14, MYRANK=1, HOSTNAME=skirit33.ics.muni.cz
> NNODES=14, MYRANK=2, HOSTNAME=skirit34.ics.muni.cz
> NNODES=14, MYRANK=3, HOSTNAME=skirit34.ics.muni.cz
> NODEID=2 argc=15
> NODEID=3 argc=15
> NODEID=0 argc=15
> NODEID=4 argc=15
> NODEID=1 argc=15
> NODEID=5 argc=15
> NODEID=6 argc=15
> NODEID=8 argc=15
> NODEID=7 argc=15
> NODEID=12 argc=15
> NODEID=9 argc=15
> NODEID=13 argc=15
> NODEID=10 argc=15
> NODEID=11 argc=15
>                          :-)  G  R  O  M  A  C  S  (-:
> ...
> ----------------------------------------------------------------------------- 
> 
> One of the processes started by mpirun has exited with a nonzero exit
> code.  This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
> 
> PID 24349 failed on node n2 (xxx.xxx.x.xxx) due to signal 11.
> ----------------------------------------------------------------------------- 
> 
> 
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> 
> i checked that node and there is a corefile. there are 2 log files on 
> that node (2 processors) and they end with
> 
> Initializing LINear Constaint Solver
>  number of constraints is 42
>   average number of constraints coupled to one constraint is 2.0
> 
> and
> 
> Started mdrun on node 4 Tue Dec 27 19:23:41 2005
> Grid: 4 x 4 x 30 cells
> Configuring nonbonded kernels...
> 
> i'm now running another simulation (12 processors), 2 processes are 
> running on the node where it crashed with 14 processors. so i don't 
> think it is hardware...
> 
> i'm not sure how to debug the core file that is coming from parallel run 
> (actually i don't know it even for a single processor task :))
> 
> anyone knows what could be causing this problem? i would be very glad 
> for any hints/information/suggestion... could that error from grompp be 
> a problem?
> 
> thank you in advance. with best regards,
> 

-- 
David.
________________________________________________________________________
David van der Spoel, PhD, Assoc. Prof., Molecular Biophysics group,
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596,  	75124 Uppsala, Sweden
phone:	46 18 471 4205		fax: 46 18 511 755
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://xray.bmc.uu.se/~spoel
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++