[gmx-users] parallel runs with lam
David van der Spoel
spoel at xray.bmc.uu.se
Wed Dec 28 01:07:03 CET 2005
Lubos Vrbka wrote:
> hi guys,
>
> i experience strange behavior when running parallel gromacs (3.3) using
> lam (version 7.1.1, iirc).
>
> is there any limitation on the number of processors used for the md
> simulation based on the total number of atoms in the system?
>
> when i run the simulation of 896 water molecules, i can use 20
> processors at a time without problems (i'm running using myrinet2000
> interconnection).
> when i run a simulation with 192 water molecules only, i can run up to
> 12 processors, but not more. the log file for processor 0 and standard
> and error outputs follow (14 processors used; i tried to remove
> unnecessary stuff from the outputs, if you miss something, let me know)
It looks weird, your grid is 4 x 4 x 30.
Is your box also very long?
It should work anyway though, if it runs on 1 or a few processors, it
should run on 14, or quit with a decent error message. Obviously it is
not going to be efficient to run such a small system on many processors.
If you can repeat this, please post a bugzilla.
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> LOGFILE
> =======
> ...
> CPU= 0, lastcg= 13, targetcg= 109, myshift= 7
> CPU= 1, lastcg= 27, targetcg= 123, myshift= 7
> CPU= 2, lastcg= 41, targetcg= 137, myshift= 7
> CPU= 3, lastcg= 55, targetcg= 151, myshift= 7
> CPU= 4, lastcg= 69, targetcg= 165, myshift= 7
> CPU= 5, lastcg= 83, targetcg= 179, myshift= 8
> CPU= 6, lastcg= 96, targetcg= 0, myshift= 8
> CPU= 7, lastcg= 110, targetcg= 14, myshift= 7
> CPU= 8, lastcg= 123, targetcg= 28, myshift= 7
> CPU= 9, lastcg= 137, targetcg= 42, myshift= 7
> CPU= 10, lastcg= 150, targetcg= 54, myshift= 7
> CPU= 11, lastcg= 164, targetcg= 68, myshift= 7
> CPU= 12, lastcg= 177, targetcg= 82, myshift= 7
> CPU= 13, lastcg= 191, targetcg= 96, myshift= 7
> nsb->shift = 8, nsb->bshift= 0
> Listing Scalars
> nsb->nodeid: 0
> nsb->nnodes: 14
> nsb->cgtotal: 192
> nsb->natoms: 1152
> nsb->shift: 8
> nsb->bshift: 0
> Nodeid index homenr cgload workload
> 0 0 84 14 14
> 1 84 84 28 28
> 2 168 84 42 42
> 3 252 84 56 56
> 4 336 84 70 70
> 5 420 84 84 84
> 6 504 78 97 97
> 7 582 84 111 111
> 8 666 78 124 124
> 9 744 84 138 138
> 10 828 78 151 151
> 11 906 84 165 165
> 12 990 78 178 178
> 13 1068 84 192 192
>
> .....
> Parallelized PME sum used.
> PARALLEL FFT DATA:
> local_nx: 1 local_x_start: 0
> local_ny_after_transpose: 2 local_y_start_after_transpose 0
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
> 0: System, initial mass: 3458.96
> There are: 42 Atoms
> There are: 42 VSites
> Removing pbc first time
> Done rmpbc
>
> Constraining the starting coordinates (step -2)
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
> LINCS: A Linear Constraint Solver for molecular simulations
> J. Comp. Chem. 18 (1997) pp. 1463-1472
> -------- -------- --- Thank You --- -------- --------
>
>
> Initializing LINear Constraint Solver
> number of constraints is 42
> average number of constraints coupled to one constraint is 2.0
>
> Rel. Constraint Deviation: Max between atoms RMS
> Before LINCS 0.001124 61 66 0.000824
> After LINCS 0.000091 79 84 0.000066
>
>
> Constraining the coordinates at t0-dt (step -1)
> Rel. Constraint Deviation: Max between atoms RMS
> Before LINCS 0.000110 50 54 0.000070
> After LINCS 0.000172 67 68 0.000073
>
> Started mdrun on node 0 Tue Dec 27 19:23:41 2005
> Initial temperature: 246.98 K
> Step Time Lambda
> 0 0.00000 0.00000
>
> Grid: 4 x 4 x 30 cells
> Configuring nonbonded kernels...
>
> and here it stops
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> STDOUT
> ======
> it doesn't contain nothing unusual
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> STDERR
> ======
>
> firstly i get a warning from grompp:
>
> checking input for internal consistency...
> WARNING 1 [file _freeze.mdp, line unknown]:
>
> Generated 10 of the 10 non-bonded parameter combinations
> Excluding 1 bonded neighbours for NE6 192
> processing coordinates...
> double-checking input for internal consistency...
> Cleaning up constraints and constant bonded interactions with virtual sites
> renumbering atomtypes...
> converting bonded parameters..
> ...
> then it reads trr and edr file from the previous part of the simulation
> (was run on another machine - single processor, amd64; now running on
> cluster with xeons)
> ...
> Walking down the molecule graph to make shake-blocks
> There are 192 charge group borders and 960 shake borders
> There are 192 total borders
> Division over nodes in atoms:
> 84 84 84 84 84 84 78 84 78
> 84 78 84 78 84
> writing run input file...
> There was 1 warning
>
> gcq#252: "Uh-oh .... Right Again" (Laurie Anderson)
>
> and then mpirun is run with appropriate parameters
>
> NNODES=14, MYRANK=10, HOSTNAME=skirit40.ics.muni.cz
> NNODES=14, MYRANK=11, HOSTNAME=skirit40.ics.muni.cz
> NNODES=14, MYRANK=8, HOSTNAME=skirit38.ics.muni.cz
> NNODES=14, MYRANK=9, HOSTNAME=skirit38.ics.muni.cz
> NNODES=14, MYRANK=12, HOSTNAME=skirit41.ics.muni.cz
> NNODES=14, MYRANK=13, HOSTNAME=skirit41.ics.muni.cz
> NNODES=14, MYRANK=4, HOSTNAME=skirit35.ics.muni.cz
> NNODES=14, MYRANK=5, HOSTNAME=skirit35.ics.muni.cz
> NNODES=14, MYRANK=6, HOSTNAME=skirit36.ics.muni.cz
> NNODES=14, MYRANK=7, HOSTNAME=skirit36.ics.muni.cz
> NNODES=14, MYRANK=0, HOSTNAME=skirit33.ics.muni.cz
> NNODES=14, MYRANK=1, HOSTNAME=skirit33.ics.muni.cz
> NNODES=14, MYRANK=2, HOSTNAME=skirit34.ics.muni.cz
> NNODES=14, MYRANK=3, HOSTNAME=skirit34.ics.muni.cz
> NODEID=2 argc=15
> NODEID=3 argc=15
> NODEID=0 argc=15
> NODEID=4 argc=15
> NODEID=1 argc=15
> NODEID=5 argc=15
> NODEID=6 argc=15
> NODEID=8 argc=15
> NODEID=7 argc=15
> NODEID=12 argc=15
> NODEID=9 argc=15
> NODEID=13 argc=15
> NODEID=10 argc=15
> NODEID=11 argc=15
> :-) G R O M A C S (-:
> ...
> -----------------------------------------------------------------------------
>
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 24349 failed on node n2 (xxx.xxx.x.xxx) due to signal 11.
> -----------------------------------------------------------------------------
>
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>
> i checked that node and there is a corefile. there are 2 log files on
> that node (2 processors) and they end with
>
> Initializing LINear Constaint Solver
> number of constraints is 42
> average number of constraints coupled to one constraint is 2.0
>
> and
>
> Started mdrun on node 4 Tue Dec 27 19:23:41 2005
> Grid: 4 x 4 x 30 cells
> Configuring nonbonded kernels...
>
> i'm now running another simulation (12 processors), 2 processes are
> running on the node where it crashed with 14 processors. so i don't
> think it is hardware...
>
> i'm not sure how to debug the core file that is coming from parallel run
> (actually i don't know it even for a single processor task :))
>
> anyone knows what could be causing this problem? i would be very glad
> for any hints/information/suggestion... could that error from grompp be
> a problem?
>
> thank you in advance. with best regards,
>
--
David.
________________________________________________________________________
David van der Spoel, PhD, Assoc. Prof., Molecular Biophysics group,
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596, 75124 Uppsala, Sweden
phone: 46 18 471 4205 fax: 46 18 511 755
spoel at xray.bmc.uu.se spoel at gromacs.org http://xray.bmc.uu.se/~spoel
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
More information about the gromacs.org_gmx-users
mailing list