[gmx-users] parallel runs with lam
Lubos Vrbka
lubos.vrbka at gmail.com
Tue Dec 27 19:52:01 CET 2005
hi guys,
i experience strange behavior when running parallel gromacs (3.3) using
lam (version 7.1.1, iirc).
is there any limitation on the number of processors used for the md
simulation based on the total number of atoms in the system?
when i run the simulation of 896 water molecules, i can use 20
processors at a time without problems (i'm running using myrinet2000
interconnection).
when i run a simulation with 192 water molecules only, i can run up to
12 processors, but not more. the log file for processor 0 and standard
and error outputs follow (14 processors used; i tried to remove
unnecessary stuff from the outputs, if you miss something, let me know)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
LOGFILE
=======
...
CPU= 0, lastcg= 13, targetcg= 109, myshift= 7
CPU= 1, lastcg= 27, targetcg= 123, myshift= 7
CPU= 2, lastcg= 41, targetcg= 137, myshift= 7
CPU= 3, lastcg= 55, targetcg= 151, myshift= 7
CPU= 4, lastcg= 69, targetcg= 165, myshift= 7
CPU= 5, lastcg= 83, targetcg= 179, myshift= 8
CPU= 6, lastcg= 96, targetcg= 0, myshift= 8
CPU= 7, lastcg= 110, targetcg= 14, myshift= 7
CPU= 8, lastcg= 123, targetcg= 28, myshift= 7
CPU= 9, lastcg= 137, targetcg= 42, myshift= 7
CPU= 10, lastcg= 150, targetcg= 54, myshift= 7
CPU= 11, lastcg= 164, targetcg= 68, myshift= 7
CPU= 12, lastcg= 177, targetcg= 82, myshift= 7
CPU= 13, lastcg= 191, targetcg= 96, myshift= 7
nsb->shift = 8, nsb->bshift= 0
Listing Scalars
nsb->nodeid: 0
nsb->nnodes: 14
nsb->cgtotal: 192
nsb->natoms: 1152
nsb->shift: 8
nsb->bshift: 0
Nodeid index homenr cgload workload
0 0 84 14 14
1 84 84 28 28
2 168 84 42 42
3 252 84 56 56
4 336 84 70 70
5 420 84 84 84
6 504 78 97 97
7 582 84 111 111
8 666 78 124 124
9 744 84 138 138
10 828 78 151 151
11 906 84 165 165
12 990 78 178 178
13 1068 84 192 192
.....
Parallelized PME sum used.
PARALLEL FFT DATA:
local_nx: 1 local_x_start: 0
local_ny_after_transpose: 2 local_y_start_after_transpose 0
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: System, initial mass: 3458.96
There are: 42 Atoms
There are: 42 VSites
Removing pbc first time
Done rmpbc
Constraining the starting coordinates (step -2)
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. 18 (1997) pp. 1463-1472
-------- -------- --- Thank You --- -------- --------
Initializing LINear Constraint Solver
number of constraints is 42
average number of constraints coupled to one constraint is 2.0
Rel. Constraint Deviation: Max between atoms RMS
Before LINCS 0.001124 61 66 0.000824
After LINCS 0.000091 79 84 0.000066
Constraining the coordinates at t0-dt (step -1)
Rel. Constraint Deviation: Max between atoms RMS
Before LINCS 0.000110 50 54 0.000070
After LINCS 0.000172 67 68 0.000073
Started mdrun on node 0 Tue Dec 27 19:23:41 2005
Initial temperature: 246.98 K
Step Time Lambda
0 0.00000 0.00000
Grid: 4 x 4 x 30 cells
Configuring nonbonded kernels...
and here it stops
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
STDOUT
======
it doesn't contain nothing unusual
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
STDERR
======
firstly i get a warning from grompp:
checking input for internal consistency...
WARNING 1 [file _freeze.mdp, line unknown]:
Generated 10 of the 10 non-bonded parameter combinations
Excluding 1 bonded neighbours for NE6 192
processing coordinates...
double-checking input for internal consistency...
Cleaning up constraints and constant bonded interactions with virtual sites
renumbering atomtypes...
converting bonded parameters..
...
then it reads trr and edr file from the previous part of the simulation
(was run on another machine - single processor, amd64; now running on
cluster with xeons)
...
Walking down the molecule graph to make shake-blocks
There are 192 charge group borders and 960 shake borders
There are 192 total borders
Division over nodes in atoms:
84 84 84 84 84 84 78 84
78 84 78 84 78 84
writing run input file...
There was 1 warning
gcq#252: "Uh-oh .... Right Again" (Laurie Anderson)
and then mpirun is run with appropriate parameters
NNODES=14, MYRANK=10, HOSTNAME=skirit40.ics.muni.cz
NNODES=14, MYRANK=11, HOSTNAME=skirit40.ics.muni.cz
NNODES=14, MYRANK=8, HOSTNAME=skirit38.ics.muni.cz
NNODES=14, MYRANK=9, HOSTNAME=skirit38.ics.muni.cz
NNODES=14, MYRANK=12, HOSTNAME=skirit41.ics.muni.cz
NNODES=14, MYRANK=13, HOSTNAME=skirit41.ics.muni.cz
NNODES=14, MYRANK=4, HOSTNAME=skirit35.ics.muni.cz
NNODES=14, MYRANK=5, HOSTNAME=skirit35.ics.muni.cz
NNODES=14, MYRANK=6, HOSTNAME=skirit36.ics.muni.cz
NNODES=14, MYRANK=7, HOSTNAME=skirit36.ics.muni.cz
NNODES=14, MYRANK=0, HOSTNAME=skirit33.ics.muni.cz
NNODES=14, MYRANK=1, HOSTNAME=skirit33.ics.muni.cz
NNODES=14, MYRANK=2, HOSTNAME=skirit34.ics.muni.cz
NNODES=14, MYRANK=3, HOSTNAME=skirit34.ics.muni.cz
NODEID=2 argc=15
NODEID=3 argc=15
NODEID=0 argc=15
NODEID=4 argc=15
NODEID=1 argc=15
NODEID=5 argc=15
NODEID=6 argc=15
NODEID=8 argc=15
NODEID=7 argc=15
NODEID=12 argc=15
NODEID=9 argc=15
NODEID=13 argc=15
NODEID=10 argc=15
NODEID=11 argc=15
:-) G R O M A C S (-:
...
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 24349 failed on node n2 (xxx.xxx.x.xxx) due to signal 11.
-----------------------------------------------------------------------------
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
i checked that node and there is a corefile. there are 2 log files on
that node (2 processors) and they end with
Initializing LINear Constaint Solver
number of constraints is 42
average number of constraints coupled to one constraint is 2.0
and
Started mdrun on node 4 Tue Dec 27 19:23:41 2005
Grid: 4 x 4 x 30 cells
Configuring nonbonded kernels...
i'm now running another simulation (12 processors), 2 processes are
running on the node where it crashed with 14 processors. so i don't
think it is hardware...
i'm not sure how to debug the core file that is coming from parallel run
(actually i don't know it even for a single processor task :))
anyone knows what could be causing this problem? i would be very glad
for any hints/information/suggestion... could that error from grompp be
a problem?
thank you in advance. with best regards,
--
Lubos
_ at _"
More information about the gromacs.org_gmx-users
mailing list