[gmx-users] MPI_Recv invalid count and system explodes for large but not small parallelization on power6 but not opterons
chris.neale at utoronto.ca
chris.neale at utoronto.ca
Wed Mar 4 00:19:34 CET 2009
Hello,
I am currently testing a large system on a power6 cluster. I have
compiled gromacs 4.0.4 successfully, and it appears to be working fine
for <64 "cores" (sic, see later). First, I notice that it runs at
approximately 1/2 the speed that it obtains on some older opterons,
which is unfortunate but acceptable. Second, I run into some strange
issues when I have a greater number of cores. Since there are 32 cores
per node with simultaneous multithreading this yields 64 tasks inside
one box, and I realize that these problems could be MPI related.
Some background:
This test system is stable for > 100ns on an opteron so I am quite
confident that I do not have a problem with my topology or starting
structure.
Compilation was successful with -O2 only when I modified the
./configure file as follows, otherwise I got a stray ')' and a linking
error:
[cneale at tcs-f11n05]$ diff configure.000 configure
5052a5053
> ac_cv_f77_libs="-L/scratch/cneale/exe/fftw-3.1.2_aix/exec/lib
> -lxlf90 -L/usr/lpp/xlf/lib -lxlopt -lxlf -lxlomp_ser -lpthreads -lm
> -lc"
The error messages:
For N=1,2,4,8,16,32, and 64, the system runs properly.
For N=200, I get the error: "ERROR: 0032-103 Invalid count (-8388608)
in MPI_Recv, task 37"
For N=196 my system explodes via regular settle/lings warnings
followed by a crash.
Here are the log file and stderr snippits
On 200 cores:
The log file appears normal but is truncated.
## stderr:
...
Will use 112 particle-particle and 88 PME only nodes
This is a guess, check the performance at the end of the log file
NOTE: For optimal PME load balancing at high parallelization
PME grid_x (175) and grid_y (175) should be divisible by #PME_nodes (88)
Making 3D domain decomposition 4 x 7 x 4
starting mdrun 'Big Box'
500 steps, 1.0 ps.
ERROR: 0032-103 Invalid count (-8388608) in MPI_Recv, task 37
#####################
On 196 cores,
...
Initializing Domain Decomposition on 196 nodes
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
two-body bonded interactions: 0.556 nm, LJ-14, atoms 25035 25038
multi-body bonded interactions: 0.556 nm, Proper Dih., atoms 25035 25038
Minimum cell size due to bonded interactions: 0.612 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.820 nm
Estimated maximum distance required for P-LINCS: 0.820 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.37
Will use 108 particle-particle and 88 PME only nodes
This is a guess, check the performance at the end of the log file
Using 88 separate PME nodes
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 108 cells with a minimum initial size of 1.025 nm
The maximum allowed number of cells is: X 16 Y 16 Z 14
Domain decomposition grid 6 x 6 x 3, separate PME nodes 88
Interleaving PP and PME nodes
This is a particle-particle only node
Domain decomposition nodeid 0, coordinates 0 0 0
Using two step summing over 4 groups of on average 27.0 processes
...
##### And to stderr, I get:
...
Back Off! I just backed up temp.log to ./#temp.log.2#
Reading file temp.tpr, VERSION 4.0.4 (single precision)
Will use 108 particle-particle and 88 PME only nodes
This is a guess, check the performance at the end of the log file
NOTE: For optimal PME load balancing at high parallelization
PME grid_x (175) and grid_y (175) should be divisible by #PME_nodes (88)
Making 3D domain decomposition 6 x 6 x 3
starting mdrun 'Big Box'
500 steps, 1.0 ps.
Step 61, time 0.122 (ps) LINCS WARNING
relative constraint deviation after LINCS:
rms 0.002765, max 0.028338 (between atoms 46146 and 46145)
bonds that rotated more than 30 degrees:
atom 1 atom 2 angle previous, current, constraint length
46148 46146 89.9 0.1480 0.1499 0.1480
46050 46049 32.0 0.1470 0.1475 0.1470
t = 0.122 ps: Water molecule starting at atom 62389 can not be settled.
Check for bad contacts and/or reduce the timestep.
t = 0.122 ps: Water molecule starting at atom 706505 can not be settled.
...
### And the system then proceeds to explode.
###################
I am happy to provide more information, and apologize if what I have
posted here is incomplete. These log files are large though, and I
tried to keep this first post as short as possible.
Thanks for any assistance,
Chris.
More information about the gromacs.org_gmx-users
mailing list