[gmx-users] MPI_Recv invalid count and system explodes for large but not small parallelization on power6 but not opterons

chris.neale at utoronto.ca chris.neale at utoronto.ca
Wed Mar 4 00:19:34 CET 2009


Hello,

I am currently testing a large system on a power6 cluster. I have  
compiled gromacs 4.0.4 successfully, and it appears to be working fine  
for <64 "cores" (sic, see later). First, I notice that it runs at  
approximately 1/2 the speed that it obtains on some older opterons,  
which is unfortunate but acceptable. Second, I run into some strange  
issues when I have a greater number of cores. Since there are 32 cores  
per node with simultaneous multithreading this yields 64 tasks inside  
one box, and I realize that these problems could be MPI related.

Some background:
This test system is stable for > 100ns on an opteron so I am quite  
confident that I do not have a problem with my topology or starting  
structure.

Compilation was successful with -O2 only when I modified the  
./configure file as follows, otherwise I got a stray ')' and a linking  
error:
[cneale at tcs-f11n05]$ diff configure.000 configure
5052a5053
> ac_cv_f77_libs="-L/scratch/cneale/exe/fftw-3.1.2_aix/exec/lib  
> -lxlf90 -L/usr/lpp/xlf/lib -lxlopt -lxlf -lxlomp_ser -lpthreads -lm  
> -lc"

The error messages:
For N=1,2,4,8,16,32, and 64, the system runs properly.
For N=200, I get the error: "ERROR: 0032-103 Invalid count  (-8388608)  
in MPI_Recv, task 37"
For N=196 my system explodes via regular settle/lings warnings  
followed by a crash.

Here are the log file and stderr snippits

On 200 cores:

The log file appears normal but is truncated.

## stderr:
...
Will use 112 particle-particle and 88 PME only nodes
This is a guess, check the performance at the end of the log file

NOTE: For optimal PME load balancing at high parallelization
       PME grid_x (175) and grid_y (175) should be divisible by #PME_nodes (88)

Making 3D domain decomposition 4 x 7 x 4

starting mdrun 'Big Box'
500 steps,      1.0 ps.
ERROR: 0032-103 Invalid count  (-8388608) in MPI_Recv, task 37


#####################

On 196 cores,

...
Initializing Domain Decomposition on 196 nodes
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
     two-body bonded interactions: 0.556 nm, LJ-14, atoms 25035 25038
   multi-body bonded interactions: 0.556 nm, Proper Dih., atoms 25035 25038
Minimum cell size due to bonded interactions: 0.612 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.820 nm
Estimated maximum distance required for P-LINCS: 0.820 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.37
Will use 108 particle-particle and 88 PME only nodes
This is a guess, check the performance at the end of the log file
Using 88 separate PME nodes
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 108 cells with a minimum initial size of 1.025 nm
The maximum allowed number of cells is: X 16 Y 16 Z 14
Domain decomposition grid 6 x 6 x 3, separate PME nodes 88
Interleaving PP and PME nodes
This is a particle-particle only node

Domain decomposition nodeid 0, coordinates 0 0 0

Using two step summing over 4 groups of on average 27.0 processes
...


##### And to stderr, I get:
...
Back Off! I just backed up temp.log to ./#temp.log.2#
Reading file temp.tpr, VERSION 4.0.4 (single precision)

Will use 108 particle-particle and 88 PME only nodes
This is a guess, check the performance at the end of the log file

NOTE: For optimal PME load balancing at high parallelization
       PME grid_x (175) and grid_y (175) should be divisible by #PME_nodes (88)

Making 3D domain decomposition 6 x 6 x 3

starting mdrun 'Big Box'
500 steps,      1.0 ps.

Step 61, time 0.122 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.002765, max 0.028338 (between atoms 46146 and 46145)
bonds that rotated more than 30 degrees:
  atom 1 atom 2  angle  previous, current, constraint length
   46148  46146   89.9    0.1480   0.1499      0.1480
   46050  46049   32.0    0.1470   0.1475      0.1470

t = 0.122 ps: Water molecule starting at atom 62389 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 0.122 ps: Water molecule starting at atom 706505 can not be settled.

...

### And the system then proceeds to explode.

###################

I am happy to provide more information, and apologize if what I have  
posted here is incomplete. These log files are large though, and I  
tried to keep this first post as short as possible.

Thanks for any assistance,
Chris.




More information about the gromacs.org_gmx-users mailing list