[gmx-developers] Problem with simulation in 8 nodes

Jose Duarte duarte at molgen.mpg.de
Mon Aug 4 14:30:16 CEST 2008

I'm running a simulation with the latest version of gromacs from CVS. My 
protein is 90 residues long, I add waters and ions as usual and then 
perform energy minimization, position restrained equilibration and a 
molecular dynamics run. This all works perfectly fine on 1 cpu (standard 
mdrun executable) on 4 and on 6 (using mdrun_mpi) but misteriously fails 
on 8 cpus. I've tried this on several setups: using lam-mpi in linux on 
a single multi-core box, using lam-mpi on several nodes of a cluster, 
using open-mpi on a multi-core Mac. I'm always getting exactly the same 
behaviour: all works fine on 4 or 6 cpus but fails on 8. Gromacs is 
compiled with default parameters (single precision).

The problem itself comes in the energy minimization step. I run a pretty 
standard EM with PME for electrostatic interactions. This is the error 
message when running mdrun:

Making 2D domain decomposition 4 x 2 x 1
Steepest Descents:
   Tolerance (Fmax)   =  1.00000e+01
   Number of steps    =         5000

A list of missing interactions:
            G96Angle of   1304 missing     -1
         Proper Dih. of    510 missing     -2
       Improper Dih. of    409 missing     -1

Program mdrun_mpi, VERSION 3.3.99_development_20080718
Source code file: domdec_top.c, line: 88

Software inconsistency error:
Some interactions seem to be assigned multiple times


Error on node 3, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 3 out of 8

The one thing I notice different in this case compare to running on 4 or 
6 cpus is that in those cases the domain decomposition is 1D instead of 
2D, no idea if that's relevant.

Actually looking at the log file produced by mdrun the simulation seems 
to run properly until step 403, after which this error is reported:

Not all bonded interactions have been properly assigned to the domain 
decomposition cells

A list of missing interactions:
            G96Angle of   1304 missing     -1
         Proper Dih. of    510 missing     -2
       Improper Dih. of    409 missing     -1

I have also tried to run the same procedure on another protein but the 
problem doesn't arise at all, so it seems to be related to that 
particular protein. I can send the pdb file if that's helpful.

Any ideas? Is this a bug?



Jose M. Duarte
Max Planck Institute for Molecular Genetics
Ihnestr. 63-73
14195 Berlin

More information about the gromacs.org_gmx-developers mailing list