[gmx-developers] parallelization-dependent crashes in CVS Gromacs
    Peter Kasson 
    kasson at stanford.edu
       
    Thu Jun 19 21:38:05 CEST 2008
    
    
  
I've been encountering crashes in CVS gromacs depending on the  
parallelization.  If I run a CVS-up-to-date mdrun (or any of various  
snapshots within the past two months), I get a rapid crash as detailed  
below.  However, if I run with either -rdd 1.6 [somewhat arbitrarily  
chosen], -dd 2 2 1, or single-processor mdrun the  job completes  
successfully.  I have also had errors on similar small test systems  
where the error is a nsgrid failure (again on parallel but not  
single-processor).  Larger test systems have been working ok for me,  
although I haven't tried just replicating this box.
Any ideas?  Has anyone else encountered something similar?
(If you want input files, drop me a line.)
Thanks,
--Peter
mpiexec -np 4 /array10/software/gmx/src/kernel/mdrun -v -dlb -deffnm frame0
[...]
Reading file frame0.tpr, VERSION 3.3.99_development_20080208 (single  
precision)
Note: tpx file_version 54, software version 56
Loaded with Money
Making 1D domain decomposition 4 x 1 x 1
starting mdrun 'CMG in water'
1000000 steps,   4000.0 ps.
step 0
t = 0.144 ps: Water molecule starting at atom 28978 can not be settled.
Check for bad contacts and/or reduce the timestep.
Back Off! I just backed up step36b_n1.pdb to ./#step36b_n1.pdb.1#
Back Off! I just backed up step36c_n1.pdb to ./#step36c_n1.pdb.1#
Wrote pdb files with previous and current coordinates
t = 0.148 ps: Water molecule starting at atom 28978 can not be settled.
Check for bad contacts and/or reduce the timestep.
Back Off! I just backed up step37b_n1.pdb to ./#step37b_n1.pdb.1#
Back Off! I just backed up step37c_n1.pdb to ./#step37c_n1.pdb.1#
Wrote pdb files with previous and current coordinates
t = 0.152 ps: Water molecule starting at atom 22435 can not be settled.
Check for bad contacts and/or reduce the timestep.
Back Off! I just backed up step38b_n1.pdb to ./#step38b_n1.pdb.1#
Back Off! I just backed up step38c_n1.pdb to ./#step38c_n1.pdb.1#
Wrote pdb files with previous and current coordinates
t = 0.156 ps: Water molecule starting at atom 9895 can not be settled.
Check for bad contacts and/or reduce the timestep.
Back Off! I just backed up step39b_n0.pdb to ./#step39b_n0.pdb.1#
t = 0.156 ps: Water molecule starting at atom 22435 can not be settled.
Check for bad contacts and/or reduce the timestep.
Back Off! I just backed up step39b_n1.pdb to ./#step39b_n1.pdb.1#
Back Off! I just backed up step39c_n1.pdb to ./#step39c_n1.pdb.1#
Back Off! I just backed up step39c_n0.pdb to ./#step39c_n0.pdb.1#
Wrote pdb files with previous and current coordinates
Wrote pdb files with previous and current coordinates
t = 0.160 ps: Water molecule starting at atom 9895 can not be settled.
Check for bad contacts and/or reduce the timestep.
Back Off! I just backed up step40b_n0.pdb to ./#step40b_n0.pdb.1#
t = 0.160 ps: Water molecule starting at atom 39988 can not be settled.
Check for bad contacts and/or reduce the timestep.
Back Off! I just backed up step40b_n1.pdb to ./#step40b_n1.pdb.1#
Back Off! I just backed up step40c_n1.pdb to ./#step40c_n1.pdb.1#
Back Off! I just backed up step40c_n0.pdb to ./#step40c_n0.pdb.1#
Wrote pdb files with previous and current coordinates
Wrote pdb files with previous and current coordinates
-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_200800503
Source code file: pme.c, line: 510
Fatal error:
1 particles communicated to PME node 1 are more than a cell length out  
of the domain decomposition cell of their charge group
-------------------------------------------------------
    
    
More information about the gromacs.org_gmx-developers
mailing list