[gmx-users] Hardware-specific crash with 4.5.1

Justin A. Lemkul jalemkul at vt.edu
Mon Sep 27 22:59:22 CEST 2010


Hi All,

I'm hoping I might get some tips in tracking down the source of an issue that 
appears to be hardware-specific, leading to crashes in my system.  The failures 
are occurring on our supercomputer (Mac OSX 10.3, PowerPC).  Running the same 
.tpr file on my laptop (Mac OSX 10.5.8, Intel Core2Duo) and on another 
workstation (Ubuntu 10.04, AMD64) produce identical results.  I suspect the 
problem stems from unsuccessful energy minimization, which then leads to a crash 
when running full MD.  All jobs were run in parallel on two cores.  The 
supercomputer does not support threading, so MPI is invoked using MPICH-1.2.5 
(native MPI implementation on the cluster).


Details as follows:

EM md.log file: successful run (Intel Core2Duo or AMD64)

Steepest Descents converged to Fmax < 1000 in 7 steps
Potential Energy  = -4.8878180e+04
Maximum force     =  8.7791553e+02 on atom 5440
Norm of force     =  1.1781271e+02


EM md.log file: unsuccessful run (PowerPC)

Steepest Descents converged to Fmax < 1000 in 1 steps
Potential Energy  = -2.4873273e+04
Maximum force     =  0.0000000e+00 on atom 0
Norm of force     =            nan


MD invoked from the minimized structure generated on my laptop or AMD64 runs 
successfully (at least for a few hundred steps in my test), but the MD on the 
PowerPC cluster fails immediately:

            Step           Time         Lambda
               0        0.00000        0.00000

    Energies (kJ/mol)
             U-B    Proper Dih.  Improper Dih.      CMAP Dih.GB Polarization
     7.93559e+03    9.34958e+03    2.24036e+02   -2.47750e+03   -7.83599e+04
           LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)      Potential
     7.70042e+03    9.94520e+04   -1.17168e+04   -5.79783e+04   -2.55780e+04
     Kinetic En.   Total Energy    Temperature Pressure (bar)   Constr. rmsd
             nan            nan            nan    0.00000e+00            nan
   Constr.2 rmsd
             nan

DD  step 9 load imb.: force  3.0%


-------------------------------------------------------
Program mdrun_4.5.1_mpi, VERSION 4.5.1
Source code file: nsgrid.c, line: 601

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.
Variable ind has value 7131. It should have been within [ 0 .. 7131 ]

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

It seems as if the crash really shouldn't be happening, if the value range is 
inclusive.

Running with all-vs-all kernels works, but the performance is horrendously slow 
(<300 ps per day for a 7131-atom system) so I am attempting to use long cutoffs 
(2.0 nm) as others on the list have suggested.

Details of the installations and .mdp files are appended below.

-Justin

=== em.mdp ===
; Run parameters
integrator	= steep         ; EM
emstep      = 0.005
emtol       = 1000
nsteps      = 50000
nstcomm		= 1
comm_mode   = angular       ; non-periodic system
; Bond parameters
constraint_algorithm 	= lincs
constraints		= all-bonds
continuation 	= no		; starting up
; required cutoffs for implicit
nstlist		= 1
ns_type		= grid
rlist		= 2.0
rcoulomb	= 2.0
rvdw		= 2.0
; cutoffs required for qq and vdw
coulombtype	= cut-off
vdwtype     = cut-off
; temperature coupling
tcoupl		= no
; Pressure coupling is off
Pcoupl		= no
; Periodic boundary conditions are off for implicit
pbc		    = no
; Settings for implicit solvent
implicit_solvent    = GBSA
gb_algorithm        = OBC
rgbradii            = 2.0


=== md.mdp ===

; Run parameters
integrator	= sd            ; velocity Langevin dynamics
dt		    = 0.002
nsteps		= 2500000		; 5000 ps (5 ns)
nstcomm		= 1
comm_mode   = angular       ; non-periodic system
; Output parameters
nstxout		= 0             ; nst[xvf]out = 0 to suppress useless .trr output
nstvout		= 0
nstfout		= 0
nstlog      = 5000          ; 10 ps
nstenergy   = 5000          ; 10 ps
nstxtcout   = 5000          ; 10 ps
; Bond parameters
constraint_algorithm 	= lincs
constraints		= all-bonds
continuation 	= no		; starting up
; required cutoffs for implicit
nstlist		= 10
ns_type		= grid
rlist		= 2.0
rcoulomb	= 2.0
rvdw		= 2.0
; cutoffs required for qq and vdw
coulombtype	= cut-off
vdwtype     = cut-off
; temperature coupling
tc_grps		= System
tau_t		= 1.0   ; inverse friction coefficient for Langevin (ps^-1)
ref_t		= 310
; Pressure coupling is off
Pcoupl		= no
; Generate velocities is on
gen_vel		= yes		
gen_temp	= 310
gen_seed	= 173529
; Periodic boundary conditions are off for implicit
pbc		    = no
; Free energy must be off to use all-vs-all kernels
; default, but just for the sake of being pedantic
free_energy = no
; Settings for implicit solvent
implicit_solvent    = GBSA
gb_algorithm        = OBC
rgbradii            = 2.0


=== Installation commands for the cluster ===

$ ./configure --prefix=/home/rdiv1001/gromacs-4.5 
CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include" 
LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads --without-x 
--program-suffix=_4.5.1_s

$ make

$ make install

$ make distclean

$ ./configure --prefix=/home/rdiv1001/gromacs-4.5 
CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include" 
LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads --without-x 
--program-suffix=_4.5.1_mpi --enable-mpi 
CXXCPP="/nfs/compilers/mpich-1.2.5/bin/mpicxx -E"

$ make mdrun

$ make install-mdrun


-- 
========================================

Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================



More information about the gromacs.org_gmx-users mailing list