[gmx-users] Hardware-specific crash with 4.5.1
Justin A. Lemkul
jalemkul at vt.edu
Mon Sep 27 22:59:22 CEST 2010
Hi All,
I'm hoping I might get some tips in tracking down the source of an issue that
appears to be hardware-specific, leading to crashes in my system. The failures
are occurring on our supercomputer (Mac OSX 10.3, PowerPC). Running the same
.tpr file on my laptop (Mac OSX 10.5.8, Intel Core2Duo) and on another
workstation (Ubuntu 10.04, AMD64) produce identical results. I suspect the
problem stems from unsuccessful energy minimization, which then leads to a crash
when running full MD. All jobs were run in parallel on two cores. The
supercomputer does not support threading, so MPI is invoked using MPICH-1.2.5
(native MPI implementation on the cluster).
Details as follows:
EM md.log file: successful run (Intel Core2Duo or AMD64)
Steepest Descents converged to Fmax < 1000 in 7 steps
Potential Energy = -4.8878180e+04
Maximum force = 8.7791553e+02 on atom 5440
Norm of force = 1.1781271e+02
EM md.log file: unsuccessful run (PowerPC)
Steepest Descents converged to Fmax < 1000 in 1 steps
Potential Energy = -2.4873273e+04
Maximum force = 0.0000000e+00 on atom 0
Norm of force = nan
MD invoked from the minimized structure generated on my laptop or AMD64 runs
successfully (at least for a few hundred steps in my test), but the MD on the
PowerPC cluster fails immediately:
Step Time Lambda
0 0.00000 0.00000
Energies (kJ/mol)
U-B Proper Dih. Improper Dih. CMAP Dih.GB Polarization
7.93559e+03 9.34958e+03 2.24036e+02 -2.47750e+03 -7.83599e+04
LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Potential
7.70042e+03 9.94520e+04 -1.17168e+04 -5.79783e+04 -2.55780e+04
Kinetic En. Total Energy Temperature Pressure (bar) Constr. rmsd
nan nan nan 0.00000e+00 nan
Constr.2 rmsd
nan
DD step 9 load imb.: force 3.0%
-------------------------------------------------------
Program mdrun_4.5.1_mpi, VERSION 4.5.1
Source code file: nsgrid.c, line: 601
Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.
Variable ind has value 7131. It should have been within [ 0 .. 7131 ]
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
It seems as if the crash really shouldn't be happening, if the value range is
inclusive.
Running with all-vs-all kernels works, but the performance is horrendously slow
(<300 ps per day for a 7131-atom system) so I am attempting to use long cutoffs
(2.0 nm) as others on the list have suggested.
Details of the installations and .mdp files are appended below.
-Justin
=== em.mdp ===
; Run parameters
integrator = steep ; EM
emstep = 0.005
emtol = 1000
nsteps = 50000
nstcomm = 1
comm_mode = angular ; non-periodic system
; Bond parameters
constraint_algorithm = lincs
constraints = all-bonds
continuation = no ; starting up
; required cutoffs for implicit
nstlist = 1
ns_type = grid
rlist = 2.0
rcoulomb = 2.0
rvdw = 2.0
; cutoffs required for qq and vdw
coulombtype = cut-off
vdwtype = cut-off
; temperature coupling
tcoupl = no
; Pressure coupling is off
Pcoupl = no
; Periodic boundary conditions are off for implicit
pbc = no
; Settings for implicit solvent
implicit_solvent = GBSA
gb_algorithm = OBC
rgbradii = 2.0
=== md.mdp ===
; Run parameters
integrator = sd ; velocity Langevin dynamics
dt = 0.002
nsteps = 2500000 ; 5000 ps (5 ns)
nstcomm = 1
comm_mode = angular ; non-periodic system
; Output parameters
nstxout = 0 ; nst[xvf]out = 0 to suppress useless .trr output
nstvout = 0
nstfout = 0
nstlog = 5000 ; 10 ps
nstenergy = 5000 ; 10 ps
nstxtcout = 5000 ; 10 ps
; Bond parameters
constraint_algorithm = lincs
constraints = all-bonds
continuation = no ; starting up
; required cutoffs for implicit
nstlist = 10
ns_type = grid
rlist = 2.0
rcoulomb = 2.0
rvdw = 2.0
; cutoffs required for qq and vdw
coulombtype = cut-off
vdwtype = cut-off
; temperature coupling
tc_grps = System
tau_t = 1.0 ; inverse friction coefficient for Langevin (ps^-1)
ref_t = 310
; Pressure coupling is off
Pcoupl = no
; Generate velocities is on
gen_vel = yes
gen_temp = 310
gen_seed = 173529
; Periodic boundary conditions are off for implicit
pbc = no
; Free energy must be off to use all-vs-all kernels
; default, but just for the sake of being pedantic
free_energy = no
; Settings for implicit solvent
implicit_solvent = GBSA
gb_algorithm = OBC
rgbradii = 2.0
=== Installation commands for the cluster ===
$ ./configure --prefix=/home/rdiv1001/gromacs-4.5
CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads --without-x
--program-suffix=_4.5.1_s
$ make
$ make install
$ make distclean
$ ./configure --prefix=/home/rdiv1001/gromacs-4.5
CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads --without-x
--program-suffix=_4.5.1_mpi --enable-mpi
CXXCPP="/nfs/compilers/mpich-1.2.5/bin/mpicxx -E"
$ make mdrun
$ make install-mdrun
--
========================================
Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
========================================
More information about the gromacs.org_gmx-users
mailing list