[gmx-users] Hardware-specific crash with 4.5.1

Roland Schulz roland at utk.edu
Tue Sep 28 02:20:45 CEST 2010


Justin,

I think the interaction kernel is not OK on your PowerPC machine. I assume
that from: 1) The force seems to be zero (minimization output). 2) When you
use the all-to-all kernel which is not available for the powerpc kernel, it
automatically falls back to the C kernel and then it works.

What is the kernel you are using? It should say in the log file. Look
for: "Configuring single precision IBM Power6-specific Fortran kernels" or
"Testing Altivec/VMX support"

You can also look in the config.h whether  GMX_POWER6 and/or GMX_PPC_ALTIVEC
is set. I suggest you try to compile with one/both of them deactivated and
see whether that solves it. This will make it slower too. Thus if this is
indeed the problem, you will probably want to figure out why the fastest
kernel doesn't work correctly to get good performance.

Roland


On Mon, Sep 27, 2010 at 4:59 PM, Justin A. Lemkul <jalemkul at vt.edu> wrote:

>
> Hi All,
>
> I'm hoping I might get some tips in tracking down the source of an issue
> that appears to be hardware-specific, leading to crashes in my system.  The
> failures are occurring on our supercomputer (Mac OSX 10.3, PowerPC).
>  Running the same .tpr file on my laptop (Mac OSX 10.5.8, Intel Core2Duo)
> and on another workstation (Ubuntu 10.04, AMD64) produce identical results.
>  I suspect the problem stems from unsuccessful energy minimization, which
> then leads to a crash when running full MD.  All jobs were run in parallel
> on two cores.  The supercomputer does not support threading, so MPI is
> invoked using MPICH-1.2.5 (native MPI implementation on the cluster).
>
>
> Details as follows:
>
> EM md.log file: successful run (Intel Core2Duo or AMD64)
>
> Steepest Descents converged to Fmax < 1000 in 7 steps
> Potential Energy  = -4.8878180e+04
> Maximum force     =  8.7791553e+02 on atom 5440
> Norm of force     =  1.1781271e+02
>
>
> EM md.log file: unsuccessful run (PowerPC)
>
> Steepest Descents converged to Fmax < 1000 in 1 steps
> Potential Energy  = -2.4873273e+04
> Maximum force     =  0.0000000e+00 on atom 0
> Norm of force     =            nan
>
>
> MD invoked from the minimized structure generated on my laptop or AMD64
> runs successfully (at least for a few hundred steps in my test), but the MD
> on the PowerPC cluster fails immediately:
>
>           Step           Time         Lambda
>              0        0.00000        0.00000
>
>   Energies (kJ/mol)
>            U-B    Proper Dih.  Improper Dih.      CMAP Dih.GB Polarization
>    7.93559e+03    9.34958e+03    2.24036e+02   -2.47750e+03   -7.83599e+04
>          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)      Potential
>    7.70042e+03    9.94520e+04   -1.17168e+04   -5.79783e+04   -2.55780e+04
>    Kinetic En.   Total Energy    Temperature Pressure (bar)   Constr. rmsd
>            nan            nan            nan    0.00000e+00            nan
>  Constr.2 rmsd
>            nan
>
> DD  step 9 load imb.: force  3.0%
>
>
> -------------------------------------------------------
> Program mdrun_4.5.1_mpi, VERSION 4.5.1
> Source code file: nsgrid.c, line: 601
>
> Range checking error:
> Explanation: During neighborsearching, we assign each particle to a grid
> based on its coordinates. If your system contains collisions or parameter
> errors that give particles very high velocities you might end up with some
> coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
> put these on a grid, so this is usually where we detect those errors.
> Make sure your system is properly energy-minimized and that the potential
> energy seems reasonable before trying again.
> Variable ind has value 7131. It should have been within [ 0 .. 7131 ]
>
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
>
> It seems as if the crash really shouldn't be happening, if the value range
> is inclusive.
>
> Running with all-vs-all kernels works, but the performance is horrendously
> slow (<300 ps per day for a 7131-atom system) so I am attempting to use long
> cutoffs (2.0 nm) as others on the list have suggested.
>
> Details of the installations and .mdp files are appended below.
>
> -Justin
>
> === em.mdp ===
> ; Run parameters
> integrator      = steep         ; EM
> emstep      = 0.005
> emtol       = 1000
> nsteps      = 50000
> nstcomm         = 1
> comm_mode   = angular       ; non-periodic system
> ; Bond parameters
> constraint_algorithm    = lincs
> constraints             = all-bonds
> continuation    = no            ; starting up
> ; required cutoffs for implicit
> nstlist         = 1
> ns_type         = grid
> rlist           = 2.0
> rcoulomb        = 2.0
> rvdw            = 2.0
> ; cutoffs required for qq and vdw
> coulombtype     = cut-off
> vdwtype     = cut-off
> ; temperature coupling
> tcoupl          = no
> ; Pressure coupling is off
> Pcoupl          = no
> ; Periodic boundary conditions are off for implicit
> pbc                 = no
> ; Settings for implicit solvent
> implicit_solvent    = GBSA
> gb_algorithm        = OBC
> rgbradii            = 2.0
>
>
> === md.mdp ===
>
> ; Run parameters
> integrator      = sd            ; velocity Langevin dynamics
> dt                  = 0.002
> nsteps          = 2500000               ; 5000 ps (5 ns)
> nstcomm         = 1
> comm_mode   = angular       ; non-periodic system
> ; Output parameters
> nstxout         = 0             ; nst[xvf]out = 0 to suppress useless .trr
> output
> nstvout         = 0
> nstfout         = 0
> nstlog      = 5000          ; 10 ps
> nstenergy   = 5000          ; 10 ps
> nstxtcout   = 5000          ; 10 ps
> ; Bond parameters
> constraint_algorithm    = lincs
> constraints             = all-bonds
> continuation    = no            ; starting up
> ; required cutoffs for implicit
> nstlist         = 10
> ns_type         = grid
> rlist           = 2.0
> rcoulomb        = 2.0
> rvdw            = 2.0
> ; cutoffs required for qq and vdw
> coulombtype     = cut-off
> vdwtype     = cut-off
> ; temperature coupling
> tc_grps         = System
> tau_t           = 1.0   ; inverse friction coefficient for Langevin (ps^-1)
> ref_t           = 310
> ; Pressure coupling is off
> Pcoupl          = no
> ; Generate velocities is on
> gen_vel         = yes
> gen_temp        = 310
> gen_seed        = 173529
> ; Periodic boundary conditions are off for implicit
> pbc                 = no
> ; Free energy must be off to use all-vs-all kernels
> ; default, but just for the sake of being pedantic
> free_energy = no
> ; Settings for implicit solvent
> implicit_solvent    = GBSA
> gb_algorithm        = OBC
> rgbradii            = 2.0
>
>
> === Installation commands for the cluster ===
>
> $ ./configure --prefix=/home/rdiv1001/gromacs-4.5
> CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
> LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads --without-x
> --program-suffix=_4.5.1_s
>
> $ make
>
> $ make install
>
> $ make distclean
>
> $ ./configure --prefix=/home/rdiv1001/gromacs-4.5
> CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
> LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads --without-x
> --program-suffix=_4.5.1_mpi --enable-mpi
> CXXCPP="/nfs/compilers/mpich-1.2.5/bin/mpicxx -E"
>
> $ make mdrun
>
> $ make install-mdrun
>
>
> --
> ========================================
>
> Justin A. Lemkul
> Ph.D. Candidate
> ICTAS Doctoral Scholar
> MILES-IGERT Trainee
> Department of Biochemistry
> Virginia Tech
> Blacksburg, VA
> jalemkul[at]vt.edu | (540) 231-9080
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>
> ========================================
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>



-- 
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20100927/71bc0c1a/attachment.html>


More information about the gromacs.org_gmx-users mailing list