[gmx-users] Hardware-specific crash with 4.5.1

Tue Sep 28 03:10:17 CEST 2010

Roland Schulz wrote:
> Justin,
> 
> I think the interaction kernel is not OK on your PowerPC machine. I 
> assume that from: 1) The force seems to be zero (minimization output). 
> 2) When you use the all-to-all kernel which is not available for the 
> powerpc kernel, it automatically falls back to the C kernel and then it 
> works.
> 

Sounds about right.

> What is the kernel you are using? It should say in the log file. Look 
> for: "Configuring single precision IBM Power6-specific Fortran kernels" 
> or "Testing Altivec/VMX support"
> 

I'm not finding either in the config.log - weird?

> You can also look in the config.h whether  GMX_POWER6 
> and/or GMX_PPC_ALTIVEC is set. I suggest you try to compile with 
> one/both of them deactivated and see whether that solves it. This will 
> make it slower too. Thus if this is indeed the problem, you will 
> probably want to figure out why the fastest kernel doesn't work 
> correctly to get good performance.
> 

It looks like GMX_PPC_ALTIVEC is set.  I suppose I could re-compile with this 
turned off.

Here's what's even weirder.  The problematic version was compiled using the 
standard autoconf procedure.  If I use a CMake-compiled version, the energy 
minimization runs fine, giving the same results (energy and force) as the two 
systems I know are good.  So I guess there's something wrong with the way 
autoconf installed Gromacs.  Perhaps this isn't of concern since Gromacs will 
require CMake in subsequent releases, but I figure I should at least report it 
in case it affects anyone else.

If I may tack one more question on here, I'm wondering why my CMake installation 
  doesn't actually appear to be using MPI.  I get the right result, but the 
problem is, I get a .log, .edr, and .trr for every processor that's being used, 
as if each processor is being given its own job and not distributing the work. 
Here's how I compiled my MPI mdrun, version 4.5.1:

cmake ../gromacs-4.5.1 
-DFFTW3F_LIBRARIES=/home/rdiv1001/fftw-3.0.1-osx/lib/libfftw3f.a 
-DFFTW3F_INCLUDE_DIR=/home/rdiv1001/fftw-3.0.1-osx/include/ 
-DCMAKE_INSTALL_PREFIX=/home/rdiv1001/gromacs-4.5_cmake-osx 
-DGMX_BINARY_SUFFIX=_4.5_cmake_mpi -DGMX_THREADS=OFF -DBUILD_SHARED_LIBS=OFF 
-DGMX_X11=OFF -DGMX_MPI=ON 
-DMPI_COMPILER=/home/rdiv1001/compilers/openmpi-1.2.3-osx/bin/mpicxx 
-DMPI_INCLUDE_PATH=/home/rdiv1001/compilers/openmpi-1.2.3-osx/include

$ make mdrun

$ make install-mdrun

Is there anything obviously wrong with those commands?  Is there any way I 
should know (before actually using mdrun) whether or not I've done things right?

-Justin

> Roland
> 
> 
> On Mon, Sep 27, 2010 at 4:59 PM, Justin A. Lemkul <jalemkul at vt.edu 
> <mailto:jalemkul at vt.edu>> wrote:
> 
> 
>     Hi All,
> 
>     I'm hoping I might get some tips in tracking down the source of an
>     issue that appears to be hardware-specific, leading to crashes in my
>     system.  The failures are occurring on our supercomputer (Mac OSX
>     10.3, PowerPC).  Running the same .tpr file on my laptop (Mac OSX
>     10.5.8, Intel Core2Duo) and on another workstation (Ubuntu 10.04,
>     AMD64) produce identical results.  I suspect the problem stems from
>     unsuccessful energy minimization, which then leads to a crash when
>     running full MD.  All jobs were run in parallel on two cores.  The
>     supercomputer does not support threading, so MPI is invoked using
>     MPICH-1.2.5 (native MPI implementation on the cluster).
> 
> 
>     Details as follows:
> 
>     EM md.log file: successful run (Intel Core2Duo or AMD64)
> 
>     Steepest Descents converged to Fmax < 1000 in 7 steps
>     Potential Energy  = -4.8878180e+04
>     Maximum force     =  8.7791553e+02 on atom 5440
>     Norm of force     =  1.1781271e+02
> 
> 
>     EM md.log file: unsuccessful run (PowerPC)
> 
>     Steepest Descents converged to Fmax < 1000 in 1 steps
>     Potential Energy  = -2.4873273e+04
>     Maximum force     =  0.0000000e+00 on atom 0
>     Norm of force     =            nan
> 
> 
>     MD invoked from the minimized structure generated on my laptop or
>     AMD64 runs successfully (at least for a few hundred steps in my
>     test), but the MD on the PowerPC cluster fails immediately:
> 
>               Step           Time         Lambda
>                  0        0.00000        0.00000
> 
>       Energies (kJ/mol)
>                U-B    Proper Dih.  Improper Dih.      CMAP Dih.GB
>     Polarization
>        7.93559e+03    9.34958e+03    2.24036e+02   -2.47750e+03  
>     -7.83599e+04
>              LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)    
>      Potential
>        7.70042e+03    9.94520e+04   -1.17168e+04   -5.79783e+04  
>     -2.55780e+04
>        Kinetic En.   Total Energy    Temperature Pressure (bar)  
>     Constr. rmsd
>                nan            nan            nan    0.00000e+00        
>        nan
>      Constr.2 rmsd
>                nan
> 
>     DD  step 9 load imb.: force  3.0%
> 
> 
>     -------------------------------------------------------
>     Program mdrun_4.5.1_mpi, VERSION 4.5.1
>     Source code file: nsgrid.c, line: 601
> 
>     Range checking error:
>     Explanation: During neighborsearching, we assign each particle to a grid
>     based on its coordinates. If your system contains collisions or
>     parameter
>     errors that give particles very high velocities you might end up
>     with some
>     coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
>     put these on a grid, so this is usually where we detect those errors.
>     Make sure your system is properly energy-minimized and that the
>     potential
>     energy seems reasonable before trying again.
>     Variable ind has value 7131. It should have been within [ 0 .. 7131 ]
> 
>     For more information and tips for troubleshooting, please check the
>     GROMACS
>     website at http://www.gromacs.org/Documentation/Errors
>     -------------------------------------------------------
> 
>     It seems as if the crash really shouldn't be happening, if the value
>     range is inclusive.
> 
>     Running with all-vs-all kernels works, but the performance is
>     horrendously slow (<300 ps per day for a 7131-atom system) so I am
>     attempting to use long cutoffs (2.0 nm) as others on the list have
>     suggested.
> 
>     Details of the installations and .mdp files are appended below.
> 
>     -Justin
> 
>     === em.mdp ===
>     ; Run parameters
>     integrator      = steep         ; EM
>     emstep      = 0.005
>     emtol       = 1000
>     nsteps      = 50000
>     nstcomm         = 1
>     comm_mode   = angular       ; non-periodic system
>     ; Bond parameters
>     constraint_algorithm    = lincs
>     constraints             = all-bonds
>     continuation    = no            ; starting up
>     ; required cutoffs for implicit
>     nstlist         = 1
>     ns_type         = grid
>     rlist           = 2.0
>     rcoulomb        = 2.0
>     rvdw            = 2.0
>     ; cutoffs required for qq and vdw
>     coulombtype     = cut-off
>     vdwtype     = cut-off
>     ; temperature coupling
>     tcoupl          = no
>     ; Pressure coupling is off
>     Pcoupl          = no
>     ; Periodic boundary conditions are off for implicit
>     pbc                 = no
>     ; Settings for implicit solvent
>     implicit_solvent    = GBSA
>     gb_algorithm        = OBC
>     rgbradii            = 2.0
> 
> 
>     === md.mdp ===
> 
>     ; Run parameters
>     integrator      = sd            ; velocity Langevin dynamics
>     dt                  = 0.002
>     nsteps          = 2500000               ; 5000 ps (5 ns)
>     nstcomm         = 1
>     comm_mode   = angular       ; non-periodic system
>     ; Output parameters
>     nstxout         = 0             ; nst[xvf]out = 0 to suppress
>     useless .trr output
>     nstvout         = 0
>     nstfout         = 0
>     nstlog      = 5000          ; 10 ps
>     nstenergy   = 5000          ; 10 ps
>     nstxtcout   = 5000          ; 10 ps
>     ; Bond parameters
>     constraint_algorithm    = lincs
>     constraints             = all-bonds
>     continuation    = no            ; starting up
>     ; required cutoffs for implicit
>     nstlist         = 10
>     ns_type         = grid
>     rlist           = 2.0
>     rcoulomb        = 2.0
>     rvdw            = 2.0
>     ; cutoffs required for qq and vdw
>     coulombtype     = cut-off
>     vdwtype     = cut-off
>     ; temperature coupling
>     tc_grps         = System
>     tau_t           = 1.0   ; inverse friction coefficient for Langevin
>     (ps^-1)
>     ref_t           = 310
>     ; Pressure coupling is off
>     Pcoupl          = no
>     ; Generate velocities is on
>     gen_vel         = yes          
>     gen_temp        = 310
>     gen_seed        = 173529
>     ; Periodic boundary conditions are off for implicit
>     pbc                 = no
>     ; Free energy must be off to use all-vs-all kernels
>     ; default, but just for the sake of being pedantic
>     free_energy = no
>     ; Settings for implicit solvent
>     implicit_solvent    = GBSA
>     gb_algorithm        = OBC
>     rgbradii            = 2.0
> 
> 
>     === Installation commands for the cluster ===
> 
>     $ ./configure --prefix=/home/rdiv1001/gromacs-4.5
>     CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
>     LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads
>     --without-x --program-suffix=_4.5.1_s
> 
>     $ make
> 
>     $ make install
> 
>     $ make distclean
> 
>     $ ./configure --prefix=/home/rdiv1001/gromacs-4.5
>     CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
>     LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads
>     --without-x --program-suffix=_4.5.1_mpi --enable-mpi
>     CXXCPP="/nfs/compilers/mpich-1.2.5/bin/mpicxx -E"
> 
>     $ make mdrun
> 
>     $ make install-mdrun
> 
> 
>     -- 
>     ========================================
> 
>     Justin A. Lemkul
>     Ph.D. Candidate
>     ICTAS Doctoral Scholar
>     MILES-IGERT Trainee
>     Department of Biochemistry
>     Virginia Tech
>     Blacksburg, VA
>     jalemkul[at]vt.edu <http://vt.edu> | (540) 231-9080
>     http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
> 
>     ========================================
>     -- 
>     gmx-users mailing list    gmx-users at gromacs.org
>     <mailto:gmx-users at gromacs.org>
>     http://lists.gromacs.org/mailman/listinfo/gmx-users
>     Please search the archive at
>     http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>     Please don't post (un)subscribe requests to the list. Use the www
>     interface or send it to gmx-users-request at gromacs.org
>     <mailto:gmx-users-request at gromacs.org>.
>     Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> 
> 
> 
> 
> -- 
> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <http://cmb.ornl.gov>
> 865-241-1537, ORNL PO BOX 2008 MS6309

-- 
========================================

Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================