[gmx-users] Hardware-specific crash with 4.5.1

Justin A. Lemkul jalemkul at vt.edu
Tue Sep 28 03:39:03 CEST 2010



Mark Abraham wrote:
> 
> 
> ----- Original Message -----
> From: "Justin A. Lemkul" <jalemkul at vt.edu>
> Date: Tuesday, September 28, 2010 11:11
> Subject: Re: [gmx-users] Hardware-specific crash with 4.5.1
> To: Gromacs Users' List <gmx-users at gromacs.org>
> 
>  >
>  >
>  > Roland Schulz wrote:
>  > >Justin,
>  > >
>  > >I think the interaction kernel is not OK on your PowerPC
>  > machine. I assume that from: 1) The force seems to be zero
>  > (minimization output). 2) When you use the all-to-all kernel
>  > which is not available for the powerpc kernel, it automatically
>  > falls back to the C kernel and then it works.
>  > >
>  >
>  > Sounds about right.
>  >
>  > >What is the kernel you are using? It should say in the log
>  > file. Look for: "Configuring single precision IBM Power6-
>  > specific Fortran kernels" or "Testing Altivec/VMX support"
>  > >
>  >
>  > I'm not finding either in the config.log - weird?
> 
> You were meant to look in the mdrun.log for runtime confirmation of what 
> kernels GROMACS has decided to use.
>  

That seems entirely obvious, now that you mention it :)  Conveniently, I find 
the following in the md.log file from the (failing) autoconf-assembled mdrun:

Configuring nonbonded kernels...
Configuring standard C nonbonded kernels...
Testing Altivec/VMX support... present.
Configuring PPC/Altivec nonbonded kernels...

The (non)MPI CMake build shows the following:

Configuring nonbonded kernels...
Configuring standard C nonbonded kernels...

So it seems clear to me that autoconf built faulty nonbonded kernels, and CMake 
didn't.

>  > >You can also look in the config.h whether  GMX_POWER6
>  > and/or GMX_PPC_ALTIVEC is set. I suggest you try to compile with
>  > one/both of them deactivated and see whether that solves it.
>  > This will make it slower too. Thus if this is indeed the
>  > problem, you will probably want to figure out why the fastest
>  > kernel doesn't work correctly to get good performance.
>  > >
>  >
>  > It looks like GMX_PPC_ALTIVEC is set.  I suppose I could re-
>  > compile with this turned off.
> 
> This is supposed to be fine for Mac, as I understand.
> 
>  > Here's what's even weirder.  The problematic version was
>  > compiled using the standard autoconf procedure.  If I use a
>  > CMake-compiled version, the energy minimization runs fine,
>  > giving the same results (energy and force) as the two systems I
>  > know are good.  So I guess there's something wrong with the
>  > way autoconf installed Gromacs.  Perhaps this isn't of
>  > concern since Gromacs will require CMake in subsequent releases,
>  > but I figure I should at least report it in case it affects
>  > anyone else.
>  >
>  > If I may tack one more question on here, I'm wondering why my
>  > CMake installation  doesn't actually appear to be using
>  > MPI.  I get the right result, but the problem is, I get a
>  > .log, .edr, and .trr for every processor that's being used, as
>  > if each processor is being given its own job and not
>  > distributing the work. Here's how I compiled my MPI mdrun,
>  > version 4.5.1:
> 
> At the start and end of the .log files you should get indicators about 
> how many MPI processes were actually being used.
>  

That explains it (sort of).  It looks like mdrun thinks it's only being run over 
1 node, just several times over, and a bunch of junk that isn't getting written 
properly:

Log file opened on Mon Sep 27 21:36:00 2010
Host: n235  pid: 6857  nodeid: 0  nnodes:  1
The Gromacs distribution was built @TMP_TIME@ by
jalemkul at sysx2.arc-int.vt.edu [CMAKE] (@TMP_MACHINE@)

Frustrating.

>  > cmake ../gromacs-4.5.1 -DFFTW3F_LIBRARIES=/home/rdiv1001/fftw-
>  > 3.0.1-osx/lib/libfftw3f.a -
>  > DFFTW3F_INCLUDE_DIR=/home/rdiv1001/fftw-3.0.1-osx/include/ -
>  > DCMAKE_INSTALL_PREFIX=/home/rdiv1001/gromacs-4.5_cmake-osx -
>  > DGMX_BINARY_SUFFIX=_4.5_cmake_mpi -DGMX_THREADS=OFF -
>  > DBUILD_SHARED_LIBS=OFF -DGMX_X11=OFF -DGMX_MPI=ON -
>  > DMPI_COMPILER=/home/rdiv1001/compilers/openmpi-1.2.3-
>  > osx/bin/mpicxx -
>  > DMPI_INCLUDE_PATH=/home/rdiv1001/compilers/openmpi-1.2.3-osx/include
>  >
>  > $ make mdrun
>  >
>  > $ make install-mdrun
>  >
>  > Is there anything obviously wrong with those commands?  Is
>  > there any way I should know (before actually using mdrun)
>  > whether or not I've done things right?
> 
> I think there ought to be, but IMO not enough preparation and testing 
> has gone into the CMake switch for it to be usable.
> 

I agree.  After hours of hacking CMake to try to make it work (and thinking I 
had gotten it squared away), the MPI doesn't seem to function.  The "old" way of 
doing things worked flawlessly, except that somewhere between 4.0.7 and 4.5.1, 
the nonbonded kernels that used to work on our architecture somehow got hosed. 
So now I'm in limbo.

-Justin

> Mark
>  
>  > -Justin
>  >
>  > >Roland
>  > >
>  > >
>  > >On Mon, Sep 27, 2010 at 4:59 PM, Justin A. Lemkul
>  > <jalemkul at vt.edu <mailto:jalemkul at vt.edu>> wrote:
>  > >
>  > >
>  > >    Hi All,
>  > >
>  > >    I'm hoping I might get some tips in tracking
>  > down the source of an
>  > >    issue that appears to be hardware-specific,
>  > leading to crashes in my
>  > >    system.  The failures are occurring on
>  > our supercomputer (Mac OSX
>  > >    10.3, PowerPC).  Running the same .tpr
>  > file on my laptop (Mac OSX
>  > >    10.5.8, Intel Core2Duo) and on another
>  > workstation (Ubuntu 10.04,
>  > >    AMD64) produce identical results.  I
>  > suspect the problem stems from
>  > >    unsuccessful energy minimization, which then
>  > leads to a crash when
>  > >    running full MD.  All jobs were run in
>  > parallel on two cores.  The
>  > >    supercomputer does not support threading, so
>  > MPI is invoked using
>  > >    MPICH-1.2.5 (native MPI implementation on
>  > the cluster).
>  > >
>  > >
>  > >    Details as follows:
>  > >
>  > >    EM md.log file: successful run (Intel
>  > Core2Duo or AMD64)
>  > >
>  > >    Steepest Descents converged to Fmax <
>  > 1000 in 7 steps
>  > >    Potential Energy  = -4.8878180e+04
>  > >    Maximum force    
>  > =  8.7791553e+02 on atom 5440
>  > >    Norm of force    
>  > =  1.1781271e+02
>  > >
>  > >
>  > >    EM md.log file: unsuccessful run (PowerPC)
>  > >
>  > >    Steepest Descents converged to Fmax <
>  > 1000 in 1 steps
>  > >    Potential Energy  = -2.4873273e+04
>  > >    Maximum force    
>  > =  0.0000000e+00 on atom 0
>  > >    Norm of force    
>  > =            nan
>  > >
>  > >
>  > >    MD invoked from the minimized structure
>  > generated on my laptop or
>  > >    AMD64 runs successfully (at least for a few
>  > hundred steps in my
>  > >    test), but the MD on the PowerPC cluster
>  > fails immediately:
>  > >
>  > >              Step           Time         Lambda
>  > >                 0        0.00000        0.00000
>  > >
>  > >      Energies (kJ/mol)
>  > >               U-B    Proper Dih.  Improper Dih.      CMAP Dih.GB
>  > >    Polarization
>  > >      
>  > 7.93559e+03    9.34958e+03   
>  > 2.24036e+02   -
>  > 2.47750e+03      -7.83599e+04
>  > >             LJ-14     Coulomb-14        LJ (SR)   Coulomb 
> (SR)         Potential
>  > >      
>  > 7.70042e+03    9.94520e+04   -
>  > 1.17168e+04   -
>  > 5.79783e+04      -2.55780e+04
>  > >       Kinetic En.  
>  > Total Energy    Temperature Pressure
>  > (bar)      Constr. rmsd
>  > >               nan            nan            nan    
> 0.00000e+00               nan
>  > >     Constr.2 rmsd
>  > >               nan
>  > >
>  > >    DD  step 9 load imb.: force  3.0%
>  > >
>  > >
>  > >    ---------------------------------------------
>  > ----------
>  > >    Program mdrun_4.5.1_mpi, VERSION 4.5.1
>  > >    Source code file: nsgrid.c, line: 601
>  > >
>  > >    Range checking error:
>  > >    Explanation: During neighborsearching, we
>  > assign each particle to a grid
>  > >    based on its coordinates. If your system
>  > contains collisions or
>  > >    parameter
>  > >    errors that give particles very high
>  > velocities you might end up
>  > >    with some
>  > >    coordinates being +-Infinity or NaN (not-a-
>  > number). Obviously, we cannot
>  > >    put these on a grid, so this is usually
>  > where we detect those errors.
>  > >    Make sure your system is properly energy-
>  > minimized and that the
>  > >    potential
>  > >    energy seems reasonable before trying again.
>  > >    Variable ind has value 7131. It should have
>  > been within [ 0 .. 7131 ]
>  > >
>  > >    For more information and tips for
>  > troubleshooting, please check the
>  > >    GROMACS
>  > >    website at
>  > http://www.gromacs.org/Documentation/Errors>    --
>  > -----------------------------------------------------
>  > >
>  > >    It seems as if the crash really shouldn't be
>  > happening, if the value
>  > >    range is inclusive.
>  > >
>  > >    Running with all-vs-all kernels works, but
>  > the performance is
>  > >    horrendously slow (<300 ps per day for a
>  > 7131-atom system) so I am
>  > >    attempting to use long cutoffs (2.0 nm) as
>  > others on the list have
>  > >    suggested.
>  > >
>  > >    Details of the installations and .mdp files
>  > are appended below.
>  > >
>  > >    -Justin
>  > >
>  > >    === em.mdp ===
>  > >    ; Run parameters
>  > >    integrator      =
>  > steep         ; EM
>  > >    emstep      = 0.005
>  > >    emtol       = 1000
>  > >    nsteps      = 50000
>  > >   
>  > nstcomm         = 1
>  > >    comm_mode   =
>  > angular       ; non-periodic system
>  > >    ; Bond parameters
>  > >    constraint_algorithm    = lincs
>  > >   
>  > constraints             = all-bonds
>  > >    continuation    =
>  > no            ; starting up
>  > >    ; required cutoffs for implicit
>  > >   
>  > nstlist         = 1
>  > >   
>  > ns_type         = grid
>  > >   
>  > rlist           = 2.0
>  > >   
>  > rcoulomb        = 2.0
>  > >   
>  > rvdw            = 2.0
>  > >    ; cutoffs required for qq and vdw
>  > >    coulombtype     = cut-off
>  > >    vdwtype     = cut-off
>  > >    ; temperature coupling
>  > >   
>  > tcoupl          = no
>  > >    ; Pressure coupling is off
>  > >   
>  > Pcoupl          = no
>  > >    ; Periodic boundary conditions are off for
>  > implicit>   
>  > pbc                 = no
>  > >    ; Settings for implicit solvent
>  > >    implicit_solvent    = GBSA
>  > >   
>  > gb_algorithm        = OBC
>  > >   
>  > rgbradii            = 2.0
>  > >
>  > >
>  > >    === md.mdp ===
>  > >
>  > >    ; Run parameters
>  > >    integrator      =
>  > sd            ; velocity Langevin dynamics
>  > >   
>  > dt                  = 0.002
>  > >   
>  > nsteps          =
>  > 2500000               ; 5000 ps (5 ns)
>  > >   
>  > nstcomm         = 1
>  > >    comm_mode   =
>  > angular       ; non-periodic system
>  > >    ; Output parameters
>  > >   
>  > nstxout         =
>  > 0             ; nst[xvf]out = 0 to suppress
>  > >    useless .trr output
>  > >   
>  > nstvout         = 0
>  > >   
>  > nstfout         = 0
>  > >    nstlog      =
>  > 5000          ; 10 ps
>  > >    nstenergy   =
>  > 5000          ; 10 ps
>  > >    nstxtcout   =
>  > 5000          ; 10 ps
>  > >    ; Bond parameters
>  > >    constraint_algorithm    = lincs
>  > >   
>  > constraints             = all-bonds
>  > >    continuation    =
>  > no            ; starting up
>  > >    ; required cutoffs for implicit
>  > >   
>  > nstlist         = 10
>  > >   
>  > ns_type         = grid
>  > >   
>  > rlist           = 2.0
>  > >   
>  > rcoulomb        = 2.0
>  > >   
>  > rvdw            = 2.0
>  > >    ; cutoffs required for qq and vdw
>  > >    coulombtype     = cut-off
>  > >    vdwtype     = cut-off
>  > >    ; temperature coupling
>  > >   
>  > tc_grps         = System
>  > >   
>  > tau_t           = 1.0   ; inverse friction coefficient for Langevin
>  > >    (ps^-1)
>  > >   
>  > ref_t           = 310
>  > >    ; Pressure coupling is off
>  > >   
>  > Pcoupl          = no
>  > >    ; Generate velocities is on
>  > >   
>  > gen_vel         =
>  > yes              gen_temp        = 310
>  > >   
>  > gen_seed        = 173529
>  > >    ; Periodic boundary conditions are off for
>  > implicit>   
>  > pbc                 = no
>  > >    ; Free energy must be off to use all-vs-all
>  > kernels>    ; default, but just for the sake of
>  > being pedantic
>  > >    free_energy = no
>  > >    ; Settings for implicit solvent
>  > >    implicit_solvent    = GBSA
>  > >   
>  > gb_algorithm        = OBC
>  > >   
>  > rgbradii            = 2.0
>  > >
>  > >
>  > >    === Installation commands for the cluster ===
>  > >
>  > >    $ ./configure --
>  > prefix=/home/rdiv1001/gromacs-4.5
>  > >    CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
>  > >    LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-
>  > osx/lib" --disable-threads
>  > >    --without-x --program-suffix=_4.5.1_s
>  > >
>  > >    $ make
>  > >
>  > >    $ make install
>  > >
>  > >    $ make distclean
>  > >
>  > >    $ ./configure --
>  > prefix=/home/rdiv1001/gromacs-4.5
>  > >    CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
>  > >    LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-
>  > osx/lib" --disable-threads
>  > >    --without-x --program-suffix=_4.5.1_mpi --
>  > enable-mpi
>  > >    CXXCPP="/nfs/compilers/mpich-
>  > 1.2.5/bin/mpicxx -E"
>  > >
>  > >    $ make mdrun
>  > >
>  > >    $ make install-mdrun
>  > >
>  > >
>  > >    --    
>  > ========================================>
>  > >    Justin A. Lemkul
>  > >    Ph.D. Candidate
>  > >    ICTAS Doctoral Scholar
>  > >    MILES-IGERT Trainee
>  > >    Department of Biochemistry
>  > >    Virginia Tech
>  > >    Blacksburg, VA
>  > >    jalemkul[at]vt.edu <http://vt.edu> |
>  > (540) 231-9080
>  > >   
>  > http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin>
>  > >    ========================================
>  > >    --     gmx-users mailing
>  > list    gmx-users at gromacs.org
>  > >    <mailto:gmx-users at gromacs.org>
>  > >   
>  > http://lists.gromacs.org/mailman/listinfo/gmx-users
>  > >    Please search the archive at
>  > >   
>  > http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>  > >    Please don't post (un)subscribe requests to
>  > the list. Use the www
>  > >    interface or send it to gmx-users-
>  > request at gromacs.org>    <mailto:gmx-users-
>  > request at gromacs.org>.>    Can't post? Read
>  > http://www.gromacs.org/Support/Mailing_Lists>
>  > >
>  > >
>  > >
>  > >--
>  > >ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
>  > <http://cmb.ornl.gov>>865-241-1537, ORNL PO BOX 2008 MS6309
>  >
>  > --
>  > ========================================
>  >
>  > Justin A. Lemkul
>  > Ph.D. Candidate
>  > ICTAS Doctoral Scholar
>  > MILES-IGERT Trainee
>  > Department of Biochemistry
>  > Virginia Tech
>  > Blacksburg, VA
>  > jalemkul[at]vt.edu | (540) 231-9080
>  > http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>  >
>  > ========================================
>  > --
>  > gmx-users mailing list    gmx-users at gromacs.org
>  > http://lists.gromacs.org/mailman/listinfo/gmx-users
>  > Please search the archive at
>  > http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>  > Please don't post (un)subscribe requests to the list. Use the
>  > www interface or send it to gmx-users-request at gromacs.org.
>  > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> 

-- 
========================================

Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================



More information about the gromacs.org_gmx-users mailing list