[gmx-users] Hardware-specific crash with 4.5.1

Tue Sep 28 03:21:02 CEST 2010

----- Original Message -----
From: "Justin A. Lemkul" <jalemkul at vt.edu>
Date: Tuesday, September 28, 2010 11:11
Subject: Re: [gmx-users] Hardware-specific crash with 4.5.1
To: Gromacs Users' List <gmx-users at gromacs.org>

> 
> 
> Roland Schulz wrote:
> >Justin,
> >
> >I think the interaction kernel is not OK on your PowerPC 
> machine. I assume that from: 1) The force seems to be zero 
> (minimization output). 2) When you use the all-to-all kernel 
> which is not available for the powerpc kernel, it automatically 
> falls back to the C kernel and then it works.
> >
> 
> Sounds about right.
> 
> >What is the kernel you are using? It should say in the log 
> file. Look for: "Configuring single precision IBM Power6-
> specific Fortran kernels" or "Testing Altivec/VMX support"
> >
> 
> I'm not finding either in the config.log - weird?

You were meant to look in the mdrun.log for runtime confirmation of what kernels GROMACS has decided to use.

> >You can also look in the config.h whether  GMX_POWER6 
> and/or GMX_PPC_ALTIVEC is set. I suggest you try to compile with 
> one/both of them deactivated and see whether that solves it. 
> This will make it slower too. Thus if this is indeed the 
> problem, you will probably want to figure out why the fastest 
> kernel doesn't work correctly to get good performance.
> >
> 
> It looks like GMX_PPC_ALTIVEC is set.  I suppose I could re-
> compile with this turned off.

This is supposed to be fine for Mac, as I understand.

> Here's what's even weirder.  The problematic version was 
> compiled using the standard autoconf procedure.  If I use a 
> CMake-compiled version, the energy minimization runs fine, 
> giving the same results (energy and force) as the two systems I 
> know are good.  So I guess there's something wrong with the 
> way autoconf installed Gromacs.  Perhaps this isn't of 
> concern since Gromacs will require CMake in subsequent releases, 
> but I figure I should at least report it in case it affects 
> anyone else.
> 
> If I may tack one more question on here, I'm wondering why my 
> CMake installation  doesn't actually appear to be using 
> MPI.  I get the right result, but the problem is, I get a 
> .log, .edr, and .trr for every processor that's being used, as 
> if each processor is being given its own job and not 
> distributing the work. Here's how I compiled my MPI mdrun, 
> version 4.5.1:

At the start and end of the .log files you should get indicators about how many MPI processes were actually being used.

> cmake ../gromacs-4.5.1 -DFFTW3F_LIBRARIES=/home/rdiv1001/fftw-
> 3.0.1-osx/lib/libfftw3f.a -
> DFFTW3F_INCLUDE_DIR=/home/rdiv1001/fftw-3.0.1-osx/include/ -
> DCMAKE_INSTALL_PREFIX=/home/rdiv1001/gromacs-4.5_cmake-osx -
> DGMX_BINARY_SUFFIX=_4.5_cmake_mpi -DGMX_THREADS=OFF -
> DBUILD_SHARED_LIBS=OFF -DGMX_X11=OFF -DGMX_MPI=ON -
> DMPI_COMPILER=/home/rdiv1001/compilers/openmpi-1.2.3-
> osx/bin/mpicxx -
> DMPI_INCLUDE_PATH=/home/rdiv1001/compilers/openmpi-1.2.3-osx/include
> 
> $ make mdrun
> 
> $ make install-mdrun
> 
> Is there anything obviously wrong with those commands?  Is 
> there any way I should know (before actually using mdrun) 
> whether or not I've done things right?

I think there ought to be, but IMO not enough preparation and testing has gone into the CMake switch for it to be usable.

Mark

> -Justin
> 
> >Roland
> >
> >
> >On Mon, Sep 27, 2010 at 4:59 PM, Justin A. Lemkul 
> <jalemkul at vt.edu <mailto:jalemkul at vt.edu>> wrote:
> >
> >
> >    Hi All,
> >
> >    I'm hoping I might get some tips in tracking 
> down the source of an
> >    issue that appears to be hardware-specific, 
> leading to crashes in my
> >    system.  The failures are occurring on 
> our supercomputer (Mac OSX
> >    10.3, PowerPC).  Running the same .tpr 
> file on my laptop (Mac OSX
> >    10.5.8, Intel Core2Duo) and on another 
> workstation (Ubuntu 10.04,
> >    AMD64) produce identical results.  I 
> suspect the problem stems from
> >    unsuccessful energy minimization, which then 
> leads to a crash when
> >    running full MD.  All jobs were run in 
> parallel on two cores.  The
> >    supercomputer does not support threading, so 
> MPI is invoked using
> >    MPICH-1.2.5 (native MPI implementation on 
> the cluster).
> >
> >
> >    Details as follows:
> >
> >    EM md.log file: successful run (Intel 
> Core2Duo or AMD64)
> >
> >    Steepest Descents converged to Fmax < 
> 1000 in 7 steps
> >    Potential Energy  = -4.8878180e+04
> >    Maximum force     
> =  8.7791553e+02 on atom 5440
> >    Norm of force     
> =  1.1781271e+02
> >
> >
> >    EM md.log file: unsuccessful run (PowerPC)
> >
> >    Steepest Descents converged to Fmax < 
> 1000 in 1 steps
> >    Potential Energy  = -2.4873273e+04
> >    Maximum force     
> =  0.0000000e+00 on atom 0
> >    Norm of force     
> =            nan
> >
> >
> >    MD invoked from the minimized structure 
> generated on my laptop or
> >    AMD64 runs successfully (at least for a few 
> hundred steps in my
> >    test), but the MD on the PowerPC cluster 
> fails immediately:
> >
> >              Step           Time         Lambda
> >                 0        0.00000        0.00000
> >
> >      Energies (kJ/mol)
> >               U-B    Proper Dih.  Improper Dih.      CMAP Dih.GB
> >    Polarization
> >       
> 7.93559e+03    9.34958e+03    
> 2.24036e+02   -
> 2.47750e+03      -7.83599e+04
> >             LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)         Potential
> >       
> 7.70042e+03    9.94520e+04   -
> 1.17168e+04   -
> 5.79783e+04      -2.55780e+04
> >       Kinetic En.   
> Total Energy    Temperature Pressure 
> (bar)      Constr. rmsd
> >               nan            nan            nan    0.00000e+00               nan
> >     Constr.2 rmsd
> >               nan
> >
> >    DD  step 9 load imb.: force  3.0%
> >
> >
> >    ---------------------------------------------
> ----------
> >    Program mdrun_4.5.1_mpi, VERSION 4.5.1
> >    Source code file: nsgrid.c, line: 601
> >
> >    Range checking error:
> >    Explanation: During neighborsearching, we 
> assign each particle to a grid
> >    based on its coordinates. If your system 
> contains collisions or
> >    parameter
> >    errors that give particles very high 
> velocities you might end up
> >    with some
> >    coordinates being +-Infinity or NaN (not-a-
> number). Obviously, we cannot
> >    put these on a grid, so this is usually 
> where we detect those errors.
> >    Make sure your system is properly energy-
> minimized and that the
> >    potential
> >    energy seems reasonable before trying again.
> >    Variable ind has value 7131. It should have 
> been within [ 0 .. 7131 ]
> >
> >    For more information and tips for 
> troubleshooting, please check the
> >    GROMACS
> >    website at 
> http://www.gromacs.org/Documentation/Errors>    --
> -----------------------------------------------------
> >
> >    It seems as if the crash really shouldn't be 
> happening, if the value
> >    range is inclusive.
> >
> >    Running with all-vs-all kernels works, but 
> the performance is
> >    horrendously slow (<300 ps per day for a 
> 7131-atom system) so I am
> >    attempting to use long cutoffs (2.0 nm) as 
> others on the list have
> >    suggested.
> >
> >    Details of the installations and .mdp files 
> are appended below.
> >
> >    -Justin
> >
> >    === em.mdp ===
> >    ; Run parameters
> >    integrator      = 
> steep         ; EM
> >    emstep      = 0.005
> >    emtol       = 1000
> >    nsteps      = 50000
> >    
> nstcomm         = 1
> >    comm_mode   = 
> angular       ; non-periodic system
> >    ; Bond parameters
> >    constraint_algorithm    = lincs
> >    
> constraints             = all-bonds
> >    continuation    = 
> no            ; starting up
> >    ; required cutoffs for implicit
> >    
> nstlist         = 1
> >    
> ns_type         = grid
> >    
> rlist           = 2.0
> >    
> rcoulomb        = 2.0
> >    
> rvdw            = 2.0
> >    ; cutoffs required for qq and vdw
> >    coulombtype     = cut-off
> >    vdwtype     = cut-off
> >    ; temperature coupling
> >    
> tcoupl          = no
> >    ; Pressure coupling is off
> >    
> Pcoupl          = no
> >    ; Periodic boundary conditions are off for 
> implicit>    
> pbc                 = no
> >    ; Settings for implicit solvent
> >    implicit_solvent    = GBSA
> >    
> gb_algorithm        = OBC
> >    
> rgbradii            = 2.0
> >
> >
> >    === md.mdp ===
> >
> >    ; Run parameters
> >    integrator      = 
> sd            ; velocity Langevin dynamics
> >    
> dt                  = 0.002
> >    
> nsteps          = 
> 2500000               ; 5000 ps (5 ns)
> >    
> nstcomm         = 1
> >    comm_mode   = 
> angular       ; non-periodic system
> >    ; Output parameters
> >    
> nstxout         = 
> 0             ; nst[xvf]out = 0 to suppress
> >    useless .trr output
> >    
> nstvout         = 0
> >    
> nstfout         = 0
> >    nstlog      = 
> 5000          ; 10 ps
> >    nstenergy   = 
> 5000          ; 10 ps
> >    nstxtcout   = 
> 5000          ; 10 ps
> >    ; Bond parameters
> >    constraint_algorithm    = lincs
> >    
> constraints             = all-bonds
> >    continuation    = 
> no            ; starting up
> >    ; required cutoffs for implicit
> >    
> nstlist         = 10
> >    
> ns_type         = grid
> >    
> rlist           = 2.0
> >    
> rcoulomb        = 2.0
> >    
> rvdw            = 2.0
> >    ; cutoffs required for qq and vdw
> >    coulombtype     = cut-off
> >    vdwtype     = cut-off
> >    ; temperature coupling
> >    
> tc_grps         = System
> >    
> tau_t           = 1.0   ; inverse friction coefficient for Langevin
> >    (ps^-1)
> >    
> ref_t           = 310
> >    ; Pressure coupling is off
> >    
> Pcoupl          = no
> >    ; Generate velocities is on
> >    
> gen_vel         = 
> yes              gen_temp        = 310
> >    
> gen_seed        = 173529
> >    ; Periodic boundary conditions are off for 
> implicit>    
> pbc                 = no
> >    ; Free energy must be off to use all-vs-all 
> kernels>    ; default, but just for the sake of 
> being pedantic
> >    free_energy = no
> >    ; Settings for implicit solvent
> >    implicit_solvent    = GBSA
> >    
> gb_algorithm        = OBC
> >    
> rgbradii            = 2.0
> >
> >
> >    === Installation commands for the cluster ===
> >
> >    $ ./configure --
> prefix=/home/rdiv1001/gromacs-4.5
> >    CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
> >    LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-
> osx/lib" --disable-threads
> >    --without-x --program-suffix=_4.5.1_s
> >
> >    $ make
> >
> >    $ make install
> >
> >    $ make distclean
> >
> >    $ ./configure --
> prefix=/home/rdiv1001/gromacs-4.5
> >    CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
> >    LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-
> osx/lib" --disable-threads
> >    --without-x --program-suffix=_4.5.1_mpi --
> enable-mpi
> >    CXXCPP="/nfs/compilers/mpich-
> 1.2.5/bin/mpicxx -E"
> >
> >    $ make mdrun
> >
> >    $ make install-mdrun
> >
> >
> >    --     
> ========================================>
> >    Justin A. Lemkul
> >    Ph.D. Candidate
> >    ICTAS Doctoral Scholar
> >    MILES-IGERT Trainee
> >    Department of Biochemistry
> >    Virginia Tech
> >    Blacksburg, VA
> >    jalemkul[at]vt.edu <http://vt.edu> | 
> (540) 231-9080
> >    
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin>
> >    ========================================
> >    --     gmx-users mailing 
> list    gmx-users at gromacs.org
> >    <mailto:gmx-users at gromacs.org>
> >    
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> >    Please search the archive at
> >    
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> >    Please don't post (un)subscribe requests to 
> the list. Use the www
> >    interface or send it to gmx-users-
> request at gromacs.org>    <mailto:gmx-users-
> request at gromacs.org>.>    Can't post? Read 
> http://www.gromacs.org/Support/Mailing_Lists>
> >
> >
> >
> >-- 
> >ORNL/UT Center for Molecular Biophysics cmb.ornl.gov 
> <http://cmb.ornl.gov>>865-241-1537, ORNL PO BOX 2008 MS6309
> 
> -- 
> ========================================
> 
> Justin A. Lemkul
> Ph.D. Candidate
> ICTAS Doctoral Scholar
> MILES-IGERT Trainee
> Department of Biochemistry
> Virginia Tech
> Blacksburg, VA
> jalemkul[at]vt.edu | (540) 231-9080
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
> 
> ========================================
> -- 
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the 
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20100928/9a5e79e3/attachment.html>