[gmx-users] Hardware-specific crash with 4.5.1
Mark Abraham
mark.abraham at anu.edu.au
Tue Sep 28 03:21:02 CEST 2010
----- Original Message -----
From: "Justin A. Lemkul" <jalemkul at vt.edu>
Date: Tuesday, September 28, 2010 11:11
Subject: Re: [gmx-users] Hardware-specific crash with 4.5.1
To: Gromacs Users' List <gmx-users at gromacs.org>
>
>
> Roland Schulz wrote:
> >Justin,
> >
> >I think the interaction kernel is not OK on your PowerPC
> machine. I assume that from: 1) The force seems to be zero
> (minimization output). 2) When you use the all-to-all kernel
> which is not available for the powerpc kernel, it automatically
> falls back to the C kernel and then it works.
> >
>
> Sounds about right.
>
> >What is the kernel you are using? It should say in the log
> file. Look for: "Configuring single precision IBM Power6-
> specific Fortran kernels" or "Testing Altivec/VMX support"
> >
>
> I'm not finding either in the config.log - weird?
You were meant to look in the mdrun.log for runtime confirmation of what kernels GROMACS has decided to use.
> >You can also look in the config.h whether GMX_POWER6
> and/or GMX_PPC_ALTIVEC is set. I suggest you try to compile with
> one/both of them deactivated and see whether that solves it.
> This will make it slower too. Thus if this is indeed the
> problem, you will probably want to figure out why the fastest
> kernel doesn't work correctly to get good performance.
> >
>
> It looks like GMX_PPC_ALTIVEC is set. I suppose I could re-
> compile with this turned off.
This is supposed to be fine for Mac, as I understand.
> Here's what's even weirder. The problematic version was
> compiled using the standard autoconf procedure. If I use a
> CMake-compiled version, the energy minimization runs fine,
> giving the same results (energy and force) as the two systems I
> know are good. So I guess there's something wrong with the
> way autoconf installed Gromacs. Perhaps this isn't of
> concern since Gromacs will require CMake in subsequent releases,
> but I figure I should at least report it in case it affects
> anyone else.
>
> If I may tack one more question on here, I'm wondering why my
> CMake installation doesn't actually appear to be using
> MPI. I get the right result, but the problem is, I get a
> .log, .edr, and .trr for every processor that's being used, as
> if each processor is being given its own job and not
> distributing the work. Here's how I compiled my MPI mdrun,
> version 4.5.1:
At the start and end of the .log files you should get indicators about how many MPI processes were actually being used.
> cmake ../gromacs-4.5.1 -DFFTW3F_LIBRARIES=/home/rdiv1001/fftw-
> 3.0.1-osx/lib/libfftw3f.a -
> DFFTW3F_INCLUDE_DIR=/home/rdiv1001/fftw-3.0.1-osx/include/ -
> DCMAKE_INSTALL_PREFIX=/home/rdiv1001/gromacs-4.5_cmake-osx -
> DGMX_BINARY_SUFFIX=_4.5_cmake_mpi -DGMX_THREADS=OFF -
> DBUILD_SHARED_LIBS=OFF -DGMX_X11=OFF -DGMX_MPI=ON -
> DMPI_COMPILER=/home/rdiv1001/compilers/openmpi-1.2.3-
> osx/bin/mpicxx -
> DMPI_INCLUDE_PATH=/home/rdiv1001/compilers/openmpi-1.2.3-osx/include
>
> $ make mdrun
>
> $ make install-mdrun
>
> Is there anything obviously wrong with those commands? Is
> there any way I should know (before actually using mdrun)
> whether or not I've done things right?
I think there ought to be, but IMO not enough preparation and testing has gone into the CMake switch for it to be usable.
Mark
> -Justin
>
> >Roland
> >
> >
> >On Mon, Sep 27, 2010 at 4:59 PM, Justin A. Lemkul
> <jalemkul at vt.edu <mailto:jalemkul at vt.edu>> wrote:
> >
> >
> > Hi All,
> >
> > I'm hoping I might get some tips in tracking
> down the source of an
> > issue that appears to be hardware-specific,
> leading to crashes in my
> > system. The failures are occurring on
> our supercomputer (Mac OSX
> > 10.3, PowerPC). Running the same .tpr
> file on my laptop (Mac OSX
> > 10.5.8, Intel Core2Duo) and on another
> workstation (Ubuntu 10.04,
> > AMD64) produce identical results. I
> suspect the problem stems from
> > unsuccessful energy minimization, which then
> leads to a crash when
> > running full MD. All jobs were run in
> parallel on two cores. The
> > supercomputer does not support threading, so
> MPI is invoked using
> > MPICH-1.2.5 (native MPI implementation on
> the cluster).
> >
> >
> > Details as follows:
> >
> > EM md.log file: successful run (Intel
> Core2Duo or AMD64)
> >
> > Steepest Descents converged to Fmax <
> 1000 in 7 steps
> > Potential Energy = -4.8878180e+04
> > Maximum force
> = 8.7791553e+02 on atom 5440
> > Norm of force
> = 1.1781271e+02
> >
> >
> > EM md.log file: unsuccessful run (PowerPC)
> >
> > Steepest Descents converged to Fmax <
> 1000 in 1 steps
> > Potential Energy = -2.4873273e+04
> > Maximum force
> = 0.0000000e+00 on atom 0
> > Norm of force
> = nan
> >
> >
> > MD invoked from the minimized structure
> generated on my laptop or
> > AMD64 runs successfully (at least for a few
> hundred steps in my
> > test), but the MD on the PowerPC cluster
> fails immediately:
> >
> > Step Time Lambda
> > 0 0.00000 0.00000
> >
> > Energies (kJ/mol)
> > U-B Proper Dih. Improper Dih. CMAP Dih.GB
> > Polarization
> >
> 7.93559e+03 9.34958e+03
> 2.24036e+02 -
> 2.47750e+03 -7.83599e+04
> > LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Potential
> >
> 7.70042e+03 9.94520e+04 -
> 1.17168e+04 -
> 5.79783e+04 -2.55780e+04
> > Kinetic En.
> Total Energy Temperature Pressure
> (bar) Constr. rmsd
> > nan nan nan 0.00000e+00 nan
> > Constr.2 rmsd
> > nan
> >
> > DD step 9 load imb.: force 3.0%
> >
> >
> > ---------------------------------------------
> ----------
> > Program mdrun_4.5.1_mpi, VERSION 4.5.1
> > Source code file: nsgrid.c, line: 601
> >
> > Range checking error:
> > Explanation: During neighborsearching, we
> assign each particle to a grid
> > based on its coordinates. If your system
> contains collisions or
> > parameter
> > errors that give particles very high
> velocities you might end up
> > with some
> > coordinates being +-Infinity or NaN (not-a-
> number). Obviously, we cannot
> > put these on a grid, so this is usually
> where we detect those errors.
> > Make sure your system is properly energy-
> minimized and that the
> > potential
> > energy seems reasonable before trying again.
> > Variable ind has value 7131. It should have
> been within [ 0 .. 7131 ]
> >
> > For more information and tips for
> troubleshooting, please check the
> > GROMACS
> > website at
> http://www.gromacs.org/Documentation/Errors> --
> -----------------------------------------------------
> >
> > It seems as if the crash really shouldn't be
> happening, if the value
> > range is inclusive.
> >
> > Running with all-vs-all kernels works, but
> the performance is
> > horrendously slow (<300 ps per day for a
> 7131-atom system) so I am
> > attempting to use long cutoffs (2.0 nm) as
> others on the list have
> > suggested.
> >
> > Details of the installations and .mdp files
> are appended below.
> >
> > -Justin
> >
> > === em.mdp ===
> > ; Run parameters
> > integrator =
> steep ; EM
> > emstep = 0.005
> > emtol = 1000
> > nsteps = 50000
> >
> nstcomm = 1
> > comm_mode =
> angular ; non-periodic system
> > ; Bond parameters
> > constraint_algorithm = lincs
> >
> constraints = all-bonds
> > continuation =
> no ; starting up
> > ; required cutoffs for implicit
> >
> nstlist = 1
> >
> ns_type = grid
> >
> rlist = 2.0
> >
> rcoulomb = 2.0
> >
> rvdw = 2.0
> > ; cutoffs required for qq and vdw
> > coulombtype = cut-off
> > vdwtype = cut-off
> > ; temperature coupling
> >
> tcoupl = no
> > ; Pressure coupling is off
> >
> Pcoupl = no
> > ; Periodic boundary conditions are off for
> implicit>
> pbc = no
> > ; Settings for implicit solvent
> > implicit_solvent = GBSA
> >
> gb_algorithm = OBC
> >
> rgbradii = 2.0
> >
> >
> > === md.mdp ===
> >
> > ; Run parameters
> > integrator =
> sd ; velocity Langevin dynamics
> >
> dt = 0.002
> >
> nsteps =
> 2500000 ; 5000 ps (5 ns)
> >
> nstcomm = 1
> > comm_mode =
> angular ; non-periodic system
> > ; Output parameters
> >
> nstxout =
> 0 ; nst[xvf]out = 0 to suppress
> > useless .trr output
> >
> nstvout = 0
> >
> nstfout = 0
> > nstlog =
> 5000 ; 10 ps
> > nstenergy =
> 5000 ; 10 ps
> > nstxtcout =
> 5000 ; 10 ps
> > ; Bond parameters
> > constraint_algorithm = lincs
> >
> constraints = all-bonds
> > continuation =
> no ; starting up
> > ; required cutoffs for implicit
> >
> nstlist = 10
> >
> ns_type = grid
> >
> rlist = 2.0
> >
> rcoulomb = 2.0
> >
> rvdw = 2.0
> > ; cutoffs required for qq and vdw
> > coulombtype = cut-off
> > vdwtype = cut-off
> > ; temperature coupling
> >
> tc_grps = System
> >
> tau_t = 1.0 ; inverse friction coefficient for Langevin
> > (ps^-1)
> >
> ref_t = 310
> > ; Pressure coupling is off
> >
> Pcoupl = no
> > ; Generate velocities is on
> >
> gen_vel =
> yes gen_temp = 310
> >
> gen_seed = 173529
> > ; Periodic boundary conditions are off for
> implicit>
> pbc = no
> > ; Free energy must be off to use all-vs-all
> kernels> ; default, but just for the sake of
> being pedantic
> > free_energy = no
> > ; Settings for implicit solvent
> > implicit_solvent = GBSA
> >
> gb_algorithm = OBC
> >
> rgbradii = 2.0
> >
> >
> > === Installation commands for the cluster ===
> >
> > $ ./configure --
> prefix=/home/rdiv1001/gromacs-4.5
> > CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
> > LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-
> osx/lib" --disable-threads
> > --without-x --program-suffix=_4.5.1_s
> >
> > $ make
> >
> > $ make install
> >
> > $ make distclean
> >
> > $ ./configure --
> prefix=/home/rdiv1001/gromacs-4.5
> > CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
> > LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-
> osx/lib" --disable-threads
> > --without-x --program-suffix=_4.5.1_mpi --
> enable-mpi
> > CXXCPP="/nfs/compilers/mpich-
> 1.2.5/bin/mpicxx -E"
> >
> > $ make mdrun
> >
> > $ make install-mdrun
> >
> >
> > --
> ========================================>
> > Justin A. Lemkul
> > Ph.D. Candidate
> > ICTAS Doctoral Scholar
> > MILES-IGERT Trainee
> > Department of Biochemistry
> > Virginia Tech
> > Blacksburg, VA
> > jalemkul[at]vt.edu <http://vt.edu> |
> (540) 231-9080
> >
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin>
> > ========================================
> > -- gmx-users mailing
> list gmx-users at gromacs.org
> > <mailto:gmx-users at gromacs.org>
> >
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> > Please search the archive at
> >
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> > Please don't post (un)subscribe requests to
> the list. Use the www
> > interface or send it to gmx-users-
> request at gromacs.org> <mailto:gmx-users-
> request at gromacs.org>.> Can't post? Read
> http://www.gromacs.org/Support/Mailing_Lists>
> >
> >
> >
> >--
> >ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
> <http://cmb.ornl.gov>>865-241-1537, ORNL PO BOX 2008 MS6309
>
> --
> ========================================
>
> Justin A. Lemkul
> Ph.D. Candidate
> ICTAS Doctoral Scholar
> MILES-IGERT Trainee
> Department of Biochemistry
> Virginia Tech
> Blacksburg, VA
> jalemkul[at]vt.edu | (540) 231-9080
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>
> ========================================
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20100928/9a5e79e3/attachment.html>
More information about the gromacs.org_gmx-users
mailing list