[gmx-users] Hardware-specific crash with 4.5.1
Justin A. Lemkul
jalemkul at vt.edu
Tue Sep 28 03:10:17 CEST 2010
Roland Schulz wrote:
> Justin,
>
> I think the interaction kernel is not OK on your PowerPC machine. I
> assume that from: 1) The force seems to be zero (minimization output).
> 2) When you use the all-to-all kernel which is not available for the
> powerpc kernel, it automatically falls back to the C kernel and then it
> works.
>
Sounds about right.
> What is the kernel you are using? It should say in the log file. Look
> for: "Configuring single precision IBM Power6-specific Fortran kernels"
> or "Testing Altivec/VMX support"
>
I'm not finding either in the config.log - weird?
> You can also look in the config.h whether GMX_POWER6
> and/or GMX_PPC_ALTIVEC is set. I suggest you try to compile with
> one/both of them deactivated and see whether that solves it. This will
> make it slower too. Thus if this is indeed the problem, you will
> probably want to figure out why the fastest kernel doesn't work
> correctly to get good performance.
>
It looks like GMX_PPC_ALTIVEC is set. I suppose I could re-compile with this
turned off.
Here's what's even weirder. The problematic version was compiled using the
standard autoconf procedure. If I use a CMake-compiled version, the energy
minimization runs fine, giving the same results (energy and force) as the two
systems I know are good. So I guess there's something wrong with the way
autoconf installed Gromacs. Perhaps this isn't of concern since Gromacs will
require CMake in subsequent releases, but I figure I should at least report it
in case it affects anyone else.
If I may tack one more question on here, I'm wondering why my CMake installation
doesn't actually appear to be using MPI. I get the right result, but the
problem is, I get a .log, .edr, and .trr for every processor that's being used,
as if each processor is being given its own job and not distributing the work.
Here's how I compiled my MPI mdrun, version 4.5.1:
cmake ../gromacs-4.5.1
-DFFTW3F_LIBRARIES=/home/rdiv1001/fftw-3.0.1-osx/lib/libfftw3f.a
-DFFTW3F_INCLUDE_DIR=/home/rdiv1001/fftw-3.0.1-osx/include/
-DCMAKE_INSTALL_PREFIX=/home/rdiv1001/gromacs-4.5_cmake-osx
-DGMX_BINARY_SUFFIX=_4.5_cmake_mpi -DGMX_THREADS=OFF -DBUILD_SHARED_LIBS=OFF
-DGMX_X11=OFF -DGMX_MPI=ON
-DMPI_COMPILER=/home/rdiv1001/compilers/openmpi-1.2.3-osx/bin/mpicxx
-DMPI_INCLUDE_PATH=/home/rdiv1001/compilers/openmpi-1.2.3-osx/include
$ make mdrun
$ make install-mdrun
Is there anything obviously wrong with those commands? Is there any way I
should know (before actually using mdrun) whether or not I've done things right?
-Justin
> Roland
>
>
> On Mon, Sep 27, 2010 at 4:59 PM, Justin A. Lemkul <jalemkul at vt.edu
> <mailto:jalemkul at vt.edu>> wrote:
>
>
> Hi All,
>
> I'm hoping I might get some tips in tracking down the source of an
> issue that appears to be hardware-specific, leading to crashes in my
> system. The failures are occurring on our supercomputer (Mac OSX
> 10.3, PowerPC). Running the same .tpr file on my laptop (Mac OSX
> 10.5.8, Intel Core2Duo) and on another workstation (Ubuntu 10.04,
> AMD64) produce identical results. I suspect the problem stems from
> unsuccessful energy minimization, which then leads to a crash when
> running full MD. All jobs were run in parallel on two cores. The
> supercomputer does not support threading, so MPI is invoked using
> MPICH-1.2.5 (native MPI implementation on the cluster).
>
>
> Details as follows:
>
> EM md.log file: successful run (Intel Core2Duo or AMD64)
>
> Steepest Descents converged to Fmax < 1000 in 7 steps
> Potential Energy = -4.8878180e+04
> Maximum force = 8.7791553e+02 on atom 5440
> Norm of force = 1.1781271e+02
>
>
> EM md.log file: unsuccessful run (PowerPC)
>
> Steepest Descents converged to Fmax < 1000 in 1 steps
> Potential Energy = -2.4873273e+04
> Maximum force = 0.0000000e+00 on atom 0
> Norm of force = nan
>
>
> MD invoked from the minimized structure generated on my laptop or
> AMD64 runs successfully (at least for a few hundred steps in my
> test), but the MD on the PowerPC cluster fails immediately:
>
> Step Time Lambda
> 0 0.00000 0.00000
>
> Energies (kJ/mol)
> U-B Proper Dih. Improper Dih. CMAP Dih.GB
> Polarization
> 7.93559e+03 9.34958e+03 2.24036e+02 -2.47750e+03
> -7.83599e+04
> LJ-14 Coulomb-14 LJ (SR) Coulomb (SR)
> Potential
> 7.70042e+03 9.94520e+04 -1.17168e+04 -5.79783e+04
> -2.55780e+04
> Kinetic En. Total Energy Temperature Pressure (bar)
> Constr. rmsd
> nan nan nan 0.00000e+00
> nan
> Constr.2 rmsd
> nan
>
> DD step 9 load imb.: force 3.0%
>
>
> -------------------------------------------------------
> Program mdrun_4.5.1_mpi, VERSION 4.5.1
> Source code file: nsgrid.c, line: 601
>
> Range checking error:
> Explanation: During neighborsearching, we assign each particle to a grid
> based on its coordinates. If your system contains collisions or
> parameter
> errors that give particles very high velocities you might end up
> with some
> coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
> put these on a grid, so this is usually where we detect those errors.
> Make sure your system is properly energy-minimized and that the
> potential
> energy seems reasonable before trying again.
> Variable ind has value 7131. It should have been within [ 0 .. 7131 ]
>
> For more information and tips for troubleshooting, please check the
> GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
>
> It seems as if the crash really shouldn't be happening, if the value
> range is inclusive.
>
> Running with all-vs-all kernels works, but the performance is
> horrendously slow (<300 ps per day for a 7131-atom system) so I am
> attempting to use long cutoffs (2.0 nm) as others on the list have
> suggested.
>
> Details of the installations and .mdp files are appended below.
>
> -Justin
>
> === em.mdp ===
> ; Run parameters
> integrator = steep ; EM
> emstep = 0.005
> emtol = 1000
> nsteps = 50000
> nstcomm = 1
> comm_mode = angular ; non-periodic system
> ; Bond parameters
> constraint_algorithm = lincs
> constraints = all-bonds
> continuation = no ; starting up
> ; required cutoffs for implicit
> nstlist = 1
> ns_type = grid
> rlist = 2.0
> rcoulomb = 2.0
> rvdw = 2.0
> ; cutoffs required for qq and vdw
> coulombtype = cut-off
> vdwtype = cut-off
> ; temperature coupling
> tcoupl = no
> ; Pressure coupling is off
> Pcoupl = no
> ; Periodic boundary conditions are off for implicit
> pbc = no
> ; Settings for implicit solvent
> implicit_solvent = GBSA
> gb_algorithm = OBC
> rgbradii = 2.0
>
>
> === md.mdp ===
>
> ; Run parameters
> integrator = sd ; velocity Langevin dynamics
> dt = 0.002
> nsteps = 2500000 ; 5000 ps (5 ns)
> nstcomm = 1
> comm_mode = angular ; non-periodic system
> ; Output parameters
> nstxout = 0 ; nst[xvf]out = 0 to suppress
> useless .trr output
> nstvout = 0
> nstfout = 0
> nstlog = 5000 ; 10 ps
> nstenergy = 5000 ; 10 ps
> nstxtcout = 5000 ; 10 ps
> ; Bond parameters
> constraint_algorithm = lincs
> constraints = all-bonds
> continuation = no ; starting up
> ; required cutoffs for implicit
> nstlist = 10
> ns_type = grid
> rlist = 2.0
> rcoulomb = 2.0
> rvdw = 2.0
> ; cutoffs required for qq and vdw
> coulombtype = cut-off
> vdwtype = cut-off
> ; temperature coupling
> tc_grps = System
> tau_t = 1.0 ; inverse friction coefficient for Langevin
> (ps^-1)
> ref_t = 310
> ; Pressure coupling is off
> Pcoupl = no
> ; Generate velocities is on
> gen_vel = yes
> gen_temp = 310
> gen_seed = 173529
> ; Periodic boundary conditions are off for implicit
> pbc = no
> ; Free energy must be off to use all-vs-all kernels
> ; default, but just for the sake of being pedantic
> free_energy = no
> ; Settings for implicit solvent
> implicit_solvent = GBSA
> gb_algorithm = OBC
> rgbradii = 2.0
>
>
> === Installation commands for the cluster ===
>
> $ ./configure --prefix=/home/rdiv1001/gromacs-4.5
> CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
> LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads
> --without-x --program-suffix=_4.5.1_s
>
> $ make
>
> $ make install
>
> $ make distclean
>
> $ ./configure --prefix=/home/rdiv1001/gromacs-4.5
> CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
> LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads
> --without-x --program-suffix=_4.5.1_mpi --enable-mpi
> CXXCPP="/nfs/compilers/mpich-1.2.5/bin/mpicxx -E"
>
> $ make mdrun
>
> $ make install-mdrun
>
>
> --
> ========================================
>
> Justin A. Lemkul
> Ph.D. Candidate
> ICTAS Doctoral Scholar
> MILES-IGERT Trainee
> Department of Biochemistry
> Virginia Tech
> Blacksburg, VA
> jalemkul[at]vt.edu <http://vt.edu> | (540) 231-9080
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>
> ========================================
> --
> gmx-users mailing list gmx-users at gromacs.org
> <mailto:gmx-users at gromacs.org>
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-request at gromacs.org
> <mailto:gmx-users-request at gromacs.org>.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
>
>
>
> --
> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <http://cmb.ornl.gov>
> 865-241-1537, ORNL PO BOX 2008 MS6309
--
========================================
Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
========================================
More information about the gromacs.org_gmx-users
mailing list