[gmx-users] Hardware-specific crash with 4.5.1

Justin A. Lemkul jalemkul at vt.edu
Tue Sep 28 17:25:44 CEST 2010



Roland Schulz wrote:
> 
> 
> On Mon, Sep 27, 2010 at 9:58 PM, Mark Abraham <mark.abraham at anu.edu.au 
> <mailto:mark.abraham at anu.edu.au>> wrote:
> 
> 
> 
>     ----- Original Message -----
>     From: "Justin A. Lemkul" <jalemkul at vt.edu <mailto:jalemkul at vt.edu>>
>     Date: Tuesday, September 28, 2010 11:39
>     Subject: Re: [gmx-users] Hardware-specific crash with 4.5.1
>     To: Discussion list for GROMACS users <gmx-users at gromacs.org
>     <mailto:gmx-users at gromacs.org>>
> 
>      >
>      >
>      > Mark Abraham wrote:
>      > >
>      > >
>      > >----- Original Message -----
>      > >From: "Justin A. Lemkul" <jalemkul at vt.edu <mailto:jalemkul at vt.edu>>
>      > >Date: Tuesday, September 28, 2010 11:11
>      > >Subject: Re: [gmx-users] Hardware-specific crash with 4.5.1
>      > >To: Gromacs Users' List <gmx-users at gromacs.org
>     <mailto:gmx-users at gromacs.org>>
>      > >
>      > > >
>      > > >
>      > > > Roland Schulz wrote:
>      > > > >Justin,
>      > > > >
>      > > > >I think the interaction kernel is not OK on your PowerPC
>      > > > machine. I assume that from: 1) The force seems to be zero
>      > > > (minimization output). 2) When you use the all-to-all kernel
>      > > > which is not available for the powerpc kernel, it automatically
>      > > > falls back to the C kernel and then it works.
>      > > > >
>      > > >
>      > > > Sounds about right.
>      > > >
>      > > > >What is the kernel you are using? It should say in the log
>      > > > file. Look for: "Configuring single precision IBM Power6-
>      > > > specific Fortran kernels" or "Testing Altivec/VMX support"
>      > > > >
>      > > >
>      > > > I'm not finding either in the config.log - weird?
>      > >
>      > >You were meant to look in the mdrun.log for runtime
>      > confirmation of what kernels GROMACS has decided to use.
>      > >
>      >
>      > That seems entirely obvious, now that you mention it :) 
>      > Conveniently, I find the following in the md.log file from the
>      > (failing) autoconf-assembled mdrun:
>      >
>      > Configuring nonbonded kernels...
>      > Configuring standard C nonbonded kernels...
>      > Testing Altivec/VMX support... present.
>      > Configuring PPC/Altivec nonbonded kernels...
>      >
>      > The (non)MPI CMake build shows the following:
>      >
>      > Configuring nonbonded kernels...
>      > Configuring standard C nonbonded kernels...
>      >
>      > So it seems clear to me that autoconf built faulty nonbonded
>      > kernels, and CMake didn't.
> 
>     OK, so assuming that PPC/Altivec kernels are supposed to be good for
>     Mac (as they were in 4.0.x, I believe):
> 
>     1) CMake doesn't detect that it should be using those kernels, and
>     so appears to work, but does an inefficient run. autoconf detects
>     that it should use those kernels, but the mdrun fails for reasons
>     that are not yet clear.
> 
>      
>      > > > >You can also look in the config.h whether  GMX_POWER6
>      > > > and/or GMX_PPC_ALTIVEC is set. I suggest you try to compile with
>      > > > one/both of them deactivated and see whether that solves it.
>      > > > This will make it slower too. Thus if this is indeed the
>      > > > problem, you will probably want to figure out why the fastest
>      > > > kernel doesn't work correctly to get good performance.
>      > > > >
>      > > >
>      > > > It looks like GMX_PPC_ALTIVEC is set.  I suppose I
>      > could re-
>      > > > compile with this turned off.
>      > >
>      > >This is supposed to be fine for Mac, as I understand.
>      > >
>      > > > Here's what's even weirder.  The problematic version was
>      > > > compiled using the standard autoconf procedure.  If I
>      > use a
>      > > > CMake-compiled version, the energy minimization runs fine,
>      > > > giving the same results (energy and force) as the two
>      > systems I
>      > > > know are good.  So I guess there's something wrong with the
>      > > > way autoconf installed Gromacs.  Perhaps this isn't of
>      > > > concern since Gromacs will require CMake in subsequent releases,
>      > > > but I figure I should at least report it in case it affects
>      > > > anyone else.
>      > > >
>      > > > If I may tack one more question on here, I'm wondering why my
>      > > > CMake installation  doesn't actually appear to be using
>      > > > MPI.  I get the right result, but the problem is, I get a
>      > > > .log, .edr, and .trr for every processor that's being used, as
>      > > > if each processor is being given its own job and not
>      > > > distributing the work. Here's how I compiled my MPI mdrun,
>      > > > version 4.5.1:
>      > >
>      > >At the start and end of the .log files you should get
>      > indicators about how many MPI processes were actually being used.
>      > >
>      >
>      > That explains it (sort of).  It looks like mdrun thinks
>      > it's only being run over 1 node, just several times over, and a
>      > bunch of junk that isn't getting written properly:
>      >
>      > Log file opened on Mon Sep 27 21:36:00 2010
>      > Host: n235  pid: 6857  nodeid: 0  nnodes:  1
>      > The Gromacs distribution was built @TMP_TIME@ by
>      > jalemkul at sysx2.arc-int.vt.edu
>     <mailto:jalemkul at sysx2.arc-int.vt.edu> [CMAKE] (@TMP_MACHINE@)
>      >
>      > Frustrating.
> 
>     You can set the GMX_NOOPTIMIZEDKERNELS environment variable with
>     your autoconf build to see whether the MPI issue is CMake-dependent.
>     Normally, I'd say your supercomputer MPI environment isn't being
>     invoked correctly, but presumably you already know how to do that
>     right...
> 
> 
>      > > > cmake ../gromacs-4.5.1 -
>      > DFFTW3F_LIBRARIES=/home/rdiv1001/fftw-
>      > > > 3.0.1-osx/lib/libfftw3f.a -
>      > > > DFFTW3F_INCLUDE_DIR=/home/rdiv1001/fftw-3.0.1-osx/include/ -
>      > > > DCMAKE_INSTALL_PREFIX=/home/rdiv1001/gromacs-4.5_cmake-osx -
>      > > > DGMX_BINARY_SUFFIX=_4.5_cmake_mpi -DGMX_THREADS=OFF -
>      > > > DBUILD_SHARED_LIBS=OFF -DGMX_X11=OFF -DGMX_MPI=ON -
>      > > > DMPI_COMPILER=/home/rdiv1001/compilers/openmpi-1.2.3-
>      > > > osx/bin/mpicxx -
>      > > > DMPI_INCLUDE_PATH=/home/rdiv1001/compilers/openmpi-1.2.3-
>      > osx/include> >
>      > > > $ make mdrun
>      > > >
>      > > > $ make install-mdrun
>      > > >
>      > > > Is there anything obviously wrong with those commands?  Is
>      > > > there any way I should know (before actually using mdrun)
>      > > > whether or not I've done things right?
>      > >
>      > >I think there ought to be, but IMO not enough preparation and
>      > testing has gone into the CMake switch for it to be usable.
>      > >
>      >
>      > I agree.  After hours of hacking CMake to try to make it
>      > work (and thinking I had gotten it squared away), the MPI
>      > doesn't seem to function.  The "old" way of doing things
>      > worked flawlessly, except that somewhere between 4.0.7 and
>      > 4.5.1, the nonbonded kernels that used to work on our
>      > architecture somehow got hosed. So now I'm in limbo.
> 
> Well that's why Autoconf is still supported for 4.5 - to allow a smooth 
> transition. It is difficult to get enough feedback for cmake for just 
> the development version. Only know with the cmake support in the 
> released version their is enough feedback to make the uncommon cases 
> work. But you can still use autoconf if cmake doesn't work. Thus it 
> should work for everyone and we get the required feedback to use cmake 
> for 5.0.
>  
> 
>     Sounds like Bugzilla time.
> 
> 
> I agree you should file 3 bugs. 
> 1) CMake + MPI -> MPI doesn't work, 
> 2) CMake + Altivec -> Altivec isn't detected
> 3) Altivec produces 0 forces 
> 

Bugs 572-574 have been filed, for anyone that wants to follow them.

-Justin

> As far as I know the POWER6 kernel should work too for your hardware and 
> should be a little bit faster than the standard C kernel. You might want 
> to try that until the Altivec kernel is fixed.
> 
> Roland
> 
> 
> 
> -- 
> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <http://cmb.ornl.gov>
> 865-241-1537, ORNL PO BOX 2008 MS6309
> 

-- 
========================================

Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================



More information about the gromacs.org_gmx-users mailing list