[gmx-users] Possible free energy bug?

Justin A. Lemkul jalemkul at vt.edu
Fri Mar 11 04:04:52 CET 2011


Hi Matt,

Thanks for the extensive explanation and tips.  I'll work through things and 
report back.  It will take a while to get things going through (unless one of 
the early solutions works!) since I have no admin access to install new 
compilers, libraries, etc. and for some reason the only thing I can ever get to 
work in my home directory is Gromacs itself.  The joys of an aging cluster.

We recently got access to gcc-4.4.5 on Linux, but we're stuck with 3.3 on OS X, 
so there's at least a bit of hope for one partition.

Thanks again.

-Justin

Matthew Zwier wrote:
> Hi Justin,
> 
> I should have specified that the segfault happened for us after we got
> similar warnings and errors (DD and/or LINCS), so the segfault may
> have been tangential.  Given that everything about your system worked
> before GROMACS 4.5, it's possible that your older compilers are
> generating code that's incompatible with the GROMACS assembly loops
> (which you are likely running with, as they are the default option on
> most mainstream processors).  The bug you mentioned in your original
> post also has my antennae twitching about bad machine code.
> 
> If that's indeed happening, it's almost certainly some bizarre
> alignment issue, something like half of a float is getting overwritten
> on the way into or out of the assembly code, which corruption would
> trigger the results you describe.  It's also distantly possible that
> GROMACS is working fine, but your copy of FFTW or BLAS/LAPACK (more
> likely the latter) has alignment problems.  One final possibility
> (which would explain the failure on YellowDog but unfortunately not
> the failure on OS X) is that GCC is generating badly-aligned code for
> auto-vectorized Altivec loops, which is still a problem for Intel's
> SIMD instructions on 32-bit x86 architectures even with GCC 4.4.  I've
> also observed MPI gather/reduce operations to foul up alignment (or
> rigidly enforce it where badly compiled code is relying on broken
> alignment) under exceedingly rare circumstances, usually involving
> different libraries compiled with different compilers (which is
> generally a bad idea for scientific code anyway).
> 
> Okay...so all of that said, there are a few things to try:
> 
> 1) Recompile GROMACS using -O2 instead of -O3; that'll turn off the
> automatic vectorizer (on Yellow Dog) and various other relatively
> risky optimizations (on both platforms).  CFLAGS="-O2 -march=powerpc"
> in the environment AND on the configure command line would do that.
> Check your build logs to make sure it took, though, because if you
> don't do it exactly right, configure will ignore your directives and
> merrily set up GROMACS to compile with -O3, which is the most likely
> culprit for badly-aligned code.
> 
> 2) Recompile GROMACS specifying a forced alignment flag.  I have no
> experience with PowerPC, but -malign-natural and -malign-power look
> like good initial guesses.  That's probably going to cause more
> problems than it solves, but if you have a screwy BLAS/LAPACK or MPI,
> it might help. I only suggest it because if you've already tried #1,
> it will only take another half hour or hour of your time to recompile
> GROMACS again.  Other than that, tinkering with alignment flags is a
> really easy way to REALLY break code, so you might consider skipping
> this and moving straight on to #3.
> 
> 3) Snag GCC 4.4.4 or 4.4.5 and compile it, and use that to compile
> GROMACS, again with -O2.  GCC takes forever to compile, but beyond
> that, it's not as difficult as it could be.  Nothing preventing you
> from installing it in your home directory, either, assuming you set
> PATH and LD_LIBRARY_PATH (or DYLD_LIBRARY_PATH on OS X) properly.  You
> might need to snag a new copy of binutils as well, if gcc refuses to
> compile with the system ld.  This option would also probably get you
> threading, since you certainly have hardware support for it.
> 
> 4) Rebuild your entire GROMACS stack, including FFTW, BLAS/LAPACK,
> MPI, and GROMACS itself with the same compiler (preferably GCC from
> #3) and the same compiler options (which again should be -O2, and
> definitely NOT any sort of alignment flag).  Put them in their own
> tree (like "/opt/sci"), and definitely not in /usr (which is generally
> managed by the system) or /usr/local (which tends to accumulate
> cruft).  ATLAS is a good choice for BLAS, and there are directions on
> the ATLAS website for building a complete and optimized LAPACK based
> on BLAS.
> 
> In practice, I've found I've had to do #4 for every piece of
> scientific software our group uses, because pretty much nothing works
> right with OS-installed versions of compilers, BLAS/LAPACK, and MPI.
> It takes forever, and it pretty much defines the phrase "learning
> experience," but it also essentially *never* breaks once it works
> (because OS updates never overwrite anything you've hand-tuned to run
> correctly).  But...with luck option #1 will fix things quickly enough
> to get you running without devoting two days to rebuilding your
> software stack from scratch.
> 
> Hope that helps,
> Matt Z.
> 
> 
> On Thu, Mar 10, 2011 at 8:54 PM, Justin A. Lemkul <jalemkul at vt.edu> wrote:
>> Hi Matt,
>>
>> Thanks for the reply.  I can't trace the problem to a specific compiler.  We
>> have a PowerPC cluster with two partitions - one running Mac OS X 10.3 with
>> gcc-3.3, the other running YellowDog Linux with gcc-4.2.2.  The problem
>> happens on both partitions.  There are no seg faults, the runs just exit
>> (MPI_ABORT) after the fatal error (either "too many LINCS warnings" or the
>> DD-related error I posted before).
>>
>> We are using MPI: mpich-1.2.5 on OSX and OpenMPI-1.2.3 on Linux.  All of the
>> above has been the same since my successful 3.3.3 TI calculations (as well
>> as all of my simulations with Gromacs, ever).  Our hardware and compilers
>> are somewhat (very) outdated so threading is not supported, we always use
>> MPI.
>>
>> Gromacs was compiled in single precision using standard options through
>> autoconf.  The cmake build system still does not work on our cluster due to
>> several outstanding bugs.
>>
>> -Justin
>>
>> Matthew Zwier wrote:
>>> Dear Justin,
>>>
>>> We recently experienced a similar problem (LINCS errors, step*.pdb
>>> files), and then GROMACS usually segfaulted.  The cause was a
>>> miscompiled copy of GROMACS.  Another member of our group had compiled
>>> GROMACS on an Intel Core2 quad (gcc -march=core2) and tried to run the
>>> copy without modification on an AMD Magny Cours machine.
>>> Recompilation with the correct subarchitecture type (-march=amdfam10)
>>> fixed the problem.  Don't really know why it didn't die with SIGILL or
>>> SIGBUS instead of SIGSEGV, but that's probably a question for the
>>> hardware gurus.
>>>
>>> So...are you observing segfaults?  What compiler are you using (and on
>>> what OS)?  What were the compilation parameters for 4.5.3?  Also, are
>>> you really running across nodes with MPI, or running on the same node
>>> with MPI?
>>>
>>> Cheers,
>>> Matt Zwier
>>>
>>> On Thu, Mar 10, 2011 at 1:55 PM, Justin A. Lemkul <jalemkul at vt.edu> wrote:
>>>> Hi All,
>>>>
>>>> I've been troubleshooting a problem for some time now and I wanted to
>>>> report
>>>> it here and solicit some feedback before I submit a bug report to see if
>>>> there's anything else I can try.
>>>>
>>>> Here's the situation: I ran some free energy calculations (thermodynamic
>>>> integration) a long time ago using version 3.3.3 to determine the
>>>> hydration
>>>> free energy of a series of small molecules.  Results were good and they
>>>> ended up as part of a paper, so I'm trying to reproduce the methodology
>>>> with
>>>> 4.5.3 (using BAR) to see if I understand the workflow completely.  The
>>>> problem is my systems are crashing.  The runs simply stop randomly
>>>> (usually
>>>> within a few hundred ps) with lots of LINCS warnings and step*.pdb files
>>>> being written.
>>>>
>>>> I know the parameters are good, and produce stable trajectories, since I
>>>> spent months on them some years ago. The system prep is steepest descents
>>>> EM
>>>> to Fmax < 100 (always achieved), NVT at 298 K for 100 ps, NPT at 298K/1
>>>> bar
>>>> for 100 ps, then 5 ns of data collection under NPT conditions.  Here's
>>>> the
>>>> rundown of what I'm seeing:
>>>>
>>>> 1. All LJ transformations work fine.  The problem only comes when I have
>>>> a
>>>> molecule with full LJ interaction and I am "charging" it (i.e.,
>>>> introducing
>>>> charges to the partially-interacting species).
>>>>
>>>> 2. Simulations at lambda=1 (full interaction) work fine.
>>>>
>>>> 3. Simulations with the free energy code off entirely work fine under all
>>>> conditions.
>>>>
>>>> 4. I cannot run in serial due to http://redmine.gromacs.org/issues/715.
>>>>  The
>>>> bug seems to affect other systems and is not specifically related to my
>>>> free
>>>> energy calculations.
>>>>
>>>> 5. Running with DD fails because my system is relatively small (more on
>>>> this
>>>> in a moment).
>>>>
>>>> 6. Running with mdrun -pd 2 works, but mdrun -pd 4 crashes for any value
>>>> of
>>>> lambda != 1.
>>>>
>>>> 7. I created a larger system (instead of a 3x3x3-nm cube of water with my
>>>> molecule, I used 4x4x4) and ran on 4 CPU's with DD (lambda = 0, i.e. full
>>>> vdW, no intermolecular Coulombic interactions - .mdp file is below).
>>>>  This
>>>> run also crashed with some warnings about DD cell size:
>>>>
>>>> DD  load balancing is limited by minimum cell size in dimension X
>>>> DD  step 329999  vol min/aver 0.748! load imb.: force 31.5%
>>>>
>>>> ...and then the actual crash:
>>>>
>>>> -------------------------------------------------------
>>>> Program mdrun_4.5.3_gcc_mpi, VERSION 4.5.3
>>>> Source code file: domdec_con.c, line: 693
>>>>
>>>> Fatal error:
>>>> DD cell 0 0 0 could only obtain 14 of the 15 atoms that are connected via
>>>> constraints from the neighboring cells. This probably means your
>>>> constraint
>>>> lengths are too long compared to the domain decomposition cell size.
>>>> Decrease the number of domain decomposition grid cells or lincs-order or
>>>> use
>>>> the -rcon option of mdrun.
>>>> For more information and tips for troubleshooting, please check the
>>>> GROMACS
>>>> website at http://www.gromacs.org/Documentation/Errors
>>>> -------------------------------------------------------
>>>>
>>>> Watching the trajectory doesn't seem to give any useful information.  The
>>>> small molecule of interest is at a periodic boundary when the crash
>>>> happens,
>>>> but there are several crosses prior to the crash without incident, so I
>>>> don't know if the issue is related to PBC or not, but it appears not.
>>>>
>>>> 8. I initially thought the problem might be related to the barostat, but
>>>> switching from P-R to Berendsen does not alleviate the problem, nor does
>>>> increasing tau_p (tested 0.5, 1.0, 2.0, and 5.0 - all crash).  Longer
>>>> tau_p
>>>> simply delays the crash, but does not prevent it.
>>>>
>>>> So after all that, I'm wondering if (1) anyone has seen the same, or (2)
>>>> if
>>>> there's anything else I can try (environment variables, hidden tricks,
>>>> etc)
>>>> that I can use to get to the bottom of this before I give up and file a
>>>> bug
>>>> report.
>>>>
>>>> If you made it this far, thanks for reading my novel and hopefully
>>>> someone
>>>> can give me some ideas.  The .mdp file I'm using is below, but it is just
>>>> one of many that I've tried.  In theory, it should work, since the
>>>> parameters are the same as my successful 3.3.3 runs, with the exception
>>>> of
>>>> the new free energy features in 4.5.3 and obvious keyword changes related
>>>> to
>>>> the difference in version.
>>>>
>>>> -Justin
>>>>
>>>> --- .mdp file ---
>>>>
>>>> ; Run control
>>>> integrator               = sd       ; Langevin dynamics
>>>> tinit                    = 0
>>>> dt                       = 0.002
>>>> nsteps                   = 2500000  ; 5 ns
>>>> nstcomm                  = 100
>>>> ; Output control
>>>> nstxout                  = 500
>>>> nstvout                  = 500
>>>> nstfout                  = 0
>>>> nstlog                   = 500
>>>> nstenergy                = 500
>>>> nstxtcout                = 0
>>>> xtc-precision            = 1000
>>>> ; Neighborsearching and short-range nonbonded interactions
>>>> nstlist                  = 5
>>>> ns_type                  = grid
>>>> pbc                      = xyz
>>>> rlist                    = 0.9
>>>> ; Electrostatics
>>>> coulombtype              = PME
>>>> rcoulomb                 = 0.9
>>>> ; van der Waals
>>>> vdw-type                 = cutoff
>>>> rvdw                     = 1.4
>>>> ; Apply long range dispersion corrections for Energy and Pressure
>>>> DispCorr                  = EnerPres
>>>> ; Spacing for the PME/PPPM FFT grid
>>>> fourierspacing           = 0.12
>>>> ; EWALD/PME/PPPM parameters
>>>> pme_order                = 4
>>>> ewald_rtol               = 1e-05
>>>> epsilon_surface          = 0
>>>> optimize_fft             = no
>>>> ; Temperature coupling
>>>> ; tcoupl is implicitly handled by the sd integrator
>>>> tc_grps                  = system
>>>> tau_t                    = 1.0
>>>> ref_t                    = 298
>>>> ; Pressure coupling is on for NPT
>>>> Pcoupl                   = Berendsen
>>>> tau_p                    = 2.0
>>>> compressibility          = 4.5e-05
>>>> ref_p                    = 1.0
>>>> ; Free energy control stuff
>>>> free_energy              = yes
>>>> init_lambda              = 0.00
>>>> delta_lambda             = 0
>>>> foreign_lambda           = 0.05
>>>> sc-alpha                 = 0
>>>> sc-power                 = 1.0
>>>> sc-sigma                 = 0
>>>> couple-moltype           = MOR      ; name of moleculetype to couple
>>>> couple-lambda0           = vdw      ; vdW interactions
>>>> couple-lambda1           = vdw-q    ; turn on everything
>>>> couple-intramol          = no
>>>> dhdl_derivatives         = yes      ; this line (and the next two) are
>>>> defaults
>>>> separate_dhdl_file       = yes      ; included only for pedantry
>>>> nstdhdl                  = 10
>>>> ; Do not generate velocities
>>>> gen_vel                  = no
>>>> ; options for bonds
>>>> constraints              = all-bonds
>>>> ; Type of constraint algorithm
>>>> constraint-algorithm     = lincs
>>>> ; Constrain the starting configuration
>>>> ; since we are continuing from NPT
>>>> continuation             = yes
>>>> ; Highest order in the expansion of the constraint coupling matrix
>>>> lincs-order              = 4
>>>>
>>>>
>>>> --
>>>> ========================================
>>>>
>>>> Justin A. Lemkul
>>>> Ph.D. Candidate
>>>> ICTAS Doctoral Scholar
>>>> MILES-IGERT Trainee
>>>> Department of Biochemistry
>>>> Virginia Tech
>>>> Blacksburg, VA
>>>> jalemkul[at]vt.edu | (540) 231-9080
>>>> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>>>>
>>>> ========================================
>>>> --
>>>> gmx-users mailing list    gmx-users at gromacs.org
>>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>>> Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>>> Please don't post (un)subscribe requests to the list. Use the www
>>>> interface
>>>> or send it to gmx-users-request at gromacs.org.
>>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>> --
>> ========================================
>>
>> Justin A. Lemkul
>> Ph.D. Candidate
>> ICTAS Doctoral Scholar
>> MILES-IGERT Trainee
>> Department of Biochemistry
>> Virginia Tech
>> Blacksburg, VA
>> jalemkul[at]vt.edu | (540) 231-9080
>> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>>
>> ========================================
>> --
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the www interface
>> or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
> 

-- 
========================================

Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================



More information about the gromacs.org_gmx-users mailing list