[gmx-users] Possible free energy bug?

Michael Shirts mrshirts at gmail.com
Fri Mar 11 04:54:05 CET 2011


Hi, all-

Have you tried running

constraints = hbonds?

That might eliminate some of the constraint issues.  Much less likely
for LINCS to break or have DD issues if only the hbonds are
constrained.  2 fs is not that big a deal for the heteroatom bonds.

Best,
Michael

On Thu, Mar 10, 2011 at 8:04 PM, Justin A. Lemkul <jalemkul at vt.edu> wrote:
>
> Hi Matt,
>
> Thanks for the extensive explanation and tips.  I'll work through things and
> report back.  It will take a while to get things going through (unless one
> of the early solutions works!) since I have no admin access to install new
> compilers, libraries, etc. and for some reason the only thing I can ever get
> to work in my home directory is Gromacs itself.  The joys of an aging
> cluster.
>
> We recently got access to gcc-4.4.5 on Linux, but we're stuck with 3.3 on OS
> X, so there's at least a bit of hope for one partition.
>
> Thanks again.
>
> -Justin
>
> Matthew Zwier wrote:
>>
>> Hi Justin,
>>
>> I should have specified that the segfault happened for us after we got
>> similar warnings and errors (DD and/or LINCS), so the segfault may
>> have been tangential.  Given that everything about your system worked
>> before GROMACS 4.5, it's possible that your older compilers are
>> generating code that's incompatible with the GROMACS assembly loops
>> (which you are likely running with, as they are the default option on
>> most mainstream processors).  The bug you mentioned in your original
>> post also has my antennae twitching about bad machine code.
>>
>> If that's indeed happening, it's almost certainly some bizarre
>> alignment issue, something like half of a float is getting overwritten
>> on the way into or out of the assembly code, which corruption would
>> trigger the results you describe.  It's also distantly possible that
>> GROMACS is working fine, but your copy of FFTW or BLAS/LAPACK (more
>> likely the latter) has alignment problems.  One final possibility
>> (which would explain the failure on YellowDog but unfortunately not
>> the failure on OS X) is that GCC is generating badly-aligned code for
>> auto-vectorized Altivec loops, which is still a problem for Intel's
>> SIMD instructions on 32-bit x86 architectures even with GCC 4.4.  I've
>> also observed MPI gather/reduce operations to foul up alignment (or
>> rigidly enforce it where badly compiled code is relying on broken
>> alignment) under exceedingly rare circumstances, usually involving
>> different libraries compiled with different compilers (which is
>> generally a bad idea for scientific code anyway).
>>
>> Okay...so all of that said, there are a few things to try:
>>
>> 1) Recompile GROMACS using -O2 instead of -O3; that'll turn off the
>> automatic vectorizer (on Yellow Dog) and various other relatively
>> risky optimizations (on both platforms).  CFLAGS="-O2 -march=powerpc"
>> in the environment AND on the configure command line would do that.
>> Check your build logs to make sure it took, though, because if you
>> don't do it exactly right, configure will ignore your directives and
>> merrily set up GROMACS to compile with -O3, which is the most likely
>> culprit for badly-aligned code.
>>
>> 2) Recompile GROMACS specifying a forced alignment flag.  I have no
>> experience with PowerPC, but -malign-natural and -malign-power look
>> like good initial guesses.  That's probably going to cause more
>> problems than it solves, but if you have a screwy BLAS/LAPACK or MPI,
>> it might help. I only suggest it because if you've already tried #1,
>> it will only take another half hour or hour of your time to recompile
>> GROMACS again.  Other than that, tinkering with alignment flags is a
>> really easy way to REALLY break code, so you might consider skipping
>> this and moving straight on to #3.
>>
>> 3) Snag GCC 4.4.4 or 4.4.5 and compile it, and use that to compile
>> GROMACS, again with -O2.  GCC takes forever to compile, but beyond
>> that, it's not as difficult as it could be.  Nothing preventing you
>> from installing it in your home directory, either, assuming you set
>> PATH and LD_LIBRARY_PATH (or DYLD_LIBRARY_PATH on OS X) properly.  You
>> might need to snag a new copy of binutils as well, if gcc refuses to
>> compile with the system ld.  This option would also probably get you
>> threading, since you certainly have hardware support for it.
>>
>> 4) Rebuild your entire GROMACS stack, including FFTW, BLAS/LAPACK,
>> MPI, and GROMACS itself with the same compiler (preferably GCC from
>> #3) and the same compiler options (which again should be -O2, and
>> definitely NOT any sort of alignment flag).  Put them in their own
>> tree (like "/opt/sci"), and definitely not in /usr (which is generally
>> managed by the system) or /usr/local (which tends to accumulate
>> cruft).  ATLAS is a good choice for BLAS, and there are directions on
>> the ATLAS website for building a complete and optimized LAPACK based
>> on BLAS.
>>
>> In practice, I've found I've had to do #4 for every piece of
>> scientific software our group uses, because pretty much nothing works
>> right with OS-installed versions of compilers, BLAS/LAPACK, and MPI.
>> It takes forever, and it pretty much defines the phrase "learning
>> experience," but it also essentially *never* breaks once it works
>> (because OS updates never overwrite anything you've hand-tuned to run
>> correctly).  But...with luck option #1 will fix things quickly enough
>> to get you running without devoting two days to rebuilding your
>> software stack from scratch.
>>
>> Hope that helps,
>> Matt Z.
>>
>>
>> On Thu, Mar 10, 2011 at 8:54 PM, Justin A. Lemkul <jalemkul at vt.edu> wrote:
>>>
>>> Hi Matt,
>>>
>>> Thanks for the reply.  I can't trace the problem to a specific compiler.
>>>  We
>>> have a PowerPC cluster with two partitions - one running Mac OS X 10.3
>>> with
>>> gcc-3.3, the other running YellowDog Linux with gcc-4.2.2.  The problem
>>> happens on both partitions.  There are no seg faults, the runs just exit
>>> (MPI_ABORT) after the fatal error (either "too many LINCS warnings" or
>>> the
>>> DD-related error I posted before).
>>>
>>> We are using MPI: mpich-1.2.5 on OSX and OpenMPI-1.2.3 on Linux.  All of
>>> the
>>> above has been the same since my successful 3.3.3 TI calculations (as
>>> well
>>> as all of my simulations with Gromacs, ever).  Our hardware and compilers
>>> are somewhat (very) outdated so threading is not supported, we always use
>>> MPI.
>>>
>>> Gromacs was compiled in single precision using standard options through
>>> autoconf.  The cmake build system still does not work on our cluster due
>>> to
>>> several outstanding bugs.
>>>
>>> -Justin
>>>
>>> Matthew Zwier wrote:
>>>>
>>>> Dear Justin,
>>>>
>>>> We recently experienced a similar problem (LINCS errors, step*.pdb
>>>> files), and then GROMACS usually segfaulted.  The cause was a
>>>> miscompiled copy of GROMACS.  Another member of our group had compiled
>>>> GROMACS on an Intel Core2 quad (gcc -march=core2) and tried to run the
>>>> copy without modification on an AMD Magny Cours machine.
>>>> Recompilation with the correct subarchitecture type (-march=amdfam10)
>>>> fixed the problem.  Don't really know why it didn't die with SIGILL or
>>>> SIGBUS instead of SIGSEGV, but that's probably a question for the
>>>> hardware gurus.
>>>>
>>>> So...are you observing segfaults?  What compiler are you using (and on
>>>> what OS)?  What were the compilation parameters for 4.5.3?  Also, are
>>>> you really running across nodes with MPI, or running on the same node
>>>> with MPI?
>>>>
>>>> Cheers,
>>>> Matt Zwier
>>>>
>>>> On Thu, Mar 10, 2011 at 1:55 PM, Justin A. Lemkul <jalemkul at vt.edu>
>>>> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I've been troubleshooting a problem for some time now and I wanted to
>>>>> report
>>>>> it here and solicit some feedback before I submit a bug report to see
>>>>> if
>>>>> there's anything else I can try.
>>>>>
>>>>> Here's the situation: I ran some free energy calculations
>>>>> (thermodynamic
>>>>> integration) a long time ago using version 3.3.3 to determine the
>>>>> hydration
>>>>> free energy of a series of small molecules.  Results were good and they
>>>>> ended up as part of a paper, so I'm trying to reproduce the methodology
>>>>> with
>>>>> 4.5.3 (using BAR) to see if I understand the workflow completely.  The
>>>>> problem is my systems are crashing.  The runs simply stop randomly
>>>>> (usually
>>>>> within a few hundred ps) with lots of LINCS warnings and step*.pdb
>>>>> files
>>>>> being written.
>>>>>
>>>>> I know the parameters are good, and produce stable trajectories, since
>>>>> I
>>>>> spent months on them some years ago. The system prep is steepest
>>>>> descents
>>>>> EM
>>>>> to Fmax < 100 (always achieved), NVT at 298 K for 100 ps, NPT at 298K/1
>>>>> bar
>>>>> for 100 ps, then 5 ns of data collection under NPT conditions.  Here's
>>>>> the
>>>>> rundown of what I'm seeing:
>>>>>
>>>>> 1. All LJ transformations work fine.  The problem only comes when I
>>>>> have
>>>>> a
>>>>> molecule with full LJ interaction and I am "charging" it (i.e.,
>>>>> introducing
>>>>> charges to the partially-interacting species).
>>>>>
>>>>> 2. Simulations at lambda=1 (full interaction) work fine.
>>>>>
>>>>> 3. Simulations with the free energy code off entirely work fine under
>>>>> all
>>>>> conditions.
>>>>>
>>>>> 4. I cannot run in serial due to http://redmine.gromacs.org/issues/715.
>>>>>  The
>>>>> bug seems to affect other systems and is not specifically related to my
>>>>> free
>>>>> energy calculations.
>>>>>
>>>>> 5. Running with DD fails because my system is relatively small (more on
>>>>> this
>>>>> in a moment).
>>>>>
>>>>> 6. Running with mdrun -pd 2 works, but mdrun -pd 4 crashes for any
>>>>> value
>>>>> of
>>>>> lambda != 1.
>>>>>
>>>>> 7. I created a larger system (instead of a 3x3x3-nm cube of water with
>>>>> my
>>>>> molecule, I used 4x4x4) and ran on 4 CPU's with DD (lambda = 0, i.e.
>>>>> full
>>>>> vdW, no intermolecular Coulombic interactions - .mdp file is below).
>>>>>  This
>>>>> run also crashed with some warnings about DD cell size:
>>>>>
>>>>> DD  load balancing is limited by minimum cell size in dimension X
>>>>> DD  step 329999  vol min/aver 0.748! load imb.: force 31.5%
>>>>>
>>>>> ...and then the actual crash:
>>>>>
>>>>> -------------------------------------------------------
>>>>> Program mdrun_4.5.3_gcc_mpi, VERSION 4.5.3
>>>>> Source code file: domdec_con.c, line: 693
>>>>>
>>>>> Fatal error:
>>>>> DD cell 0 0 0 could only obtain 14 of the 15 atoms that are connected
>>>>> via
>>>>> constraints from the neighboring cells. This probably means your
>>>>> constraint
>>>>> lengths are too long compared to the domain decomposition cell size.
>>>>> Decrease the number of domain decomposition grid cells or lincs-order
>>>>> or
>>>>> use
>>>>> the -rcon option of mdrun.
>>>>> For more information and tips for troubleshooting, please check the
>>>>> GROMACS
>>>>> website at http://www.gromacs.org/Documentation/Errors
>>>>> -------------------------------------------------------
>>>>>
>>>>> Watching the trajectory doesn't seem to give any useful information.
>>>>>  The
>>>>> small molecule of interest is at a periodic boundary when the crash
>>>>> happens,
>>>>> but there are several crosses prior to the crash without incident, so I
>>>>> don't know if the issue is related to PBC or not, but it appears not.
>>>>>
>>>>> 8. I initially thought the problem might be related to the barostat,
>>>>> but
>>>>> switching from P-R to Berendsen does not alleviate the problem, nor
>>>>> does
>>>>> increasing tau_p (tested 0.5, 1.0, 2.0, and 5.0 - all crash).  Longer
>>>>> tau_p
>>>>> simply delays the crash, but does not prevent it.
>>>>>
>>>>> So after all that, I'm wondering if (1) anyone has seen the same, or
>>>>> (2)
>>>>> if
>>>>> there's anything else I can try (environment variables, hidden tricks,
>>>>> etc)
>>>>> that I can use to get to the bottom of this before I give up and file a
>>>>> bug
>>>>> report.
>>>>>
>>>>> If you made it this far, thanks for reading my novel and hopefully
>>>>> someone
>>>>> can give me some ideas.  The .mdp file I'm using is below, but it is
>>>>> just
>>>>> one of many that I've tried.  In theory, it should work, since the
>>>>> parameters are the same as my successful 3.3.3 runs, with the exception
>>>>> of
>>>>> the new free energy features in 4.5.3 and obvious keyword changes
>>>>> related
>>>>> to
>>>>> the difference in version.
>>>>>
>>>>> -Justin
>>>>>
>>>>> --- .mdp file ---
>>>>>
>>>>> ; Run control
>>>>> integrator               = sd       ; Langevin dynamics
>>>>> tinit                    = 0
>>>>> dt                       = 0.002
>>>>> nsteps                   = 2500000  ; 5 ns
>>>>> nstcomm                  = 100
>>>>> ; Output control
>>>>> nstxout                  = 500
>>>>> nstvout                  = 500
>>>>> nstfout                  = 0
>>>>> nstlog                   = 500
>>>>> nstenergy                = 500
>>>>> nstxtcout                = 0
>>>>> xtc-precision            = 1000
>>>>> ; Neighborsearching and short-range nonbonded interactions
>>>>> nstlist                  = 5
>>>>> ns_type                  = grid
>>>>> pbc                      = xyz
>>>>> rlist                    = 0.9
>>>>> ; Electrostatics
>>>>> coulombtype              = PME
>>>>> rcoulomb                 = 0.9
>>>>> ; van der Waals
>>>>> vdw-type                 = cutoff
>>>>> rvdw                     = 1.4
>>>>> ; Apply long range dispersion corrections for Energy and Pressure
>>>>> DispCorr                  = EnerPres
>>>>> ; Spacing for the PME/PPPM FFT grid
>>>>> fourierspacing           = 0.12
>>>>> ; EWALD/PME/PPPM parameters
>>>>> pme_order                = 4
>>>>> ewald_rtol               = 1e-05
>>>>> epsilon_surface          = 0
>>>>> optimize_fft             = no
>>>>> ; Temperature coupling
>>>>> ; tcoupl is implicitly handled by the sd integrator
>>>>> tc_grps                  = system
>>>>> tau_t                    = 1.0
>>>>> ref_t                    = 298
>>>>> ; Pressure coupling is on for NPT
>>>>> Pcoupl                   = Berendsen
>>>>> tau_p                    = 2.0
>>>>> compressibility          = 4.5e-05
>>>>> ref_p                    = 1.0
>>>>> ; Free energy control stuff
>>>>> free_energy              = yes
>>>>> init_lambda              = 0.00
>>>>> delta_lambda             = 0
>>>>> foreign_lambda           = 0.05
>>>>> sc-alpha                 = 0
>>>>> sc-power                 = 1.0
>>>>> sc-sigma                 = 0
>>>>> couple-moltype           = MOR      ; name of moleculetype to couple
>>>>> couple-lambda0           = vdw      ; vdW interactions
>>>>> couple-lambda1           = vdw-q    ; turn on everything
>>>>> couple-intramol          = no
>>>>> dhdl_derivatives         = yes      ; this line (and the next two) are
>>>>> defaults
>>>>> separate_dhdl_file       = yes      ; included only for pedantry
>>>>> nstdhdl                  = 10
>>>>> ; Do not generate velocities
>>>>> gen_vel                  = no
>>>>> ; options for bonds
>>>>> constraints              = all-bonds
>>>>> ; Type of constraint algorithm
>>>>> constraint-algorithm     = lincs
>>>>> ; Constrain the starting configuration
>>>>> ; since we are continuing from NPT
>>>>> continuation             = yes
>>>>> ; Highest order in the expansion of the constraint coupling matrix
>>>>> lincs-order              = 4
>>>>>
>>>>>
>>>>> --
>>>>> ========================================
>>>>>
>>>>> Justin A. Lemkul
>>>>> Ph.D. Candidate
>>>>> ICTAS Doctoral Scholar
>>>>> MILES-IGERT Trainee
>>>>> Department of Biochemistry
>>>>> Virginia Tech
>>>>> Blacksburg, VA
>>>>> jalemkul[at]vt.edu | (540) 231-9080
>>>>> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>>>>>
>>>>> ========================================
>>>>> --
>>>>> gmx-users mailing list    gmx-users at gromacs.org
>>>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>>>> Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>>>> Please don't post (un)subscribe requests to the list. Use the www
>>>>> interface
>>>>> or send it to gmx-users-request at gromacs.org.
>>>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>> --
>>> ========================================
>>>
>>> Justin A. Lemkul
>>> Ph.D. Candidate
>>> ICTAS Doctoral Scholar
>>> MILES-IGERT Trainee
>>> Department of Biochemistry
>>> Virginia Tech
>>> Blacksburg, VA
>>> jalemkul[at]vt.edu | (540) 231-9080
>>> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>>>
>>> ========================================
>>> --
>>> gmx-users mailing list    gmx-users at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>> Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>> Please don't post (un)subscribe requests to the list. Use the www
>>> interface
>>> or send it to gmx-users-request at gromacs.org.
>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>
>
> --
> ========================================
>
> Justin A. Lemkul
> Ph.D. Candidate
> ICTAS Doctoral Scholar
> MILES-IGERT Trainee
> Department of Biochemistry
> Virginia Tech
> Blacksburg, VA
> jalemkul[at]vt.edu | (540) 231-9080
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>
> ========================================
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>



More information about the gromacs.org_gmx-users mailing list