[gmx-users] Possible free energy bug?
Justin A. Lemkul
jalemkul at vt.edu
Fri Mar 11 04:56:22 CET 2011
Michael Shirts wrote:
> Hi, all-
>
> Have you tried running
>
> constraints = hbonds?
>
> That might eliminate some of the constraint issues. Much less likely
> for LINCS to break or have DD issues if only the hbonds are
> constrained. 2 fs is not that big a deal for the heteroatom bonds.
>
I haven't yet, but I'll add it to my to-do list. I was trying to keep as many
things consistent between my 3.3.3 and 4.5.3 input files as possible, so I could
diagnose any issues, but at this point, anything is worth a shot.
Thanks!
-Justin
> Best,
> Michael
>
> On Thu, Mar 10, 2011 at 8:04 PM, Justin A. Lemkul <jalemkul at vt.edu> wrote:
>> Hi Matt,
>>
>> Thanks for the extensive explanation and tips. I'll work through things and
>> report back. It will take a while to get things going through (unless one
>> of the early solutions works!) since I have no admin access to install new
>> compilers, libraries, etc. and for some reason the only thing I can ever get
>> to work in my home directory is Gromacs itself. The joys of an aging
>> cluster.
>>
>> We recently got access to gcc-4.4.5 on Linux, but we're stuck with 3.3 on OS
>> X, so there's at least a bit of hope for one partition.
>>
>> Thanks again.
>>
>> -Justin
>>
>> Matthew Zwier wrote:
>>> Hi Justin,
>>>
>>> I should have specified that the segfault happened for us after we got
>>> similar warnings and errors (DD and/or LINCS), so the segfault may
>>> have been tangential. Given that everything about your system worked
>>> before GROMACS 4.5, it's possible that your older compilers are
>>> generating code that's incompatible with the GROMACS assembly loops
>>> (which you are likely running with, as they are the default option on
>>> most mainstream processors). The bug you mentioned in your original
>>> post also has my antennae twitching about bad machine code.
>>>
>>> If that's indeed happening, it's almost certainly some bizarre
>>> alignment issue, something like half of a float is getting overwritten
>>> on the way into or out of the assembly code, which corruption would
>>> trigger the results you describe. It's also distantly possible that
>>> GROMACS is working fine, but your copy of FFTW or BLAS/LAPACK (more
>>> likely the latter) has alignment problems. One final possibility
>>> (which would explain the failure on YellowDog but unfortunately not
>>> the failure on OS X) is that GCC is generating badly-aligned code for
>>> auto-vectorized Altivec loops, which is still a problem for Intel's
>>> SIMD instructions on 32-bit x86 architectures even with GCC 4.4. I've
>>> also observed MPI gather/reduce operations to foul up alignment (or
>>> rigidly enforce it where badly compiled code is relying on broken
>>> alignment) under exceedingly rare circumstances, usually involving
>>> different libraries compiled with different compilers (which is
>>> generally a bad idea for scientific code anyway).
>>>
>>> Okay...so all of that said, there are a few things to try:
>>>
>>> 1) Recompile GROMACS using -O2 instead of -O3; that'll turn off the
>>> automatic vectorizer (on Yellow Dog) and various other relatively
>>> risky optimizations (on both platforms). CFLAGS="-O2 -march=powerpc"
>>> in the environment AND on the configure command line would do that.
>>> Check your build logs to make sure it took, though, because if you
>>> don't do it exactly right, configure will ignore your directives and
>>> merrily set up GROMACS to compile with -O3, which is the most likely
>>> culprit for badly-aligned code.
>>>
>>> 2) Recompile GROMACS specifying a forced alignment flag. I have no
>>> experience with PowerPC, but -malign-natural and -malign-power look
>>> like good initial guesses. That's probably going to cause more
>>> problems than it solves, but if you have a screwy BLAS/LAPACK or MPI,
>>> it might help. I only suggest it because if you've already tried #1,
>>> it will only take another half hour or hour of your time to recompile
>>> GROMACS again. Other than that, tinkering with alignment flags is a
>>> really easy way to REALLY break code, so you might consider skipping
>>> this and moving straight on to #3.
>>>
>>> 3) Snag GCC 4.4.4 or 4.4.5 and compile it, and use that to compile
>>> GROMACS, again with -O2. GCC takes forever to compile, but beyond
>>> that, it's not as difficult as it could be. Nothing preventing you
>>> from installing it in your home directory, either, assuming you set
>>> PATH and LD_LIBRARY_PATH (or DYLD_LIBRARY_PATH on OS X) properly. You
>>> might need to snag a new copy of binutils as well, if gcc refuses to
>>> compile with the system ld. This option would also probably get you
>>> threading, since you certainly have hardware support for it.
>>>
>>> 4) Rebuild your entire GROMACS stack, including FFTW, BLAS/LAPACK,
>>> MPI, and GROMACS itself with the same compiler (preferably GCC from
>>> #3) and the same compiler options (which again should be -O2, and
>>> definitely NOT any sort of alignment flag). Put them in their own
>>> tree (like "/opt/sci"), and definitely not in /usr (which is generally
>>> managed by the system) or /usr/local (which tends to accumulate
>>> cruft). ATLAS is a good choice for BLAS, and there are directions on
>>> the ATLAS website for building a complete and optimized LAPACK based
>>> on BLAS.
>>>
>>> In practice, I've found I've had to do #4 for every piece of
>>> scientific software our group uses, because pretty much nothing works
>>> right with OS-installed versions of compilers, BLAS/LAPACK, and MPI.
>>> It takes forever, and it pretty much defines the phrase "learning
>>> experience," but it also essentially *never* breaks once it works
>>> (because OS updates never overwrite anything you've hand-tuned to run
>>> correctly). But...with luck option #1 will fix things quickly enough
>>> to get you running without devoting two days to rebuilding your
>>> software stack from scratch.
>>>
>>> Hope that helps,
>>> Matt Z.
>>>
>>>
>>> On Thu, Mar 10, 2011 at 8:54 PM, Justin A. Lemkul <jalemkul at vt.edu> wrote:
>>>> Hi Matt,
>>>>
>>>> Thanks for the reply. I can't trace the problem to a specific compiler.
>>>> We
>>>> have a PowerPC cluster with two partitions - one running Mac OS X 10.3
>>>> with
>>>> gcc-3.3, the other running YellowDog Linux with gcc-4.2.2. The problem
>>>> happens on both partitions. There are no seg faults, the runs just exit
>>>> (MPI_ABORT) after the fatal error (either "too many LINCS warnings" or
>>>> the
>>>> DD-related error I posted before).
>>>>
>>>> We are using MPI: mpich-1.2.5 on OSX and OpenMPI-1.2.3 on Linux. All of
>>>> the
>>>> above has been the same since my successful 3.3.3 TI calculations (as
>>>> well
>>>> as all of my simulations with Gromacs, ever). Our hardware and compilers
>>>> are somewhat (very) outdated so threading is not supported, we always use
>>>> MPI.
>>>>
>>>> Gromacs was compiled in single precision using standard options through
>>>> autoconf. The cmake build system still does not work on our cluster due
>>>> to
>>>> several outstanding bugs.
>>>>
>>>> -Justin
>>>>
>>>> Matthew Zwier wrote:
>>>>> Dear Justin,
>>>>>
>>>>> We recently experienced a similar problem (LINCS errors, step*.pdb
>>>>> files), and then GROMACS usually segfaulted. The cause was a
>>>>> miscompiled copy of GROMACS. Another member of our group had compiled
>>>>> GROMACS on an Intel Core2 quad (gcc -march=core2) and tried to run the
>>>>> copy without modification on an AMD Magny Cours machine.
>>>>> Recompilation with the correct subarchitecture type (-march=amdfam10)
>>>>> fixed the problem. Don't really know why it didn't die with SIGILL or
>>>>> SIGBUS instead of SIGSEGV, but that's probably a question for the
>>>>> hardware gurus.
>>>>>
>>>>> So...are you observing segfaults? What compiler are you using (and on
>>>>> what OS)? What were the compilation parameters for 4.5.3? Also, are
>>>>> you really running across nodes with MPI, or running on the same node
>>>>> with MPI?
>>>>>
>>>>> Cheers,
>>>>> Matt Zwier
>>>>>
>>>>> On Thu, Mar 10, 2011 at 1:55 PM, Justin A. Lemkul <jalemkul at vt.edu>
>>>>> wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> I've been troubleshooting a problem for some time now and I wanted to
>>>>>> report
>>>>>> it here and solicit some feedback before I submit a bug report to see
>>>>>> if
>>>>>> there's anything else I can try.
>>>>>>
>>>>>> Here's the situation: I ran some free energy calculations
>>>>>> (thermodynamic
>>>>>> integration) a long time ago using version 3.3.3 to determine the
>>>>>> hydration
>>>>>> free energy of a series of small molecules. Results were good and they
>>>>>> ended up as part of a paper, so I'm trying to reproduce the methodology
>>>>>> with
>>>>>> 4.5.3 (using BAR) to see if I understand the workflow completely. The
>>>>>> problem is my systems are crashing. The runs simply stop randomly
>>>>>> (usually
>>>>>> within a few hundred ps) with lots of LINCS warnings and step*.pdb
>>>>>> files
>>>>>> being written.
>>>>>>
>>>>>> I know the parameters are good, and produce stable trajectories, since
>>>>>> I
>>>>>> spent months on them some years ago. The system prep is steepest
>>>>>> descents
>>>>>> EM
>>>>>> to Fmax < 100 (always achieved), NVT at 298 K for 100 ps, NPT at 298K/1
>>>>>> bar
>>>>>> for 100 ps, then 5 ns of data collection under NPT conditions. Here's
>>>>>> the
>>>>>> rundown of what I'm seeing:
>>>>>>
>>>>>> 1. All LJ transformations work fine. The problem only comes when I
>>>>>> have
>>>>>> a
>>>>>> molecule with full LJ interaction and I am "charging" it (i.e.,
>>>>>> introducing
>>>>>> charges to the partially-interacting species).
>>>>>>
>>>>>> 2. Simulations at lambda=1 (full interaction) work fine.
>>>>>>
>>>>>> 3. Simulations with the free energy code off entirely work fine under
>>>>>> all
>>>>>> conditions.
>>>>>>
>>>>>> 4. I cannot run in serial due to http://redmine.gromacs.org/issues/715.
>>>>>> The
>>>>>> bug seems to affect other systems and is not specifically related to my
>>>>>> free
>>>>>> energy calculations.
>>>>>>
>>>>>> 5. Running with DD fails because my system is relatively small (more on
>>>>>> this
>>>>>> in a moment).
>>>>>>
>>>>>> 6. Running with mdrun -pd 2 works, but mdrun -pd 4 crashes for any
>>>>>> value
>>>>>> of
>>>>>> lambda != 1.
>>>>>>
>>>>>> 7. I created a larger system (instead of a 3x3x3-nm cube of water with
>>>>>> my
>>>>>> molecule, I used 4x4x4) and ran on 4 CPU's with DD (lambda = 0, i.e.
>>>>>> full
>>>>>> vdW, no intermolecular Coulombic interactions - .mdp file is below).
>>>>>> This
>>>>>> run also crashed with some warnings about DD cell size:
>>>>>>
>>>>>> DD load balancing is limited by minimum cell size in dimension X
>>>>>> DD step 329999 vol min/aver 0.748! load imb.: force 31.5%
>>>>>>
>>>>>> ...and then the actual crash:
>>>>>>
>>>>>> -------------------------------------------------------
>>>>>> Program mdrun_4.5.3_gcc_mpi, VERSION 4.5.3
>>>>>> Source code file: domdec_con.c, line: 693
>>>>>>
>>>>>> Fatal error:
>>>>>> DD cell 0 0 0 could only obtain 14 of the 15 atoms that are connected
>>>>>> via
>>>>>> constraints from the neighboring cells. This probably means your
>>>>>> constraint
>>>>>> lengths are too long compared to the domain decomposition cell size.
>>>>>> Decrease the number of domain decomposition grid cells or lincs-order
>>>>>> or
>>>>>> use
>>>>>> the -rcon option of mdrun.
>>>>>> For more information and tips for troubleshooting, please check the
>>>>>> GROMACS
>>>>>> website at http://www.gromacs.org/Documentation/Errors
>>>>>> -------------------------------------------------------
>>>>>>
>>>>>> Watching the trajectory doesn't seem to give any useful information.
>>>>>> The
>>>>>> small molecule of interest is at a periodic boundary when the crash
>>>>>> happens,
>>>>>> but there are several crosses prior to the crash without incident, so I
>>>>>> don't know if the issue is related to PBC or not, but it appears not.
>>>>>>
>>>>>> 8. I initially thought the problem might be related to the barostat,
>>>>>> but
>>>>>> switching from P-R to Berendsen does not alleviate the problem, nor
>>>>>> does
>>>>>> increasing tau_p (tested 0.5, 1.0, 2.0, and 5.0 - all crash). Longer
>>>>>> tau_p
>>>>>> simply delays the crash, but does not prevent it.
>>>>>>
>>>>>> So after all that, I'm wondering if (1) anyone has seen the same, or
>>>>>> (2)
>>>>>> if
>>>>>> there's anything else I can try (environment variables, hidden tricks,
>>>>>> etc)
>>>>>> that I can use to get to the bottom of this before I give up and file a
>>>>>> bug
>>>>>> report.
>>>>>>
>>>>>> If you made it this far, thanks for reading my novel and hopefully
>>>>>> someone
>>>>>> can give me some ideas. The .mdp file I'm using is below, but it is
>>>>>> just
>>>>>> one of many that I've tried. In theory, it should work, since the
>>>>>> parameters are the same as my successful 3.3.3 runs, with the exception
>>>>>> of
>>>>>> the new free energy features in 4.5.3 and obvious keyword changes
>>>>>> related
>>>>>> to
>>>>>> the difference in version.
>>>>>>
>>>>>> -Justin
>>>>>>
>>>>>> --- .mdp file ---
>>>>>>
>>>>>> ; Run control
>>>>>> integrator = sd ; Langevin dynamics
>>>>>> tinit = 0
>>>>>> dt = 0.002
>>>>>> nsteps = 2500000 ; 5 ns
>>>>>> nstcomm = 100
>>>>>> ; Output control
>>>>>> nstxout = 500
>>>>>> nstvout = 500
>>>>>> nstfout = 0
>>>>>> nstlog = 500
>>>>>> nstenergy = 500
>>>>>> nstxtcout = 0
>>>>>> xtc-precision = 1000
>>>>>> ; Neighborsearching and short-range nonbonded interactions
>>>>>> nstlist = 5
>>>>>> ns_type = grid
>>>>>> pbc = xyz
>>>>>> rlist = 0.9
>>>>>> ; Electrostatics
>>>>>> coulombtype = PME
>>>>>> rcoulomb = 0.9
>>>>>> ; van der Waals
>>>>>> vdw-type = cutoff
>>>>>> rvdw = 1.4
>>>>>> ; Apply long range dispersion corrections for Energy and Pressure
>>>>>> DispCorr = EnerPres
>>>>>> ; Spacing for the PME/PPPM FFT grid
>>>>>> fourierspacing = 0.12
>>>>>> ; EWALD/PME/PPPM parameters
>>>>>> pme_order = 4
>>>>>> ewald_rtol = 1e-05
>>>>>> epsilon_surface = 0
>>>>>> optimize_fft = no
>>>>>> ; Temperature coupling
>>>>>> ; tcoupl is implicitly handled by the sd integrator
>>>>>> tc_grps = system
>>>>>> tau_t = 1.0
>>>>>> ref_t = 298
>>>>>> ; Pressure coupling is on for NPT
>>>>>> Pcoupl = Berendsen
>>>>>> tau_p = 2.0
>>>>>> compressibility = 4.5e-05
>>>>>> ref_p = 1.0
>>>>>> ; Free energy control stuff
>>>>>> free_energy = yes
>>>>>> init_lambda = 0.00
>>>>>> delta_lambda = 0
>>>>>> foreign_lambda = 0.05
>>>>>> sc-alpha = 0
>>>>>> sc-power = 1.0
>>>>>> sc-sigma = 0
>>>>>> couple-moltype = MOR ; name of moleculetype to couple
>>>>>> couple-lambda0 = vdw ; vdW interactions
>>>>>> couple-lambda1 = vdw-q ; turn on everything
>>>>>> couple-intramol = no
>>>>>> dhdl_derivatives = yes ; this line (and the next two) are
>>>>>> defaults
>>>>>> separate_dhdl_file = yes ; included only for pedantry
>>>>>> nstdhdl = 10
>>>>>> ; Do not generate velocities
>>>>>> gen_vel = no
>>>>>> ; options for bonds
>>>>>> constraints = all-bonds
>>>>>> ; Type of constraint algorithm
>>>>>> constraint-algorithm = lincs
>>>>>> ; Constrain the starting configuration
>>>>>> ; since we are continuing from NPT
>>>>>> continuation = yes
>>>>>> ; Highest order in the expansion of the constraint coupling matrix
>>>>>> lincs-order = 4
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ========================================
>>>>>>
>>>>>> Justin A. Lemkul
>>>>>> Ph.D. Candidate
>>>>>> ICTAS Doctoral Scholar
>>>>>> MILES-IGERT Trainee
>>>>>> Department of Biochemistry
>>>>>> Virginia Tech
>>>>>> Blacksburg, VA
>>>>>> jalemkul[at]vt.edu | (540) 231-9080
>>>>>> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>>>>>>
>>>>>> ========================================
>>>>>> --
>>>>>> gmx-users mailing list gmx-users at gromacs.org
>>>>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>>>>> Please search the archive at
>>>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>>>>> Please don't post (un)subscribe requests to the list. Use the www
>>>>>> interface
>>>>>> or send it to gmx-users-request at gromacs.org.
>>>>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>> --
>>>> ========================================
>>>>
>>>> Justin A. Lemkul
>>>> Ph.D. Candidate
>>>> ICTAS Doctoral Scholar
>>>> MILES-IGERT Trainee
>>>> Department of Biochemistry
>>>> Virginia Tech
>>>> Blacksburg, VA
>>>> jalemkul[at]vt.edu | (540) 231-9080
>>>> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>>>>
>>>> ========================================
>>>> --
>>>> gmx-users mailing list gmx-users at gromacs.org
>>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>>> Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>>> Please don't post (un)subscribe requests to the list. Use the www
>>>> interface
>>>> or send it to gmx-users-request at gromacs.org.
>>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>> --
>> ========================================
>>
>> Justin A. Lemkul
>> Ph.D. Candidate
>> ICTAS Doctoral Scholar
>> MILES-IGERT Trainee
>> Department of Biochemistry
>> Virginia Tech
>> Blacksburg, VA
>> jalemkul[at]vt.edu | (540) 231-9080
>> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>>
>> ========================================
>> --
>> gmx-users mailing list gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the www interface
>> or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>
--
========================================
Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
========================================
More information about the gromacs.org_gmx-users
mailing list