[gmx-users] Possible free energy bug?
Justin A. Lemkul
jalemkul at vt.edu
Fri Mar 11 04:04:52 CET 2011
Thanks for the extensive explanation and tips. I'll work through things and
report back. It will take a while to get things going through (unless one of
the early solutions works!) since I have no admin access to install new
compilers, libraries, etc. and for some reason the only thing I can ever get to
work in my home directory is Gromacs itself. The joys of an aging cluster.
We recently got access to gcc-4.4.5 on Linux, but we're stuck with 3.3 on OS X,
so there's at least a bit of hope for one partition.
Matthew Zwier wrote:
> Hi Justin,
> I should have specified that the segfault happened for us after we got
> similar warnings and errors (DD and/or LINCS), so the segfault may
> have been tangential. Given that everything about your system worked
> before GROMACS 4.5, it's possible that your older compilers are
> generating code that's incompatible with the GROMACS assembly loops
> (which you are likely running with, as they are the default option on
> most mainstream processors). The bug you mentioned in your original
> post also has my antennae twitching about bad machine code.
> If that's indeed happening, it's almost certainly some bizarre
> alignment issue, something like half of a float is getting overwritten
> on the way into or out of the assembly code, which corruption would
> trigger the results you describe. It's also distantly possible that
> GROMACS is working fine, but your copy of FFTW or BLAS/LAPACK (more
> likely the latter) has alignment problems. One final possibility
> (which would explain the failure on YellowDog but unfortunately not
> the failure on OS X) is that GCC is generating badly-aligned code for
> auto-vectorized Altivec loops, which is still a problem for Intel's
> SIMD instructions on 32-bit x86 architectures even with GCC 4.4. I've
> also observed MPI gather/reduce operations to foul up alignment (or
> rigidly enforce it where badly compiled code is relying on broken
> alignment) under exceedingly rare circumstances, usually involving
> different libraries compiled with different compilers (which is
> generally a bad idea for scientific code anyway).
> Okay...so all of that said, there are a few things to try:
> 1) Recompile GROMACS using -O2 instead of -O3; that'll turn off the
> automatic vectorizer (on Yellow Dog) and various other relatively
> risky optimizations (on both platforms). CFLAGS="-O2 -march=powerpc"
> in the environment AND on the configure command line would do that.
> Check your build logs to make sure it took, though, because if you
> don't do it exactly right, configure will ignore your directives and
> merrily set up GROMACS to compile with -O3, which is the most likely
> culprit for badly-aligned code.
> 2) Recompile GROMACS specifying a forced alignment flag. I have no
> experience with PowerPC, but -malign-natural and -malign-power look
> like good initial guesses. That's probably going to cause more
> problems than it solves, but if you have a screwy BLAS/LAPACK or MPI,
> it might help. I only suggest it because if you've already tried #1,
> it will only take another half hour or hour of your time to recompile
> GROMACS again. Other than that, tinkering with alignment flags is a
> really easy way to REALLY break code, so you might consider skipping
> this and moving straight on to #3.
> 3) Snag GCC 4.4.4 or 4.4.5 and compile it, and use that to compile
> GROMACS, again with -O2. GCC takes forever to compile, but beyond
> that, it's not as difficult as it could be. Nothing preventing you
> from installing it in your home directory, either, assuming you set
> PATH and LD_LIBRARY_PATH (or DYLD_LIBRARY_PATH on OS X) properly. You
> might need to snag a new copy of binutils as well, if gcc refuses to
> compile with the system ld. This option would also probably get you
> threading, since you certainly have hardware support for it.
> 4) Rebuild your entire GROMACS stack, including FFTW, BLAS/LAPACK,
> MPI, and GROMACS itself with the same compiler (preferably GCC from
> #3) and the same compiler options (which again should be -O2, and
> definitely NOT any sort of alignment flag). Put them in their own
> tree (like "/opt/sci"), and definitely not in /usr (which is generally
> managed by the system) or /usr/local (which tends to accumulate
> cruft). ATLAS is a good choice for BLAS, and there are directions on
> the ATLAS website for building a complete and optimized LAPACK based
> on BLAS.
> In practice, I've found I've had to do #4 for every piece of
> scientific software our group uses, because pretty much nothing works
> right with OS-installed versions of compilers, BLAS/LAPACK, and MPI.
> It takes forever, and it pretty much defines the phrase "learning
> experience," but it also essentially *never* breaks once it works
> (because OS updates never overwrite anything you've hand-tuned to run
> correctly). But...with luck option #1 will fix things quickly enough
> to get you running without devoting two days to rebuilding your
> software stack from scratch.
> Hope that helps,
> Matt Z.
> On Thu, Mar 10, 2011 at 8:54 PM, Justin A. Lemkul <jalemkul at vt.edu> wrote:
>> Hi Matt,
>> Thanks for the reply. I can't trace the problem to a specific compiler. We
>> have a PowerPC cluster with two partitions - one running Mac OS X 10.3 with
>> gcc-3.3, the other running YellowDog Linux with gcc-4.2.2. The problem
>> happens on both partitions. There are no seg faults, the runs just exit
>> (MPI_ABORT) after the fatal error (either "too many LINCS warnings" or the
>> DD-related error I posted before).
>> We are using MPI: mpich-1.2.5 on OSX and OpenMPI-1.2.3 on Linux. All of the
>> above has been the same since my successful 3.3.3 TI calculations (as well
>> as all of my simulations with Gromacs, ever). Our hardware and compilers
>> are somewhat (very) outdated so threading is not supported, we always use
>> Gromacs was compiled in single precision using standard options through
>> autoconf. The cmake build system still does not work on our cluster due to
>> several outstanding bugs.
>> Matthew Zwier wrote:
>>> Dear Justin,
>>> We recently experienced a similar problem (LINCS errors, step*.pdb
>>> files), and then GROMACS usually segfaulted. The cause was a
>>> miscompiled copy of GROMACS. Another member of our group had compiled
>>> GROMACS on an Intel Core2 quad (gcc -march=core2) and tried to run the
>>> copy without modification on an AMD Magny Cours machine.
>>> Recompilation with the correct subarchitecture type (-march=amdfam10)
>>> fixed the problem. Don't really know why it didn't die with SIGILL or
>>> SIGBUS instead of SIGSEGV, but that's probably a question for the
>>> hardware gurus.
>>> So...are you observing segfaults? What compiler are you using (and on
>>> what OS)? What were the compilation parameters for 4.5.3? Also, are
>>> you really running across nodes with MPI, or running on the same node
>>> with MPI?
>>> Matt Zwier
>>> On Thu, Mar 10, 2011 at 1:55 PM, Justin A. Lemkul <jalemkul at vt.edu> wrote:
>>>> Hi All,
>>>> I've been troubleshooting a problem for some time now and I wanted to
>>>> it here and solicit some feedback before I submit a bug report to see if
>>>> there's anything else I can try.
>>>> Here's the situation: I ran some free energy calculations (thermodynamic
>>>> integration) a long time ago using version 3.3.3 to determine the
>>>> free energy of a series of small molecules. Results were good and they
>>>> ended up as part of a paper, so I'm trying to reproduce the methodology
>>>> 4.5.3 (using BAR) to see if I understand the workflow completely. The
>>>> problem is my systems are crashing. The runs simply stop randomly
>>>> within a few hundred ps) with lots of LINCS warnings and step*.pdb files
>>>> being written.
>>>> I know the parameters are good, and produce stable trajectories, since I
>>>> spent months on them some years ago. The system prep is steepest descents
>>>> to Fmax < 100 (always achieved), NVT at 298 K for 100 ps, NPT at 298K/1
>>>> for 100 ps, then 5 ns of data collection under NPT conditions. Here's
>>>> rundown of what I'm seeing:
>>>> 1. All LJ transformations work fine. The problem only comes when I have
>>>> molecule with full LJ interaction and I am "charging" it (i.e.,
>>>> charges to the partially-interacting species).
>>>> 2. Simulations at lambda=1 (full interaction) work fine.
>>>> 3. Simulations with the free energy code off entirely work fine under all
>>>> 4. I cannot run in serial due to http://redmine.gromacs.org/issues/715.
>>>> bug seems to affect other systems and is not specifically related to my
>>>> energy calculations.
>>>> 5. Running with DD fails because my system is relatively small (more on
>>>> in a moment).
>>>> 6. Running with mdrun -pd 2 works, but mdrun -pd 4 crashes for any value
>>>> lambda != 1.
>>>> 7. I created a larger system (instead of a 3x3x3-nm cube of water with my
>>>> molecule, I used 4x4x4) and ran on 4 CPU's with DD (lambda = 0, i.e. full
>>>> vdW, no intermolecular Coulombic interactions - .mdp file is below).
>>>> run also crashed with some warnings about DD cell size:
>>>> DD load balancing is limited by minimum cell size in dimension X
>>>> DD step 329999 vol min/aver 0.748! load imb.: force 31.5%
>>>> ...and then the actual crash:
>>>> Program mdrun_4.5.3_gcc_mpi, VERSION 4.5.3
>>>> Source code file: domdec_con.c, line: 693
>>>> Fatal error:
>>>> DD cell 0 0 0 could only obtain 14 of the 15 atoms that are connected via
>>>> constraints from the neighboring cells. This probably means your
>>>> lengths are too long compared to the domain decomposition cell size.
>>>> Decrease the number of domain decomposition grid cells or lincs-order or
>>>> the -rcon option of mdrun.
>>>> For more information and tips for troubleshooting, please check the
>>>> website at http://www.gromacs.org/Documentation/Errors
>>>> Watching the trajectory doesn't seem to give any useful information. The
>>>> small molecule of interest is at a periodic boundary when the crash
>>>> but there are several crosses prior to the crash without incident, so I
>>>> don't know if the issue is related to PBC or not, but it appears not.
>>>> 8. I initially thought the problem might be related to the barostat, but
>>>> switching from P-R to Berendsen does not alleviate the problem, nor does
>>>> increasing tau_p (tested 0.5, 1.0, 2.0, and 5.0 - all crash). Longer
>>>> simply delays the crash, but does not prevent it.
>>>> So after all that, I'm wondering if (1) anyone has seen the same, or (2)
>>>> there's anything else I can try (environment variables, hidden tricks,
>>>> that I can use to get to the bottom of this before I give up and file a
>>>> If you made it this far, thanks for reading my novel and hopefully
>>>> can give me some ideas. The .mdp file I'm using is below, but it is just
>>>> one of many that I've tried. In theory, it should work, since the
>>>> parameters are the same as my successful 3.3.3 runs, with the exception
>>>> the new free energy features in 4.5.3 and obvious keyword changes related
>>>> the difference in version.
>>>> --- .mdp file ---
>>>> ; Run control
>>>> integrator = sd ; Langevin dynamics
>>>> tinit = 0
>>>> dt = 0.002
>>>> nsteps = 2500000 ; 5 ns
>>>> nstcomm = 100
>>>> ; Output control
>>>> nstxout = 500
>>>> nstvout = 500
>>>> nstfout = 0
>>>> nstlog = 500
>>>> nstenergy = 500
>>>> nstxtcout = 0
>>>> xtc-precision = 1000
>>>> ; Neighborsearching and short-range nonbonded interactions
>>>> nstlist = 5
>>>> ns_type = grid
>>>> pbc = xyz
>>>> rlist = 0.9
>>>> ; Electrostatics
>>>> coulombtype = PME
>>>> rcoulomb = 0.9
>>>> ; van der Waals
>>>> vdw-type = cutoff
>>>> rvdw = 1.4
>>>> ; Apply long range dispersion corrections for Energy and Pressure
>>>> DispCorr = EnerPres
>>>> ; Spacing for the PME/PPPM FFT grid
>>>> fourierspacing = 0.12
>>>> ; EWALD/PME/PPPM parameters
>>>> pme_order = 4
>>>> ewald_rtol = 1e-05
>>>> epsilon_surface = 0
>>>> optimize_fft = no
>>>> ; Temperature coupling
>>>> ; tcoupl is implicitly handled by the sd integrator
>>>> tc_grps = system
>>>> tau_t = 1.0
>>>> ref_t = 298
>>>> ; Pressure coupling is on for NPT
>>>> Pcoupl = Berendsen
>>>> tau_p = 2.0
>>>> compressibility = 4.5e-05
>>>> ref_p = 1.0
>>>> ; Free energy control stuff
>>>> free_energy = yes
>>>> init_lambda = 0.00
>>>> delta_lambda = 0
>>>> foreign_lambda = 0.05
>>>> sc-alpha = 0
>>>> sc-power = 1.0
>>>> sc-sigma = 0
>>>> couple-moltype = MOR ; name of moleculetype to couple
>>>> couple-lambda0 = vdw ; vdW interactions
>>>> couple-lambda1 = vdw-q ; turn on everything
>>>> couple-intramol = no
>>>> dhdl_derivatives = yes ; this line (and the next two) are
>>>> separate_dhdl_file = yes ; included only for pedantry
>>>> nstdhdl = 10
>>>> ; Do not generate velocities
>>>> gen_vel = no
>>>> ; options for bonds
>>>> constraints = all-bonds
>>>> ; Type of constraint algorithm
>>>> constraint-algorithm = lincs
>>>> ; Constrain the starting configuration
>>>> ; since we are continuing from NPT
>>>> continuation = yes
>>>> ; Highest order in the expansion of the constraint coupling matrix
>>>> lincs-order = 4
>>>> Justin A. Lemkul
>>>> Ph.D. Candidate
>>>> ICTAS Doctoral Scholar
>>>> MILES-IGERT Trainee
>>>> Department of Biochemistry
>>>> Virginia Tech
>>>> Blacksburg, VA
>>>> jalemkul[at]vt.edu | (540) 231-9080
>>>> gmx-users mailing list gmx-users at gromacs.org
>>>> Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>>> Please don't post (un)subscribe requests to the list. Use the www
>>>> or send it to gmx-users-request at gromacs.org.
>>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> Justin A. Lemkul
>> Ph.D. Candidate
>> ICTAS Doctoral Scholar
>> MILES-IGERT Trainee
>> Department of Biochemistry
>> Virginia Tech
>> Blacksburg, VA
>> jalemkul[at]vt.edu | (540) 231-9080
>> gmx-users mailing list gmx-users at gromacs.org
>> Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the www interface
>> or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Justin A. Lemkul
ICTAS Doctoral Scholar
Department of Biochemistry
jalemkul[at]vt.edu | (540) 231-9080
More information about the gromacs.org_gmx-users