[gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

Sat Apr 2 22:59:08 CEST 2016

Hi,

On Thu, 31 Mar 2016 19:00 Michael Shirts <mrshirts at gmail.com> wrote:

> > One more question: why does the free energy code have to use its own
> kernel?

There are some aspects that are only used in (some kinds of) free-energy
calculations e.g. soft-core options. So maybe coulomb-only transitions have
an optimization available.

> I realize that I'm going to sound like an idiot here, but why can't one
> just tweak parameters outside of the kernel and then have the optimized
> kernel do the dynamics? I presume that one has to step outside of the
> kernel to do the replica exchanges, so why can't the code just use the
> optimized kernel to do the dynamics between exchanges?
>

I think that would make good sense in some cases, I don't know whether
that's little enough work to be worth doing, in the context of likely being
able to do a proper job of SIMD kernels in the future. To do the best job
would require just-in-time compilation. Once a .tpr is read, much of the
mdrun logic is constant - we know it's PME with whatever coupling, etc.
Every conditional branch is very expensive for the CPU - Chris may not have
been able to see the calls to gmx_waste_time_here() but there is one every
few lines in the free-energy kernel :-) Being able to compile such that the
compiler knows that one branch will always be taken will allow it to
eliminate the check and thus the speed penalty. gcc has early-stage support
for such things, but we haven't tried to use it yet. The alternative is to
make kernels for something like all-against-all option combinations, and
that runs to hundreds of kernels.

The optimized energy code only codes very specific interactions;
> vanilla Coulomb, and Lennard-Jones.  If it coded the more general
> interactions (such as soft-core) it would be significantly slower.
>

Nitpick - each of the ~100 kernels targets a specific algorithmic
combination for the hardware selected at cmake time. There is a generic
kernel, but it probably runs about as badly as the free-energy one.

We've come up with some ways to handle free energy calculations with
> optimized inner loops (I think we're talking about the same thing Mark
> -- though we should coordinate to be sure), but that will take a
> little time.
>

I was talking about software engineering along the lines of the existing
kernels. Your work to use tables is also attractive (and orthogonal).

One alternate possibility would be to create N replicas of the system,
> with slightly different parameters, but where all systems use the
> standard functional form, and then combine the results internall.  For
> something like REST2, then this could work, since one is only linearly
> scaling the interactions between two parts A and B of the system; so
> it would be possible to represent each system as a separate physical
> system. This could work fairly well for small numbers of replicas,
> though might have some problems with large numbers of replicas. If
> each replica was an entire new system, it could make replica exchange
> quite slow, since calculating the energies of a given configuration
> with a different replica's energy function would be very costly.
>

In principle, we should be able to have multiple model physics available
where the coordinates already are, to do such calculation with fairly low
overhead. But making that work reliably needs a lot more code modularity
than we have.

Mark

Right now, for alchemical interactions, there are only two
> representations of the system, and then lambda interpolates between
> the two representations of the systems to create the alchemical
> intermediates. Changing the way the code is structured would again
> require some significant changes and time.
>
> On Thu, Mar 31, 2016 at 10:49 AM, Christopher Neale
> <chris.neale at alum.utoronto.ca> wrote:
> > Dear Szilárd, Mark, and Michael:
> >
> > Thank you for your suggestions.
> >
> > 1. I did use the icc compiler
> > 2. I tested usage of nstlist=10 by setting verlet-buffer-tolerance=-1 (I
> had already set nstlist=10 but mdrun changes that to 25 in my previous
> setup) -- this did not improve the performance
> > 3. I tested increasing .mdp option nstdhdl from 200 to 20,000 both with
> and without also increasing the mdrun -replex option from 200 to 20,000 --
> this did not improve the performance
> > 4. I tried to test more than one openMP thread per rank. I am fuzzy on
> this part of the usage, but I tried "ibrun -np 4 gmx_mpi mdrun -ntomp 6"
> and it was a lot slower (I think it only used 4 cores?); I also tried as
> above but with also -ntmpi 4 and it errored with a message about not being
> able to set thread-MPI since I didn't compile with it. In any event, I
> doubted that an optimization here would get back most of the 12x
> performance loss so I didn't pursue this further.
> >
> > The good news is that there is a beta version of the H-REMD fork of
> Plumed that works with gromacs 5.1, so I have a viable route for now.
> >
> > One more question: why does the free energy code have to use its own
> kernel? I realize that I'm going to sound like an idiot here, but why can't
> one just tweak parameters outside of the kernel and then have the optimized
> kernel do the dynamics? I presume that one has to step outside of the
> kernel to do the replica exchanges, so why can't the code just use the
> optimized kernel to do the dynamics between exchanges?
> >
> > Thank you again for all of your help,
> > Chris.
> >
> > ________________________________________
> > From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
> gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Szilárd
> Páll <pall.szilard at gmail.com>
> > Sent: 30 March 2016 12:59
> > To: Discussion list for GROMACS users
> > Subject: Re: [gmx-users] Looking for the source of the H-REMD slowdown
> when decoupling a lot of atoms
> >
> > Chris,
> >
> > I can suggest two possible tweaks, these won't do miracles, but may give
> > you a bit better performance.
> >
> > * icc is better than gcc at optimizing the naiive free energy kernel
> code.
> > I observed in the past up to 1.5x faster free energy kernel performance
> > with icc 15 vs gcc 4.9 or so.
> >
> > * The default nstlist heuristic/suggestion assumes fast non-bondeds. In
> > your case, depending on how long is the list buffer with the nstlist you
> > use (25?), you may be able to gain a bit of performance by shifting load
> > back to the search with a smaller nstlist, e.g. 10.
> >
> > These two combined could give in best case 2x, I guess.
> >
> > What I find slightly unusual in the logs you posted on redmine is the
> 3-4x
> > slowdown in PME and constraints (I'd expect ~2x in PME and less in
> > constraints), but it could be that this too is simply because most
> > interaction in the system are perturbed which trigger non-optimized
> > code-paths.
> >
> > Cheers,
> > --
> > Szilárd
> >
> > On Wed, Mar 30, 2016 at 3:05 PM, Mark Abraham <mark.j.abraham at gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Yeah, unfortunately Michael's pretty much right - the free-energy
> kernel is
> >> currently that cousin nobody talks about (or to). It's essentially
> >> unchanged since GROMACS 4.0 days, except that the Verlet scheme has some
> >> kludge so that it can call the same kernel that the group scheme used to
> >> call. But it has none of the optimizations and also some simplifying
> >> pessimizations. The result is highly unsuitable for Chris's use case,
> but
> >> not horrible for the more normal case of perturbing a small part of a
> >> system. For now, I can only suggest trying a run with more than one
> OpenMP
> >> thread per rank, but there's nothing in the log file snippets that
> fills me
> >> with any hope that it would be noticeably faster.
> >>
> >> We have a pile of infrastructure built and in the works that will lead
> to
> >> being able to offer similarly optimized free-energy kernels, but they
> won't
> >> see the light of day until next year, I'm afraid. A set of sample .tprs
> at
> >> Redmine 742 would be most welcome, however - it's very good for us to
> know
> >> when/that we're optimizing a workflow someone actually wants to run, but
> >> currently has reason to avoid.
> >>
> >> Mark
> >>
> >> On Wed, Mar 30, 2016 at 8:05 AM Michael Shirts <mrshirts at gmail.com>
> wrote:
> >>
> >> > Hi, Chris: I'm pretty sure that it's because the nonbonded free
> >> > energies are much slower than the standard free energies.  You state:
> >> >
> >> > > I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I
> >> was
> >> > unable to find a function called "gmx_waste_time_here()" and beyond
> that
> >> I
> >> > was out of my depth.
> >> >
> >> > But it's much more the fact that the non-free energy nonbondeds are
> >> > SUPER optimized.
> >> >
> >> > I don't see a particularly viable way around this for now. The only
> >> > thing I can think of splitting the neighborlists into two force calls,
> >> > and scaling the forces and energies coming out of those.  That would
> >> > be a huge pain.
> >> >
> >> > On Tue, Mar 29, 2016 at 9:50 PM, Christopher Neale
> >> > <chris.neale at alum.utoronto.ca> wrote:
> >> > > Dear Users:
> >> > >
> >> > > I am trying to do some Hamiltonian replica exchange (H-REMD) in
> gromacs
> >> > 5.1.2 and am running up against really large slowdowns when
> decoupling a
> >> > large number of atoms. I am decoupling 5360 atoms out of the 15520
> atoms
> >> in
> >> > my system. The goal is not to get a PMF, but to enhance sampling using
> >> the
> >> > REST approach to partially decouple lipids in a bilayer. This approach
> >> > enhances lipid relaxation times (
> >> > http://pubs.acs.org/doi/pdf/10.1021/ct500305u ) though the authors of
> >> > that paper modified the gromacs code to do their own H-REMD in order
> to
> >> > avoid the really slow speed they also got when decoupling lots of
> atoms
> >> via
> >> > the free energy code.
> >> > >
> >> > > I have already posted details here
> >> http://redmine.gromacs.org/issues/742
> >> > , which includes .mdp options and some timing output. I compare the
> >> timing
> >> > output to a standard temperature REMD (T-REMD) run. For my usage, the
> >> > slowdown is about 12x for H-REMD vs. T-REMD.
> >> > >
> >> > > I am motivated to find a solution within gromacs because the
> >> alternative
> >> > is to use gromacs 4.6.7 with plumed (or with the aforementioned
> modified
> >> > code, which is also gromacs v4). Normally that would be a viable
> option,
> >> > but I am using the charmm force field and the charmm TIP3P water and I
> >> > would rather not give up the speed boost that I see in gromacs v5.1.2,
> >> > which allows the use of the verlet cutoff scheme and has been tested
> and
> >> > shown to give the correct reproduction of charmm forces (vs. the
> forces
> >> one
> >> > would get using the charmm simulation software).
> >> > >
> >> > > I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I
> >> was
> >> > unable to find a function called "gmx_waste_time_here()" and beyond
> that
> >> I
> >> > was out of my depth.
> >> > >
> >> > > Thank you for any pointers,
> >> > > Chris.
> >> > >
> >> > >
> >> > > --
> >> > > Gromacs Users mailing list
> >> > >
> >> > > * Please search the archive at
> >> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> > posting!
> >> > >
> >> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> > >
> >> > > * For (un)subscribe requests visit
> >> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> >> > send a mail to gmx-users-request at gromacs.org.
> >> > --
> >> > Gromacs Users mailing list
> >> >
> >> > * Please search the archive at
> >> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> > posting!
> >> >
> >> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> >
> >> > * For (un)subscribe requests visit
> >> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >> > send a mail to gmx-users-request at gromacs.org.
> >> >
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at
> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >> send a mail to gmx-users-request at gromacs.org.
> >>
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.