[gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

Fri Apr 1 23:39:12 CEST 2016

Dear developers:

Is it possible with only a small amount of work to modify calc_delta() in src/programs/mdrun/repl_ex.cpp to add another case to re->type such that the program sends the coordinates to the alternate .tpr information and evaluates the energy completely? This would allow for arbitrary exchanges, including bizarre things like exchanging the cutoff distance or whatever (not what I want to do -- I'm just emphasizing the generality of this approach). It will certainly cost some communication and compute cycles to do this, but for e.g. I currently have a 12x slowdown using the free energy architecture while decoupling 5K atoms. So if I do an exchange attempt every 500 steps, we're still breaking even if this complete energy re-evaluation takes 5,500x longer than a regular integration step -- and I don't see how it could possibly take anywhere near that long assuming bMultiEx==FALSE. 

Possible problems that I can imagine are:
(a) the functions for a complete re-evaluation don't exist (either the communication of coordinates or the energy evaluation)
(b) some issues with changes in temperature or pressure 
(c) the coordinates that are required to do this evaluation no longer exist because the energy evaluation functions are coupled to the timestep so we really need to pass them the coordinates of the previous timestep.

Any suggestions are really appreciated.

Thank you,
Chris.

________________________________________
From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Michael Shirts <mrshirts at gmail.com>
Sent: 31 March 2016 14:00
To: Discussion list for GROMACS users
Subject: Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

> One more question: why does the free energy code have to use its own kernel? I realize that I'm going to sound like an idiot here, but why can't one just tweak parameters outside of the kernel and then have the optimized kernel do the dynamics? I presume that one has to step outside of the kernel to do the replica exchanges, so why can't the code just use the optimized kernel to do the dynamics between exchanges?

The optimized energy code only codes very specific interactions;
vanilla Coulomb, and Lennard-Jones.  If it coded the more general
interactions (such as soft-core) it would be significantly slower.

We've come up with some ways to handle free energy calculations with
optimized inner loops (I think we're talking about the same thing Mark
-- though we should coordinate to be sure), but that will take a
little time.

One alternate possibility would be to create N replicas of the system,
with slightly different parameters, but where all systems use the
standard functional form, and then combine the results internall.  For
something like REST2, then this could work, since one is only linearly
scaling the interactions between two parts A and B of the system; so
it would be possible to represent each system as a separate physical
system. This could work fairly well for small numbers of replicas,
though might have some problems with large numbers of replicas. If
each replica was an entire new system, it could make replica exchange
quite slow, since calculating the energies of a given configuration
with a different replica's energy function would be very costly.

Right now, for alchemical interactions, there are only two
representations of the system, and then lambda interpolates between
the two representations of the systems to create the alchemical
intermediates. Changing the way the code is structured would again
require some significant changes and time.

On Thu, Mar 31, 2016 at 10:49 AM, Christopher Neale
<chris.neale at alum.utoronto.ca> wrote:
> Dear Szilárd, Mark, and Michael:
>
> Thank you for your suggestions.
>
> 1. I did use the icc compiler
> 2. I tested usage of nstlist=10 by setting verlet-buffer-tolerance=-1 (I had already set nstlist=10 but mdrun changes that to 25 in my previous setup) -- this did not improve the performance
> 3. I tested increasing .mdp option nstdhdl from 200 to 20,000 both with and without also increasing the mdrun -replex option from 200 to 20,000 -- this did not improve the performance
> 4. I tried to test more than one openMP thread per rank. I am fuzzy on this part of the usage, but I tried "ibrun -np 4 gmx_mpi mdrun -ntomp 6" and it was a lot slower (I think it only used 4 cores?); I also tried as above but with also -ntmpi 4 and it errored with a message about not being able to set thread-MPI since I didn't compile with it. In any event, I doubted that an optimization here would get back most of the 12x performance loss so I didn't pursue this further.
>
> The good news is that there is a beta version of the H-REMD fork of Plumed that works with gromacs 5.1, so I have a viable route for now.
>
> One more question: why does the free energy code have to use its own kernel? I realize that I'm going to sound like an idiot here, but why can't one just tweak parameters outside of the kernel and then have the optimized kernel do the dynamics? I presume that one has to step outside of the kernel to do the replica exchanges, so why can't the code just use the optimized kernel to do the dynamics between exchanges?
>
> Thank you again for all of your help,
> Chris.
>
> ________________________________________
> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Szilárd Páll <pall.szilard at gmail.com>
> Sent: 30 March 2016 12:59
> To: Discussion list for GROMACS users
> Subject: Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms
>
> Chris,
>
> I can suggest two possible tweaks, these won't do miracles, but may give
> you a bit better performance.
>
> * icc is better than gcc at optimizing the naiive free energy kernel code.
> I observed in the past up to 1.5x faster free energy kernel performance
> with icc 15 vs gcc 4.9 or so.
>
> * The default nstlist heuristic/suggestion assumes fast non-bondeds. In
> your case, depending on how long is the list buffer with the nstlist you
> use (25?), you may be able to gain a bit of performance by shifting load
> back to the search with a smaller nstlist, e.g. 10.
>
> These two combined could give in best case 2x, I guess.
>
> What I find slightly unusual in the logs you posted on redmine is the 3-4x
> slowdown in PME and constraints (I'd expect ~2x in PME and less in
> constraints), but it could be that this too is simply because most
> interaction in the system are perturbed which trigger non-optimized
> code-paths.
>
> Cheers,
> --
> Szilárd
>
> On Wed, Mar 30, 2016 at 3:05 PM, Mark Abraham <mark.j.abraham at gmail.com>
> wrote:
>
>> Hi,
>>
>> Yeah, unfortunately Michael's pretty much right - the free-energy kernel is
>> currently that cousin nobody talks about (or to). It's essentially
>> unchanged since GROMACS 4.0 days, except that the Verlet scheme has some
>> kludge so that it can call the same kernel that the group scheme used to
>> call. But it has none of the optimizations and also some simplifying
>> pessimizations. The result is highly unsuitable for Chris's use case, but
>> not horrible for the more normal case of perturbing a small part of a
>> system. For now, I can only suggest trying a run with more than one OpenMP
>> thread per rank, but there's nothing in the log file snippets that fills me
>> with any hope that it would be noticeably faster.
>>
>> We have a pile of infrastructure built and in the works that will lead to
>> being able to offer similarly optimized free-energy kernels, but they won't
>> see the light of day until next year, I'm afraid. A set of sample .tprs at
>> Redmine 742 would be most welcome, however - it's very good for us to know
>> when/that we're optimizing a workflow someone actually wants to run, but
>> currently has reason to avoid.
>>
>> Mark
>>
>> On Wed, Mar 30, 2016 at 8:05 AM Michael Shirts <mrshirts at gmail.com> wrote:
>>
>> > Hi, Chris: I'm pretty sure that it's because the nonbonded free
>> > energies are much slower than the standard free energies.  You state:
>> >
>> > > I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I
>> was
>> > unable to find a function called "gmx_waste_time_here()" and beyond that
>> I
>> > was out of my depth.
>> >
>> > But it's much more the fact that the non-free energy nonbondeds are
>> > SUPER optimized.
>> >
>> > I don't see a particularly viable way around this for now. The only
>> > thing I can think of splitting the neighborlists into two force calls,
>> > and scaling the forces and energies coming out of those.  That would
>> > be a huge pain.
>> >
>> > On Tue, Mar 29, 2016 at 9:50 PM, Christopher Neale
>> > <chris.neale at alum.utoronto.ca> wrote:
>> > > Dear Users:
>> > >
>> > > I am trying to do some Hamiltonian replica exchange (H-REMD) in gromacs
>> > 5.1.2 and am running up against really large slowdowns when decoupling a
>> > large number of atoms. I am decoupling 5360 atoms out of the 15520 atoms
>> in
>> > my system. The goal is not to get a PMF, but to enhance sampling using
>> the
>> > REST approach to partially decouple lipids in a bilayer. This approach
>> > enhances lipid relaxation times (
>> > http://pubs.acs.org/doi/pdf/10.1021/ct500305u ) though the authors of
>> > that paper modified the gromacs code to do their own H-REMD in order to
>> > avoid the really slow speed they also got when decoupling lots of atoms
>> via
>> > the free energy code.
>> > >
>> > > I have already posted details here
>> http://redmine.gromacs.org/issues/742
>> > , which includes .mdp options and some timing output. I compare the
>> timing
>> > output to a standard temperature REMD (T-REMD) run. For my usage, the
>> > slowdown is about 12x for H-REMD vs. T-REMD.
>> > >
>> > > I am motivated to find a solution within gromacs because the
>> alternative
>> > is to use gromacs 4.6.7 with plumed (or with the aforementioned modified
>> > code, which is also gromacs v4). Normally that would be a viable option,
>> > but I am using the charmm force field and the charmm TIP3P water and I
>> > would rather not give up the speed boost that I see in gromacs v5.1.2,
>> > which allows the use of the verlet cutoff scheme and has been tested and
>> > shown to give the correct reproduction of charmm forces (vs. the forces
>> one
>> > would get using the charmm simulation software).
>> > >
>> > > I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I
>> was
>> > unable to find a function called "gmx_waste_time_here()" and beyond that
>> I
>> > was out of my depth.
>> > >
>> > > Thank you for any pointers,
>> > > Chris.
>> > >
>> > >
>> > > --
>> > > Gromacs Users mailing list
>> > >
>> > > * Please search the archive at
>> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> > posting!
>> > >
>> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> > >
>> > > * For (un)subscribe requests visit
>> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> > send a mail to gmx-users-request at gromacs.org.
>> > --
>> > Gromacs Users mailing list
>> >
>> > * Please search the archive at
>> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> > posting!
>> >
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >
>> > * For (un)subscribe requests visit
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> > send a mail to gmx-users-request at gromacs.org.
>> >
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.