[gmx-users] Prospective research areas of GROMACS

Tue Sep 22 21:59:21 CEST 2015

Hi,

On Mon, Sep 21, 2015 at 7:10 PM Sabyasachi Sahoo <ssahoo.iisc at gmail.com>
wrote:

> Dear Gromacs users and developers,
>
> I am a parallel programming researcher and would like to contribute to
> Gromacs molecular dynamics software by helping to nail down any bottlenecks
> that occur in scaling of the software on multiple CPUs (and probably
> improved performance in exa-scale era.) I am already in process of
> collecting profiling results to identify the phases that can be improved
> (and also for few other MD softwares).
>

Welcome! (And good luck... optimizing software that's had over 10 years of
performance focussed effort is... challenging!)

Profiling is itself problematic. Assuming one can get GROMACS to build with
whatever constraints the tool adds, and can benchmark on a machine where
you do not suffer contention for network resources, one needs to produce
more useful information than can already be obtained at the end of the .log
file. Such data is most interesting only when near the strong scaling limit
(e.g. < 100 particles per CPU core) and thus at very short per-MD-step
times (several wallclock milliseconds). Neither function instrumentation,
nor sampling tend to work well in this regime. But we'd love to be
surprised :-)

Hence, I would request all of you to please suggest me some possible areas
> of research, on which we can work on, for better scaling of Gromacs,
> (and/or MD softwares in general.) Going through the official website
> documentation helps me realise that implementing a truly parallel FFT (or
> making it scale better) in Gromacs will be truly helpful.

Yes and no. The FFT is intrinsically global, and most of the time spent
doing the PME component of MD is organizing the data to go into the FFT,
rather than doing the computation. The current implementation of spreading
charges onto the FFT grid is known to scale poorly across increasing
numbers of OpenMP cores of an MPI rank. That would be a high-impact problem
to fix - but start by getting a thorough understanding of the algorithm
before looking at the code (because the form of the code will not help you
understand anything). Some profiling with well-targeted function
instrumentation could be worthwhile here.

Because of this, at scale it is often necessary to use a subset of the MPI
ranks to handle the PME part, MPMD style (see reference manual, and/or
GROMACS 4 paper). However, the implementation of that requires that the
user choose a division in advance, and without running some external
optimizer (like gmx tune_pme) that choice is difficult because there also
has to be a PME domain decomposition, and now an extra communication phase
mapping from one DD to the other, and that can't be efficient unless
various constraints are satisified... Approaches that would take variables
out of user space could be quite useful here. For a trivial example, it
might be best
* to interleave PME ranks with PP ranks, to spread them out over the
network to maximize communication bandwidth when doing the all-to-all for
the 3DFFT, and hopefully minimize communication latency when doing the
transformation from PP DD to PME DD by having ranks very close, or
* to pack PME ranks together on the network to minimize external network
contention during the all-to-all, but to do so in a way that doesn't lose
all the benefits by instead taking the latency hit at the PP<->PME stages...
Currently the user has to choose one of these "up front." The latter works
well only in the presence of knowledge about the network topology, which is
unavailable until someone adds (e.g.) netloc support.

Replacing our home-grown hardware detection support with hwloc support
would perhaps be higher reward for effort, however.

Avoiding PME entirely is another avenue (but there are two fast-multipole
projects running here already).

The latest paper
> on Gromacs 5.0 concludes saying an algorithm implementing preempting fine
> grained tasks based on priority can lead to improvements. I am also trying
> to look into it and would want to know your take on this.
>

We don't think the current approach of

1. spread this task over n threads per rank, then
2. do something serial
3. spread the next task over the same n threads per rank, then
4. do something serial
5. spread the next task over the same n threads per rank, then
... continue like this

is going to cut it in the long term. You need to be able e.g. to fire off
the necessary PME communication, then go back to doing bonded work, then 20
microseconds later when communication arrives drop everything to handle the
next phase of PME ASAP, then go back to the bondeds, but preferably don't
trash all your cache in the meantime, etc. But there's a lot of boring code
transformation that has to happen before we can do much about it. Current
thinking is that we want to move in the direction of encapsulated tasks
that we might be able to write a custom TBB thread scheduler to handle in a
way that's automatic and efficient.

Mark

You could also direct me to the link on the website, or any person
> concerned with this. You could also point me to any link in developer zone
> that I might have missed. Any more insight into matter will be really
> appreciated.
>
> Thanks in advance.
>
> --
> Yours sincerely,
> Sabyasachi Sahoo
> Supercomputer Education & Research Center
> Indian Institute of Science - Bangalore
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>