[gmx-developers] expressing high-level mdrun code

Mark Abraham mark.j.abraham at gmail.com
Thu Apr 9 01:36:09 CEST 2015


Hi gmx-devs,

We periodically have discussion on gerrit that certain kinds of code
refactoring removes opportunities for inlining, or might degrade
performance while we call a function that just returns to do_md() after
checking a simple condition, because we're not e.g. doing a
neighbour-search step right now. So, I thought we might benefit from a
group discussion. In particular, insights from people who've tried to do
things in mdrun and struggled/failed/succeeded-at-ruinous-expense would be
valuable for the community. There's a lot of related topics I'll talk about
here (sorry), but just tackling some of them doesn't do justice to the
complexity of the problem...

The alternative of expressing every condition about everything in raw code
in do_md() and do_force() is obviously untenable. Function calls are
required. Such code is rather better than it used to be, and will improve
further with the death of the group scheme, but there is no good reason for
do_md() to still be 1500 lines long.

The kind of code I'm talking about is high-level control code, so there's
going to be a partial or full pipeline stall from some branch, cache miss,
or other function call in the next few lines of code. Even if the run-time
penalty of refactoring

if (bDoThis) { someExternFunction(); }
callNextFunction();

to

someExternFunction();
callNextFunction();

// defined elsewhere as

void someExternFunction()
{
   if (!bDoThis) { return; }
...
}

(or later doing similar things with virtual functions that might have an
empty body) is a thousand CPU cycles, that's still under 1 microsecond of
wallclock, and so we can do scores of them per CPU core per step before
having a noticeable impact on a target MD iteration time under 1 ms/step.
We can get under 1 ms/step now, but I think pretty much nobody can afford
to be that inefficient with their hardware, so the real-world per-step
iteration time is noticeably further away from performance impact than our
cutting-edge strong scaling. And penalties of 1000 cycles are pessimistic -
a DRAM memory load on x86 is 100-200 cycles (e.g.
http://www3.informatik.uni-erlangen.de/Lehre/CAMA/SS2014/caches.pdf,
http://idarkside.org/posts/numbers-you-should-know/).

We can certainly measure that such things as putting control-flow
conditionals inside functions have negligible effects in practice, but that
seems to me a fairly Chicken Little approach to development. In particular,
we can't afford to dedicate a machine capable of automatically testing all
possible performance regressions, so the alternative is manual testing that
nobody really wants to do multiple times per patch over the code-review
process. Maybe a weekly automated performance-regression test is feasible,
though, which would catch unexpected regressions (which is the kind of
thing you don't catch in manual testing, anyway).

Currently, the CUDA code path is expressed in the non-GPU build via macros
that transform function declarations into inline-able empty static
functions. Those make life painful for Doxygen, compiler warnings, and
automated checkers, which we could live with if there was value provided by
the use of those macros. But most (all?) of the empty function calls that
are inlined away are behind runtime checks for whether a GPU is in use, and
those are still present in the non-GPU build (and the GPU build when non
using a GPU). So, I think this is an example where we've tried hard to
write the code so that we can prove both code paths are as fast as we
intend, but some effort was wasted - we could have just had normal C code
that compiled non-inlined empty functions that would never have been
called. Now that we have three implementations on the table (CUDA, OpenCL,
null) and a C++ project, we need to move to using a real C++ interface. By
design, only one definition of any virtual function is going to be compiled
in any build, so GCC 4.9 devirtualization will just work when we turn it on
(there's a long blog series from the guy who made this work, worth a read
if the topic interests you
http://hubicka.blogspot.cz/2014/01/devirtualization-in-c-part-1.html). But
devirt is just icing - I think the performance cost of transforming ~10
GPU-related functions per MD step into virtual functions is not worth the
time to measure, and particularly not when you consider the cost of not
doing something else useful with that developer time.

Deploying task parallelism is going to be worse - tasks are virtual
function calls in TBB and inlining is simply not a feature of the landscape
until you get inside a task. The price of expressing parallelism right now
is already that we have extra function calls every time we open an OpenMP
region.

Conversely, when we observe that the cost of calling nearly-empty functions
becomes noteworthy, every MD step we have lots of branches for
* is this a NS step
* which integrator is this
* which coupling algorithm might be active
* is it multisim
* is it rerun
* is it PME
* is GPU active
* is DD active
* etc.
So far, I think nobody has proposed specializing the MD loop so that we
eliminate those checks (branch less, call fewer functions). Anton does
things like this, but they're a few orders of magnitude faster than us. In
part, we haven't done this because the group scheme is still around, and
also the added complexity is probably too high for some of the benefits,
but it's also because do_md() and do_force() are an awful mess of
everything being expressed in raw code, so nobody dares to touch anything
because it's all too big to have in your head... Profile-guided
optimization, JIT, link-time-optimization and compiler devirtualization are
things we don't do much of yet, but e.g. recent versions of gcc have made
massive progress here. There's a lot of C++ compiler consumers who have
much bigger problems than we do...

There's a price for having a main loop that is thousands of lines long,
because every conditional is inlined "for performance." These days, there's
just 70 lines in do_md() for declaring nearly undocumented local variables,
and that's after a sustained cleanup campaign, mostly from me... We don't
feel the price of such unwieldy code much, because we don't really see
whether people develop their new feature in some "developer-friendly" MD
code, or write their own MD code, so that they don't do enough sampling on
a real problem for anyone to trust their shiny new method. Or they try to
develop in GROMACS and silently fail, or kind-of succeed but never trust
their code/results and don't dare try to contribute their feature.

I'm perfectly willing to trade a couple of percent of raw performance on
plain-vanilla MD while raising the abstraction level. I would like to see a
main loop in a few hundred lines that a new-coder grad student can observe
is expressing the MD algorithm, and so they know where it makes sense to
look further in order to change things. I think that the important future
optimizations are going to be algorithmic, not squeezing a few more percent
here and there, and we should be building a tool that can make these things
possible to code in reasonable time, and so they can be tested on
real-world simulation problems because a fast sampling back end can be used.

What do other people think?

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20150408/62518022/attachment.html>


More information about the gromacs.org_gmx-developers mailing list