[gmx-developers] expressing high-level mdrun code

Mon Apr 20 11:32:37 CEST 2015

On 2015-04-09 01:36, Mark Abraham wrote:
> Hi gmx-devs,
>
> We periodically have discussion on gerrit that certain kinds of code
> refactoring removes opportunities for inlining, or might degrade
> performance while we call a function that just returns to do_md() after
> checking a simple condition, because we're not e.g. doing a
> neighbour-search step right now. So, I thought we might benefit from a
> group discussion. In particular, insights from people who've tried to do
> things in mdrun and struggled/failed/succeeded-at-ruinous-expense would
> be valuable for the community. There's a lot of related topics I'll talk
> about here (sorry), but just tackling some of them doesn't do justice to
> the complexity of the problem...
>
> The alternative of expressing every condition about everything in raw
> code in do_md() and do_force() is obviously untenable. Function calls
> are required. Such code is rather better than it used to be, and will
> improve further with the death of the group scheme, but there is no good
> reason for do_md() to still be 1500 lines long.
>
> The kind of code I'm talking about is high-level control code, so
> there's going to be a partial or full pipeline stall from some branch,
> cache miss, or other function call in the next few lines of code. Even
> if the run-time penalty of refactoring
>
> if (bDoThis) { someExternFunction(); }
> callNextFunction();
>
> to
>
> someExternFunction();
> callNextFunction();
>
> // defined elsewhere as
>
> void someExternFunction()
> {
>     if (!bDoThis) { return; }
> ...
> }
>
> (or later doing similar things with virtual functions that might have an
> empty body) is a thousand CPU cycles, that's still under 1 microsecond
> of wallclock, and so we can do scores of them per CPU core per step
> before having a noticeable impact on a target MD iteration time under 1
> ms/step. We can get under 1 ms/step now, but I think pretty much nobody
> can afford to be that inefficient with their hardware, so the real-world
> per-step iteration time is noticeably further away from performance
> impact than our cutting-edge strong scaling. And penalties of 1000
> cycles are pessimistic - a DRAM memory load on x86 is 100-200 cycles
> (e.g.
> http://www3.informatik.uni-erlangen.de/Lehre/CAMA/SS2014/caches.pdf,
> http://idarkside.org/posts/numbers-you-should-know/).
>
> We can certainly measure that such things as putting control-flow
> conditionals inside functions have negligible effects in practice, but
> that seems to me a fairly Chicken Little approach to development. In
> particular, we can't afford to dedicate a machine capable of
> automatically testing all possible performance regressions, so the
> alternative is manual testing that nobody really wants to do multiple
> times per patch over the code-review process. Maybe a weekly automated
> performance-regression test is feasible, though, which would catch
> unexpected regressions (which is the kind of thing you don't catch in
> manual testing, anyway).
>
> Currently, the CUDA code path is expressed in the non-GPU build via
> macros that transform function declarations into inline-able empty
> static functions. Those make life painful for Doxygen, compiler
> warnings, and automated checkers, which we could live with if there was
> value provided by the use of those macros. But most (all?) of the empty
> function calls that are inlined away are behind runtime checks for
> whether a GPU is in use, and those are still present in the non-GPU
> build (and the GPU build when non using a GPU). So, I think this is an
> example where we've tried hard to write the code so that we can prove
> both code paths are as fast as we intend, but some effort was wasted -
> we could have just had normal C code that compiled non-inlined empty
> functions that would never have been called. Now that we have three
> implementations on the table (CUDA, OpenCL, null) and a C++ project, we
> need to move to using a real C++ interface. By design, only one
> definition of any virtual function is going to be compiled in any build,
> so GCC 4.9 devirtualization will just work when we turn it on (there's a
> long blog series from the guy who made this work, worth a read if the
> topic interests you
> http://hubicka.blogspot.cz/2014/01/devirtualization-in-c-part-1.html).
> But devirt is just icing - I think the performance cost of transforming
> ~10 GPU-related functions per MD step into virtual functions is not
> worth the time to measure, and particularly not when you consider the
> cost of not doing something else useful with that developer time.
>
> Deploying task parallelism is going to be worse - tasks are virtual
> function calls in TBB and inlining is simply not a feature of the
> landscape until you get inside a task. The price of expressing
> parallelism right now is already that we have extra function calls every
> time we open an OpenMP region.
>
> Conversely, when we observe that the cost of calling nearly-empty
> functions becomes noteworthy, every MD step we have lots of branches for
> * is this a NS step
> * which integrator is this
> * which coupling algorithm might be active
> * is it multisim
> * is it rerun
> * is it PME
> * is GPU active
> * is DD active
> * etc.
> So far, I think nobody has proposed specializing the MD loop so that we
> eliminate those checks (branch less, call fewer functions). Anton does
> things like this, but they're a few orders of magnitude faster than us.
> In part, we haven't done this because the group scheme is still around,
> and also the added complexity is probably too high for some of the
> benefits, but it's also because do_md() and do_force() are an awful mess
> of everything being expressed in raw code, so nobody dares to touch
> anything because it's all too big to have in your head... Profile-guided
> optimization, JIT, link-time-optimization and compiler devirtualization
> are things we don't do much of yet, but e.g. recent versions of gcc have
> made massive progress here. There's a lot of C++ compiler consumers who
> have much bigger problems than we do...
>
> There's a price for having a main loop that is thousands of lines long,
> because every conditional is inlined "for performance." These days,
> there's just 70 lines in do_md() for declaring nearly undocumented local
> variables, and that's after a sustained cleanup campaign, mostly from
> me... We don't feel the price of such unwieldy code much, because we
> don't really see whether people develop their new feature in some
> "developer-friendly" MD code, or write their own MD code, so that they
> don't do enough sampling on a real problem for anyone to trust their
> shiny new method. Or they try to develop in GROMACS and silently fail,
> or kind-of succeed but never trust their code/results and don't dare try
> to contribute their feature.
>
> I'm perfectly willing to trade a couple of percent of raw performance on
> plain-vanilla MD while raising the abstraction level. I would like to
> see a main loop in a few hundred lines that a new-coder grad student can
> observe is expressing the MD algorithm, and so they know where it makes
> sense to look further in order to change things. I think that the
> important future optimizations are going to be algorithmic, not
> squeezing a few more percent here and there, and we should be building a
> tool that can make these things possible to code in reasonable time, and
> so they can be tested on real-world simulation problems because a fast
> sampling back end can be used.
>
> What do other people think?
>
> Mark
>
>
Thanks for the long insightful mail.
I'm not so much into performance tuning anymore, but maintainability is 
crucial for future extensions and applications and therefore I support 
further cleanups.

There is somehow a catch-22 though in that much of the mess you describe 
in do_md could be done more elegant and readable using C++. However that 
requires a more extensive use of classes throughout the code which makes 
such an undertaking rather overwhelming. For now, cleaning up before the 
C++ transition (the code equivalent of throwing away old junk before you 
move to a new place) seems to be the way to go.

So for what reason do we still need the group scheme?
- Shell/Drude code?
- Tabulated potentials?
- More?

I couldn't find a redmine issue for "completely replacing group scheme 
functionality", it might be useful to have one.

Cheers,
-- 
David van der Spoel, Ph.D., Professor of Biology
Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone:	+46184714205.
spoel at xray.bmc.uu.se    http://folding.bmc.uu.se