[gmx-developers] [RFC] thread affinity in mdrun

Mon Sep 23 22:13:59 CEST 2013

On Thu, Sep 19, 2013 at 7:53 PM, Szilárd Páll <szilard.pall at cbr.su.se> wrote:
> Hi,
>
> I would like to get feedback on an issue (or more precisely a set of
> issues) related to thread/process affinities and
> i) the way we should (or should not) tweak the current behavior and
> ii) the way we should proceed in the future.
>
>
> Brief introduction, skip this if you are familiar with the
> implementation details:
> Currently, mdrun always sets per-thread affinity if the number of
> threads is equal to the number of "CPUs" detected (reported by the OS
> ~ number of hardware threads supported). However, if this is not the
> case, e.g. one wants to leave some cores empty (run multiple
> simulations per node) or avoid using HT, thread pinning will not be
> done. This can have quite harsh consequences on the performance -
> especially when OpenMP parallelization is used (most notably with
> GPUs).
> Additionally, we try hard to not override externally set affinities
> which means that if mdrun detects non-default affinity, it will not
> pin threads (not even if -pin on is used). This happens if the job
> scheduler sets the affinity, or if the user sets it e.g. with
> KMP_AFFINITY/GOMP_CPU_AFFINITY, taskset, etc., but even if the MPI
> implementation sets only its thread's affinity.
>
>
> On the one hand, there was a request (see
> http://redmine.gromacs.org/issues/1122) that we should allow forcing
> the affinity setting by mdrun either by "-pin on" acquiring more
> aggressive behavior or using a "-pin force" option. Please check out
> the discussion on the issue page and express your opinion on whether
> you agree/which behavior you support.

I've added my voice to that thread - in favour of the current
behaviour, plus a new option "mdrun -pin force" for the times your
problem wants a hammer ;-) It would be nice for users not to have to
know about this, but until HPC kernels are routinely set up to set
suitable affinities, it's a user-space decision :-(

> On the other hand, more generally, I would like to get feedback on
> what people's experience is with affinity setting. I'll just list a
> few aspects of this issue that should be considered, but feel free to
> raise other issues:
> - per-process vs per-thread affinity;
> - affinity set by or required (for optimal performance)
> MPI/communication software stack;
> - GPU/accelerator NUMA aspects;
> - hwloc;
> - leaving a core empty, for interrupts (AMD/Cray?), MPI, NIC or GPU
> driver thread.
>
> Note that this part of the discussion is aimed more at the behavior of
> mdrun in the future. This is especially relevant as the next major (?)
> version is being planned/developed and new tasking/parallelization
> design options are being explored.

These are important issues. Currently, mdrun both is and is not
cache-oblivious - the hit with scaling OpenMP across sockets
(particularly on AMD and BlueGene/Q) illustrates that there are
fundamental problems to address. (To be fair, there are places we
explicitly prohibit false sharing through over-allocation, and we do
use thread-local force accumulation buffers that get reduced later.)

Moving forward, it seems to me that continuing to improve strong
scaling will require us to be more explicit in managing these details.
Our current-generation home-grown mostly-x86-specific hardware
detection scheme is probably fine for now, but I question whether it's
worth using/extending it to provide information on cache line sizes.

Projects like hwloc seem to do that job pretty well. Also should help
with knowing the locality of I/O and network hardware. Coverage of
relevant HPC and desktop hardware and OS seems great, including CUDA,
BlueGene/Q and Xeon Phi; not sure about Fujitsu machines. License is
BSD, so we could distribute it if we wanted to. As suggested by one of
the CMake devs on this list back in May or so, such bundling is
straightforward to manage because Kitware does it lots. Bundling
avoids a lot of versioning and compatibility issues. I am even tempted
to bundle and bootstrap CMake, but that is another debate...

Why do we care about fine hardware details? In the task-parallelism
framework being planned, I think it will be important to be able to
use/implement a cache-aware concurrent data structure that isolates
the details of maximizing performance. Using a plain vector (or even a
concurrency-aware vector), if two cores in different sockets need to
accumulate to the force for the same atom, the first will invalidate
the whole cache line for the second. If a core back on the first
socket now needs a further write, then it might miss cache as well!
Currently we avoid this by writing to thread-local force arrays and
synchronize over all cores when reducing. If threads do not have
affinities set then all of this optimization is at the mercy of the
kernel scheduling!

We should expect to live in L2 cache at worst (e.g. 256K on recent
Intel) because even at 500 atoms/core, there's 500*3*4 bytes for
positions, same again for velocities and forces, a similar amount for
nonbonded tables. This leaves huge amounts of space for other data
structures, and we want to target rather fewer atoms/core than that!
Living in 32K L1 may not be a dream...

I think one effective implementation of a container for forces will be
to construct "lazy" L2-local shadow vectors
* only when required (since the task schedule will be dynamic),
* probably with granularity related to the L2 cache-line size, and
* with reduction when required for integration (i.e. when the
integrate-the-forces task for a given atom runs, and only if the
container reports that in fact multiple shadow vectors were used).
This should work fine with the existing SIMD-aware coordinate layouts.
However, it requires that a thread know where it is in NUMA space, how
big L2 lines are, and that the force container can keep track of which
shadow vectors need to be (and have been) spawned. As far as I can
see, the only use for a contiguous-storage-for-all-atoms-in-a-set
force array might be for communication (MPI, I/O, CUDA?); constructing
that should only be done when required. Similar considerations pertain
to velocities and positions, IMO.

Using something like hwloc gives us a portable way to have access to
the information we'll need, without having to do much dirty work
ourselves, so it seems like a no-brainer for post-5.0 development. The
one cloud I can see is that if we use Intel's TBB as our task
framework, it aims to be generically cache friendly without being
cache aware, so we can expect no explicit help from it (whether or not
we use hwloc).

Cheers,

Mark

> Cheers,
> --
> Szilárd
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org.