[gmx-developers] [RFC] thread affinity in mdrun

Mon Sep 23 22:19:49 CEST 2013

Hi,

Just a super-short follow-up to Mark's excellent long mail:

Getting L1/L2/L3 cache size information (and associativity, etc.) is actually very straightforward on x86 - if somebody can use that for better performance we can easily do it already for 5.0 as part of the CPUID code.

However, the big shortcoming is that there is no way whatsoever to get any information about NUMA domains that way - for the long run we need hwloc.

Cheers,

Erik

On Sep 23, 2013, at 10:13 PM, Mark Abraham <mark.j.abraham at gmail.com> wrote:

> On Thu, Sep 19, 2013 at 7:53 PM, Szilárd Páll <szilard.pall at cbr.su.se> wrote:
>> Hi,
>> 
>> I would like to get feedback on an issue (or more precisely a set of
>> issues) related to thread/process affinities and
>> i) the way we should (or should not) tweak the current behavior and
>> ii) the way we should proceed in the future.
>> 
>> 
>> Brief introduction, skip this if you are familiar with the
>> implementation details:
>> Currently, mdrun always sets per-thread affinity if the number of
>> threads is equal to the number of "CPUs" detected (reported by the OS
>> ~ number of hardware threads supported). However, if this is not the
>> case, e.g. one wants to leave some cores empty (run multiple
>> simulations per node) or avoid using HT, thread pinning will not be
>> done. This can have quite harsh consequences on the performance -
>> especially when OpenMP parallelization is used (most notably with
>> GPUs).
>> Additionally, we try hard to not override externally set affinities
>> which means that if mdrun detects non-default affinity, it will not
>> pin threads (not even if -pin on is used). This happens if the job
>> scheduler sets the affinity, or if the user sets it e.g. with
>> KMP_AFFINITY/GOMP_CPU_AFFINITY, taskset, etc., but even if the MPI
>> implementation sets only its thread's affinity.
>> 
>> 
>> On the one hand, there was a request (see
>> http://redmine.gromacs.org/issues/1122) that we should allow forcing
>> the affinity setting by mdrun either by "-pin on" acquiring more
>> aggressive behavior or using a "-pin force" option. Please check out
>> the discussion on the issue page and express your opinion on whether
>> you agree/which behavior you support.
> 
> I've added my voice to that thread - in favour of the current
> behaviour, plus a new option "mdrun -pin force" for the times your
> problem wants a hammer ;-) It would be nice for users not to have to
> know about this, but until HPC kernels are routinely set up to set
> suitable affinities, it's a user-space decision :-(
> 
>> On the other hand, more generally, I would like to get feedback on
>> what people's experience is with affinity setting. I'll just list a
>> few aspects of this issue that should be considered, but feel free to
>> raise other issues:
>> - per-process vs per-thread affinity;
>> - affinity set by or required (for optimal performance)
>> MPI/communication software stack;
>> - GPU/accelerator NUMA aspects;
>> - hwloc;
>> - leaving a core empty, for interrupts (AMD/Cray?), MPI, NIC or GPU
>> driver thread.
>> 
>> Note that this part of the discussion is aimed more at the behavior of
>> mdrun in the future. This is especially relevant as the next major (?)
>> version is being planned/developed and new tasking/parallelization
>> design options are being explored.
> 
> These are important issues. Currently, mdrun both is and is not
> cache-oblivious - the hit with scaling OpenMP across sockets
> (particularly on AMD and BlueGene/Q) illustrates that there are
> fundamental problems to address. (To be fair, there are places we
> explicitly prohibit false sharing through over-allocation, and we do
> use thread-local force accumulation buffers that get reduced later.)
> 
> Moving forward, it seems to me that continuing to improve strong
> scaling will require us to be more explicit in managing these details.
> Our current-generation home-grown mostly-x86-specific hardware
> detection scheme is probably fine for now, but I question whether it's
> worth using/extending it to provide information on cache line sizes.
> 
> Projects like hwloc seem to do that job pretty well. Also should help
> with knowing the locality of I/O and network hardware. Coverage of
> relevant HPC and desktop hardware and OS seems great, including CUDA,
> BlueGene/Q and Xeon Phi; not sure about Fujitsu machines. License is
> BSD, so we could distribute it if we wanted to. As suggested by one of
> the CMake devs on this list back in May or so, such bundling is
> straightforward to manage because Kitware does it lots. Bundling
> avoids a lot of versioning and compatibility issues. I am even tempted
> to bundle and bootstrap CMake, but that is another debate...
> 
> Why do we care about fine hardware details? In the task-parallelism
> framework being planned, I think it will be important to be able to
> use/implement a cache-aware concurrent data structure that isolates
> the details of maximizing performance. Using a plain vector (or even a
> concurrency-aware vector), if two cores in different sockets need to
> accumulate to the force for the same atom, the first will invalidate
> the whole cache line for the second. If a core back on the first
> socket now needs a further write, then it might miss cache as well!
> Currently we avoid this by writing to thread-local force arrays and
> synchronize over all cores when reducing. If threads do not have
> affinities set then all of this optimization is at the mercy of the
> kernel scheduling!
> 
> We should expect to live in L2 cache at worst (e.g. 256K on recent
> Intel) because even at 500 atoms/core, there's 500*3*4 bytes for
> positions, same again for velocities and forces, a similar amount for
> nonbonded tables. This leaves huge amounts of space for other data
> structures, and we want to target rather fewer atoms/core than that!
> Living in 32K L1 may not be a dream...
> 
> I think one effective implementation of a container for forces will be
> to construct "lazy" L2-local shadow vectors
> * only when required (since the task schedule will be dynamic),
> * probably with granularity related to the L2 cache-line size, and
> * with reduction when required for integration (i.e. when the
> integrate-the-forces task for a given atom runs, and only if the
> container reports that in fact multiple shadow vectors were used).
> This should work fine with the existing SIMD-aware coordinate layouts.
> However, it requires that a thread know where it is in NUMA space, how
> big L2 lines are, and that the force container can keep track of which
> shadow vectors need to be (and have been) spawned. As far as I can
> see, the only use for a contiguous-storage-for-all-atoms-in-a-set
> force array might be for communication (MPI, I/O, CUDA?); constructing
> that should only be done when required. Similar considerations pertain
> to velocities and positions, IMO.
> 
> Using something like hwloc gives us a portable way to have access to
> the information we'll need, without having to do much dirty work
> ourselves, so it seems like a no-brainer for post-5.0 development. The
> one cloud I can see is that if we use Intel's TBB as our task
> framework, it aims to be generically cache friendly without being
> cache aware, so we can expect no explicit help from it (whether or not
> we use hwloc).
> 
> Cheers,
> 
> Mark
> 
>> Cheers,
>> --
>> Szilárd
>> --
>> gmx-developers mailing list
>> gmx-developers at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-developers-request at gromacs.org.
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org.