[gmx-developers] [RFC] thread affinity in mdrun
Szilárd Páll
szilard.pall at cbr.su.se
Fri Sep 27 00:53:29 CEST 2013
On Mon, Sep 23, 2013 at 10:13 PM, Mark Abraham <mark.j.abraham at gmail.com> wrote:
> On Thu, Sep 19, 2013 at 7:53 PM, Szilárd Páll <szilard.pall at cbr.su.se> wrote:
>> Hi,
>>
>> I would like to get feedback on an issue (or more precisely a set of
>> issues) related to thread/process affinities and
>> i) the way we should (or should not) tweak the current behavior and
>> ii) the way we should proceed in the future.
>>
>>
>> Brief introduction, skip this if you are familiar with the
>> implementation details:
>> Currently, mdrun always sets per-thread affinity if the number of
>> threads is equal to the number of "CPUs" detected (reported by the OS
>> ~ number of hardware threads supported). However, if this is not the
>> case, e.g. one wants to leave some cores empty (run multiple
>> simulations per node) or avoid using HT, thread pinning will not be
>> done. This can have quite harsh consequences on the performance -
>> especially when OpenMP parallelization is used (most notably with
>> GPUs).
>> Additionally, we try hard to not override externally set affinities
>> which means that if mdrun detects non-default affinity, it will not
>> pin threads (not even if -pin on is used). This happens if the job
>> scheduler sets the affinity, or if the user sets it e.g. with
>> KMP_AFFINITY/GOMP_CPU_AFFINITY, taskset, etc., but even if the MPI
>> implementation sets only its thread's affinity.
>>
>>
>> On the one hand, there was a request (see
>> http://redmine.gromacs.org/issues/1122) that we should allow forcing
>> the affinity setting by mdrun either by "-pin on" acquiring more
>> aggressive behavior or using a "-pin force" option. Please check out
>> the discussion on the issue page and express your opinion on whether
>> you agree/which behavior you support.
First of all, thanks for the comments!
> I've added my voice to that thread - in favour of the current
> behaviour, plus a new option "mdrun -pin force" for the times your
> problem wants a hammer ;-) It would be nice for users not to have to
> know about this, but until HPC kernels are routinely set up to set
> suitable affinities, it's a user-space decision :-(
The change is in gerrit:
https://gerrit.gromacs.org/#/c/2633/
It would be great if people could test the code behavior and let me
know (on redmine #1122) if something does not work as expected.
>
>> On the other hand, more generally, I would like to get feedback on
>> what people's experience is with affinity setting. I'll just list a
>> few aspects of this issue that should be considered, but feel free to
>> raise other issues:
>> - per-process vs per-thread affinity;
>> - affinity set by or required (for optimal performance)
>> MPI/communication software stack;
>> - GPU/accelerator NUMA aspects;
>> - hwloc;
>> - leaving a core empty, for interrupts (AMD/Cray?), MPI, NIC or GPU
>> driver thread.
>>
>> Note that this part of the discussion is aimed more at the behavior of
>> mdrun in the future. This is especially relevant as the next major (?)
>> version is being planned/developed and new tasking/parallelization
>> design options are being explored.
>
> These are important issues. Currently, mdrun both is and is not
> cache-oblivious - the hit with scaling OpenMP across sockets
> (particularly on AMD and BlueGene/Q) illustrates that there are
> fundamental problems to address. (To be fair, there are places we
> explicitly prohibit false sharing through over-allocation, and we do
> use thread-local force accumulation buffers that get reduced later.)
>
> Moving forward, it seems to me that continuing to improve strong
> scaling will require us to be more explicit in managing these details.
> Our current-generation home-grown mostly-x86-specific hardware
> detection scheme is probably fine for now, but I question whether it's
> worth using/extending it to provide information on cache line sizes.
>
> Projects like hwloc seem to do that job pretty well. Also should help
> with knowing the locality of I/O and network hardware. Coverage of
> relevant HPC and desktop hardware and OS seems great, including CUDA,
> BlueGene/Q and Xeon Phi; not sure about Fujitsu machines. License is
> BSD, so we could distribute it if we wanted to. As suggested by one of
> the CMake devs on this list back in May or so, such bundling is
> straightforward to manage because Kitware does it lots. Bundling
> avoids a lot of versioning and compatibility issues. I am even tempted
> to bundle and bootstrap CMake, but that is another debate...
>
> Why do we care about fine hardware details? In the task-parallelism
> framework being planned, I think it will be important to be able to
> use/implement a cache-aware concurrent data structure that isolates
> the details of maximizing performance. Using a plain vector (or even a
> concurrency-aware vector), if two cores in different sockets need to
> accumulate to the force for the same atom, the first will invalidate
> the whole cache line for the second. If a core back on the first
> socket now needs a further write, then it might miss cache as well!
> Currently we avoid this by writing to thread-local force arrays and
> synchronize over all cores when reducing. If threads do not have
> affinities set then all of this optimization is at the mercy of the
> kernel scheduling!
>
> We should expect to live in L2 cache at worst (e.g. 256K on recent
> Intel) because even at 500 atoms/core, there's 500*3*4 bytes for
> positions, same again for velocities and forces, a similar amount for
> nonbonded tables. This leaves huge amounts of space for other data
> structures, and we want to target rather fewer atoms/core than that!
> Living in 32K L1 may not be a dream...
>
> I think one effective implementation of a container for forces will be
> to construct "lazy" L2-local shadow vectors
> * only when required (since the task schedule will be dynamic),
> * probably with granularity related to the L2 cache-line size, and
> * with reduction when required for integration (i.e. when the
> integrate-the-forces task for a given atom runs, and only if the
> container reports that in fact multiple shadow vectors were used).
> This should work fine with the existing SIMD-aware coordinate layouts.
> However, it requires that a thread know where it is in NUMA space, how
> big L2 lines are, and that the force container can keep track of which
> shadow vectors need to be (and have been) spawned. As far as I can
> see, the only use for a contiguous-storage-for-all-atoms-in-a-set
> force array might be for communication (MPI, I/O, CUDA?); constructing
> that should only be done when required. Similar considerations pertain
> to velocities and positions, IMO.
>
> Using something like hwloc gives us a portable way to have access to
> the information we'll need, without having to do much dirty work
> ourselves, so it seems like a no-brainer for post-5.0 development. The
> one cloud I can see is that if we use Intel's TBB as our task
> framework, it aims to be generically cache friendly without being
> cache aware, so we can expect no explicit help from it (whether or not
> we use hwloc).
>
> Cheers,
>
> Mark
>
>> Cheers,
>> --
>> Szilárd
>> --
>> gmx-developers mailing list
>> gmx-developers at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-developers-request at gromacs.org.
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org.
More information about the gromacs.org_gmx-developers
mailing list