[gmx-developers] [RFC] thread affinity in mdrun
szilard.pall at cbr.su.se
Fri Sep 27 02:02:06 CEST 2013
I'm afraid overriding affinities set through the OpenMP interface
(GOMP_CPU_AFFINITY/KMP_AFFINITY) doesn't work as expected. In some
cases I see large performance degradation when overriding (otherwise
incorrect) OpenMP-set affinities. For details, see the redmine page.
Anybody has an idea why is this happening?
On Fri, Sep 27, 2013 at 12:53 AM, Szilárd Páll <szilard.pall at cbr.su.se> wrote:
> On Mon, Sep 23, 2013 at 10:13 PM, Mark Abraham <mark.j.abraham at gmail.com> wrote:
>> On Thu, Sep 19, 2013 at 7:53 PM, Szilárd Páll <szilard.pall at cbr.su.se> wrote:
>>> I would like to get feedback on an issue (or more precisely a set of
>>> issues) related to thread/process affinities and
>>> i) the way we should (or should not) tweak the current behavior and
>>> ii) the way we should proceed in the future.
>>> Brief introduction, skip this if you are familiar with the
>>> implementation details:
>>> Currently, mdrun always sets per-thread affinity if the number of
>>> threads is equal to the number of "CPUs" detected (reported by the OS
>>> ~ number of hardware threads supported). However, if this is not the
>>> case, e.g. one wants to leave some cores empty (run multiple
>>> simulations per node) or avoid using HT, thread pinning will not be
>>> done. This can have quite harsh consequences on the performance -
>>> especially when OpenMP parallelization is used (most notably with
>>> Additionally, we try hard to not override externally set affinities
>>> which means that if mdrun detects non-default affinity, it will not
>>> pin threads (not even if -pin on is used). This happens if the job
>>> scheduler sets the affinity, or if the user sets it e.g. with
>>> KMP_AFFINITY/GOMP_CPU_AFFINITY, taskset, etc., but even if the MPI
>>> implementation sets only its thread's affinity.
>>> On the one hand, there was a request (see
>>> http://redmine.gromacs.org/issues/1122) that we should allow forcing
>>> the affinity setting by mdrun either by "-pin on" acquiring more
>>> aggressive behavior or using a "-pin force" option. Please check out
>>> the discussion on the issue page and express your opinion on whether
>>> you agree/which behavior you support.
> First of all, thanks for the comments!
>> I've added my voice to that thread - in favour of the current
>> behaviour, plus a new option "mdrun -pin force" for the times your
>> problem wants a hammer ;-) It would be nice for users not to have to
>> know about this, but until HPC kernels are routinely set up to set
>> suitable affinities, it's a user-space decision :-(
> The change is in gerrit:
> It would be great if people could test the code behavior and let me
> know (on redmine #1122) if something does not work as expected.
>>> On the other hand, more generally, I would like to get feedback on
>>> what people's experience is with affinity setting. I'll just list a
>>> few aspects of this issue that should be considered, but feel free to
>>> raise other issues:
>>> - per-process vs per-thread affinity;
>>> - affinity set by or required (for optimal performance)
>>> MPI/communication software stack;
>>> - GPU/accelerator NUMA aspects;
>>> - hwloc;
>>> - leaving a core empty, for interrupts (AMD/Cray?), MPI, NIC or GPU
>>> driver thread.
>>> Note that this part of the discussion is aimed more at the behavior of
>>> mdrun in the future. This is especially relevant as the next major (?)
>>> version is being planned/developed and new tasking/parallelization
>>> design options are being explored.
>> These are important issues. Currently, mdrun both is and is not
>> cache-oblivious - the hit with scaling OpenMP across sockets
>> (particularly on AMD and BlueGene/Q) illustrates that there are
>> fundamental problems to address. (To be fair, there are places we
>> explicitly prohibit false sharing through over-allocation, and we do
>> use thread-local force accumulation buffers that get reduced later.)
>> Moving forward, it seems to me that continuing to improve strong
>> scaling will require us to be more explicit in managing these details.
>> Our current-generation home-grown mostly-x86-specific hardware
>> detection scheme is probably fine for now, but I question whether it's
>> worth using/extending it to provide information on cache line sizes.
>> Projects like hwloc seem to do that job pretty well. Also should help
>> with knowing the locality of I/O and network hardware. Coverage of
>> relevant HPC and desktop hardware and OS seems great, including CUDA,
>> BlueGene/Q and Xeon Phi; not sure about Fujitsu machines. License is
>> BSD, so we could distribute it if we wanted to. As suggested by one of
>> the CMake devs on this list back in May or so, such bundling is
>> straightforward to manage because Kitware does it lots. Bundling
>> avoids a lot of versioning and compatibility issues. I am even tempted
>> to bundle and bootstrap CMake, but that is another debate...
>> Why do we care about fine hardware details? In the task-parallelism
>> framework being planned, I think it will be important to be able to
>> use/implement a cache-aware concurrent data structure that isolates
>> the details of maximizing performance. Using a plain vector (or even a
>> concurrency-aware vector), if two cores in different sockets need to
>> accumulate to the force for the same atom, the first will invalidate
>> the whole cache line for the second. If a core back on the first
>> socket now needs a further write, then it might miss cache as well!
>> Currently we avoid this by writing to thread-local force arrays and
>> synchronize over all cores when reducing. If threads do not have
>> affinities set then all of this optimization is at the mercy of the
>> kernel scheduling!
>> We should expect to live in L2 cache at worst (e.g. 256K on recent
>> Intel) because even at 500 atoms/core, there's 500*3*4 bytes for
>> positions, same again for velocities and forces, a similar amount for
>> nonbonded tables. This leaves huge amounts of space for other data
>> structures, and we want to target rather fewer atoms/core than that!
>> Living in 32K L1 may not be a dream...
>> I think one effective implementation of a container for forces will be
>> to construct "lazy" L2-local shadow vectors
>> * only when required (since the task schedule will be dynamic),
>> * probably with granularity related to the L2 cache-line size, and
>> * with reduction when required for integration (i.e. when the
>> integrate-the-forces task for a given atom runs, and only if the
>> container reports that in fact multiple shadow vectors were used).
>> This should work fine with the existing SIMD-aware coordinate layouts.
>> However, it requires that a thread know where it is in NUMA space, how
>> big L2 lines are, and that the force container can keep track of which
>> shadow vectors need to be (and have been) spawned. As far as I can
>> see, the only use for a contiguous-storage-for-all-atoms-in-a-set
>> force array might be for communication (MPI, I/O, CUDA?); constructing
>> that should only be done when required. Similar considerations pertain
>> to velocities and positions, IMO.
>> Using something like hwloc gives us a portable way to have access to
>> the information we'll need, without having to do much dirty work
>> ourselves, so it seems like a no-brainer for post-5.0 development. The
>> one cloud I can see is that if we use Intel's TBB as our task
>> framework, it aims to be generically cache friendly without being
>> cache aware, so we can expect no explicit help from it (whether or not
>> we use hwloc).
>>> gmx-developers mailing list
>>> gmx-developers at gromacs.org
>>> Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to gmx-developers-request at gromacs.org.
>> gmx-developers mailing list
>> gmx-developers at gromacs.org
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-developers-request at gromacs.org.
More information about the gromacs.org_gmx-developers