[gmx-developers] [RFC] thread affinity in mdrun

Thu Sep 19 19:53:04 CEST 2013

Hi,

I would like to get feedback on an issue (or more precisely a set of
issues) related to thread/process affinities and
i) the way we should (or should not) tweak the current behavior and
ii) the way we should proceed in the future.

Brief introduction, skip this if you are familiar with the
implementation details:
Currently, mdrun always sets per-thread affinity if the number of
threads is equal to the number of "CPUs" detected (reported by the OS
~ number of hardware threads supported). However, if this is not the
case, e.g. one wants to leave some cores empty (run multiple
simulations per node) or avoid using HT, thread pinning will not be
done. This can have quite harsh consequences on the performance -
especially when OpenMP parallelization is used (most notably with
GPUs).
Additionally, we try hard to not override externally set affinities
which means that if mdrun detects non-default affinity, it will not
pin threads (not even if -pin on is used). This happens if the job
scheduler sets the affinity, or if the user sets it e.g. with
KMP_AFFINITY/GOMP_CPU_AFFINITY, taskset, etc., but even if the MPI
implementation sets only its thread's affinity.

On the one hand, there was a request (see
http://redmine.gromacs.org/issues/1122) that we should allow forcing
the affinity setting by mdrun either by "-pin on" acquiring more
aggressive behavior or using a "-pin force" option. Please check out
the discussion on the issue page and express your opinion on whether
you agree/which behavior you support.

On the other hand, more generally, I would like to get feedback on
what people's experience is with affinity setting. I'll just list a
few aspects of this issue that should be considered, but feel free to
raise other issues:
- per-process vs per-thread affinity;
- affinity set by or required (for optimal performance)
MPI/communication software stack;
- GPU/accelerator NUMA aspects;
- hwloc;
- leaving a core empty, for interrupts (AMD/Cray?), MPI, NIC or GPU
driver thread.

Note that this part of the discussion is aimed more at the behavior of
mdrun in the future. This is especially relevant as the next major (?)
version is being planned/developed and new tasking/parallelization
design options are being explored.

Cheers,
--
Szilárd