[gmx-developers] [RFC] thread affinity in mdrun

Szilárd Páll szilard.pall at cbr.su.se
Sun Sep 22 22:32:52 CEST 2013


On Sun, Sep 22, 2013 at 8:29 PM, Alexey Shvetsov
<alexxy at omrb.pnpi.spb.ru> wrote:
> Hi!
>
> Szilárd Páll писал 22-09-2013 19:49:
>
>> Hi,
>>
>> On Fri, Sep 20, 2013 at 7:06 AM, Alexey Shvetsov
>> <alexxy at omrb.pnpi.spb.ru> wrote:
>>>
>>> Hi!
>>>
>>> I saw issues with demo Numascale system[1] and default (without external
>>> mpi) mdrun behavior -- it pins all 128 threads to first ~20 cores.
>>> Version
>>> with external  MPI (numascale provides openmpi offload module) works
>>> fine.
>>
>>
>> That's possible. I don't know of any testing done on Numascale systems
>> - at least not for 4.6. Feel free to file a bug report! However,
>> somebody with access to the machine would need to contribute a patch
>> or at least help figuring out what does not work correctly in the
>> current hardware detection code.
>
>
> Currently I don't have access to NumaScale hw, but we plan to get small
> system (2x2x2 3D torus) by the end of this year, so it will be possible to
> check whats going wrong. As I know currently even hwloc code doesn't work
> well on NumaScale.

In that case, as GROMACS uses probably similar approach to hwloc (I
assume) - cpuid to get package/core ID to get hardware thread layout
mapping to OS CPU id-s -, a fix will probably require either
- detecting a Numascale system, disabling pinning, and leaving it for
the user or
- implementing a special case for Numascale with hard-coded OS CPU id
to hardware thread mapping (which is probably just linear 1-1).

>
>
>> Cheers,
>> --
>> Szilárd
>>
>>>
>>> [1] http://numascale.com/numa_access.php
>>>
>>> Szilárd Páll писал 19-09-2013 21:53:
>>>
>>>> Hi,
>>>>
>>>> I would like to get feedback on an issue (or more precisely a set of
>>>> issues) related to thread/process affinities and
>>>> i) the way we should (or should not) tweak the current behavior and
>>>> ii) the way we should proceed in the future.
>>>>
>>>>
>>>> Brief introduction, skip this if you are familiar with the
>>>> implementation details:
>>>> Currently, mdrun always sets per-thread affinity if the number of
>>>> threads is equal to the number of "CPUs" detected (reported by the OS
>>>> ~ number of hardware threads supported). However, if this is not the
>>>> case, e.g. one wants to leave some cores empty (run multiple
>>>> simulations per node) or avoid using HT, thread pinning will not be
>>>> done. This can have quite harsh consequences on the performance -
>>>> especially when OpenMP parallelization is used (most notably with
>>>> GPUs).
>>>> Additionally, we try hard to not override externally set affinities
>>>> which means that if mdrun detects non-default affinity, it will not
>>>> pin threads (not even if -pin on is used). This happens if the job
>>>> scheduler sets the affinity, or if the user sets it e.g. with
>>>> KMP_AFFINITY/GOMP_CPU_AFFINITY, taskset, etc., but even if the MPI
>>>> implementation sets only its thread's affinity.
>>>>
>>>>
>>>> On the one hand, there was a request (see
>>>> http://redmine.gromacs.org/issues/1122) that we should allow forcing
>>>> the affinity setting by mdrun either by "-pin on" acquiring more
>>>> aggressive behavior or using a "-pin force" option. Please check out
>>>> the discussion on the issue page and express your opinion on whether
>>>> you agree/which behavior you support.
>>>>
>>>>
>>>> On the other hand, more generally, I would like to get feedback on
>>>> what people's experience is with affinity setting. I'll just list a
>>>> few aspects of this issue that should be considered, but feel free to
>>>> raise other issues:
>>>> - per-process vs per-thread affinity;
>>>> - affinity set by or required (for optimal performance)
>>>> MPI/communication software stack;
>>>> - GPU/accelerator NUMA aspects;
>>>> - hwloc;
>>>> - leaving a core empty, for interrupts (AMD/Cray?), MPI, NIC or GPU
>>>> driver thread.
>>>>
>>>> Note that this part of the discussion is aimed more at the behavior of
>>>> mdrun in the future. This is especially relevant as the next major (?)
>>>> version is being planned/developed and new tasking/parallelization
>>>> design options are being explored.
>>>>
>>>> Cheers,
>>>> --
>>>> Szilárd
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Alexey 'Alexxy' Shvetsov
>>> Petersburg Nuclear Physics Institute, NRC Kurchatov Institute, Gatchina,
>>> Russia
>>> Department of Molecular and Radiation Biophysics
>>> Gentoo Team Ru
>>> Gentoo Linux Dev
>>> mailto:alexxyum at gmail.com
>>> mailto:alexxy at gentoo.org
>>> mailto:alexxy at omrb.pnpi.spb.ru
>>> --
>>> gmx-developers mailing list
>>> gmx-developers at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>> Please don't post (un)subscribe requests to the list. Use the www
>>> interface
>>> or send it to gmx-developers-request at gromacs.org.
>
>
> --
> Best Regards,
> Alexey 'Alexxy' Shvetsov
> Petersburg Nuclear Physics Institute, NRC Kurchatov Institute, Gatchina,
> Russia
> Department of Molecular and Radiation Biophysics
> Gentoo Team Ru
> Gentoo Linux Dev
> mailto:alexxyum at gmail.com
> mailto:alexxy at gentoo.org
> mailto:alexxy at omrb.pnpi.spb.ru
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-developers-request at gromacs.org.



More information about the gromacs.org_gmx-developers mailing list