[gmx-users] the importance of process/thread affinity, especially in node sharing setups [fork of Re: performance]

Mon Sep 18 16:07:01 CEST 2017

>>> However, note that if you are sharing a node with others, if their jobs
> are not correctly affinitized, those processes will affect the performance
> of your job.
>
> Yes exactly. In this case I would need to manually set pinoffset but this
> can be but frustrating if other Gromacs users are not binding :)
> Would it be possible to fix this in the default algorithm, though am
> unaware of other issues it might cause? Also mutidir is not convenient
> sometimes when job crashes in the middle and automatic restart from cpt
> file would be difficult.

Let me be very explicit and clear about this to avoid misunderstandings:

This is *not a problem* in GROMACS, but rather a property of any
modern multicore system: you either set the right affinities for the
use-case (considering workload, node utilization, hardware locality,
scaling concerns) or otherwise the effect job/process/thread locality
(in terms of where it runs/its data located in a node) will be a
matter of luck and up to the operating system and will rarely be
optimal.

While mdrun tries to help users obtain good and consistent performance
by either setting (when it can assume that it runs on the full node)
or helping to set affinities, ultimately it is the responsibility of
the users/job schedulers to get job placement right -- especially in
node sharing setups. At job allocation the job scheduler should know
which resources (cores, memory GPUs) the user is allocated and it
should place and affinitize jobs accordingly -- which mdrun does
respect.

A bit more technical detail for the curious: data resides in different
levels of memories (from global memory to L3, L2, and L1 caches) and
if a job e.g. starts running on cores 0-3, working sets of data will
be "pulled in" into the private and shared caches closest to these
cores. If the job is not affinitized, e.g. two threads running on
cores 0 and 1 could end up moved to, say, cores 9-10. As a results
these two unlucky threads will "loose" their private caches or even
worse, if cores 9-10 are on the second socket, they will also loose
shared cache and the ability to do fast data sharing with the two
other threads of the same mdrun run. For this reason, if your job is
meant to run on four cores cores 0-3, its process affinity mask should
be set accordingly to prevent its threads from migrating to other
cores.
Note that this is a simplified example specific to a use-case that can
hurt the performance of GROMACS runs. Different affinity patterns will
be optimal for other types of compute workloads.

Cheers,
--
Szilárd

On Thu, Sep 14, 2017 at 1:02 PM, gromacs query <gromacsquery at gmail.com> wrote:
> Hi Szilárd,
>
> Here are my replies:
>
>>> Did you run the "fast" single job on an otherwise empty node? That might
> explain it as, when most of the CPU cores are left empty, modern CPUs
> increase clocks (tubo boost) on the used cores higher than they could with
> all cores busy.
>
> Yes the "fast" single job was on empty node. Sorry I don't get it when you
> say 'modern CPUs increase clocks', you mean the ns/day I get is pseudo in
> that case?
>
>>> and if you post an actual log I can certainly give more informed comments
>
> Sure, if its ok can I post it off-mailing list to you?
>
>>> However, note that if you are sharing a node with others, if their jobs
> are not correctly affinitized, those processes will affect the performance
> of your job.
>
> Yes exactly. In this case I would need to manually set pinoffset but this
> can be but frustrating if other Gromacs users are not binding :)
> Would it be possible to fix this in the default algorithm, though am
> unaware of other issues it might cause? Also mutidir is not convenient
> sometimes when job crashes in the middle and automatic restart from cpt
> file would be difficult.
>
> -J
>
>
> On Thu, Sep 14, 2017 at 11:26 AM, Szilárd Páll <pall.szilard at gmail.com>
> wrote:
>
>> On Wed, Sep 13, 2017 at 11:14 PM, gromacs query <gromacsquery at gmail.com>
>> wrote:
>> > Hi Szilárd,
>> >
>> > Thanks again. I tried now with -multidir like this:
>> >
>> > mpirun -np 16 gmx_mpi mdrun -s test -ntomp 2 -maxh 0.1 -multidir t1 t2
>> t3 t4
>> >
>> > So this runs 4 jobs on same node so for each job np is = 16/4, and each
>> job
>> > using 2 GPU. I get now quite improved performance and equal performance
>> for
>> > each job (~ 220 ns) though still slightly less than single independent
>> job
>> > (where I get 300 ns). I can live with that but -
>>
>> That is not normal and it is more likely to be a benchmarking
>> discrepancy: you are likely not comparing apples to apples. Did you
>> run the "fast" single job on an otherwise empty node? That might
>> explain it as, when most of the CPU cores are left empty, modern CPUs
>> increase clocks (tubo boost) on the used cores higher than they could
>> with all cores busy.
>>
>> > Surprised: There are maximum 40 cores and 8 GPUs per node and thus my 4
>> > jobs should consume 8 GPUS.
>>
>> Note that even if those are 40 real cores (rather than 20 core with
>> HyperThreading), the current GROMACS release will be unlikely to run
>> efficiently with at least 6-8 cores per GPU. This will likely change
>> with the next release.
>>
>> > So I am bit surprised with the fact the same
>> > node on which my four jobs were running was already occupied with jobs by
>> > some other user, which I think should not happen (may be slurm.config
>> admin
>> > issue?). Either my some jobs should have gone in queue or run on other
>> node
>> > if free.
>>
>> Sounds like a job scheduler issue (you can always check in the log the
>> detected hardware) -- and if you post an actual log I can certainly
>> give more informed comments.
>>
>> > What to do: Importantly though as an individual user I can submit
>> -multidir
>> > job but lets say, which is normally the case, there will be many other
>> > unknown users who submit one or two jobs in that case performance will be
>> > an issue (which is equivalent to my case when I submit many jobs without
>> > -multi/multidir).
>>
>> Not sure I follow: if you always have a number of similar runs to do,
>> submit them together and benefit from not having to manual hardware
>> assignment. Otherwise, if your cluster relies on node sharing, you
>> will have to make sure that you specify correctly the affinity/binding
>> arguments to your job scheduler (or work around it with manual offset
>> calculation). However, note that if you are sharing a node with
>> others, if their jobs are not correctly affinitized, those processes
>> will affect the performance of your job.
>>
>> > I think still they will need -pinoffset. Could you
>> > please suggest what best can be done in such case?
>>
>> See above.
>>
>> Cheers,
>> --
>> Szilárd
>>
>> >
>> > -Jiom
>> >
>> >
>> >
>> >
>> > On Wed, Sep 13, 2017 at 9:15 PM, Szilárd Páll <pall.szilard at gmail.com>
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> First off, have you considered options 2) using multi-sim? That would
>> >> allow you to not have to bother manually set offsets. Can you not
>> >> submit your jobs such that you fill at least a node?
>> >>
>> >> How many threads/cores does you node have? Can you share log files?
>> >>
>> >> Cheers,
>> >> --
>> >> Szilárd
>> >>
>> >>
>> >> On Wed, Sep 13, 2017 at 9:14 PM, gromacs query <gromacsquery at gmail.com>
>> >> wrote:
>> >> > Hi Szilárd,
>> >> >
>> >> > Sorry I was bit quick to say its working with pinoffset. I just
>> submitted
>> >> > four same jobs (2 gpus, 4 nprocs) on the same node with -pin on and
>> >> > different -pinoffset to 0, 5, 10, 15 (numbers should be fine as there
>> are
>> >> > 40 cores on node). Still I don't get same performance (all variably
>> less
>> >> > than 50%) as expected from a single independent job. Now am wondering
>> if
>> >> > its still related to overlap of cores as pin on should lock the cores
>> for
>> >> > the same job.
>> >> >
>> >> > -J
>> >> >
>> >> > On Wed, Sep 13, 2017 at 7:33 PM, gromacs query <
>> gromacsquery at gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Hi Szilárd,
>> >> >>
>> >> >> Thanks, option 3 was in my mind but I need to figure out now how :)
>> >> >> Manually fixing pinoffset as of now seems working with some quick
>> test.
>> >> >> I think option 1 would require to ask the admin but I can try option
>> 3
>> >> >> myself. As there are other users from different places who may not
>> >> bother
>> >> >> using option 3. I think I would need to ask the admin to force
>> option 1
>> >> but
>> >> >> before that I will try option 3.
>> >> >>
>> >> >> JIom
>> >> >>
>> >> >> On Wed, Sep 13, 2017 at 7:10 PM, Szilárd Páll <
>> pall.szilard at gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >>> J,
>> >> >>>
>> >> >>> You have a few options:
>> >> >>>
>> >> >>> * Use SLURM to assign not on evenly the set of GPUs, but also the correct
>> >> >>> set of CPU cores to each mdrun process. If you do so, mdrun will
>> >> >>> respect the affinity mask it will inherit and your two mdrun jobs
>> >> >>> should be running on the right set of cores. This has the drawback
>> >> >>> that (AFAIK) SLURM/aprun (or srun) will not allow you to bind each
>> >> >>> application thread to a core/hardware thread (which is what mdrun
>> >> >>> does), only a process to a group of cores/hw threads which can
>> >> >>> sometimes lead to performance loss. (You might be able to compensate
>> >> >>> using some OpenMP library environment variables, though.)
>> >> >>>
>> >> >>> * Run multiple jobs with mdrun "-multi"/"-multidir"  (either two per
>> >> >>> node or mulitple across nodes) and benefit from the rank/thread to
>> >> >>> core/hw thread assignment that's supported also across multiple
>> >> >>> simulations part of a multi-run; e.g.:
>> >> >>> mpirun -np 4 gmx mdrun -multi 4 -ntomp N -multidir
>> >> my_input_dir{1,2,3,4}
>> >> >>> will launch 4 ranks and start 4 simulations in each of the four
>> >> >>> directories passed.
>> >> >>>
>> >> >>> * Write a wrapper script around gmx mdrun which will be what you
>> >> >>> launch with SLURM; you can then inspect the node and decide what
>> >> >>> pinoffset value to pass to your mdrun launch command.
>> >> >>>
>> >> >>>
>> >> >>> I hope one of these will deliver the desired results :)
>> >> >>>
>> >> >>> Cheers,
>> >> >>> --
>> >> >>> Szilárd
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Sep 13, 2017 at 7:47 PM, gromacs query <
>> gromacsquery at gmail.com
>> >> >
>> >> >>> wrote:
>> >> >>> > Hi Szilárd,
>> >> >>> >
>> >> >>> > Thanks for your reply. This is useful but now am thinking because
>> the
>> >> >>> slurm
>> >> >>> > launches job in an automated way it is not really in my control to
>> >> >>> choose
>> >> >>> > the node. So following things can happen; say for two mdrun jobs I
>> >> set
>> >> >>> > -pinoffset 0 and -pinoffset 4;
>> >> >>> >
>> >> >>> > - if they are running on the same node this is good
>> >> >>> > - if jobs run on different nodes (partially occupied or free)
>> whether
>> >> >>> these
>> >> >>> > chosen pinoffsets will make sense or not as I don't know what
>> >> pinoffset
>> >> >>> I
>> >> >>> > would need to set
>> >> >>> > - if I have to submit many jobs together and slurm chooses
>> >> >>> different/same
>> >> >>> > node itself then I think it is difficult to define pinoffset.
>> >> >>> >
>> >> >>> > -
>> >> >>> > J
>> >> >>> >
>> >> >>> > On Wed, Sep 13, 2017 at 6:14 PM, Szilárd Páll <
>> >> pall.szilard at gmail.com>
>> >> >>> > wrote:
>> >> >>> >
>> >> >>> >> My guess is that the two jobs are using the same cores -- either
>> all
>> >> >>> >> cores/threads or only half of them, but the same set.
>> >> >>> >>
>> >> >>> >> You should use -pinoffset; see:
>> >> >>> >>
>> >> >>> >> - Docs and example:
>> >> >>> >> http://manual.gromacs.org/documentation/2016/user-guide/
>> >> >>> >> mdrun-performance.html
>> >> >>> >>
>> >> >>> >> - More explanation on the thread pinning behavior on the old
>> >> website:
>> >> >>> >> http://www.gromacs.org/Documentation/Acceleration_
>> >> >>> >> and_parallelization#Pinning_threads_to_physical_cores
>> >> >>> >>
>> >> >>> >> Cheers,
>> >> >>> >> --
>> >> >>> >> Szilárd
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Wed, Sep 13, 2017 at 6:35 PM, gromacs query <
>> >> gromacsquery at gmail.com
>> >> >>> >
>> >> >>> >> wrote:
>> >> >>> >> > Sorry forgot to add; we thought the two jobs are using same GPU
>> >> ids
>> >> >>> but
>> >> >>> >> > cuda visible devices show both jobs are using different ids
>> (0,1
>> >> and
>> >> >>> 2,3)
>> >> >>> >> >
>> >> >>> >> > -
>> >> >>> >> > J
>> >> >>> >> >
>> >> >>> >> > On Wed, Sep 13, 2017 at 5:33 PM, gromacs query <
>> >> >>> gromacsquery at gmail.com>
>> >> >>> >> > wrote:
>> >> >>> >> >
>> >> >>> >> >> Hi All,
>> >> >>> >> >>
>> >> >>> >> >> I have some issues with gromacs performance. There are many
>> nodes
>> >> >>> and
>> >> >>> >> each
>> >> >>> >> >> node has number of gpus and the batch process is controlled by
>> >> >>> slurm.
>> >> >>> >> >> Although I get good performance with some settings of number
>> of
>> >> >>> gpus and
>> >> >>> >> >> nprocs but when I submit same job twice on the same node then
>> the
>> >> >>> >> >> performance is reduced drastically. e.g
>> >> >>> >> >>
>> >> >>> >> >> For 2 GPUs I get 300 ns per day when there is no other job
>> >> running
>> >> >>> on
>> >> >>> >> the
>> >> >>> >> >> node. When I submit same job twice on the same node & at the
>> same
>> >> >>> time,
>> >> >>> >> I
>> >> >>> >> >> get only 17 ns/day for both the jobs. I am using this:
>> >> >>> >> >>
>> >> >>> >> >> mpirun -np 4 gmx_mpi mdrun -deffnm test -ntomp 2 -maxh 0.12
>> >> >>> >> >>
>> >> >>> >> >> Any suggestions highly appreciated.
>> >> >>> >> >>
>> >> >>> >> >> Thanks
>> >> >>> >> >>
>> >> >>> >> >> Jiom
>> >> >>> >> >>
>> >> >>> >> > --
>> >> >>> >> > Gromacs Users mailing list
>> >> >>> >> >
>> >> >>> >> > * Please search the archive at http://www.gromacs.org/
>> >> >>> >> Support/Mailing_Lists/GMX-Users_List before posting!
>> >> >>> >> >
>> >> >>> >> > * Can't post? Read http://www.gromacs.org/
>> Support/Mailing_Lists
>> >> >>> >> >
>> >> >>> >> > * For (un)subscribe requests visit
>> >> >>> >> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_
>> >> gmx-users
>> >> >>> or
>> >> >>> >> send a mail to gmx-users-request at gromacs.org.
>> >> >>> >> --
>> >> >>> >> Gromacs Users mailing list
>> >> >>> >>
>> >> >>> >> * Please search the archive at http://www.gromacs.org/
>> >> >>> >> Support/Mailing_Lists/GMX-Users_List before posting!
>> >> >>> >>
>> >> >>> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> >>> >>
>> >> >>> >> * For (un)subscribe requests visit
>> >> >>> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_
>> gmx-users
>> >> or
>> >> >>> >> send a mail to gmx-users-request at gromacs.org.
>> >> >>> > --
>> >> >>> > Gromacs Users mailing list
>> >> >>> >
>> >> >>> > * Please search the archive at http://www.gromacs.org/Support
>> >> >>> /Mailing_Lists/GMX-Users_List before posting!
>> >> >>> >
>> >> >>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> >>> >
>> >> >>> > * For (un)subscribe requests visit
>> >> >>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_
>> gmx-users
>> >> or
>> >> >>> send a mail to gmx-users-request at gromacs.org.
>> >> >>> --
>> >> >>> Gromacs Users mailing list
>> >> >>>
>> >> >>> * Please search the archive at http://www.gromacs.org/Support
>> >> >>> /Mailing_Lists/GMX-Users_List before posting!
>> >> >>>
>> >> >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> >>>
>> >> >>> * For (un)subscribe requests visit
>> >> >>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>> or
>> >> >>> send a mail to gmx-users-request at gromacs.org.
>> >> >>>
>> >> >>
>> >> >>
>> >> > --
>> >> > Gromacs Users mailing list
>> >> >
>> >> > * Please search the archive at http://www.gromacs.org/
>> >> Support/Mailing_Lists/GMX-Users_List before posting!
>> >> >
>> >> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> >
>> >> > * For (un)subscribe requests visit
>> >> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> >> send a mail to gmx-users-request at gromacs.org.
>> >> --
>> >> Gromacs Users mailing list
>> >>
>> >> * Please search the archive at http://www.gromacs.org/
>> >> Support/Mailing_Lists/GMX-Users_List before posting!
>> >>
>> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >>
>> >> * For (un)subscribe requests visit
>> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> >> send a mail to gmx-users-request at gromacs.org.
>> >>
>> > --
>> > Gromacs Users mailing list
>> >
>> > * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>> >
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >
>> > * For (un)subscribe requests visit
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.