[gmx-users] multi-replica runs with GPUs [fork of Re: Gromacs 2018 and GPU PME ]

Szilárd Páll pall.szilard at gmail.com
Tue Feb 20 17:40:25 CET 2018


Hi Dan,

On Fri, Feb 9, 2018 at 4:56 PM, Daniel Kozuch <dan.kozuch at gmail.com> wrote:

> Szilárd,
>
> If I may jump in on this conversation,



Let's fork the thread so the topics stay clear and discoverable by others,
please.

I am having the reverse problem
> (which I assume others may encounter also) where I am attempting a large
> REMD run (84 replicas) and I have access to say 12 GPUs and 84 CPUs.
>

OK. That's a useful case to clarify. If you have exactly 84 CPUs for your
84 runs and as that number of divisible by the number of GPUs too, it
should be as simple as running
mpirun -np 84 gmx_mpi mdrun -multi 84
and the automatic mapping of ranks/threads to hardware should work just
fine.

Basically I have less GPUs than simulations. Is there a logical approach to
> using gputasks and other new options in GROMACS 2018 for this setup? I read
> through the available documentation,but as you mentioned it seems to be
> targeted for a single-GPU runs.
>


TL;DR
Correction: PME GPU offload is tuned for single-GPU (more precisely
domain-decomposition) simulations. The -gpu_id/-gputasks options have been
fairly well thought through to be useful for all current and most future
use-cases ;) Therefore, the use of -gputasks is valid and useful in
multi-rank and multi-sim/replica runs as well: it will still map GPUs to
tasks/rank within a node.


You'll find a bit more detailed writeup below which might refresh & clarify
the details of mdrun internal workings.


----

First a brief recap of the basics (for those inexperienced or needing a
reminder):
GROMACS uses heterogeneous offload parallelization, i.e. CPU & GPU are both
used. This has a number of benefits and drawbacks, what is relevant in this
context is that for a highly tuned MD engine it is generally difficult to
achieve perfect load balance and 100% utilization of both CPU and GPU (in
fact the same applies to ranks of an MPI-parallel run). Consequently, part
of the runtime will be spent waiting for the GPU on the CPU or vice-versa.
GPUs are easy to share among multiple, dependent or independent jobs
(unlike CPUs/cores) and this can help make use of otherwise idle GPU time.
By setting up threads/processes to share GPUs (e.g. ranks in a multi-sim
run or even independent program executions), these can fill the GPU
utilization "gaps" generally resulting in better overall efficiency, e.g.
higher aggregate ns/day. GPU sharing has a moderate importance in
multi-GPU/node parallel runs (i.e. it is useful to run a few ranks per
GPU), but even more important in throughput type use with a few to many
independent runs -- as a reminder, we've talked about this in our paper and
most information is still useful and valid, see Fig 5 and related
discussion of https://goo.gl/FvkGC7


Back to the use of -gpu_id / -gputasks. The slight change is that the roles
of -gpu_id have been separated: previously it specified *both* the IDs of
the devices to use as well as the mapping of the devices to the tasks/ranks
in a simulation. In the 2018 release -gpu_id *only specifies which devices
to use*, while -gputasks provides the mapping (and it is in most cases
optional).
E.g. these three are equivalent, but the former is not valid in v2018
mdrun -ntmpi 4 -gpu_id 0011 -nb gpu # map two GPUs to four ranks in v2016
mdrun -ntmpi 4 [-gpu_id 01] -gputasks 0011 -nb gpu [-pme cpu] # use GPU 0
and 1 and map two-by-two to the four PP ranks -- in v2018
mdrun -ntmpi 4 # lazy default for both v2016 and v2018 -- assuming there
are only two GPUs and given that PME runs on the CPU by default with DD

Now to PME-GPU: before v2018 there was only a single task type offloaded,
so -gpu_id mapped GPUs to PP tasks within a node (combined PP+PME or
separate PP ranks); for multiple ranks using the same GPU, the ID was
simply repeated. With the additional PME task to offload (which can
"reside" either in the same rank as the PP task or in a separate rank) the
mapping has to account for PME too. What may not be crystal clear from the
docs is that the order of tasks is generally PP first, PME next (both
within a rank and across ranks); also PME ranks are by default interleaved
(unless changed with -ddorder) so in an 8-rank 6 PP / 2 PME setup the rank
order is  3 PP / 1 PME / 3 PP / 1 PME (this is however not supported with
GPUs!)

A few examples leading up to the multi-replica runs:
* a single GPU / single rank run
gmx mdrun -ntmpi 1 -gputasks 00 -nb gpu -pme gpu # verbose command line
gmx mdrun -ntmpi 1 -gpu_id 0 # will do the same as above
gmx mdrun -ntmpi 1 # will also do the same as above ;)

* single node 2 GPUs with separate PME -- note the limitations of this mode
offload mode discussed previously (the mail I've just forked!)
gmx mdrun [-pme gpu -nb gpu] -ntmpi 8 -ntomp 6 -npme 1 -gputasks 00000001 #
assuming 24 cores / 48 threads / 2 GPUs
gmx mdrun [-pme gpu -nb gpu] -ntmpi 12 -ntomp 4 -npme 1 -gputasks
000000000011 # could be more efficient than the above

* single node 2 GPUs 4 replicas 1 rank each 2-way sharing
mpirun -np 4 gmx mdrun -multi 4 -pme gpu -nb gpu -gputasks 0011
mpirun -np 4 gmx mdrun -multi 4  # equivalent with the above assuming 2 GPUs

It can be worth trying at least 2-4 sims per GPU (especially if there are
enough replicas and individual run performance is less important). What you
*need* to make sure for performance reasons is that you have at least 1
core per GPU (cores not a hardware threads).

Cheers,
--
Szilárd



>
> Thanks so much,
> Dan
>
>
>
> On Fri, Feb 9, 2018 at 10:27 AM, Szilárd Páll <pall.szilard at gmail.com>
> wrote:
>
> > On Fri, Feb 9, 2018 at 4:25 PM, Szilárd Páll <pall.szilard at gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > First of all,have you read the docs (admittedly somewhat brief):
> > > http://manual.gromacs.org/documentation/2018/user-guide/
> > > mdrun-performance.html#types-of-gpu-tasks
> > >
> > > The current PME GPU was optimized for single-GPU runs. Using multiple
> > GPUs
> > > with PME offloaded works, but this mode hasn't been an optimization
> > target
> > > and it will often not give very good performance. Using multiple GPUs
> > > requires a separate PME rank (as you have realized), only one can be
> used
> > > (as we don't support PME decomposition on GPUs) and it comes some
> > > inherent scaling drawbacks. For this reason, unless you _need_ your
> > single
> > > run to be as fast as possible, you'll be better off running multiple
> > > simulations side-by side.
> > >
> >
> > PS: You can of course also run on two GPUs and run two simulations
> > side-by-side (on half of the cores for each) to improve the overall
> > aggregate throughput you get out of the hardware.
> >
> >
> > >
> > > A few tips for tuning the performance of a multi-GPU run with PME
> > offload:
> > > * expect to get at best 1.5 scaling to 2 GPUs (rarely 3 if the tasks
> > allow)
> > > * generally it's best to use about the same decomposition that you'd
> use
> > > with nonbonded-only offload, e.g. in your case 6-8 ranks
> > > * map the GPU task alone or at most together with 1 PP rank to a GPU,
> > i.e.
> > > use the new -gputasks option
> > > e.g. for your case I'd expect the following to work ~best:
> > > gmx mdrun -v -deffnm md -pme gpu -nb gpu -ntmpi 8 -ntomp 6 -npme 1
> > > -gputasks 00000001
> > > or
> > > gmx mdrun -v -deffnm md -pme gpu -nb gpu -ntmpi 8 -ntomp 6 -npme 1
> > > -gputasks 00000011
> > >
> > >
> > > Let me know if that gave some improvement.
> > >
> > > Cheers,
> > >
> > > --
> > > Szilárd
> > >
> > > On Fri, Feb 9, 2018 at 8:51 AM, Gmx QA <gmxquestions at gmail.com> wrote:
> > >
> > >> Hi list,
> > >>
> > >> I am trying out the new gromacs 2018 (really nice so far), but have a
> > few
> > >> questions about what command line options I should specify,
> specifically
> > >> with the new gnu pme implementation.
> > >>
> > >> My computer has two CPUs (with 12 cores each, 24 with hyper threading)
> > and
> > >> two GPUs, and I currently (with 2018) start simulations like this:
> > >>
> > >> $ gmx mdrun -v -deffnm md -pme gpu -nb gpu -ntmpi 2 -npme 1 -ntomp 24
> > >> -gpu_id 01
> > >>
> > >> this works, but gromacs prints the message that 24 omp threads per mpi
> > >> rank
> > >> is likely inefficient. However, trying to reduce the number of omp
> > threads
> > >> I see a reduction in performance. Is this message no longer relevant
> > with
> > >> gpu pme or am I overlooking something?
> > >>
> > >> Thanks
> > >> /PK
> > >> --
> > >> Gromacs Users mailing list
> > >>
> > >> * Please search the archive at http://www.gromacs.org/Support
> > >> /Mailing_Lists/GMX-Users_List before posting!
> > >>
> > >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >>
> > >> * For (un)subscribe requests visit
> > >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > >> send a mail to gmx-users-request at gromacs.org.
> > >>
> > >
> > >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at http://www.gromacs.org/
> > Support/Mailing_Lists/GMX-Users_List before posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support
> /Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list