[gmx-users] simulation on 2 gpus

Kevin Boyd kevin.boyd at uconn.edu
Fri Jul 26 04:30:49 CEST 2019


Hi,

I've done a lot of research/experimentation on this, so I can maybe get you
started - if anyone has any questions about the essay to follow, feel free
to email me personally, and I'll link it to the email thread if it ends up
being pertinent.

First, there's some more internet resources to checkout. See Mark's talk at
-
https://bioexcel.eu/webinar-performance-tuning-and-optimization-of-gromacs/
Gromacs development moves fast, but a lot of it is still relevant.

I'll expand a bit here, with the caveat that Gromacs GPU development is
moving very fast and so the correct commands for optimal performance are
both system-dependent and a moving target between versions. This is a good
thing - GPUs have revolutionized the field, and with each iteration we make
better use of them. The downside is that it's unclear exactly what sort of
CPU-GPU balance you should look to purchase to take advantage of future
developments, though the trend is certainly that more and more computation
is being offloaded to the GPUs.

The most important consideration is that to get maximum total throughput
performance, you should be running not one but multiple simulations
simultaneously. You can do this through the -multidir option, but I don't
recommend that in this case, as it requires compiling with MPI and limits
some of your options. My run scripts usually use "gmx mdrun ... &" to
initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin
-pinoffset, and -gputasks. I can give specific examples if you're
interested.

Another important point is that you can run more simulations than the
number of GPUs you have. Depending on CPU-GPU balance and quality, you
won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but
you might increase it up to 1.5x. This would involve targeting the same GPU
with -gputasks.

Within a simulation, you should set up a benchmarking script to figure out
the best combination of thread-mpi ranks and open-mp threads - this can
have pretty drastic effects on performance. For example, if you want to use
your entire machine for one simulation (not recommended for maximal
efficiency), you have a lot of decomposition options (ignoring PME - which
is important, see below):

-ntmpi 2 -ntomp 32 -gputasks 01
-ntmpi 4 -ntomp 16 -gputasks 0011
-ntmpi 8 -ntomp 8  -gputasks 00001111
-ntmpi 16 -ntomp 4 -gputasks 000000001111111
(and a few others - note that ntmpi * ntomp = total threads available)

In my experience, you need to scan the options in a benchmarking script for
each simulation size/content you want to simulate, and the difference
between the best and the worst can be up to a factor of 2-4 in terms of
performance. If you're splitting your machine among multiple simulations, I
suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your
benchmarking suggests that the optimal performance lies elsewhere.

Things get more complicated when you start putting PME on the GPUs. For the
machines I work on, putting PME on GPUs absolutely improves performance,
but I'm not fully confident in that assessment without testing your
specific machine - you have a lot of cores with that threadripper, and this
is another area where I expect Gromacs 2020 might shift the GPU-CPU optimal
balance.

The issue with PME on GPUs is that we can (currently) only have one rank
doing GPU PME work. So, if we have a machine with say 20 cores and 2 gpus,
if I run the following

gmx mdrun .... -ntomp 10 -ntmpi 2 -pme gpu -npme 1 -gputasks 01

, two ranks will be started - one with cores 0-9, will work on the
short-range interactions, offloading where it can to GPU 0, and the PME
rank (cores 10-19)  will offload to GPU 1. There is one significant problem
(and one minor problem) with this setup. First, it is massively inefficient
in terms of load balance. In a typical system (there are exceptions), PME
takes up ~1/3 of the computation that short-range interactions take. So, we
are offloading 1/4 of our interactions to one GPU and 3/4 to the other,
which leads to imbalance. In this specific case (2 GPUs and sufficient
cores), the most optimal solution is often (but not always) to run with
-ntmpi 4 (in this example, then -ntomp 5), as the PME rank then gets 1/4 of
the GPU instructions, proportional to the computation needed.

The second(less critical - don't worry about this unless you're
CPU-limited) problem is that PME-GPU mpi ranks only use 1 CPU core in their
calculations. So, with a node of 20 cores and 2 GPUs, if I run a simulation
with -ntmpi 4 -ntmpi 5 -pme gpu -npme 1 -pme gpu, each one of those ranks
will have 5 CPUs, but the PME rank will only use one of them. You can
specify the number of PME cores per rank with -ntomp_pme. This is useful in
restricted cases. For example, given the above architecture setup (20
cores, 2 GPUs), I could maximally exploit my CPUs with the following
commands:

gmx mdrun .... -ntmpi 4 -ntomp 3 -ntomp_pme 1 -pme gpu -npme 1 -gputasks
0000 -pin on -pinoffset 0 &
gmx mdrun .... -ntmpi 4 -ntomp 3 -ntomp_pme 1 -pme gpu -npme 1 -gputasks
1111 -pin on -pinoffset 10

where the 1st 10 cores are (0-2 - PP) (3-5 - PP) (6-8 -PP) (9 - PME)
and similar for the other 10 cores.

There are a few other parameters to scan for minor improvements - for
example nstlist, which I typically scan in a range between 80-140 for GPU
simulations, with an effect between 2-5% of performance

I'm happy to expand the discussion with anyone who's interested.

Kevin


On Thu, Jul 25, 2019 at 1:47 PM Stefano Guglielmo <
stefano.guglielmo at unito.it> wrote:

> Dear all,
> I am trying to run simulation with Gromacs 2019.2 on a workstation with an
> amd Threadripper cpu (32 core, 64 threads, 128 GB RAM and with two rtx 2080
> ti with nvlink bridge. I read user's guide section regarding performance
> and I am exploring some possibile combinations of cpu/gpu work to run as
> fast as possible. I was wondering if some of you has experience of running
> on more than one gpu with several cores and can give some hints as starting
> point.
> Thanks
> Stefano
>
>
> --
> Stefano GUGLIELMO PhD
> Assistant Professor of Medicinal Chemistry
> Department of Drug Science and Technology
> Via P. Giuria 9
> 10125 Turin, ITALY
> ph. +39 (0)11 6707178
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists%2FGMX-Users_List&amp;data=02%7C01%7Ckevin.boyd%40uconn.edu%7Cc7062b5bbc2b41c3990d08d711282cfb%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636996736578995316&amp;sdata=YlAnpTCNdfiGk9ObdJuEbvYdgdoHTPqjndSYnv7ojh8%3D&amp;reserved=0
> before posting!
>
> * Can't post? Read
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists&amp;data=02%7C01%7Ckevin.boyd%40uconn.edu%7Cc7062b5bbc2b41c3990d08d711282cfb%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636996736578995316&amp;sdata=pVoIa7t1Ou0nn0A9y5hVWvaLJwX0MoklW16E6HrheVo%3D&amp;reserved=0
>
> * For (un)subscribe requests visit
>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmaillist.sys.kth.se%2Fmailman%2Flistinfo%2Fgromacs.org_gmx-users&amp;data=02%7C01%7Ckevin.boyd%40uconn.edu%7Cc7062b5bbc2b41c3990d08d711282cfb%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636996736578995316&amp;sdata=9GxvuSGex05Peo5Q1WTOgb9GM1oE9Y7OttziatK6H3c%3D&amp;reserved=0
> or send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list