[gmx-users] simulation on 2 gpus

Szilárd Páll pall.szilard at gmail.com
Wed Aug 21 13:44:09 CEST 2019


Hi Stefano,


On Tue, Aug 20, 2019 at 3:29 PM Stefano Guglielmo
<stefano.guglielmo at unito.it> wrote:
>
> Dear Szilard,
>
> thanks for the very clear answer.
> Following your suggestion I tried to run without DD; for the same system I
> run two simulations on two gpus:
>
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> -gputasks 00 -pin on -pinoffset 0 -pinstride 1
>
> gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> -gputasks 11 -pin on -pinoffset 28 -pinstride 1
>
> but again the system crashed; with this I mean that after few minutes the
> machine goes off (power off) without any error message, even without using
> all the threads.

That is not normal and I strongly recommend investigating it as it
could be a sign of an underlying system/hardware instability or fault
which could ultimately lead to incorrect simulation results.

Are you sure that:
- your machine is stable and reliable at high loads; is the PSU sufficient?
- your hardware has been thoroughly stress-tested and it does not show
instabilities?

Does the crash also happen with GROMACS running on the CPU only (using
all cores)?
I'd recommend running some stress-tests that fully load the machine
for a few hours to see if the error persists.

> I then tried running the two simulations on the same gpu without DD:
>
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> -gputasks 00 -pin on -pinoffset 0 -pinstride 1
>
> gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> -gputasks 00 -pin on -pinoffset 28 -pinstride 1
>
> and I obtained better performance (about 70 ns/day) with a massive use of
> the gpu (around 90%), comparing to the two runs on two gpus I reported in
> the previous post
> (gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1 -gputasks
> 0000000 -pin on -pinoffset 0 -pinstride 1
>  gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> -gputasks 1111111 -pin on -pinoffset 28 -pinstride 1).

That is expected; domain-decomposition on a single GPU is unnecessary
and introduces overheads that limit performance.

> As for pinning, cpu topology according to log file is:
> hardware topology: Basic
>     Sockets, cores, and logical processors:
>       Socket  0: [   0  32] [   1  33] [   2  34] [   3  35] [   4  36] [
> 5  37] [   6  38] [   7  39] [  16  48] [  17  49] [  18  50] [  19  51] [
>  20  52] [  21  53] [  22  54] [  23  55] [   8  40] [   9  41] [  10  42]
> [  11  43] [  12  44] [  13  45] [  14  46] [  15  47] [  24  56] [  25
>  57] [  26  58] [  27  59] [  28  60] [  29  61] [  30  62] [  31  63]
> If I understand well (absolutely not sure) it should not be that convenient
> to pin to consecutive threads,

On the contrary, pinning to consecutive threads is the recommended
behavior. More generally, application threads are expected to be
pinned to consecutive cores (as threading parallelization will benefit
from the resulting cache access patterns); now, CPU cores can have
multiple hardware threads and depending on whether using one or
mulitpole makes sense (performance-wise), will determine whether a
stride of 1 or 2 is best. Typically, when most work is offloaded to a
GPU and many CPU cores are available 1 thread/core is best.

Note that the above topology mapping simply means that the indexed
entities that the operating system calls "CPU" grouped in "[]"
correspond to hardware threads of the same core, i.e. core 0 is [0
32], core 1 [1 33], etc. Pinning with a stride happens into this map:
- with a -pinstride 1 thread mapping will be (app thread->hardware
thread): 0->0, 1->32, 2->1, 3->33,...
- with a -pinstride 2 thread mapping will be (-||-): 0->0, 1->1, 2->2, 3->3, ...

> and indeed I found a subtle degradation of
> performance for a single simulation, switching from:
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks
> 00 -pin on
> to
> gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0 -gputasks
> 00 -pin on -pinoffset 0 -pinstride 1.

If you compare the log files of the two, you should notice that the
former used a pinstride 2 resulting in the use 28 cores while the
latter using only 14 cores; the likely reason for only a small
difference is that there is not enough CPU work to scale to 28 cores
and additionally, these specific TR CPUs are tricky to scale across
using wide multi-threaded parallelization.

Cheers,
--
Szilárd


>
> Thanks again
> Stefano
>
>
>
>
> Il giorno ven 16 ago 2019 alle ore 17:48 Szilárd Páll <
> pall.szilard at gmail.com> ha scritto:
>
> > On Mon, Aug 5, 2019 at 5:00 PM Stefano Guglielmo
> > <stefano.guglielmo at unito.it> wrote:
> > >
> > > Dear Paul,
> > > thanks for suggestions. Following them I managed to run 91 ns/day for the
> > > system I referred to in my previous post with the configuration:
> > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> > -gputasks
> > > 0000111 -pin on (still 28 threads seems to be the best choice)
> > >
> > > and 56 ns/day for two independent runs:
> > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> > -gputasks
> > > 0000000 -pin on -pinoffset 0 -pinstride 1
> > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 7 -npme 1
> > -gputasks
> > > 1111111 -pin on -pinoffset 28 -pinstride 1
> > > which is a fairly good result.
> >
> > Use no DD in single-GPU runs, i.e. for the latter, just simply
> > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 28 -ntmpi 1 -npme 0
> > -gputasks 00 -pin on -pinoffset 0 -pinstride 1
> >
> > You can also have mdrun's multidir functionality manage an ensemble of
> > jobs (related or not) so you don't have to manually start, calculate
> > pinning, etc.
> >
> >
> > > I am still wondering if somehow I should pin the threads in some
> > different
> > > way in order to reflect the cpu topology and if this can influence
> > > performance (if I remember well NAMD allows the user to indicate
> > explicitly
> > > the cpu core/threads to use in a computation).
> >
> > Your pinning does reflect the CPU topology -- the 4x7=28 threads are
> > pinned to consecutive hardware threads (because of -pinstride 1, i.e.
> > don't skip the second hardware thread of the core). The mapping of
> > software to hardware threads happens based on a the topology-based
> > hardware thread indexing, see the hardware detection report in the log
> > file.
> >
> > > When I tried to run two simulations with the following configuration:
> > > gmx mdrun -deffnm run -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1
> > -gputasks
> > > 00001111 -pin on -pinoffset 0 -pinstride 1
> > > gmx mdrun -deffnm run2 -nb gpu -pme gpu -ntomp 4 -ntmpi 8 -npme 1
> > -gputasks
> > > 00001111 -pin on -pinoffset 0 -pinstride 32
> > > the system crashed down. Probably this is normal and I am missing
> > something
> > > quite obvious.
> >
> > Not really. What do you mean by "crashed down", the machine should not
> > crash, nor should the simulation. Even though your machine has 32
> > cores / 64 threads, using all of these may not always be beneficial as
> > using more threads where there is too little work to scale will have
> > an overhead. Have you tried using all cores but only 1 thread / core
> > (i.e. 32 threads in total with pinstride 2)?
> >
> > Cheers,
> > --
> > Szilárd
> >
> > >
> > > Thanks again for the valuable advices
> > > Stefano
> > >
> > >
> > >
> > > Il giorno dom 4 ago 2019 alle ore 01:40 paul buscemi <pbuscemi at q.com> ha
> > > scritto:
> > >
> > > > Stefano,
> > > >
> > > > A recent run with 140000 atoms, including 10000 isopropanol  molecules
> > on
> > > > top of  an end restrained PDMS surface of  74000 atoms  in a 20 20 30
> > nm
> > > > box ran at 67 ns/d nvt with the mdrun conditions I posted. It took 120
> > ns
> > > > for 100 molecules of an adsorbate  to go from solution to the
> > surface.   I
> > > > don't think this will set the world ablaze with any benchmarks but it
> > is
> > > > acceptable to get some work done.
> > > >
> > > > Linux Mint Mate 18, AMD Threadripper 32 core 2990wx 4.2Ghz, 32GB DDR4,
> > 2x
> > > > RTX 2080TI gmx2019 in the simplest gmx configuration for gpus,  CUDA
> > > > version 10, Nvidia 410.7p loaded  from the repository
> > > >
> > > > Paul
> > > >
> > > > > On Aug 3, 2019, at 12:58 PM, paul buscemi <pbuscemi at q.com> wrote:
> > > > >
> > > > > Stefano,
> > > > >
> > > > > Here is a typical run
> > > > >
> > > > > fpr minimization mdrun -deffnm   grofile. -nn gpu
> > > > >
> > > > > and for other runs for a 32 core
> > > > >
> > > > > gmx -deffnm grofile.nvt  -nb gpu -pme gpu -ntomp  8  -ntmpi 8  -npme
> > 1
> > > > -gputasks 0000000011111111  -pin on
> > > > >
> > > > > Depending on the molecular system/model   -ntomp -4 -ntmpi 16  may be
> > > > faster   - of course adjusting -gputasks
> > > > >
> > > > > Rarely do I find that not using ntomp and ntpmi is faster, but it is
> > > > never bad
> > > > >
> > > > > Let me know how it goes.
> > > > >
> > > > > Paul
> > > > >
> > > > >> On Aug 3, 2019, at 4:41 AM, Stefano Guglielmo <
> > > > stefano.guglielmo at unito.it> wrote:
> > > > >>
> > > > >> Hi Paul,
> > > > >> thanks for the reply. Would you mind posting the command you used or
> > > > >> telling how did you balance the work between cpu and gpu?
> > > > >>
> > > > >> What about pinning? Does anyone know how to deal with a cpu topology
> > > > like
> > > > >> the one reported in my previous post and if it is relevant for
> > > > performance?
> > > > >> Thanks
> > > > >> Stefano
> > > > >>
> > > > >> Il giorno sabato 3 agosto 2019, Paul Buscemi <pbuscemi at q.com> ha
> > > > scritto:
> > > > >>
> > > > >>> I run the same system and setup but no nvlink. Maestro runs both
> > gpus
> > > > at
> > > > >>> 100 percent. Gromacs typically 50 --60 percent can do 600ns/d on
> > 20000
> > > > >>> atoms
> > > > >>>
> > > > >>> PB
> > > > >>>
> > > > >>>> On Jul 25, 2019, at 9:30 PM, Kevin Boyd <kevin.boyd at uconn.edu>
> > wrote:
> > > > >>>>
> > > > >>>> Hi,
> > > > >>>>
> > > > >>>> I've done a lot of research/experimentation on this, so I can
> > maybe
> > > > get
> > > > >>> you
> > > > >>>> started - if anyone has any questions about the essay to follow,
> > feel
> > > > >>> free
> > > > >>>> to email me personally, and I'll link it to the email thread if it
> > > > ends
> > > > >>> up
> > > > >>>> being pertinent.
> > > > >>>>
> > > > >>>> First, there's some more internet resources to checkout. See
> > Mark's
> > > > talk
> > > > >>> at
> > > > >>>> -
> > > > >>>> https://bioexcel.eu/webinar-performance-tuning-and-
> > > > >>> optimization-of-gromacs/
> > > > >>>> Gromacs development moves fast, but a lot of it is still relevant.
> > > > >>>>
> > > > >>>> I'll expand a bit here, with the caveat that Gromacs GPU
> > development
> > > > is
> > > > >>>> moving very fast and so the correct commands for optimal
> > performance
> > > > are
> > > > >>>> both system-dependent and a moving target between versions. This
> > is a
> > > > >>> good
> > > > >>>> thing - GPUs have revolutionized the field, and with each
> > iteration we
> > > > >>> make
> > > > >>>> better use of them. The downside is that it's unclear exactly what
> > > > sort
> > > > >>> of
> > > > >>>> CPU-GPU balance you should look to purchase to take advantage of
> > > > future
> > > > >>>> developments, though the trend is certainly that more and more
> > > > >>> computation
> > > > >>>> is being offloaded to the GPUs.
> > > > >>>>
> > > > >>>> The most important consideration is that to get maximum total
> > > > throughput
> > > > >>>> performance, you should be running not one but multiple
> > simulations
> > > > >>>> simultaneously. You can do this through the -multidir option, but
> > I
> > > > don't
> > > > >>>> recommend that in this case, as it requires compiling with MPI and
> > > > limits
> > > > >>>> some of your options. My run scripts usually use "gmx mdrun ...
> > &" to
> > > > >>>> initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin
> > > > >>>> -pinoffset, and -gputasks. I can give specific examples if you're
> > > > >>>> interested.
> > > > >>>>
> > > > >>>> Another important point is that you can run more simulations than
> > the
> > > > >>>> number of GPUs you have. Depending on CPU-GPU balance and
> > quality, you
> > > > >>>> won't double your throughput by e.g. putting 4 simulations on 2
> > GPUs,
> > > > but
> > > > >>>> you might increase it up to 1.5x. This would involve targeting the
> > > > same
> > > > >>> GPU
> > > > >>>> with -gputasks.
> > > > >>>>
> > > > >>>> Within a simulation, you should set up a benchmarking script to
> > figure
> > > > >>> out
> > > > >>>> the best combination of thread-mpi ranks and open-mp threads -
> > this
> > > > can
> > > > >>>> have pretty drastic effects on performance. For example, if you
> > want
> > > > to
> > > > >>> use
> > > > >>>> your entire machine for one simulation (not recommended for
> > maximal
> > > > >>>
> > > > >>> --
> > > > >>> Gromacs Users mailing list
> > > > >>>
> > > > >>> * Please search the archive at http://www.gromacs.org/
> > > > >>> Support/Mailing_Lists/GMX-Users_List before posting!
> > > > >>>
> > > > >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > > >>>
> > > > >>> * For (un)subscribe requests visit
> > > > >>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > or
> > > > >>> send a mail to gmx-users-request at gromacs.org.
> > > > >>>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Stefano GUGLIELMO PhD
> > > > >> Assistant Professor of Medicinal Chemistry
> > > > >> Department of Drug Science and Technology
> > > > >> Via P. Giuria 9
> > > > >> 10125 Turin, ITALY
> > > > >> ph. +39 (0)11 6707178
> > > > >> --
> > > > >> Gromacs Users mailing list
> > > > >>
> > > > >> * Please search the archive at
> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > posting!
> > > > >>
> > > > >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > > >>
> > > > >> * For (un)subscribe requests visit
> > > > >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > or
> > > > send a mail to gmx-users-request at gromacs.org.
> > > > >
> > > > > --
> > > > > Gromacs Users mailing list
> > > > >
> > > > > * Please search the archive at
> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > posting!
> > > > >
> > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > > >
> > > > > * For (un)subscribe requests visit
> > > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > or
> > > > send a mail to gmx-users-request at gromacs.org.
> > > >
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at
> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > > send a mail to gmx-users-request at gromacs.org.
> > > >
> > >
> > >
> > > --
> > > Stefano GUGLIELMO PhD
> > > Assistant Professor of Medicinal Chemistry
> > > Department of Drug Science and Technology
> > > Via P. Giuria 9
> > > 10125 Turin, ITALY
> > > ph. +39 (0)11 6707178
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
>
>
>
> --
> Stefano GUGLIELMO PhD
> Assistant Professor of Medicinal Chemistry
> Department of Drug Science and Technology
> Via P. Giuria 9
> 10125 Turin, ITALY
> ph. +39 (0)11 6707178
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list