[gmx-users] simulation on 2 gpus
Stefano Guglielmo
stefano.guglielmo at unito.it
Tue Jul 30 09:29:44 CEST 2019
Kevin, Mark,
thanks for sharing advices and experience.
I am facing some strange behaviour trying to run with the two gpus: there
are some combinations that "simply" make the system crash (the workstation
turns off after few seconds of running); in particular the following runs:
gmx mdrun -deffnm run (-gpu_id 01) -pin on
which produces the following log
"GROMACS version: 2019.2
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 4.8.5
C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler: /usr/bin/c++ GNU 4.8.5
C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
driver;Copyright (c) 2005-2019 NVIDIA Corporation;Built on
Wed_Apr_24_19:10:27_PDT_2019;Cuda compilation tools, release 10.1, V10.1.168
CUDA compiler
flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;;
;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 10.10
CUDA runtime: 10.10
Running on 1 node with total 32 cores, 64 logical cores, 2 compatible GPUs
Hardware detected:
CPU info:
Vendor: AMD
Brand: AMD Ryzen Threadripper 2990WX 32-Core Processor
Family: 23 Model: 8 Stepping: 2
Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf
misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp
sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [
5 37] [ 6 38] [ 7 39] [ 16 48] [ 17 49] [ 18 50] [ 19 51] [
20 52] [ 21 53] [ 22 54] [ 23 55] [ 8 40] [ 9 41] [ 10 42]
[ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47] [ 24 56] [ 25
57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63]
GPU info:
Number of GPUs detected: 2
#0: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat:
compatible
#1: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat:
compatible
...
Using 16 MPI threads
Using 4 OpenMP threads per tMPI thread
On host pcpharm018 2 GPUs selected for this run.
Mapping of GPU IDs to the 16 GPU tasks in the 16 ranks on this node:
PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1
PP tasks will do (non-perturbed) short-ranged and most bonded interactions
on the GPU
Pinning threads with an auto-selected logical core stride of 1"
Running the following command seems to work without crashing, with 1 tmpi
and 32 omp threads on 1 gpu only:
gmx mdrun -deffnm run -gpu_id 01 -pin on -pinstride 1 -pinoffset 32 -ntmpi
1.
The most efficient way to run a single run seems to be produced by:
gmx mdrun -deffnm run -gpu_id 0 -ntmpi 1 -ntomp 28
which makes 86 ns/day for a system of about 100K atoms (1000 res. protein
with membrane and water).
I also tried to run two independent simulations, and again with the
following commands the system crashes:
gmx mdrun -deffnm run1 -gpu_id 1 -pin on -pinstride 1 -pinoffset 32 -ntomp
32 -ntmpi 1
gmx mdrun -deffnm run0 -gpu_id 0 -pin on -pinstride 1 -pinoffset 0 -ntomp
32 -ntmpi 1
with the log
"Running on 1 node with total 32 cores, 64 logical cores, 2 compatible GPUs
Hardware detected:
CPU info:
Vendor: AMD
Brand: AMD Ryzen Threadripper 2990WX 32-Core Processor
Family: 23 Model: 8 Stepping: 2
Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf
misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp
sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [
5 37] [ 6 38] [ 7 39] [ 16 48] [ 17 49] [ 18 50] [ 19 51] [
20 52] [ 21 53] [ 22 54] [ 23 55] [ 8 40] [ 9 41] [ 10 42]
[ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47] [ 24 56] [ 25
57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63]
GPU info:
Number of GPUs detected: 2
#0: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat:
compatible
#1: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat:
compatible
...
Using 1 MPI thread
Using 32 OpenMP threads
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PME tasks will do all aspects on the GPU
Applying core pinning offset 32."
Two runs can be carried out with the command:
gmx mdrun -deffnm run1 -gpu_id 1 -pin on -pinstride 1 -pinoffset 14 -ntmpi
1 -ntomp 28
gmx mdrun -deffnm run0 -gpu_id 0 -pin on -pinstride 1 -pinoffset 0 -ntmpi 1
-ntomp 28
"Using 1 MPI thread
Using 28 OpenMP threads
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PME tasks will do all aspects on the GPU
Applying core pinning offset 14
Pinning threads with a user-specified logical core stride of 1"
or
gmx mdrun -deffnm run1 -gpu_id 1 -pin on -ntmpi 1 -ntomp 28
gmx mdrun -deffnm run0 -gpu_id 0 -pin on -ntmpi 1 -ntomp 28
"Using 1 MPI thread
Using 28 OpenMP threads
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PME tasks will do all aspects on the GPU
Pinning threads with an auto-selected logical core stride of 2"
With some disappointment in both situations there was a substantial
degrading of performance, about 35-40 ns/day for the same system, with a
gpu usage of 25-30%, compared to 50-55% for the single run on a single gpu,
and much below the power cap.
I hope not to have been confusing and will be grateful for any suggestions.
Thanks
Stefano
Il giorno ven 26 lug 2019 alle ore 15:00 Kevin Boyd <kevin.boyd at uconn.edu>
ha scritto:
> Sure - you can do it 2 ways with normal Gromacs. Either run the simulations
> in separate terminals, or use ampersands to run them in the background of 1
> terminal.
>
> I'll give a concrete example for your threadripper, using 32 of your cores,
> so that you could run some other computation on the other 32. I typically
> make a bash variable with all the common arguments.
>
> Given tprs run1.tpr ...run4.tpr
>
> gmx_common="gmx mdrun -ntomp 8 -ntmpi 1 -pme gpu -nb gpu -pin on -pinstride
> 1"
> $gmx_common -deffnm run1 -pinoffset 32 -gputasks 00 &
> $gmx_common -deffnm run2 -pinoffest 40 -gputasks 00 &
> $gmx_common -deffnm run3 -pinoffset 48 -gputasks 11 &
> $gmx_common -deffnm run3 -pinoffset 56 -gputasks 11
>
> So run1 will run on cores 32-39, on GPU 0, run2 on cores 40-47 on the same
> GPU, and the other 2 runs will use GPU 1. Note the ampersands on the first
> 3 runs, so they'll go off in the background
>
> I should also have mentioned one peculiarity with running with -ntmpi 1 and
> -pme gpu, in that even though there's now only one rank (with nonbonded and
> PME both running on it), you still need 2 gpu tasks for that one rank, one
> for each type of interaction.
>
> As for multidir, I forget what troubles I ran into exactly, but I was
> unable to run some subset of simulations. Anyhow if you aren't running on a
> cluster, I see no reason to compile with MPI and have to use srun or slurm,
> and need to use gmx_mpi rather than gmx. The built-in thread-mpi gives you
> up to 64 threads, and can have a minor (<5% in my experience) performance
> benefit over MPI.
>
> Kevin
>
> On Fri, Jul 26, 2019 at 8:21 AM Gregory Man Kai Poon <gpoon at gsu.edu>
> wrote:
>
> > Hi Kevin,
> > Thanks for your very useful post. Could you give a few command line
> > examples on how to start multiple runs at different times (e.g.,
> allocate a
> > subset of CPU/GPU to one run, and start another run later using another
> > unsubset of yet-unallocated CPU/GPU). Also, could you elaborate on the
> > drawbacks of the MPI compilation that you hinted at?
> > Gregory
> >
> > From: Kevin Boyd<mailto:kevin.boyd at uconn.edu>
> > Sent: Thursday, July 25, 2019 10:31 PM
> > To: gmx-users at gromacs.org<mailto:gmx-users at gromacs.org>
> > Subject: Re: [gmx-users] simulation on 2 gpus
> >
> > Hi,
> >
> > I've done a lot of research/experimentation on this, so I can maybe get
> you
> > started - if anyone has any questions about the essay to follow, feel
> free
> > to email me personally, and I'll link it to the email thread if it ends
> up
> > being pertinent.
> >
> > First, there's some more internet resources to checkout. See Mark's talk
> at
> > -
> >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2F&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053958340&sdata=6KGjrrBb8w%2FSqMvTtdzoiufYtOpmOKX5QyYNFAivTMo%3D&reserved=0
> > Gromacs development moves fast, but a lot of it is still relevant.
> >
> > I'll expand a bit here, with the caveat that Gromacs GPU development is
> > moving very fast and so the correct commands for optimal performance are
> > both system-dependent and a moving target between versions. This is a
> good
> > thing - GPUs have revolutionized the field, and with each iteration we
> make
> > better use of them. The downside is that it's unclear exactly what sort
> of
> > CPU-GPU balance you should look to purchase to take advantage of future
> > developments, though the trend is certainly that more and more
> computation
> > is being offloaded to the GPUs.
> >
> > The most important consideration is that to get maximum total throughput
> > performance, you should be running not one but multiple simulations
> > simultaneously. You can do this through the -multidir option, but I don't
> > recommend that in this case, as it requires compiling with MPI and limits
> > some of your options. My run scripts usually use "gmx mdrun ... &" to
> > initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin
> > -pinoffset, and -gputasks. I can give specific examples if you're
> > interested.
> >
> > Another important point is that you can run more simulations than the
> > number of GPUs you have. Depending on CPU-GPU balance and quality, you
> > won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but
> > you might increase it up to 1.5x. This would involve targeting the same
> GPU
> > with -gputasks.
> >
> > Within a simulation, you should set up a benchmarking script to figure
> out
> > the best combination of thread-mpi ranks and open-mp threads - this can
> > have pretty drastic effects on performance. For example, if you want to
> use
> > your entire machine for one simulation (not recommended for maximal
> > efficiency), you have a lot of decomposition options (ignoring PME -
> which
> > is important, see below):
> >
> > -ntmpi 2 -ntomp 32 -gputasks 01
> > -ntmpi 4 -ntomp 16 -gputasks 0011
> > -ntmpi 8 -ntomp 8 -gputasks 00001111
> > -ntmpi 16 -ntomp 4 -gputasks 000000001111111
> > (and a few others - note that ntmpi * ntomp = total threads available)
> >
> > In my experience, you need to scan the options in a benchmarking script
> for
> > each simulation size/content you want to simulate, and the difference
> > between the best and the worst can be up to a factor of 2-4 in terms of
> > performance. If you're splitting your machine among multiple
> simulations, I
> > suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your
> > benchmarking suggests that the optimal performance lies elsewhere.
> >
> > Things get more complicated when you start putting PME on the GPUs. For
> the
> > machines I work on, putting PME on GPUs absolutely improves performance,
> > but I'm not fully confident in that assessment without testing your
> > specific machine - you have a lot of cores with that threadripper, and
> this
> > is another area where I expect Gromacs 2020 might shift the GPU-CPU
> optimal
> > balance.
> >
> > The issue with PME on GPUs is that we can (currently) only have one rank
> > doing GPU PME work. So, if we have a machine with say 20 cores and 2
> gpus,
> > if I run the following
> >
> > gmx mdrun .... -ntomp 10 -ntmpi 2 -pme gpu -npme 1 -gputasks 01
> >
> > , two ranks will be started - one with cores 0-9, will work on the
> > short-range interactions, offloading where it can to GPU 0, and the PME
> > rank (cores 10-19) will offload to GPU 1. There is one significant
> problem
> > (and one minor problem) with this setup. First, it is massively
> inefficient
> > in terms of load balance. In a typical system (there are exceptions), PME
> > takes up ~1/3 of the computation that short-range interactions take. So,
> we
> > are offloading 1/4 of our interactions to one GPU and 3/4 to the other,
> > which leads to imbalance. In this specific case (2 GPUs and sufficient
> > cores), the most optimal solution is often (but not always) to run with
> > -ntmpi 4 (in this example, then -ntomp 5), as the PME rank then gets 1/4
> of
> > the GPU instructions, proportional to the computation needed.
> >
> > The second(less critical - don't worry about this unless you're
> > CPU-limited) problem is that PME-GPU mpi ranks only use 1 CPU core in
> their
> > calculations. So, with a node of 20 cores and 2 GPUs, if I run a
> simulation
> > with -ntmpi 4 -ntmpi 5 -pme gpu -npme 1 -pme gpu, each one of those ranks
> > will have 5 CPUs, but the PME rank will only use one of them. You can
> > specify the number of PME cores per rank with -ntomp_pme. This is useful
> in
> > restricted cases. For example, given the above architecture setup (20
> > cores, 2 GPUs), I could maximally exploit my CPUs with the following
> > commands:
> >
> > gmx mdrun .... -ntmpi 4 -ntomp 3 -ntomp_pme 1 -pme gpu -npme 1 -gputasks
> > 0000 -pin on -pinoffset 0 &
> > gmx mdrun .... -ntmpi 4 -ntomp 3 -ntomp_pme 1 -pme gpu -npme 1 -gputasks
> > 1111 -pin on -pinoffset 10
> >
> > where the 1st 10 cores are (0-2 - PP) (3-5 - PP) (6-8 -PP) (9 - PME)
> > and similar for the other 10 cores.
> >
> > There are a few other parameters to scan for minor improvements - for
> > example nstlist, which I typically scan in a range between 80-140 for GPU
> > simulations, with an effect between 2-5% of performance
> >
> > I'm happy to expand the discussion with anyone who's interested.
> >
> > Kevin
> >
> >
> > On Thu, Jul 25, 2019 at 1:47 PM Stefano Guglielmo <
> > stefano.guglielmo at unito.it> wrote:
> >
> > > Dear all,
> > > I am trying to run simulation with Gromacs 2019.2 on a workstation with
> > an
> > > amd Threadripper cpu (32 core, 64 threads, 128 GB RAM and with two rtx
> > 2080
> > > ti with nvlink bridge. I read user's guide section regarding
> performance
> > > and I am exploring some possibile combinations of cpu/gpu work to run
> as
> > > fast as possible. I was wondering if some of you has experience of
> > running
> > > on more than one gpu with several cores and can give some hints as
> > starting
> > > point.
> > > Thanks
> > > Stefano
> > >
> > >
> > > --
> > > Stefano GUGLIELMO PhD
> > > Assistant Professor of Medicinal Chemistry
> > > Department of Drug Science and Technology
> > > Via P. Giuria 9
> > > 10125 Turin, ITALY
> > > ph. +39 (0)11 6707178
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists%2FGMX-Users_List&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053958340&sdata=2%2FCC3SZgnYolAwNRUaPg1%2BmCc1%2Bb%2FZwU38g9FxqJp2A%3D&reserved=0
> > > before posting!
> > >
> > > * Can't post? Read
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053958340&sdata=esfRK00iIHBlFN285W6JkWFr8S4HQ3%2B9jn3R45v%2FBvY%3D&reserved=0
> > >
> > > * For (un)subscribe requests visit
> > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmaillist.sys.kth.se%2Fmailman%2Flistinfo%2Fgromacs.org_gmx-users&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=5bBv3pZe45cZk8wKpCZgYlCxZEQ5sD0RO8e8EcgqnOw%3D&reserved=0
> > > or send a mail to gmx-users-request at gromacs.org.
> > >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> >
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists%2FGMX-Users_List&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=xMCvfr9LVUW37ZurhHDi%2BqW76PZnH78E2MIR7yQF6Qw%3D&reserved=0
> > before posting!
> >
> > * Can't post? Read
> >
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=fHDaHZPUZf57P%2FMXkIxN%2FqmtkRtvDu4B%2B9EQiU20BnA%3D&reserved=0
> >
> > * For (un)subscribe requests visit
> >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmaillist.sys.kth.se%2Fmailman%2Flistinfo%2Fgromacs.org_gmx-users&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=5bBv3pZe45cZk8wKpCZgYlCxZEQ5sD0RO8e8EcgqnOw%3D&reserved=0
> > or send a mail to gmx-users-request at gromacs.org.
> >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> >
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists%2FGMX-Users_List&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=xMCvfr9LVUW37ZurhHDi%2BqW76PZnH78E2MIR7yQF6Qw%3D&reserved=0
> > before posting!
> >
> > * Can't post? Read
> >
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=fHDaHZPUZf57P%2FMXkIxN%2FqmtkRtvDu4B%2B9EQiU20BnA%3D&reserved=0
> >
> > * For (un)subscribe requests visit
> >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmaillist.sys.kth.se%2Fmailman%2Flistinfo%2Fgromacs.org_gmx-users&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=5bBv3pZe45cZk8wKpCZgYlCxZEQ5sD0RO8e8EcgqnOw%3D&reserved=0
> > or send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
--
Stefano GUGLIELMO PhD
Assistant Professor of Medicinal Chemistry
Department of Drug Science and Technology
Via P. Giuria 9
10125 Turin, ITALY
ph. +39 (0)11 6707178
More information about the gromacs.org_gmx-users
mailing list