[gmx-users] Optimising mdrun

Thu Apr 2 05:59:01 CEST 2020

Hi -

> This setting is using 16 MPI process and using 2 OpenMP thrads per MPI
proces

With ntomp 1 you should only be getting one OpenMP thread, not sure why
that's not working. Can you post a link to a log file?

For a small system like that and a powerful GPU, you're likely going to
have some inefficiency. Can you run replicates? You'd want to do
something like this (though I'm not sure about the mpirun part, I usually
do non-MPI gromacs):
mpirun -np 8 gmx_mpi mdrun -nt 8 -ntomp 8 -ntmpi  1 -pin on -pinoffset 0
-nb gpu -pme gpu
mpirun -np 8 gmx_mpi mdrun -nt 8 -ntomp 8 -ntmpi  1 -pin on -pinoffset 8
-nb gpu -pme gpu

where the first job runs on cores 0-7 and the second job runs on cores
8-15, sharing the same GPU.

> The log states "Will do PME sum in reciprocal space for electrostatic
interactions"

Reciprocal space does not indicate whether PME was run on the CPU or GPU,
it's just reporting math.

On Wed, Apr 1, 2020 at 1:03 AM Robert Cordina <robert.cordina at strath.ac.uk>
wrote:

> *Message sent from a system outside of UConn.*
>
>
> Hi,
>
> I'm trying to use GROMACS 2019.3 on two different machines, simulating the
> same system.  The system is made up of 100 identical molecules, consisting
> of 61 united atoms each, in a non-aqueous system.  I am however getting
> vastly different performance on the two machines, and I don't really
> understand why.  The two machines' specs are as follows:
>
> Machine 1 (stand-alone machine)
>
>
>   *   40-core CPU (Intel Xeon Gold 6148 2.4GHz, 3.7GHz, 20C, 10.4GT/s
> 2UPI, 27M Cache, HT (150W) DDR4-2666) - so 20 hyper-threaded cores to give
> a total of 40, although in my case only 20 cores are being used given the
> size of the system (see below)
>   *   1 GPU (Nvidia Quadro RTX5000, 16GB, 4DP, VirtualLink (XX20T))
>   *   Memory (512GB (8x64GB) 2666MHz DDR4 LRDIMM ECC) - although in
> reality only about 10 GB are used during my simulation
>   *   CUDA 10.10
>
> When I simulate my system, I've found that the optimal mdrun settings are
> extremely simple
>
> mdrun -pin on
>
> This setting uses 1 MPI thread and 20 OpenMP threads, while the PP and PME
> tasks are carried out automatically on the GPU.  With this I'm getting
> about 800 ns/day.
>
>
> Machine 2 (part of a HPC cluster - of which I can only use 1 node at a
> time - the below details are for a GPU node)
>
>
>   *   16 core CPU (Intel Xeon E5-2600, 2.2 GHz)
>   *   1 GPU (NVIDIA Tesla V100-PCIE-32GB)
>   *   Memory (64GB RAM)
>   *   CUDA 10.0
>
> I've tried reading the GROMACS documentation, repeatedly, but I really
> can't understand much about how to go about choosing the right mdrun
> settings for this machine.  I've tried several, and the best performance I
> could get is with
>
> mpirun -np 16 gmx_mpi mdrun -ntomp 1
>
> Even then, I get a performance of about 45-50 ns/day, which is way lower
> than the 800 ns/day I get with the stand-alone machine.  This setting is
> using 16 MPI process and using 2 OpenMP thrads per MPI process.  Also, only
> PP tasks are being carried out on the GPU.  The log states "Will do PME sum
> in reciprocal space for electrostatic interactions".  Also I get a Warning
> "On rank 0: oversubscribing the available 16 logical CPU cores per node
> with 32 threads. This will cause considerable performance loss", with Note
> "Oversubscribing the CPU, will not pin threads".
>
>
> Does anyone have any guidance as to what I should try to get good
> performance on the GPU node of the cluster please?  Any help will be really
> appreciated.
>
> Thanks,
>
> Robert
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>