[gmx-users] Optimising mdrun

Wed Apr 1 10:03:36 CEST 2020

Hi,

I'm trying to use GROMACS 2019.3 on two different machines, simulating the same system.  The system is made up of 100 identical molecules, consisting of 61 united atoms each, in a non-aqueous system.  I am however getting vastly different performance on the two machines, and I don't really understand why.  The two machines' specs are as follows:

Machine 1 (stand-alone machine)

  *   40-core CPU (Intel Xeon Gold 6148 2.4GHz, 3.7GHz, 20C, 10.4GT/s 2UPI, 27M Cache, HT (150W) DDR4-2666) - so 20 hyper-threaded cores to give a total of 40, although in my case only 20 cores are being used given the size of the system (see below)
  *   1 GPU (Nvidia Quadro RTX5000, 16GB, 4DP, VirtualLink (XX20T))
  *   Memory (512GB (8x64GB) 2666MHz DDR4 LRDIMM ECC) - although in reality only about 10 GB are used during my simulation
  *   CUDA 10.10

When I simulate my system, I've found that the optimal mdrun settings are extremely simple

mdrun -pin on

This setting uses 1 MPI thread and 20 OpenMP threads, while the PP and PME tasks are carried out automatically on the GPU.  With this I'm getting about 800 ns/day.

Machine 2 (part of a HPC cluster - of which I can only use 1 node at a time - the below details are for a GPU node)

  *   16 core CPU (Intel Xeon E5-2600, 2.2 GHz)
  *   1 GPU (NVIDIA Tesla V100-PCIE-32GB)
  *   Memory (64GB RAM)
  *   CUDA 10.0

I've tried reading the GROMACS documentation, repeatedly, but I really can't understand much about how to go about choosing the right mdrun settings for this machine.  I've tried several, and the best performance I could get is with

mpirun -np 16 gmx_mpi mdrun -ntomp 1

Even then, I get a performance of about 45-50 ns/day, which is way lower than the 800 ns/day I get with the stand-alone machine.  This setting is using 16 MPI process and using 2 OpenMP thrads per MPI process.  Also, only PP tasks are being carried out on the GPU.  The log states "Will do PME sum in reciprocal space for electrostatic interactions".  Also I get a Warning "On rank 0: oversubscribing the available 16 logical CPU cores per node with 32 threads. This will cause considerable performance loss", with Note "Oversubscribing the CPU, will not pin threads".

Does anyone have any guidance as to what I should try to get good performance on the GPU node of the cluster please?  Any help will be really appreciated.

Thanks,

Robert