[gmx-users] Maximising Hardware Performance on Local node: Optimal settings

Wed Dec 4 18:36:04 CET 2019

Hi Matt,

Here are a few bullet points that might help you, maybe other experts can contribute more.

If you're running on a single machine, using thread-mpi over mpi is a good choice.

"-pin on" might help you.

60k atoms is not very large, here are some other systems ready to benchmark https://www.mpibpc.mpg.de/grubmueller/bench 
that be able to tell you more about your performance on a range of systems.

It is normal that the GPU is not fully utilized; the newest GROMACS release should be able to make more use of the GPU, 
so you might want to try out the beta-3 version to get an idea, but please don't use for production, but wait till 
January when GROMACS-2020 is released.

If you want to maximise sampling, incorporate running multiple simulations simultaneously in your benchmark set (mdrun 
-multidir makes things easy here), most often this is what you actually want and can give you a drastic increase in 
output from your hardware (guessing a long shot, you might get 4 * 150 ns/day)

I assume you had already a look at this, but for reference check here:

http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html

http://manual.gromacs.org/documentation/current/onlinehelp/gmx-mdrun.html
http://manual.gromacs.org/documentation/current/user-guide/mdrun-features.html

https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc.26011

Best,

Christian

On 2019-12-04 17:53, Matthew Fisher wrote:
> Dear all,
>
> We're currently running some experiments with a new hardware configuration and attempting to maximise performance from it. Our system contains 1x V100 and 2x 12 core (24 logical) Xeon Silver 4214 CPUs which, after optimisation of CUDA drivers & kernels etc., we've been able to get a performance of 210 ns/day for 60k atoms with GROMACS 2019.3 (allowing mdrun to select threads, which has surprised us as it only creates 24 OpenMP threads for our 48 logical core system). Furthermore we have a surprising amount of wasted GPU time. Therefore, we were wondering if anyone had any advice on how we could maximise our hardware output? We've enclosed the real cycle and time accounting display below.
>
> Any help will be massively appreciated
>
> Thanks,
> Matt
>
>       R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 1 MPI rank, each using 24 OpenMP threads
>
>   Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                       Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
>   Neighbor search        1   24      12501      32.590       1716.686   3.2
>   Launch GPU ops.        1   24    2500002     105.169       5539.764  10.2
>   Force                  1   24    1250001     140.283       7389.414  13.6
>   Wait PME GPU gather    1   24    1250001      79.714       4198.902   7.7
>   Reduce GPU PME F       1   24    1250001      25.159       1325.260   2.4
>   Wait GPU NB local      1   24    1250001     264.961      13956.769  25.7
>   NB X/F buffer ops.     1   24    2487501     177.862       9368.871  17.3
>   Write traj.            1   24        252       5.748        302.799   0.6
>   Update                 1   24    1250001      81.151       4274.601   7.9
>   Constraints            1   24    1250001      70.231       3699.389   6.8
>   Rest                                          47.521       2503.167   4.6
> -----------------------------------------------------------------------------
>   Total                                       1030.389      54275.623 100.0
> -----------------------------------------------------------------------------
>
>                 Core t (s)   Wall t (s)        (%)
>         Time:    24729.331     1030.389     2400.0
>                   (ns/day)    (hour/ns)
> Performance:      209.630        0.114