[gmx-users] Maximising Hardware Performance on Local node: Optimal settings

Sat Dec 7 23:45:40 CET 2019

Hi,

I also wrote up some examples on optimizing for multiple simulations on the
same node, see

https://mailman-1.sys.kth.se/pipermail/gromacs.org_gmx-users/2019-July/126007.html

On Wed, Dec 4, 2019 at 9:36 AM Christian Blau <blau at kth.se> wrote:

> Hi Matt,
>
>
> Here are a few bullet points that might help you, maybe other experts can
> contribute more.
>
>
> If you're running on a single machine, using thread-mpi over mpi is a good
> choice.
>
> "-pin on" might help you.
>
> 60k atoms is not very large, here are some other systems ready to
> benchmark https://www.mpibpc.mpg.de/grubmueller/bench
> that be able to tell you more about your performance on a range of systems.
>
> It is normal that the GPU is not fully utilized; the newest GROMACS
> release should be able to make more use of the GPU,
> so you might want to try out the beta-3 version to get an idea, but please
> don't use for production, but wait till
> January when GROMACS-2020 is released.
>
> If you want to maximise sampling, incorporate running multiple simulations
> simultaneously in your benchmark set (mdrun
> -multidir makes things easy here), most often this is what you actually
> want and can give you a drastic increase in
> output from your hardware (guessing a long shot, you might get 4 * 150
> ns/day)
>
>
> I assume you had already a look at this, but for reference check here:
>
>
> http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html
>
> http://manual.gromacs.org/documentation/current/onlinehelp/gmx-mdrun.html
>
> http://manual.gromacs.org/documentation/current/user-guide/mdrun-features.html
>
> https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc.26011
>
>
> Best,
>
> Christian
>
> On 2019-12-04 17:53, Matthew Fisher wrote:
> > Dear all,
> >
> > We're currently running some experiments with a new hardware
> configuration and attempting to maximise performance from it. Our system
> contains 1x V100 and 2x 12 core (24 logical) Xeon Silver 4214 CPUs which,
> after optimisation of CUDA drivers & kernels etc., we've been able to get a
> performance of 210 ns/day for 60k atoms with GROMACS 2019.3 (allowing mdrun
> to select threads, which has surprised us as it only creates 24 OpenMP
> threads for our 48 logical core system). Furthermore we have a surprising
> amount of wasted GPU time. Therefore, we were wondering if anyone had any
> advice on how we could maximise our hardware output? We've enclosed the
> real cycle and time accounting display below.
> >
> > Any help will be massively appreciated
> >
> > Thanks,
> > Matt
> >
> >       R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> >
> > On 1 MPI rank, each using 24 OpenMP threads
> >
> >   Computing:          Num   Num      Call    Wall time
>  Giga-Cycles
> >                       Ranks Threads  Count      (s)         total sum
> %
> >
> -----------------------------------------------------------------------------
> >   Neighbor search        1   24      12501      32.590       1716.686
>  3.2
> >   Launch GPU ops.        1   24    2500002     105.169       5539.764
> 10.2
> >   Force                  1   24    1250001     140.283       7389.414
> 13.6
> >   Wait PME GPU gather    1   24    1250001      79.714       4198.902
>  7.7
> >   Reduce GPU PME F       1   24    1250001      25.159       1325.260
>  2.4
> >   Wait GPU NB local      1   24    1250001     264.961      13956.769
> 25.7
> >   NB X/F buffer ops.     1   24    2487501     177.862       9368.871
> 17.3
> >   Write traj.            1   24        252       5.748        302.799
>  0.6
> >   Update                 1   24    1250001      81.151       4274.601
>  7.9
> >   Constraints            1   24    1250001      70.231       3699.389
>  6.8
> >   Rest                                          47.521       2503.167
>  4.6
> >
> -----------------------------------------------------------------------------
> >   Total                                       1030.389      54275.623
> 100.0
> >
> -----------------------------------------------------------------------------
> >
> >                 Core t (s)   Wall t (s)        (%)
> >         Time:    24729.331     1030.389     2400.0
> >                   (ns/day)    (hour/ns)
> > Performance:      209.630        0.114
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>