[gmx-users] GPU waits for CPU, any remedies?

Wed Sep 17 16:18:13 CEST 2014

Dear Michael,

I checked and indeed, the Ryckaert-Bellman dihedrals are not SIMD
accelerated - that's why they are quite slow. While you CPU is the
bottleneck and you're quite right that the PP-PME balancing can't do
much about this kind of imbalance, the good news is that it can be
faster - even without a new CPU.

With SIMD this will accelerate quite well and will likely cut down
your bonded time by a lot (I'd guess at least 3-4x with AVX, maybe
more with FMA). This code has ben been SIMD optimized yet mostly
because in typical runs the RB computation takes relatively little
time, and additionally it is not quite developer-friendly the way
these kernels need to be written/rewritten for SIMD-acceleration.
However it will likely get implemented soon which in your case will
bring big improvements.

Cheers,
--
Szilárd

On Wed, Sep 17, 2014 at 3:01 PM, Michael Brunsteiner <mbx0009 at yahoo.com> wrote:
>
> Dear Szilard,
> yes it seems i just should have done a bit more reserarch regarding
> the optimal CPU/GPU combination ... and as you point out, the
> bonded interactions are the culprits ... most often people probably
> simulate aqueous systems, in which LINCS does most of this work
> here i have a polymer glass ... different story ...
> the flops table you miss was in my previous mail (see below for another
> copy) and indeed it tells me that 65% of ther CPU load is "Force" while
> only 15.5% is for PME mesh, and i assume only the latter is what can
> be modified by dynamic load balancing ... i assume this means
> there is no way to improve things ... i guess i just have to live
> with the fact that for this type of system my slow CPU is the
> bottleneck ... if you have any other ideas please let me know...
> regards
> mic
>
>
>
> :
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
>  Neighbor search        1   12        251       0.574         23.403   2.1
>  Launch GPU ops.        1   12      10001       0.627         25.569   2.3
>  Force                  1   12      10001      17.392        709.604  64.5
>  PME mesh               1   12      10001       4.172        170.234  15.5
>  Wait GPU local         1   12      10001       0.206          8.401   0.8
>  NB X/F buffer ops.     1   12      19751       0.239          9.736   0.9
>  Write traj.            1   12         11       0.381         15.554   1.4
>  Update                 1   12      10001       0.303         12.365   1.1
>  Constraints            1   12      10001       1.458         59.489   5.4
>  Rest                                           1.621         66.139   6.0
> -----------------------------------------------------------------------------
>  Total                                         26.973       1100.493 100.0
>
> ===============================
>
> Why be happy when you could be normal?
>
> --------------------------------------------
> On Tue, 9/16/14, Szilárd Páll <pall.szilard at gmail.com> wrote:
>
>  Subject: Re: [gmx-users] GPU waits for CPU, any remedies?
>  To: "Michael Brunsteiner" <mbx0009 at yahoo.com>
>  Cc: "Discussion list for GROMACS users" <gmx-users at gromacs.org>, "gromacs.org_gmx-users at maillist.sys.kth.se" <gromacs.org_gmx-users at maillist.sys.kth.se>
>  Date: Tuesday, September 16, 2014, 6:52 PM
>
>  Well, it looks like you are i)
>  unlucky ii) limited by the huge bonded workload.
>
>  i) As your system is quite small, mdrun thinks that there
>  are no
>  convenient grids between 32x32x32 and 28x28x28 (see the
>  PP-PME tuning
>  output). As the latter corresponds to quite a big jump in
>  cut-off
>  (from 1.296 to 1.482) which more than doubles the non-bonded
>  workload
>  and is slower than the former, mdrun sticks to using 1.296
>  nm as
>  coulomb cut-off. You may be able to gain some performance by
>  tweaking
>  your fourier grid spacing a bit to help mdrun generating
>  some
>  additional grids that could give more cut-off settings in
>  the 1.3-1.48
>  range. However, on a second thought, there aren't more
>  convenient grid
>  sizes between 28 and 32, I guess.
>
>  ii) The primary issue is however that your bonded workload
>  is much
>  higher than it normally is. I'm not fully familiar with the
>  implementation, but I think this may be due to the RB term
>  which is
>  quite slow. This time it's the flops table that could
>  confirm this
>  this, but as you still have not shared the entire log file,
>  we/I can't
>  tell.
>
>  Cheers,
>  --
>  Szilárd
>
>