[gmx-users] GPU waits for CPU, any remedies?

Thu Sep 18 10:29:01 CEST 2014

Dear Szilard,
thanks for your reply!
one more question ... you wrote that SIMD optimized RB-dihedrals might get implemented
soon ... is there perhaps a link on gerrit.gromacs.org that i can use to follow the progress there?
cheers
michael

===============================

Why be happy when you could be normal?

>________________________________
> From: Szilárd Páll <pall.szilard at gmail.com>
>To: Michael Brunsteiner <mbx0009 at yahoo.com> 
>Cc: Discussion list for GROMACS users <gmx-users at gromacs.org>; "gromacs.org_gmx-users at maillist.sys.kth.se" <gromacs.org_gmx-users at maillist.sys.kth.se> 
>Sent: Wednesday, September 17, 2014 4:18 PM
>Subject: Re: [gmx-users] GPU waits for CPU, any remedies?
> 
>
>Dear Michael,
>
>I checked and indeed, the Ryckaert-Bellman dihedrals are not SIMD
>accelerated - that's why they are quite slow. While you CPU is the
>bottleneck and you're quite right that the PP-PME balancing can't do
>much about this kind of imbalance, the good news is that it can be
>faster - even without a new CPU.
>
>With SIMD this will accelerate quite well and will likely cut down
>your bonded time by a lot (I'd guess at least 3-4x with AVX, maybe
>more with FMA). This code has ben been SIMD optimized yet mostly
>because in typical runs the RB computation takes relatively little
>time, and additionally it is not quite developer-friendly the way
>these kernels need to be written/rewritten for SIMD-acceleration.
>However it will likely get implemented soon which in your case will
>bring big improvements.
>
>Cheers,
>--
>Szilárd
>
>
>
>On Wed, Sep 17, 2014 at 3:01 PM, Michael Brunsteiner <mbx0009 at yahoo.com> wrote:
>>
>> Dear Szilard,
>> yes it seems i just should have done a bit more reserarch regarding
>> the optimal CPU/GPU combination ... and as you point out, the
>> bonded interactions are the culprits ... most often people probably
>> simulate aqueous systems, in which LINCS does most of this work
>> here i have a polymer glass ... different story ...
>> the flops table you miss was in my previous mail (see below for another
>> copy) and indeed it tells me that 65% of ther CPU load is "Force" while
>> only 15.5% is for PME mesh, and i assume only the latter is what can
>> be modified by dynamic load balancing ... i assume this means
>> there is no way to improve things ... i guess i just have to live
>> with the fact that for this type of system my slow CPU is the
>> bottleneck ... if you have any other ideas please let me know...
>> regards
>> mic
>>
>>
>>
>> :
>>
>>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>>                      Ranks Threads  Count      (s)         total sum    %
>> -----------------------------------------------------------------------------
>>  Neighbor search        1   12        251       0.574         23.403   2.1
>>  Launch GPU ops.        1   12      10001       0.627         25.569   2.3
>>  Force                  1   12      10001      17.392        709.604  64.5
>>  PME mesh               1   12      10001       4.172        170.234  15.5
>>  Wait GPU local         1   12      10001       0.206          8.401   0.8
>>  NB X/F buffer ops.     1   12      19751       0.239          9.736   0.9
>>  Write traj.            1   12         11       0.381         15.554   1.4
>>  Update                 1   12      10001       0.303         12.365   1.1
>>  Constraints            1   12      10001       1.458         59.489   5.4
>>  Rest                                           1.621         66.139   6.0
>> -----------------------------------------------------------------------------
>>  Total                                         26.973       1100.493 100.0
>>
>> ===============================
>>
>> Why be happy when you could be normal?
>>
>> --------------------------------------------
>> On Tue, 9/16/14, Szilárd Páll <pall.szilard at gmail.com> wrote:
>>
>>  Subject: Re: [gmx-users] GPU waits for CPU, any remedies?
>>  To: "Michael Brunsteiner" <mbx0009 at yahoo.com>
>>  Cc: "Discussion list for GROMACS users" <gmx-users at gromacs.org>, "gromacs.org_gmx-users at maillist.sys.kth.se" <gromacs.org_gmx-users at maillist.sys.kth.se>
>>  Date: Tuesday, September 16, 2014, 6:52 PM
>>
>>  Well, it looks like you are i)
>>  unlucky ii) limited by the huge bonded workload.
>>
>>  i) As your system is quite small, mdrun thinks that there
>>  are no
>>  convenient grids between 32x32x32 and 28x28x28 (see the
>>  PP-PME tuning
>>  output). As the latter corresponds to quite a big jump in
>>  cut-off
>>  (from 1.296 to 1.482) which more than doubles the non-bonded
>>  workload
>>  and is slower than the former, mdrun sticks to using 1.296
>>  nm as
>>  coulomb cut-off. You may be able to gain some performance by
>>  tweaking
>>  your fourier grid spacing a bit to help mdrun generating
>>  some
>>  additional grids that could give more cut-off settings in
>>  the 1.3-1.48
>>  range. However, on a second thought, there aren't more
>>  convenient grid
>>  sizes between 28 and 32, I guess.
>>
>>  ii) The primary issue is however that your bonded workload
>>  is much
>>  higher than it normally is. I'm not fully familiar with the
>>  implementation, but I think this may be due to the RB term
>>  which is
>>  quite slow. This time it's the flops table that could
>>  confirm this
>>  this, but as you still have not shared the entire log file,
>>  we/I can't
>>  tell.
>>
>>  Cheers,
>>  --
>>  Szilárd
>>
>>
>
>