[gmx-users] GPU waits for CPU, any remedies?

Szilárd Páll pall.szilard at gmail.com
Thu Sep 18 11:20:04 CEST 2014


For now I created a redmine issue:
http://redmine.gromacs.org/issues/1598, you can track the status
there.
--
Szilárd


On Thu, Sep 18, 2014 at 10:28 AM, Michael Brunsteiner <mbx0009 at yahoo.com> wrote:
>
> Dear Szilard,
> thanks for your reply!
> one more question ... you wrote that SIMD optimized RB-dihedrals might get
> implemented
> soon ... is there perhaps a link on gerrit.gromacs.org that i can use to
> follow the progress there?
> cheers
> michael
>
>
>
> ===============================
>
>
> Why be happy when you could be normal?
>
> ________________________________
> From: Szilárd Páll <pall.szilard at gmail.com>
> To: Michael Brunsteiner <mbx0009 at yahoo.com>
> Cc: Discussion list for GROMACS users <gmx-users at gromacs.org>;
> "gromacs.org_gmx-users at maillist.sys.kth.se"
> <gromacs.org_gmx-users at maillist.sys.kth.se>
> Sent: Wednesday, September 17, 2014 4:18 PM
>
> Subject: Re: [gmx-users] GPU waits for CPU, any remedies?
>
> Dear Michael,
>
> I checked and indeed, the Ryckaert-Bellman dihedrals are not SIMD
> accelerated - that's why they are quite slow. While you CPU is the
> bottleneck and you're quite right that the PP-PME balancing can't do
> much about this kind of imbalance, the good news is that it can be
> faster - even without a new CPU.
>
> With SIMD this will accelerate quite well and will likely cut down
> your bonded time by a lot (I'd guess at least 3-4x with AVX, maybe
> more with FMA). This code has ben been SIMD optimized yet mostly
> because in typical runs the RB computation takes relatively little
> time, and additionally it is not quite developer-friendly the way
> these kernels need to be written/rewritten for SIMD-acceleration.
> However it will likely get implemented soon which in your case will
> bring big improvements.
>
> Cheers,
> --
> Szilárd
>
>
> On Wed, Sep 17, 2014 at 3:01 PM, Michael Brunsteiner <mbx0009 at yahoo.com>
> wrote:
>>
>> Dear Szilard,
>> yes it seems i just should have done a bit more reserarch regarding
>> the optimal CPU/GPU combination ... and as you point out, the
>> bonded interactions are the culprits ... most often people probably
>> simulate aqueous systems, in which LINCS does most of this work
>> here i have a polymer glass ... different story ...
>> the flops table you miss was in my previous mail (see below for another
>> copy) and indeed it tells me that 65% of ther CPU load is "Force" while
>> only 15.5% is for PME mesh, and i assume only the latter is what can
>> be modified by dynamic load balancing ... i assume this means
>> there is no way to improve things ... i guess i just have to live
>> with the fact that for this type of system my slow CPU is the
>> bottleneck ... if you have any other ideas please let me know...
>> regards
>> mic
>>
>>
>>
>> :
>>
>>  Computing:          Num  Num      Call    Wall time        Giga-Cycles
>>                      Ranks Threads  Count      (s)        total sum    %
>>
>> -----------------------------------------------------------------------------
>>  Neighbor search        1  12        251      0.574        23.403  2.1
>>  Launch GPU ops.        1  12      10001      0.627        25.569  2.3
>>  Force                  1  12      10001      17.392        709.604  64.5
>>  PME mesh              1  12      10001      4.172        170.234  15.5
>>  Wait GPU local        1  12      10001      0.206          8.401  0.8
>>  NB X/F buffer ops.    1  12      19751      0.239          9.736  0.9
>>  Write traj.            1  12        11      0.381        15.554  1.4
>>  Update                1  12      10001      0.303        12.365  1.1
>>  Constraints            1  12      10001      1.458        59.489  5.4
>>  Rest                                          1.621        66.139  6.0
>>
>> -----------------------------------------------------------------------------
>>  Total                                        26.973      1100.493 100.0
>>
>> ===============================
>>
>> Why be happy when you could be normal?
>>
>> --------------------------------------------
>> On Tue, 9/16/14, Szilárd Páll <pall.szilard at gmail.com> wrote:
>>
>>  Subject: Re: [gmx-users] GPU waits for CPU, any remedies?
>>  To: "Michael Brunsteiner" <mbx0009 at yahoo.com>
>>  Cc: "Discussion list for GROMACS users" <gmx-users at gromacs.org>,
>> "gromacs.org_gmx-users at maillist.sys.kth.se"
>> <gromacs.org_gmx-users at maillist.sys.kth.se>
>>  Date: Tuesday, September 16, 2014, 6:52 PM
>>
>>  Well, it looks like you are i)
>>  unlucky ii) limited by the huge bonded workload.
>>
>>  i) As your system is quite small, mdrun thinks that there
>>  are no
>>  convenient grids between 32x32x32 and 28x28x28 (see the
>>  PP-PME tuning
>>  output). As the latter corresponds to quite a big jump in
>>  cut-off
>>  (from 1.296 to 1.482) which more than doubles the non-bonded
>>  workload
>>  and is slower than the former, mdrun sticks to using 1.296
>>  nm as
>>  coulomb cut-off. You may be able to gain some performance by
>>  tweaking
>>  your fourier grid spacing a bit to help mdrun generating
>>  some
>>  additional grids that could give more cut-off settings in
>>  the 1.3-1.48
>>  range. However, on a second thought, there aren't more
>>  convenient grid
>>  sizes between 28 and 32, I guess.
>>
>>  ii) The primary issue is however that your bonded workload
>>  is much
>>  higher than it normally is. I'm not fully familiar with the
>>  implementation, but I think this may be due to the RB term
>>  which is
>>  quite slow. This time it's the flops table that could
>>  confirm this
>>  this, but as you still have not shared the entire log file,
>>  we/I can't
>>  tell.
>>
>>  Cheers,
>>  --
>>  Szilárd
>>
>>
>
>


More information about the gromacs.org_gmx-users mailing list