[gmx-users] GPU waits for CPU, any remedies?
Michael Brunsteiner
mbx0009 at yahoo.com
Thu Sep 18 10:29:01 CEST 2014
Dear Szilard,
thanks for your reply!
one more question ... you wrote that SIMD optimized RB-dihedrals might get implemented
soon ... is there perhaps a link on gerrit.gromacs.org that i can use to follow the progress there?
cheers
michael
===============================
Why be happy when you could be normal?
>________________________________
> From: Szilárd Páll <pall.szilard at gmail.com>
>To: Michael Brunsteiner <mbx0009 at yahoo.com>
>Cc: Discussion list for GROMACS users <gmx-users at gromacs.org>; "gromacs.org_gmx-users at maillist.sys.kth.se" <gromacs.org_gmx-users at maillist.sys.kth.se>
>Sent: Wednesday, September 17, 2014 4:18 PM
>Subject: Re: [gmx-users] GPU waits for CPU, any remedies?
>
>
>Dear Michael,
>
>I checked and indeed, the Ryckaert-Bellman dihedrals are not SIMD
>accelerated - that's why they are quite slow. While you CPU is the
>bottleneck and you're quite right that the PP-PME balancing can't do
>much about this kind of imbalance, the good news is that it can be
>faster - even without a new CPU.
>
>With SIMD this will accelerate quite well and will likely cut down
>your bonded time by a lot (I'd guess at least 3-4x with AVX, maybe
>more with FMA). This code has ben been SIMD optimized yet mostly
>because in typical runs the RB computation takes relatively little
>time, and additionally it is not quite developer-friendly the way
>these kernels need to be written/rewritten for SIMD-acceleration.
>However it will likely get implemented soon which in your case will
>bring big improvements.
>
>Cheers,
>--
>Szilárd
>
>
>
>On Wed, Sep 17, 2014 at 3:01 PM, Michael Brunsteiner <mbx0009 at yahoo.com> wrote:
>>
>> Dear Szilard,
>> yes it seems i just should have done a bit more reserarch regarding
>> the optimal CPU/GPU combination ... and as you point out, the
>> bonded interactions are the culprits ... most often people probably
>> simulate aqueous systems, in which LINCS does most of this work
>> here i have a polymer glass ... different story ...
>> the flops table you miss was in my previous mail (see below for another
>> copy) and indeed it tells me that 65% of ther CPU load is "Force" while
>> only 15.5% is for PME mesh, and i assume only the latter is what can
>> be modified by dynamic load balancing ... i assume this means
>> there is no way to improve things ... i guess i just have to live
>> with the fact that for this type of system my slow CPU is the
>> bottleneck ... if you have any other ideas please let me know...
>> regards
>> mic
>>
>>
>>
>> :
>>
>> Computing: Num Num Call Wall time Giga-Cycles
>> Ranks Threads Count (s) total sum %
>> -----------------------------------------------------------------------------
>> Neighbor search 1 12 251 0.574 23.403 2.1
>> Launch GPU ops. 1 12 10001 0.627 25.569 2.3
>> Force 1 12 10001 17.392 709.604 64.5
>> PME mesh 1 12 10001 4.172 170.234 15.5
>> Wait GPU local 1 12 10001 0.206 8.401 0.8
>> NB X/F buffer ops. 1 12 19751 0.239 9.736 0.9
>> Write traj. 1 12 11 0.381 15.554 1.4
>> Update 1 12 10001 0.303 12.365 1.1
>> Constraints 1 12 10001 1.458 59.489 5.4
>> Rest 1.621 66.139 6.0
>> -----------------------------------------------------------------------------
>> Total 26.973 1100.493 100.0
>>
>> ===============================
>>
>> Why be happy when you could be normal?
>>
>> --------------------------------------------
>> On Tue, 9/16/14, Szilárd Páll <pall.szilard at gmail.com> wrote:
>>
>> Subject: Re: [gmx-users] GPU waits for CPU, any remedies?
>> To: "Michael Brunsteiner" <mbx0009 at yahoo.com>
>> Cc: "Discussion list for GROMACS users" <gmx-users at gromacs.org>, "gromacs.org_gmx-users at maillist.sys.kth.se" <gromacs.org_gmx-users at maillist.sys.kth.se>
>> Date: Tuesday, September 16, 2014, 6:52 PM
>>
>> Well, it looks like you are i)
>> unlucky ii) limited by the huge bonded workload.
>>
>> i) As your system is quite small, mdrun thinks that there
>> are no
>> convenient grids between 32x32x32 and 28x28x28 (see the
>> PP-PME tuning
>> output). As the latter corresponds to quite a big jump in
>> cut-off
>> (from 1.296 to 1.482) which more than doubles the non-bonded
>> workload
>> and is slower than the former, mdrun sticks to using 1.296
>> nm as
>> coulomb cut-off. You may be able to gain some performance by
>> tweaking
>> your fourier grid spacing a bit to help mdrun generating
>> some
>> additional grids that could give more cut-off settings in
>> the 1.3-1.48
>> range. However, on a second thought, there aren't more
>> convenient grid
>> sizes between 28 and 32, I guess.
>>
>> ii) The primary issue is however that your bonded workload
>> is much
>> higher than it normally is. I'm not fully familiar with the
>> implementation, but I think this may be due to the RB term
>> which is
>> quite slow. This time it's the flops table that could
>> confirm this
>> this, but as you still have not shared the entire log file,
>> we/I can't
>> tell.
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>
>
More information about the gromacs.org_gmx-users
mailing list