[gmx-users] RB dihedral performance greatly improved! [was: Re: GPU waits for CPU, any remedies?]
Szilárd Páll
pall.szilard at gmail.com
Thu Oct 2 03:11:31 CEST 2014
Hi,
The SIMD-accelerated RB dihedrals got implemented a few days ago and as it
turned out to be a relatively minor addition, we accepted the change
for the 5.0 series and it even made it into today's release!
Expect a considerable performance improvement in GPU accelerated simulations of:
* of systems that contain a large amount of RB dihedrals;
* of inhomogeneous systems that contain some RB dihedrals when running
in parallel (due to the decreased load imbalance).
Cheers,
--
Szilárd
On Thu, Sep 18, 2014 at 10:28 AM, Michael Brunsteiner <mbx0009 at yahoo.com> wrote:
>
> Dear Szilard,
> thanks for your reply!
> one more question ... you wrote that SIMD optimized RB-dihedrals might get
> implemented
> soon ... is there perhaps a link on gerrit.gromacs.org that i can use to
> follow the progress there?
> cheers
> michael
>
>
>
> ===============================
>
>
> Why be happy when you could be normal?
>
> ________________________________
> From: Szilárd Páll <pall.szilard at gmail.com>
> To: Michael Brunsteiner <mbx0009 at yahoo.com>
> Cc: Discussion list for GROMACS users <gmx-users at gromacs.org>;
> "gromacs.org_gmx-users at maillist.sys.kth.se"
> <gromacs.org_gmx-users at maillist.sys.kth.se>
> Sent: Wednesday, September 17, 2014 4:18 PM
>
> Subject: Re: [gmx-users] GPU waits for CPU, any remedies?
>
> Dear Michael,
>
> I checked and indeed, the Ryckaert-Bellman dihedrals are not SIMD
> accelerated - that's why they are quite slow. While you CPU is the
> bottleneck and you're quite right that the PP-PME balancing can't do
> much about this kind of imbalance, the good news is that it can be
> faster - even without a new CPU.
>
> With SIMD this will accelerate quite well and will likely cut down
> your bonded time by a lot (I'd guess at least 3-4x with AVX, maybe
> more with FMA). This code has ben been SIMD optimized yet mostly
> because in typical runs the RB computation takes relatively little
> time, and additionally it is not quite developer-friendly the way
> these kernels need to be written/rewritten for SIMD-acceleration.
> However it will likely get implemented soon which in your case will
> bring big improvements.
>
> Cheers,
> --
> Szilárd
>
>
> On Wed, Sep 17, 2014 at 3:01 PM, Michael Brunsteiner <mbx0009 at yahoo.com>
> wrote:
>>
>> Dear Szilard,
>> yes it seems i just should have done a bit more reserarch regarding
>> the optimal CPU/GPU combination ... and as you point out, the
>> bonded interactions are the culprits ... most often people probably
>> simulate aqueous systems, in which LINCS does most of this work
>> here i have a polymer glass ... different story ...
>> the flops table you miss was in my previous mail (see below for another
>> copy) and indeed it tells me that 65% of ther CPU load is "Force" while
>> only 15.5% is for PME mesh, and i assume only the latter is what can
>> be modified by dynamic load balancing ... i assume this means
>> there is no way to improve things ... i guess i just have to live
>> with the fact that for this type of system my slow CPU is the
>> bottleneck ... if you have any other ideas please let me know...
>> regards
>> mic
>>
>>
>>
>> :
>>
>> Computing: Num Num Call Wall time Giga-Cycles
>> Ranks Threads Count (s) total sum %
>>
>> -----------------------------------------------------------------------------
>> Neighbor search 1 12 251 0.574 23.403 2.1
>> Launch GPU ops. 1 12 10001 0.627 25.569 2.3
>> Force 1 12 10001 17.392 709.604 64.5
>> PME mesh 1 12 10001 4.172 170.234 15.5
>> Wait GPU local 1 12 10001 0.206 8.401 0.8
>> NB X/F buffer ops. 1 12 19751 0.239 9.736 0.9
>> Write traj. 1 12 11 0.381 15.554 1.4
>> Update 1 12 10001 0.303 12.365 1.1
>> Constraints 1 12 10001 1.458 59.489 5.4
>> Rest 1.621 66.139 6.0
>>
>> -----------------------------------------------------------------------------
>> Total 26.973 1100.493 100.0
>>
>> ===============================
>>
>> Why be happy when you could be normal?
>>
>> --------------------------------------------
>> On Tue, 9/16/14, Szilárd Páll <pall.szilard at gmail.com> wrote:
>>
>> Subject: Re: [gmx-users] GPU waits for CPU, any remedies?
>> To: "Michael Brunsteiner" <mbx0009 at yahoo.com>
>> Cc: "Discussion list for GROMACS users" <gmx-users at gromacs.org>,
>> "gromacs.org_gmx-users at maillist.sys.kth.se"
>> <gromacs.org_gmx-users at maillist.sys.kth.se>
>> Date: Tuesday, September 16, 2014, 6:52 PM
>>
>> Well, it looks like you are i)
>> unlucky ii) limited by the huge bonded workload.
>>
>> i) As your system is quite small, mdrun thinks that there
>> are no
>> convenient grids between 32x32x32 and 28x28x28 (see the
>> PP-PME tuning
>> output). As the latter corresponds to quite a big jump in
>> cut-off
>> (from 1.296 to 1.482) which more than doubles the non-bonded
>> workload
>> and is slower than the former, mdrun sticks to using 1.296
>> nm as
>> coulomb cut-off. You may be able to gain some performance by
>> tweaking
>> your fourier grid spacing a bit to help mdrun generating
>> some
>> additional grids that could give more cut-off settings in
>> the 1.3-1.48
>> range. However, on a second thought, there aren't more
>> convenient grid
>> sizes between 28 and 32, I guess.
>>
>> ii) The primary issue is however that your bonded workload
>> is much
>> higher than it normally is. I'm not fully familiar with the
>> implementation, but I think this may be due to the RB term
>> which is
>> quite slow. This time it's the flops table that could
>> confirm this
>> this, but as you still have not shared the entire log file,
>> we/I can't
>> tell.
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>
>
More information about the gromacs.org_gmx-users
mailing list