[gmx-users] GPU waits for CPU, any remedies?
Michael Brunsteiner
mbx0009 at yahoo.com
Wed Sep 17 15:01:17 CEST 2014
Dear Szilard,
yes it seems i just should have done a bit more reserarch regarding
the optimal CPU/GPU combination ... and as you point out, the
bonded interactions are the culprits ... most often people probably
simulate aqueous systems, in which LINCS does most of this work
here i have a polymer glass ... different story ...
the flops table you miss was in my previous mail (see below for another
copy) and indeed it tells me that 65% of ther CPU load is "Force" while
only 15.5% is for PME mesh, and i assume only the latter is what can
be modified by dynamic load balancing ... i assume this means
there is no way to improve things ... i guess i just have to live
with the fact that for this type of system my slow CPU is the
bottleneck ... if you have any other ideas please let me know...
regards
mic
:
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 12 251 0.574 23.403 2.1
Launch GPU ops. 1 12 10001 0.627 25.569 2.3
Force 1 12 10001 17.392 709.604 64.5
PME mesh 1 12 10001 4.172 170.234 15.5
Wait GPU local 1 12 10001 0.206 8.401 0.8
NB X/F buffer ops. 1 12 19751 0.239 9.736 0.9
Write traj. 1 12 11 0.381 15.554 1.4
Update 1 12 10001 0.303 12.365 1.1
Constraints 1 12 10001 1.458 59.489 5.4
Rest 1.621 66.139 6.0
-----------------------------------------------------------------------------
Total 26.973 1100.493 100.0
===============================
Why be happy when you could be normal?
--------------------------------------------
On Tue, 9/16/14, Szilárd Páll <pall.szilard at gmail.com> wrote:
Subject: Re: [gmx-users] GPU waits for CPU, any remedies?
To: "Michael Brunsteiner" <mbx0009 at yahoo.com>
Cc: "Discussion list for GROMACS users" <gmx-users at gromacs.org>, "gromacs.org_gmx-users at maillist.sys.kth.se" <gromacs.org_gmx-users at maillist.sys.kth.se>
Date: Tuesday, September 16, 2014, 6:52 PM
Well, it looks like you are i)
unlucky ii) limited by the huge bonded workload.
i) As your system is quite small, mdrun thinks that there
are no
convenient grids between 32x32x32 and 28x28x28 (see the
PP-PME tuning
output). As the latter corresponds to quite a big jump in
cut-off
(from 1.296 to 1.482) which more than doubles the non-bonded
workload
and is slower than the former, mdrun sticks to using 1.296
nm as
coulomb cut-off. You may be able to gain some performance by
tweaking
your fourier grid spacing a bit to help mdrun generating
some
additional grids that could give more cut-off settings in
the 1.3-1.48
range. However, on a second thought, there aren't more
convenient grid
sizes between 28 and 32, I guess.
ii) The primary issue is however that your bonded workload
is much
higher than it normally is. I'm not fully familiar with the
implementation, but I think this may be due to the RB term
which is
quite slow. This time it's the flops table that could
confirm this
this, but as you still have not shared the entire log file,
we/I can't
tell.
Cheers,
--
Szilárd
More information about the gromacs.org_gmx-users
mailing list