Dear Szilard,
yes it seems i just should have done a bit more reserarch regarding
the optimal CPU/GPU combination ... and as you point out, the
bonded interactions are the culprits ... most often people probably
simulate aqueous systems, in which LINCS does most of this work
here i have a polymer glass ... different story ...
the flops table you miss was in my previous mail (see below for another
copy) and indeed it tells me that 65% of ther CPU load is "Force" while
only 15.5% is for PME mesh, and i assume only the latter is what can
be modified by dynamic load balancing ... i assume this means
there is no way to improve things ... i guess i just have to live
with the fact that for this type of system my slow CPU is the
bottleneck ... if you have any other ideas please let me know...


 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
 Neighbor search        1   12        251       0.574         23.403   2.1
 Launch GPU ops.        1   12      10001       0.627         25.569   2.3
 Force                  1   12      10001      17.392        709.604  64.5
 PME mesh               1   12      10001       4.172        170.234  15.5
 Wait GPU local         1   12      10001       0.206          8.401   0.8
 NB X/F buffer ops.     1   12      19751       0.239          9.736   0.9
 Write traj.            1   12         11       0.381         15.554   1.4
 Update                 1   12      10001       0.303         12.365   1.1
 Constraints            1   12      10001       1.458         59.489   5.4
 Rest                                           1.621         66.139   6.0
 Total                                         26.973       1100.493 100.0


Why be happy when you could be normal?

 Well, it looks like you are i)
 unlucky ii) limited by the huge bonded workload.
 i) As your system is quite small, mdrun thinks that there
 are no
 convenient grids between 32x32x32 and 28x28x28 (see the
 PP-PME tuning
 output). As the latter corresponds to quite a big jump in
 (from 1.296 to 1.482) which more than doubles the non-bonded
 and is slower than the former, mdrun sticks to using 1.296
 nm as
 coulomb cut-off. You may be able to gain some performance by
 your fourier grid spacing a bit to help mdrun generating
 additional grids that could give more cut-off settings in
 the 1.3-1.48
 range. However, on a second thought, there aren't more
 convenient grid
 sizes between 28 and 32, I guess.
 ii) The primary issue is however that your bonded workload
 is much
 higher than it normally is. I'm not fully familiar with the
 implementation, but I think this may be due to the RB term
 which is
 quite slow. This time it's the flops table that could
 confirm this
 this, but as you still have not shared the entire log file,
 we/I can't

