[gmx-users] The question of performance of GPU acceleration

Wed Jul 6 13:58:08 CEST 2016

Have you tested different ways to launch the run (different number of
ranks, threads)? With 12 ranks you seem to be getting quite some load
imabalance, though that may or may not matter. Why pme_order=6?

Share full log files, please, there is much more info in it than just
the bit you pasted here.

--
Szilárd

On Wed, Jul 6, 2016 at 5:56 AM, DeChang Li <li.dc06 at gmail.com> wrote:
> Dear all,
>
> I used GPU acceleration in Gromacs-5.0.4. I want to know whether the
> acceleration performance is good or not.
>
>
> Here is my hardware:
>
> 2 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, totally 12 physical cores.
> 2 board of NVIDIA Tesla K10 GPU, totally 6144 GPU processor cores.
> 32GB DDR4 2133MHz memory
>
>
> My simulation system contain about 480,000 atoms, used PME with grid 0.16,
> pme_order=6, non-bonded cut-off 1nm, nstlist = 40.
>
> The following is the performance:
>
> M E G A - F L O P S   A C C O U N T I N G
>
>  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>  V&F=Potential and force  V=Potential only  F=Force only
>
>  Computing:                               M-Number         M-Flops  % Flops
> -----------------------------------------------------------------------------
>  NB VdW [V&F]                         30048.030048       30048.030     0.0
>  Pair Search distance check         1643106.927152    14787962.344     0.1
>  NxN Ewald Elec. + LJ [F]         422678896.374912 27896807160.744    95.8
>  NxN Ewald Elec. + LJ [V&F]         4269950.969088   456884753.692     1.6
>  1,4 nonbonded interactions           43309.043309     3897813.898     0.0
>  Calc Weights                       1426624.426623    51358479.358     0.2
>  Spread Q Bspline                  30434654.434624    60869308.869     0.2
>  Gather F Bspline                  30434654.434624   182607926.608     0.6
>  3D-FFT                            44406686.579502   355253492.636     1.2
>  Solve PME                           138238.986240     8847295.119     0.0
>  Reset In Box                         11888.525000       35665.575     0.0
>  CG-CoM                               11889.000541       35667.002     0.0
>  Propers                              36440.036440     8344768.345     0.0
>  Impropers                             2746.002746      571168.571     0.0
>  Virial                               19043.716081      342786.889     0.0
>  Stop-CM                               4755.885541       47558.855     0.0
>  Calc-Ekin                            95109.151082     2567947.079     0.0
>  Lincs                                27924.896602     1675493.796     0.0
>  Lincs-Mat                           809415.182240     3237660.729     0.0
>  Constraint-V                        539622.857861     4316982.863     0.0
>  Constraint-Vir                       20468.413005      491241.912     0.0
>  Settle                              161257.688219    52086233.295     0.2
>  (null)                                1060.001060           0.000     0.0
> -----------------------------------------------------------------------------
>  Total                                             29105097416.211   100.0
> -----------------------------------------------------------------------------
>
>
>     D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>
>  av. #atoms communicated per step for force:  2 x 341414.5
>  av. #atoms communicated per step for LINCS:  2 x 36121.5
>
>  Average load imbalance: 75.3 %
>  Part of the total run time spent waiting due to load imbalance: 4.1 %
>  Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Y 2
> %
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 12 MPI ranks
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
>  Domain decomp.        12    1      25000    1011.294      29057.964   2.9
>  DD comm. load         12    1      25000      85.817       2465.806   0.2
>  DD comm. bounds       12    1      25000     124.974       3590.946   0.4
>  Neighbor search       12    1      25001     483.458      13891.413   1.4
>  Launch GPU ops.       12    1    2000002     124.094       3565.638   0.4
>  Comm. coord.          12    1     975000    1905.608      54754.706   5.5
>  Force                 12    1    1000001    1489.965      42811.850   4.3
>  Wait + Comm. F        12    1    1000001     435.575      12515.570   1.3
>  PME mesh              12    1    1000001   19507.755     560525.285  56.2
>  Wait GPU nonlocal     12    1    1000001      17.722        509.211   0.1
>  Wait GPU local        12    1    1000001       5.608        161.146   0.0
>  NB X/F buffer ops.    12    1    3950002     458.601      13177.195   1.3
>  COM pull force        12    1    1000001     640.120      18392.870   1.8
>  Write traj.           12    1        539      17.620        506.289   0.1
>  Update                12    1    1000001    1912.108      54941.466   5.5
>  Constraints           12    1    1000001    5255.916     151020.645  15.1
>  Comm. energies        12    1     100001     916.317      26328.958   2.6
>  Rest                                         300.654       8638.816   0.9
> -----------------------------------------------------------------------------
>  Total                                      34693.203     996855.772 100.0
> -----------------------------------------------------------------------------
>  Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
>  PME redist. X/F       12    1    2000002    7236.634     207933.530  20.9
>  PME spread/gather     12    1    2000002    7925.435     227725.169  22.8
>  PME 3D-FFT            12    1    2000002    2616.866      75191.620   7.5
>  PME 3D-FFT Comm.      12    1    2000002    1506.177      43277.664   4.3
>  PME solve Elec        12    1    1000001     217.415       6247.077   0.6
> -----------------------------------------------------------------------------
>
>                Core t (s)   Wall t (s)        (%)
>        Time:   415256.517    34693.203     1196.9
>                          9h38:13
>                  (ns/day)    (hour/ns)
> Performance:        4.981        4.818
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.