[gmx-users] The question of performance of GPU acceleration

DeChang Li li.dc06 at gmail.com
Wed Jul 6 05:56:36 CEST 2016

Dear all,

I used GPU acceleration in Gromacs-5.0.4. I want to know whether the
acceleration performance is good or not.

Here is my hardware:

2 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, totally 12 physical cores.
2 board of NVIDIA Tesla K10 GPU, totally 6144 GPU processor cores.
32GB DDR4 2133MHz memory

My simulation system contain about 480,000 atoms, used PME with grid 0.16,
pme_order=6, non-bonded cut-off 1nm, nstlist = 40.

The following is the performance:

M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
 NB VdW [V&F]                         30048.030048       30048.030     0.0
 Pair Search distance check         1643106.927152    14787962.344     0.1
 NxN Ewald Elec. + LJ [F]         422678896.374912 27896807160.744    95.8
 NxN Ewald Elec. + LJ [V&F]         4269950.969088   456884753.692     1.6
 1,4 nonbonded interactions           43309.043309     3897813.898     0.0
 Calc Weights                       1426624.426623    51358479.358     0.2
 Spread Q Bspline                  30434654.434624    60869308.869     0.2
 Gather F Bspline                  30434654.434624   182607926.608     0.6
 3D-FFT                            44406686.579502   355253492.636     1.2
 Solve PME                           138238.986240     8847295.119     0.0
 Reset In Box                         11888.525000       35665.575     0.0
 CG-CoM                               11889.000541       35667.002     0.0
 Propers                              36440.036440     8344768.345     0.0
 Impropers                             2746.002746      571168.571     0.0
 Virial                               19043.716081      342786.889     0.0
 Stop-CM                               4755.885541       47558.855     0.0
 Calc-Ekin                            95109.151082     2567947.079     0.0
 Lincs                                27924.896602     1675493.796     0.0
 Lincs-Mat                           809415.182240     3237660.729     0.0
 Constraint-V                        539622.857861     4316982.863     0.0
 Constraint-Vir                       20468.413005      491241.912     0.0
 Settle                              161257.688219    52086233.295     0.2
 (null)                                1060.001060           0.000     0.0
 Total                                             29105097416.211   100.0

    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 341414.5
 av. #atoms communicated per step for LINCS:  2 x 36121.5

 Average load imbalance: 75.3 %
 Part of the total run time spent waiting due to load imbalance: 4.1 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Y 2

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 12 MPI ranks

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
 Domain decomp.        12    1      25000    1011.294      29057.964   2.9
 DD comm. load         12    1      25000      85.817       2465.806   0.2
 DD comm. bounds       12    1      25000     124.974       3590.946   0.4
 Neighbor search       12    1      25001     483.458      13891.413   1.4
 Launch GPU ops.       12    1    2000002     124.094       3565.638   0.4
 Comm. coord.          12    1     975000    1905.608      54754.706   5.5
 Force                 12    1    1000001    1489.965      42811.850   4.3
 Wait + Comm. F        12    1    1000001     435.575      12515.570   1.3
 PME mesh              12    1    1000001   19507.755     560525.285  56.2
 Wait GPU nonlocal     12    1    1000001      17.722        509.211   0.1
 Wait GPU local        12    1    1000001       5.608        161.146   0.0
 NB X/F buffer ops.    12    1    3950002     458.601      13177.195   1.3
 COM pull force        12    1    1000001     640.120      18392.870   1.8
 Write traj.           12    1        539      17.620        506.289   0.1
 Update                12    1    1000001    1912.108      54941.466   5.5
 Constraints           12    1    1000001    5255.916     151020.645  15.1
 Comm. energies        12    1     100001     916.317      26328.958   2.6
 Rest                                         300.654       8638.816   0.9
 Total                                      34693.203     996855.772 100.0
 Breakdown of PME mesh computation
 PME redist. X/F       12    1    2000002    7236.634     207933.530  20.9
 PME spread/gather     12    1    2000002    7925.435     227725.169  22.8
 PME 3D-FFT            12    1    2000002    2616.866      75191.620   7.5
 PME 3D-FFT Comm.      12    1    2000002    1506.177      43277.664   4.3
 PME solve Elec        12    1    1000001     217.415       6247.077   0.6

               Core t (s)   Wall t (s)        (%)
       Time:   415256.517    34693.203     1196.9
                 (ns/day)    (hour/ns)
Performance:        4.981        4.818

More information about the gromacs.org_gmx-users mailing list