[gmx-users] load imbalance in multiple GPU simulations

yunshi11 . yunshi09 at gmail.com
Sun Dec 8 19:11:07 CET 2013


Hi all,

My conventional MD run (equilibration) of a protein in TIP3 water had the
"Average load imbalance: 59.4 %" when running with 3 GPUs + 12 CPU cores.
So I wonder how to tweak parameters to optimize the performance.

End of the log file reads:

......
        M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check           78483.330336    706349.973     0.1
 NxN QSTab Elec. + VdW [F]         11321254.234368   464171423.609    95.1
 NxN QSTab Elec. + VdW [V&F]         114522.922048     6756852.401     1.4
 1,4 nonbonded interactions            1645.932918    148133.963     0.0
 Calc Weights                         25454.159073    916349.727     0.2
 Spread Q Bspline                    543022.060224     1086044.120     0.2
 Gather F Bspline                    543022.060224     3258132.361     0.7
 3D-FFT                             1138719.444112     9109755.553     1.9
 Solve PME                              353.129616     22600.295     0.0
 Reset In Box                           424.227500        1272.682     0.0
 CG-CoM                                 424.397191        1273.192     0.0
 Bonds                                  330.706614     19511.690     0.0
 Angles                                1144.322886    192246.245     0.0
 Propers                               1718.934378    393635.973     0.1
 Impropers                              134.502690     27976.560     0.0
 Pos. Restr.                            321.706434       16085.322     0.0
 Virial                                 424.734826        7645.227     0.0
 Stop-CM                                 85.184882         851.849     0.0
 P-Coupling                            8484.719691       50908.318     0.0
 Calc-Ekin                              848.794382     22917.448     0.0
 Lincs                                  313.720420     18823.225     0.0
 Lincs-Mat                             1564.146576        6256.586     0.0
 Constraint-V                          8651.865815     69214.927     0.0
 Constraint-Vir                         417.065668     10009.576     0.0
 Settle                                2674.808325      863963.089     0.2
-----------------------------------------------------------------------------
 Total                                               487878233.910   100.0
-----------------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 63413.7
 av. #atoms communicated per step for LINCS:  2 x 3922.5

 Average load imbalance: 59.4 %
 Part of the total run time spent waiting due to load imbalance: 5.0 %


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles     %
-----------------------------------------------------------------------------
 Domain decomp.         3    4       2500      42.792     1300.947     4.4
 DD comm. load          3    4         31       0.000        0.014     0.0
 Neighbor search        3    4       2501      33.076     1005.542     3.4
 Launch GPU ops.        3    4     100002       6.537      198.739     0.7
 Comm. coord.           3    4      47500      20.349      618.652     2.1
 Force                  3    4      50001      75.093     2282.944     7.8
 Wait + Comm. F         3    4      50001      24.850      755.482     2.6
 PME mesh               3    4      50001     597.925    18177.760    62.0
 Wait GPU nonlocal      3    4      50001       9.862      299.813     1.0
 Wait GPU local         3    4      50001       0.262        7.968     0.0
 NB X/F buffer ops.     3    4     195002      33.578     1020.833     3.5
 Write traj.            3    4         12       0.506       15.385     0.1
 Update                 3    4      50001      23.243      706.611     2.4
 Constraints            3    4      50001      70.972     2157.657     7.4
 Comm. energies         3    4       2501       0.386       11.724     0.0
 Rest                   3                      24.466      743.803     2.5
-----------------------------------------------------------------------------
 Total                  3                     963.899    29303.873   100.0
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
 PME redist. X/F        3    4     100002     121.844     3704.214    12.6
 PME spread/gather      3    4     100002     300.759     9143.486    31.2
 PME 3D-FFT             3    4     100002     111.366     3385.682    11.6
 PME 3D-FFT Comm.       3    4     100002      55.347     1682.636     5.7
 PME solve              3    4      50001       8.199      249.246     0.9
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:    11533.900      963.899     1196.6
                 (ns/day)    (hour/ns)
Performance:        8.964        2.677
Finished mdrun on node 0 Sun Dec  8 11:04:48 2013



And I set rlist = rvdw = rcoulomb = 1.0.

Is there any documentation that details what those values, e.g. VdW [V&F] ,
mean?

Thanks,
Yun


More information about the gromacs.org_gmx-users mailing list