[gmx-users] load imbalance in multiple GPU simulations

Szilárd Páll pall.szilard at gmail.com
Sun Dec 8 23:48:46 CET 2013


Hi,

That's unfortunate, but not unexpected. You are getting a 3x1x1
decomposition where the "middle" cell has most of the protein, hence
most of the bonded forces to calculate, while the ones on the side
have little (or none).

Currently, the only thing you can do is to try using more domains,
perhaps with manual decomposition (such that the initial domains will
contain as much protein as possible). This may not help much, though.
In extreme cases (e.g. small system), even using only two of the three
GPUs could improve performance.

Cheers,
--
Szilárd


On Sun, Dec 8, 2013 at 8:10 PM, yunshi11 . <yunshi09 at gmail.com> wrote:
> Hi all,
>
> My conventional MD run (equilibration) of a protein in TIP3 water had the
> "Average load imbalance: 59.4 %" when running with 3 GPUs + 12 CPU cores.
> So I wonder how to tweak parameters to optimize the performance.
>
> End of the log file reads:
>
> ......
>         M E G A - F L O P S   A C C O U N T I N G
>
>  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>  V&F=Potential and force  V=Potential only  F=Force only
>
>  Computing:                               M-Number         M-Flops  % Flops
> -----------------------------------------------------------------------------
>  Pair Search distance check           78483.330336    706349.973     0.1
>  NxN QSTab Elec. + VdW [F]         11321254.234368   464171423.609    95.1
>  NxN QSTab Elec. + VdW [V&F]         114522.922048     6756852.401     1.4
>  1,4 nonbonded interactions            1645.932918    148133.963     0.0
>  Calc Weights                         25454.159073    916349.727     0.2
>  Spread Q Bspline                    543022.060224     1086044.120     0.2
>  Gather F Bspline                    543022.060224     3258132.361     0.7
>  3D-FFT                             1138719.444112     9109755.553     1.9
>  Solve PME                              353.129616     22600.295     0.0
>  Reset In Box                           424.227500        1272.682     0.0
>  CG-CoM                                 424.397191        1273.192     0.0
>  Bonds                                  330.706614     19511.690     0.0
>  Angles                                1144.322886    192246.245     0.0
>  Propers                               1718.934378    393635.973     0.1
>  Impropers                              134.502690     27976.560     0.0
>  Pos. Restr.                            321.706434       16085.322     0.0
>  Virial                                 424.734826        7645.227     0.0
>  Stop-CM                                 85.184882         851.849     0.0
>  P-Coupling                            8484.719691       50908.318     0.0
>  Calc-Ekin                              848.794382     22917.448     0.0
>  Lincs                                  313.720420     18823.225     0.0
>  Lincs-Mat                             1564.146576        6256.586     0.0
>  Constraint-V                          8651.865815     69214.927     0.0
>  Constraint-Vir                         417.065668     10009.576     0.0
>  Settle                                2674.808325      863963.089     0.2
> -----------------------------------------------------------------------------
>  Total                                               487878233.910   100.0
> -----------------------------------------------------------------------------
>
>
>     D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>
>  av. #atoms communicated per step for force:  2 x 63413.7
>  av. #atoms communicated per step for LINCS:  2 x 3922.5
>
>  Average load imbalance: 59.4 %
>  Part of the total run time spent waiting due to load imbalance: 5.0 %
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
>  Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles     %
> -----------------------------------------------------------------------------
>  Domain decomp.         3    4       2500      42.792     1300.947     4.4
>  DD comm. load          3    4         31       0.000        0.014     0.0
>  Neighbor search        3    4       2501      33.076     1005.542     3.4
>  Launch GPU ops.        3    4     100002       6.537      198.739     0.7
>  Comm. coord.           3    4      47500      20.349      618.652     2.1
>  Force                  3    4      50001      75.093     2282.944     7.8
>  Wait + Comm. F         3    4      50001      24.850      755.482     2.6
>  PME mesh               3    4      50001     597.925    18177.760    62.0
>  Wait GPU nonlocal      3    4      50001       9.862      299.813     1.0
>  Wait GPU local         3    4      50001       0.262        7.968     0.0
>  NB X/F buffer ops.     3    4     195002      33.578     1020.833     3.5
>  Write traj.            3    4         12       0.506       15.385     0.1
>  Update                 3    4      50001      23.243      706.611     2.4
>  Constraints            3    4      50001      70.972     2157.657     7.4
>  Comm. energies         3    4       2501       0.386       11.724     0.0
>  Rest                   3                      24.466      743.803     2.5
> -----------------------------------------------------------------------------
>  Total                  3                     963.899    29303.873   100.0
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
>  PME redist. X/F        3    4     100002     121.844     3704.214    12.6
>  PME spread/gather      3    4     100002     300.759     9143.486    31.2
>  PME 3D-FFT             3    4     100002     111.366     3385.682    11.6
>  PME 3D-FFT Comm.       3    4     100002      55.347     1682.636     5.7
>  PME solve              3    4      50001       8.199      249.246     0.9
> -----------------------------------------------------------------------------
>
>                Core t (s)   Wall t (s)        (%)
>        Time:    11533.900      963.899     1196.6
>                  (ns/day)    (hour/ns)
> Performance:        8.964        2.677
> Finished mdrun on node 0 Sun Dec  8 11:04:48 2013
>
>
>
> And I set rlist = rvdw = rcoulomb = 1.0.
>
> Is there any documentation that details what those values, e.g. VdW [V&F] ,
> mean?
>
> Thanks,
> Yun
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list