[gmx-developers] Which part of runtime cost does "Wait GPU NB nonloc" and "Wait GPU NB local" actually count?

Berk Hess hess at kth.se
Tue Apr 14 10:52:54 CEST 2020


Hi,

Those timers report the time the CPU is waiting for results to arrive 
from the local and non-local non-bonded calculations on the GPU. When 
the CPU has few or no forces to compute, this wait time can be a large 
part of the total run time.

Cheers,

Berk

On 2020-04-14 10:37 , 张驭洲 wrote:
>
> Hello GROMACS developers,
>
>
> I'm using GROMACS 2020.1 on a node with 2 Intel(R) Xeon(R) Gold 6142 
> CPUs and 4 NVIDIA Tesla V100-PCIE-32GB GPUs.
>
> With the command line as follows:
>
>     gmx mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr -g 
> p16.log -pin on -ntmpi 4 -ntomp 6 -nb gpu -bonded gpu -pme gpu -npme 1
>
> I got the following performance results:
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
> on 1 MPI rank doing PME, using 6 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
>  Domain decomp.         3    6       2001      15.290 715.584   6.4
>  DD comm. load          3    6        245       0.008 0.377   0.0
>  DD comm. bounds        3    6         48       0.003 0.151   0.0
>  Send X to PME          3    6     200001       9.756 456.559   4.1
>  Neighbor search        3    6       2001      12.184 570.190   5.1
>  Launch GPU ops.        3    6     400002      17.929 839.075   7.5
>  Force                  3    6     200001       3.912 183.082   1.6
>  Wait + Comm. F         3    6      40001       4.229 197.913   1.8
>  PME mesh *             1    6     200001      16.733 261.027   2.3
>  PME wait for PP *                            162.467 2534.449  22.7
>  Wait + Recv. PME F     3    6     200001      18.827 881.091   7.9
>  Wait PME GPU gather    3    6     200001       2.896 135.522   1.2
>  Wait Bonded GPU        3    6       2001       0.003 0.122   0.0
>  Wait GPU NB nonloc.    3    6     200001      15.328 717.330   6.4
>  Wait GPU NB local      3    6     200001       0.175 8.169   0.1
>  Wait GPU state copy    3    6     160000      26.204 1226.327  11.0
>  NB X/F buffer ops.     3    6     798003       7.023 328.655   2.9
>  Write traj.            3    6         21       0.182 8.540   0.1
>  Update                 3    6     200001       6.685 312.856   2.8
>  Comm. energies         3    6      40001       6.684 312.796   2.8
>  Rest                                          31.899 1492.851  13.3
> -----------------------------------------------------------------------------
>  Total                                        179.216 11182.921 100.0
> -----------------------------------------------------------------------------
> (*) Note that with separate PME ranks, the walltime column actually 
> sums to
>     twice the total reported, but the cycle count total and % are correct.
> ----------------------------------------------------------------------------- 
>
>
>                Core t (s)   Wall t (s)        (%)
>        Time:     4301.031      179.216     2399.9
>                  (ns/day)    (hour/ns)
> Performance:       96.421        0.249
>
>
> Using two nodes and the following command:
>
>   gmx_mpi mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr -g 
> p16.log -ntomp 6 -nb gpu -bonded gpu -pme gpu -npme 1
>
> I got these results:
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 6 MPI ranks doing PP, each using 6 OpenMP threads, and
> on 1 MPI rank doing PME, using 6 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
>  Domain decomp.         6    6       2001       8.477 793.447   3.7
>  DD comm. load          6    6        256       0.005 0.449   0.0
>  DD comm. bounds        6    6         60       0.002 0.216   0.0
>  Send X to PME          6    6     200001      32.588 3050.168  14.1
>  Neighbor search        6    6       2001       6.639 621.393   2.9
>  Launch GPU ops.        6    6     400002      14.686 1374.563   6.4
>  Comm. coord.           6    6     198000      36.691 3434.263  15.9
>  Force                  6    6     200001       2.913 272.694   1.3
>  Wait + Comm. F         6    6     200001      32.024 2997.400  13.9
>  PME mesh *             1    6     200001      77.479 1208.657   5.6
>  PME wait for PP *                            119.009 1856.517   8.6
>  Wait + Recv. PME F     6    6     200001      14.328 1341.122   6.2
>  Wait PME GPU gather    6    6     200001      11.115 1040.397   4.8
>  Wait Bonded GPU        6    6       2001       0.003 0.279   0.0
>  Wait GPU NB nonloc.    6    6     200001      27.604 2583.729  11.9
>  Wait GPU NB local      6    6     200001       0.548 51.333   0.2
>  NB X/F buffer ops.     6    6     796002      11.095 1038.515   4.8
>  Write traj.            6    6         21       0.105 9.851   0.0
>  Update                 6    6     200001       3.498 327.440   1.5
>  Comm. energies         6    6      40001       2.947 275.863   1.3
> -----------------------------------------------------------------------------
>  Total                                        198.094 21631.660 100.0
> -----------------------------------------------------------------------------
> (*) Note that with separate PME ranks, the walltime column actually 
> sums to
>     twice the total reported, but the cycle count total and % are correct.
> ----------------------------------------------------------------------------- 
>
>
>                Core t (s)   Wall t (s)        (%)
>        Time:     8319.867      198.094     4200.0
>                  (ns/day)    (hour/ns)
> Performance:       87.232        0.275
>
>
> I'm curious about the "Wait GPU NB nonloc" and "Wait GPU NB local" 
> part, which you can see in both cases, the wall time of wait GPU NB 
> local is very short but that of nonloc is pretty long, and the wall 
> time of Force in both cases is much shorter than that of Wait GPU NB 
> nonloc. Could you please give an explanation of the these timing 
> terms? And I'd appreciate it very much if you can give 
> some suggestions of reducing the time consumption of that waiting!
>
>
> Sincerely,
>
> Zhang
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20200414/2acfe2e1/attachment.html>


More information about the gromacs.org_gmx-developers mailing list