[gmx-developers] Which part of runtime cost does "Wait GPU NB nonloc" and "Wait GPU NB local" actually count?

Tue Apr 14 13:37:33 CEST 2020

Hi,

GPUs are only collected locally. "non-local" refers to interactions 
between atoms of which some or all might belong home on other MPI ranks. 
We compute these non-local interactions with higher priority on the GPU 
and wait on those forces first so we can communicate these forces to the 
home ranks of those atoms. Thus the wait time on the non-local forces 
can be long when the CPU has relatively little work. The local forces 
are often finished quickly after the non-local forces have been 
transferred, so that wait time is often short.

Cheers,

Berk

On 2020-04-14 11:33 , 张驭洲 wrote:
>
>
> Hello Berk,
>
>
> Thanks for your reply! I want to ask one more question. As the wall 
> time of Wait GPU NB noloc is relatively long while that of Force and 
> Wait GPU NB local is very short, does it means that the communation 
> between a CPU and its nolocal GPUs slows down the running? Or in other 
> words, the force kernel is fast, it's the hardware connecting CPUs and 
> GPUs or their topological structure that restricts the performance?
>
>
> Sincerely,
>
> Zhang
>
>     -----原始邮件-----
>     *发件人:*"Berk Hess" <hess at kth.se>
>     *发送时间:*2020-04-14 16:52:51 (星期二)
>     *收件人:* gmx-developers at gromacs.org
>     *抄送:*
>     *主题:* Re: [gmx-developers] Which part of runtime cost does "Wait
>     GPU NB nonloc" and "Wait GPU NB local" actually count?
>
>     Hi,
>
>     Those timers report the time the CPU is waiting for results to
>     arrive from the local and non-local non-bonded calculations on the
>     GPU. When the CPU has few or no forces to compute, this wait time
>     can be a large part of the total run time.
>
>     Cheers,
>
>     Berk
>
>     On 2020-04-14 10:37 , 张驭洲 wrote:
>>
>>     Hello GROMACS developers,
>>
>>
>>     I'm using GROMACS 2020.1 on a node with 2 Intel(R) Xeon(R) Gold
>>     6142 CPUs and 4 NVIDIA Tesla V100-PCIE-32GB GPUs.
>>
>>     With the command line as follows:
>>
>>         gmx mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr -g
>>     p16.log -pin on -ntmpi 4 -ntomp 6 -nb gpu -bonded gpu -pme gpu
>>     -npme 1
>>
>>     I got the following performance results:
>>
>>
>>          R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>>
>>     On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
>>     on 1 MPI rank doing PME, using 6 OpenMP threads
>>
>>      Computing:          Num   Num      Call    Wall time        
>>     Giga-Cycles
>>                          Ranks Threads  Count      (s) total sum    %
>>     -----------------------------------------------------------------------------
>>      Domain decomp.         3    6       2001      15.290 715.584   6.4
>>      DD comm. load          3    6        245 0.008          0.377   0.0
>>      DD comm. bounds        3    6         48 0.003          0.151   0.0
>>      Send X to PME          3    6     200001       9.756 456.559   4.1
>>      Neighbor search        3    6       2001      12.184 570.190   5.1
>>      Launch GPU ops.        3    6     400002      17.929 839.075   7.5
>>      Force                  3    6     200001       3.912 183.082   1.6
>>      Wait + Comm. F         3    6      40001       4.229 197.913   1.8
>>      PME mesh *             1    6     200001      16.733 261.027   2.3
>>      PME wait for PP *                            162.467 2534.449  22.7
>>      Wait + Recv. PME F     3    6     200001      18.827 881.091   7.9
>>      Wait PME GPU gather    3    6     200001       2.896 135.522   1.2
>>      Wait Bonded GPU        3    6       2001 0.003          0.122   0.0
>>      Wait GPU NB nonloc.    3    6     200001      15.328 717.330   6.4
>>      Wait GPU NB local      3    6     200001 0.175          8.169   0.1
>>      Wait GPU state copy    3    6     160000      26.204 1226.327  11.0
>>      NB X/F buffer ops.     3    6     798003       7.023 328.655   2.9
>>      Write traj.            3    6         21 0.182          8.540   0.1
>>      Update                 3    6     200001       6.685 312.856   2.8
>>      Comm. energies         3    6      40001       6.684 312.796   2.8
>>      Rest                                          31.899 1492.851  13.3
>>     -----------------------------------------------------------------------------
>>      Total                                        179.216 11182.921 100.0
>>     -----------------------------------------------------------------------------
>>     (*) Note that with separate PME ranks, the walltime column
>>     actually sums to
>>         twice the total reported, but the cycle count total and % are
>>     correct.
>>     -----------------------------------------------------------------------------
>>
>>
>>                    Core t (s)   Wall t (s)        (%)
>>            Time:     4301.031      179.216     2399.9
>>                      (ns/day)    (hour/ns)
>>     Performance:       96.421        0.249
>>
>>
>>     Using two nodes and the following command:
>>
>>       gmx_mpi mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr
>>     -g p16.log -ntomp 6 -nb gpu -bonded gpu -pme gpu -npme 1
>>
>>     I got these results:
>>
>>
>>
>>          R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>>
>>     On 6 MPI ranks doing PP, each using 6 OpenMP threads, and
>>     on 1 MPI rank doing PME, using 6 OpenMP threads
>>
>>      Computing:          Num   Num      Call    Wall time        
>>     Giga-Cycles
>>                          Ranks Threads  Count      (s) total sum    %
>>     -----------------------------------------------------------------------------
>>      Domain decomp.         6    6       2001       8.477 793.447   3.7
>>      DD comm. load          6    6        256 0.005          0.449   0.0
>>      DD comm. bounds        6    6         60 0.002          0.216   0.0
>>      Send X to PME          6    6     200001      32.588 3050.168  14.1
>>      Neighbor search        6    6       2001       6.639 621.393   2.9
>>      Launch GPU ops.        6    6     400002      14.686 1374.563   6.4
>>      Comm. coord.           6    6     198000      36.691 3434.263  15.9
>>      Force                  6    6     200001       2.913 272.694   1.3
>>      Wait + Comm. F         6    6     200001      32.024 2997.400  13.9
>>      PME mesh *             1    6     200001      77.479 1208.657   5.6
>>      PME wait for PP *                            119.009 1856.517   8.6
>>      Wait + Recv. PME F     6    6     200001      14.328 1341.122   6.2
>>      Wait PME GPU gather    6    6     200001      11.115 1040.397   4.8
>>      Wait Bonded GPU        6    6       2001 0.003          0.279   0.0
>>      Wait GPU NB nonloc.    6    6     200001      27.604 2583.729  11.9
>>      Wait GPU NB local      6    6     200001 0.548         51.333   0.2
>>      NB X/F buffer ops.     6    6     796002      11.095 1038.515   4.8
>>      Write traj.            6    6         21 0.105          9.851   0.0
>>      Update                 6    6     200001       3.498 327.440   1.5
>>      Comm. energies         6    6      40001       2.947 275.863   1.3
>>     -----------------------------------------------------------------------------
>>      Total                                        198.094 21631.660 100.0
>>     -----------------------------------------------------------------------------
>>     (*) Note that with separate PME ranks, the walltime column
>>     actually sums to
>>         twice the total reported, but the cycle count total and % are
>>     correct.
>>     -----------------------------------------------------------------------------
>>
>>
>>                    Core t (s)   Wall t (s)        (%)
>>            Time:     8319.867      198.094     4200.0
>>                      (ns/day)    (hour/ns)
>>     Performance:       87.232        0.275
>>
>>
>>     I'm curious about the "Wait GPU NB nonloc" and "Wait GPU NB
>>     local" part, which you can see in both cases, the wall time of
>>     wait GPU NB local is very short but that of nonloc is pretty
>>     long, and the wall time of Force in both cases is much shorter
>>     than that of Wait GPU NB nonloc. Could you please give an
>>     explanation of the these timing terms? And I'd appreciate it very
>>     much if you can give some suggestions of reducing the time
>>     consumption of that waiting!
>>
>>
>>     Sincerely,
>>
>>     Zhang
>>
>>
>>
>>
>>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20200414/4dcf66f0/attachment.html>