[gmx-developers] Which part of runtime cost does "Wait GPU NB nonloc" and "Wait GPU NB local" actually count?

张驭洲 zhangyuzhou15 at mails.ucas.edu.cn
Tue Apr 14 11:34:06 CEST 2020


Hello Berk,




Thanks for your reply! I want to ask one more question. As the wall time of Wait GPU NB noloc is relatively long while that of Force and Wait GPU NB local is very short, does it means that the communation between a CPU and its nolocal GPUs slows down the running? Or in other words, the force kernel is fast, it's the hardware connecting CPUs and GPUs or their topological structure that restricts the performance?




Sincerely,

Zhang
 

-----原始邮件-----
发件人:"Berk Hess" <hess at kth.se>
发送时间:2020-04-14 16:52:51 (星期二)
收件人: gmx-developers at gromacs.org
抄送:
主题: Re: [gmx-developers] Which part of runtime cost does "Wait GPU NB nonloc" and "Wait GPU NB local" actually count?


Hi,

Those timers report the time the CPU is waiting for results to arrive from the local and non-local non-bonded calculations on the GPU. When the CPU has few or no forces to compute, this wait time can be a large part of the total run time.

Cheers,

Berk

On 2020-04-14 10:37 , 张驭洲 wrote:


Hello GROMACS developers,




I'm using GROMACS 2020.1 on a node with 2 Intel(R) Xeon(R) Gold 6142 CPUs and 4 NVIDIA Tesla V100-PCIE-32GB GPUs.

With the command line as follows:

    gmx mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr -g p16.log -pin on -ntmpi 4 -ntomp 6 -nb gpu -bonded gpu -pme gpu -npme 1

I got the following performance results:




     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         3    6       2001      15.290        715.584   6.4
 DD comm. load          3    6        245       0.008          0.377   0.0
 DD comm. bounds        3    6         48       0.003          0.151   0.0
 Send X to PME          3    6     200001       9.756        456.559   4.1
 Neighbor search        3    6       2001      12.184        570.190   5.1
 Launch GPU ops.        3    6     400002      17.929        839.075   7.5
 Force                  3    6     200001       3.912        183.082   1.6
 Wait + Comm. F         3    6      40001       4.229        197.913   1.8
 PME mesh *             1    6     200001      16.733        261.027   2.3
 PME wait for PP *                            162.467       2534.449  22.7
 Wait + Recv. PME F     3    6     200001      18.827        881.091   7.9
 Wait PME GPU gather    3    6     200001       2.896        135.522   1.2
 Wait Bonded GPU        3    6       2001       0.003          0.122   0.0
 Wait GPU NB nonloc.    3    6     200001      15.328        717.330   6.4
 Wait GPU NB local      3    6     200001       0.175          8.169   0.1
 Wait GPU state copy    3    6     160000      26.204       1226.327  11.0
 NB X/F buffer ops.     3    6     798003       7.023        328.655   2.9
 Write traj.            3    6         21       0.182          8.540   0.1
 Update                 3    6     200001       6.685        312.856   2.8
 Comm. energies         3    6      40001       6.684        312.796   2.8
 Rest                                          31.899       1492.851  13.3
-----------------------------------------------------------------------------
 Total                                        179.216      11182.921 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:     4301.031      179.216     2399.9
                 (ns/day)    (hour/ns)
Performance:       96.421        0.249




Using two nodes and the following command:

  gmx_mpi mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr -g p16.log -ntomp 6 -nb gpu -bonded gpu -pme gpu -npme 1

I got these results:







     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 6 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         6    6       2001       8.477        793.447   3.7
 DD comm. load          6    6        256       0.005          0.449   0.0
 DD comm. bounds        6    6         60       0.002          0.216   0.0
 Send X to PME          6    6     200001      32.588       3050.168  14.1
 Neighbor search        6    6       2001       6.639        621.393   2.9
 Launch GPU ops.        6    6     400002      14.686       1374.563   6.4
 Comm. coord.           6    6     198000      36.691       3434.263  15.9
 Force                  6    6     200001       2.913        272.694   1.3
 Wait + Comm. F         6    6     200001      32.024       2997.400  13.9
 PME mesh *             1    6     200001      77.479       1208.657   5.6
 PME wait for PP *                            119.009       1856.517   8.6
 Wait + Recv. PME F     6    6     200001      14.328       1341.122   6.2
 Wait PME GPU gather    6    6     200001      11.115       1040.397   4.8
 Wait Bonded GPU        6    6       2001       0.003          0.279   0.0
 Wait GPU NB nonloc.    6    6     200001      27.604       2583.729  11.9
 Wait GPU NB local      6    6     200001       0.548         51.333   0.2
 NB X/F buffer ops.     6    6     796002      11.095       1038.515   4.8
 Write traj.            6    6         21       0.105          9.851   0.0
 Update                 6    6     200001       3.498        327.440   1.5
 Comm. energies         6    6      40001       2.947        275.863   1.3
-----------------------------------------------------------------------------
 Total                                        198.094      21631.660 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:     8319.867      198.094     4200.0
                 (ns/day)    (hour/ns)
Performance:       87.232        0.275




I'm curious about the "Wait GPU NB nonloc" and "Wait GPU NB local" part, which you can see in both cases, the wall time of wait GPU NB local is very short but that of nonloc is pretty long, and the wall time of Force in both cases is much shorter than that of Wait GPU NB nonloc. Could you please give an explanation of the these timing terms? And I'd appreciate it very much if you can give some suggestions of reducing the time consumption of that waiting!




Sincerely,

Zhang










-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20200414/8fb8e24b/attachment-0001.html>


More information about the gromacs.org_gmx-developers mailing list