[gmx-developers] Which part of runtime cost does "Wait GPU NB nonloc" and "Wait GPU NB local" actually count?
张驭洲
zhangyuzhou15 at mails.ucas.edu.cn
Tue Apr 14 13:54:56 CEST 2020
Hi Berk,
Thank you very much for the detailed explanation!
Sincerely,
Zhang
-----原始邮件-----
发件人:"Berk Hess" <hess at kth.se>
发送时间:2020-04-14 19:37:31 (星期二)
收件人: gmx-developers at gromacs.org
抄送:
主题: Re: [gmx-developers] Which part of runtime cost does "Wait GPU NB nonloc" and "Wait GPU NB local" actually count?
Hi,
GPUs are only collected locally. "non-local" refers to interactions between atoms of which some or all might belong home on other MPI ranks. We compute these non-local interactions with higher priority on the GPU and wait on those forces first so we can communicate these forces to the home ranks of those atoms. Thus the wait time on the non-local forces can be long when the CPU has relatively little work. The local forces are often finished quickly after the non-local forces have been transferred, so that wait time is often short.
Cheers,
Berk
On 2020-04-14 11:33 , 张驭洲 wrote:
Hello Berk,
Thanks for your reply! I want to ask one more question. As the wall time of Wait GPU NB noloc is relatively long while that of Force and Wait GPU NB local is very short, does it means that the communation between a CPU and its nolocal GPUs slows down the running? Or in other words, the force kernel is fast, it's the hardware connecting CPUs and GPUs or their topological structure that restricts the performance?
Sincerely,
Zhang
-----原始邮件-----
发件人:"Berk Hess" <hess at kth.se>
发送时间:2020-04-14 16:52:51 (星期二)
收件人:gmx-developers at gromacs.org
抄送:
主题: Re: [gmx-developers] Which part of runtime cost does "Wait GPU NB nonloc" and "Wait GPU NB local" actually count?
Hi,
Those timers report the time the CPU is waiting for results to arrive from the local and non-local non-bonded calculations on the GPU. When the CPU has few or no forces to compute, this wait time can be a large part of the total run time.
Cheers,
Berk
On 2020-04-14 10:37 , 张驭洲 wrote:
Hello GROMACS developers,
I'm using GROMACS 2020.1 on a node with 2 Intel(R) Xeon(R) Gold 6142 CPUs and 4 NVIDIA Tesla V100-PCIE-32GB GPUs.
With the command line as follows:
gmx mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr -g p16.log -pin on -ntmpi 4 -ntomp 6 -nb gpu -bonded gpu -pme gpu -npme 1
I got the following performance results:
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 3 6 2001 15.290 715.584 6.4
DD comm. load 3 6 245 0.008 0.377 0.0
DD comm. bounds 3 6 48 0.003 0.151 0.0
Send X to PME 3 6 200001 9.756 456.559 4.1
Neighbor search 3 6 2001 12.184 570.190 5.1
Launch GPU ops. 3 6 400002 17.929 839.075 7.5
Force 3 6 200001 3.912 183.082 1.6
Wait + Comm. F 3 6 40001 4.229 197.913 1.8
PME mesh * 1 6 200001 16.733 261.027 2.3
PME wait for PP * 162.467 2534.449 22.7
Wait + Recv. PME F 3 6 200001 18.827 881.091 7.9
Wait PME GPU gather 3 6 200001 2.896 135.522 1.2
Wait Bonded GPU 3 6 2001 0.003 0.122 0.0
Wait GPU NB nonloc. 3 6 200001 15.328 717.330 6.4
Wait GPU NB local 3 6 200001 0.175 8.169 0.1
Wait GPU state copy 3 6 160000 26.204 1226.327 11.0
NB X/F buffer ops. 3 6 798003 7.023 328.655 2.9
Write traj. 3 6 21 0.182 8.540 0.1
Update 3 6 200001 6.685 312.856 2.8
Comm. energies 3 6 40001 6.684 312.796 2.8
Rest 31.899 1492.851 13.3
-----------------------------------------------------------------------------
Total 179.216 11182.921 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 4301.031 179.216 2399.9
(ns/day) (hour/ns)
Performance: 96.421 0.249
Using two nodes and the following command:
gmx_mpi mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr -g p16.log -ntomp 6 -nb gpu -bonded gpu -pme gpu -npme 1
I got these results:
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 6 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 6 6 2001 8.477 793.447 3.7
DD comm. load 6 6 256 0.005 0.449 0.0
DD comm. bounds 6 6 60 0.002 0.216 0.0
Send X to PME 6 6 200001 32.588 3050.168 14.1
Neighbor search 6 6 2001 6.639 621.393 2.9
Launch GPU ops. 6 6 400002 14.686 1374.563 6.4
Comm. coord. 6 6 198000 36.691 3434.263 15.9
Force 6 6 200001 2.913 272.694 1.3
Wait + Comm. F 6 6 200001 32.024 2997.400 13.9
PME mesh * 1 6 200001 77.479 1208.657 5.6
PME wait for PP * 119.009 1856.517 8.6
Wait + Recv. PME F 6 6 200001 14.328 1341.122 6.2
Wait PME GPU gather 6 6 200001 11.115 1040.397 4.8
Wait Bonded GPU 6 6 2001 0.003 0.279 0.0
Wait GPU NB nonloc. 6 6 200001 27.604 2583.729 11.9
Wait GPU NB local 6 6 200001 0.548 51.333 0.2
NB X/F buffer ops. 6 6 796002 11.095 1038.515 4.8
Write traj. 6 6 21 0.105 9.851 0.0
Update 6 6 200001 3.498 327.440 1.5
Comm. energies 6 6 40001 2.947 275.863 1.3
-----------------------------------------------------------------------------
Total 198.094 21631.660 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 8319.867 198.094 4200.0
(ns/day) (hour/ns)
Performance: 87.232 0.275
I'm curious about the "Wait GPU NB nonloc" and "Wait GPU NB local" part, which you can see in both cases, the wall time of wait GPU NB local is very short but that of nonloc is pretty long, and the wall time of Force in both cases is much shorter than that of Wait GPU NB nonloc. Could you please give an explanation of the these timing terms? And I'd appreciate it very much if you can give some suggestions of reducing the time consumption of that waiting!
Sincerely,
Zhang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20200414/1e6b259b/attachment-0001.html>
More information about the gromacs.org_gmx-developers
mailing list