[gmx-developers] Which part of runtime cost does "Wait GPU NB nonloc" and "Wait GPU NB local" actually count?
张驭洲
zhangyuzhou15 at mails.ucas.edu.cn
Tue Apr 14 10:37:48 CEST 2020
Hello GROMACS developers,
I'm using GROMACS 2020.1 on a node with 2 Intel(R) Xeon(R) Gold 6142 CPUs and 4 NVIDIA Tesla V100-PCIE-32GB GPUs.
With the command line as follows:
gmx mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr -g p16.log -pin on -ntmpi 4 -ntomp 6 -nb gpu -bonded gpu -pme gpu -npme 1
I got the following performance results:
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 3 6 2001 15.290 715.584 6.4
DD comm. load 3 6 245 0.008 0.377 0.0
DD comm. bounds 3 6 48 0.003 0.151 0.0
Send X to PME 3 6 200001 9.756 456.559 4.1
Neighbor search 3 6 2001 12.184 570.190 5.1
Launch GPU ops. 3 6 400002 17.929 839.075 7.5
Force 3 6 200001 3.912 183.082 1.6
Wait + Comm. F 3 6 40001 4.229 197.913 1.8
PME mesh * 1 6 200001 16.733 261.027 2.3
PME wait for PP * 162.467 2534.449 22.7
Wait + Recv. PME F 3 6 200001 18.827 881.091 7.9
Wait PME GPU gather 3 6 200001 2.896 135.522 1.2
Wait Bonded GPU 3 6 2001 0.003 0.122 0.0
Wait GPU NB nonloc. 3 6 200001 15.328 717.330 6.4
Wait GPU NB local 3 6 200001 0.175 8.169 0.1
Wait GPU state copy 3 6 160000 26.204 1226.327 11.0
NB X/F buffer ops. 3 6 798003 7.023 328.655 2.9
Write traj. 3 6 21 0.182 8.540 0.1
Update 3 6 200001 6.685 312.856 2.8
Comm. energies 3 6 40001 6.684 312.796 2.8
Rest 31.899 1492.851 13.3
-----------------------------------------------------------------------------
Total 179.216 11182.921 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 4301.031 179.216 2399.9
(ns/day) (hour/ns)
Performance: 96.421 0.249
Using two nodes and the following command:
gmx_mpi mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr -g p16.log -ntomp 6 -nb gpu -bonded gpu -pme gpu -npme 1
I got these results:
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 6 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 6 6 2001 8.477 793.447 3.7
DD comm. load 6 6 256 0.005 0.449 0.0
DD comm. bounds 6 6 60 0.002 0.216 0.0
Send X to PME 6 6 200001 32.588 3050.168 14.1
Neighbor search 6 6 2001 6.639 621.393 2.9
Launch GPU ops. 6 6 400002 14.686 1374.563 6.4
Comm. coord. 6 6 198000 36.691 3434.263 15.9
Force 6 6 200001 2.913 272.694 1.3
Wait + Comm. F 6 6 200001 32.024 2997.400 13.9
PME mesh * 1 6 200001 77.479 1208.657 5.6
PME wait for PP * 119.009 1856.517 8.6
Wait + Recv. PME F 6 6 200001 14.328 1341.122 6.2
Wait PME GPU gather 6 6 200001 11.115 1040.397 4.8
Wait Bonded GPU 6 6 2001 0.003 0.279 0.0
Wait GPU NB nonloc. 6 6 200001 27.604 2583.729 11.9
Wait GPU NB local 6 6 200001 0.548 51.333 0.2
NB X/F buffer ops. 6 6 796002 11.095 1038.515 4.8
Write traj. 6 6 21 0.105 9.851 0.0
Update 6 6 200001 3.498 327.440 1.5
Comm. energies 6 6 40001 2.947 275.863 1.3
-----------------------------------------------------------------------------
Total 198.094 21631.660 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 8319.867 198.094 4200.0
(ns/day) (hour/ns)
Performance: 87.232 0.275
I'm curious about the "Wait GPU NB nonloc" and "Wait GPU NB local" part, which you can see in both cases, the wall time of wait GPU NB local is very short but that of nonloc is pretty long, and the wall time of Force in both cases is much shorter than that of Wait GPU NB nonloc. Could you please give an explanation of the these timing terms? And I'd appreciate it very much if you can give some suggestions of reducing the time consumption of that waiting!
Sincerely,
Zhang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20200414/39db172b/attachment-0001.html>
More information about the gromacs.org_gmx-developers
mailing list