[gmx-developers] Which part of runtime cost does "Wait GPU NB nonloc" and "Wait GPU NB local" actually count?
Berk Hess
hess at kth.se
Tue Apr 14 10:52:54 CEST 2020
Hi,
Those timers report the time the CPU is waiting for results to arrive
from the local and non-local non-bonded calculations on the GPU. When
the CPU has few or no forces to compute, this wait time can be a large
part of the total run time.
Cheers,
Berk
On 2020-04-14 10:37 , 张驭洲 wrote:
>
> Hello GROMACS developers,
>
>
> I'm using GROMACS 2020.1 on a node with 2 Intel(R) Xeon(R) Gold 6142
> CPUs and 4 NVIDIA Tesla V100-PCIE-32GB GPUs.
>
> With the command line as follows:
>
> gmx mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr -g
> p16.log -pin on -ntmpi 4 -ntomp 6 -nb gpu -bonded gpu -pme gpu -npme 1
>
> I got the following performance results:
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
> on 1 MPI rank doing PME, using 6 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
> -----------------------------------------------------------------------------
> Domain decomp. 3 6 2001 15.290 715.584 6.4
> DD comm. load 3 6 245 0.008 0.377 0.0
> DD comm. bounds 3 6 48 0.003 0.151 0.0
> Send X to PME 3 6 200001 9.756 456.559 4.1
> Neighbor search 3 6 2001 12.184 570.190 5.1
> Launch GPU ops. 3 6 400002 17.929 839.075 7.5
> Force 3 6 200001 3.912 183.082 1.6
> Wait + Comm. F 3 6 40001 4.229 197.913 1.8
> PME mesh * 1 6 200001 16.733 261.027 2.3
> PME wait for PP * 162.467 2534.449 22.7
> Wait + Recv. PME F 3 6 200001 18.827 881.091 7.9
> Wait PME GPU gather 3 6 200001 2.896 135.522 1.2
> Wait Bonded GPU 3 6 2001 0.003 0.122 0.0
> Wait GPU NB nonloc. 3 6 200001 15.328 717.330 6.4
> Wait GPU NB local 3 6 200001 0.175 8.169 0.1
> Wait GPU state copy 3 6 160000 26.204 1226.327 11.0
> NB X/F buffer ops. 3 6 798003 7.023 328.655 2.9
> Write traj. 3 6 21 0.182 8.540 0.1
> Update 3 6 200001 6.685 312.856 2.8
> Comm. energies 3 6 40001 6.684 312.796 2.8
> Rest 31.899 1492.851 13.3
> -----------------------------------------------------------------------------
> Total 179.216 11182.921 100.0
> -----------------------------------------------------------------------------
> (*) Note that with separate PME ranks, the walltime column actually
> sums to
> twice the total reported, but the cycle count total and % are correct.
> -----------------------------------------------------------------------------
>
>
> Core t (s) Wall t (s) (%)
> Time: 4301.031 179.216 2399.9
> (ns/day) (hour/ns)
> Performance: 96.421 0.249
>
>
> Using two nodes and the following command:
>
> gmx_mpi mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e p16.edr -g
> p16.log -ntomp 6 -nb gpu -bonded gpu -pme gpu -npme 1
>
> I got these results:
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 6 MPI ranks doing PP, each using 6 OpenMP threads, and
> on 1 MPI rank doing PME, using 6 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
> -----------------------------------------------------------------------------
> Domain decomp. 6 6 2001 8.477 793.447 3.7
> DD comm. load 6 6 256 0.005 0.449 0.0
> DD comm. bounds 6 6 60 0.002 0.216 0.0
> Send X to PME 6 6 200001 32.588 3050.168 14.1
> Neighbor search 6 6 2001 6.639 621.393 2.9
> Launch GPU ops. 6 6 400002 14.686 1374.563 6.4
> Comm. coord. 6 6 198000 36.691 3434.263 15.9
> Force 6 6 200001 2.913 272.694 1.3
> Wait + Comm. F 6 6 200001 32.024 2997.400 13.9
> PME mesh * 1 6 200001 77.479 1208.657 5.6
> PME wait for PP * 119.009 1856.517 8.6
> Wait + Recv. PME F 6 6 200001 14.328 1341.122 6.2
> Wait PME GPU gather 6 6 200001 11.115 1040.397 4.8
> Wait Bonded GPU 6 6 2001 0.003 0.279 0.0
> Wait GPU NB nonloc. 6 6 200001 27.604 2583.729 11.9
> Wait GPU NB local 6 6 200001 0.548 51.333 0.2
> NB X/F buffer ops. 6 6 796002 11.095 1038.515 4.8
> Write traj. 6 6 21 0.105 9.851 0.0
> Update 6 6 200001 3.498 327.440 1.5
> Comm. energies 6 6 40001 2.947 275.863 1.3
> -----------------------------------------------------------------------------
> Total 198.094 21631.660 100.0
> -----------------------------------------------------------------------------
> (*) Note that with separate PME ranks, the walltime column actually
> sums to
> twice the total reported, but the cycle count total and % are correct.
> -----------------------------------------------------------------------------
>
>
> Core t (s) Wall t (s) (%)
> Time: 8319.867 198.094 4200.0
> (ns/day) (hour/ns)
> Performance: 87.232 0.275
>
>
> I'm curious about the "Wait GPU NB nonloc" and "Wait GPU NB local"
> part, which you can see in both cases, the wall time of wait GPU NB
> local is very short but that of nonloc is pretty long, and the wall
> time of Force in both cases is much shorter than that of Wait GPU NB
> nonloc. Could you please give an explanation of the these timing
> terms? And I'd appreciate it very much if you can give
> some suggestions of reducing the time consumption of that waiting!
>
>
> Sincerely,
>
> Zhang
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20200414/2acfe2e1/attachment.html>
More information about the gromacs.org_gmx-developers
mailing list