[gmx-users] Too much PME mesh wall time.

Mon Aug 25 13:38:25 CEST 2014

On Sun, Aug 24, 2014 at 2:19 AM, Yunlong Liu <yliu120 at jh.edu> wrote:

> Hi gromacs users,
>
> I met a problem with too much PME Mesh time in my simulation. The
> following is my time accounting. I am running my simulation on 2 nodes.
> Each of them has 16 CPUs and 1 Tesla K20m Nvidia GPU.
>
> And my mdrun command is ibrun /work/03002/yliu120/gromacs-5/bin/mdrun_mpi
> -pin on -ntomp 8 -dlb no -deffnm pi3k-wt-charm-4 -gpu_id 00.
>
> I manually turned off dlb since when it is turned on, the simulation will
> crash. I have reported it to both mailing lists and talked to Roland.
>

Hmm. This shouldn't happen. Can you please open an issue at
http://redmine.gromacs.org/ and upload enough info for us to replicate it?

>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 4 MPI ranks, each using 8 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> ------------------------------------------------------------
> -----------------
>  Domain decomp.         4    8     150000    1592.099 137554.334   2.2
>  DD comm. load          4    8        751       0.057 4.947   0.0
>  Neighbor search        4    8     150001     665.072 57460.919   0.9
>  Launch GPU ops.        4    8   15000002     967.023 83548.916   1.3
>  Comm. coord.           4    8    7350000    2488.263 214981.185   3.5
>  Force                  4    8    7500001    7037.401 608018.042   9.8
>  Wait + Comm. F         4    8    7500001    3931.222 339650.132   5.5
> * PME mesh               4    8 7500001   40799.937    3525036.971  56.7*
>  Wait GPU nonlocal      4    8    7500001    1985.151 171513.300   2.8
>  Wait GPU local         4    8    7500001      68.365 5906.612   0.1
>  NB X/F buffer ops.     4    8   29700002    1229.406 106218.328   1.7
>  Write traj.            4    8        830      28.245 2440.304   0.0
>  Update                 4    8    7500001    2479.611 214233.669   3.4
>  Constraints            4    8    7500001    7041.030 608331.635   9.8
>  Comm. energies         4    8     150001      14.250 1231.154   0.0
>  Rest                                        1601.588 138374.139   2.2
> ------------------------------------------------------------
> -----------------
>  Total                                      71928.719 6214504.588 100.0
> ------------------------------------------------------------
> -----------------
>  Breakdown of PME mesh computation
> ------------------------------------------------------------
> -----------------
>  PME redist. X/F        4    8   15000002    8362.454 722500.151  11.6
>  PME spread/gather      4    8   15000002   14836.350 1281832.463  20.6
>  PME 3D-FFT             4    8   15000002    8985.776 776353.949  12.5
>  PME 3D-FFT Comm.       4    8   15000002    7547.935 652127.220  10.5
>  PME solve Elec         4    8    7500001    1025.249 88579.550   1.4
> ------------------------------------------------------------
> -----------------
>
> First, I would like to know whether this is a big problem and second, I
> want to know how to improve my performance?
>

"Too much" mesh time is not really possible. With the GPU doing the
short-ranged work, the only work for the CPU to do is the bondeds (in
"Force" above) and long-range (PME mesh). Those ought to dominate the run
time, and roughly in that ratio for a typical biomolecular system.

Does it mean that my GPU is running too fast and CPU is waiting.

Looks balanced - if the GPU had too much work then the Wait GPU times would
be appreciable. What did the PP-PME load balancing at the start of the run
look like?

> BTW, what does the wait GPU nonlocal refer to?
>

When using DD and GPUs, the short-ranged work on each PP rank is decomposed
into a set whose resulting forces are needed by other ("non-local") PP
ranks, and the rest. Then the non-local work is done first, so that once
PME mesh work is done, the PP<->PP MPI communication could be overlapped
with the local short-ranged GPU work. The 0.1% time for "Wait GPU local"
indicates that the communication took longer than the amount of local work,
perhaps because there was not much of the latter or it was already
complete. Unfortunately, it is not always possible to get timing
information from CUDA without slowing down the run. What actually happens
is strongly dependent on the hardware and simulation system.

Mark

Thank you.
> Yunlong
>
> --
>
> ========================================
> Yunlong Liu, PhD Candidate
> Computational Biology and Biophysics
> Department of Biophysics and Biophysical Chemistry
> School of Medicine, The Johns Hopkins University
> Email: yliu120 at jhmi.edu
> Address: 725 N Wolfe St, WBSB RM 601, 21205
> ========================================
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>