[gmx-users] GMX GPU Rest Time

Thu Jun 8 10:33:47 CEST 2017

Hi,

On Thu, Jun 8, 2017 at 8:55 AM Daniel Kozuch <dkozuch at princeton.edu> wrote:

> Hello,
>
> I recently changed the number of cpus I was pairing with each gpu and I
> noticed a significant slowdown, more than I would have expected simply due
> to a reduction in the number of cpus.
>
> From the log file it appears that the GPU is resting for a large amount of
> time. Is there something I can do about this?
>

That's not just the GPUs resting, clearly many are stalled waiting for
exchange attempt synchronization.

I have attached parts of the log file. For reference this is a REMD
> simulation with 60 replicas on 360 cpus and 60 gpus.

The run reports 15 nodes and 24 GPUs and 28 cores per node, so mapping your
simulation system to your hardware is the first thing to focus on. All the
replicas progress only at the rate of the slowest replica, so at least some
of them are sharing GPUs, so the ones that finish first sit waiting for the
other ones

If e.g. this cluster has 4 GPUs and 28 cores per node, then you want to
place 4 replicas per node, e.g. 1 MPI rank per replica with 7 OpenMP
threads per core. See https://arxiv.org/abs/1507.00898 for further clues.

GROMACS 2016 does slightly improve the implementation of coupling of the
simulations in replica exchange, but you need to address all the above
issues first.

Mark

I have set the local
> variable OMP_NUM_THREADS to six in order to assign 6 cpus to each replica
> and avoid domain decomposition for my small system (as recommend in an
> earlier correspondence).
>
> Any help is appreciated,
> Dan
>
>
> -----------------------------------------------------------------------------------------------------------------------------
>
> GROMACS:      gmx mdrun, VERSION 5.1.4
> Executable:   /home/dkozuch/programs/gromacs_514_gpu/bin/gmx_514_gpu
> Data prefix:  /home/dkozuch/programs/gromacs_514_gpu
> Command line:
>   gmx_514_gpu mdrun -v -deffnm 1msi_eq -multidir 1 2 3 4 5 6 7 8 9 10 11 12
> 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
> 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 -pin
> on
>
> GROMACS version:    VERSION 5.1.4
> Precision:          single
> Memory model:       64 bit
> MPI library:        MPI
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support:        enabled
> OpenCL support:     disabled
> invsqrt routine:    gmx_software_invsqrt(x)
> SIMD instructions:  AVX2_256
> FFT library:        fftw-3.3.4-sse2-avx
> RDTSCP usage:       enabled
> C++11 compilation:  disabled
> TNG support:        enabled
> Tracing support:    disabled
> Built on:           Mon May 22 18:29:21 EDT 2017
> Built by:           dkozuch at tigergpu.princeton.edu [CMAKE]
> Build OS/arch:      Linux 3.10.0-514.16.1.el7.x86_64 x86_64
> Build CPU vendor:   GenuineIntel
> Build CPU brand:    Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> Build CPU family:   6   Model: 79   Stepping: 1
> Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler:         /usr/bin/cc GNU 4.8.5
> C compiler flags:    -march=core-avx2    -Wextra
> -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
> -Wno-unused -Wunused-value -Wunused-parameter  -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds
> C++ compiler:       /usr/bin/c++ GNU 4.8.5
> C++ compiler flags:  -march=core-avx2    -Wextra
> -Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function
>  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds
> Boost version:      1.53.0 (external)
> CUDA compiler:      /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
> compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
> CUDA compiler
>
> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
>
> ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
> CUDA driver:        8.0
> CUDA runtime:       8.0
>
>
> Number of logical cores detected (28) does not match the number reported by
> OpenMP (6).
> Consider setting the launch configuration manually!
>
> Running on 15 nodes with total 420 cores, 420 logical cores, 24 compatible
> GPUs
>   Cores per node:           28
>   Logical cores per node:   28
>   Compatible GPUs per node:  0 -  4
>   Different nodes have different type(s) and/or order of GPUs
> Hardware detected on host tiger-i20g2 (the node of MPI rank 4):
>   CPU info:
>     Vendor: GenuineIntel
>     Brand:  Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>     Family:  6  model: 79  stepping:  1
>     CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
>     SIMD instructions most likely to fit this hardware: AVX2_256
>     SIMD instructions selected at GROMACS compile time: AVX2_256
>   GPU info:
>     Number of GPUs detected: 4
>     #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
>     #1: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
>     #2: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
>     #3: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
>
> This is simulation 4 out of 60 running as a composite GROMACS
> multi-simulation job. Setup for this simulation:
>
> Using 1 MPI process
> Using 6 OpenMP threads
>
> 4 compatible GPUs are present, with IDs 0,1,2,3
> 4 GPUs auto-selected for this run.
> Mapping of GPU IDs to the 4 PP ranks in this node: 0,1,2,3
>
> Will do PME sum in reciprocal space for electrostatic interactions.
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
> Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
>
> Will do ordinary reciprocal space Ewald sum.
> Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
> Cut-off's:   NS: 1.003   Coulomb: 0.9   LJ: 0.9
> System total charge: -0.000
> Generated table with 1001 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 1001 data points for LJ6.
> Tabscale = 500 points/nm
> Generated table with 1001 data points for LJ12.
> Tabscale = 500 points/nm
> Generated table with 1001 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 1001 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 1001 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
> Potential shift: LJ r^-12: -3.541e+00 r^-6: -1.882e+00, Ewald -1.000e-05
> Initialized non-bonded Ewald correction tables, spacing: 8.85e-04 size:
> 1018
>
>
> NOTE: GROMACS was configured without NVML support hence it can not exploit
>       application clocks of the detected Tesla P100-PCIE-16GB GPU to
> improve performance.
>       Recompile with the NVML library (compatible with the driver used) or
> set application clocks manually.
>
>
> Using GPU 8x8 non-bonded kernels
>
> Removing pbc first time
>
> Overriding thread affinity set outside gmx_514_gpu
>
> Pinning threads with an auto-selected logical core stride of 1
>
> Initializing LINear Constraint Solver
>
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
>   0:  rest
> There are: 7898 Atoms
> There are: 2300 VSites
>
> M E G A - F L O P S   A C C O U N T I N G
>
>  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>  V&F=Potential and force  V=Potential only  F=Force only
>
>  Computing:                               M-Number         M-Flops  % Flops
>
> -----------------------------------------------------------------------------
>  Pair Search distance check            3465.668704       31191.018     0.1
>  NxN Ewald Elec. + LJ [F]            415473.225344    27421232.873    94.0
>  NxN Ewald Elec. + LJ [V&F]            4204.264896      449856.344     1.5
>  1,4 nonbonded interactions             131.602632       11844.237     0.0
>  Calc Weights                          1529.730594       55070.301     0.2
>  Spread Q Bspline                     32634.252672       65268.505     0.2
>  Gather F Bspline                     32634.252672      195805.516     0.7
>  3D-FFT                              102116.765540      816934.124     2.8
>  Solve PME                               79.966400        5117.850     0.0
>  Shift-X                                 25.505198         153.031     0.0
>  Angles                                  92.051841       15464.709     0.1
>  Propers                                143.502870       32862.157     0.1
>  Impropers                               11.050221        2298.446     0.0
>  Virial                                  51.225243         922.054     0.0
>  Stop-CM                                  5.119396          51.194     0.0
>  P-Coupling                              50.990000         305.940     0.0
>  Calc-Ekin                              102.000396        2754.011     0.0
>  Lincs                                   50.203012        3012.181     0.0
>  Lincs-Mat                             1104.666276        4418.665     0.0
>  Constraint-V                           445.417816        3563.343     0.0
>  Constraint-Vir                          39.527904         948.670     0.0
>  Settle                                 115.006900       37147.229     0.1
>  Virtual Site 3                         126.504600        4680.670     0.0
>
> -----------------------------------------------------------------------------
>  Total                                                29160903.068   100.0
>
> -----------------------------------------------------------------------------
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 1 MPI rank, each using 6 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
>
> -----------------------------------------------------------------------------
>  Vsite constr.          1    6      50001       0.960         13.824   0.7
>  Neighbor search        1    6       2501       2.908         41.869   2.1
>  Launch GPU ops.        1    6      50001       2.081         29.973   1.5
>  Force                  1    6      50001       4.203         60.525   3.0
>  PME mesh               1    6      50001      19.931        287.004  14.2
>  Wait GPU local         1    6      50001       0.722         10.398   0.5
>  NB X/F buffer ops.     1    6      97501       0.830         11.957   0.6
>  Vsite spread           1    6      55002       1.189         17.116   0.8
>  Write traj.            1    6          6       0.040          0.580   0.0
>  Update                 1    6      50001       3.417         49.202   2.4
>  Constraints            1    6      50001       4.544         65.428   3.2
>  *Rest                                          99.392       1431.228
> 70.9*
>
> -----------------------------------------------------------------------------
>  Total                                        140.217       2019.104 100.0
>
> -----------------------------------------------------------------------------
>  Breakdown of PME mesh computation
>
> -----------------------------------------------------------------------------
>  PME spread/gather      1    6     100002      11.795        169.845   8.4
>  PME 3D-FFT             1    6     100002       6.712         96.651   4.8
>  PME solve Elec         1    6      50001       1.320         19.010   0.9
>
> -----------------------------------------------------------------------------
>
>  GPU timings
>
> -----------------------------------------------------------------------------
>  Computing:                         Count  Wall t (s)      ms/step       %
>
> -----------------------------------------------------------------------------
>  Pair list H2D                       2501       0.174        0.069     0.7
>  X / q H2D                          50001       1.052        0.021     4.3
>  Nonbonded F kernel                 47500      21.005        0.442    86.5
>  Nonbonded F+prune k.                2000       0.920        0.460     3.8
>  Nonbonded F+ene+prune k.             501       0.254        0.507     1.0
>  F D2H                              50001       0.873        0.017     3.6
>
> -----------------------------------------------------------------------------
>  Total                                         24.277        0.486   100.0
>
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 0.486 ms/0.483 ms = 1.006
> For optimal performance this ratio should be close to 1!
>
>                Core t (s)   Wall t (s)        (%)
>        Time:      839.987      140.217      599.1
>                  (ns/day)    (hour/ns)
> Performance:      123.240        0.195
> Finished mdrun on rank 0 Wed Jun  7 10:58:29 2017
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>