[gmx-users] GMX GPU Rest Time

Thu Jun 8 08:55:13 CEST 2017

Hello,

I recently changed the number of cpus I was pairing with each gpu and I
noticed a significant slowdown, more than I would have expected simply due
to a reduction in the number of cpus.

>From the log file it appears that the GPU is resting for a large amount of
time. Is there something I can do about this?

I have attached parts of the log file. For reference this is a REMD
simulation with 60 replicas on 360 cpus and 60 gpus. I have set the local
variable OMP_NUM_THREADS to six in order to assign 6 cpus to each replica
and avoid domain decomposition for my small system (as recommend in an
earlier correspondence).

Any help is appreciated,
Dan

-----------------------------------------------------------------------------------------------------------------------------

GROMACS:      gmx mdrun, VERSION 5.1.4
Executable:   /home/dkozuch/programs/gromacs_514_gpu/bin/gmx_514_gpu
Data prefix:  /home/dkozuch/programs/gromacs_514_gpu
Command line:
  gmx_514_gpu mdrun -v -deffnm 1msi_eq -multidir 1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 -pin on

GROMACS version:    VERSION 5.1.4
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:        enabled
OpenCL support:     disabled
invsqrt routine:    gmx_software_invsqrt(x)
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.4-sse2-avx
RDTSCP usage:       enabled
C++11 compilation:  disabled
TNG support:        enabled
Tracing support:    disabled
Built on:           Mon May 22 18:29:21 EDT 2017
Built by:           dkozuch at tigergpu.princeton.edu [CMAKE]
Build OS/arch:      Linux 3.10.0-514.16.1.el7.x86_64 x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Build CPU family:   6   Model: 79   Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /usr/bin/cc GNU 4.8.5
C compiler flags:    -march=core-avx2    -Wextra
-Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
-Wno-unused -Wunused-value -Wunused-parameter  -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds
C++ compiler:       /usr/bin/c++ GNU 4.8.5
C++ compiler flags:  -march=core-avx2    -Wextra
-Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function
 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds
Boost version:      1.53.0 (external)
CUDA compiler:      /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
CUDA compiler
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
CUDA driver:        8.0
CUDA runtime:       8.0

Number of logical cores detected (28) does not match the number reported by
OpenMP (6).
Consider setting the launch configuration manually!

Running on 15 nodes with total 420 cores, 420 logical cores, 24 compatible
GPUs
  Cores per node:           28
  Logical cores per node:   28
  Compatible GPUs per node:  0 -  4
  Different nodes have different type(s) and/or order of GPUs
Hardware detected on host tiger-i20g2 (the node of MPI rank 4):
  CPU info:
    Vendor: GenuineIntel
    Brand:  Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
    Family:  6  model: 79  stepping:  1
    CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256
  GPU info:
    Number of GPUs detected: 4
    #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible
    #1: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible
    #2: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible
    #3: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible

This is simulation 4 out of 60 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process
Using 6 OpenMP threads

4 compatible GPUs are present, with IDs 0,1,2,3
4 GPUs auto-selected for this run.
Mapping of GPU IDs to the 4 PP ranks in this node: 0,1,2,3

Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
Cut-off's:   NS: 1.003   Coulomb: 0.9   LJ: 0.9
System total charge: -0.000
Generated table with 1001 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1001 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1001 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 1001 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1001 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1001 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Potential shift: LJ r^-12: -3.541e+00 r^-6: -1.882e+00, Ewald -1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 8.85e-04 size: 1018

NOTE: GROMACS was configured without NVML support hence it can not exploit
      application clocks of the detected Tesla P100-PCIE-16GB GPU to
improve performance.
      Recompile with the NVML library (compatible with the driver used) or
set application clocks manually.

Using GPU 8x8 non-bonded kernels

Removing pbc first time

Overriding thread affinity set outside gmx_514_gpu

Pinning threads with an auto-selected logical core stride of 1

Initializing LINear Constraint Solver

Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  rest
There are: 7898 Atoms
There are: 2300 VSites

M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check            3465.668704       31191.018     0.1
 NxN Ewald Elec. + LJ [F]            415473.225344    27421232.873    94.0
 NxN Ewald Elec. + LJ [V&F]            4204.264896      449856.344     1.5
 1,4 nonbonded interactions             131.602632       11844.237     0.0
 Calc Weights                          1529.730594       55070.301     0.2
 Spread Q Bspline                     32634.252672       65268.505     0.2
 Gather F Bspline                     32634.252672      195805.516     0.7
 3D-FFT                              102116.765540      816934.124     2.8
 Solve PME                               79.966400        5117.850     0.0
 Shift-X                                 25.505198         153.031     0.0
 Angles                                  92.051841       15464.709     0.1
 Propers                                143.502870       32862.157     0.1
 Impropers                               11.050221        2298.446     0.0
 Virial                                  51.225243         922.054     0.0
 Stop-CM                                  5.119396          51.194     0.0
 P-Coupling                              50.990000         305.940     0.0
 Calc-Ekin                              102.000396        2754.011     0.0
 Lincs                                   50.203012        3012.181     0.0
 Lincs-Mat                             1104.666276        4418.665     0.0
 Constraint-V                           445.417816        3563.343     0.0
 Constraint-Vir                          39.527904         948.670     0.0
 Settle                                 115.006900       37147.229     0.1
 Virtual Site 3                         126.504600        4680.670     0.0
-----------------------------------------------------------------------------
 Total                                                29160903.068   100.0
-----------------------------------------------------------------------------

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 6 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Vsite constr.          1    6      50001       0.960         13.824   0.7
 Neighbor search        1    6       2501       2.908         41.869   2.1
 Launch GPU ops.        1    6      50001       2.081         29.973   1.5
 Force                  1    6      50001       4.203         60.525   3.0
 PME mesh               1    6      50001      19.931        287.004  14.2
 Wait GPU local         1    6      50001       0.722         10.398   0.5
 NB X/F buffer ops.     1    6      97501       0.830         11.957   0.6
 Vsite spread           1    6      55002       1.189         17.116   0.8
 Write traj.            1    6          6       0.040          0.580   0.0
 Update                 1    6      50001       3.417         49.202   2.4
 Constraints            1    6      50001       4.544         65.428   3.2
 *Rest                                          99.392       1431.228  70.9*
-----------------------------------------------------------------------------
 Total                                        140.217       2019.104 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME spread/gather      1    6     100002      11.795        169.845   8.4
 PME 3D-FFT             1    6     100002       6.712         96.651   4.8
 PME solve Elec         1    6      50001       1.320         19.010   0.9
-----------------------------------------------------------------------------

 GPU timings
-----------------------------------------------------------------------------
 Computing:                         Count  Wall t (s)      ms/step       %
-----------------------------------------------------------------------------
 Pair list H2D                       2501       0.174        0.069     0.7
 X / q H2D                          50001       1.052        0.021     4.3
 Nonbonded F kernel                 47500      21.005        0.442    86.5
 Nonbonded F+prune k.                2000       0.920        0.460     3.8
 Nonbonded F+ene+prune k.             501       0.254        0.507     1.0
 F D2H                              50001       0.873        0.017     3.6
-----------------------------------------------------------------------------
 Total                                         24.277        0.486   100.0
-----------------------------------------------------------------------------

Force evaluation time GPU/CPU: 0.486 ms/0.483 ms = 1.006
For optimal performance this ratio should be close to 1!

               Core t (s)   Wall t (s)        (%)
       Time:      839.987      140.217      599.1
                 (ns/day)    (hour/ns)
Performance:      123.240        0.195
Finished mdrun on rank 0 Wed Jun  7 10:58:29 2017