[gmx-users] Poor GPU Performance with GROMACS 5.1.4

Wed May 24 21:09:12 CEST 2017

Hello,

I'm using GROMACS 5.1.4 on 8 CPUs and 1 GPU for a system of ~8000 atoms in
a dodecahedron box, and I'm having trouble getting good performance out of
the GPU. Specifically it appears that there is significant performance loss
to wait times ("Wait + Comm. F" and "Wait GPU nonlocal"). I have pasted the
relevant parts of the log file below. I suspect that I have set up my
ranks/threads badly, but I am unsure where the issue is. I have tried
changing the local variable OMP_NUM_THREADS from 1 to 2 per the note
generated by GROMACS, but this severely slows down the simulation to the
point where it takes 10 minutes to get a few picoseconds.

I have tried browsing through the mailing lists, but I haven't found a
solution to this particular problem.

Any help is appreciated,
Dan

------------------------------------------------------------
------------------------------------------------------------
----------------------------------------------------

GROMACS:      gmx mdrun, VERSION 5.1.4
Executable:   /home/dkozuch/programs/gromacs_514_gpu/bin/gmx_514_gpu
Data prefix:  /home/dkozuch/programs/gromacs_514_gpu
Command line:
  gmx_514_gpu mdrun -deffnm 1ucs_npt -ntomp 1

GROMACS version:    VERSION 5.1.4
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:        enabled
OpenCL support:     disabled
invsqrt routine:    gmx_software_invsqrt(x)
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.4-sse2-avx
RDTSCP usage:       enabled
C++11 compilation:  disabled
TNG support:        enabled
Tracing support:    disabled
Built on:           Mon May 22 18:29:21 EDT 2017
Built by:           dkozuch at tigergpu.princeton.edu [CMAKE]
Build OS/arch:      Linux 3.10.0-514.16.1.el7.x86_64 x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Build CPU family:   6   Model: 79   Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /usr/bin/cc GNU 4.8.5
C compiler flags:    -march=core-avx2    -Wextra
-Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
-Wno-unused -Wunused-value -Wunused-parameter  -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds
C++ compiler:       /usr/bin/c++ GNU 4.8.5
C++ compiler flags:  -march=core-avx2    -Wextra
-Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function
 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds
Boost version:      1.53.0 (external)
CUDA compiler:      /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=comp
ute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-
gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_
50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-
gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_
61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-
gencode;arch=compute_61,code=compute_61;-use_fast_math;;
;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-
Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-
funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
CUDA driver:        8.0
CUDA runtime:       8.0

Number of logical cores detected (28) does not match the number reported by
OpenMP (1).
Consider setting the launch configuration manually!

Running on 1 node with total 28 logical cores, 1 compatible GPU
Hardware detected on host tiger-i23g10 (the node of MPI rank 0):
  CPU info:
    Vendor: GenuineIntel
    Brand:  Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
    Family:  6  model: 79  stepping:  1
    CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible

For optimal performance with a GPU nstlist (now 10) should be larger.
The optimum depends on your CPU and GPU resources.
You might want to try several nstlist values.
Changing nstlist from 10 to 25, rlist from 1.014 to 1.098

Initializing Domain Decomposition on 8 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.417 nm, LJ-14, atoms 514 517
  multi-body bonded interactions: 0.417 nm, Proper Dih., atoms 514 517
Minimum cell size due to bonded interactions: 0.459 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.820 nm
Estimated maximum distance required for P-LINCS: 0.820 nm
This distance will limit the DD cell size, you can override this with -rcon
Using 0 separate PME ranks, per user request
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 8 cells with a minimum initial size of 1.025 nm
The maximum allowed number of cells is: X 3 Y 3 Z 3
Domain decomposition grid 2 x 2 x 2, separate PME ranks 0
PME domain decomposition: 2 x 4 x 1
Domain decomposition rank 0, coordinates 0 0 0

Using 8 MPI processes
Using 1 OpenMP thread per MPI process

On host [redacted] 1 compatible GPU is present, with ID 0
On host [redacted] 1 GPU auto-selected for this run.
Mapping of GPU ID to the 8 PP ranks in this node: 0,0,0,0,0,0,0,0

NOTE: Your choice of number of MPI ranks and amount of resources results in
using 1 OpenMP threads per rank, which is most likely inefficient. The
optimum is usually between 2 and 6 threads per rank.

Will do PME sum in reciprocal space for electrostatic interactions.

Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Cut-off's:   NS: 1.098   Coulomb: 1   LJ: 1
System total charge: -0.000
Generated table with 1049 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1049 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1049 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 1049 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1049 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1049 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 9.33e-04 size: 1073

NOTE: GROMACS was configured without NVML support hence it can not exploit
      application clocks of the detected Tesla P100-PCIE-16GB GPU to
improve performance.
      Recompile with the NVML library (compatible with the driver used) or
set application clocks manually.

Using GPU 8x8 non-bonded kernels

Removing pbc first time

Non-default thread affinity set probably by the OpenMP library,
disabling internal thread affinity

Linking all bonded interactions to atoms

The initial number of communication pulses is: X 1 Y 1 Z 1
The initial domain decomposition cell size is: X 1.82 nm Y 1.82 nm Z 1.58 nm

The maximum allowed distance for charge groups involved in interactions is:
                 non-bonded interactions           1.098 nm
(the following are initial values, they could change due to box deformation)
            two-body bonded interactions  (-rdd)   1.098 nm
          multi-body bonded interactions  (-rdd)   1.098 nm
  atoms separated by up to 5 constraints  (-rcon)  1.578 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 1 Y 1 Z 1
The minimum size for domain decomposition cells is 1.098 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.60 Y 0.60 Z 0.70
The maximum allowed distance for charge groups involved in interactions is:
                 non-bonded interactions           1.098 nm
            two-body bonded interactions  (-rdd)   1.098 nm
          multi-body bonded interactions  (-rdd)   1.098 nm
  atoms separated by up to 5 constraints  (-rcon)  1.098 nm

Making 3D domain decomposition grid 2 x 2 x 2, home cell index 0 0 0

Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  rest
There are: 8081 Atoms
Atom distribution over 8 domains: av 1010 stddev 44 min 939 max 1056

NOTE: DLB will not turn on during the first phase of PME tuning

M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
------------------------------------------------------------
-----------------
 Pair Search distance check          200315.445632     1802839.011     0.1
 NxN Ewald Elec. + LJ [F]          35433155.735168  2338588278.521    93.7
 NxN Ewald Elec. + LJ [V&F]          357920.995008    38297546.466     1.5
 1,4 nonbonded interactions            5205.001041      468450.094     0.0
 Calc Weights                        121215.024243     4363740.873     0.2
 Spread Q Bspline                   2585920.517184     5171841.034     0.2
 Gather F Bspline                   2585920.517184    15515523.103     0.6
 3D-FFT                            10217812.797720    81742502.382     3.3
 Solve PME                            31999.376000     2047960.064     0.1
 Reset In Box                          1616.208081        4848.624     0.0
 CG-CoM                                1616.216162        4848.648     0.0
 Angles                                4245.000849      713160.143     0.0
 Propers                               2345.000469      537005.107     0.0
 Impropers                             1235.000247      256880.051     0.0
 Virial                                4220.508441       75969.152     0.0
 Stop-CM                                404.066162        4040.662     0.0
 P-Coupling                            4040.500000       24243.000     0.0
 Calc-Ekin                            16162.016162      436374.436     0.0
 Lincs                                 6775.647238      406538.834     0.0
 Lincs-Mat                           115022.349612      460089.398     0.0
 Constraint-V                         56090.458032      448723.664     0.0
 Constraint-Vir                        4931.491948      118355.807     0.0
 Settle                               14179.724888     4580051.139     0.2
------------------------------------------------------------
-----------------
 Total                                              2496069810
<(249)%20606-9810>.214   100.0
------------------------------------------------------------
-----------------

    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 23872.9
 av. #atoms communicated per step for LINCS:  2 x 2065.3

 Average load imbalance: 0.9 %
 Part of the total run time spent waiting due to load imbalance: 0.3 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0
% Y 0 % Z 1 %

 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 8 MPI ranks

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
------------------------------------------------------------
-----------------
 Domain decomp.       8    1     200001     181.288       3480.731
2.1
 DD comm. load         8    1     198450      12.116        232.631
   0.1
 DD comm. bounds    8    1     198401       1.401         26.898
 0.0
 Neighbor search        8    1     200001     164.440       3157.257
  1.9
 Launch GPU ops.      8    1   10000002     254.944       4894.923       3.0
 Comm. coord.            8    1    4800000     399.361       7667.747
 4.7
 Force                          8    1    5000001     144.644
2777.170       1.7
 Wait + Comm. F         8    1    5000001    2355.957      45234.421   *27.7
 PME mesh                 8    1    5000001    2226.183      42742.751
 26.2
 Wait GPU nonlocal     8    1    5000001    1621.582      31134.402    *19.1
 Wait GPU local           8    1    5000001      18.061        346.780
    0.2
 NB X/F buffer ops.      8    1   19600002     140.943       2706.099
 1.7
 Write traj.                    8    1       5009       0.569
 10.930            0.0
 Update                        8    1    5000001     208.399       4001.266
      2.5
 Constraints                 8    1    5000001     658.189      12637.242
   7.7
 Comm. energies         8    1    1000001      65.254       1252.872
 0.8
 Rest                                          51.772        994.016   0.6
------------------------------------------------------------
-----------------
 Total                                       8505.104     163298.138 100.0
------------------------------------------------------------
-----------------
 Breakdown of PME mesh computation
------------------------------------------------------------
-----------------
 PME redist. X/F        8    1   10000002     458.829       8809.522   5.4
 PME spread/gather      8    1   10000002     795.109      15266.106   9.3
 PME 3D-FFT             8    1   10000002     506.799       9730.551   6.0
 PME 3D-FFT Comm.       8    1   20000004     355.387       6823.444   4.2
 PME solve Elec         8    1    5000001     103.450       1986.247   1.2
------------------------------------------------------------
-----------------

               Core t (s)   Wall t (s)        (%)
       Time:    68031.078     8505.104      799.9
                         2h21:45
                 (ns/day)    (hour/ns)
Performance:      152.379        0.158
Finished mdrun on rank 0 Tue May 23 23:32:35 2017