[gmx-users] Poor GPU Performance with GROMACS 5.1.4

Thu May 25 03:49:23 CEST 2017

Szilárd,

I think I must be misunderstanding your advice. If I remove the domain
decomposition and set pin on as suggested by Mark, using:

gmx_gpu mdrun -deffnm my_tpr -dd 1 -pin on

Then I get very poor performance and the following error:

NOTE: Affinity setting for 6/6 threads failed. This can cause performance
degradation.
      If you think your settings are correct, ask on the gmx-users list.

I am running only one rank and using 6 threads (I do not want to use all
the available 28 cores on the node because I hope to run multiple of these
jobs per node in the near future).

Thanks for the help,
Dan

-----------------------------------------------------------------------------------------------------------------------------
Log File:

GROMACS version:    VERSION 5.1.4
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:        enabled
OpenCL support:     disabled
invsqrt routine:    gmx_software_invsqrt(x)
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.4-sse2-avx
RDTSCP usage:       enabled
C++11 compilation:  disabled
TNG support:        enabled
Tracing support:    disabled
Built on:           Mon May 22 18:29:21 EDT 2017
Built by:           dkozuch at tigergpu.princeton.edu [CMAKE]
Build OS/arch:      Linux 3.10.0-514.16.1.el7.x86_64 x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Build CPU family:   6   Model: 79   Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /usr/bin/cc GNU 4.8.5
C compiler flags:    -march=core-avx2    -Wextra
-Wno-missing-field-initializers
-Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value
-Wunused-parameter  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
 -Wno-array-bounds
C++ compiler:       /usr/bin/c++ GNU 4.8.5
C++ compiler flags:  -march=core-avx2    -Wextra
-Wno-missing-field-initializers
-Wpointer-arith -Wall -Wno-unused-function  -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast  -Wno-array-bounds
Boost version:      1.53.0 (external)
CUDA compiler:      /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=
compute_30,code=sm_30;-gencode;arch=compute_35,code=
sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=
compute_50,code=sm_50;-gencode;arch=compute_52,code=
sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=
compute_61,code=sm_61;-gencode;arch=compute_60,code=
compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-
Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-
fexcess-precision=fast;-Wno-array-bounds;
CUDA driver:        8.0
CUDA runtime:       8.0

Number of logical cores detected (28) does not match the number reported by
OpenMP (1).
Consider setting the launch configuration manually!

Running on 1 node with total 28 logical cores, 1 compatible GPU
Hardware detected on host tiger-i23g14 (the node of MPI rank 0):
  CPU info:
    Vendor: GenuineIntel
    Brand:  Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
    Family:  6  model: 79  stepping:  1
    CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible

Using 1 MPI process
Using 6 OpenMP threads

1 compatible GPU is present, with ID 0
1 GPU auto-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 0

NOTE: GROMACS was configured without NVML support hence it can not exploit
      application clocks of the detected Tesla P100-PCIE-16GB GPU to
improve performance.
      Recompile with the NVML library (compatible with the driver used) or
set application clocks manually.

Using GPU 8x8 non-bonded kernels

Removing pbc first time

Overriding thread affinity set outside gmx_514_gpu

Pinning threads with an auto-selected logical core stride of 4

NOTE: Affinity setting for 6/6 threads failed. This can cause performance
degradation.
      If you think your settings are correct, ask on the gmx-users list.

Initializing LINear Constraint Solver

 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 6 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1    6        201       1.402         20.185   9.4
 Launch GPU ops.        1    6       5001       0.216          3.116   1.5
 Force                  1    6       5001       1.070         15.402   7.2
 PME mesh               1    6       5001       5.538         79.745  37.1
 Wait GPU local         1    6       5001       0.072          1.043   0.5
 NB X/F buffer ops.     1    6       9801       0.396          5.706   2.7
 Write traj.            1    6          2       0.022          0.310   0.1
 Update                 1    6       5001       1.683         24.232  11.3
 Constraints            1    6       5001       2.488         35.833  16.7
 Rest                                           2.031         29.247  13.6
-----------------------------------------------------------------------------
 Total                                         14.918        214.819 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME spread/gather      1    6      10002       4.782         68.865  32.1
 PME 3D-FFT             1    6      10002       0.654          9.411   4.4
 PME solve Elec         1    6       5001       0.024          0.352   0.2
-----------------------------------------------------------------------------

 GPU timings
-----------------------------------------------------------------------------
 Computing:                         Count  Wall t (s)      ms/step       %
-----------------------------------------------------------------------------
 Pair list H2D                        201       0.020        0.099     0.3
 X / q H2D                           5001       0.090        0.018     1.5
 Nonbonded F kernel                  4800       5.617        1.170    92.8
 Nonbonded F+prune k.                 150       0.186        1.240     3.1
 Nonbonded F+ene+prune k.              51       0.064        1.257     1.1
 F D2H                               5001       0.075        0.015     1.2
-----------------------------------------------------------------------------
 Total                                          6.052        1.210   100.0
-----------------------------------------------------------------------------

Force evaluation time GPU/CPU: 1.210 ms/1.321 ms = 0.916
For optimal performance this ratio should be close to 1!

               Core t (s)   Wall t (s)        (%)
       Time:       23.471       14.918      157.3
                 (ns/day)    (hour/ns)
Performance:       86.893        0.276
Finished mdrun on rank 0 Wed May 24 21:36:47 2017