[gmx-users] Poor GPU Performance with GROMACS 5.1.4

Thu May 25 04:48:19 CEST 2017

I apologize for the confusion, but I found my error. I was failing to
request a certain number of cpus-per-task and the scheduler was having
issues assigning the threads because of this. Speed is now at ~400 ns/day
with a 3 fs timestep which seems reasonable.

Thanks for all the help,
Dan

On Wed, May 24, 2017 at 9:48 PM, Daniel Kozuch <dkozuch at princeton.edu>
wrote:

> Szilárd,
>
> I think I must be misunderstanding your advice. If I remove the domain
> decomposition and set pin on as suggested by Mark, using:
>
> gmx_gpu mdrun -deffnm my_tpr -dd 1 -pin on
>
> Then I get very poor performance and the following error:
>
> NOTE: Affinity setting for 6/6 threads failed. This can cause performance
> degradation.
>       If you think your settings are correct, ask on the gmx-users list.
>
> I am running only one rank and using 6 threads (I do not want to use all
> the available 28 cores on the node because I hope to run multiple of these
> jobs per node in the near future).
>
> Thanks for the help,
> Dan
>
>
> ------------------------------------------------------------
> -----------------------------------------------------------------
> Log File:
>
> GROMACS version:    VERSION 5.1.4
> Precision:          single
> Memory model:       64 bit
> MPI library:        MPI
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support:        enabled
> OpenCL support:     disabled
> invsqrt routine:    gmx_software_invsqrt(x)
> SIMD instructions:  AVX2_256
> FFT library:        fftw-3.3.4-sse2-avx
> RDTSCP usage:       enabled
> C++11 compilation:  disabled
> TNG support:        enabled
> Tracing support:    disabled
> Built on:           Mon May 22 18:29:21 EDT 2017
> Built by:           dkozuch at tigergpu.princeton.edu [CMAKE]
> Build OS/arch:      Linux 3.10.0-514.16.1.el7.x86_64 x86_64
> Build CPU vendor:   GenuineIntel
> Build CPU brand:    Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> Build CPU family:   6   Model: 79   Stepping: 1
> Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler:         /usr/bin/cc GNU 4.8.5
> C compiler flags:    -march=core-avx2    -Wextra
> -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
> -Wno-unused -Wunused-value -Wunused-parameter  -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds
> C++ compiler:       /usr/bin/c++ GNU 4.8.5
> C++ compiler flags:  -march=core-avx2    -Wextra
> -Wno-missing-field-initializers -Wpointer-arith -Wall
> -Wno-unused-function  -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast  -Wno-array-bounds
> Boost version:      1.53.0 (external)
> CUDA compiler:      /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
> compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
> CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=comp
> ute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-
> gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_
> 50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-
> gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_
> 61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-
> gencode;arch=compute_61,code=compute_61;-use_fast_math;;
> ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-
> Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-
> funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
> CUDA driver:        8.0
> CUDA runtime:       8.0
>
>
> Number of logical cores detected (28) does not match the number reported
> by OpenMP (1).
> Consider setting the launch configuration manually!
>
> Running on 1 node with total 28 logical cores, 1 compatible GPU
> Hardware detected on host tiger-i23g14 (the node of MPI rank 0):
>   CPU info:
>     Vendor: GenuineIntel
>     Brand:  Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>     Family:  6  model: 79  stepping:  1
>     CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
>     SIMD instructions most likely to fit this hardware: AVX2_256
>     SIMD instructions selected at GROMACS compile time: AVX2_256
>   GPU info:
>     Number of GPUs detected: 1
>     #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
>
> Using 1 MPI process
> Using 6 OpenMP threads
>
> 1 compatible GPU is present, with ID 0
> 1 GPU auto-selected for this run.
> Mapping of GPU ID to the 1 PP rank in this node: 0
>
> NOTE: GROMACS was configured without NVML support hence it can not exploit
>       application clocks of the detected Tesla P100-PCIE-16GB GPU to
> improve performance.
>       Recompile with the NVML library (compatible with the driver used) or
> set application clocks manually.
>
>
> Using GPU 8x8 non-bonded kernels
>
> Removing pbc first time
>
> Overriding thread affinity set outside gmx_514_gpu
>
> Pinning threads with an auto-selected logical core stride of 4
>
> NOTE: Affinity setting for 6/6 threads failed. This can cause performance
> degradation.
>       If you think your settings are correct, ask on the gmx-users list.
>
> Initializing LINear Constraint Solver
>
>  R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 1 MPI rank, each using 6 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> ------------------------------------------------------------
> -----------------
>  Neighbor search        1    6        201       1.402         20.185   9.4
>  Launch GPU ops.        1    6       5001       0.216          3.116   1.5
>  Force                  1    6       5001       1.070         15.402   7.2
>  PME mesh               1    6       5001       5.538         79.745  37.1
>  Wait GPU local         1    6       5001       0.072          1.043   0.5
>  NB X/F buffer ops.     1    6       9801       0.396          5.706   2.7
>  Write traj.            1    6          2       0.022          0.310   0.1
>  Update                 1    6       5001       1.683         24.232  11.3
>  Constraints            1    6       5001       2.488         35.833  16.7
>  Rest                                           2.031         29.247  13.6
> ------------------------------------------------------------
> -----------------
>  Total                                         14.918        214.819 100.0
> ------------------------------------------------------------
> -----------------
>  Breakdown of PME mesh computation
> ------------------------------------------------------------
> -----------------
>  PME spread/gather      1    6      10002       4.782         68.865  32.1
>  PME 3D-FFT             1    6      10002       0.654          9.411   4.4
>  PME solve Elec         1    6       5001       0.024          0.352   0.2
> ------------------------------------------------------------
> -----------------
>
>  GPU timings
> ------------------------------------------------------------
> -----------------
>  Computing:                         Count  Wall t (s)      ms/step       %
> ------------------------------------------------------------
> -----------------
>  Pair list H2D                        201       0.020        0.099     0.3
>  X / q H2D                           5001       0.090        0.018     1.5
>  Nonbonded F kernel                  4800       5.617        1.170    92.8
>  Nonbonded F+prune k.                 150       0.186        1.240     3.1
>  Nonbonded F+ene+prune k.              51       0.064        1.257     1.1
>  F D2H                               5001       0.075        0.015     1.2
> ------------------------------------------------------------
> -----------------
>  Total                                          6.052        1.210   100.0
> ------------------------------------------------------------
> -----------------
>
> Force evaluation time GPU/CPU: 1.210 ms/1.321 ms = 0.916
> For optimal performance this ratio should be close to 1!
>
>                Core t (s)   Wall t (s)        (%)
>        Time:       23.471       14.918      157.3
>                  (ns/day)    (hour/ns)
> Performance:       86.893        0.276
> Finished mdrun on rank 0 Wed May 24 21:36:47 2017
>