[gmx-users] Poor GPU Performance with GROMACS 5.1.4
Daniel Kozuch
dkozuch at princeton.edu
Thu May 25 03:49:23 CEST 2017
Szilárd,
I think I must be misunderstanding your advice. If I remove the domain
decomposition and set pin on as suggested by Mark, using:
gmx_gpu mdrun -deffnm my_tpr -dd 1 -pin on
Then I get very poor performance and the following error:
NOTE: Affinity setting for 6/6 threads failed. This can cause performance
degradation.
If you think your settings are correct, ask on the gmx-users list.
I am running only one rank and using 6 threads (I do not want to use all
the available 28 cores on the node because I hope to run multiple of these
jobs per node in the near future).
Thanks for the help,
Dan
-----------------------------------------------------------------------------------------------------------------------------
Log File:
GROMACS version: VERSION 5.1.4
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support: enabled
OpenCL support: disabled
invsqrt routine: gmx_software_invsqrt(x)
SIMD instructions: AVX2_256
FFT library: fftw-3.3.4-sse2-avx
RDTSCP usage: enabled
C++11 compilation: disabled
TNG support: enabled
Tracing support: disabled
Built on: Mon May 22 18:29:21 EDT 2017
Built by: dkozuch at tigergpu.princeton.edu [CMAKE]
Build OS/arch: Linux 3.10.0-514.16.1.el7.x86_64 x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Build CPU family: 6 Model: 79 Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /usr/bin/cc GNU 4.8.5
C compiler flags: -march=core-avx2 -Wextra
-Wno-missing-field-initializers
-Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value
-Wunused-parameter -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
-Wno-array-bounds
C++ compiler: /usr/bin/c++ GNU 4.8.5
C++ compiler flags: -march=core-avx2 -Wextra
-Wno-missing-field-initializers
-Wpointer-arith -Wall -Wno-unused-function -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast -Wno-array-bounds
Boost version: 1.53.0 (external)
CUDA compiler: /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=
compute_30,code=sm_30;-gencode;arch=compute_35,code=
sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=
compute_50,code=sm_50;-gencode;arch=compute_52,code=
sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=
compute_61,code=sm_61;-gencode;arch=compute_60,code=
compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-
Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-
fexcess-precision=fast;-Wno-array-bounds;
CUDA driver: 8.0
CUDA runtime: 8.0
Number of logical cores detected (28) does not match the number reported by
OpenMP (1).
Consider setting the launch configuration manually!
Running on 1 node with total 28 logical cores, 1 compatible GPU
Hardware detected on host tiger-i23g14 (the node of MPI rank 0):
CPU info:
Vendor: GenuineIntel
Brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Family: 6 model: 79 stepping: 1
CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX2_256
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible
Using 1 MPI process
Using 6 OpenMP threads
1 compatible GPU is present, with ID 0
1 GPU auto-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 0
NOTE: GROMACS was configured without NVML support hence it can not exploit
application clocks of the detected Tesla P100-PCIE-16GB GPU to
improve performance.
Recompile with the NVML library (compatible with the driver used) or
set application clocks manually.
Using GPU 8x8 non-bonded kernels
Removing pbc first time
Overriding thread affinity set outside gmx_514_gpu
Pinning threads with an auto-selected logical core stride of 4
NOTE: Affinity setting for 6/6 threads failed. This can cause performance
degradation.
If you think your settings are correct, ask on the gmx-users list.
Initializing LINear Constraint Solver
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 6 201 1.402 20.185 9.4
Launch GPU ops. 1 6 5001 0.216 3.116 1.5
Force 1 6 5001 1.070 15.402 7.2
PME mesh 1 6 5001 5.538 79.745 37.1
Wait GPU local 1 6 5001 0.072 1.043 0.5
NB X/F buffer ops. 1 6 9801 0.396 5.706 2.7
Write traj. 1 6 2 0.022 0.310 0.1
Update 1 6 5001 1.683 24.232 11.3
Constraints 1 6 5001 2.488 35.833 16.7
Rest 2.031 29.247 13.6
-----------------------------------------------------------------------------
Total 14.918 214.819 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME spread/gather 1 6 10002 4.782 68.865 32.1
PME 3D-FFT 1 6 10002 0.654 9.411 4.4
PME solve Elec 1 6 5001 0.024 0.352 0.2
-----------------------------------------------------------------------------
GPU timings
-----------------------------------------------------------------------------
Computing: Count Wall t (s) ms/step %
-----------------------------------------------------------------------------
Pair list H2D 201 0.020 0.099 0.3
X / q H2D 5001 0.090 0.018 1.5
Nonbonded F kernel 4800 5.617 1.170 92.8
Nonbonded F+prune k. 150 0.186 1.240 3.1
Nonbonded F+ene+prune k. 51 0.064 1.257 1.1
F D2H 5001 0.075 0.015 1.2
-----------------------------------------------------------------------------
Total 6.052 1.210 100.0
-----------------------------------------------------------------------------
Force evaluation time GPU/CPU: 1.210 ms/1.321 ms = 0.916
For optimal performance this ratio should be close to 1!
Core t (s) Wall t (s) (%)
Time: 23.471 14.918 157.3
(ns/day) (hour/ns)
Performance: 86.893 0.276
Finished mdrun on rank 0 Wed May 24 21:36:47 2017
More information about the gromacs.org_gmx-users
mailing list