[gmx-users] Poor GPU Performance with GROMACS 5.1.4
Daniel Kozuch
dkozuch at princeton.edu
Thu May 25 04:48:19 CEST 2017
I apologize for the confusion, but I found my error. I was failing to
request a certain number of cpus-per-task and the scheduler was having
issues assigning the threads because of this. Speed is now at ~400 ns/day
with a 3 fs timestep which seems reasonable.
Thanks for all the help,
Dan
On Wed, May 24, 2017 at 9:48 PM, Daniel Kozuch <dkozuch at princeton.edu>
wrote:
> Szilárd,
>
> I think I must be misunderstanding your advice. If I remove the domain
> decomposition and set pin on as suggested by Mark, using:
>
> gmx_gpu mdrun -deffnm my_tpr -dd 1 -pin on
>
> Then I get very poor performance and the following error:
>
> NOTE: Affinity setting for 6/6 threads failed. This can cause performance
> degradation.
> If you think your settings are correct, ask on the gmx-users list.
>
> I am running only one rank and using 6 threads (I do not want to use all
> the available 28 cores on the node because I hope to run multiple of these
> jobs per node in the near future).
>
> Thanks for the help,
> Dan
>
>
> ------------------------------------------------------------
> -----------------------------------------------------------------
> Log File:
>
> GROMACS version: VERSION 5.1.4
> Precision: single
> Memory model: 64 bit
> MPI library: MPI
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support: enabled
> OpenCL support: disabled
> invsqrt routine: gmx_software_invsqrt(x)
> SIMD instructions: AVX2_256
> FFT library: fftw-3.3.4-sse2-avx
> RDTSCP usage: enabled
> C++11 compilation: disabled
> TNG support: enabled
> Tracing support: disabled
> Built on: Mon May 22 18:29:21 EDT 2017
> Built by: dkozuch at tigergpu.princeton.edu [CMAKE]
> Build OS/arch: Linux 3.10.0-514.16.1.el7.x86_64 x86_64
> Build CPU vendor: GenuineIntel
> Build CPU brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> Build CPU family: 6 Model: 79 Stepping: 1
> Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler: /usr/bin/cc GNU 4.8.5
> C compiler flags: -march=core-avx2 -Wextra
> -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
> -Wno-unused -Wunused-value -Wunused-parameter -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
> C++ compiler: /usr/bin/c++ GNU 4.8.5
> C++ compiler flags: -march=core-avx2 -Wextra
> -Wno-missing-field-initializers -Wpointer-arith -Wall
> -Wno-unused-function -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast -Wno-array-bounds
> Boost version: 1.53.0 (external)
> CUDA compiler: /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
> compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
> CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=comp
> ute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-
> gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_
> 50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-
> gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_
> 61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-
> gencode;arch=compute_61,code=compute_61;-use_fast_math;;
> ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-
> Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-
> funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
> CUDA driver: 8.0
> CUDA runtime: 8.0
>
>
> Number of logical cores detected (28) does not match the number reported
> by OpenMP (1).
> Consider setting the launch configuration manually!
>
> Running on 1 node with total 28 logical cores, 1 compatible GPU
> Hardware detected on host tiger-i23g14 (the node of MPI rank 0):
> CPU info:
> Vendor: GenuineIntel
> Brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> Family: 6 model: 79 stepping: 1
> CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> SIMD instructions most likely to fit this hardware: AVX2_256
> SIMD instructions selected at GROMACS compile time: AVX2_256
> GPU info:
> Number of GPUs detected: 1
> #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
>
> Using 1 MPI process
> Using 6 OpenMP threads
>
> 1 compatible GPU is present, with ID 0
> 1 GPU auto-selected for this run.
> Mapping of GPU ID to the 1 PP rank in this node: 0
>
> NOTE: GROMACS was configured without NVML support hence it can not exploit
> application clocks of the detected Tesla P100-PCIE-16GB GPU to
> improve performance.
> Recompile with the NVML library (compatible with the driver used) or
> set application clocks manually.
>
>
> Using GPU 8x8 non-bonded kernels
>
> Removing pbc first time
>
> Overriding thread affinity set outside gmx_514_gpu
>
> Pinning threads with an auto-selected logical core stride of 4
>
> NOTE: Affinity setting for 6/6 threads failed. This can cause performance
> degradation.
> If you think your settings are correct, ask on the gmx-users list.
>
> Initializing LINear Constraint Solver
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 1 MPI rank, each using 6 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
> ------------------------------------------------------------
> -----------------
> Neighbor search 1 6 201 1.402 20.185 9.4
> Launch GPU ops. 1 6 5001 0.216 3.116 1.5
> Force 1 6 5001 1.070 15.402 7.2
> PME mesh 1 6 5001 5.538 79.745 37.1
> Wait GPU local 1 6 5001 0.072 1.043 0.5
> NB X/F buffer ops. 1 6 9801 0.396 5.706 2.7
> Write traj. 1 6 2 0.022 0.310 0.1
> Update 1 6 5001 1.683 24.232 11.3
> Constraints 1 6 5001 2.488 35.833 16.7
> Rest 2.031 29.247 13.6
> ------------------------------------------------------------
> -----------------
> Total 14.918 214.819 100.0
> ------------------------------------------------------------
> -----------------
> Breakdown of PME mesh computation
> ------------------------------------------------------------
> -----------------
> PME spread/gather 1 6 10002 4.782 68.865 32.1
> PME 3D-FFT 1 6 10002 0.654 9.411 4.4
> PME solve Elec 1 6 5001 0.024 0.352 0.2
> ------------------------------------------------------------
> -----------------
>
> GPU timings
> ------------------------------------------------------------
> -----------------
> Computing: Count Wall t (s) ms/step %
> ------------------------------------------------------------
> -----------------
> Pair list H2D 201 0.020 0.099 0.3
> X / q H2D 5001 0.090 0.018 1.5
> Nonbonded F kernel 4800 5.617 1.170 92.8
> Nonbonded F+prune k. 150 0.186 1.240 3.1
> Nonbonded F+ene+prune k. 51 0.064 1.257 1.1
> F D2H 5001 0.075 0.015 1.2
> ------------------------------------------------------------
> -----------------
> Total 6.052 1.210 100.0
> ------------------------------------------------------------
> -----------------
>
> Force evaluation time GPU/CPU: 1.210 ms/1.321 ms = 0.916
> For optimal performance this ratio should be close to 1!
>
> Core t (s) Wall t (s) (%)
> Time: 23.471 14.918 157.3
> (ns/day) (hour/ns)
> Performance: 86.893 0.276
> Finished mdrun on rank 0 Wed May 24 21:36:47 2017
>
More information about the gromacs.org_gmx-users
mailing list