[gmx-users] GMX GPU Rest Time
Daniel Kozuch
dkozuch at princeton.edu
Thu Jun 8 08:55:13 CEST 2017
Hello,
I recently changed the number of cpus I was pairing with each gpu and I
noticed a significant slowdown, more than I would have expected simply due
to a reduction in the number of cpus.
>From the log file it appears that the GPU is resting for a large amount of
time. Is there something I can do about this?
I have attached parts of the log file. For reference this is a REMD
simulation with 60 replicas on 360 cpus and 60 gpus. I have set the local
variable OMP_NUM_THREADS to six in order to assign 6 cpus to each replica
and avoid domain decomposition for my small system (as recommend in an
earlier correspondence).
Any help is appreciated,
Dan
-----------------------------------------------------------------------------------------------------------------------------
GROMACS: gmx mdrun, VERSION 5.1.4
Executable: /home/dkozuch/programs/gromacs_514_gpu/bin/gmx_514_gpu
Data prefix: /home/dkozuch/programs/gromacs_514_gpu
Command line:
gmx_514_gpu mdrun -v -deffnm 1msi_eq -multidir 1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 -pin on
GROMACS version: VERSION 5.1.4
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support: enabled
OpenCL support: disabled
invsqrt routine: gmx_software_invsqrt(x)
SIMD instructions: AVX2_256
FFT library: fftw-3.3.4-sse2-avx
RDTSCP usage: enabled
C++11 compilation: disabled
TNG support: enabled
Tracing support: disabled
Built on: Mon May 22 18:29:21 EDT 2017
Built by: dkozuch at tigergpu.princeton.edu [CMAKE]
Build OS/arch: Linux 3.10.0-514.16.1.el7.x86_64 x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Build CPU family: 6 Model: 79 Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /usr/bin/cc GNU 4.8.5
C compiler flags: -march=core-avx2 -Wextra
-Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
-Wno-unused -Wunused-value -Wunused-parameter -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
C++ compiler: /usr/bin/c++ GNU 4.8.5
C++ compiler flags: -march=core-avx2 -Wextra
-Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function
-O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
Boost version: 1.53.0 (external)
CUDA compiler: /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
CUDA compiler
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
CUDA driver: 8.0
CUDA runtime: 8.0
Number of logical cores detected (28) does not match the number reported by
OpenMP (6).
Consider setting the launch configuration manually!
Running on 15 nodes with total 420 cores, 420 logical cores, 24 compatible
GPUs
Cores per node: 28
Logical cores per node: 28
Compatible GPUs per node: 0 - 4
Different nodes have different type(s) and/or order of GPUs
Hardware detected on host tiger-i20g2 (the node of MPI rank 4):
CPU info:
Vendor: GenuineIntel
Brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Family: 6 model: 79 stepping: 1
CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX2_256
GPU info:
Number of GPUs detected: 4
#0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible
#1: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible
#2: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible
#3: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible
This is simulation 4 out of 60 running as a composite GROMACS
multi-simulation job. Setup for this simulation:
Using 1 MPI process
Using 6 OpenMP threads
4 compatible GPUs are present, with IDs 0,1,2,3
4 GPUs auto-selected for this run.
Mapping of GPU IDs to the 4 PP ranks in this node: 0,1,2,3
Will do PME sum in reciprocal space for electrostatic interactions.
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------
Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
Cut-off's: NS: 1.003 Coulomb: 0.9 LJ: 0.9
System total charge: -0.000
Generated table with 1001 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1001 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1001 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 1001 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1001 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1001 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Potential shift: LJ r^-12: -3.541e+00 r^-6: -1.882e+00, Ewald -1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 8.85e-04 size: 1018
NOTE: GROMACS was configured without NVML support hence it can not exploit
application clocks of the detected Tesla P100-PCIE-16GB GPU to
improve performance.
Recompile with the NVML library (compatible with the driver used) or
set application clocks manually.
Using GPU 8x8 non-bonded kernels
Removing pbc first time
Overriding thread affinity set outside gmx_514_gpu
Pinning threads with an auto-selected logical core stride of 1
Initializing LINear Constraint Solver
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: rest
There are: 7898 Atoms
There are: 2300 VSites
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 3465.668704 31191.018 0.1
NxN Ewald Elec. + LJ [F] 415473.225344 27421232.873 94.0
NxN Ewald Elec. + LJ [V&F] 4204.264896 449856.344 1.5
1,4 nonbonded interactions 131.602632 11844.237 0.0
Calc Weights 1529.730594 55070.301 0.2
Spread Q Bspline 32634.252672 65268.505 0.2
Gather F Bspline 32634.252672 195805.516 0.7
3D-FFT 102116.765540 816934.124 2.8
Solve PME 79.966400 5117.850 0.0
Shift-X 25.505198 153.031 0.0
Angles 92.051841 15464.709 0.1
Propers 143.502870 32862.157 0.1
Impropers 11.050221 2298.446 0.0
Virial 51.225243 922.054 0.0
Stop-CM 5.119396 51.194 0.0
P-Coupling 50.990000 305.940 0.0
Calc-Ekin 102.000396 2754.011 0.0
Lincs 50.203012 3012.181 0.0
Lincs-Mat 1104.666276 4418.665 0.0
Constraint-V 445.417816 3563.343 0.0
Constraint-Vir 39.527904 948.670 0.0
Settle 115.006900 37147.229 0.1
Virtual Site 3 126.504600 4680.670 0.0
-----------------------------------------------------------------------------
Total 29160903.068 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Vsite constr. 1 6 50001 0.960 13.824 0.7
Neighbor search 1 6 2501 2.908 41.869 2.1
Launch GPU ops. 1 6 50001 2.081 29.973 1.5
Force 1 6 50001 4.203 60.525 3.0
PME mesh 1 6 50001 19.931 287.004 14.2
Wait GPU local 1 6 50001 0.722 10.398 0.5
NB X/F buffer ops. 1 6 97501 0.830 11.957 0.6
Vsite spread 1 6 55002 1.189 17.116 0.8
Write traj. 1 6 6 0.040 0.580 0.0
Update 1 6 50001 3.417 49.202 2.4
Constraints 1 6 50001 4.544 65.428 3.2
*Rest 99.392 1431.228 70.9*
-----------------------------------------------------------------------------
Total 140.217 2019.104 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME spread/gather 1 6 100002 11.795 169.845 8.4
PME 3D-FFT 1 6 100002 6.712 96.651 4.8
PME solve Elec 1 6 50001 1.320 19.010 0.9
-----------------------------------------------------------------------------
GPU timings
-----------------------------------------------------------------------------
Computing: Count Wall t (s) ms/step %
-----------------------------------------------------------------------------
Pair list H2D 2501 0.174 0.069 0.7
X / q H2D 50001 1.052 0.021 4.3
Nonbonded F kernel 47500 21.005 0.442 86.5
Nonbonded F+prune k. 2000 0.920 0.460 3.8
Nonbonded F+ene+prune k. 501 0.254 0.507 1.0
F D2H 50001 0.873 0.017 3.6
-----------------------------------------------------------------------------
Total 24.277 0.486 100.0
-----------------------------------------------------------------------------
Force evaluation time GPU/CPU: 0.486 ms/0.483 ms = 1.006
For optimal performance this ratio should be close to 1!
Core t (s) Wall t (s) (%)
Time: 839.987 140.217 599.1
(ns/day) (hour/ns)
Performance: 123.240 0.195
Finished mdrun on rank 0 Wed Jun 7 10:58:29 2017
More information about the gromacs.org_gmx-users
mailing list