[gmx-users] Poor GPU Performance with GROMACS 5.1.4
Daniel Kozuch
dkozuch at princeton.edu
Wed May 24 21:09:12 CEST 2017
Hello,
I'm using GROMACS 5.1.4 on 8 CPUs and 1 GPU for a system of ~8000 atoms in
a dodecahedron box, and I'm having trouble getting good performance out of
the GPU. Specifically it appears that there is significant performance loss
to wait times ("Wait + Comm. F" and "Wait GPU nonlocal"). I have pasted the
relevant parts of the log file below. I suspect that I have set up my
ranks/threads badly, but I am unsure where the issue is. I have tried
changing the local variable OMP_NUM_THREADS from 1 to 2 per the note
generated by GROMACS, but this severely slows down the simulation to the
point where it takes 10 minutes to get a few picoseconds.
I have tried browsing through the mailing lists, but I haven't found a
solution to this particular problem.
Any help is appreciated,
Dan
------------------------------------------------------------
------------------------------------------------------------
----------------------------------------------------
GROMACS: gmx mdrun, VERSION 5.1.4
Executable: /home/dkozuch/programs/gromacs_514_gpu/bin/gmx_514_gpu
Data prefix: /home/dkozuch/programs/gromacs_514_gpu
Command line:
gmx_514_gpu mdrun -deffnm 1ucs_npt -ntomp 1
GROMACS version: VERSION 5.1.4
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support: enabled
OpenCL support: disabled
invsqrt routine: gmx_software_invsqrt(x)
SIMD instructions: AVX2_256
FFT library: fftw-3.3.4-sse2-avx
RDTSCP usage: enabled
C++11 compilation: disabled
TNG support: enabled
Tracing support: disabled
Built on: Mon May 22 18:29:21 EDT 2017
Built by: dkozuch at tigergpu.princeton.edu [CMAKE]
Build OS/arch: Linux 3.10.0-514.16.1.el7.x86_64 x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Build CPU family: 6 Model: 79 Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /usr/bin/cc GNU 4.8.5
C compiler flags: -march=core-avx2 -Wextra
-Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
-Wno-unused -Wunused-value -Wunused-parameter -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
C++ compiler: /usr/bin/c++ GNU 4.8.5
C++ compiler flags: -march=core-avx2 -Wextra
-Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function
-O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
Boost version: 1.53.0 (external)
CUDA compiler: /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=comp
ute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-
gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_
50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-
gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_
61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-
gencode;arch=compute_61,code=compute_61;-use_fast_math;;
;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-
Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-
funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
CUDA driver: 8.0
CUDA runtime: 8.0
Number of logical cores detected (28) does not match the number reported by
OpenMP (1).
Consider setting the launch configuration manually!
Running on 1 node with total 28 logical cores, 1 compatible GPU
Hardware detected on host tiger-i23g10 (the node of MPI rank 0):
CPU info:
Vendor: GenuineIntel
Brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Family: 6 model: 79 stepping: 1
CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX2_256
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible
For optimal performance with a GPU nstlist (now 10) should be larger.
The optimum depends on your CPU and GPU resources.
You might want to try several nstlist values.
Changing nstlist from 10 to 25, rlist from 1.014 to 1.098
Initializing Domain Decomposition on 8 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
two-body bonded interactions: 0.417 nm, LJ-14, atoms 514 517
multi-body bonded interactions: 0.417 nm, Proper Dih., atoms 514 517
Minimum cell size due to bonded interactions: 0.459 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.820 nm
Estimated maximum distance required for P-LINCS: 0.820 nm
This distance will limit the DD cell size, you can override this with -rcon
Using 0 separate PME ranks, per user request
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 8 cells with a minimum initial size of 1.025 nm
The maximum allowed number of cells is: X 3 Y 3 Z 3
Domain decomposition grid 2 x 2 x 2, separate PME ranks 0
PME domain decomposition: 2 x 4 x 1
Domain decomposition rank 0, coordinates 0 0 0
Using 8 MPI processes
Using 1 OpenMP thread per MPI process
On host [redacted] 1 compatible GPU is present, with ID 0
On host [redacted] 1 GPU auto-selected for this run.
Mapping of GPU ID to the 8 PP ranks in this node: 0,0,0,0,0,0,0,0
NOTE: Your choice of number of MPI ranks and amount of resources results in
using 1 OpenMP threads per rank, which is most likely inefficient. The
optimum is usually between 2 and 6 threads per rank.
Will do PME sum in reciprocal space for electrostatic interactions.
Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Cut-off's: NS: 1.098 Coulomb: 1 LJ: 1
System total charge: -0.000
Generated table with 1049 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1049 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1049 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 1049 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1049 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1049 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 9.33e-04 size: 1073
NOTE: GROMACS was configured without NVML support hence it can not exploit
application clocks of the detected Tesla P100-PCIE-16GB GPU to
improve performance.
Recompile with the NVML library (compatible with the driver used) or
set application clocks manually.
Using GPU 8x8 non-bonded kernels
Removing pbc first time
Non-default thread affinity set probably by the OpenMP library,
disabling internal thread affinity
Linking all bonded interactions to atoms
The initial number of communication pulses is: X 1 Y 1 Z 1
The initial domain decomposition cell size is: X 1.82 nm Y 1.82 nm Z 1.58 nm
The maximum allowed distance for charge groups involved in interactions is:
non-bonded interactions 1.098 nm
(the following are initial values, they could change due to box deformation)
two-body bonded interactions (-rdd) 1.098 nm
multi-body bonded interactions (-rdd) 1.098 nm
atoms separated by up to 5 constraints (-rcon) 1.578 nm
When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 1 Y 1 Z 1
The minimum size for domain decomposition cells is 1.098 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.60 Y 0.60 Z 0.70
The maximum allowed distance for charge groups involved in interactions is:
non-bonded interactions 1.098 nm
two-body bonded interactions (-rdd) 1.098 nm
multi-body bonded interactions (-rdd) 1.098 nm
atoms separated by up to 5 constraints (-rcon) 1.098 nm
Making 3D domain decomposition grid 2 x 2 x 2, home cell index 0 0 0
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: rest
There are: 8081 Atoms
Atom distribution over 8 domains: av 1010 stddev 44 min 939 max 1056
NOTE: DLB will not turn on during the first phase of PME tuning
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
------------------------------------------------------------
-----------------
Pair Search distance check 200315.445632 1802839.011 0.1
NxN Ewald Elec. + LJ [F] 35433155.735168 2338588278.521 93.7
NxN Ewald Elec. + LJ [V&F] 357920.995008 38297546.466 1.5
1,4 nonbonded interactions 5205.001041 468450.094 0.0
Calc Weights 121215.024243 4363740.873 0.2
Spread Q Bspline 2585920.517184 5171841.034 0.2
Gather F Bspline 2585920.517184 15515523.103 0.6
3D-FFT 10217812.797720 81742502.382 3.3
Solve PME 31999.376000 2047960.064 0.1
Reset In Box 1616.208081 4848.624 0.0
CG-CoM 1616.216162 4848.648 0.0
Angles 4245.000849 713160.143 0.0
Propers 2345.000469 537005.107 0.0
Impropers 1235.000247 256880.051 0.0
Virial 4220.508441 75969.152 0.0
Stop-CM 404.066162 4040.662 0.0
P-Coupling 4040.500000 24243.000 0.0
Calc-Ekin 16162.016162 436374.436 0.0
Lincs 6775.647238 406538.834 0.0
Lincs-Mat 115022.349612 460089.398 0.0
Constraint-V 56090.458032 448723.664 0.0
Constraint-Vir 4931.491948 118355.807 0.0
Settle 14179.724888 4580051.139 0.2
------------------------------------------------------------
-----------------
Total 2496069810
<(249)%20606-9810>.214 100.0
------------------------------------------------------------
-----------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 23872.9
av. #atoms communicated per step for LINCS: 2 x 2065.3
Average load imbalance: 0.9 %
Part of the total run time spent waiting due to load imbalance: 0.3 %
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0
% Y 0 % Z 1 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 8 MPI ranks
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
------------------------------------------------------------
-----------------
Domain decomp. 8 1 200001 181.288 3480.731
2.1
DD comm. load 8 1 198450 12.116 232.631
0.1
DD comm. bounds 8 1 198401 1.401 26.898
0.0
Neighbor search 8 1 200001 164.440 3157.257
1.9
Launch GPU ops. 8 1 10000002 254.944 4894.923 3.0
Comm. coord. 8 1 4800000 399.361 7667.747
4.7
Force 8 1 5000001 144.644
2777.170 1.7
Wait + Comm. F 8 1 5000001 2355.957 45234.421 *27.7
PME mesh 8 1 5000001 2226.183 42742.751
26.2
Wait GPU nonlocal 8 1 5000001 1621.582 31134.402 *19.1
Wait GPU local 8 1 5000001 18.061 346.780
0.2
NB X/F buffer ops. 8 1 19600002 140.943 2706.099
1.7
Write traj. 8 1 5009 0.569
10.930 0.0
Update 8 1 5000001 208.399 4001.266
2.5
Constraints 8 1 5000001 658.189 12637.242
7.7
Comm. energies 8 1 1000001 65.254 1252.872
0.8
Rest 51.772 994.016 0.6
------------------------------------------------------------
-----------------
Total 8505.104 163298.138 100.0
------------------------------------------------------------
-----------------
Breakdown of PME mesh computation
------------------------------------------------------------
-----------------
PME redist. X/F 8 1 10000002 458.829 8809.522 5.4
PME spread/gather 8 1 10000002 795.109 15266.106 9.3
PME 3D-FFT 8 1 10000002 506.799 9730.551 6.0
PME 3D-FFT Comm. 8 1 20000004 355.387 6823.444 4.2
PME solve Elec 8 1 5000001 103.450 1986.247 1.2
------------------------------------------------------------
-----------------
Core t (s) Wall t (s) (%)
Time: 68031.078 8505.104 799.9
2h21:45
(ns/day) (hour/ns)
Performance: 152.379 0.158
Finished mdrun on rank 0 Tue May 23 23:32:35 2017
More information about the gromacs.org_gmx-users
mailing list