[gmx-users] Help on MD performance, GPU has less load than CPU.
Davide Bonanni
davide.bonanni at unito.it
Mon Jul 10 17:01:40 CEST 2017
Hi,
I am working on a node with Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16
physical core, 32 logical core and 1 GPU NVIDIA GeForce GTX 980 Ti.
I am launching a series of 2 ns molecolar dynamics simulations of a system
of 60000 atoms.
I tried diverse setting combination, but however i obtained the best
performance with the command:
"gmx mdrun -deffnm md_LIG -cpt 1 -cpo restart1.cpt -pin on"
which use 32 OpenMP threads, 1 MPI thread, and the GPU.
At the end of the file.log of molecular dynamic production I obtain this
message:
"NOTE: The GPU has >25% less load than the CPU. This imbalance causes
performance loss."
I don't know how can improve the load on CPU more than this, or how I can
decrease the load on GPU. Do you have any suggestions?
Thank you in advance.
Cheers,
Davide Bonanni
Initial and final part of LOG file here:
Log file opened on Sun Jul 9 04:02:44 2017
Host: bigblue pid: 16777 rank ID: 0 number of ranks: 1
:-) GROMACS - gmx mdrun, VERSION 5.1.4 (-:
GROMACS: gmx mdrun, VERSION 5.1.4
Executable: /usr/bin/gmx
Data prefix: /usr/local/gromacs
Command line:
gmx mdrun -deffnm md_fluo_7 -cpt 1 -cpo restart1.cpt -pin on
GROMACS version: VERSION 5.1.4
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support: enabled
OpenCL support: disabled
invsqrt routine: gmx_software_invsqrt(x)
SIMD instructions: AVX2_256
FFT library: fftw-3.3.4-sse2-avx
RDTSCP usage: enabled
C++11 compilation: disabled
TNG support: enabled
Tracing support: disabled
Built on: Tue 8 Nov 12:26:14 CET 2016
Built by: root at bigblue [CMAKE]
Build OS/arch: Linux 3.10.0-327.el7.x86_64 x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Build CPU family: 6 Model: 63 Stepping: 2
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /bin/cc GNU 4.8.5
C compiler flags: -march=core-avx2 -Wextra
-Wno-missing-field-initializers
-Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value
-Wunused-parameter -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
-Wno-array-bounds
C++ compiler: /bin/c++ GNU 4.8.5
C++ compiler flags: -march=core-avx2 -Wextra
-Wno-missing-field-initializers
-Wpointer-arith -Wall -Wno-unused-function -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast -Wno-array-bounds
Boost version: 1.55.0 (internal)
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=
compute_30,code=sm_30;-gencode;arch=compute_35,code=
sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=
compute_50,code=sm_50;-gencode;arch=compute_52,code=
sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=
compute_61,code=sm_61;-gencode;arch=compute_60,code=
compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-
Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-
fexcess-precision=fast;-Wno-array-bounds;
CUDA driver: 8.0
CUDA runtime: 8.0
Running on 1 node with total 16 cores, 32 logical cores, 1 compatible GPU
Hardware detected:
CPU info:
Vendor: GenuineIntel
Brand: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Family: 6 model: 63 stepping: 2
CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX2_256
GPU info:
Number of GPUs detected: 1
#0: NVIDIA GeForce GTX 980 Ti, compute cap.: 5.2, ECC: no, stat:
compatible
Changing nstlist from 20 to 40, rlist from 1.2 to 1.2
Input Parameters:
integrator = sd
tinit = 0
dt = 0.002
nsteps = 1000000
init-step = 0
simulation-part = 1
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = 57540858
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 5000
nstvout = 500
nstfout = 0
nstlog = 500
nstcalcenergy = 100
nstenergy = 1000
nstxout-compressed = 0
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 40
ns-type = Grid
pbc = xyz
periodic-molecules = FALSE
verlet-buffer-tolerance = 0.005
rlist = 1.2
rlistlong = 1.2
nstcalclr = 20
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1.2
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Potential-switch
rvdw-switch = 1
rvdw = 1.2
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.12
fourier-nx = 72
fourier-ny = 72
fourier-nz = 72
pme-order = 6
ewald-rtol = 1e-06
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
implicit-solvent = No
gb-algorithm = Still
nstgbradii = 1
rgbradii = 1
gb-epsilon-solvent = 80
gb-saltconc = 0
gb-obc-alpha = 1
gb-obc-beta = 0.8
gb-obc-gamma = 4.85
gb-dielectric-offset = 0.009
sa-algorithm = Ace-approximation
sa-surface-tension = 2.05016
tcoupl = No
nsttcouple = -1
nh-chain-length = 0
print-nose-hoover-chain-variables = FALSE
pcoupl = Parrinello-Rahman
pcoupltype = Isotropic
nstpcouple = 20
tau-p = 1
Using 1 MPI thread
Using 32 OpenMP threads
1 compatible GPU is present, with ID 0
1 GPU auto-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 0
Will do PME sum in reciprocal space for electrostatic interactions.
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------
Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.34693 nm for Ewald
Cut-off's: NS: 1.2 Coulomb: 1.2 LJ: 1.2
Long Range LJ corr.: <C6> 3.2003e-04
System total charge, top. A: -0.000 top. B: -0.000
Generated table with 1100 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1100 data points for LJ6Switch.
Tabscale = 500 points/nm
Generated table with 1100 data points for LJ12Switch.
Tabscale = 500 points/nm
Generated table with 1100 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1100 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1100 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Potential shift: LJ r^-12: 0.000e+00 r^-6: 0.000e+00, Ewald -1.000e-06
Initialized non-bonded Ewald correction tables, spacing: 9.71e-04 size: 1237
Using GPU 8x8 non-bonded kernels
NOTE: With GPUs, reporting energy group contributions is not supported
There are 39 atoms and 39 charges for free energy perturbation
Pinning threads with an auto-selected logical core stride of 1
Initializing LINear Constraint Solver
-------- -------- --- Thank You --- -------- --------
There are: 59559 Atoms
Initial temperature: 301.342 K
Started mdrun on rank 0 Sun Jul 9 04:02:47 2017
Step Time Lambda
0 0.00000 0.35000
.....
.....
.....
.....
.....
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
------------------------------------------------------------
-----------------
NB Free energy kernel 7881861.469518 7881861.470 0.1
Pair Search distance check 211801.978992 1906217.811 0.0
NxN Ewald Elec. + LJ [F] 61644114.490880 5732902647.652 91.3
NxN Ewald Elec. + LJ [V&F] 622729.312576 79086622.697 1.3
1,4 nonbonded interactions 15157.138733 1364142.486 0.0
Calc Weights 178677.178677 6432378.432 0.1
Spread Q Bspline 25729513.729488 51459027.459 0.8
Gather F Bspline 25729513.729488 154377082.377 2.5
3D-FFT 27628393.815424 221027150.523 3.5
Solve PME 10366.046848 663426.998 0.0
Shift-X 1489.034559 8934.207 0.0
Angles 10513.850597 1766326.900 0.0
Propers 18191.018191 4165743.166 0.1
Impropers 1133.001133 235664.236 0.0
Virial 2980.259604 53644.673 0.0
Update 59559.059559 1846330.846 0.0
Stop-CM 595.649559 5956.496 0.0
Calc-Ekin 5956.019118 160812.516 0.0
Lincs 11610.011610 696600.697 0.0
Lincs-Mat 588728.588728 2354914.355 0.0
Constraint-V 130824.130824 1046593.047 0.0
Constraint-Vir 2980.409607 71529.831 0.0
Settle 35868.035868 11585375.585 0.2
------------------------------------------------------------
-----------------
Total 6281098984.459 100.0
------------------------------------------------------------
-----------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 32 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
------------------------------------------------------------
-----------------
Neighbor search 1 32 25001 170.606 13073.577 1.5
Launch GPU ops. 1 32 1000001 97.251 7452.377 0.8
Force 1 32 1000001 2462.595 188709.029 21.0
PME mesh 1 32 1000001 7214.132 552819.972 61.5
Wait GPU local 1 32 1000001 22.963 1759.683 0.2
NB X/F buffer ops. 1 32 1975001 303.888 23287.017 2.6
Write traj. 1 32 2190 41.970 3216.155 0.4
Update 1 32 2000002 374.895 28728.243 3.2
Constraints 1 32 2000002 718.184 55034.545 6.1
Rest 315.793 24199.295 2.7
------------------------------------------------------------
-----------------
Total 11722.279 898279.893 100.0
------------------------------------------------------------
-----------------
Breakdown of PME mesh computation
------------------------------------------------------------
-----------------
PME spread/gather 1 32 4000004 5659.890 433718.207 48.3
PME 3D-FFT 1 32 4000004 1447.568 110927.319 12.3
PME solve Elec 1 32 2000002 85.838 6577.816 0.7
------------------------------------------------------------
-----------------
GPU timings
------------------------------------------------------------
-----------------
Computing: Count Wall t (s) ms/step %
------------------------------------------------------------
-----------------
Pair list H2D 25001 14.012 0.560 0.6
X / q H2D 1000001 171.474 0.171 7.7
Nonbonded F kernel 970000 1852.997 1.910 82.8
Nonbonded F+ene k. 5000 13.053 2.611 0.6
Nonbonded F+prune k. 20000 47.018 2.351 2.1
Nonbonded F+ene+prune k. 5001 15.825 3.164 0.7
F D2H 1000001 124.521 0.125 5.6
------------------------------------------------------------
-----------------
Total 2238.898 2.239 100.0
------------------------------------------------------------
-----------------
Force evaluation time GPU/CPU: 2.239 ms/9.677 ms = 0.231
For optimal performance this ratio should be close to 1!
NOTE: The GPU has >25% less load than the CPU. This imbalance causes
performance loss.
Core t (s) Wall t (s) (%)
Time: 374361.605 11722.279 3193.6
3h15:22
(ns/day) (hour/ns)
Performance: 14.741 1.628
Finished mdrun on rank 0 Sun Jul 9 07:18:10 2017
More information about the gromacs.org_gmx-users
mailing list