[gmx-users] Help on MD performance, GPU has less load than CPU.

Mon Jul 10 17:01:40 CEST 2017

Hi,

I am working on a node with Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16
physical core, 32 logical core and 1 GPU NVIDIA GeForce GTX 980 Ti.
I am launching a series of 2 ns molecolar dynamics simulations of a system
of 60000 atoms.
I tried diverse setting combination, but however i obtained the best
performance with the command:

"gmx mdrun  -deffnm md_LIG -cpt 1 -cpo restart1.cpt -pin on"

which use 32 OpenMP threads, 1 MPI thread, and the GPU.
At the end of the file.log of molecular dynamic production I obtain this
message:

"NOTE: The GPU has >25% less load than the CPU. This imbalance causes
      performance loss."

I don't know how can improve the load on CPU more than this, or how I can
decrease the load on GPU. Do you have any suggestions?

Thank you in advance.

Cheers,

Davide Bonanni

Initial and final part of LOG file here:

Log file opened on Sun Jul  9 04:02:44 2017
Host: bigblue  pid: 16777  rank ID: 0  number of ranks:  1
                   :-) GROMACS - gmx mdrun, VERSION 5.1.4 (-:

GROMACS:      gmx mdrun, VERSION 5.1.4
Executable:   /usr/bin/gmx
Data prefix:  /usr/local/gromacs
Command line:
  gmx mdrun -deffnm md_fluo_7 -cpt 1 -cpo restart1.cpt -pin on

GROMACS version:    VERSION 5.1.4
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:        enabled
OpenCL support:     disabled
invsqrt routine:    gmx_software_invsqrt(x)
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.4-sse2-avx
RDTSCP usage:       enabled
C++11 compilation:  disabled
TNG support:        enabled
Tracing support:    disabled
Built on:           Tue  8 Nov 12:26:14 CET 2016
Built by:           root at bigblue [CMAKE]
Build OS/arch:      Linux 3.10.0-327.el7.x86_64 x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Build CPU family:   6   Model: 63   Stepping: 2
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /bin/cc GNU 4.8.5
C compiler flags:    -march=core-avx2    -Wextra
-Wno-missing-field-initializers
-Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value
-Wunused-parameter  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
 -Wno-array-bounds
C++ compiler:       /bin/c++ GNU 4.8.5
C++ compiler flags:  -march=core-avx2    -Wextra
-Wno-missing-field-initializers
-Wpointer-arith -Wall -Wno-unused-function  -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast  -Wno-array-bounds
Boost version:      1.55.0 (internal)
CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=
compute_30,code=sm_30;-gencode;arch=compute_35,code=
sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=
compute_50,code=sm_50;-gencode;arch=compute_52,code=
sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=
compute_61,code=sm_61;-gencode;arch=compute_60,code=
compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-
Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-
fexcess-precision=fast;-Wno-array-bounds;
CUDA driver:        8.0
CUDA runtime:       8.0

Running on 1 node with total 16 cores, 32 logical cores, 1 compatible GPU
Hardware detected:
  CPU info:
    Vendor: GenuineIntel
    Brand:  Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
    Family:  6  model: 63  stepping:  2
    CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA GeForce GTX 980 Ti, compute cap.: 5.2, ECC:  no, stat:
compatible

Changing nstlist from 20 to 40, rlist from 1.2 to 1.2

Input Parameters:
   integrator                     = sd
   tinit                          = 0
   dt                             = 0.002
   nsteps                         = 1000000
   init-step                      = 0
   simulation-part                = 1
   comm-mode                      = Linear
   nstcomm                        = 100
   bd-fric                        = 0
   ld-seed                        = 57540858
   emtol                          = 10
   emstep                         = 0.01
   niter                          = 20
   fcstep                         = 0
   nstcgsteep                     = 1000
   nbfgscorr                      = 10
   rtpi                           = 0.05
   nstxout                        = 5000
   nstvout                        = 500
   nstfout                        = 0
   nstlog                         = 500
   nstcalcenergy                  = 100
   nstenergy                      = 1000
   nstxout-compressed             = 0
   compressed-x-precision         = 1000
   cutoff-scheme                  = Verlet
   nstlist                        = 40
   ns-type                        = Grid
   pbc                            = xyz
   periodic-molecules             = FALSE
   verlet-buffer-tolerance        = 0.005
   rlist                          = 1.2
   rlistlong                      = 1.2
   nstcalclr                      = 20
   coulombtype                    = PME
   coulomb-modifier               = Potential-shift
   rcoulomb-switch                = 0
   rcoulomb                       = 1.2
   epsilon-r                      = 1
   epsilon-rf                     = inf
   vdw-type                       = Cut-off
   vdw-modifier                   = Potential-switch
   rvdw-switch                    = 1
   rvdw                           = 1.2
   DispCorr                       = EnerPres
   table-extension                = 1
   fourierspacing                 = 0.12
   fourier-nx                     = 72
   fourier-ny                     = 72
   fourier-nz                     = 72
   pme-order                      = 6
   ewald-rtol                     = 1e-06
   ewald-rtol-lj                  = 0.001
   lj-pme-comb-rule               = Geometric
   ewald-geometry                 = 0
   epsilon-surface                = 0
   implicit-solvent               = No
   gb-algorithm                   = Still
   nstgbradii                     = 1
   rgbradii                       = 1
   gb-epsilon-solvent             = 80
   gb-saltconc                    = 0
   gb-obc-alpha                   = 1
   gb-obc-beta                    = 0.8
   gb-obc-gamma                   = 4.85
   gb-dielectric-offset           = 0.009
   sa-algorithm                   = Ace-approximation
   sa-surface-tension             = 2.05016
   tcoupl                         = No
   nsttcouple                     = -1
   nh-chain-length                = 0
   print-nose-hoover-chain-variables = FALSE
   pcoupl                         = Parrinello-Rahman
   pcoupltype                     = Isotropic
   nstpcouple                     = 20
   tau-p                          = 1

Using 1 MPI thread
Using 32 OpenMP threads

1 compatible GPU is present, with ID 0
1 GPU auto-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 0

Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.34693 nm for Ewald
Cut-off's:   NS: 1.2   Coulomb: 1.2   LJ: 1.2
Long Range LJ corr.: <C6> 3.2003e-04
System total charge, top. A: -0.000 top. B: -0.000
Generated table with 1100 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1100 data points for LJ6Switch.
Tabscale = 500 points/nm
Generated table with 1100 data points for LJ12Switch.
Tabscale = 500 points/nm
Generated table with 1100 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1100 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1100 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Potential shift: LJ r^-12: 0.000e+00 r^-6: 0.000e+00, Ewald -1.000e-06
Initialized non-bonded Ewald correction tables, spacing: 9.71e-04 size: 1237

Using GPU 8x8 non-bonded kernels

NOTE: With GPUs, reporting energy group contributions is not supported

There are 39 atoms and 39 charges for free energy perturbation
Pinning threads with an auto-selected logical core stride of 1

Initializing LINear Constraint Solver

-------- -------- --- Thank You --- -------- --------

There are: 59559 Atoms
Initial temperature: 301.342 K

Started mdrun on rank 0 Sun Jul  9 04:02:47 2017
           Step           Time         Lambda
              0        0.00000        0.35000

.....
.....
.....
.....
.....

M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
------------------------------------------------------------
-----------------
 NB Free energy kernel              7881861.469518     7881861.470     0.1
 Pair Search distance check          211801.978992     1906217.811     0.0
 NxN Ewald Elec. + LJ [F]          61644114.490880  5732902647.652    91.3
 NxN Ewald Elec. + LJ [V&F]          622729.312576    79086622.697     1.3
 1,4 nonbonded interactions           15157.138733     1364142.486     0.0
 Calc Weights                        178677.178677     6432378.432     0.1
 Spread Q Bspline                  25729513.729488    51459027.459     0.8
 Gather F Bspline                  25729513.729488   154377082.377     2.5
 3D-FFT                            27628393.815424   221027150.523     3.5
 Solve PME                            10366.046848      663426.998     0.0
 Shift-X                               1489.034559        8934.207     0.0
 Angles                               10513.850597     1766326.900     0.0
 Propers                              18191.018191     4165743.166     0.1
 Impropers                             1133.001133      235664.236     0.0
 Virial                                2980.259604       53644.673     0.0
 Update                               59559.059559     1846330.846     0.0
 Stop-CM                                595.649559        5956.496     0.0
 Calc-Ekin                             5956.019118      160812.516     0.0
 Lincs                                11610.011610      696600.697     0.0
 Lincs-Mat                           588728.588728     2354914.355     0.0
 Constraint-V                        130824.130824     1046593.047     0.0
 Constraint-Vir                        2980.409607       71529.831     0.0
 Settle                               35868.035868    11585375.585     0.2
------------------------------------------------------------
-----------------
 Total                                              6281098984.459   100.0
------------------------------------------------------------
-----------------

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 32 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
------------------------------------------------------------
-----------------
 Neighbor search        1   32      25001     170.606      13073.577   1.5
 Launch GPU ops.        1   32    1000001      97.251       7452.377   0.8
 Force                  1   32    1000001    2462.595     188709.029  21.0
 PME mesh               1   32    1000001    7214.132     552819.972  61.5
 Wait GPU local         1   32    1000001      22.963       1759.683   0.2
 NB X/F buffer ops.     1   32    1975001     303.888      23287.017   2.6
 Write traj.            1   32       2190      41.970       3216.155   0.4
 Update                 1   32    2000002     374.895      28728.243   3.2
 Constraints            1   32    2000002     718.184      55034.545   6.1
 Rest                                         315.793      24199.295   2.7
------------------------------------------------------------
-----------------
 Total                                      11722.279     898279.893 100.0
------------------------------------------------------------
-----------------
 Breakdown of PME mesh computation
------------------------------------------------------------
-----------------
 PME spread/gather      1   32    4000004    5659.890     433718.207  48.3
 PME 3D-FFT             1   32    4000004    1447.568     110927.319  12.3
 PME solve Elec         1   32    2000002      85.838       6577.816   0.7
------------------------------------------------------------
-----------------

 GPU timings
------------------------------------------------------------
-----------------
 Computing:                         Count  Wall t (s)      ms/step       %
------------------------------------------------------------
-----------------
 Pair list H2D                      25001      14.012        0.560     0.6
 X / q H2D                        1000001     171.474        0.171     7.7
 Nonbonded F kernel                970000    1852.997        1.910    82.8
 Nonbonded F+ene k.                  5000      13.053        2.611     0.6
 Nonbonded F+prune k.               20000      47.018        2.351     2.1
 Nonbonded F+ene+prune k.            5001      15.825        3.164     0.7
 F D2H                            1000001     124.521        0.125     5.6
------------------------------------------------------------
-----------------
 Total                                       2238.898        2.239   100.0
------------------------------------------------------------
-----------------

Force evaluation time GPU/CPU: 2.239 ms/9.677 ms = 0.231
For optimal performance this ratio should be close to 1!

NOTE: The GPU has >25% less load than the CPU. This imbalance causes
      performance loss.

               Core t (s)   Wall t (s)        (%)
       Time:   374361.605    11722.279     3193.6
                         3h15:22
                 (ns/day)    (hour/ns)
Performance:       14.741        1.628
Finished mdrun on rank 0 Sun Jul  9 07:18:10 2017