[gmx-users] Help on MD performance, GPU has less load than CPU.

Tue Jul 11 15:24:26 CEST 2017

Hi,

I'm genuinely curious about why people set ewald_rtol smaller (which is
unlikely to be useful, because the accumulation of forces in single
precision will have round-off error that means the approximation to the
correct sum is not reliably accurate to more than about 1 in 1e-5), and
thus pme_order to large values - second time I've seen this in 24 hours. Is
there data somewhere that shows this is useful?

In any case, it a) causes a lot more work on the CPU, and b) only 4 (and to
a lesser extent, 5) is optimized for performance (because there's no data
that shows higher order is useful). And for free-energy calculation, that
extra expense accrues for each lambda state. See the "PME mesh" parts of
the performance report.

Guessing wildly, the cost of your simulation is probably at least double
what the defaults would give, and for that cost, I'd want to know why.

Mark

On Mon, Jul 10, 2017 at 5:02 PM Davide Bonanni <davide.bonanni at unito.it>
wrote:

> Hi,
>
> I am working on a node with Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16
> physical core, 32 logical core and 1 GPU NVIDIA GeForce GTX 980 Ti.
> I am launching a series of 2 ns molecolar dynamics simulations of a system
> of 60000 atoms.
> I tried diverse setting combination, but however i obtained the best
> performance with the command:
>
> "gmx mdrun  -deffnm md_LIG -cpt 1 -cpo restart1.cpt -pin on"
>
> which use 32 OpenMP threads, 1 MPI thread, and the GPU.
> At the end of the file.log of molecular dynamic production I obtain this
> message:
>
> "NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>       performance loss."
>
> I don't know how can improve the load on CPU more than this, or how I can
> decrease the load on GPU. Do you have any suggestions?
>
> Thank you in advance.
>
> Cheers,
>
> Davide Bonanni
>
>
> Initial and final part of LOG file here:
>
> Log file opened on Sun Jul  9 04:02:44 2017
> Host: bigblue  pid: 16777  rank ID: 0  number of ranks:  1
>                    :-) GROMACS - gmx mdrun, VERSION 5.1.4 (-:
>
>
>
> GROMACS:      gmx mdrun, VERSION 5.1.4
> Executable:   /usr/bin/gmx
> Data prefix:  /usr/local/gromacs
> Command line:
>   gmx mdrun -deffnm md_fluo_7 -cpt 1 -cpo restart1.cpt -pin on
>
> GROMACS version:    VERSION 5.1.4
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support:        enabled
> OpenCL support:     disabled
> invsqrt routine:    gmx_software_invsqrt(x)
> SIMD instructions:  AVX2_256
> FFT library:        fftw-3.3.4-sse2-avx
> RDTSCP usage:       enabled
> C++11 compilation:  disabled
> TNG support:        enabled
> Tracing support:    disabled
> Built on:           Tue  8 Nov 12:26:14 CET 2016
> Built by:           root at bigblue [CMAKE]
> Build OS/arch:      Linux 3.10.0-327.el7.x86_64 x86_64
> Build CPU vendor:   GenuineIntel
> Build CPU brand:    Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> Build CPU family:   6   Model: 63   Stepping: 2
> Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler:         /bin/cc GNU 4.8.5
> C compiler flags:    -march=core-avx2    -Wextra
> -Wno-missing-field-initializers
> -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value
> -Wunused-parameter  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
>  -Wno-array-bounds
> C++ compiler:       /bin/c++ GNU 4.8.5
> C++ compiler flags:  -march=core-avx2    -Wextra
> -Wno-missing-field-initializers
> -Wpointer-arith -Wall -Wno-unused-function  -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast  -Wno-array-bounds
> Boost version:      1.55.0 (internal)
> CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
> driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
> CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=
> compute_30,code=sm_30;-gencode;arch=compute_35,code=
> sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=
> compute_50,code=sm_50;-gencode;arch=compute_52,code=
> sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=
> compute_61,code=sm_61;-gencode;arch=compute_60,code=
> compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
> ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-
> Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-
> fexcess-precision=fast;-Wno-array-bounds;
> CUDA driver:        8.0
> CUDA runtime:       8.0
>
>
> Running on 1 node with total 16 cores, 32 logical cores, 1 compatible GPU
> Hardware detected:
>   CPU info:
>     Vendor: GenuineIntel
>     Brand:  Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>     Family:  6  model: 63  stepping:  2
>     CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
>     SIMD instructions most likely to fit this hardware: AVX2_256
>     SIMD instructions selected at GROMACS compile time: AVX2_256
>   GPU info:
>     Number of GPUs detected: 1
>     #0: NVIDIA GeForce GTX 980 Ti, compute cap.: 5.2, ECC:  no, stat:
> compatible
>
>
>
> Changing nstlist from 20 to 40, rlist from 1.2 to 1.2
>
> Input Parameters:
>    integrator                     = sd
>    tinit                          = 0
>    dt                             = 0.002
>    nsteps                         = 1000000
>    init-step                      = 0
>    simulation-part                = 1
>    comm-mode                      = Linear
>    nstcomm                        = 100
>    bd-fric                        = 0
>    ld-seed                        = 57540858
>    emtol                          = 10
>    emstep                         = 0.01
>    niter                          = 20
>    fcstep                         = 0
>    nstcgsteep                     = 1000
>    nbfgscorr                      = 10
>    rtpi                           = 0.05
>    nstxout                        = 5000
>    nstvout                        = 500
>    nstfout                        = 0
>    nstlog                         = 500
>    nstcalcenergy                  = 100
>    nstenergy                      = 1000
>    nstxout-compressed             = 0
>    compressed-x-precision         = 1000
>    cutoff-scheme                  = Verlet
>    nstlist                        = 40
>    ns-type                        = Grid
>    pbc                            = xyz
>    periodic-molecules             = FALSE
>    verlet-buffer-tolerance        = 0.005
>    rlist                          = 1.2
>    rlistlong                      = 1.2
>    nstcalclr                      = 20
>    coulombtype                    = PME
>    coulomb-modifier               = Potential-shift
>    rcoulomb-switch                = 0
>    rcoulomb                       = 1.2
>    epsilon-r                      = 1
>    epsilon-rf                     = inf
>    vdw-type                       = Cut-off
>    vdw-modifier                   = Potential-switch
>    rvdw-switch                    = 1
>    rvdw                           = 1.2
>    DispCorr                       = EnerPres
>    table-extension                = 1
>    fourierspacing                 = 0.12
>    fourier-nx                     = 72
>    fourier-ny                     = 72
>    fourier-nz                     = 72
>    pme-order                      = 6
>    ewald-rtol                     = 1e-06
>    ewald-rtol-lj                  = 0.001
>    lj-pme-comb-rule               = Geometric
>    ewald-geometry                 = 0
>    epsilon-surface                = 0
>    implicit-solvent               = No
>    gb-algorithm                   = Still
>    nstgbradii                     = 1
>    rgbradii                       = 1
>    gb-epsilon-solvent             = 80
>    gb-saltconc                    = 0
>    gb-obc-alpha                   = 1
>    gb-obc-beta                    = 0.8
>    gb-obc-gamma                   = 4.85
>    gb-dielectric-offset           = 0.009
>    sa-algorithm                   = Ace-approximation
>    sa-surface-tension             = 2.05016
>    tcoupl                         = No
>    nsttcouple                     = -1
>    nh-chain-length                = 0
>    print-nose-hoover-chain-variables = FALSE
>    pcoupl                         = Parrinello-Rahman
>    pcoupltype                     = Isotropic
>    nstpcouple                     = 20
>    tau-p                          = 1
>
>
> Using 1 MPI thread
> Using 32 OpenMP threads
>
> 1 compatible GPU is present, with ID 0
> 1 GPU auto-selected for this run.
> Mapping of GPU ID to the 1 PP rank in this node: 0
>
> Will do PME sum in reciprocal space for electrostatic interactions.
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
> Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
>
> Will do ordinary reciprocal space Ewald sum.
> Using a Gaussian width (1/beta) of 0.34693 nm for Ewald
> Cut-off's:   NS: 1.2   Coulomb: 1.2   LJ: 1.2
> Long Range LJ corr.: <C6> 3.2003e-04
> System total charge, top. A: -0.000 top. B: -0.000
> Generated table with 1100 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for LJ6Switch.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for LJ12Switch.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
> Potential shift: LJ r^-12: 0.000e+00 r^-6: 0.000e+00, Ewald -1.000e-06
> Initialized non-bonded Ewald correction tables, spacing: 9.71e-04 size:
> 1237
>
>
> Using GPU 8x8 non-bonded kernels
>
>
> NOTE: With GPUs, reporting energy group contributions is not supported
>
> There are 39 atoms and 39 charges for free energy perturbation
> Pinning threads with an auto-selected logical core stride of 1
>
> Initializing LINear Constraint Solver
>
> -------- -------- --- Thank You --- -------- --------
>
> There are: 59559 Atoms
> Initial temperature: 301.342 K
>
> Started mdrun on rank 0 Sun Jul  9 04:02:47 2017
>            Step           Time         Lambda
>               0        0.00000        0.35000
>
>
>
> .....
> .....
> .....
> .....
> .....
>
>
>
>
> M E G A - F L O P S   A C C O U N T I N G
>
>  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>  V&F=Potential and force  V=Potential only  F=Force only
>
>  Computing:                               M-Number         M-Flops  % Flops
> ------------------------------------------------------------
> -----------------
>  NB Free energy kernel              7881861.469518     7881861.470     0.1
>  Pair Search distance check          211801.978992     1906217.811     0.0
>  NxN Ewald Elec. + LJ [F]          61644114.490880  5732902647.652    91.3
>  NxN Ewald Elec. + LJ [V&F]          622729.312576    79086622.697     1.3
>  1,4 nonbonded interactions           15157.138733     1364142.486     0.0
>  Calc Weights                        178677.178677     6432378.432     0.1
>  Spread Q Bspline                  25729513.729488    51459027.459     0.8
>  Gather F Bspline                  25729513.729488   154377082.377     2.5
>  3D-FFT                            27628393.815424   221027150.523     3.5
>  Solve PME                            10366.046848      663426.998     0.0
>  Shift-X                               1489.034559        8934.207     0.0
>  Angles                               10513.850597     1766326.900     0.0
>  Propers                              18191.018191     4165743.166     0.1
>  Impropers                             1133.001133      235664.236     0.0
>  Virial                                2980.259604       53644.673     0.0
>  Update                               59559.059559     1846330.846     0.0
>  Stop-CM                                595.649559        5956.496     0.0
>  Calc-Ekin                             5956.019118      160812.516     0.0
>  Lincs                                11610.011610      696600.697     0.0
>  Lincs-Mat                           588728.588728     2354914.355     0.0
>  Constraint-V                        130824.130824     1046593.047     0.0
>  Constraint-Vir                        2980.409607       71529.831     0.0
>  Settle                               35868.035868    11585375.585     0.2
> ------------------------------------------------------------
> -----------------
>  Total                                              6281098984.459   100.0
> ------------------------------------------------------------
> -----------------
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 1 MPI rank, each using 32 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> ------------------------------------------------------------
> -----------------
>  Neighbor search        1   32      25001     170.606      13073.577   1.5
>  Launch GPU ops.        1   32    1000001      97.251       7452.377   0.8
>  Force                  1   32    1000001    2462.595     188709.029  21.0
>  PME mesh               1   32    1000001    7214.132     552819.972  61.5
>  Wait GPU local         1   32    1000001      22.963       1759.683   0.2
>  NB X/F buffer ops.     1   32    1975001     303.888      23287.017   2.6
>  Write traj.            1   32       2190      41.970       3216.155   0.4
>  Update                 1   32    2000002     374.895      28728.243   3.2
>  Constraints            1   32    2000002     718.184      55034.545   6.1
>  Rest                                         315.793      24199.295   2.7
> ------------------------------------------------------------
> -----------------
>  Total                                      11722.279     898279.893 100.0
> ------------------------------------------------------------
> -----------------
>  Breakdown of PME mesh computation
> ------------------------------------------------------------
> -----------------
>  PME spread/gather      1   32    4000004    5659.890     433718.207  48.3
>  PME 3D-FFT             1   32    4000004    1447.568     110927.319  12.3
>  PME solve Elec         1   32    2000002      85.838       6577.816   0.7
> ------------------------------------------------------------
> -----------------
>
>  GPU timings
> ------------------------------------------------------------
> -----------------
>  Computing:                         Count  Wall t (s)      ms/step       %
> ------------------------------------------------------------
> -----------------
>  Pair list H2D                      25001      14.012        0.560     0.6
>  X / q H2D                        1000001     171.474        0.171     7.7
>  Nonbonded F kernel                970000    1852.997        1.910    82.8
>  Nonbonded F+ene k.                  5000      13.053        2.611     0.6
>  Nonbonded F+prune k.               20000      47.018        2.351     2.1
>  Nonbonded F+ene+prune k.            5001      15.825        3.164     0.7
>  F D2H                            1000001     124.521        0.125     5.6
> ------------------------------------------------------------
> -----------------
>  Total                                       2238.898        2.239   100.0
> ------------------------------------------------------------
> -----------------
>
> Force evaluation time GPU/CPU: 2.239 ms/9.677 ms = 0.231
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>       performance loss.
>
>                Core t (s)   Wall t (s)        (%)
>        Time:   374361.605    11722.279     3193.6
>                          3h15:22
>                  (ns/day)    (hour/ns)
> Performance:       14.741        1.628
> Finished mdrun on rank 0 Sun Jul  9 07:18:10 2017
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>