[gmx-users] GPU waits for CPU, any remedies?

Tue Sep 16 18:52:53 CEST 2014

Well, it looks like you are i) unlucky ii) limited by the huge bonded workload.

i) As your system is quite small, mdrun thinks that there are no
convenient grids between 32x32x32 and 28x28x28 (see the PP-PME tuning
output). As the latter corresponds to quite a big jump in cut-off
(from 1.296 to 1.482) which more than doubles the non-bonded workload
and is slower than the former, mdrun sticks to using 1.296 nm as
coulomb cut-off. You may be able to gain some performance by tweaking
your fourier grid spacing a bit to help mdrun generating some
additional grids that could give more cut-off settings in the 1.3-1.48
range. However, on a second thought, there aren't more convenient grid
sizes between 28 and 32, I guess.

ii) The primary issue is however that your bonded workload is much
higher than it normally is. I'm not fully familiar with the
implementation, but I think this may be due to the RB term which is
quite slow. This time it's the flops table that could confirm this
this, but as you still have not shared the entire log file, we/I can't
tell.

Cheers,
--
Szilárd

*

On Tue, Sep 16, 2014 at 6:04 PM, Michael Brunsteiner <mbx0009 at yahoo.com> wrote:
>
> Szilárd Páll wrote:
>
>> The PP-PME load balancing done in the beginning of the run should
>> attempt to shift work from the CPU to GPU. The amount of performance
>> improvement this can bring is limited, but normally it should still do
>> its job and decrease the PME load.
>>
>> However, the PP-PME load balancing output, which could provide a clue
>> on why do you end up with CPU-GPU load imbalance is missing from your
>> post! Please post a full log file and not just parts that seem useful.
>
>
> see below ... as i wrote, i guess this is difficult as only a small part of
> the
> CPUs workload is actually for the reciprocal part of PME (as can be seen in
> my
> previous mail) ... so there is not much room for improving things by merely
> changing cut-off and pme grid ... i was hoping that there are other ways to
> shift more of the work to the GPU ...
>
> cheers
> michael
>
>
> the md.log file ...
>
> GROMACS:      gmx mdrun, VERSION 5.0.1
> Executable:   /usr/local/gromacs/bin/gmx
> Library dir:  /usr/local/gromacs/share/gromacs/top
> Command line:
>   mdrun -v -s topol.tpr
>
> Gromacs version:    VERSION 5.0.1
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled
> GPU support:        enabled
> invsqrt routine:    gmx_software_invsqrt(x)
> SIMD instructions:  AVX_256
> FFT library:        fftw-3.3.3-sse2
> RDTSCP usage:       enabled
> C++11 compilation:  disabled
> TNG support:        enabled
> Tracing support:    disabled
> Built on:           Mon Sep 15 14:16:02 CEST 2014
> Built by:           root at rcpe-sbd-node01 [CMAKE]
> Build OS/arch:      Linux 3.14-2-amd64 x86_64
> Build CPU vendor:   GenuineIntel
> Build CPU brand:    Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
> Build CPU family:   6   Model: 62   Stepping: 4
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx
> msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3
> sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler:         /usr/bin/cc GNU 4.9.1
> C compiler flags:    -mavx   -Wno-maybe-uninitialized -Wextra
> -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
> -Wno-unused -Wunused-value -Wunused-parameter   -fomit-frame-pointer
> -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds  -O3 -DNDEBUG
> C++ compiler:       /usr/bin/c++ GNU 4.9.1
> C++ compiler flags:  -mavx   -Wextra -Wno-missing-field-initializers
> -Wpointer-arith -Wall -Wno-unused-function   -fomit-frame-pointer
> -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds  -O3 -DNDEBUG
> Boost version:      1.55.0 (external)
> CUDA compiler:      /usr/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
> driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on
> Wed_Jul_17_18:36:13_PDT_2013;Cuda compilation tools, release 5.5, V5.5.0
> CUDA compiler
> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_20,code=sm_21;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;-Xcompiler;-fPIC
> ;
> ;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-fomit-frame-pointer;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;-O3;-DNDEBUG
> CUDA driver:        6.50
> CUDA runtime:       5.50
>
> Changing nstlist from 20 to 40, rlist from 0.9 to 0.921
>
> Input Parameters:
>    integrator                     = md
>    tinit                          = 0
>    dt                             = 0.002
>    nsteps                         = 10000
>    init-step                      = 0
>    simulation-part                = 1
>    comm-mode                      = Linear
>    nstcomm                        = 100
>    bd-fric                        = 0
>    ld-seed                        = 4224805756
>    emtol                          = 10
>    emstep                         = 0.01
>    niter                          = 20
>    fcstep                         = 0
>    nstcgsteep                     = 1000
>    nbfgscorr                      = 10
>    rtpi                           = 0.05
>    nstxout                        = 1000
>    nstvout                        = 0
>    nstfout                        = 0
>    nstlog                         = 1000
>    nstcalcenergy                  = 100
>    nstenergy                      = 1000
>    nstxout-compressed             = 0
>    compressed-x-precision         = 1000
>    cutoff-scheme                  = Verlet
>    nstlist                        = 40
>    ns-type                        = Grid
>    pbc                            = xyz
>    periodic-molecules             = FALSE
>    verlet-buffer-tolerance        = 0.005
>    rlist                          = 0.921
>    rlistlong                      = 0.921
>    nstcalclr                      = 20
>    coulombtype                    = PME
>    coulomb-modifier               = Potential-shift
>    rcoulomb-switch                = 0
>    rcoulomb                       = 0.9
>    epsilon-r                      = 1
>    epsilon-rf                     = inf
>    vdw-type                       = Cut-off
>    vdw-modifier                   = Potential-shift
>    rvdw-switch                    = 0
>    rvdw                           = 0.9
>    DispCorr                       = EnerPres
>    table-extension                = 1
>    fourierspacing                 = 0.12
>    fourier-nx                     = 48
>    fourier-ny                     = 48
>    fourier-nz                     = 48
>    pme-order                      = 4
>    ewald-rtol                     = 1e-05
>    ewald-rtol-lj                  = 0.001
>    lj-pme-comb-rule               = Geometric
>    ewald-geometry                 = 0
>    epsilon-surface                = 0
>    implicit-solvent               = No
>    gb-algorithm                   = Still
>    nstgbradii                     = 1
>    rgbradii                       = 1
>    gb-epsilon-solvent             = 80
>    gb-saltconc                    = 0
>    gb-obc-alpha                   = 1
>    gb-obc-beta                    = 0.8
>    gb-obc-gamma                   = 4.85
>    gb-dielectric-offset           = 0.009
>    sa-algorithm                   = Ace-approximation
>    sa-surface-tension             = 2.05016
>    tcoupl                         = Berendsen
>    nsttcouple                     = 20
>    nh-chain-length                = 0
>    print-nose-hoover-chain-variables = FALSE
>    pcoupl                         = Berendsen
>    pcoupltype                     = Isotropic
>    nstpcouple                     = 20
>    tau-p                          = 0.5
>    compressibility (3x3):
>       compressibility[    0]={ 1.00000e-05,  0.00000e+00,  0.00000e+00}
>       compressibility[    1]={ 0.00000e+00,  1.00000e-05,  0.00000e+00}
>       compressibility[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e-05}
>    ref-p (3x3):
>       ref-p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
>       ref-p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
>       ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
>    refcoord-scaling               = No
>    posres-com (3):
>       posres-com[0]= 0.00000e+00
>       posres-com[1]= 0.00000e+00
>       posres-com[2]= 0.00000e+00
>    posres-comB (3):
>       posres-comB[0]= 0.00000e+00
>       posres-comB[1]= 0.00000e+00
>       posres-comB[2]= 0.00000e+00
>    QMMM                           = FALSE
>    QMconstraints                  = 0
>    QMMMscheme                     = 0
>    MMChargeScaleFactor            = 1
> qm-opts:
>    ngQM                           = 0
>    constraint-algorithm           = Lincs
>    continuation                   = FALSE
>    Shake-SOR                      = FALSE
>    shake-tol                      = 0.0001
>    lincs-order                    = 4
>    lincs-iter                     = 1
>    lincs-warnangle                = 30
>    nwall                          = 0
>    wall-type                      = 9-3
>    wall-r-linpot                  = -1
>    wall-atomtype[0]               = -1
>    wall-atomtype[1]               = -1
>    wall-density[0]                = 0
>    wall-density[1]                = 0
>    wall-ewald-zfac                = 3
>    pull                           = no
>    rotation                       = FALSE
>    interactiveMD                  = FALSE
>    disre                          = No
>    disre-weighting                = Conservative
>    disre-mixed                    = FALSE
>    dr-fc                          = 1000
>    dr-tau                         = 0
>    nstdisreout                    = 100
>    orire-fc                       = 0
>    orire-tau                      = 0
>    nstorireout                    = 100
>    free-energy                    = no
>    cos-acceleration               = 0
>    deform (3x3):
>       deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>       deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>       deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>    simulated-tempering            = FALSE
>    E-x:
>       n = 0
>    E-xt:
>       n = 0
>    E-y:
>       n = 0
>    E-yt:
>       n = 0
>    E-z:
>       n = 0
>    E-zt:
>       n = 0
>    swapcoords                     = no
>    adress                         = FALSE
>    userint1                       = 0
>    userint2                       = 0
>    userint3                       = 0
>    userint4                       = 0
>    userreal1                      = 0
>    userreal2                      = 0
>    userreal3                      = 0
>    userreal4                      = 0
> grpopts:
>    nrdf:       42647
>    ref-t:         298
>    tau-t:         0.2
> annealing:          No
> annealing-npoints:           0
>    acc:            0           0           0
>    nfreeze:           N           N           N
>    energygrp-flags[  0]: 0
> Using 1 MPI thread
> Using 12 OpenMP threads
>
> Detecting CPU SIMD instructions.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
> Family:  6  Model: 62  Stepping:  4
> Features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr
> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3
> sse4.1 sse4.2 ssse3 tdt x2apic
> SIMD instructions most likely to fit this hardware: AVX_256
> SIMD instructions selected at GROMACS compile time: AVX_256
>
> 1 GPU detected:
>   #0: NVIDIA GeForce GTX 780, compute cap.: 3.5, ECC:  no, stat: compatible
>
> 1 GPU auto-selected for this run.
> Mapping of GPU to the 1 PP rank in this node: #0
>
> Will do PME sum in reciprocal space for electrostatic interactions.
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
>
> Will do ordinary reciprocal space Ewald sum.
> Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
> Cut-off's:   NS: 0.921   Coulomb: 0.9   LJ: 0.9
> Long Range LJ corr.: <C6> 7.5684e-04
> System total charge: -0.000
> Generated table with 960 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 960 data points for LJ6.
> Tabscale = 500 points/nm
> Generated table with 960 data points for LJ12.
> Tabscale = 500 points/nm
> Generated table with 960 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 960 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 960 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
>
> Using CUDA 8x8 non-bonded kernels
>
> Potential shift: LJ r^-12: -3.541e+00 r^-6: -1.882e+00, Ewald -1.000e-05
> Initialized non-bonded Ewald correction tables, spacing: 5.87e-04 size: 1536
>
> Removing pbc first time
> Pinning threads with an auto-selected logical core stride of 1
>
> Initializing LINear Constraint Solver
>
> The number of constraints is 9808
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
>   0:  System
>
> There are: 17486 Atoms
>
> Constraining the starting coordinates (step 0)
>
> Constraining the coordinates at t0-dt (step 0)
> RMS relative constraint deviation after constraining: 2.70e-06
> Initial temperature: 240.59 K
>
> Started mdrun on rank 0 Mon Sep 15 14:42:54 2014
>            Step           Time         Lambda
>               0        0.00000        0.00000
>
>    Energies (kJ/mol)
>            Bond          Angle Ryckaert-Bell.          LJ-14     Coulomb-14
>     1.44497e+04    4.39736e+04    2.61328e+04    1.85762e+04   -2.40721e+04
>         LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
>    -2.98504e+04   -7.84599e+03   -2.85147e+04    3.75691e+03    1.66059e+04
>     Kinetic En.   Total Energy    Temperature Pres. DC (bar) Pressure (bar)
>     4.67109e+04    6.33168e+04    2.63465e+02   -7.71214e+02   -1.70587e+02
>    Constr. rmsd
>     3.47554e-06
>
> step   80: timed with pme grid 48 48 48, coulomb cutoff 0.900: 412.7
> M-cycles
> step  160: timed with pme grid 42 42 42, coulomb cutoff 0.988: 397.0
> M-cycles
> step  240: timed with pme grid 36 36 36, coulomb cutoff 1.152: 363.9
> M-cycles
> step  320: timed with pme grid 32 32 32, coulomb cutoff 1.296: 346.9
> M-cycles
> step  400: timed with pme grid 28 28 28, coulomb cutoff 1.482: 349.9
> M-cycles
> step  480: timed with pme grid 25 25 25, coulomb cutoff 1.659: 348.8
> M-cycles
> step  560: timed with pme grid 20 20 20, coulomb cutoff 2.074: 482.5
> M-cycles
> step  640: timed with pme grid 40 40 40, coulomb cutoff 1.037: 370.3
> M-cycles
> step  720: timed with pme grid 36 36 36, coulomb cutoff 1.152: 363.8
> M-cycles
> step  800: timed with pme grid 32 32 32, coulomb cutoff 1.296: 346.3
> M-cycles
> step  880: timed with pme grid 28 28 28, coulomb cutoff 1.482: 360.2
> M-cycles
> step  960: timed with pme grid 25 25 25, coulomb cutoff 1.659: 347.4
> M-cycles
>            Step           Time         Lambda
>            1000        2.00000        0.00000
>
>    Energies (kJ/mol)
>            Bond          Angle Ryckaert-Bell.          LJ-14     Coulomb-14
>     1.45899e+04    4.39271e+04    2.57760e+04    1.85182e+04   -2.40151e+04
>         LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
>    -9.43624e+03   -7.92188e+03   -2.51041e+04    3.21161e+02    3.66551e+04
>     Kinetic En.   Total Energy    Temperature Pres. DC (bar) Pressure (bar)
>     5.30986e+04    8.97537e+04    2.99494e+02   -7.86181e+02   -1.15836e+03
>    Constr. rmsd
>     3.59579e-06
>
> step 1040: timed with pme grid 24 24 24, coulomb cutoff 1.728: 366.0
> M-cycles
>               optimal pme grid 32 32 32, coulomb cutoff 1.296
>            Step           Time         Lambda
>            2000        4.00000        0.00000
>
>    Energies (kJ/mol)
>            Bond          Angle Ryckaert-Bell.          LJ-14     Coulomb-14
>     1.47786e+04    4.39581e+04    2.59885e+04    1.85884e+04   -2.40729e+04
>         LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
>    -2.51579e+04   -7.93781e+03   -2.56234e+04    9.03152e+02    2.14247e+04
>     Kinetic En.   Total Energy    Temperature Pres. DC (bar) Pressure (bar)
>     5.29243e+04    7.43490e+04    2.98510e+02   -7.89341e+02   -3.46741e+02
>    Constr. rmsd
>     3.55274e-06
>
>            Step           Time         Lambda
>            3000        6.00000        0.00000
>
> etc ...
>
>
>
>
>
> On Tue, Sep 16, 2014 at 3:19 PM, Michael Brunsteiner <mbx0009 at yahoo.com>
> wrote:
>>
>>
>> hi,
>>
>> testing a new computer we just got i found that for the system i use
>> performance
>> is sub-optimal as the GPU appears to be about 50% faster than the CPU (see
>> below
>> for details)
>> the dynamic load balancing that is performed automatically at the
>> beginning
>> of each simulation does not seem to improve things much, giving, for
>> example:
>>
>> Force evaluation time GPU/CPU: 1.198 ms/2.156 ms = 0.556
>>
>> i guess this is so because only 15% of the CPU load are used for
>> PME mesh, and the rest for something else (are these the bonded forces??)
>>
>> if i make the inital rcoulomb in the mdp file larger
>> then load balance improves to a value closer to 1, e.g:
>>
>> Force evaluation time GPU/CPU: 2.720 ms/2.502 ms = 1.087
>>
>> but the overall performance gets, in fact, worse ...
>>
>> any suggestions ?? (mdp file included at the bottom of this mail)
>>
>> thanks,
>> michael
>>
>>
>>
>> the timing:
>>
>>
>>  Computing:          Num  Num      Call    Wall time        Giga-Cycles
>>                      Ranks Threads  Count      (s)        total sum    %
>>
>> -----------------------------------------------------------------------------
>>  Neighbor search        1  12        251      0.574        23.403  2.1
>>  Launch GPU ops.        1  12      10001      0.627        25.569  2.3
>>  Force                  1  12      10001      17.392        709.604  64.5
>>  PME mesh              1  12      10001      4.172        170.234  15.5
>>  Wait GPU local        1  12      10001      0.206          8.401  0.8
>>  NB X/F buffer ops.    1  12      19751      0.239          9.736  0.9
>>  Write traj.            1  12        11      0.381        15.554  1.4
>>  Update                1  12      10001      0.303        12.365  1.1
>>  Constraints            1  12      10001      1.458        59.489  5.4
>>  Rest                                          1.621        66.139  6.0
>>
>> -----------------------------------------------------------------------------
>>  Total                                        26.973      1100.493 100.0
>>
>> -----------------------------------------------------------------------------
>>  Breakdown of PME mesh computation
>>
>> -----------------------------------------------------------------------------
>>  PME spread/gather      1  12      20002      3.319        135.423  12.3
>>  PME 3D-FFT            1  12      20002      0.616        25.138  2.3
>>  PME solve Elec        1  12      10001      0.198          8.066  0.7
>>
>> -----------------------------------------------------------------------------
>>
>>  GPU timings
>>
>> -----------------------------------------------------------------------------
>>  Computing:                        Count  Wall t (s)      ms/step      %
>>
>> -----------------------------------------------------------------------------
>>  Pair list H2D                        251      0.036        0.144    0.3
>>  X / q H2D                          10001      0.317        0.032    2.6
>>  Nonbonded F kernel                  9500      10.492        1.104    87.6
>>  Nonbonded F+ene k.                  250      0.404        1.617    3.4
>>  Nonbonded F+ene+prune k.            251      0.476        1.898    4.0
>>  F D2H                              10001      0.258        0.026    2.2
>>
>> -----------------------------------------------------------------------------
>>  Total                                        11.984        1.198  100.0
>>
>> -----------------------------------------------------------------------------
>>
>> Force evaluation time GPU/CPU: 1.198 ms/2.156 ms = 0.556
>> For optimal performance this ratio should be close to 1!
>>
>>
>>
>> md.mdpintegrator              = md
>> dt                      = 0.002
>> nsteps                  = 10000
>> comm-grps                = System
>> ;
>> nstxout                  = 1000
>> nstvout                  = 0
>> nstfout                  = 0
>> nstlog                  = 1000
>> nstenergy                = 1000
>> ;
>> nstlist                  = 20
>> ns_type                  = grid
>> pbc                      = xyz
>> rlist                    = 1.1
>> cutoff-scheme            = Verlet
>> ;
>> coulombtype              = PME
>> rcoulomb                = 0.9
>> vdw_type                = cut-off
>> rvdw                    = 0.9
>> DispCorr                = EnerPres
>> ;
>> tcoupl                  = Berendsen
>> tc-grps                  = System
>> tau_t                    = 0.2
>> ref_t                    = 298.0
>> ;
>> gen-vel                  = yes
>> gen-temp                = 240.0
>> gen-seed                = -1
>> continuation            = no
>> ;
>> Pcoupl                  = berendsen
>> Pcoupltype              = isotropic
>> tau_p                    = 0.5
>> compressibility          = 1.0e-5
>> ref_p                    = 1.0
>> ;
>> constraints              = hbonds
>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
>> a mail to gmx-users-request at gromacs.org.
>