[gmx-developers] free energies on GPUs?

Igor Leontyev ileontyev at ucdavis.edu
Wed Feb 22 17:54:31 CET 2017


 >
 > What CPU vs GPU time per step gets reported at the end of the log
 > file?

Thank you Berk for prompt response. Here is my log-file that provides 
all the details.

=================================================
Host: compute-0-113.local  pid: 12081  rank ID: 0  number of ranks:  1
                       :-) GROMACS - gmx mdrun, 2016.2 (-:

                             GROMACS is written by:
...........................................................

GROMACS:      gmx mdrun, version 2016.2
Executable: 
/home/leontyev/programs/bin/gromacs/gromacs-2016.2/bin/gmx_avx2_gpu
Data prefix:  /home/leontyev/programs/bin/gromacs/gromacs-2016.2
Working dir: 
/share/COMMON2/MDRUNS/GROMACS/MUTATIONS/PROTEINS/coc-Flu_A-B_LIGs/MDRUNS/InP/fluA/Output_test/6829_6818_9/Gromacs.571690
Command line:
   gmx_avx2_gpu mdrun -nb gpu -gpu_id 3 -pin on -nt 8 -s 
6829_6818-liq_0.tpr -e 
/state/partition1/Gromacs.571690.0//6829_6818-liq_0.edr -dhdl 
/state/partition1/Gromacs.571690.0//6829_6818-liq_0.xvg -o 
/state/partition1/Gromacs.571690.0//6829_6818-liq_0.trr -x 
/state/partition1/Gromacs.571690.0//6829_6818-liq_0.xtc -cpo 
/state/partition1/Gromacs.571690.0//6829_6818-liq_0.cpt -c 
6829_6818-liq_0.gro -g 6829_6818-liq_0.log

GROMACS version:    2016.2
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:        CUDA
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.4-sse2-avx
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
Built on:           Mon Feb 20 18:26:54 PST 2017
Built by:           leontyev at cluster01.interxinc.com [CMAKE]
Build OS/arch:      Linux 2.6.32-642.el6.x86_64 x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz
Build CPU family:   6   Model: 45   Stepping: 7
Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf mmx msr 
nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 
sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /share/apps/devtoolset-1.1/root/usr/bin/gcc GNU 4.7.2
C compiler flags:    -march=core-avx2   -static-libgcc -static-libstdc++ 
   -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler:       /share/apps/devtoolset-1.1/root/usr/bin/g++ GNU 4.7.2
C++ compiler flags:  -march=core-avx2    -std=c++0x   -O3 -DNDEBUG 
-funroll-all-loops -fexcess-precision=fast
CUDA compiler:      /share/apps/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda 
compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on 
Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
CUDA compiler 
flags:-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;;-Xcompiler;,-march=core-avx2,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,; 

CUDA driver:        8.0
CUDA runtime:       8.0


Running on 1 node with total 24 cores, 24 logical cores, 4 compatible GPUs
Hardware detected:
   CPU info:
     Vendor: Intel
     Brand:  Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
     Family: 6   Model: 63   Stepping: 2
     Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf 
mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp 
sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
     SIMD instructions most likely to fit this hardware: AVX2_256
     SIMD instructions selected at GROMACS compile time: AVX2_256

   Hardware topology: Basic
     Sockets, cores, and logical processors:
       Socket  0: [   0] [   1] [   2] [   3] [   4] [   5] [   6] [ 
7] [   8] [   9] [  10] [  11]
       Socket  1: [  12] [  13] [  14] [  15] [  16] [  17] [  18] [ 
19] [  20] [  21] [  22] [  23]
   GPU info:
     Number of GPUs detected: 4
     #0: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat: 
compatible
     #1: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat: 
compatible
     #2: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat: 
compatible
     #3: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat: 
compatible


For optimal performance with a GPU nstlist (now 10) should be larger.
The optimum depends on your CPU and GPU resources.
You might want to try several nstlist values.
Changing nstlist from 10 to 40, rlist from 0.9 to 0.932

Input Parameters:
    integrator                     = sd
    tinit                          = 0
    dt                             = 0.001
    nsteps                         = 10000
    init-step                      = 0
    simulation-part                = 1
    comm-mode                      = Linear
    nstcomm                        = 100
    bd-fric                        = 0
    ld-seed                        = 1103660843
    emtol                          = 10
    emstep                         = 0.01
    niter                          = 20
    fcstep                         = 0
    nstcgsteep                     = 1000
    nbfgscorr                      = 10
    rtpi                           = 0.05
    nstxout                        = 10000000
    nstvout                        = 10000000
    nstfout                        = 0
    nstlog                         = 20000
    nstcalcenergy                  = 100
    nstenergy                      = 1000
    nstxout-compressed             = 5000
    compressed-x-precision         = 1000
    cutoff-scheme                  = Verlet
    nstlist                        = 40
    ns-type                        = Grid
    pbc                            = xyz
    periodic-molecules             = false
    verlet-buffer-tolerance        = 0.005
    rlist                          = 0.932
    coulombtype                    = PME
    coulomb-modifier               = Potential-shift
    rcoulomb-switch                = 0.9
    rcoulomb                       = 0.9
    epsilon-r                      = 1
    epsilon-rf                     = inf
    vdw-type                       = Cut-off
    vdw-modifier                   = Potential-shift
    rvdw-switch                    = 0.9
    rvdw                           = 0.9
    DispCorr                       = EnerPres
    table-extension                = 1
    fourierspacing                 = 0.12
    fourier-nx                     = 42
    fourier-ny                     = 42
    fourier-nz                     = 40
    pme-order                      = 6
    ewald-rtol                     = 1e-05
    ewald-rtol-lj                  = 0.001
    lj-pme-comb-rule               = Geometric
    ewald-geometry                 = 0
    epsilon-surface                = 0
    implicit-solvent               = No
    gb-algorithm                   = Still
    nstgbradii                     = 1
    rgbradii                       = 1
    gb-epsilon-solvent             = 80
    gb-saltconc                    = 0
    gb-obc-alpha                   = 1
    gb-obc-beta                    = 0.8
    gb-obc-gamma                   = 4.85
    gb-dielectric-offset           = 0.009
    sa-algorithm                   = Ace-approximation
    sa-surface-tension             = 2.05016
    tcoupl                         = No
    nsttcouple                     = 5
    nh-chain-length                = 0
    print-nose-hoover-chain-variables = false
    pcoupl                         = Parrinello-Rahman
    pcoupltype                     = Isotropic
    nstpcouple                     = 5
    tau-p                          = 0.5
    compressibility (3x3):
       compressibility[    0]={ 5.00000e-05,  0.00000e+00,  0.00000e+00}
       compressibility[    1]={ 0.00000e+00,  5.00000e-05,  0.00000e+00}
       compressibility[    2]={ 0.00000e+00,  0.00000e+00,  5.00000e-05}
    ref-p (3x3):
       ref-p[    0]={ 1.01325e+00,  0.00000e+00,  0.00000e+00}
       ref-p[    1]={ 0.00000e+00,  1.01325e+00,  0.00000e+00}
       ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.01325e+00}
    refcoord-scaling               = All
    posres-com (3):
       posres-com[0]= 0.00000e+00
       posres-com[1]= 0.00000e+00
       posres-com[2]= 0.00000e+00
    posres-comB (3):
       posres-comB[0]= 0.00000e+00
       posres-comB[1]= 0.00000e+00
       posres-comB[2]= 0.00000e+00
    QMMM                           = false
    QMconstraints                  = 0
    QMMMscheme                     = 0
    MMChargeScaleFactor            = 1
qm-opts:
    ngQM                           = 0
    constraint-algorithm           = Lincs
    continuation                   = false
    Shake-SOR                      = false
    shake-tol                      = 0.0001
    lincs-order                    = 12
    lincs-iter                     = 1
    lincs-warnangle                = 30
    nwall                          = 0
    wall-type                      = 9-3
    wall-r-linpot                  = -1
    wall-atomtype[0]               = -1
    wall-atomtype[1]               = -1
    wall-density[0]                = 0
    wall-density[1]                = 0
    wall-ewald-zfac                = 3
    pull                           = false
    rotation                       = false
    interactiveMD                  = false
    disre                          = No
    disre-weighting                = Conservative
    disre-mixed                    = false
    dr-fc                          = 1000
    dr-tau                         = 0
    nstdisreout                    = 100
    orire-fc                       = 0
    orire-tau                      = 0
    nstorireout                    = 100
    free-energy                    = yes
    init-lambda                    = -1
    init-lambda-state              = 0
    delta-lambda                   = 0
    nstdhdl                        = 100
    n-lambdas                      = 13
    separate-dvdl:
        fep-lambdas =   FALSE
       mass-lambdas =   FALSE
       coul-lambdas =   TRUE
        vdw-lambdas =   TRUE
     bonded-lambdas =   TRUE
  restraint-lambdas =   FALSE
temperature-lambdas =   FALSE
all-lambdas:
        fep-lambdas =            0           0           0           0 
         0           0           0           0           0           0 
         0           0           0
       mass-lambdas =            0           0           0           0 
         0           0           0           0           0           0 
         0           0           0
       coul-lambdas =            0        0.03         0.1         0.2 
       0.3         0.4         0.5         0.6         0.7         0.8 
       0.9        0.97           1
        vdw-lambdas =            0        0.03         0.1         0.2 
       0.3         0.4         0.5         0.6         0.7         0.8 
       0.9        0.97           1
     bonded-lambdas =            0        0.03         0.1         0.2 
       0.3         0.4         0.5         0.6         0.7         0.8 
       0.9        0.97           1
  restraint-lambdas =            0           0           0           0 
         0           0           0           0           0           0 
         0           0           0
temperature-lambdas =            0           0           0           0 
         0           0           0           0           0           0 
         0           0           0
    calc-lambda-neighbors          = -1
    dhdl-print-energy              = potential
    sc-alpha                       = 0.1
    sc-power                       = 1
    sc-r-power                     = 6
    sc-sigma                       = 0.3
    sc-sigma-min                   = 0.3
    sc-coul                        = true
    dh-hist-size                   = 0
    dh-hist-spacing                = 0.1
    separate-dhdl-file             = yes
    dhdl-derivatives               = yes
    cos-acceleration               = 0
    deform (3x3):
       deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
       deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
       deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
    simulated-tempering            = false
    E-x:
       n = 0
    E-xt:
       n = 0
    E-y:
       n = 0
    E-yt:
       n = 0
    E-z:
       n = 0
    E-zt:
       n = 0
    swapcoords                     = no
    userint1                       = 0
    userint2                       = 0
    userint3                       = 0
    userint4                       = 0
    userreal1                      = 0
    userreal2                      = 0
    userreal3                      = 0
    userreal4                      = 0
grpopts:
    nrdf:     6332.24     62.9925     18705.8
    ref-t:      298.15      298.15      298.15
    tau-t:           1           1           1
annealing:          No          No          No
annealing-npoints:           0           0           0
    acc:	           0           0           0
    nfreeze:           N           N           N
    energygrp-flags[  0]: 0

Using 1 MPI thread
Using 8 OpenMP threads

1 GPU user-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 3

Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. 
Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
Cut-off's:   NS: 0.932   Coulomb: 0.9   LJ: 0.9
Long Range LJ corr.: <C6> 3.6183e-04
System total charge, top. A: 7.000 top. B: 7.000
Generated table with 965 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 965 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 965 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 965 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 965 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 965 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Potential shift: LJ r^-12: -3.541e+00 r^-6: -1.882e+00, Ewald -1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 8.85e-04 size: 1018


Using GPU 8x8 non-bonded kernels

Using Lorentz-Berthelot Lennard-Jones combination rule

There are 21 atoms and 21 charges for free energy perturbation
Removing pbc first time
Pinning threads with an auto-selected logical core stride of 1

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------

Intra-simulation communication will occur every 5 steps.
Initial vector of lambda components:[     0.0000     0.0000     0.0000 
   0.0000     0.0000     0.0000     0.0000 ]
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
   0:  rest

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
N. Goga and A. J. Rzepiela and A. H. de Vries and S. J. Marrink and H. J. C.
Berendsen
Efficient Algorithms for Langevin and DPD Dynamics
J. Chem. Theory Comput. 8 (2012) pp. 3637--3649
-------- -------- --- Thank You --- -------- --------

There are: 11486 Atoms

Constraining the starting coordinates (step 0)

Constraining the coordinates at t0-dt (step 0)
RMS relative constraint deviation after constraining: 0.00e+00
Initial temperature: 291.365 K

Started mdrun on rank 0 Wed Feb 22 02:11:02 2017
            Step           Time
               0        0.00000

    Energies (kJ/mol)
            Bond          Angle    Proper Dih. Ryckaert-Bell.  Improper Dih.
     2.99018e+03    4.09043e+03    5.20416e+03    4.32600e+01    2.38045e+02
           LJ-14     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)
     2.04778e+03    1.45523e+04    1.59846e+04   -2.41317e+03   -1.92125e+05
    Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
     1.58368e+03    2.08367e-09   -1.47804e+05    3.03783e+04   -1.17425e+05
     Temperature Pres. DC (bar) Pressure (bar)      dVcoul/dl       dVvdw/dl
     2.91118e+02   -3.53694e+02   -3.01252e+02    4.77627e+02    1.41810e+01
     dVbonded/dl
    -2.15074e+01

step   80: timed with pme grid 42 42 40, coulomb cutoff 0.900: 391.6 
M-cycles
step  160: timed with pme grid 36 36 36, coulomb cutoff 1.043: 595.7 
M-cycles
step  240: timed with pme grid 40 36 36, coulomb cutoff 1.022: 401.1 
M-cycles
step  320: timed with pme grid 40 40 36, coulomb cutoff 0.963: 318.8 
M-cycles
step  400: timed with pme grid 40 40 40, coulomb cutoff 0.938: 349.9 
M-cycles
step  480: timed with pme grid 42 40 40, coulomb cutoff 0.920: 319.9 
M-cycles
               optimal pme grid 40 40 36, coulomb cutoff 0.963
            Step           Time
           10000       10.00000

Writing checkpoint, step 10000 at Wed Feb 22 02:11:41 2017


    Energies (kJ/mol)
            Bond          Angle    Proper Dih. Ryckaert-Bell.  Improper Dih.
     2.99123e+03    4.14451e+03    5.19572e+03    2.56045e+01    2.74109e+02
           LJ-14     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)
     2.01371e+03    1.45326e+04    1.55974e+04   -2.43903e+03   -1.88805e+05
    Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
     1.26353e+03    7.39689e+01   -1.45132e+05    3.14390e+04   -1.13693e+05
     Temperature Pres. DC (bar) Pressure (bar)      dVcoul/dl       dVvdw/dl
     3.01283e+02   -3.61306e+02    1.35461e+02    3.46732e+02    1.03533e+01
     dVbonded/dl
    -1.08537e+01

	<======  ###############  ==>
	<====  A V E R A G E S  ====>
	<==  ###############  ======>

	Statistics over 10001 steps using 101 frames

    Energies (kJ/mol)
            Bond          Angle    Proper Dih. Ryckaert-Bell.  Improper Dih.
     3.01465e+03    4.25438e+03    5.23249e+03    3.47157e+01    2.59375e+02
           LJ-14     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)
     2.02486e+03    1.45795e+04    1.58085e+04   -2.42589e+03   -1.89788e+05
    Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
     1.28411e+03    6.08802e+01   -1.45660e+05    3.09346e+04   -1.14726e+05
     Temperature Pres. DC (bar) Pressure (bar)      dVcoul/dl       dVvdw/dl
     2.96448e+02   -3.57435e+02    3.32252e+01    4.36060e+02    1.77368e+01
     dVbonded/dl
    -1.82384e+01

           Box-X          Box-Y          Box-Z
     4.99607e+00    4.89654e+00    4.61444e+00

    Total Virial (kJ/mol)
     1.00345e+04    5.03211e+01   -1.17351e+02
     4.69630e+01    1.04021e+04    1.73033e+02
    -1.16637e+02    1.75781e+02    1.01673e+04

    Pressure (bar)
     7.67740e+01   -1.32678e+01    3.58518e+01
    -1.22810e+01   -2.15571e+01   -5.79828e+01
     3.56420e+01   -5.87931e+01    4.44585e+01

       T-Protein          T-LIG          T-SOL
     2.98707e+02    2.97436e+02    2.95680e+02


        P P   -   P M E   L O A D   B A L A N C I N G

  PP/PME load balancing changed the cut-off and PME settings:
            particle-particle                    PME
             rcoulomb  rlist            grid      spacing   1/beta
    initial  0.900 nm  0.932 nm      42  42  40   0.119 nm  0.288 nm
    final    0.963 nm  0.995 nm      40  40  36   0.128 nm  0.308 nm
  cost-ratio           1.22             0.82
  (note that these numbers concern only part of the total PP and PME load)


	M E G A - F L O P S   A C C O U N T I N G

  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
  V&F=Potential and force  V=Potential only  F=Force only

  Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
  NB Free energy kernel                20441.549154       20441.549     0.3
  Pair Search distance check             289.750448        2607.754     0.0
  NxN Ewald Elec. + LJ [F]             78217.065728     5162326.338    85.6
  NxN Ewald Elec. + LJ [V&F]             798.216192       85409.133     1.4
  1,4 nonbonded interactions              55.597769        5003.799     0.1
  Calc Weights                           344.614458       12406.120     0.2
  Spread Q Bspline                     49624.481952       99248.964     1.6
  Gather F Bspline                     49624.481952      297746.892     4.9
  3D-FFT                               36508.030372      292064.243     4.8
  Solve PME                               31.968000        2045.952     0.0
  Shift-X                                  2.882986          17.298     0.0
  Bonds                                   21.487804        1267.780     0.0
  Angles                                  38.645175        6492.389     0.1
  Propers                                 58.750116       13453.777     0.2
  Impropers                                4.270427         888.249     0.0
  RB-Dihedrals                             0.445700         110.088     0.0
  Pos. Restr.                              0.900090          45.005     0.0
  Virial                                  23.073531         415.324     0.0
  Update                                 114.871486        3561.016     0.1
  Stop-CM                                  1.171572          11.716     0.0
  Calc-Ekin                               45.966972        1241.108     0.0
  Constraint-V                           187.108062        1496.864     0.0
  Constraint-Vir                          18.717354         449.216     0.0
  Settle                                  62.372472       20146.308     0.3
-----------------------------------------------------------------------------
  Total                                                 6028896.883   100.0
-----------------------------------------------------------------------------


      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 8 OpenMP threads

  Computing:          Num   Num      Call    Wall time         Giga-Cycles
                      Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
  Neighbor search        1    8        251       0.530          9.754   1.4
  Launch GPU ops.        1    8      10001       0.509          9.357   1.3
  Force                  1    8      10001      10.634        195.662  27.3
  PME mesh               1    8      10001      22.173        407.991  57.0
  Wait GPU local         1    8      10001       0.073          1.338   0.2
  NB X/F buffer ops.     1    8      19751       0.255          4.690   0.7
  Write traj.            1    8          3       0.195          3.587   0.5
  Update                 1    8      20002       1.038         19.093   2.7
  Constraints            1    8      20002       0.374          6.887   1.0
  Rest                                           3.126         57.513   8.0
-----------------------------------------------------------------------------
  Total                                         38.906        715.871 100.0
-----------------------------------------------------------------------------
  Breakdown of PME mesh computation
-----------------------------------------------------------------------------
  PME spread/gather      1    8      40004      19.289        354.929  49.6
  PME 3D-FFT             1    8      40004       2.319         42.665   6.0
  PME solve Elec         1    8      20002       0.518          9.538   1.3
-----------------------------------------------------------------------------

  GPU timings
-----------------------------------------------------------------------------
  Computing:                         Count  Wall t (s)      ms/step       %
-----------------------------------------------------------------------------
  Pair list H2D                        251       0.023        0.090     1.1
  X / q H2D                          10001       0.269        0.027    12.5
  Nonbonded F kernel                  9700       1.615        0.166    75.0
  Nonbonded F+ene k.                    50       0.014        0.273     0.6
  Nonbonded F+prune k.                 200       0.039        0.196     1.8
  Nonbonded F+ene+prune k.              51       0.016        0.323     0.8
  F D2H                              10001       0.177        0.018     8.2
-----------------------------------------------------------------------------
  Total                                          2.153        0.215   100.0
-----------------------------------------------------------------------------

Average per-step force GPU/CPU evaluation time ratio: 0.215 ms/3.280 ms 
= 0.066
For optimal performance this ratio should be close to 1!


NOTE: The GPU has >25% less load than the CPU. This imbalance causes
       performance loss.

                Core t (s)   Wall t (s)        (%)
        Time:      311.246       38.906      800.0
                  (ns/day)    (hour/ns)
Performance:       22.210        1.081
=================================================

On 2/22/2017 1:04 AM, Igor Leontyev wrote:
> Hi.
> I am having hard time with accelerating free energy (FE) simulations on
> my high end GPU. Not sure is it normal for my smaller systems or I am
> doing something wrong.
>
> The efficiency of GPU acceleration seems to decrease with the system
> size, right? Typical sizes in FE simulations in water is 32x32x32 A^3
> (~3.5K atoms) and in protein it is about 60x60x60A^3 (~25K atoms).
> Requirement for larger MD box in FE simulation is rather rare.
>
> For my system (with 11K atoms) I am getting on 8 cpus and with GTX 1080
> gpu only up to 50% speedup. GPU utilization during simulation is only
> 1-2%. Does it sound right? (I am using current gmx ver-2016.2 and CUDA
> driver 8.0; by request will attach log-files with all the details.)
>
> BTW, regarding how much take perturbed interactions, in my case
> simulation with "free_energy = no" running about TWICE faster.
>
> Igor
>
>> On 2/13/17, 1:32 AM,
>> "gromacs.org_gmx-developers-bounces at maillist.sys.kth.se on behalf of
>> Berk Hess" <gromacs.org_gmx-developers-bounces at maillist.sys.kth.se on
>> behalf of hess at kth.se> wrote:
>>
>>     That depends on what you mean with this.
>>     With free-energy all non-perturbed non-bonded interactions can run on
>>     the GPU. The perturbed ones currently can not. For a large system
>> with a
>>     few perturbed atoms this is no issue. For smaller systems the
>>     free-energy kernel can be the limiting factor. I think there is a
>> lot of
>>     gain to be had in making the extremely complex CPU free-energy kernel
>>     faster. Initially I thought SIMD would not help there. But since any
>>     perturbed i-particle will have perturbed interactions with all
>> j's, this
>>     will help a lot.
>>
>>     Cheers,
>>
>>     Berk
>>
>>     On 2017-02-13 01:08, Michael R Shirts wrote:
>>     > What?s the current state of free energy code on GPU?s, and what
>> are the roadblocks?
>>     >
>>     > Thanks!
>>     > ~~~~~~~~~~~~~~~~
>>     > Michael Shirts


More information about the gromacs.org_gmx-developers mailing list