[gmx-developers] free energies on GPUs?

Wed Feb 22 18:41:56 CET 2017

Hi,

Are you trying to run several of these at once? If so, you need to manage
details better, because just doing three 8 core runs with pinning and
different gpu ids will leave 16 cores idle. See examples at
http://manual.gromacs.org/documentation/2016.2/user-guide/mdrun-performance.html.
Or better, use the MPI mdrun to do a multi simulation and let it get the
details right.

Mark

On Wed, 22 Feb 2017 17:54 Igor Leontyev <ileontyev at ucdavis.edu> wrote:

>  >
>  > What CPU vs GPU time per step gets reported at the end of the log
>  > file?
>
> Thank you Berk for prompt response. Here is my log-file that provides
> all the details.
>
> =================================================
> Host: compute-0-113.local  pid: 12081  rank ID: 0  number of ranks:  1
>                        :-) GROMACS - gmx mdrun, 2016.2 (-:
>
>                              GROMACS is written by:
> ...........................................................
>
> GROMACS:      gmx mdrun, version 2016.2
> Executable:
> /home/leontyev/programs/bin/gromacs/gromacs-2016.2/bin/gmx_avx2_gpu
> Data prefix:  /home/leontyev/programs/bin/gromacs/gromacs-2016.2
> Working dir:
>
> /share/COMMON2/MDRUNS/GROMACS/MUTATIONS/PROTEINS/coc-Flu_A-B_LIGs/MDRUNS/InP/fluA/Output_test/6829_6818_9/Gromacs.571690
> Command line:
>    gmx_avx2_gpu mdrun -nb gpu -gpu_id 3 -pin on -nt 8 -s
> 6829_6818-liq_0.tpr -e
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.edr -dhdl
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.xvg -o
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.trr -x
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.xtc -cpo
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.cpt -c
> 6829_6818-liq_0.gro -g 6829_6818-liq_0.log
>
> GROMACS version:    2016.2
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support:        CUDA
> SIMD instructions:  AVX2_256
> FFT library:        fftw-3.3.4-sse2-avx
> RDTSCP usage:       enabled
> TNG support:        enabled
> Hwloc support:      disabled
> Tracing support:    disabled
> Built on:           Mon Feb 20 18:26:54 PST 2017
> Built by:           leontyev at cluster01.interxinc.com [CMAKE]
> Build OS/arch:      Linux 2.6.32-642.el6.x86_64 x86_64
> Build CPU vendor:   Intel
> Build CPU brand:    Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz
> Build CPU family:   6   Model: 45   Stepping: 7
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf mmx msr
> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3
> sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler:         /share/apps/devtoolset-1.1/root/usr/bin/gcc GNU 4.7.2
> C compiler flags:    -march=core-avx2   -static-libgcc -static-libstdc++
>    -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
> C++ compiler:       /share/apps/devtoolset-1.1/root/usr/bin/g++ GNU 4.7.2
> C++ compiler flags:  -march=core-avx2    -std=c++0x   -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> CUDA compiler:      /share/apps/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
> compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
> CUDA compiler
>
> flags:-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;;-Xcompiler;,-march=core-avx2,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,;
>
> CUDA driver:        8.0
> CUDA runtime:       8.0
>
>
> Running on 1 node with total 24 cores, 24 logical cores, 4 compatible GPUs
> Hardware detected:
>    CPU info:
>      Vendor: Intel
>      Brand:  Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
>      Family: 6   Model: 63   Stepping: 2
>      Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf
> mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
> sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
>      SIMD instructions most likely to fit this hardware: AVX2_256
>      SIMD instructions selected at GROMACS compile time: AVX2_256
>
>    Hardware topology: Basic
>      Sockets, cores, and logical processors:
>        Socket  0: [   0] [   1] [   2] [   3] [   4] [   5] [   6] [
> 7] [   8] [   9] [  10] [  11]
>        Socket  1: [  12] [  13] [  14] [  15] [  16] [  17] [  18] [
> 19] [  20] [  21] [  22] [  23]
>    GPU info:
>      Number of GPUs detected: 4
>      #0: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat:
> compatible
>      #1: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat:
> compatible
>      #2: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat:
> compatible
>      #3: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat:
> compatible
>
>
> For optimal performance with a GPU nstlist (now 10) should be larger.
> The optimum depends on your CPU and GPU resources.
> You might want to try several nstlist values.
> Changing nstlist from 10 to 40, rlist from 0.9 to 0.932
>
> Input Parameters:
>     integrator                     = sd
>     tinit                          = 0
>     dt                             = 0.001
>     nsteps                         = 10000
>     init-step                      = 0
>     simulation-part                = 1
>     comm-mode                      = Linear
>     nstcomm                        = 100
>     bd-fric                        = 0
>     ld-seed                        = 1103660843
>     emtol                          = 10
>     emstep                         = 0.01
>     niter                          = 20
>     fcstep                         = 0
>     nstcgsteep                     = 1000
>     nbfgscorr                      = 10
>     rtpi                           = 0.05
>     nstxout                        = 10000000
>     nstvout                        = 10000000
>     nstfout                        = 0
>     nstlog                         = 20000
>     nstcalcenergy                  = 100
>     nstenergy                      = 1000
>     nstxout-compressed             = 5000
>     compressed-x-precision         = 1000
>     cutoff-scheme                  = Verlet
>     nstlist                        = 40
>     ns-type                        = Grid
>     pbc                            = xyz
>     periodic-molecules             = false
>     verlet-buffer-tolerance        = 0.005
>     rlist                          = 0.932
>     coulombtype                    = PME
>     coulomb-modifier               = Potential-shift
>     rcoulomb-switch                = 0.9
>     rcoulomb                       = 0.9
>     epsilon-r                      = 1
>     epsilon-rf                     = inf
>     vdw-type                       = Cut-off
>     vdw-modifier                   = Potential-shift
>     rvdw-switch                    = 0.9
>     rvdw                           = 0.9
>     DispCorr                       = EnerPres
>     table-extension                = 1
>     fourierspacing                 = 0.12
>     fourier-nx                     = 42
>     fourier-ny                     = 42
>     fourier-nz                     = 40
>     pme-order                      = 6
>     ewald-rtol                     = 1e-05
>     ewald-rtol-lj                  = 0.001
>     lj-pme-comb-rule               = Geometric
>     ewald-geometry                 = 0
>     epsilon-surface                = 0
>     implicit-solvent               = No
>     gb-algorithm                   = Still
>     nstgbradii                     = 1
>     rgbradii                       = 1
>     gb-epsilon-solvent             = 80
>     gb-saltconc                    = 0
>     gb-obc-alpha                   = 1
>     gb-obc-beta                    = 0.8
>     gb-obc-gamma                   = 4.85
>     gb-dielectric-offset           = 0.009
>     sa-algorithm                   = Ace-approximation
>     sa-surface-tension             = 2.05016
>     tcoupl                         = No
>     nsttcouple                     = 5
>     nh-chain-length                = 0
>     print-nose-hoover-chain-variables = false
>     pcoupl                         = Parrinello-Rahman
>     pcoupltype                     = Isotropic
>     nstpcouple                     = 5
>     tau-p                          = 0.5
>     compressibility (3x3):
>        compressibility[    0]={ 5.00000e-05,  0.00000e+00,  0.00000e+00}
>        compressibility[    1]={ 0.00000e+00,  5.00000e-05,  0.00000e+00}
>        compressibility[    2]={ 0.00000e+00,  0.00000e+00,  5.00000e-05}
>     ref-p (3x3):
>        ref-p[    0]={ 1.01325e+00,  0.00000e+00,  0.00000e+00}
>        ref-p[    1]={ 0.00000e+00,  1.01325e+00,  0.00000e+00}
>        ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.01325e+00}
>     refcoord-scaling               = All
>     posres-com (3):
>        posres-com[0]= 0.00000e+00
>        posres-com[1]= 0.00000e+00
>        posres-com[2]= 0.00000e+00
>     posres-comB (3):
>        posres-comB[0]= 0.00000e+00
>        posres-comB[1]= 0.00000e+00
>        posres-comB[2]= 0.00000e+00
>     QMMM                           = false
>     QMconstraints                  = 0
>     QMMMscheme                     = 0
>     MMChargeScaleFactor            = 1
> qm-opts:
>     ngQM                           = 0
>     constraint-algorithm           = Lincs
>     continuation                   = false
>     Shake-SOR                      = false
>     shake-tol                      = 0.0001
>     lincs-order                    = 12
>     lincs-iter                     = 1
>     lincs-warnangle                = 30
>     nwall                          = 0
>     wall-type                      = 9-3
>     wall-r-linpot                  = -1
>     wall-atomtype[0]               = -1
>     wall-atomtype[1]               = -1
>     wall-density[0]                = 0
>     wall-density[1]                = 0
>     wall-ewald-zfac                = 3
>     pull                           = false
>     rotation                       = false
>     interactiveMD                  = false
>     disre                          = No
>     disre-weighting                = Conservative
>     disre-mixed                    = false
>     dr-fc                          = 1000
>     dr-tau                         = 0
>     nstdisreout                    = 100
>     orire-fc                       = 0
>     orire-tau                      = 0
>     nstorireout                    = 100
>     free-energy                    = yes
>     init-lambda                    = -1
>     init-lambda-state              = 0
>     delta-lambda                   = 0
>     nstdhdl                        = 100
>     n-lambdas                      = 13
>     separate-dvdl:
>         fep-lambdas =   FALSE
>        mass-lambdas =   FALSE
>        coul-lambdas =   TRUE
>         vdw-lambdas =   TRUE
>      bonded-lambdas =   TRUE
>   restraint-lambdas =   FALSE
> temperature-lambdas =   FALSE
> all-lambdas:
>         fep-lambdas =            0           0           0           0
>          0           0           0           0           0           0
>          0           0           0
>        mass-lambdas =            0           0           0           0
>          0           0           0           0           0           0
>          0           0           0
>        coul-lambdas =            0        0.03         0.1         0.2
>        0.3         0.4         0.5         0.6         0.7         0.8
>        0.9        0.97           1
>         vdw-lambdas =            0        0.03         0.1         0.2
>        0.3         0.4         0.5         0.6         0.7         0.8
>        0.9        0.97           1
>      bonded-lambdas =            0        0.03         0.1         0.2
>        0.3         0.4         0.5         0.6         0.7         0.8
>        0.9        0.97           1
>   restraint-lambdas =            0           0           0           0
>          0           0           0           0           0           0
>          0           0           0
> temperature-lambdas =            0           0           0           0
>          0           0           0           0           0           0
>          0           0           0
>     calc-lambda-neighbors          = -1
>     dhdl-print-energy              = potential
>     sc-alpha                       = 0.1
>     sc-power                       = 1
>     sc-r-power                     = 6
>     sc-sigma                       = 0.3
>     sc-sigma-min                   = 0.3
>     sc-coul                        = true
>     dh-hist-size                   = 0
>     dh-hist-spacing                = 0.1
>     separate-dhdl-file             = yes
>     dhdl-derivatives               = yes
>     cos-acceleration               = 0
>     deform (3x3):
>        deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>        deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>        deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>     simulated-tempering            = false
>     E-x:
>        n = 0
>     E-xt:
>        n = 0
>     E-y:
>        n = 0
>     E-yt:
>        n = 0
>     E-z:
>        n = 0
>     E-zt:
>        n = 0
>     swapcoords                     = no
>     userint1                       = 0
>     userint2                       = 0
>     userint3                       = 0
>     userint4                       = 0
>     userreal1                      = 0
>     userreal2                      = 0
>     userreal3                      = 0
>     userreal4                      = 0
> grpopts:
>     nrdf:     6332.24     62.9925     18705.8
>     ref-t:      298.15      298.15      298.15
>     tau-t:           1           1           1
> annealing:          No          No          No
> annealing-npoints:           0           0           0
>     acc:                   0           0           0
>     nfreeze:           N           N           N
>     energygrp-flags[  0]: 0
>
> Using 1 MPI thread
> Using 8 OpenMP threads
>
> 1 GPU user-selected for this run.
> Mapping of GPU ID to the 1 PP rank in this node: 3
>
> Will do PME sum in reciprocal space for electrostatic interactions.
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
> Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
>
> Will do ordinary reciprocal space Ewald sum.
> Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
> Cut-off's:   NS: 0.932   Coulomb: 0.9   LJ: 0.9
> Long Range LJ corr.: <C6> 3.6183e-04
> System total charge, top. A: 7.000 top. B: 7.000
> Generated table with 965 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 965 data points for LJ6.
> Tabscale = 500 points/nm
> Generated table with 965 data points for LJ12.
> Tabscale = 500 points/nm
> Generated table with 965 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 965 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 965 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
> Potential shift: LJ r^-12: -3.541e+00 r^-6: -1.882e+00, Ewald -1.000e-05
> Initialized non-bonded Ewald correction tables, spacing: 8.85e-04 size:
> 1018
>
>
> Using GPU 8x8 non-bonded kernels
>
> Using Lorentz-Berthelot Lennard-Jones combination rule
>
> There are 21 atoms and 21 charges for free energy perturbation
> Removing pbc first time
> Pinning threads with an auto-selected logical core stride of 1
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> S. Miyamoto and P. A. Kollman
> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
> Water Models
> J. Comp. Chem. 13 (1992) pp. 952-962
> -------- -------- --- Thank You --- -------- --------
>
> Intra-simulation communication will occur every 5 steps.
> Initial vector of lambda components:[     0.0000     0.0000     0.0000
>    0.0000     0.0000     0.0000     0.0000 ]
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
>    0:  rest
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> N. Goga and A. J. Rzepiela and A. H. de Vries and S. J. Marrink and H. J.
> C.
> Berendsen
> Efficient Algorithms for Langevin and DPD Dynamics
> J. Chem. Theory Comput. 8 (2012) pp. 3637--3649
> -------- -------- --- Thank You --- -------- --------
>
> There are: 11486 Atoms
>
> Constraining the starting coordinates (step 0)
>
> Constraining the coordinates at t0-dt (step 0)
> RMS relative constraint deviation after constraining: 0.00e+00
> Initial temperature: 291.365 K
>
> Started mdrun on rank 0 Wed Feb 22 02:11:02 2017
>             Step           Time
>                0        0.00000
>
>     Energies (kJ/mol)
>             Bond          Angle    Proper Dih. Ryckaert-Bell.  Improper
> Dih.
>      2.99018e+03    4.09043e+03    5.20416e+03    4.32600e+01
> 2.38045e+02
>            LJ-14     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb
> (SR)
>      2.04778e+03    1.45523e+04    1.59846e+04   -2.41317e+03
>  -1.92125e+05
>     Coul. recip. Position Rest.      Potential    Kinetic En.   Total
> Energy
>      1.58368e+03    2.08367e-09   -1.47804e+05    3.03783e+04
>  -1.17425e+05
>      Temperature Pres. DC (bar) Pressure (bar)      dVcoul/dl
>  dVvdw/dl
>      2.91118e+02   -3.53694e+02   -3.01252e+02    4.77627e+02
> 1.41810e+01
>      dVbonded/dl
>     -2.15074e+01
>
> step   80: timed with pme grid 42 42 40, coulomb cutoff 0.900: 391.6
> M-cycles
> step  160: timed with pme grid 36 36 36, coulomb cutoff 1.043: 595.7
> M-cycles
> step  240: timed with pme grid 40 36 36, coulomb cutoff 1.022: 401.1
> M-cycles
> step  320: timed with pme grid 40 40 36, coulomb cutoff 0.963: 318.8
> M-cycles
> step  400: timed with pme grid 40 40 40, coulomb cutoff 0.938: 349.9
> M-cycles
> step  480: timed with pme grid 42 40 40, coulomb cutoff 0.920: 319.9
> M-cycles
>                optimal pme grid 40 40 36, coulomb cutoff 0.963
>             Step           Time
>            10000       10.00000
>
> Writing checkpoint, step 10000 at Wed Feb 22 02:11:41 2017
>
>
>     Energies (kJ/mol)
>             Bond          Angle    Proper Dih. Ryckaert-Bell.  Improper
> Dih.
>      2.99123e+03    4.14451e+03    5.19572e+03    2.56045e+01
> 2.74109e+02
>            LJ-14     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb
> (SR)
>      2.01371e+03    1.45326e+04    1.55974e+04   -2.43903e+03
>  -1.88805e+05
>     Coul. recip. Position Rest.      Potential    Kinetic En.   Total
> Energy
>      1.26353e+03    7.39689e+01   -1.45132e+05    3.14390e+04
>  -1.13693e+05
>      Temperature Pres. DC (bar) Pressure (bar)      dVcoul/dl
>  dVvdw/dl
>      3.01283e+02   -3.61306e+02    1.35461e+02    3.46732e+02
> 1.03533e+01
>      dVbonded/dl
>     -1.08537e+01
>
>         <======  ###############  ==>
>         <====  A V E R A G E S  ====>
>         <==  ###############  ======>
>
>         Statistics over 10001 steps using 101 frames
>
>     Energies (kJ/mol)
>             Bond          Angle    Proper Dih. Ryckaert-Bell.  Improper
> Dih.
>      3.01465e+03    4.25438e+03    5.23249e+03    3.47157e+01
> 2.59375e+02
>            LJ-14     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb
> (SR)
>      2.02486e+03    1.45795e+04    1.58085e+04   -2.42589e+03
>  -1.89788e+05
>     Coul. recip. Position Rest.      Potential    Kinetic En.   Total
> Energy
>      1.28411e+03    6.08802e+01   -1.45660e+05    3.09346e+04
>  -1.14726e+05
>      Temperature Pres. DC (bar) Pressure (bar)      dVcoul/dl
>  dVvdw/dl
>      2.96448e+02   -3.57435e+02    3.32252e+01    4.36060e+02
> 1.77368e+01
>      dVbonded/dl
>     -1.82384e+01
>
>            Box-X          Box-Y          Box-Z
>      4.99607e+00    4.89654e+00    4.61444e+00
>
>     Total Virial (kJ/mol)
>      1.00345e+04    5.03211e+01   -1.17351e+02
>      4.69630e+01    1.04021e+04    1.73033e+02
>     -1.16637e+02    1.75781e+02    1.01673e+04
>
>     Pressure (bar)
>      7.67740e+01   -1.32678e+01    3.58518e+01
>     -1.22810e+01   -2.15571e+01   -5.79828e+01
>      3.56420e+01   -5.87931e+01    4.44585e+01
>
>        T-Protein          T-LIG          T-SOL
>      2.98707e+02    2.97436e+02    2.95680e+02
>
>
>         P P   -   P M E   L O A D   B A L A N C I N G
>
>   PP/PME load balancing changed the cut-off and PME settings:
>             particle-particle                    PME
>              rcoulomb  rlist            grid      spacing   1/beta
>     initial  0.900 nm  0.932 nm      42  42  40   0.119 nm  0.288 nm
>     final    0.963 nm  0.995 nm      40  40  36   0.128 nm  0.308 nm
>   cost-ratio           1.22             0.82
>   (note that these numbers concern only part of the total PP and PME load)
>
>
>         M E G A - F L O P S   A C C O U N T I N G
>
>   NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>   RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>   W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>   V&F=Potential and force  V=Potential only  F=Force only
>
>   Computing:                               M-Number         M-Flops  %
> Flops
>
> -----------------------------------------------------------------------------
>   NB Free energy kernel                20441.549154       20441.549     0.3
>   Pair Search distance check             289.750448        2607.754     0.0
>   NxN Ewald Elec. + LJ [F]             78217.065728     5162326.338    85.6
>   NxN Ewald Elec. + LJ [V&F]             798.216192       85409.133     1.4
>   1,4 nonbonded interactions              55.597769        5003.799     0.1
>   Calc Weights                           344.614458       12406.120     0.2
>   Spread Q Bspline                     49624.481952       99248.964     1.6
>   Gather F Bspline                     49624.481952      297746.892     4.9
>   3D-FFT                               36508.030372      292064.243     4.8
>   Solve PME                               31.968000        2045.952     0.0
>   Shift-X                                  2.882986          17.298     0.0
>   Bonds                                   21.487804        1267.780     0.0
>   Angles                                  38.645175        6492.389     0.1
>   Propers                                 58.750116       13453.777     0.2
>   Impropers                                4.270427         888.249     0.0
>   RB-Dihedrals                             0.445700         110.088     0.0
>   Pos. Restr.                              0.900090          45.005     0.0
>   Virial                                  23.073531         415.324     0.0
>   Update                                 114.871486        3561.016     0.1
>   Stop-CM                                  1.171572          11.716     0.0
>   Calc-Ekin                               45.966972        1241.108     0.0
>   Constraint-V                           187.108062        1496.864     0.0
>   Constraint-Vir                          18.717354         449.216     0.0
>   Settle                                  62.372472       20146.308     0.3
>
> -----------------------------------------------------------------------------
>   Total                                                 6028896.883   100.0
>
> -----------------------------------------------------------------------------
>
>
>       R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 1 MPI rank, each using 8 OpenMP threads
>
>   Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                       Ranks Threads  Count      (s)         total sum    %
>
> -----------------------------------------------------------------------------
>   Neighbor search        1    8        251       0.530          9.754   1.4
>   Launch GPU ops.        1    8      10001       0.509          9.357   1.3
>   Force                  1    8      10001      10.634        195.662  27.3
>   PME mesh               1    8      10001      22.173        407.991  57.0
>   Wait GPU local         1    8      10001       0.073          1.338   0.2
>   NB X/F buffer ops.     1    8      19751       0.255          4.690   0.7
>   Write traj.            1    8          3       0.195          3.587   0.5
>   Update                 1    8      20002       1.038         19.093   2.7
>   Constraints            1    8      20002       0.374          6.887   1.0
>   Rest                                           3.126         57.513   8.0
>
> -----------------------------------------------------------------------------
>   Total                                         38.906        715.871 100.0
>
> -----------------------------------------------------------------------------
>   Breakdown of PME mesh computation
>
> -----------------------------------------------------------------------------
>   PME spread/gather      1    8      40004      19.289        354.929  49.6
>   PME 3D-FFT             1    8      40004       2.319         42.665   6.0
>   PME solve Elec         1    8      20002       0.518          9.538   1.3
>
> -----------------------------------------------------------------------------
>
>   GPU timings
>
> -----------------------------------------------------------------------------
>   Computing:                         Count  Wall t (s)      ms/step       %
>
> -----------------------------------------------------------------------------
>   Pair list H2D                        251       0.023        0.090     1.1
>   X / q H2D                          10001       0.269        0.027    12.5
>   Nonbonded F kernel                  9700       1.615        0.166    75.0
>   Nonbonded F+ene k.                    50       0.014        0.273     0.6
>   Nonbonded F+prune k.                 200       0.039        0.196     1.8
>   Nonbonded F+ene+prune k.              51       0.016        0.323     0.8
>   F D2H                              10001       0.177        0.018     8.2
>
> -----------------------------------------------------------------------------
>   Total                                          2.153        0.215   100.0
>
> -----------------------------------------------------------------------------
>
> Average per-step force GPU/CPU evaluation time ratio: 0.215 ms/3.280 ms
> = 0.066
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>        performance loss.
>
>                 Core t (s)   Wall t (s)        (%)
>         Time:      311.246       38.906      800.0
>                   (ns/day)    (hour/ns)
> Performance:       22.210        1.081
> =================================================
>
> On 2/22/2017 1:04 AM, Igor Leontyev wrote:
> > Hi.
> > I am having hard time with accelerating free energy (FE) simulations on
> > my high end GPU. Not sure is it normal for my smaller systems or I am
> > doing something wrong.
> >
> > The efficiency of GPU acceleration seems to decrease with the system
> > size, right? Typical sizes in FE simulations in water is 32x32x32 A^3
> > (~3.5K atoms) and in protein it is about 60x60x60A^3 (~25K atoms).
> > Requirement for larger MD box in FE simulation is rather rare.
> >
> > For my system (with 11K atoms) I am getting on 8 cpus and with GTX 1080
> > gpu only up to 50% speedup. GPU utilization during simulation is only
> > 1-2%. Does it sound right? (I am using current gmx ver-2016.2 and CUDA
> > driver 8.0; by request will attach log-files with all the details.)
> >
> > BTW, regarding how much take perturbed interactions, in my case
> > simulation with "free_energy = no" running about TWICE faster.
> >
> > Igor
> >
> >> On 2/13/17, 1:32 AM,
> >> "gromacs.org_gmx-developers-bounces at maillist.sys.kth.se on behalf of
> >> Berk Hess" <gromacs.org_gmx-developers-bounces at maillist.sys.kth.se on
> >> behalf of hess at kth.se> wrote:
> >>
> >>     That depends on what you mean with this.
> >>     With free-energy all non-perturbed non-bonded interactions can run
> on
> >>     the GPU. The perturbed ones currently can not. For a large system
> >> with a
> >>     few perturbed atoms this is no issue. For smaller systems the
> >>     free-energy kernel can be the limiting factor. I think there is a
> >> lot of
> >>     gain to be had in making the extremely complex CPU free-energy
> kernel
> >>     faster. Initially I thought SIMD would not help there. But since any
> >>     perturbed i-particle will have perturbed interactions with all
> >> j's, this
> >>     will help a lot.
> >>
> >>     Cheers,
> >>
> >>     Berk
> >>
> >>     On 2017-02-13 01:08, Michael R Shirts wrote:
> >>     > What?s the current state of free energy code on GPU?s, and what
> >> are the roadblocks?
> >>     >
> >>     > Thanks!
> >>     > ~~~~~~~~~~~~~~~~
> >>     > Michael Shirts
> --
> Gromacs Developers mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
> or send a mail to gmx-developers-request at gromacs.org.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20170222/69c791f5/attachment-0003.html>