[gmx-developers] free energies on GPUs?

Berk Hess hess at kth.se
Thu Feb 23 01:52:50 CET 2017


I don't see anything strange, apart from the multiple run issue Mark 
noticed.

For performance pme-order=6 is bad. You spend 50% of CPU time in PME 
spread+gather. Order 6 is not SIMD intrinsics accelerated. Using 
pme-order=5 will be about twice as fast. You can reduce the grid spacing 
a bit if you think you need high PME accuracy.

Cheers,

Berk

On 22/02/17 11:16 , Igor Leontyev wrote:
> >
> > What CPU vs GPU time per step gets reported at the end of the log
> > file?
>
> Thank you Berk for prompt response. Here is my log-file that provides 
> all the details.
>
> =================================================
> Host: compute-0-113.local  pid: 12081  rank ID: 0  number of ranks:  1
>                       :-) GROMACS - gmx mdrun, 2016.2 (-:
>
>                             GROMACS is written by:
> ...........................................................
>
> GROMACS:      gmx mdrun, version 2016.2
> Executable: 
> /home/leontyev/programs/bin/gromacs/gromacs-2016.2/bin/gmx_avx2_gpu
> Data prefix:  /home/leontyev/programs/bin/gromacs/gromacs-2016.2
> Working dir: 
> /share/COMMON2/MDRUNS/GROMACS/MUTATIONS/PROTEINS/coc-Flu_A-B_LIGs/MDRUNS/InP/fluA/Output_test/6829_6818_9/Gromacs.571690
> Command line:
>   gmx_avx2_gpu mdrun -nb gpu -gpu_id 3 -pin on -nt 8 -s 
> 6829_6818-liq_0.tpr -e 
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.edr -dhdl 
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.xvg -o 
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.trr -x 
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.xtc -cpo 
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.cpt -c 
> 6829_6818-liq_0.gro -g 6829_6818-liq_0.log
>
> GROMACS version:    2016.2
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support:        CUDA
> SIMD instructions:  AVX2_256
> FFT library:        fftw-3.3.4-sse2-avx
> RDTSCP usage:       enabled
> TNG support:        enabled
> Hwloc support:      disabled
> Tracing support:    disabled
> Built on:           Mon Feb 20 18:26:54 PST 2017
> Built by:           leontyev at cluster01.interxinc.com [CMAKE]
> Build OS/arch:      Linux 2.6.32-642.el6.x86_64 x86_64
> Build CPU vendor:   Intel
> Build CPU brand:    Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz
> Build CPU family:   6   Model: 45   Stepping: 7
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf mmx msr 
> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 
> sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler:         /share/apps/devtoolset-1.1/root/usr/bin/gcc GNU 4.7.2
> C compiler flags:    -march=core-avx2   -static-libgcc 
> -static-libstdc++   -O3 -DNDEBUG -funroll-all-loops 
> -fexcess-precision=fast
> C++ compiler:       /share/apps/devtoolset-1.1/root/usr/bin/g++ GNU 4.7.2
> C++ compiler flags:  -march=core-avx2    -std=c++0x   -O3 -DNDEBUG 
> -funroll-all-loops -fexcess-precision=fast
> CUDA compiler:      /share/apps/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) 
> Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built 
> on Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, 
> V8.0.44
> CUDA compiler 
> flags:-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;;-Xcompiler;,-march=core-avx2,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,; 
>
> CUDA driver:        8.0
> CUDA runtime:       8.0
>
>
> Running on 1 node with total 24 cores, 24 logical cores, 4 compatible 
> GPUs
> Hardware detected:
>   CPU info:
>     Vendor: Intel
>     Brand:  Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
>     Family: 6   Model: 63   Stepping: 2
>     Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf 
> mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp 
> sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
>     SIMD instructions most likely to fit this hardware: AVX2_256
>     SIMD instructions selected at GROMACS compile time: AVX2_256
>
>   Hardware topology: Basic
>     Sockets, cores, and logical processors:
>       Socket  0: [   0] [   1] [   2] [   3] [   4] [   5] [   6] [ 7] 
> [   8] [   9] [  10] [  11]
>       Socket  1: [  12] [  13] [  14] [  15] [  16] [  17] [  18] [ 
> 19] [  20] [  21] [  22] [  23]
>   GPU info:
>     Number of GPUs detected: 4
>     #0: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat: 
> compatible
>     #1: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat: 
> compatible
>     #2: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat: 
> compatible
>     #3: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat: 
> compatible
>
>
> For optimal performance with a GPU nstlist (now 10) should be larger.
> The optimum depends on your CPU and GPU resources.
> You might want to try several nstlist values.
> Changing nstlist from 10 to 40, rlist from 0.9 to 0.932
>
> Input Parameters:
>    integrator                     = sd
>    tinit                          = 0
>    dt                             = 0.001
>    nsteps                         = 10000
>    init-step                      = 0
>    simulation-part                = 1
>    comm-mode                      = Linear
>    nstcomm                        = 100
>    bd-fric                        = 0
>    ld-seed                        = 1103660843
>    emtol                          = 10
>    emstep                         = 0.01
>    niter                          = 20
>    fcstep                         = 0
>    nstcgsteep                     = 1000
>    nbfgscorr                      = 10
>    rtpi                           = 0.05
>    nstxout                        = 10000000
>    nstvout                        = 10000000
>    nstfout                        = 0
>    nstlog                         = 20000
>    nstcalcenergy                  = 100
>    nstenergy                      = 1000
>    nstxout-compressed             = 5000
>    compressed-x-precision         = 1000
>    cutoff-scheme                  = Verlet
>    nstlist                        = 40
>    ns-type                        = Grid
>    pbc                            = xyz
>    periodic-molecules             = false
>    verlet-buffer-tolerance        = 0.005
>    rlist                          = 0.932
>    coulombtype                    = PME
>    coulomb-modifier               = Potential-shift
>    rcoulomb-switch                = 0.9
>    rcoulomb                       = 0.9
>    epsilon-r                      = 1
>    epsilon-rf                     = inf
>    vdw-type                       = Cut-off
>    vdw-modifier                   = Potential-shift
>    rvdw-switch                    = 0.9
>    rvdw                           = 0.9
>    DispCorr                       = EnerPres
>    table-extension                = 1
>    fourierspacing                 = 0.12
>    fourier-nx                     = 42
>    fourier-ny                     = 42
>    fourier-nz                     = 40
>    pme-order                      = 6
>    ewald-rtol                     = 1e-05
>    ewald-rtol-lj                  = 0.001
>    lj-pme-comb-rule               = Geometric
>    ewald-geometry                 = 0
>    epsilon-surface                = 0
>    implicit-solvent               = No
>    gb-algorithm                   = Still
>    nstgbradii                     = 1
>    rgbradii                       = 1
>    gb-epsilon-solvent             = 80
>    gb-saltconc                    = 0
>    gb-obc-alpha                   = 1
>    gb-obc-beta                    = 0.8
>    gb-obc-gamma                   = 4.85
>    gb-dielectric-offset           = 0.009
>    sa-algorithm                   = Ace-approximation
>    sa-surface-tension             = 2.05016
>    tcoupl                         = No
>    nsttcouple                     = 5
>    nh-chain-length                = 0
>    print-nose-hoover-chain-variables = false
>    pcoupl                         = Parrinello-Rahman
>    pcoupltype                     = Isotropic
>    nstpcouple                     = 5
>    tau-p                          = 0.5
>    compressibility (3x3):
>       compressibility[    0]={ 5.00000e-05,  0.00000e+00, 0.00000e+00}
>       compressibility[    1]={ 0.00000e+00,  5.00000e-05, 0.00000e+00}
>       compressibility[    2]={ 0.00000e+00,  0.00000e+00, 5.00000e-05}
>    ref-p (3x3):
>       ref-p[    0]={ 1.01325e+00,  0.00000e+00,  0.00000e+00}
>       ref-p[    1]={ 0.00000e+00,  1.01325e+00,  0.00000e+00}
>       ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.01325e+00}
>    refcoord-scaling               = All
>    posres-com (3):
>       posres-com[0]= 0.00000e+00
>       posres-com[1]= 0.00000e+00
>       posres-com[2]= 0.00000e+00
>    posres-comB (3):
>       posres-comB[0]= 0.00000e+00
>       posres-comB[1]= 0.00000e+00
>       posres-comB[2]= 0.00000e+00
>    QMMM                           = false
>    QMconstraints                  = 0
>    QMMMscheme                     = 0
>    MMChargeScaleFactor            = 1
> qm-opts:
>    ngQM                           = 0
>    constraint-algorithm           = Lincs
>    continuation                   = false
>    Shake-SOR                      = false
>    shake-tol                      = 0.0001
>    lincs-order                    = 12
>    lincs-iter                     = 1
>    lincs-warnangle                = 30
>    nwall                          = 0
>    wall-type                      = 9-3
>    wall-r-linpot                  = -1
>    wall-atomtype[0]               = -1
>    wall-atomtype[1]               = -1
>    wall-density[0]                = 0
>    wall-density[1]                = 0
>    wall-ewald-zfac                = 3
>    pull                           = false
>    rotation                       = false
>    interactiveMD                  = false
>    disre                          = No
>    disre-weighting                = Conservative
>    disre-mixed                    = false
>    dr-fc                          = 1000
>    dr-tau                         = 0
>    nstdisreout                    = 100
>    orire-fc                       = 0
>    orire-tau                      = 0
>    nstorireout                    = 100
>    free-energy                    = yes
>    init-lambda                    = -1
>    init-lambda-state              = 0
>    delta-lambda                   = 0
>    nstdhdl                        = 100
>    n-lambdas                      = 13
>    separate-dvdl:
>        fep-lambdas =   FALSE
>       mass-lambdas =   FALSE
>       coul-lambdas =   TRUE
>        vdw-lambdas =   TRUE
>     bonded-lambdas =   TRUE
>  restraint-lambdas =   FALSE
> temperature-lambdas =   FALSE
> all-lambdas:
>        fep-lambdas =            0           0 0           0         
> 0           0           0 0           0           0         
> 0           0           0
>       mass-lambdas =            0           0 0           0         
> 0           0           0 0           0           0         
> 0           0           0
>       coul-lambdas =            0        0.03         0.1 0.2       
> 0.3         0.4         0.5         0.6 0.7         0.8       
> 0.9        0.97           1
>        vdw-lambdas =            0        0.03         0.1 0.2       
> 0.3         0.4         0.5         0.6 0.7         0.8       
> 0.9        0.97           1
>     bonded-lambdas =            0        0.03         0.1 0.2       
> 0.3         0.4         0.5         0.6 0.7         0.8       
> 0.9        0.97           1
>  restraint-lambdas =            0           0 0           0         
> 0           0           0 0           0           0         
> 0           0           0
> temperature-lambdas =            0           0 0           0         
> 0           0           0 0           0           0         
> 0           0           0
>    calc-lambda-neighbors          = -1
>    dhdl-print-energy              = potential
>    sc-alpha                       = 0.1
>    sc-power                       = 1
>    sc-r-power                     = 6
>    sc-sigma                       = 0.3
>    sc-sigma-min                   = 0.3
>    sc-coul                        = true
>    dh-hist-size                   = 0
>    dh-hist-spacing                = 0.1
>    separate-dhdl-file             = yes
>    dhdl-derivatives               = yes
>    cos-acceleration               = 0
>    deform (3x3):
>       deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>       deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>       deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>    simulated-tempering            = false
>    E-x:
>       n = 0
>    E-xt:
>       n = 0
>    E-y:
>       n = 0
>    E-yt:
>       n = 0
>    E-z:
>       n = 0
>    E-zt:
>       n = 0
>    swapcoords                     = no
>    userint1                       = 0
>    userint2                       = 0
>    userint3                       = 0
>    userint4                       = 0
>    userreal1                      = 0
>    userreal2                      = 0
>    userreal3                      = 0
>    userreal4                      = 0
> grpopts:
>    nrdf:     6332.24     62.9925     18705.8
>    ref-t:      298.15      298.15      298.15
>    tau-t:           1           1           1
> annealing:          No          No          No
> annealing-npoints:           0           0           0
>    acc:               0           0           0
>    nfreeze:           N           N           N
>    energygrp-flags[  0]: 0
>
> Using 1 MPI thread
> Using 8 OpenMP threads
>
> 1 GPU user-selected for this run.
> Mapping of GPU ID to the 1 PP rank in this node: 3
>
> Will do PME sum in reciprocal space for electrostatic interactions.
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. 
> Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
>
> Will do ordinary reciprocal space Ewald sum.
> Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
> Cut-off's:   NS: 0.932   Coulomb: 0.9   LJ: 0.9
> Long Range LJ corr.: <C6> 3.6183e-04
> System total charge, top. A: 7.000 top. B: 7.000
> Generated table with 965 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 965 data points for LJ6.
> Tabscale = 500 points/nm
> Generated table with 965 data points for LJ12.
> Tabscale = 500 points/nm
> Generated table with 965 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 965 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 965 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
> Potential shift: LJ r^-12: -3.541e+00 r^-6: -1.882e+00, Ewald -1.000e-05
> Initialized non-bonded Ewald correction tables, spacing: 8.85e-04 
> size: 1018
>
>
> Using GPU 8x8 non-bonded kernels
>
> Using Lorentz-Berthelot Lennard-Jones combination rule
>
> There are 21 atoms and 21 charges for free energy perturbation
> Removing pbc first time
> Pinning threads with an auto-selected logical core stride of 1
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> S. Miyamoto and P. A. Kollman
> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for 
> Rigid
> Water Models
> J. Comp. Chem. 13 (1992) pp. 952-962
> -------- -------- --- Thank You --- -------- --------
>
> Intra-simulation communication will occur every 5 steps.
> Initial vector of lambda components:[     0.0000     0.0000 0.0000   
> 0.0000     0.0000     0.0000     0.0000 ]
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
>   0:  rest
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> N. Goga and A. J. Rzepiela and A. H. de Vries and S. J. Marrink and H. 
> J. C.
> Berendsen
> Efficient Algorithms for Langevin and DPD Dynamics
> J. Chem. Theory Comput. 8 (2012) pp. 3637--3649
> -------- -------- --- Thank You --- -------- --------
>
> There are: 11486 Atoms
>
> Constraining the starting coordinates (step 0)
>
> Constraining the coordinates at t0-dt (step 0)
> RMS relative constraint deviation after constraining: 0.00e+00
> Initial temperature: 291.365 K
>
> Started mdrun on rank 0 Wed Feb 22 02:11:02 2017
>            Step           Time
>               0        0.00000
>
>    Energies (kJ/mol)
>            Bond          Angle    Proper Dih. Ryckaert-Bell. Improper 
> Dih.
>     2.99018e+03    4.09043e+03    5.20416e+03    4.32600e+01 2.38045e+02
>           LJ-14     Coulomb-14        LJ (SR)  Disper. corr. Coulomb (SR)
>     2.04778e+03    1.45523e+04    1.59846e+04   -2.41317e+03 -1.92125e+05
>    Coul. recip. Position Rest.      Potential    Kinetic En. Total Energy
>     1.58368e+03    2.08367e-09   -1.47804e+05    3.03783e+04 -1.17425e+05
>     Temperature Pres. DC (bar) Pressure (bar)      dVcoul/dl dVvdw/dl
>     2.91118e+02   -3.53694e+02   -3.01252e+02    4.77627e+02 1.41810e+01
>     dVbonded/dl
>    -2.15074e+01
>
> step   80: timed with pme grid 42 42 40, coulomb cutoff 0.900: 391.6 
> M-cycles
> step  160: timed with pme grid 36 36 36, coulomb cutoff 1.043: 595.7 
> M-cycles
> step  240: timed with pme grid 40 36 36, coulomb cutoff 1.022: 401.1 
> M-cycles
> step  320: timed with pme grid 40 40 36, coulomb cutoff 0.963: 318.8 
> M-cycles
> step  400: timed with pme grid 40 40 40, coulomb cutoff 0.938: 349.9 
> M-cycles
> step  480: timed with pme grid 42 40 40, coulomb cutoff 0.920: 319.9 
> M-cycles
>               optimal pme grid 40 40 36, coulomb cutoff 0.963
>            Step           Time
>           10000       10.00000
>
> Writing checkpoint, step 10000 at Wed Feb 22 02:11:41 2017
>
>
>    Energies (kJ/mol)
>            Bond          Angle    Proper Dih. Ryckaert-Bell. Improper 
> Dih.
>     2.99123e+03    4.14451e+03    5.19572e+03    2.56045e+01 2.74109e+02
>           LJ-14     Coulomb-14        LJ (SR)  Disper. corr. Coulomb (SR)
>     2.01371e+03    1.45326e+04    1.55974e+04   -2.43903e+03 -1.88805e+05
>    Coul. recip. Position Rest.      Potential    Kinetic En. Total Energy
>     1.26353e+03    7.39689e+01   -1.45132e+05    3.14390e+04 -1.13693e+05
>     Temperature Pres. DC (bar) Pressure (bar)      dVcoul/dl dVvdw/dl
>     3.01283e+02   -3.61306e+02    1.35461e+02    3.46732e+02 1.03533e+01
>     dVbonded/dl
>    -1.08537e+01
>
>     <======  ###############  ==>
>     <====  A V E R A G E S  ====>
>     <==  ###############  ======>
>
>     Statistics over 10001 steps using 101 frames
>
>    Energies (kJ/mol)
>            Bond          Angle    Proper Dih. Ryckaert-Bell. Improper 
> Dih.
>     3.01465e+03    4.25438e+03    5.23249e+03    3.47157e+01 2.59375e+02
>           LJ-14     Coulomb-14        LJ (SR)  Disper. corr. Coulomb (SR)
>     2.02486e+03    1.45795e+04    1.58085e+04   -2.42589e+03 -1.89788e+05
>    Coul. recip. Position Rest.      Potential    Kinetic En. Total Energy
>     1.28411e+03    6.08802e+01   -1.45660e+05    3.09346e+04 -1.14726e+05
>     Temperature Pres. DC (bar) Pressure (bar)      dVcoul/dl dVvdw/dl
>     2.96448e+02   -3.57435e+02    3.32252e+01    4.36060e+02 1.77368e+01
>     dVbonded/dl
>    -1.82384e+01
>
>           Box-X          Box-Y          Box-Z
>     4.99607e+00    4.89654e+00    4.61444e+00
>
>    Total Virial (kJ/mol)
>     1.00345e+04    5.03211e+01   -1.17351e+02
>     4.69630e+01    1.04021e+04    1.73033e+02
>    -1.16637e+02    1.75781e+02    1.01673e+04
>
>    Pressure (bar)
>     7.67740e+01   -1.32678e+01    3.58518e+01
>    -1.22810e+01   -2.15571e+01   -5.79828e+01
>     3.56420e+01   -5.87931e+01    4.44585e+01
>
>       T-Protein          T-LIG          T-SOL
>     2.98707e+02    2.97436e+02    2.95680e+02
>
>
>        P P   -   P M E   L O A D   B A L A N C I N G
>
>  PP/PME load balancing changed the cut-off and PME settings:
>            particle-particle                    PME
>             rcoulomb  rlist            grid      spacing   1/beta
>    initial  0.900 nm  0.932 nm      42  42  40   0.119 nm  0.288 nm
>    final    0.963 nm  0.995 nm      40  40  36   0.128 nm  0.308 nm
>  cost-ratio           1.22             0.82
>  (note that these numbers concern only part of the total PP and PME load)
>
>
>     M E G A - F L O P S   A C C O U N T I N G
>
>  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>  V&F=Potential and force  V=Potential only  F=Force only
>
>  Computing:                               M-Number M-Flops  % Flops
> ----------------------------------------------------------------------------- 
>
>  NB Free energy kernel                20441.549154 20441.549     0.3
>  Pair Search distance check             289.750448 2607.754     0.0
>  NxN Ewald Elec. + LJ [F]             78217.065728 5162326.338    85.6
>  NxN Ewald Elec. + LJ [V&F]             798.216192 85409.133     1.4
>  1,4 nonbonded interactions              55.597769 5003.799     0.1
>  Calc Weights                           344.614458 12406.120     0.2
>  Spread Q Bspline                     49624.481952 99248.964     1.6
>  Gather F Bspline                     49624.481952 297746.892     4.9
>  3D-FFT                               36508.030372 292064.243     4.8
>  Solve PME                               31.968000 2045.952     0.0
>  Shift-X                                  2.882986 17.298     0.0
>  Bonds                                   21.487804 1267.780     0.0
>  Angles                                  38.645175 6492.389     0.1
>  Propers                                 58.750116 13453.777     0.2
>  Impropers                                4.270427 888.249     0.0
>  RB-Dihedrals                             0.445700 110.088     0.0
>  Pos. Restr.                              0.900090 45.005     0.0
>  Virial                                  23.073531 415.324     0.0
>  Update                                 114.871486 3561.016     0.1
>  Stop-CM                                  1.171572 11.716     0.0
>  Calc-Ekin                               45.966972 1241.108     0.0
>  Constraint-V                           187.108062 1496.864     0.0
>  Constraint-Vir                          18.717354 449.216     0.0
>  Settle                                  62.372472 20146.308     0.3
> ----------------------------------------------------------------------------- 
>
>  Total 6028896.883   100.0
> ----------------------------------------------------------------------------- 
>
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 1 MPI rank, each using 8 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> ----------------------------------------------------------------------------- 
>
>  Neighbor search        1    8        251       0.530 9.754   1.4
>  Launch GPU ops.        1    8      10001       0.509 9.357   1.3
>  Force                  1    8      10001      10.634 195.662  27.3
>  PME mesh               1    8      10001      22.173 407.991  57.0
>  Wait GPU local         1    8      10001       0.073 1.338   0.2
>  NB X/F buffer ops.     1    8      19751       0.255 4.690   0.7
>  Write traj.            1    8          3       0.195 3.587   0.5
>  Update                 1    8      20002       1.038 19.093   2.7
>  Constraints            1    8      20002       0.374 6.887   1.0
>  Rest                                           3.126 57.513   8.0
> ----------------------------------------------------------------------------- 
>
>  Total                                         38.906 715.871 100.0
> ----------------------------------------------------------------------------- 
>
>  Breakdown of PME mesh computation
> ----------------------------------------------------------------------------- 
>
>  PME spread/gather      1    8      40004      19.289 354.929  49.6
>  PME 3D-FFT             1    8      40004       2.319 42.665   6.0
>  PME solve Elec         1    8      20002       0.518 9.538   1.3
> ----------------------------------------------------------------------------- 
>
>
>  GPU timings
> ----------------------------------------------------------------------------- 
>
>  Computing:                         Count  Wall t (s) ms/step       %
> ----------------------------------------------------------------------------- 
>
>  Pair list H2D                        251       0.023 0.090     1.1
>  X / q H2D                          10001       0.269 0.027    12.5
>  Nonbonded F kernel                  9700       1.615 0.166    75.0
>  Nonbonded F+ene k.                    50       0.014 0.273     0.6
>  Nonbonded F+prune k.                 200       0.039 0.196     1.8
>  Nonbonded F+ene+prune k.              51       0.016 0.323     0.8
>  F D2H                              10001       0.177 0.018     8.2
> ----------------------------------------------------------------------------- 
>
>  Total                                          2.153 0.215   100.0
> ----------------------------------------------------------------------------- 
>
>
> Average per-step force GPU/CPU evaluation time ratio: 0.215 ms/3.280 
> ms = 0.066
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>       performance loss.
>
>                Core t (s)   Wall t (s)        (%)
>        Time:      311.246       38.906      800.0
>                  (ns/day)    (hour/ns)
> Performance:       22.210        1.081
> =================================================
>
> On 2/22/2017 1:04 AM, Igor Leontyev wrote:
>> Hi.
>> I am having hard time with accelerating free energy (FE) simulations on
>> my high end GPU. Not sure is it normal for my smaller systems or I am
>> doing something wrong.
>>
>> The efficiency of GPU acceleration seems to decrease with the system
>> size, right? Typical sizes in FE simulations in water is 32x32x32 A^3
>> (~3.5K atoms) and in protein it is about 60x60x60A^3 (~25K atoms).
>> Requirement for larger MD box in FE simulation is rather rare.
>>
>> For my system (with 11K atoms) I am getting on 8 cpus and with GTX 1080
>> gpu only up to 50% speedup. GPU utilization during simulation is only
>> 1-2%. Does it sound right? (I am using current gmx ver-2016.2 and CUDA
>> driver 8.0; by request will attach log-files with all the details.)
>>
>> BTW, regarding how much take perturbed interactions, in my case
>> simulation with "free_energy = no" running about TWICE faster.
>>
>> Igor
>>
>>> On 2/13/17, 1:32 AM,
>>> "gromacs.org_gmx-developers-bounces at maillist.sys.kth.se on behalf of
>>> Berk Hess" <gromacs.org_gmx-developers-bounces at maillist.sys.kth.se on
>>> behalf of hess at kth.se> wrote:
>>>
>>>     That depends on what you mean with this.
>>>     With free-energy all non-perturbed non-bonded interactions can 
>>> run on
>>>     the GPU. The perturbed ones currently can not. For a large system
>>> with a
>>>     few perturbed atoms this is no issue. For smaller systems the
>>>     free-energy kernel can be the limiting factor. I think there is a
>>> lot of
>>>     gain to be had in making the extremely complex CPU free-energy 
>>> kernel
>>>     faster. Initially I thought SIMD would not help there. But since 
>>> any
>>>     perturbed i-particle will have perturbed interactions with all
>>> j's, this
>>>     will help a lot.
>>>
>>>     Cheers,
>>>
>>>     Berk
>>>
>>>     On 2017-02-13 01:08, Michael R Shirts wrote:
>>>     > What?s the current state of free energy code on GPU?s, and what
>>> are the roadblocks?
>>>     >
>>>     > Thanks!
>>>     > ~~~~~~~~~~~~~~~~
>>>     > Michael Shirts



More information about the gromacs.org_gmx-developers mailing list