[gmx-developers] free energies on GPUs?
Berk Hess
hess at kth.se
Thu Feb 23 01:52:50 CET 2017
I don't see anything strange, apart from the multiple run issue Mark
noticed.
For performance pme-order=6 is bad. You spend 50% of CPU time in PME
spread+gather. Order 6 is not SIMD intrinsics accelerated. Using
pme-order=5 will be about twice as fast. You can reduce the grid spacing
a bit if you think you need high PME accuracy.
Cheers,
Berk
On 22/02/17 11:16 , Igor Leontyev wrote:
> >
> > What CPU vs GPU time per step gets reported at the end of the log
> > file?
>
> Thank you Berk for prompt response. Here is my log-file that provides
> all the details.
>
> =================================================
> Host: compute-0-113.local pid: 12081 rank ID: 0 number of ranks: 1
> :-) GROMACS - gmx mdrun, 2016.2 (-:
>
> GROMACS is written by:
> ...........................................................
>
> GROMACS: gmx mdrun, version 2016.2
> Executable:
> /home/leontyev/programs/bin/gromacs/gromacs-2016.2/bin/gmx_avx2_gpu
> Data prefix: /home/leontyev/programs/bin/gromacs/gromacs-2016.2
> Working dir:
> /share/COMMON2/MDRUNS/GROMACS/MUTATIONS/PROTEINS/coc-Flu_A-B_LIGs/MDRUNS/InP/fluA/Output_test/6829_6818_9/Gromacs.571690
> Command line:
> gmx_avx2_gpu mdrun -nb gpu -gpu_id 3 -pin on -nt 8 -s
> 6829_6818-liq_0.tpr -e
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.edr -dhdl
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.xvg -o
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.trr -x
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.xtc -cpo
> /state/partition1/Gromacs.571690.0//6829_6818-liq_0.cpt -c
> 6829_6818-liq_0.gro -g 6829_6818-liq_0.log
>
> GROMACS version: 2016.2
> Precision: single
> Memory model: 64 bit
> MPI library: thread_mpi
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support: CUDA
> SIMD instructions: AVX2_256
> FFT library: fftw-3.3.4-sse2-avx
> RDTSCP usage: enabled
> TNG support: enabled
> Hwloc support: disabled
> Tracing support: disabled
> Built on: Mon Feb 20 18:26:54 PST 2017
> Built by: leontyev at cluster01.interxinc.com [CMAKE]
> Build OS/arch: Linux 2.6.32-642.el6.x86_64 x86_64
> Build CPU vendor: Intel
> Build CPU brand: Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz
> Build CPU family: 6 Model: 45 Stepping: 7
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf mmx msr
> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3
> sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler: /share/apps/devtoolset-1.1/root/usr/bin/gcc GNU 4.7.2
> C compiler flags: -march=core-avx2 -static-libgcc
> -static-libstdc++ -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast
> C++ compiler: /share/apps/devtoolset-1.1/root/usr/bin/g++ GNU 4.7.2
> C++ compiler flags: -march=core-avx2 -std=c++0x -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> CUDA compiler: /share/apps/cuda-8.0/bin/nvcc nvcc: NVIDIA (R)
> Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built
> on Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0,
> V8.0.44
> CUDA compiler
> flags:-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;;-Xcompiler;,-march=core-avx2,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,;
>
> CUDA driver: 8.0
> CUDA runtime: 8.0
>
>
> Running on 1 node with total 24 cores, 24 logical cores, 4 compatible
> GPUs
> Hardware detected:
> CPU info:
> Vendor: Intel
> Brand: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
> Family: 6 Model: 63 Stepping: 2
> Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf
> mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
> sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> SIMD instructions most likely to fit this hardware: AVX2_256
> SIMD instructions selected at GROMACS compile time: AVX2_256
>
> Hardware topology: Basic
> Sockets, cores, and logical processors:
> Socket 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7]
> [ 8] [ 9] [ 10] [ 11]
> Socket 1: [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [
> 19] [ 20] [ 21] [ 22] [ 23]
> GPU info:
> Number of GPUs detected: 4
> #0: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC: no, stat:
> compatible
> #1: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC: no, stat:
> compatible
> #2: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC: no, stat:
> compatible
> #3: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC: no, stat:
> compatible
>
>
> For optimal performance with a GPU nstlist (now 10) should be larger.
> The optimum depends on your CPU and GPU resources.
> You might want to try several nstlist values.
> Changing nstlist from 10 to 40, rlist from 0.9 to 0.932
>
> Input Parameters:
> integrator = sd
> tinit = 0
> dt = 0.001
> nsteps = 10000
> init-step = 0
> simulation-part = 1
> comm-mode = Linear
> nstcomm = 100
> bd-fric = 0
> ld-seed = 1103660843
> emtol = 10
> emstep = 0.01
> niter = 20
> fcstep = 0
> nstcgsteep = 1000
> nbfgscorr = 10
> rtpi = 0.05
> nstxout = 10000000
> nstvout = 10000000
> nstfout = 0
> nstlog = 20000
> nstcalcenergy = 100
> nstenergy = 1000
> nstxout-compressed = 5000
> compressed-x-precision = 1000
> cutoff-scheme = Verlet
> nstlist = 40
> ns-type = Grid
> pbc = xyz
> periodic-molecules = false
> verlet-buffer-tolerance = 0.005
> rlist = 0.932
> coulombtype = PME
> coulomb-modifier = Potential-shift
> rcoulomb-switch = 0.9
> rcoulomb = 0.9
> epsilon-r = 1
> epsilon-rf = inf
> vdw-type = Cut-off
> vdw-modifier = Potential-shift
> rvdw-switch = 0.9
> rvdw = 0.9
> DispCorr = EnerPres
> table-extension = 1
> fourierspacing = 0.12
> fourier-nx = 42
> fourier-ny = 42
> fourier-nz = 40
> pme-order = 6
> ewald-rtol = 1e-05
> ewald-rtol-lj = 0.001
> lj-pme-comb-rule = Geometric
> ewald-geometry = 0
> epsilon-surface = 0
> implicit-solvent = No
> gb-algorithm = Still
> nstgbradii = 1
> rgbradii = 1
> gb-epsilon-solvent = 80
> gb-saltconc = 0
> gb-obc-alpha = 1
> gb-obc-beta = 0.8
> gb-obc-gamma = 4.85
> gb-dielectric-offset = 0.009
> sa-algorithm = Ace-approximation
> sa-surface-tension = 2.05016
> tcoupl = No
> nsttcouple = 5
> nh-chain-length = 0
> print-nose-hoover-chain-variables = false
> pcoupl = Parrinello-Rahman
> pcoupltype = Isotropic
> nstpcouple = 5
> tau-p = 0.5
> compressibility (3x3):
> compressibility[ 0]={ 5.00000e-05, 0.00000e+00, 0.00000e+00}
> compressibility[ 1]={ 0.00000e+00, 5.00000e-05, 0.00000e+00}
> compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 5.00000e-05}
> ref-p (3x3):
> ref-p[ 0]={ 1.01325e+00, 0.00000e+00, 0.00000e+00}
> ref-p[ 1]={ 0.00000e+00, 1.01325e+00, 0.00000e+00}
> ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.01325e+00}
> refcoord-scaling = All
> posres-com (3):
> posres-com[0]= 0.00000e+00
> posres-com[1]= 0.00000e+00
> posres-com[2]= 0.00000e+00
> posres-comB (3):
> posres-comB[0]= 0.00000e+00
> posres-comB[1]= 0.00000e+00
> posres-comB[2]= 0.00000e+00
> QMMM = false
> QMconstraints = 0
> QMMMscheme = 0
> MMChargeScaleFactor = 1
> qm-opts:
> ngQM = 0
> constraint-algorithm = Lincs
> continuation = false
> Shake-SOR = false
> shake-tol = 0.0001
> lincs-order = 12
> lincs-iter = 1
> lincs-warnangle = 30
> nwall = 0
> wall-type = 9-3
> wall-r-linpot = -1
> wall-atomtype[0] = -1
> wall-atomtype[1] = -1
> wall-density[0] = 0
> wall-density[1] = 0
> wall-ewald-zfac = 3
> pull = false
> rotation = false
> interactiveMD = false
> disre = No
> disre-weighting = Conservative
> disre-mixed = false
> dr-fc = 1000
> dr-tau = 0
> nstdisreout = 100
> orire-fc = 0
> orire-tau = 0
> nstorireout = 100
> free-energy = yes
> init-lambda = -1
> init-lambda-state = 0
> delta-lambda = 0
> nstdhdl = 100
> n-lambdas = 13
> separate-dvdl:
> fep-lambdas = FALSE
> mass-lambdas = FALSE
> coul-lambdas = TRUE
> vdw-lambdas = TRUE
> bonded-lambdas = TRUE
> restraint-lambdas = FALSE
> temperature-lambdas = FALSE
> all-lambdas:
> fep-lambdas = 0 0 0 0
> 0 0 0 0 0 0
> 0 0 0
> mass-lambdas = 0 0 0 0
> 0 0 0 0 0 0
> 0 0 0
> coul-lambdas = 0 0.03 0.1 0.2
> 0.3 0.4 0.5 0.6 0.7 0.8
> 0.9 0.97 1
> vdw-lambdas = 0 0.03 0.1 0.2
> 0.3 0.4 0.5 0.6 0.7 0.8
> 0.9 0.97 1
> bonded-lambdas = 0 0.03 0.1 0.2
> 0.3 0.4 0.5 0.6 0.7 0.8
> 0.9 0.97 1
> restraint-lambdas = 0 0 0 0
> 0 0 0 0 0 0
> 0 0 0
> temperature-lambdas = 0 0 0 0
> 0 0 0 0 0 0
> 0 0 0
> calc-lambda-neighbors = -1
> dhdl-print-energy = potential
> sc-alpha = 0.1
> sc-power = 1
> sc-r-power = 6
> sc-sigma = 0.3
> sc-sigma-min = 0.3
> sc-coul = true
> dh-hist-size = 0
> dh-hist-spacing = 0.1
> separate-dhdl-file = yes
> dhdl-derivatives = yes
> cos-acceleration = 0
> deform (3x3):
> deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
> deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
> deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
> simulated-tempering = false
> E-x:
> n = 0
> E-xt:
> n = 0
> E-y:
> n = 0
> E-yt:
> n = 0
> E-z:
> n = 0
> E-zt:
> n = 0
> swapcoords = no
> userint1 = 0
> userint2 = 0
> userint3 = 0
> userint4 = 0
> userreal1 = 0
> userreal2 = 0
> userreal3 = 0
> userreal4 = 0
> grpopts:
> nrdf: 6332.24 62.9925 18705.8
> ref-t: 298.15 298.15 298.15
> tau-t: 1 1 1
> annealing: No No No
> annealing-npoints: 0 0 0
> acc: 0 0 0
> nfreeze: N N N
> energygrp-flags[ 0]: 0
>
> Using 1 MPI thread
> Using 8 OpenMP threads
>
> 1 GPU user-selected for this run.
> Mapping of GPU ID to the 1 PP rank in this node: 3
>
> Will do PME sum in reciprocal space for electrostatic interactions.
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
> Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
>
> Will do ordinary reciprocal space Ewald sum.
> Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
> Cut-off's: NS: 0.932 Coulomb: 0.9 LJ: 0.9
> Long Range LJ corr.: <C6> 3.6183e-04
> System total charge, top. A: 7.000 top. B: 7.000
> Generated table with 965 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 965 data points for LJ6.
> Tabscale = 500 points/nm
> Generated table with 965 data points for LJ12.
> Tabscale = 500 points/nm
> Generated table with 965 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 965 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 965 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
> Potential shift: LJ r^-12: -3.541e+00 r^-6: -1.882e+00, Ewald -1.000e-05
> Initialized non-bonded Ewald correction tables, spacing: 8.85e-04
> size: 1018
>
>
> Using GPU 8x8 non-bonded kernels
>
> Using Lorentz-Berthelot Lennard-Jones combination rule
>
> There are 21 atoms and 21 charges for free energy perturbation
> Removing pbc first time
> Pinning threads with an auto-selected logical core stride of 1
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> S. Miyamoto and P. A. Kollman
> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for
> Rigid
> Water Models
> J. Comp. Chem. 13 (1992) pp. 952-962
> -------- -------- --- Thank You --- -------- --------
>
> Intra-simulation communication will occur every 5 steps.
> Initial vector of lambda components:[ 0.0000 0.0000 0.0000
> 0.0000 0.0000 0.0000 0.0000 ]
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
> 0: rest
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> N. Goga and A. J. Rzepiela and A. H. de Vries and S. J. Marrink and H.
> J. C.
> Berendsen
> Efficient Algorithms for Langevin and DPD Dynamics
> J. Chem. Theory Comput. 8 (2012) pp. 3637--3649
> -------- -------- --- Thank You --- -------- --------
>
> There are: 11486 Atoms
>
> Constraining the starting coordinates (step 0)
>
> Constraining the coordinates at t0-dt (step 0)
> RMS relative constraint deviation after constraining: 0.00e+00
> Initial temperature: 291.365 K
>
> Started mdrun on rank 0 Wed Feb 22 02:11:02 2017
> Step Time
> 0 0.00000
>
> Energies (kJ/mol)
> Bond Angle Proper Dih. Ryckaert-Bell. Improper
> Dih.
> 2.99018e+03 4.09043e+03 5.20416e+03 4.32600e+01 2.38045e+02
> LJ-14 Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR)
> 2.04778e+03 1.45523e+04 1.59846e+04 -2.41317e+03 -1.92125e+05
> Coul. recip. Position Rest. Potential Kinetic En. Total Energy
> 1.58368e+03 2.08367e-09 -1.47804e+05 3.03783e+04 -1.17425e+05
> Temperature Pres. DC (bar) Pressure (bar) dVcoul/dl dVvdw/dl
> 2.91118e+02 -3.53694e+02 -3.01252e+02 4.77627e+02 1.41810e+01
> dVbonded/dl
> -2.15074e+01
>
> step 80: timed with pme grid 42 42 40, coulomb cutoff 0.900: 391.6
> M-cycles
> step 160: timed with pme grid 36 36 36, coulomb cutoff 1.043: 595.7
> M-cycles
> step 240: timed with pme grid 40 36 36, coulomb cutoff 1.022: 401.1
> M-cycles
> step 320: timed with pme grid 40 40 36, coulomb cutoff 0.963: 318.8
> M-cycles
> step 400: timed with pme grid 40 40 40, coulomb cutoff 0.938: 349.9
> M-cycles
> step 480: timed with pme grid 42 40 40, coulomb cutoff 0.920: 319.9
> M-cycles
> optimal pme grid 40 40 36, coulomb cutoff 0.963
> Step Time
> 10000 10.00000
>
> Writing checkpoint, step 10000 at Wed Feb 22 02:11:41 2017
>
>
> Energies (kJ/mol)
> Bond Angle Proper Dih. Ryckaert-Bell. Improper
> Dih.
> 2.99123e+03 4.14451e+03 5.19572e+03 2.56045e+01 2.74109e+02
> LJ-14 Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR)
> 2.01371e+03 1.45326e+04 1.55974e+04 -2.43903e+03 -1.88805e+05
> Coul. recip. Position Rest. Potential Kinetic En. Total Energy
> 1.26353e+03 7.39689e+01 -1.45132e+05 3.14390e+04 -1.13693e+05
> Temperature Pres. DC (bar) Pressure (bar) dVcoul/dl dVvdw/dl
> 3.01283e+02 -3.61306e+02 1.35461e+02 3.46732e+02 1.03533e+01
> dVbonded/dl
> -1.08537e+01
>
> <====== ############### ==>
> <==== A V E R A G E S ====>
> <== ############### ======>
>
> Statistics over 10001 steps using 101 frames
>
> Energies (kJ/mol)
> Bond Angle Proper Dih. Ryckaert-Bell. Improper
> Dih.
> 3.01465e+03 4.25438e+03 5.23249e+03 3.47157e+01 2.59375e+02
> LJ-14 Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR)
> 2.02486e+03 1.45795e+04 1.58085e+04 -2.42589e+03 -1.89788e+05
> Coul. recip. Position Rest. Potential Kinetic En. Total Energy
> 1.28411e+03 6.08802e+01 -1.45660e+05 3.09346e+04 -1.14726e+05
> Temperature Pres. DC (bar) Pressure (bar) dVcoul/dl dVvdw/dl
> 2.96448e+02 -3.57435e+02 3.32252e+01 4.36060e+02 1.77368e+01
> dVbonded/dl
> -1.82384e+01
>
> Box-X Box-Y Box-Z
> 4.99607e+00 4.89654e+00 4.61444e+00
>
> Total Virial (kJ/mol)
> 1.00345e+04 5.03211e+01 -1.17351e+02
> 4.69630e+01 1.04021e+04 1.73033e+02
> -1.16637e+02 1.75781e+02 1.01673e+04
>
> Pressure (bar)
> 7.67740e+01 -1.32678e+01 3.58518e+01
> -1.22810e+01 -2.15571e+01 -5.79828e+01
> 3.56420e+01 -5.87931e+01 4.44585e+01
>
> T-Protein T-LIG T-SOL
> 2.98707e+02 2.97436e+02 2.95680e+02
>
>
> P P - P M E L O A D B A L A N C I N G
>
> PP/PME load balancing changed the cut-off and PME settings:
> particle-particle PME
> rcoulomb rlist grid spacing 1/beta
> initial 0.900 nm 0.932 nm 42 42 40 0.119 nm 0.288 nm
> final 0.963 nm 0.995 nm 40 40 36 0.128 nm 0.308 nm
> cost-ratio 1.22 0.82
> (note that these numbers concern only part of the total PP and PME load)
>
>
> M E G A - F L O P S A C C O U N T I N G
>
> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
> W3=SPC/TIP3p W4=TIP4p (single or pairs)
> V&F=Potential and force V=Potential only F=Force only
>
> Computing: M-Number M-Flops % Flops
> -----------------------------------------------------------------------------
>
> NB Free energy kernel 20441.549154 20441.549 0.3
> Pair Search distance check 289.750448 2607.754 0.0
> NxN Ewald Elec. + LJ [F] 78217.065728 5162326.338 85.6
> NxN Ewald Elec. + LJ [V&F] 798.216192 85409.133 1.4
> 1,4 nonbonded interactions 55.597769 5003.799 0.1
> Calc Weights 344.614458 12406.120 0.2
> Spread Q Bspline 49624.481952 99248.964 1.6
> Gather F Bspline 49624.481952 297746.892 4.9
> 3D-FFT 36508.030372 292064.243 4.8
> Solve PME 31.968000 2045.952 0.0
> Shift-X 2.882986 17.298 0.0
> Bonds 21.487804 1267.780 0.0
> Angles 38.645175 6492.389 0.1
> Propers 58.750116 13453.777 0.2
> Impropers 4.270427 888.249 0.0
> RB-Dihedrals 0.445700 110.088 0.0
> Pos. Restr. 0.900090 45.005 0.0
> Virial 23.073531 415.324 0.0
> Update 114.871486 3561.016 0.1
> Stop-CM 1.171572 11.716 0.0
> Calc-Ekin 45.966972 1241.108 0.0
> Constraint-V 187.108062 1496.864 0.0
> Constraint-Vir 18.717354 449.216 0.0
> Settle 62.372472 20146.308 0.3
> -----------------------------------------------------------------------------
>
> Total 6028896.883 100.0
> -----------------------------------------------------------------------------
>
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 1 MPI rank, each using 8 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
> -----------------------------------------------------------------------------
>
> Neighbor search 1 8 251 0.530 9.754 1.4
> Launch GPU ops. 1 8 10001 0.509 9.357 1.3
> Force 1 8 10001 10.634 195.662 27.3
> PME mesh 1 8 10001 22.173 407.991 57.0
> Wait GPU local 1 8 10001 0.073 1.338 0.2
> NB X/F buffer ops. 1 8 19751 0.255 4.690 0.7
> Write traj. 1 8 3 0.195 3.587 0.5
> Update 1 8 20002 1.038 19.093 2.7
> Constraints 1 8 20002 0.374 6.887 1.0
> Rest 3.126 57.513 8.0
> -----------------------------------------------------------------------------
>
> Total 38.906 715.871 100.0
> -----------------------------------------------------------------------------
>
> Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
>
> PME spread/gather 1 8 40004 19.289 354.929 49.6
> PME 3D-FFT 1 8 40004 2.319 42.665 6.0
> PME solve Elec 1 8 20002 0.518 9.538 1.3
> -----------------------------------------------------------------------------
>
>
> GPU timings
> -----------------------------------------------------------------------------
>
> Computing: Count Wall t (s) ms/step %
> -----------------------------------------------------------------------------
>
> Pair list H2D 251 0.023 0.090 1.1
> X / q H2D 10001 0.269 0.027 12.5
> Nonbonded F kernel 9700 1.615 0.166 75.0
> Nonbonded F+ene k. 50 0.014 0.273 0.6
> Nonbonded F+prune k. 200 0.039 0.196 1.8
> Nonbonded F+ene+prune k. 51 0.016 0.323 0.8
> F D2H 10001 0.177 0.018 8.2
> -----------------------------------------------------------------------------
>
> Total 2.153 0.215 100.0
> -----------------------------------------------------------------------------
>
>
> Average per-step force GPU/CPU evaluation time ratio: 0.215 ms/3.280
> ms = 0.066
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
> performance loss.
>
> Core t (s) Wall t (s) (%)
> Time: 311.246 38.906 800.0
> (ns/day) (hour/ns)
> Performance: 22.210 1.081
> =================================================
>
> On 2/22/2017 1:04 AM, Igor Leontyev wrote:
>> Hi.
>> I am having hard time with accelerating free energy (FE) simulations on
>> my high end GPU. Not sure is it normal for my smaller systems or I am
>> doing something wrong.
>>
>> The efficiency of GPU acceleration seems to decrease with the system
>> size, right? Typical sizes in FE simulations in water is 32x32x32 A^3
>> (~3.5K atoms) and in protein it is about 60x60x60A^3 (~25K atoms).
>> Requirement for larger MD box in FE simulation is rather rare.
>>
>> For my system (with 11K atoms) I am getting on 8 cpus and with GTX 1080
>> gpu only up to 50% speedup. GPU utilization during simulation is only
>> 1-2%. Does it sound right? (I am using current gmx ver-2016.2 and CUDA
>> driver 8.0; by request will attach log-files with all the details.)
>>
>> BTW, regarding how much take perturbed interactions, in my case
>> simulation with "free_energy = no" running about TWICE faster.
>>
>> Igor
>>
>>> On 2/13/17, 1:32 AM,
>>> "gromacs.org_gmx-developers-bounces at maillist.sys.kth.se on behalf of
>>> Berk Hess" <gromacs.org_gmx-developers-bounces at maillist.sys.kth.se on
>>> behalf of hess at kth.se> wrote:
>>>
>>> That depends on what you mean with this.
>>> With free-energy all non-perturbed non-bonded interactions can
>>> run on
>>> the GPU. The perturbed ones currently can not. For a large system
>>> with a
>>> few perturbed atoms this is no issue. For smaller systems the
>>> free-energy kernel can be the limiting factor. I think there is a
>>> lot of
>>> gain to be had in making the extremely complex CPU free-energy
>>> kernel
>>> faster. Initially I thought SIMD would not help there. But since
>>> any
>>> perturbed i-particle will have perturbed interactions with all
>>> j's, this
>>> will help a lot.
>>>
>>> Cheers,
>>>
>>> Berk
>>>
>>> On 2017-02-13 01:08, Michael R Shirts wrote:
>>> > What?s the current state of free energy code on GPU?s, and what
>>> are the roadblocks?
>>> >
>>> > Thanks!
>>> > ~~~~~~~~~~~~~~~~
>>> > Michael Shirts
More information about the gromacs.org_gmx-developers
mailing list