[gmx-users] Help on MD performance, GPU has less load than CPU.
Mark Abraham
mark.j.abraham at gmail.com
Tue Jul 11 15:24:26 CEST 2017
Hi,
I'm genuinely curious about why people set ewald_rtol smaller (which is
unlikely to be useful, because the accumulation of forces in single
precision will have round-off error that means the approximation to the
correct sum is not reliably accurate to more than about 1 in 1e-5), and
thus pme_order to large values - second time I've seen this in 24 hours. Is
there data somewhere that shows this is useful?
In any case, it a) causes a lot more work on the CPU, and b) only 4 (and to
a lesser extent, 5) is optimized for performance (because there's no data
that shows higher order is useful). And for free-energy calculation, that
extra expense accrues for each lambda state. See the "PME mesh" parts of
the performance report.
Guessing wildly, the cost of your simulation is probably at least double
what the defaults would give, and for that cost, I'd want to know why.
Mark
On Mon, Jul 10, 2017 at 5:02 PM Davide Bonanni <davide.bonanni at unito.it>
wrote:
> Hi,
>
> I am working on a node with Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16
> physical core, 32 logical core and 1 GPU NVIDIA GeForce GTX 980 Ti.
> I am launching a series of 2 ns molecolar dynamics simulations of a system
> of 60000 atoms.
> I tried diverse setting combination, but however i obtained the best
> performance with the command:
>
> "gmx mdrun -deffnm md_LIG -cpt 1 -cpo restart1.cpt -pin on"
>
> which use 32 OpenMP threads, 1 MPI thread, and the GPU.
> At the end of the file.log of molecular dynamic production I obtain this
> message:
>
> "NOTE: The GPU has >25% less load than the CPU. This imbalance causes
> performance loss."
>
> I don't know how can improve the load on CPU more than this, or how I can
> decrease the load on GPU. Do you have any suggestions?
>
> Thank you in advance.
>
> Cheers,
>
> Davide Bonanni
>
>
> Initial and final part of LOG file here:
>
> Log file opened on Sun Jul 9 04:02:44 2017
> Host: bigblue pid: 16777 rank ID: 0 number of ranks: 1
> :-) GROMACS - gmx mdrun, VERSION 5.1.4 (-:
>
>
>
> GROMACS: gmx mdrun, VERSION 5.1.4
> Executable: /usr/bin/gmx
> Data prefix: /usr/local/gromacs
> Command line:
> gmx mdrun -deffnm md_fluo_7 -cpt 1 -cpo restart1.cpt -pin on
>
> GROMACS version: VERSION 5.1.4
> Precision: single
> Memory model: 64 bit
> MPI library: thread_mpi
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support: enabled
> OpenCL support: disabled
> invsqrt routine: gmx_software_invsqrt(x)
> SIMD instructions: AVX2_256
> FFT library: fftw-3.3.4-sse2-avx
> RDTSCP usage: enabled
> C++11 compilation: disabled
> TNG support: enabled
> Tracing support: disabled
> Built on: Tue 8 Nov 12:26:14 CET 2016
> Built by: root at bigblue [CMAKE]
> Build OS/arch: Linux 3.10.0-327.el7.x86_64 x86_64
> Build CPU vendor: GenuineIntel
> Build CPU brand: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> Build CPU family: 6 Model: 63 Stepping: 2
> Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler: /bin/cc GNU 4.8.5
> C compiler flags: -march=core-avx2 -Wextra
> -Wno-missing-field-initializers
> -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value
> -Wunused-parameter -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
> -Wno-array-bounds
> C++ compiler: /bin/c++ GNU 4.8.5
> C++ compiler flags: -march=core-avx2 -Wextra
> -Wno-missing-field-initializers
> -Wpointer-arith -Wall -Wno-unused-function -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast -Wno-array-bounds
> Boost version: 1.55.0 (internal)
> CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
> driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
> CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=
> compute_30,code=sm_30;-gencode;arch=compute_35,code=
> sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=
> compute_50,code=sm_50;-gencode;arch=compute_52,code=
> sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=
> compute_61,code=sm_61;-gencode;arch=compute_60,code=
> compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
> ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-
> Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-
> fexcess-precision=fast;-Wno-array-bounds;
> CUDA driver: 8.0
> CUDA runtime: 8.0
>
>
> Running on 1 node with total 16 cores, 32 logical cores, 1 compatible GPU
> Hardware detected:
> CPU info:
> Vendor: GenuineIntel
> Brand: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> Family: 6 model: 63 stepping: 2
> CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> SIMD instructions most likely to fit this hardware: AVX2_256
> SIMD instructions selected at GROMACS compile time: AVX2_256
> GPU info:
> Number of GPUs detected: 1
> #0: NVIDIA GeForce GTX 980 Ti, compute cap.: 5.2, ECC: no, stat:
> compatible
>
>
>
> Changing nstlist from 20 to 40, rlist from 1.2 to 1.2
>
> Input Parameters:
> integrator = sd
> tinit = 0
> dt = 0.002
> nsteps = 1000000
> init-step = 0
> simulation-part = 1
> comm-mode = Linear
> nstcomm = 100
> bd-fric = 0
> ld-seed = 57540858
> emtol = 10
> emstep = 0.01
> niter = 20
> fcstep = 0
> nstcgsteep = 1000
> nbfgscorr = 10
> rtpi = 0.05
> nstxout = 5000
> nstvout = 500
> nstfout = 0
> nstlog = 500
> nstcalcenergy = 100
> nstenergy = 1000
> nstxout-compressed = 0
> compressed-x-precision = 1000
> cutoff-scheme = Verlet
> nstlist = 40
> ns-type = Grid
> pbc = xyz
> periodic-molecules = FALSE
> verlet-buffer-tolerance = 0.005
> rlist = 1.2
> rlistlong = 1.2
> nstcalclr = 20
> coulombtype = PME
> coulomb-modifier = Potential-shift
> rcoulomb-switch = 0
> rcoulomb = 1.2
> epsilon-r = 1
> epsilon-rf = inf
> vdw-type = Cut-off
> vdw-modifier = Potential-switch
> rvdw-switch = 1
> rvdw = 1.2
> DispCorr = EnerPres
> table-extension = 1
> fourierspacing = 0.12
> fourier-nx = 72
> fourier-ny = 72
> fourier-nz = 72
> pme-order = 6
> ewald-rtol = 1e-06
> ewald-rtol-lj = 0.001
> lj-pme-comb-rule = Geometric
> ewald-geometry = 0
> epsilon-surface = 0
> implicit-solvent = No
> gb-algorithm = Still
> nstgbradii = 1
> rgbradii = 1
> gb-epsilon-solvent = 80
> gb-saltconc = 0
> gb-obc-alpha = 1
> gb-obc-beta = 0.8
> gb-obc-gamma = 4.85
> gb-dielectric-offset = 0.009
> sa-algorithm = Ace-approximation
> sa-surface-tension = 2.05016
> tcoupl = No
> nsttcouple = -1
> nh-chain-length = 0
> print-nose-hoover-chain-variables = FALSE
> pcoupl = Parrinello-Rahman
> pcoupltype = Isotropic
> nstpcouple = 20
> tau-p = 1
>
>
> Using 1 MPI thread
> Using 32 OpenMP threads
>
> 1 compatible GPU is present, with ID 0
> 1 GPU auto-selected for this run.
> Mapping of GPU ID to the 1 PP rank in this node: 0
>
> Will do PME sum in reciprocal space for electrostatic interactions.
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
> Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
>
> Will do ordinary reciprocal space Ewald sum.
> Using a Gaussian width (1/beta) of 0.34693 nm for Ewald
> Cut-off's: NS: 1.2 Coulomb: 1.2 LJ: 1.2
> Long Range LJ corr.: <C6> 3.2003e-04
> System total charge, top. A: -0.000 top. B: -0.000
> Generated table with 1100 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for LJ6Switch.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for LJ12Switch.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
> Potential shift: LJ r^-12: 0.000e+00 r^-6: 0.000e+00, Ewald -1.000e-06
> Initialized non-bonded Ewald correction tables, spacing: 9.71e-04 size:
> 1237
>
>
> Using GPU 8x8 non-bonded kernels
>
>
> NOTE: With GPUs, reporting energy group contributions is not supported
>
> There are 39 atoms and 39 charges for free energy perturbation
> Pinning threads with an auto-selected logical core stride of 1
>
> Initializing LINear Constraint Solver
>
> -------- -------- --- Thank You --- -------- --------
>
> There are: 59559 Atoms
> Initial temperature: 301.342 K
>
> Started mdrun on rank 0 Sun Jul 9 04:02:47 2017
> Step Time Lambda
> 0 0.00000 0.35000
>
>
>
> .....
> .....
> .....
> .....
> .....
>
>
>
>
> M E G A - F L O P S A C C O U N T I N G
>
> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
> W3=SPC/TIP3p W4=TIP4p (single or pairs)
> V&F=Potential and force V=Potential only F=Force only
>
> Computing: M-Number M-Flops % Flops
> ------------------------------------------------------------
> -----------------
> NB Free energy kernel 7881861.469518 7881861.470 0.1
> Pair Search distance check 211801.978992 1906217.811 0.0
> NxN Ewald Elec. + LJ [F] 61644114.490880 5732902647.652 91.3
> NxN Ewald Elec. + LJ [V&F] 622729.312576 79086622.697 1.3
> 1,4 nonbonded interactions 15157.138733 1364142.486 0.0
> Calc Weights 178677.178677 6432378.432 0.1
> Spread Q Bspline 25729513.729488 51459027.459 0.8
> Gather F Bspline 25729513.729488 154377082.377 2.5
> 3D-FFT 27628393.815424 221027150.523 3.5
> Solve PME 10366.046848 663426.998 0.0
> Shift-X 1489.034559 8934.207 0.0
> Angles 10513.850597 1766326.900 0.0
> Propers 18191.018191 4165743.166 0.1
> Impropers 1133.001133 235664.236 0.0
> Virial 2980.259604 53644.673 0.0
> Update 59559.059559 1846330.846 0.0
> Stop-CM 595.649559 5956.496 0.0
> Calc-Ekin 5956.019118 160812.516 0.0
> Lincs 11610.011610 696600.697 0.0
> Lincs-Mat 588728.588728 2354914.355 0.0
> Constraint-V 130824.130824 1046593.047 0.0
> Constraint-Vir 2980.409607 71529.831 0.0
> Settle 35868.035868 11585375.585 0.2
> ------------------------------------------------------------
> -----------------
> Total 6281098984.459 100.0
> ------------------------------------------------------------
> -----------------
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 1 MPI rank, each using 32 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
> ------------------------------------------------------------
> -----------------
> Neighbor search 1 32 25001 170.606 13073.577 1.5
> Launch GPU ops. 1 32 1000001 97.251 7452.377 0.8
> Force 1 32 1000001 2462.595 188709.029 21.0
> PME mesh 1 32 1000001 7214.132 552819.972 61.5
> Wait GPU local 1 32 1000001 22.963 1759.683 0.2
> NB X/F buffer ops. 1 32 1975001 303.888 23287.017 2.6
> Write traj. 1 32 2190 41.970 3216.155 0.4
> Update 1 32 2000002 374.895 28728.243 3.2
> Constraints 1 32 2000002 718.184 55034.545 6.1
> Rest 315.793 24199.295 2.7
> ------------------------------------------------------------
> -----------------
> Total 11722.279 898279.893 100.0
> ------------------------------------------------------------
> -----------------
> Breakdown of PME mesh computation
> ------------------------------------------------------------
> -----------------
> PME spread/gather 1 32 4000004 5659.890 433718.207 48.3
> PME 3D-FFT 1 32 4000004 1447.568 110927.319 12.3
> PME solve Elec 1 32 2000002 85.838 6577.816 0.7
> ------------------------------------------------------------
> -----------------
>
> GPU timings
> ------------------------------------------------------------
> -----------------
> Computing: Count Wall t (s) ms/step %
> ------------------------------------------------------------
> -----------------
> Pair list H2D 25001 14.012 0.560 0.6
> X / q H2D 1000001 171.474 0.171 7.7
> Nonbonded F kernel 970000 1852.997 1.910 82.8
> Nonbonded F+ene k. 5000 13.053 2.611 0.6
> Nonbonded F+prune k. 20000 47.018 2.351 2.1
> Nonbonded F+ene+prune k. 5001 15.825 3.164 0.7
> F D2H 1000001 124.521 0.125 5.6
> ------------------------------------------------------------
> -----------------
> Total 2238.898 2.239 100.0
> ------------------------------------------------------------
> -----------------
>
> Force evaluation time GPU/CPU: 2.239 ms/9.677 ms = 0.231
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
> performance loss.
>
> Core t (s) Wall t (s) (%)
> Time: 374361.605 11722.279 3193.6
> 3h15:22
> (ns/day) (hour/ns)
> Performance: 14.741 1.628
> Finished mdrun on rank 0 Sun Jul 9 07:18:10 2017
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list