[gmx-users] GMX GPU Rest Time
Mark Abraham
mark.j.abraham at gmail.com
Thu Jun 8 10:33:47 CEST 2017
Hi,
On Thu, Jun 8, 2017 at 8:55 AM Daniel Kozuch <dkozuch at princeton.edu> wrote:
> Hello,
>
> I recently changed the number of cpus I was pairing with each gpu and I
> noticed a significant slowdown, more than I would have expected simply due
> to a reduction in the number of cpus.
>
> From the log file it appears that the GPU is resting for a large amount of
> time. Is there something I can do about this?
>
That's not just the GPUs resting, clearly many are stalled waiting for
exchange attempt synchronization.
I have attached parts of the log file. For reference this is a REMD
> simulation with 60 replicas on 360 cpus and 60 gpus.
The run reports 15 nodes and 24 GPUs and 28 cores per node, so mapping your
simulation system to your hardware is the first thing to focus on. All the
replicas progress only at the rate of the slowest replica, so at least some
of them are sharing GPUs, so the ones that finish first sit waiting for the
other ones
If e.g. this cluster has 4 GPUs and 28 cores per node, then you want to
place 4 replicas per node, e.g. 1 MPI rank per replica with 7 OpenMP
threads per core. See https://arxiv.org/abs/1507.00898 for further clues.
GROMACS 2016 does slightly improve the implementation of coupling of the
simulations in replica exchange, but you need to address all the above
issues first.
Mark
I have set the local
> variable OMP_NUM_THREADS to six in order to assign 6 cpus to each replica
> and avoid domain decomposition for my small system (as recommend in an
> earlier correspondence).
>
> Any help is appreciated,
> Dan
>
>
> -----------------------------------------------------------------------------------------------------------------------------
>
> GROMACS: gmx mdrun, VERSION 5.1.4
> Executable: /home/dkozuch/programs/gromacs_514_gpu/bin/gmx_514_gpu
> Data prefix: /home/dkozuch/programs/gromacs_514_gpu
> Command line:
> gmx_514_gpu mdrun -v -deffnm 1msi_eq -multidir 1 2 3 4 5 6 7 8 9 10 11 12
> 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
> 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 -pin
> on
>
> GROMACS version: VERSION 5.1.4
> Precision: single
> Memory model: 64 bit
> MPI library: MPI
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support: enabled
> OpenCL support: disabled
> invsqrt routine: gmx_software_invsqrt(x)
> SIMD instructions: AVX2_256
> FFT library: fftw-3.3.4-sse2-avx
> RDTSCP usage: enabled
> C++11 compilation: disabled
> TNG support: enabled
> Tracing support: disabled
> Built on: Mon May 22 18:29:21 EDT 2017
> Built by: dkozuch at tigergpu.princeton.edu [CMAKE]
> Build OS/arch: Linux 3.10.0-514.16.1.el7.x86_64 x86_64
> Build CPU vendor: GenuineIntel
> Build CPU brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> Build CPU family: 6 Model: 79 Stepping: 1
> Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler: /usr/bin/cc GNU 4.8.5
> C compiler flags: -march=core-avx2 -Wextra
> -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
> -Wno-unused -Wunused-value -Wunused-parameter -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
> C++ compiler: /usr/bin/c++ GNU 4.8.5
> C++ compiler flags: -march=core-avx2 -Wextra
> -Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function
> -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
> Boost version: 1.53.0 (external)
> CUDA compiler: /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
> compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
> CUDA compiler
>
> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
>
> ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
> CUDA driver: 8.0
> CUDA runtime: 8.0
>
>
> Number of logical cores detected (28) does not match the number reported by
> OpenMP (6).
> Consider setting the launch configuration manually!
>
> Running on 15 nodes with total 420 cores, 420 logical cores, 24 compatible
> GPUs
> Cores per node: 28
> Logical cores per node: 28
> Compatible GPUs per node: 0 - 4
> Different nodes have different type(s) and/or order of GPUs
> Hardware detected on host tiger-i20g2 (the node of MPI rank 4):
> CPU info:
> Vendor: GenuineIntel
> Brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> Family: 6 model: 79 stepping: 1
> CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> SIMD instructions most likely to fit this hardware: AVX2_256
> SIMD instructions selected at GROMACS compile time: AVX2_256
> GPU info:
> Number of GPUs detected: 4
> #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
> #1: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
> #2: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
> #3: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
>
> This is simulation 4 out of 60 running as a composite GROMACS
> multi-simulation job. Setup for this simulation:
>
> Using 1 MPI process
> Using 6 OpenMP threads
>
> 4 compatible GPUs are present, with IDs 0,1,2,3
> 4 GPUs auto-selected for this run.
> Mapping of GPU IDs to the 4 PP ranks in this node: 0,1,2,3
>
> Will do PME sum in reciprocal space for electrostatic interactions.
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
> Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
>
> Will do ordinary reciprocal space Ewald sum.
> Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
> Cut-off's: NS: 1.003 Coulomb: 0.9 LJ: 0.9
> System total charge: -0.000
> Generated table with 1001 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 1001 data points for LJ6.
> Tabscale = 500 points/nm
> Generated table with 1001 data points for LJ12.
> Tabscale = 500 points/nm
> Generated table with 1001 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 1001 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 1001 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
> Potential shift: LJ r^-12: -3.541e+00 r^-6: -1.882e+00, Ewald -1.000e-05
> Initialized non-bonded Ewald correction tables, spacing: 8.85e-04 size:
> 1018
>
>
> NOTE: GROMACS was configured without NVML support hence it can not exploit
> application clocks of the detected Tesla P100-PCIE-16GB GPU to
> improve performance.
> Recompile with the NVML library (compatible with the driver used) or
> set application clocks manually.
>
>
> Using GPU 8x8 non-bonded kernels
>
> Removing pbc first time
>
> Overriding thread affinity set outside gmx_514_gpu
>
> Pinning threads with an auto-selected logical core stride of 1
>
> Initializing LINear Constraint Solver
>
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
> 0: rest
> There are: 7898 Atoms
> There are: 2300 VSites
>
> M E G A - F L O P S A C C O U N T I N G
>
> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
> W3=SPC/TIP3p W4=TIP4p (single or pairs)
> V&F=Potential and force V=Potential only F=Force only
>
> Computing: M-Number M-Flops % Flops
>
> -----------------------------------------------------------------------------
> Pair Search distance check 3465.668704 31191.018 0.1
> NxN Ewald Elec. + LJ [F] 415473.225344 27421232.873 94.0
> NxN Ewald Elec. + LJ [V&F] 4204.264896 449856.344 1.5
> 1,4 nonbonded interactions 131.602632 11844.237 0.0
> Calc Weights 1529.730594 55070.301 0.2
> Spread Q Bspline 32634.252672 65268.505 0.2
> Gather F Bspline 32634.252672 195805.516 0.7
> 3D-FFT 102116.765540 816934.124 2.8
> Solve PME 79.966400 5117.850 0.0
> Shift-X 25.505198 153.031 0.0
> Angles 92.051841 15464.709 0.1
> Propers 143.502870 32862.157 0.1
> Impropers 11.050221 2298.446 0.0
> Virial 51.225243 922.054 0.0
> Stop-CM 5.119396 51.194 0.0
> P-Coupling 50.990000 305.940 0.0
> Calc-Ekin 102.000396 2754.011 0.0
> Lincs 50.203012 3012.181 0.0
> Lincs-Mat 1104.666276 4418.665 0.0
> Constraint-V 445.417816 3563.343 0.0
> Constraint-Vir 39.527904 948.670 0.0
> Settle 115.006900 37147.229 0.1
> Virtual Site 3 126.504600 4680.670 0.0
>
> -----------------------------------------------------------------------------
> Total 29160903.068 100.0
>
> -----------------------------------------------------------------------------
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 1 MPI rank, each using 6 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
>
> -----------------------------------------------------------------------------
> Vsite constr. 1 6 50001 0.960 13.824 0.7
> Neighbor search 1 6 2501 2.908 41.869 2.1
> Launch GPU ops. 1 6 50001 2.081 29.973 1.5
> Force 1 6 50001 4.203 60.525 3.0
> PME mesh 1 6 50001 19.931 287.004 14.2
> Wait GPU local 1 6 50001 0.722 10.398 0.5
> NB X/F buffer ops. 1 6 97501 0.830 11.957 0.6
> Vsite spread 1 6 55002 1.189 17.116 0.8
> Write traj. 1 6 6 0.040 0.580 0.0
> Update 1 6 50001 3.417 49.202 2.4
> Constraints 1 6 50001 4.544 65.428 3.2
> *Rest 99.392 1431.228
> 70.9*
>
> -----------------------------------------------------------------------------
> Total 140.217 2019.104 100.0
>
> -----------------------------------------------------------------------------
> Breakdown of PME mesh computation
>
> -----------------------------------------------------------------------------
> PME spread/gather 1 6 100002 11.795 169.845 8.4
> PME 3D-FFT 1 6 100002 6.712 96.651 4.8
> PME solve Elec 1 6 50001 1.320 19.010 0.9
>
> -----------------------------------------------------------------------------
>
> GPU timings
>
> -----------------------------------------------------------------------------
> Computing: Count Wall t (s) ms/step %
>
> -----------------------------------------------------------------------------
> Pair list H2D 2501 0.174 0.069 0.7
> X / q H2D 50001 1.052 0.021 4.3
> Nonbonded F kernel 47500 21.005 0.442 86.5
> Nonbonded F+prune k. 2000 0.920 0.460 3.8
> Nonbonded F+ene+prune k. 501 0.254 0.507 1.0
> F D2H 50001 0.873 0.017 3.6
>
> -----------------------------------------------------------------------------
> Total 24.277 0.486 100.0
>
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 0.486 ms/0.483 ms = 1.006
> For optimal performance this ratio should be close to 1!
>
> Core t (s) Wall t (s) (%)
> Time: 839.987 140.217 599.1
> (ns/day) (hour/ns)
> Performance: 123.240 0.195
> Finished mdrun on rank 0 Wed Jun 7 10:58:29 2017
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list