[gmx-users] Poor GPU Performance with GROMACS 5.1.4
Mark Abraham
mark.j.abraham at gmail.com
Thu May 25 09:51:03 CEST 2017
Hi,
Good. Remember that the job scheduler is a degree of freedom that matters,
so how you used it and why would have been good to mention the first time
;-) And don't just set your time step to arbitrary numbers unless you know
why it is a stable integration scheme.
Mark
On Thu, May 25, 2017 at 4:48 AM Daniel Kozuch <dkozuch at princeton.edu> wrote:
> I apologize for the confusion, but I found my error. I was failing to
> request a certain number of cpus-per-task and the scheduler was having
> issues assigning the threads because of this. Speed is now at ~400 ns/day
> with a 3 fs timestep which seems reasonable.
>
> Thanks for all the help,
> Dan
>
> On Wed, May 24, 2017 at 9:48 PM, Daniel Kozuch <dkozuch at princeton.edu>
> wrote:
>
> > Szilárd,
> >
> > I think I must be misunderstanding your advice. If I remove the domain
> > decomposition and set pin on as suggested by Mark, using:
> >
> > gmx_gpu mdrun -deffnm my_tpr -dd 1 -pin on
> >
> > Then I get very poor performance and the following error:
> >
> > NOTE: Affinity setting for 6/6 threads failed. This can cause performance
> > degradation.
> > If you think your settings are correct, ask on the gmx-users list.
> >
> > I am running only one rank and using 6 threads (I do not want to use all
> > the available 28 cores on the node because I hope to run multiple of
> these
> > jobs per node in the near future).
> >
> > Thanks for the help,
> > Dan
> >
> >
> > ------------------------------------------------------------
> > -----------------------------------------------------------------
> > Log File:
> >
> > GROMACS version: VERSION 5.1.4
> > Precision: single
> > Memory model: 64 bit
> > MPI library: MPI
> > OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
> > GPU support: enabled
> > OpenCL support: disabled
> > invsqrt routine: gmx_software_invsqrt(x)
> > SIMD instructions: AVX2_256
> > FFT library: fftw-3.3.4-sse2-avx
> > RDTSCP usage: enabled
> > C++11 compilation: disabled
> > TNG support: enabled
> > Tracing support: disabled
> > Built on: Mon May 22 18:29:21 EDT 2017
> > Built by: dkozuch at tigergpu.princeton.edu [CMAKE]
> > Build OS/arch: Linux 3.10.0-514.16.1.el7.x86_64 x86_64
> > Build CPU vendor: GenuineIntel
> > Build CPU brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> > Build CPU family: 6 Model: 79 Stepping: 1
> > Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> > lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> > rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> > C compiler: /usr/bin/cc GNU 4.8.5
> > C compiler flags: -march=core-avx2 -Wextra
> > -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
> > -Wno-unused -Wunused-value -Wunused-parameter -O3 -DNDEBUG
> > -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
> > C++ compiler: /usr/bin/c++ GNU 4.8.5
> > C++ compiler flags: -march=core-avx2 -Wextra
> > -Wno-missing-field-initializers -Wpointer-arith -Wall
> > -Wno-unused-function -O3 -DNDEBUG -funroll-all-loops
> > -fexcess-precision=fast -Wno-array-bounds
> > Boost version: 1.53.0 (external)
> > CUDA compiler: /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
> > compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> > Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
> > CUDA compiler
> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=comp
> > ute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-
> > gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_
> > 50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-
> > gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_
> > 61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-
> > gencode;arch=compute_61,code=compute_61;-use_fast_math;;
> > ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-
> > Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-
> > funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
> > CUDA driver: 8.0
> > CUDA runtime: 8.0
> >
> >
> > Number of logical cores detected (28) does not match the number reported
> > by OpenMP (1).
> > Consider setting the launch configuration manually!
> >
> > Running on 1 node with total 28 logical cores, 1 compatible GPU
> > Hardware detected on host tiger-i23g14 (the node of MPI rank 0):
> > CPU info:
> > Vendor: GenuineIntel
> > Brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> > Family: 6 model: 79 stepping: 1
> > CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> > lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> > rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> > SIMD instructions most likely to fit this hardware: AVX2_256
> > SIMD instructions selected at GROMACS compile time: AVX2_256
> > GPU info:
> > Number of GPUs detected: 1
> > #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> > compatible
> >
> > Using 1 MPI process
> > Using 6 OpenMP threads
> >
> > 1 compatible GPU is present, with ID 0
> > 1 GPU auto-selected for this run.
> > Mapping of GPU ID to the 1 PP rank in this node: 0
> >
> > NOTE: GROMACS was configured without NVML support hence it can not
> exploit
> > application clocks of the detected Tesla P100-PCIE-16GB GPU to
> > improve performance.
> > Recompile with the NVML library (compatible with the driver used)
> or
> > set application clocks manually.
> >
> >
> > Using GPU 8x8 non-bonded kernels
> >
> > Removing pbc first time
> >
> > Overriding thread affinity set outside gmx_514_gpu
> >
> > Pinning threads with an auto-selected logical core stride of 4
> >
> > NOTE: Affinity setting for 6/6 threads failed. This can cause performance
> > degradation.
> > If you think your settings are correct, ask on the gmx-users list.
> >
> > Initializing LINear Constraint Solver
> >
> > R E A L C Y C L E A N D T I M E A C C O U N T I N G
> >
> > On 1 MPI rank, each using 6 OpenMP threads
> >
> > Computing: Num Num Call Wall time Giga-Cycles
> > Ranks Threads Count (s) total sum %
> > ------------------------------------------------------------
> > -----------------
> > Neighbor search 1 6 201 1.402 20.185
> 9.4
> > Launch GPU ops. 1 6 5001 0.216 3.116
> 1.5
> > Force 1 6 5001 1.070 15.402
> 7.2
> > PME mesh 1 6 5001 5.538 79.745
> 37.1
> > Wait GPU local 1 6 5001 0.072 1.043
> 0.5
> > NB X/F buffer ops. 1 6 9801 0.396 5.706
> 2.7
> > Write traj. 1 6 2 0.022 0.310
> 0.1
> > Update 1 6 5001 1.683 24.232
> 11.3
> > Constraints 1 6 5001 2.488 35.833
> 16.7
> > Rest 2.031 29.247
> 13.6
> > ------------------------------------------------------------
> > -----------------
> > Total 14.918 214.819
> 100.0
> > ------------------------------------------------------------
> > -----------------
> > Breakdown of PME mesh computation
> > ------------------------------------------------------------
> > -----------------
> > PME spread/gather 1 6 10002 4.782 68.865
> 32.1
> > PME 3D-FFT 1 6 10002 0.654 9.411
> 4.4
> > PME solve Elec 1 6 5001 0.024 0.352
> 0.2
> > ------------------------------------------------------------
> > -----------------
> >
> > GPU timings
> > ------------------------------------------------------------
> > -----------------
> > Computing: Count Wall t (s) ms/step
> %
> > ------------------------------------------------------------
> > -----------------
> > Pair list H2D 201 0.020 0.099
> 0.3
> > X / q H2D 5001 0.090 0.018
> 1.5
> > Nonbonded F kernel 4800 5.617 1.170
> 92.8
> > Nonbonded F+prune k. 150 0.186 1.240
> 3.1
> > Nonbonded F+ene+prune k. 51 0.064 1.257
> 1.1
> > F D2H 5001 0.075 0.015
> 1.2
> > ------------------------------------------------------------
> > -----------------
> > Total 6.052 1.210
> 100.0
> > ------------------------------------------------------------
> > -----------------
> >
> > Force evaluation time GPU/CPU: 1.210 ms/1.321 ms = 0.916
> > For optimal performance this ratio should be close to 1!
> >
> > Core t (s) Wall t (s) (%)
> > Time: 23.471 14.918 157.3
> > (ns/day) (hour/ns)
> > Performance: 86.893 0.276
> > Finished mdrun on rank 0 Wed May 24 21:36:47 2017
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list