[gmx-users] problem with gpu performance
jagannath mondal
jm3745 at columbia.edu
Fri Sep 4 15:02:00 CEST 2015
Dear Gromacs Users
I am trying to run gpu version of gromacs5.0.6 in a work-station which is
a hexacore processor that can be multithreaded to 12. The workstation has 2
Geforce GT 610 GPUs . I am finding the simulation using -nb gpu is
exceedingly slower than -nb cpu ( i,e turning off gpu)
I installed cuda-7.0 and using this I could install gpu version of gromacs
5.0.6 as follows.
cmake ../ -DGMX_BUILD_OWN_FFTW=ON
-DCMAKE_INSTALL_PREFIX=/home/jmondal/UTIL/GROMACS_5.0.6_gpu/
-DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DGMX_GPU=ON
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda/
However, the performance with gpu is very weird. If I do mdrun using
following command:
1) gmx mdrun -s topol. -nb gpu -v &>log_run
and then repeat the same thing by turning of gpu usage
2) gmx mdrun -s topol -nb cpu -v >& log_run
using gpus, the performance drops about 3 times !! Using both the GPUs
along with CPUs, the performance is: 1.620 ns/day
using only CPUs, the performance is 4.6 ns/day... usage of GPUs is
frustratingly slowing down the performance.
when using -nb gpu option, gromacs md.log correctly detects gpu and cpu as
follows:
Using 2 MPI threads
Using 6 OpenMP threads per tMPI thread
Detecting CPU SIMD instructions.
Present hardware specification:
Vendor: GenuineIntel
Brand: Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
Family: 6 Model: 63 Stepping: 2
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf_lm mmx
msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2
sse3 sse4.1 sse4.2 ssse3 tdt x2apic
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX2_256
2 GPUs detected:
#0: NVIDIA GeForce GT 610, compute cap.: 2.1, ECC: no, stat: compatible
#1: NVIDIA GeForce GT 610, compute cap.: 2.1, ECC: no, stat: compatible
2 GPUs auto-selected for this run.
Mapping of GPUs to the 2 PP ranks in this node: #0, #1
However, when I look at the performance at the end of the simulation, the
'wait GPU nonlocal' takes awfully long time.
I also tried few other options ( such as using only 1 gpu using gpu_id 0 ).
Also played with ntmpi and ntomp option. But GPUs performance is
drastically poor ( surprisingly 3 times slower than only cpu-based
simulation),
I am struggling to figure out whether it is a hardware issue or GPU-driver
issue or whether I am not using best optimal option.
Your suggestion will be useful in solving the issue.
Jagannath
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 2 MPI ranks, each using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 2 6 63 0.270 11.322 0.2
DD comm. load 2 6 13 0.000 0.002 0.0
Neighbor search 2 6 63 0.311 13.062 0.2
Launch GPU ops. 2 6 5002 0.205 8.614 0.2
Comm. coord. 2 6 2438 0.239 10.016 0.2
Force 2 6 2501 1.358 57.011 1.0
Wait + Comm. F 2 6 2501 0.404 16.954 0.3
PME mesh 2 6 2501 9.734 408.587 7.3
Wait GPU nonlocal 2 6 2501 117.798 4944.651 88.3
Wait GPU local 2 6 2501 0.005 0.206 0.0
NB X/F buffer ops. 2 6 9878 0.255 10.683 0.2
Write traj. 2 6 4 0.180 7.558 0.1
Update 2 6 2501 0.807 33.886 0.6
Constraints 2 6 2501 1.216 51.025 0.9
Comm. energies 2 6 126 0.001 0.055 0.0
Rest 0.609 25.573 0.5
-----------------------------------------------------------------------------
Total 133.392 5599.205 100.0
More information about the gromacs.org_gmx-users
mailing list