[gmx-users] problem with gpu performance

Fri Sep 4 15:02:00 CEST 2015

Dear Gromacs Users

  I am trying to run gpu version of gromacs5.0.6 in a work-station which is
a hexacore processor that can be multithreaded to 12. The workstation has 2
Geforce GT  610 GPUs . I am finding the simulation using -nb gpu is
exceedingly slower than -nb cpu ( i,e turning off gpu)

I installed cuda-7.0 and using this I could install gpu version of gromacs
5.0.6 as follows.

cmake ../ -DGMX_BUILD_OWN_FFTW=ON
-DCMAKE_INSTALL_PREFIX=/home/jmondal/UTIL/GROMACS_5.0.6_gpu/
-DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++  -DGMX_GPU=ON
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda/

However,  the performance with gpu is very weird. If I do mdrun using
following command:
1) gmx mdrun -s topol. -nb gpu -v &>log_run

and then repeat the same thing by turning of gpu usage

2) gmx mdrun -s topol -nb cpu -v >& log_run

using gpus, the performance drops about 3 times !! Using both the GPUs
along with CPUs, the performance is: 1.620 ns/day
  using only CPUs, the performance is 4.6 ns/day... usage of GPUs is
frustratingly slowing down the performance.

when using -nb gpu option, gromacs md.log correctly detects gpu and cpu as
follows:

Using 2 MPI threads
Using 6 OpenMP threads per tMPI thread

Detecting CPU SIMD instructions.
Present hardware specification:
Vendor: GenuineIntel
Brand:  Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
Family:  6  Model: 63  Stepping:  2
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf_lm mmx
msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2
sse3 sse4.1 sse4.2 ssse3 tdt x2apic
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX2_256

2 GPUs detected:
  #0: NVIDIA GeForce GT 610, compute cap.: 2.1, ECC:  no, stat: compatible
  #1: NVIDIA GeForce GT 610, compute cap.: 2.1, ECC:  no, stat: compatible

2 GPUs auto-selected for this run.
Mapping of GPUs to the 2 PP ranks in this node: #0, #1

However, when I look at the performance at the end of the simulation, the
'wait GPU nonlocal' takes awfully long time.
I also tried few other options ( such as using only 1 gpu using gpu_id 0 ).
Also played with ntmpi and ntomp option. But GPUs performance is
drastically poor ( surprisingly 3 times slower than only cpu-based
simulation),

I am struggling to figure out whether it is a hardware issue or GPU-driver
issue or whether I am not using best optimal option.
Your suggestion will be useful in solving the issue.
Jagannath

   R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 2 MPI ranks, each using 6 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         2    6         63       0.270         11.322   0.2
 DD comm. load          2    6         13       0.000          0.002   0.0
 Neighbor search        2    6         63       0.311         13.062   0.2
 Launch GPU ops.        2    6       5002       0.205          8.614   0.2
 Comm. coord.           2    6       2438       0.239         10.016   0.2
 Force                  2    6       2501       1.358         57.011   1.0
 Wait + Comm. F         2    6       2501       0.404         16.954   0.3
 PME mesh               2    6       2501       9.734        408.587   7.3
 Wait GPU nonlocal      2    6       2501     117.798       4944.651  88.3
 Wait GPU local         2    6       2501       0.005          0.206   0.0
 NB X/F buffer ops.     2    6       9878       0.255         10.683   0.2
 Write traj.            2    6          4       0.180          7.558   0.1
 Update                 2    6       2501       0.807         33.886   0.6
 Constraints            2    6       2501       1.216         51.025   0.9
 Comm. energies         2    6        126       0.001          0.055   0.0
 Rest                                           0.609         25.573   0.5
-----------------------------------------------------------------------------
 Total                                        133.392       5599.205 100.0