[gmx-users] 1 gpu vs 2 gpu speedup

Mon Jul 7 13:01:38 CEST 2014

BS"D
Dear All,

 I was given access to a test machine with

2 x E5-2630 2.3GHz 6 core processors
2 x Tesla K20x GPU's
Gromacs 5.0, compiled (gcc 4.4.7) with support for Intel MPI.

Ran a 1ns simulation on a 3-domain (all in one chain) DNA binding protein, dsDNA, waters, and ions (~32,600 atoms).  DNA was constrained.

Using a VDW cutoff of 1.3Å gave a close balance between GPU and CPU usage with 6 cores, and 1 GPU card (1.061).

Results:

Setup Wall Time (s) ns/day speedup (relative to 1st result)

6 cores, no gpu 12,996 6.65
12 cores, no gpu 7037 12.3 1.85
6 cores, 1 gpu 1853 46.6 7.01
2 x 6 cores, 2 gpu 1342 64.4 9.68

I was a bit disappointed by the 2gpu case (less than 1.4 speedup relative to 1 GPU).  Unlike the others, I used mpirun -np 2 to submit this job, with no other mdrun arguments, other than adding -pin on, which made no difference.

The job certainly ran on 2 MPI ranks, using 12 cores and 2 GPU's (~75% usage each).  I'm wondering if I didn't specify the proper mdrun command line arguments (based on log file, see below, where it complains about too few total ranks), or perhaps my system is not amenable to efficient GPU acceleration beyond one GPU.

Comments?  Below is some of the last log file:
...
Number of hardware threads detected (12) does not match the number reported by OpenMP (6).
Consider setting the launch configuration manually!
...
Initializing Domain Decomposition on 2 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.410 nm, LJ-14, atoms 1018 1025
  multi-body bonded interactions: 0.410 nm, Proper Dih., atoms 1018 1025
Minimum cell size due to bonded interactions: 0.452 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.819 nm
Estimated maximum distance required for P-LINCS: 0.819 nm
This distance will limit the DD cell size, you can override this with -rcon
Using 0 separate PME ranks, as there are too few total
 ranks for efficient splitting
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 2 cells with a minimum initial size of 1.024 nm
The maximum allowed number of cells is: X 6 Y 6 Z 5
Domain decomposition grid 2 x 1 x 1, separate PME ranks 0
PME domain decomposition: 2 x 1 x 1
Domain decomposition rank 0, coordinates 0 0 0

Using 2 MPI processes
Using 6 OpenMP threads per MPI process
...
2 GPUs detected on host cff042:
  #0: NVIDIA Tesla K20Xm, compute cap.: 3.5, ECC:  no, stat: compatible
  #1: NVIDIA Tesla K20Xm, compute cap.: 3.5, ECC:  no, stat: compatible

2 GPUs auto-selected for this run.
Mapping of GPUs to the 2 PP ranks in this node: #0, #1
...

    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 16308.8
 av. #atoms communicated per step for LINCS:  2 x 1444.2

 Average load imbalance: 11.5 %
 Part of the total run time spent waiting due to load imbalance: 1.2 %

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 2 MPI ranks, each using 6 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         2    6      12500      42.424       1167.938   3.2
 DD comm. load          2    6       2497       0.023          0.621   0.0
 Neighbor search        2    6      12501      35.169        968.215   2.6
 Launch GPU ops.        2    6    1000002      39.832       1096.595   3.0
 Comm. coord.           2    6     487500      50.194       1381.851   3.7
 Force                  2    6     500001     110.595       3044.707   8.2
 Wait + Comm. F         2    6     500001      56.881       1565.951   4.2
 PME mesh               2    6     500001     688.824      18963.576  51.3
 Wait GPU nonlocal      2    6     500001       5.112        140.734   0.4
 Wait GPU local         2    6     500001      66.823       1839.662   5.0
 NB X/F buffer ops.     2    6    1975002      23.432        645.092   1.7
 Write traj.            2    6       1002       3.453         95.073   0.3
 Update                 2    6     500001      24.698        679.933   1.8
 Constraints            2    6     500001     181.295       4991.120  13.5
 Comm. energies         2    6      25001       0.500         13.757   0.0
 Rest                                          12.633        347.800   0.9
-----------------------------------------------------------------------------
 Total                                       1341.887      36942.624 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME redist. X/F        2    6    1000002     102.268       2815.465   7.6
 PME spread/gather      2    6    1000002     319.578       8798.101  23.8
 PME 3D-FFT             2    6    1000002     123.860       3409.896   9.2
 PME 3D-FFT Comm.       2    6    1000002     116.919       3218.822   8.7
 PME solve Elec         2    6     500001      24.586        676.870   1.8
-----------------------------------------------------------------------------

Thanks for any suggestions

Harry

-------------------------------------------------------------------------

Harry M. Greenblatt

Associate Staff Scientist

Dept of Structural Biology

Weizmann Institute of Science        Phone:  972-8-934-3625

234 Herzl St.                        Facsimile:   972-8-934-4159

Rehovot, 76100

Israel

Harry.Greenblatt at weizmann.ac.il<mailto:Harry.Greenblatt at weizmann.ac.il>