[gmx-users] 1 gpu vs 2 gpu speedup
Harry Mark Greenblatt
harry.greenblatt at weizmann.ac.il
Mon Jul 7 13:01:38 CEST 2014
BS"D
Dear All,
I was given access to a test machine with
2 x E5-2630 2.3GHz 6 core processors
2 x Tesla K20x GPU's
Gromacs 5.0, compiled (gcc 4.4.7) with support for Intel MPI.
Ran a 1ns simulation on a 3-domain (all in one chain) DNA binding protein, dsDNA, waters, and ions (~32,600 atoms). DNA was constrained.
Using a VDW cutoff of 1.3Å gave a close balance between GPU and CPU usage with 6 cores, and 1 GPU card (1.061).
Results:
Setup Wall Time (s) ns/day speedup (relative to 1st result)
6 cores, no gpu 12,996 6.65
12 cores, no gpu 7037 12.3 1.85
6 cores, 1 gpu 1853 46.6 7.01
2 x 6 cores, 2 gpu 1342 64.4 9.68
I was a bit disappointed by the 2gpu case (less than 1.4 speedup relative to 1 GPU). Unlike the others, I used mpirun -np 2 to submit this job, with no other mdrun arguments, other than adding -pin on, which made no difference.
The job certainly ran on 2 MPI ranks, using 12 cores and 2 GPU's (~75% usage each). I'm wondering if I didn't specify the proper mdrun command line arguments (based on log file, see below, where it complains about too few total ranks), or perhaps my system is not amenable to efficient GPU acceleration beyond one GPU.
Comments? Below is some of the last log file:
...
Number of hardware threads detected (12) does not match the number reported by OpenMP (6).
Consider setting the launch configuration manually!
...
Initializing Domain Decomposition on 2 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
two-body bonded interactions: 0.410 nm, LJ-14, atoms 1018 1025
multi-body bonded interactions: 0.410 nm, Proper Dih., atoms 1018 1025
Minimum cell size due to bonded interactions: 0.452 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.819 nm
Estimated maximum distance required for P-LINCS: 0.819 nm
This distance will limit the DD cell size, you can override this with -rcon
Using 0 separate PME ranks, as there are too few total
ranks for efficient splitting
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 2 cells with a minimum initial size of 1.024 nm
The maximum allowed number of cells is: X 6 Y 6 Z 5
Domain decomposition grid 2 x 1 x 1, separate PME ranks 0
PME domain decomposition: 2 x 1 x 1
Domain decomposition rank 0, coordinates 0 0 0
Using 2 MPI processes
Using 6 OpenMP threads per MPI process
...
2 GPUs detected on host cff042:
#0: NVIDIA Tesla K20Xm, compute cap.: 3.5, ECC: no, stat: compatible
#1: NVIDIA Tesla K20Xm, compute cap.: 3.5, ECC: no, stat: compatible
2 GPUs auto-selected for this run.
Mapping of GPUs to the 2 PP ranks in this node: #0, #1
...
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 16308.8
av. #atoms communicated per step for LINCS: 2 x 1444.2
Average load imbalance: 11.5 %
Part of the total run time spent waiting due to load imbalance: 1.2 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 2 MPI ranks, each using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 2 6 12500 42.424 1167.938 3.2
DD comm. load 2 6 2497 0.023 0.621 0.0
Neighbor search 2 6 12501 35.169 968.215 2.6
Launch GPU ops. 2 6 1000002 39.832 1096.595 3.0
Comm. coord. 2 6 487500 50.194 1381.851 3.7
Force 2 6 500001 110.595 3044.707 8.2
Wait + Comm. F 2 6 500001 56.881 1565.951 4.2
PME mesh 2 6 500001 688.824 18963.576 51.3
Wait GPU nonlocal 2 6 500001 5.112 140.734 0.4
Wait GPU local 2 6 500001 66.823 1839.662 5.0
NB X/F buffer ops. 2 6 1975002 23.432 645.092 1.7
Write traj. 2 6 1002 3.453 95.073 0.3
Update 2 6 500001 24.698 679.933 1.8
Constraints 2 6 500001 181.295 4991.120 13.5
Comm. energies 2 6 25001 0.500 13.757 0.0
Rest 12.633 347.800 0.9
-----------------------------------------------------------------------------
Total 1341.887 36942.624 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME redist. X/F 2 6 1000002 102.268 2815.465 7.6
PME spread/gather 2 6 1000002 319.578 8798.101 23.8
PME 3D-FFT 2 6 1000002 123.860 3409.896 9.2
PME 3D-FFT Comm. 2 6 1000002 116.919 3218.822 8.7
PME solve Elec 2 6 500001 24.586 676.870 1.8
-----------------------------------------------------------------------------
Thanks for any suggestions
Harry
-------------------------------------------------------------------------
Harry M. Greenblatt
Associate Staff Scientist
Dept of Structural Biology
Weizmann Institute of Science Phone: 972-8-934-3625
234 Herzl St. Facsimile: 972-8-934-4159
Rehovot, 76100
Israel
Harry.Greenblatt at weizmann.ac.il<mailto:Harry.Greenblatt at weizmann.ac.il>
More information about the gromacs.org_gmx-users
mailing list