[gmx-users] 1 gpu vs 2 gpu speedup
Szilárd Páll
pall.szilard at gmail.com
Tue Jul 8 11:52:26 CEST 2014
Hi,
32k atoms is very little to parallelize across two GPUs, so it is no
surprise that you see only 1.3x improvement.
More comments inline.
On Mon, Jul 7, 2014 at 12:13 PM, Harry Mark Greenblatt
<harry.greenblatt at weizmann.ac.il> wrote:
> BS"D
> Dear All,
>
> I was given access to a test machine with
>
> 2 x E5-2630 2.3GHz 6 core processors
> 2 x Tesla K20x GPU's
> Gromacs 5.0, compiled (gcc 4.4.7) with support for Intel MPI.
This compiler is very outdated, you should use at least gcc 4.7 or 4.8
for best performance - especially the CPU-only runs should get quite a
bit faster.
>
> Ran a 1ns simulation on a 3-domain (all in one chain) DNA binding protein, dsDNA, waters, and ions (~32,600 atoms). DNA was constrained.
>
> Using a VDW cutoff of 1.3Å gave a close balance between GPU and CPU usage with 6 cores, and 1 GPU card (1.061).
>
> Results:
>
> Setup Wall Time (s) ns/day speedup (relative to 1st result)
>
> 6 cores, no gpu 12,996 6.65
> 12 cores, no gpu 7037 12.3 1.85
> 6 cores, 1 gpu 1853 46.6 7.01
> 2 x 6 cores, 2 gpu 1342 64.4 9.68
Is that really 7x speedup wrt 6 cores? That should be more like 3-4x,
so I suspect your CPU-only performance is 1.5-2x off.
> I was a bit disappointed by the 2gpu case (less than 1.4 speedup relative to 1 GPU). Unlike the others, I used mpirun -np 2 to submit this job, with no other mdrun arguments, other than adding -pin on, which made no difference.
>
> The job certainly ran on 2 MPI ranks, using 12 cores and 2 GPU's (~75% usage each). I'm wondering if I didn't specify the proper mdrun command line arguments (based on log file, see below, where it complains about too few total ranks), or perhaps my system is not amenable to efficient GPU acceleration beyond one GPU.
>
> Comments? Below is some of the last log file:
> ...
> Number of hardware threads detected (12) does not match the number reported by OpenMP (6).
This does not look good, I think it means that your job scheduler is
expecting you to use 6 cores. You should make sure that thread
affinities are set correctly and getting rid of the above could help
too.
> Consider setting the launch configuration manually!
> ...
> Initializing Domain Decomposition on 2 ranks
> Dynamic load balancing: auto
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
> two-body bonded interactions: 0.410 nm, LJ-14, atoms 1018 1025
> multi-body bonded interactions: 0.410 nm, Proper Dih., atoms 1018 1025
> Minimum cell size due to bonded interactions: 0.452 nm
> Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.819 nm
> Estimated maximum distance required for P-LINCS: 0.819 nm
> This distance will limit the DD cell size, you can override this with -rcon
> Using 0 separate PME ranks, as there are too few total
> ranks for efficient splitting
> Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
> Optimizing the DD grid for 2 cells with a minimum initial size of 1.024 nm
> The maximum allowed number of cells is: X 6 Y 6 Z 5
> Domain decomposition grid 2 x 1 x 1, separate PME ranks 0
> PME domain decomposition: 2 x 1 x 1
> Domain decomposition rank 0, coordinates 0 0 0
>
> Using 2 MPI processes
> Using 6 OpenMP threads per MPI process
> ...
> 2 GPUs detected on host cff042:
> #0: NVIDIA Tesla K20Xm, compute cap.: 3.5, ECC: no, stat: compatible
> #1: NVIDIA Tesla K20Xm, compute cap.: 3.5, ECC: no, stat: compatible
>
> 2 GPUs auto-selected for this run.
> Mapping of GPUs to the 2 PP ranks in this node: #0, #1
> ...
You may want to try using multiple ranks per GPU, e.g.
mpirun -np 4 mdrun -gpu_id 0011
mpirun -np 6 mdrun -gpu_id 000111
> D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
>
> av. #atoms communicated per step for force: 2 x 16308.8
> av. #atoms communicated per step for LINCS: 2 x 1444.2
>
> Average load imbalance: 11.5 %
> Part of the total run time spent waiting due to load imbalance: 1.2 %
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 2 MPI ranks, each using 6 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
> -----------------------------------------------------------------------------
> Domain decomp. 2 6 12500 42.424 1167.938 3.2
> DD comm. load 2 6 2497 0.023 0.621 0.0
> Neighbor search 2 6 12501 35.169 968.215 2.6
> Launch GPU ops. 2 6 1000002 39.832 1096.595 3.0
> Comm. coord. 2 6 487500 50.194 1381.851 3.7
> Force 2 6 500001 110.595 3044.707 8.2
> Wait + Comm. F 2 6 500001 56.881 1565.951 4.2
> PME mesh 2 6 500001 688.824 18963.576 51.3
> Wait GPU nonlocal 2 6 500001 5.112 140.734 0.4
> Wait GPU local 2 6 500001 66.823 1839.662 5.0
> NB X/F buffer ops. 2 6 1975002 23.432 645.092 1.7
> Write traj. 2 6 1002 3.453 95.073 0.3
> Update 2 6 500001 24.698 679.933 1.8
> Constraints 2 6 500001 181.295 4991.120 13.5
> Comm. energies 2 6 25001 0.500 13.757 0.0
> Rest 12.633 347.800 0.9
> -----------------------------------------------------------------------------
> Total 1341.887 36942.624 100.0
> -----------------------------------------------------------------------------
> Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
> PME redist. X/F 2 6 1000002 102.268 2815.465 7.6
> PME spread/gather 2 6 1000002 319.578 8798.101 23.8
> PME 3D-FFT 2 6 1000002 123.860 3409.896 9.2
> PME 3D-FFT Comm. 2 6 1000002 116.919 3218.822 8.7
> PME solve Elec 2 6 500001 24.586 676.870 1.8
> -----------------------------------------------------------------------------
>
> Thanks for any suggestions
>
>
> Harry
>
>
>
>
> -------------------------------------------------------------------------
>
> Harry M. Greenblatt
>
> Associate Staff Scientist
>
> Dept of Structural Biology
>
> Weizmann Institute of Science Phone: 972-8-934-3625
>
> 234 Herzl St. Facsimile: 972-8-934-4159
>
> Rehovot, 76100
>
> Israel
>
>
> Harry.Greenblatt at weizmann.ac.il<mailto:Harry.Greenblatt at weizmann.ac.il>
>
>
>
>
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list