[gmx-users] 1 gpu vs 2 gpu speedup

Tue Jul 8 11:52:26 CEST 2014

Hi,

32k atoms is very little to parallelize across two GPUs, so it is no
surprise that you see only 1.3x improvement.

More comments inline.

On Mon, Jul 7, 2014 at 12:13 PM, Harry Mark Greenblatt
<harry.greenblatt at weizmann.ac.il> wrote:
> BS"D
> Dear All,
>
>  I was given access to a test machine with
>
> 2 x E5-2630 2.3GHz 6 core processors
> 2 x Tesla K20x GPU's
> Gromacs 5.0, compiled (gcc 4.4.7) with support for Intel MPI.

This compiler is very outdated, you should use at least gcc 4.7 or 4.8
for best performance - especially the CPU-only runs should get quite a
bit faster.

>
> Ran a 1ns simulation on a 3-domain (all in one chain) DNA binding protein, dsDNA, waters, and ions (~32,600 atoms).  DNA was constrained.
>
> Using a VDW cutoff of 1.3Å gave a close balance between GPU and CPU usage with 6 cores, and 1 GPU card (1.061).
>
> Results:
>
> Setup Wall Time (s) ns/day speedup (relative to 1st result)
>
> 6 cores, no gpu 12,996 6.65
> 12 cores, no gpu 7037 12.3 1.85
> 6 cores, 1 gpu 1853 46.6 7.01
> 2 x 6 cores, 2 gpu 1342 64.4 9.68

Is that really 7x speedup wrt 6 cores? That should be more like 3-4x,
so I suspect your CPU-only performance is 1.5-2x off.

> I was a bit disappointed by the 2gpu case (less than 1.4 speedup relative to 1 GPU).  Unlike the others, I used mpirun -np 2 to submit this job, with no other mdrun arguments, other than adding -pin on, which made no difference.
>
> The job certainly ran on 2 MPI ranks, using 12 cores and 2 GPU's (~75% usage each).  I'm wondering if I didn't specify the proper mdrun command line arguments (based on log file, see below, where it complains about too few total ranks), or perhaps my system is not amenable to efficient GPU acceleration beyond one GPU.
>
> Comments?  Below is some of the last log file:
> ...
> Number of hardware threads detected (12) does not match the number reported by OpenMP (6).

This does not look good, I think it means that your job scheduler is
expecting you to use 6 cores. You should make sure that thread
affinities are set correctly and getting rid of the above could help
too.

> Consider setting the launch configuration manually!
> ...
> Initializing Domain Decomposition on 2 ranks
> Dynamic load balancing: auto
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
>     two-body bonded interactions: 0.410 nm, LJ-14, atoms 1018 1025
>   multi-body bonded interactions: 0.410 nm, Proper Dih., atoms 1018 1025
> Minimum cell size due to bonded interactions: 0.452 nm
> Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.819 nm
> Estimated maximum distance required for P-LINCS: 0.819 nm
> This distance will limit the DD cell size, you can override this with -rcon
> Using 0 separate PME ranks, as there are too few total
>  ranks for efficient splitting
> Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
> Optimizing the DD grid for 2 cells with a minimum initial size of 1.024 nm
> The maximum allowed number of cells is: X 6 Y 6 Z 5
> Domain decomposition grid 2 x 1 x 1, separate PME ranks 0
> PME domain decomposition: 2 x 1 x 1
> Domain decomposition rank 0, coordinates 0 0 0
>
> Using 2 MPI processes
> Using 6 OpenMP threads per MPI process
> ...
> 2 GPUs detected on host cff042:
>   #0: NVIDIA Tesla K20Xm, compute cap.: 3.5, ECC:  no, stat: compatible
>   #1: NVIDIA Tesla K20Xm, compute cap.: 3.5, ECC:  no, stat: compatible
>
> 2 GPUs auto-selected for this run.
> Mapping of GPUs to the 2 PP ranks in this node: #0, #1
> ...

You may want to try using multiple ranks per GPU, e.g.
mpirun -np 4 mdrun -gpu_id 0011
mpirun -np 6 mdrun -gpu_id 000111

>     D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>
>  av. #atoms communicated per step for force:  2 x 16308.8
>  av. #atoms communicated per step for LINCS:  2 x 1444.2
>
>  Average load imbalance: 11.5 %
>  Part of the total run time spent waiting due to load imbalance: 1.2 %
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 2 MPI ranks, each using 6 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
>  Domain decomp.         2    6      12500      42.424       1167.938   3.2
>  DD comm. load          2    6       2497       0.023          0.621   0.0
>  Neighbor search        2    6      12501      35.169        968.215   2.6
>  Launch GPU ops.        2    6    1000002      39.832       1096.595   3.0
>  Comm. coord.           2    6     487500      50.194       1381.851   3.7
>  Force                  2    6     500001     110.595       3044.707   8.2
>  Wait + Comm. F         2    6     500001      56.881       1565.951   4.2
>  PME mesh               2    6     500001     688.824      18963.576  51.3
>  Wait GPU nonlocal      2    6     500001       5.112        140.734   0.4
>  Wait GPU local         2    6     500001      66.823       1839.662   5.0
>  NB X/F buffer ops.     2    6    1975002      23.432        645.092   1.7
>  Write traj.            2    6       1002       3.453         95.073   0.3
>  Update                 2    6     500001      24.698        679.933   1.8
>  Constraints            2    6     500001     181.295       4991.120  13.5
>  Comm. energies         2    6      25001       0.500         13.757   0.0
>  Rest                                          12.633        347.800   0.9
> -----------------------------------------------------------------------------
>  Total                                       1341.887      36942.624 100.0
> -----------------------------------------------------------------------------
>  Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
>  PME redist. X/F        2    6    1000002     102.268       2815.465   7.6
>  PME spread/gather      2    6    1000002     319.578       8798.101  23.8
>  PME 3D-FFT             2    6    1000002     123.860       3409.896   9.2
>  PME 3D-FFT Comm.       2    6    1000002     116.919       3218.822   8.7
>  PME solve Elec         2    6     500001      24.586        676.870   1.8
> -----------------------------------------------------------------------------
>
> Thanks for any suggestions
>
>
> Harry
>
>
>
>
> -------------------------------------------------------------------------
>
> Harry M. Greenblatt
>
> Associate Staff Scientist
>
> Dept of Structural Biology
>
> Weizmann Institute of Science        Phone:  972-8-934-3625
>
> 234 Herzl St.                        Facsimile:   972-8-934-4159
>
> Rehovot, 76100
>
> Israel
>
>
> Harry.Greenblatt at weizmann.ac.il<mailto:Harry.Greenblatt at weizmann.ac.il>
>
>
>
>
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.