[gmx-users] Tests with Threadripper and dual gpu setup

Wed Jan 24 10:15:05 CET 2018

BS”D

In case anybody is interested we have tested Gromacs on a Threadripper machine with two GPU’s.

Hardware:

Ryzen Threadripper 1950X 16 core CPU (multithreading on), with Corsair H100i V2 Liquid cooling
Asus Prime X399-A M/B
2 X Geforce GTX 1080 GPU’s
32 GB of 3200MHz memory
Samsung 850 Pro 512GB SSD

OS, software:

Centos 7.4, with 4.14 Kernel from ElRepo
gcc 4.8.5 and gcc 5.5.0
fftw 3.3.7 (AVX2 enabled)
Cuda 8
Gromacs 2016.4
Gromacs 2018-rc1 and final 2018.
Using thread-MPI

I managed to compile gcc 5.5.0, but when I went to use it to compile Gromacs, the compiler could not recognise the hardware, although the native gcc 4.8.5 had no problem.
In 2016.4, I was able to specify which SIMD set to use, so this was not an issue.   In any case there was very little difference between gcc 5.5.0 and 4.8.5.  So I used 4.8.5 for 2018.
Any ideas how to overcome this problem with 5.5.0?

————————————
Gromacs 2016.4
————————————

System: Protein/DNA complex, with 438,397 atoms (including waters/ions), 100 ps npt equilibration.

Allowing Gromacs to choose how it wanted to allocate the hardware gave

8 tMPI ranks, 4 thread per rank, both GPU’s

12.4 ns/day

When I told it to use 4 tMPI ranks, 8 threads per rank, both GPU’s

12.2 ns/day

Running on “real” cores only

4 tMPI ranks, 4 threads per rank, 2 GPU’s

10.2 ns/day

1 tMPI rank, 16 threads per rank, *one* GPU (“half” the machine; pin on, but pinstride and pinoffset automatic)

10.6 ns/day

1 tMP rank, 16 threads per rank, one gpu, and manually set all pinning options:

gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pin on -ntomp 16 -ntmpi 1 -gpu_id 0 -pinoffset 0 -pinstride 2

12.3 ns/day

Presumably, the gain here is because “pintstride 2” caused the job to run on the “real” (1,2,3…15) cores, and not on virtual cores.  The automatic pinstride above used cores [0,16], [1,17], [2,18]…[7,23], half of which are virtual and so gave only 10.6ns/day.

** So there very little gain from the second GPU, and very little gain from multithreading. **

Using AVX_256 and not AVX2_256 with above command gave a small speed up (although using AVX instead of AVX2 for FFTW made things worse).

12.5 ns/day

To compare with an Intel Xeon Silver system:
2 x Xeon Silver 4116 (2.1GHz base clock, 12 cores each, no Hyperthreading), 64GB memory
2 x Geforce 1080’s (as used in the above tests)

gcc 4.8.5
Gromacs 2016.4, with MPI, AVX_256 (compiled on an older GPU machine, and not by me).

2 MPI ranks, 12 threads each rank, 2 GPU’s

11.7 ns/day

4 MPI ranks, 6 threads each rank, 2 GPU’s

13.0 ns/day

6 MPI ranks, 4 threads each rank, 2 GPU’s

14.0 ns/day

To compare with the AMD machine, same number of cores

1 MPI rank, 16 threads, 1 GPU

11.2 ns/day

—————————————————
Gromacs 2018 rc1 (using gcc 4.8.5)
—————————————————

Using AVX_256

In ‘classic’ mode, not using gpu for PME

8 tMPI ranks, 4 threads per rank, 2 GPU’s

12.7 ns/day (modest speed up from 12.4 ns/day with 2016.4)

Now use a gpu for PME

gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on

used 1 tMPI rank, 32 OpenMP threads, 1 GPU

14.9 ns/day

Forcing the program to use both GPU’s

gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 -npme 1 -gputasks 0011 -nb gpu

18.5 ns/day

Now with AVX2_128

19.0 ns/day

Now force Dynamic Load Balancing

gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 -npme 1 -gputasks 0011 -nb gpu -dlb yes

20.1 ns/day

Now use more (8) tMPI ranks

gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 8 -npme 1 -gputasks 00001111 -nb gpu -dlb yes

20.7 ns/day

And finally, using 2018 (AVX2_128) with the above command line

20.9 ns/day

Here are the final lines from the log file

Dynamic load balancing report:
 DLB was permanently on during the run per user request.
 Average load imbalance: 7.7%.
 The balanceable part of the MD step is 51%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 3.9%.
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
 Average PME mesh/force load: 1.275
 Part of the total run time spent waiting due to PP/PME imbalance: 9.4 %

NOTE: 9.4 % performance was lost because the PME ranks
      had more work to do than the PP ranks.
      You might want to increase the number of PME ranks
      or increase the cut-off and the grid spacing.

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 7 MPI ranks doing PP, each using 4 OpenMP threads, and
on 1 MPI rank doing PME, using 4 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         7    4        500      13.721       1306.196   2.9
 DD comm. load          7    4        500       0.366         34.875   0.1
 DD comm. bounds        7    4        500       0.036          3.445   0.0
 Send X to PME          7    4      50001       7.047        670.854   1.5
 Neighbor search        7    4        501       6.060        576.925   1.3
 Launch GPU ops.        7    4     100002      11.335       1079.049   2.4
 Comm. coord.           7    4      49500      38.156       3632.409   8.1
 Force                  7    4      50001      38.357       3651.633   8.1
 Wait + Comm. F         7    4      50001      42.186       4016.143   8.9
 PME mesh *             1    4      50001     205.801       2798.887   6.2
 PME wait for PP *                            207.924       2827.762   6.3
 Wait + Recv. PME F     7    4      50001      70.682       6728.928  14.9
 Wait PME GPU gather    7    4      50001      28.106       2675.682   5.9
 Wait GPU NB nonloc.    7    4      50001      20.463       1948.121   4.3
 Wait GPU NB local      7    4      50001      12.992       1236.845   2.7
 NB X/F buffer ops.     7    4     199002      24.396       2322.498   5.2
 Write traj.            7    4        501       9.081        864.479   1.9
 Update                 7    4      50001      24.809       2361.775   5.2
 Constraints            7    4      50001      79.806       7597.527  16.9
 Comm. energies         7    4       2501      11.961       1138.713   2.5
-----------------------------------------------------------------------------
Total                                        413.769      45018.045 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:    13240.604      413.769     3200.0
                 (ns/day)    (hour/ns)
Performance:       20.882        1.149

--------------------------------------------------------------------
Harry M. Greenblatt
Associate Staff Scientist
Dept of Structural Biology           harry.greenblatt at weizmann.ac.il<../../owa/redir.aspx?C=QQgUExlE8Ueu2zs5OGxuL5gubHf97c8IyXxHOfOIqyzCgIQtXppXx1YBYaN5yrHbaDn2xAb8moU.&URL=mailto%3aharry.greenblatt%40weizmann.ac.il>
Weizmann Institute of Science        Phone:  972-8-934-6340
234 Herzl St.                        Facsimile:   972-8-934-3361
Rehovot, 7610001
Israel