[gmx-users] Tests with Threadripper and dual gpu setup
Harry Mark Greenblatt
harry.greenblatt at weizmann.ac.il
Wed Jan 24 10:15:05 CET 2018
BS”D
In case anybody is interested we have tested Gromacs on a Threadripper machine with two GPU’s.
Hardware:
Ryzen Threadripper 1950X 16 core CPU (multithreading on), with Corsair H100i V2 Liquid cooling
Asus Prime X399-A M/B
2 X Geforce GTX 1080 GPU’s
32 GB of 3200MHz memory
Samsung 850 Pro 512GB SSD
OS, software:
Centos 7.4, with 4.14 Kernel from ElRepo
gcc 4.8.5 and gcc 5.5.0
fftw 3.3.7 (AVX2 enabled)
Cuda 8
Gromacs 2016.4
Gromacs 2018-rc1 and final 2018.
Using thread-MPI
I managed to compile gcc 5.5.0, but when I went to use it to compile Gromacs, the compiler could not recognise the hardware, although the native gcc 4.8.5 had no problem.
In 2016.4, I was able to specify which SIMD set to use, so this was not an issue. In any case there was very little difference between gcc 5.5.0 and 4.8.5. So I used 4.8.5 for 2018.
Any ideas how to overcome this problem with 5.5.0?
————————————
Gromacs 2016.4
————————————
System: Protein/DNA complex, with 438,397 atoms (including waters/ions), 100 ps npt equilibration.
Allowing Gromacs to choose how it wanted to allocate the hardware gave
8 tMPI ranks, 4 thread per rank, both GPU’s
12.4 ns/day
When I told it to use 4 tMPI ranks, 8 threads per rank, both GPU’s
12.2 ns/day
Running on “real” cores only
4 tMPI ranks, 4 threads per rank, 2 GPU’s
10.2 ns/day
1 tMPI rank, 16 threads per rank, *one* GPU (“half” the machine; pin on, but pinstride and pinoffset automatic)
10.6 ns/day
1 tMP rank, 16 threads per rank, one gpu, and manually set all pinning options:
gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pin on -ntomp 16 -ntmpi 1 -gpu_id 0 -pinoffset 0 -pinstride 2
12.3 ns/day
Presumably, the gain here is because “pintstride 2” caused the job to run on the “real” (1,2,3…15) cores, and not on virtual cores. The automatic pinstride above used cores [0,16], [1,17], [2,18]…[7,23], half of which are virtual and so gave only 10.6ns/day.
** So there very little gain from the second GPU, and very little gain from multithreading. **
Using AVX_256 and not AVX2_256 with above command gave a small speed up (although using AVX instead of AVX2 for FFTW made things worse).
12.5 ns/day
To compare with an Intel Xeon Silver system:
2 x Xeon Silver 4116 (2.1GHz base clock, 12 cores each, no Hyperthreading), 64GB memory
2 x Geforce 1080’s (as used in the above tests)
gcc 4.8.5
Gromacs 2016.4, with MPI, AVX_256 (compiled on an older GPU machine, and not by me).
2 MPI ranks, 12 threads each rank, 2 GPU’s
11.7 ns/day
4 MPI ranks, 6 threads each rank, 2 GPU’s
13.0 ns/day
6 MPI ranks, 4 threads each rank, 2 GPU’s
14.0 ns/day
To compare with the AMD machine, same number of cores
1 MPI rank, 16 threads, 1 GPU
11.2 ns/day
—————————————————
Gromacs 2018 rc1 (using gcc 4.8.5)
—————————————————
Using AVX_256
In ‘classic’ mode, not using gpu for PME
8 tMPI ranks, 4 threads per rank, 2 GPU’s
12.7 ns/day (modest speed up from 12.4 ns/day with 2016.4)
Now use a gpu for PME
gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on
used 1 tMPI rank, 32 OpenMP threads, 1 GPU
14.9 ns/day
Forcing the program to use both GPU’s
gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 -npme 1 -gputasks 0011 -nb gpu
18.5 ns/day
Now with AVX2_128
19.0 ns/day
Now force Dynamic Load Balancing
gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 -npme 1 -gputasks 0011 -nb gpu -dlb yes
20.1 ns/day
Now use more (8) tMPI ranks
gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 8 -npme 1 -gputasks 00001111 -nb gpu -dlb yes
20.7 ns/day
And finally, using 2018 (AVX2_128) with the above command line
20.9 ns/day
Here are the final lines from the log file
Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 7.7%.
The balanceable part of the MD step is 51%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 3.9%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
Average PME mesh/force load: 1.275
Part of the total run time spent waiting due to PP/PME imbalance: 9.4 %
NOTE: 9.4 % performance was lost because the PME ranks
had more work to do than the PP ranks.
You might want to increase the number of PME ranks
or increase the cut-off and the grid spacing.
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 7 MPI ranks doing PP, each using 4 OpenMP threads, and
on 1 MPI rank doing PME, using 4 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 7 4 500 13.721 1306.196 2.9
DD comm. load 7 4 500 0.366 34.875 0.1
DD comm. bounds 7 4 500 0.036 3.445 0.0
Send X to PME 7 4 50001 7.047 670.854 1.5
Neighbor search 7 4 501 6.060 576.925 1.3
Launch GPU ops. 7 4 100002 11.335 1079.049 2.4
Comm. coord. 7 4 49500 38.156 3632.409 8.1
Force 7 4 50001 38.357 3651.633 8.1
Wait + Comm. F 7 4 50001 42.186 4016.143 8.9
PME mesh * 1 4 50001 205.801 2798.887 6.2
PME wait for PP * 207.924 2827.762 6.3
Wait + Recv. PME F 7 4 50001 70.682 6728.928 14.9
Wait PME GPU gather 7 4 50001 28.106 2675.682 5.9
Wait GPU NB nonloc. 7 4 50001 20.463 1948.121 4.3
Wait GPU NB local 7 4 50001 12.992 1236.845 2.7
NB X/F buffer ops. 7 4 199002 24.396 2322.498 5.2
Write traj. 7 4 501 9.081 864.479 1.9
Update 7 4 50001 24.809 2361.775 5.2
Constraints 7 4 50001 79.806 7597.527 16.9
Comm. energies 7 4 2501 11.961 1138.713 2.5
-----------------------------------------------------------------------------
Total 413.769 45018.045 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 13240.604 413.769 3200.0
(ns/day) (hour/ns)
Performance: 20.882 1.149
--------------------------------------------------------------------
Harry M. Greenblatt
Associate Staff Scientist
Dept of Structural Biology harry.greenblatt at weizmann.ac.il<../../owa/redir.aspx?C=QQgUExlE8Ueu2zs5OGxuL5gubHf97c8IyXxHOfOIqyzCgIQtXppXx1YBYaN5yrHbaDn2xAb8moU.&URL=mailto%3aharry.greenblatt%40weizmann.ac.il>
Weizmann Institute of Science Phone: 972-8-934-6340
234 Herzl St. Facsimile: 972-8-934-3361
Rehovot, 7610001
Israel
More information about the gromacs.org_gmx-users
mailing list