[gmx-users] Poor GPU Performance with GROMACS 5.1.4
Daniel Kozuch
dkozuch at princeton.edu
Wed May 24 22:18:02 CEST 2017
Thanks so much for the quick reply. That seems to have fixed the wait time
issues. Unfortunately, I'm still only getting ~300 ns/day for the benchmark
system (villin vsites, http://www.gromacs.org/GPU_acceleration), while the
website claims over 1000 ns/day.
I'm running on a NVIDIA Tesla P100-PCIE-16GB with 8 Xeon(R) CPU E5-2680 v4
@ 2.40GHz. I can see that the cpus are now under performing (324% used).
Any suggestions?
_________________________________________________________
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 2 MPI ranks, each using 4 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
------------------------------------------------------------
-----------------
Domain decomp. 2 4 4001 4.402 84.517 3.2
DD comm. load 2 4 3983 0.021 0.402 0.0
DD comm. bounds 2 4 3982 0.014 0.267 0.0
Vsite constr. 2 4 100001 3.330 63.929 2.4
Neighbor search 2 4 4001 7.495 143.911 5.5
Launch GPU ops. 2 4 200002 4.820 92.537 3.5
Comm. coord. 2 4 96000 2.212 42.468 1.6
Force 2 4 100001 12.465 239.335 9.1
Wait + Comm. F 2 4 100001 2.572 49.381 1.9
PME mesh 2 4 100001 59.323 1139.002 43.3
Wait GPU nonlocal 2 4 100001 0.483 9.282 0.4
Wait GPU local 2 4 100001 0.292 5.607 0.2
NB X/F buffer ops. 2 4 392002 5.703 109.491 4.2
Vsite spread 2 4 101002 2.762 53.030 2.0
Write traj. 2 4 1 0.007 0.130 0.0
Update 2 4 100001 4.372 83.942 3.2
Constraints 2 4 100001 23.858 458.072 17.4
Comm. energies 2 4 20001 0.146 2.803 0.1
Rest 2.739 52.595 2.0
------------------------------------------------------------
-----------------
Total 137.015 2630.701 100.0
------------------------------------------------------------
-----------------
Breakdown of PME mesh computation
------------------------------------------------------------
-----------------
PME redist. X/F 2 4 200002 6.021 115.598 4.4
PME spread/gather 2 4 200002 36.204 695.123 26.4
PME 3D-FFT 2 4 200002 13.127 252.036 9.6
PME 3D-FFT Comm. 2 4 200002 2.007 38.538 1.5
PME solve Elec 2 4 100001 0.541 10.392 0.4
------------------------------------------------------------
-----------------
Core t (s) Wall t (s) (%)
Time: 444.060 137.015 324.1
(ns/day) (hour/ns)
Performance: 315.296 0.076
Finished mdrun on rank 0 Wed May 24 15:48:59 2017
On Wed, May 24, 2017 at 3:25 PM, Smith, Micholas D. <smithmd at ornl.gov>
wrote:
> Try just using your equivalent of:
>
> mpirun -n 2 -npernode 2 gmx_mpi mdrun (your run stuff here) -ntomp 4
> -gpu_id 00
>
> That may speed it up.
>
> ===================
> Micholas Dean Smith, PhD.
> Post-doctoral Research Associate
> University of Tennessee/Oak Ridge National Laboratory
> Center for Molecular Biophysics
>
> ________________________________________
> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
> gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Daniel
> Kozuch <dkozuch at princeton.edu>
> Sent: Wednesday, May 24, 2017 3:08 PM
> To: gromacs.org_gmx-users at maillist.sys.kth.se
> Subject: [gmx-users] Poor GPU Performance with GROMACS 5.1.4
>
> Hello,
>
> I'm using GROMACS 5.1.4 on 8 CPUs and 1 GPU for a system of ~8000 atoms in
> a dodecahedron box, and I'm having trouble getting good performance out of
> the GPU. Specifically it appears that there is significant performance loss
> to wait times ("Wait + Comm. F" and "Wait GPU nonlocal"). I have pasted the
> relevant parts of the log file below. I suspect that I have set up my
> ranks/threads badly, but I am unsure where the issue is. I have tried
> changing the local variable OMP_NUM_THREADS from 1 to 2 per the note
> generated by GROMACS, but this severely slows down the simulation to the
> point where it takes 10 minutes to get a few picoseconds.
>
> I have tried browsing through the mailing lists, but I haven't found a
> solution to this particular problem.
>
> Any help is appreciated,
> Dan
>
More information about the gromacs.org_gmx-users
mailing list