[gmx-users] Poor GPU Performance with GROMACS 5.1.4
Mark Abraham
mark.j.abraham at gmail.com
Thu May 25 00:09:00 CEST 2017
Hi,
I'm wondering why you want 8 ranks on the 14 or 28 cores. The log reports
that something F else is controlling thread affinity, which is the easiest
thing to screw up if you are doing node sharing. The job manager has to
give you cores that are solely yours, and you/it need to set the affinities
of your threads to them. Or use mdrun -pin on and let mdrun do it properly
(but you are still dead if there's another job on your cores).
Mark
On Wed, 24 May 2017 22:18 Daniel Kozuch <dkozuch at princeton.edu> wrote:
> Thanks so much for the quick reply. That seems to have fixed the wait time
> issues. Unfortunately, I'm still only getting ~300 ns/day for the benchmark
> system (villin vsites, http://www.gromacs.org/GPU_acceleration), while the
> website claims over 1000 ns/day.
>
> I'm running on a NVIDIA Tesla P100-PCIE-16GB with 8 Xeon(R) CPU E5-2680 v4
> @ 2.40GHz. I can see that the cpus are now under performing (324% used).
> Any suggestions?
>
>
> _________________________________________________________
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 2 MPI ranks, each using 4 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
> ------------------------------------------------------------
> -----------------
> Domain decomp. 2 4 4001 4.402 84.517 3.2
> DD comm. load 2 4 3983 0.021 0.402 0.0
> DD comm. bounds 2 4 3982 0.014 0.267 0.0
> Vsite constr. 2 4 100001 3.330 63.929 2.4
> Neighbor search 2 4 4001 7.495 143.911 5.5
> Launch GPU ops. 2 4 200002 4.820 92.537 3.5
> Comm. coord. 2 4 96000 2.212 42.468 1.6
> Force 2 4 100001 12.465 239.335 9.1
> Wait + Comm. F 2 4 100001 2.572 49.381 1.9
> PME mesh 2 4 100001 59.323 1139.002 43.3
> Wait GPU nonlocal 2 4 100001 0.483 9.282 0.4
> Wait GPU local 2 4 100001 0.292 5.607 0.2
> NB X/F buffer ops. 2 4 392002 5.703 109.491 4.2
> Vsite spread 2 4 101002 2.762 53.030 2.0
> Write traj. 2 4 1 0.007 0.130 0.0
> Update 2 4 100001 4.372 83.942 3.2
> Constraints 2 4 100001 23.858 458.072 17.4
> Comm. energies 2 4 20001 0.146 2.803 0.1
> Rest 2.739 52.595 2.0
> ------------------------------------------------------------
> -----------------
> Total 137.015 2630.701 100.0
> ------------------------------------------------------------
> -----------------
> Breakdown of PME mesh computation
> ------------------------------------------------------------
> -----------------
> PME redist. X/F 2 4 200002 6.021 115.598 4.4
> PME spread/gather 2 4 200002 36.204 695.123 26.4
> PME 3D-FFT 2 4 200002 13.127 252.036 9.6
> PME 3D-FFT Comm. 2 4 200002 2.007 38.538 1.5
> PME solve Elec 2 4 100001 0.541 10.392 0.4
> ------------------------------------------------------------
> -----------------
>
> Core t (s) Wall t (s) (%)
> Time: 444.060 137.015 324.1
> (ns/day) (hour/ns)
> Performance: 315.296 0.076
> Finished mdrun on rank 0 Wed May 24 15:48:59 2017
>
> On Wed, May 24, 2017 at 3:25 PM, Smith, Micholas D. <smithmd at ornl.gov>
> wrote:
>
> > Try just using your equivalent of:
> >
> > mpirun -n 2 -npernode 2 gmx_mpi mdrun (your run stuff here) -ntomp 4
> > -gpu_id 00
> >
> > That may speed it up.
> >
> > ===================
> > Micholas Dean Smith, PhD.
> > Post-doctoral Research Associate
> > University of Tennessee/Oak Ridge National Laboratory
> > Center for Molecular Biophysics
> >
> > ________________________________________
> > From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
> > gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Daniel
> > Kozuch <dkozuch at princeton.edu>
> > Sent: Wednesday, May 24, 2017 3:08 PM
> > To: gromacs.org_gmx-users at maillist.sys.kth.se
> > Subject: [gmx-users] Poor GPU Performance with GROMACS 5.1.4
> >
> > Hello,
> >
> > I'm using GROMACS 5.1.4 on 8 CPUs and 1 GPU for a system of ~8000 atoms
> in
> > a dodecahedron box, and I'm having trouble getting good performance out
> of
> > the GPU. Specifically it appears that there is significant performance
> loss
> > to wait times ("Wait + Comm. F" and "Wait GPU nonlocal"). I have pasted
> the
> > relevant parts of the log file below. I suspect that I have set up my
> > ranks/threads badly, but I am unsure where the issue is. I have tried
> > changing the local variable OMP_NUM_THREADS from 1 to 2 per the note
> > generated by GROMACS, but this severely slows down the simulation to the
> > point where it takes 10 minutes to get a few picoseconds.
> >
> > I have tried browsing through the mailing lists, but I haven't found a
> > solution to this particular problem.
> >
> > Any help is appreciated,
> > Dan
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list