[gmx-users] Poor GPU Performance with GROMACS 5.1.4

Thu May 25 00:09:00 CEST 2017

Hi,

I'm wondering why you want 8 ranks on the 14 or 28 cores. The log reports
that something F else is controlling thread affinity, which is the easiest
thing to screw up if you are doing node sharing. The job manager has to
give you cores that are solely yours, and you/it need to set the affinities
of your threads to them. Or use mdrun -pin on and let mdrun do it properly
(but you are still dead if there's another job on your cores).

Mark

On Wed, 24 May 2017 22:18 Daniel Kozuch <dkozuch at princeton.edu> wrote:

> Thanks so much for the quick reply. That seems to have fixed the wait time
> issues. Unfortunately, I'm still only getting ~300 ns/day for the benchmark
> system (villin vsites, http://www.gromacs.org/GPU_acceleration), while the
> website claims over 1000 ns/day.
>
> I'm running on a NVIDIA Tesla P100-PCIE-16GB with 8 Xeon(R) CPU E5-2680 v4
> @ 2.40GHz. I can see that the cpus are now under performing (324% used).
> Any suggestions?
>
>
> _________________________________________________________
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 2 MPI ranks, each using 4 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> ------------------------------------------------------------
> -----------------
>  Domain decomp.         2    4       4001       4.402         84.517   3.2
>  DD comm. load          2    4       3983       0.021          0.402   0.0
>  DD comm. bounds        2    4       3982       0.014          0.267   0.0
>  Vsite constr.          2    4     100001       3.330         63.929   2.4
>  Neighbor search        2    4       4001       7.495        143.911   5.5
>  Launch GPU ops.        2    4     200002       4.820         92.537   3.5
>  Comm. coord.           2    4      96000       2.212         42.468   1.6
>  Force                  2    4     100001      12.465        239.335   9.1
>  Wait + Comm. F         2    4     100001       2.572         49.381   1.9
>  PME mesh               2    4     100001      59.323       1139.002  43.3
>  Wait GPU nonlocal      2    4     100001       0.483          9.282   0.4
>  Wait GPU local         2    4     100001       0.292          5.607   0.2
>  NB X/F buffer ops.     2    4     392002       5.703        109.491   4.2
>  Vsite spread           2    4     101002       2.762         53.030   2.0
>  Write traj.            2    4          1       0.007          0.130   0.0
>  Update                 2    4     100001       4.372         83.942   3.2
>  Constraints            2    4     100001      23.858        458.072  17.4
>  Comm. energies         2    4      20001       0.146          2.803   0.1
>  Rest                                           2.739         52.595   2.0
> ------------------------------------------------------------
> -----------------
>  Total                                        137.015       2630.701 100.0
> ------------------------------------------------------------
> -----------------
>  Breakdown of PME mesh computation
> ------------------------------------------------------------
> -----------------
>  PME redist. X/F        2    4     200002       6.021        115.598   4.4
>  PME spread/gather      2    4     200002      36.204        695.123  26.4
>  PME 3D-FFT             2    4     200002      13.127        252.036   9.6
>  PME 3D-FFT Comm.       2    4     200002       2.007         38.538   1.5
>  PME solve Elec         2    4     100001       0.541         10.392   0.4
> ------------------------------------------------------------
> -----------------
>
>                Core t (s)   Wall t (s)        (%)
>        Time:      444.060      137.015      324.1
>                  (ns/day)    (hour/ns)
> Performance:      315.296        0.076
> Finished mdrun on rank 0 Wed May 24 15:48:59 2017
>
> On Wed, May 24, 2017 at 3:25 PM, Smith, Micholas D. <smithmd at ornl.gov>
> wrote:
>
> > Try just using your equivalent of:
> >
> > mpirun -n 2 -npernode 2 gmx_mpi mdrun (your run stuff here) -ntomp 4
> > -gpu_id 00
> >
> > That may speed it up.
> >
> > ===================
> > Micholas Dean Smith, PhD.
> > Post-doctoral Research Associate
> > University of Tennessee/Oak Ridge National Laboratory
> > Center for Molecular Biophysics
> >
> > ________________________________________
> > From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
> > gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Daniel
> > Kozuch <dkozuch at princeton.edu>
> > Sent: Wednesday, May 24, 2017 3:08 PM
> > To: gromacs.org_gmx-users at maillist.sys.kth.se
> > Subject: [gmx-users] Poor GPU Performance with GROMACS 5.1.4
> >
> > Hello,
> >
> > I'm using GROMACS 5.1.4 on 8 CPUs and 1 GPU for a system of ~8000 atoms
> in
> > a dodecahedron box, and I'm having trouble getting good performance out
> of
> > the GPU. Specifically it appears that there is significant performance
> loss
> > to wait times ("Wait + Comm. F" and "Wait GPU nonlocal"). I have pasted
> the
> > relevant parts of the log file below. I suspect that I have set up my
> > ranks/threads badly, but I am unsure where the issue is. I have tried
> > changing the local variable OMP_NUM_THREADS from 1 to 2 per the note
> > generated by GROMACS, but this severely slows down the simulation to the
> > point where it takes 10 minutes to get a few picoseconds.
> >
> > I have tried browsing through the mailing lists, but I haven't found a
> > solution to this particular problem.
> >
> > Any help is appreciated,
> > Dan
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>