[gmx-users] Poor GPU Performance with GROMACS 5.1.4

Wed May 24 22:18:02 CEST 2017

Thanks so much for the quick reply. That seems to have fixed the wait time
issues. Unfortunately, I'm still only getting ~300 ns/day for the benchmark
system (villin vsites, http://www.gromacs.org/GPU_acceleration), while the
website claims over 1000 ns/day.

I'm running on a NVIDIA Tesla P100-PCIE-16GB with 8 Xeon(R) CPU E5-2680 v4
@ 2.40GHz. I can see that the cpus are now under performing (324% used).
Any suggestions?

_________________________________________________________

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 2 MPI ranks, each using 4 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
------------------------------------------------------------
-----------------
 Domain decomp.         2    4       4001       4.402         84.517   3.2
 DD comm. load          2    4       3983       0.021          0.402   0.0
 DD comm. bounds        2    4       3982       0.014          0.267   0.0
 Vsite constr.          2    4     100001       3.330         63.929   2.4
 Neighbor search        2    4       4001       7.495        143.911   5.5
 Launch GPU ops.        2    4     200002       4.820         92.537   3.5
 Comm. coord.           2    4      96000       2.212         42.468   1.6
 Force                  2    4     100001      12.465        239.335   9.1
 Wait + Comm. F         2    4     100001       2.572         49.381   1.9
 PME mesh               2    4     100001      59.323       1139.002  43.3
 Wait GPU nonlocal      2    4     100001       0.483          9.282   0.4
 Wait GPU local         2    4     100001       0.292          5.607   0.2
 NB X/F buffer ops.     2    4     392002       5.703        109.491   4.2
 Vsite spread           2    4     101002       2.762         53.030   2.0
 Write traj.            2    4          1       0.007          0.130   0.0
 Update                 2    4     100001       4.372         83.942   3.2
 Constraints            2    4     100001      23.858        458.072  17.4
 Comm. energies         2    4      20001       0.146          2.803   0.1
 Rest                                           2.739         52.595   2.0
------------------------------------------------------------
-----------------
 Total                                        137.015       2630.701 100.0
------------------------------------------------------------
-----------------
 Breakdown of PME mesh computation
------------------------------------------------------------
-----------------
 PME redist. X/F        2    4     200002       6.021        115.598   4.4
 PME spread/gather      2    4     200002      36.204        695.123  26.4
 PME 3D-FFT             2    4     200002      13.127        252.036   9.6
 PME 3D-FFT Comm.       2    4     200002       2.007         38.538   1.5
 PME solve Elec         2    4     100001       0.541         10.392   0.4
------------------------------------------------------------
-----------------

               Core t (s)   Wall t (s)        (%)
       Time:      444.060      137.015      324.1
                 (ns/day)    (hour/ns)
Performance:      315.296        0.076
Finished mdrun on rank 0 Wed May 24 15:48:59 2017

On Wed, May 24, 2017 at 3:25 PM, Smith, Micholas D. <smithmd at ornl.gov>
wrote:

> Try just using your equivalent of:
>
> mpirun -n 2 -npernode 2 gmx_mpi mdrun (your run stuff here) -ntomp 4
> -gpu_id 00
>
> That may speed it up.
>
> ===================
> Micholas Dean Smith, PhD.
> Post-doctoral Research Associate
> University of Tennessee/Oak Ridge National Laboratory
> Center for Molecular Biophysics
>
> ________________________________________
> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
> gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Daniel
> Kozuch <dkozuch at princeton.edu>
> Sent: Wednesday, May 24, 2017 3:08 PM
> To: gromacs.org_gmx-users at maillist.sys.kth.se
> Subject: [gmx-users] Poor GPU Performance with GROMACS 5.1.4
>
> Hello,
>
> I'm using GROMACS 5.1.4 on 8 CPUs and 1 GPU for a system of ~8000 atoms in
> a dodecahedron box, and I'm having trouble getting good performance out of
> the GPU. Specifically it appears that there is significant performance loss
> to wait times ("Wait + Comm. F" and "Wait GPU nonlocal"). I have pasted the
> relevant parts of the log file below. I suspect that I have set up my
> ranks/threads badly, but I am unsure where the issue is. I have tried
> changing the local variable OMP_NUM_THREADS from 1 to 2 per the note
> generated by GROMACS, but this severely slows down the simulation to the
> point where it takes 10 minutes to get a few picoseconds.
>
> I have tried browsing through the mailing lists, but I haven't found a
> solution to this particular problem.
>
> Any help is appreciated,
> Dan
>