[gmx-users] Poor GPU Performance with GROMACS 5.1.4

Thu May 25 00:58:27 CEST 2017

+ let me emphasize again what Mark said: do not use
domain-decomposition with such a small system! All the overhead you
see comes from the communication you force mdrun to do by running
multiple ranks.

BTW the 1.1 us/day number you quote does a ~6000 atoms simulation with
4 or 5 fs time-step (so >500 ns/day with your system should be easily
doable).

Cheers,
--
Szilárd

On Thu, May 25, 2017 at 12:08 AM, Mark Abraham <mark.j.abraham at gmail.com> wrote:
> Hi,
>
> I'm wondering why you want 8 ranks on the 14 or 28 cores. The log reports
> that something F else is controlling thread affinity, which is the easiest
> thing to screw up if you are doing node sharing. The job manager has to
> give you cores that are solely yours, and you/it need to set the affinities
> of your threads to them. Or use mdrun -pin on and let mdrun do it properly
> (but you are still dead if there's another job on your cores).
>
> Mark
>
> On Wed, 24 May 2017 22:18 Daniel Kozuch <dkozuch at princeton.edu> wrote:
>
>> Thanks so much for the quick reply. That seems to have fixed the wait time
>> issues. Unfortunately, I'm still only getting ~300 ns/day for the benchmark
>> system (villin vsites, http://www.gromacs.org/GPU_acceleration), while the
>> website claims over 1000 ns/day.
>>
>> I'm running on a NVIDIA Tesla P100-PCIE-16GB with 8 Xeon(R) CPU E5-2680 v4
>> @ 2.40GHz. I can see that the cpus are now under performing (324% used).
>> Any suggestions?
>>
>>
>> _________________________________________________________
>>
>>
>>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>>
>> On 2 MPI ranks, each using 4 OpenMP threads
>>
>>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>>                      Ranks Threads  Count      (s)         total sum    %
>> ------------------------------------------------------------
>> -----------------
>>  Domain decomp.         2    4       4001       4.402         84.517   3.2
>>  DD comm. load          2    4       3983       0.021          0.402   0.0
>>  DD comm. bounds        2    4       3982       0.014          0.267   0.0
>>  Vsite constr.          2    4     100001       3.330         63.929   2.4
>>  Neighbor search        2    4       4001       7.495        143.911   5.5
>>  Launch GPU ops.        2    4     200002       4.820         92.537   3.5
>>  Comm. coord.           2    4      96000       2.212         42.468   1.6
>>  Force                  2    4     100001      12.465        239.335   9.1
>>  Wait + Comm. F         2    4     100001       2.572         49.381   1.9
>>  PME mesh               2    4     100001      59.323       1139.002  43.3
>>  Wait GPU nonlocal      2    4     100001       0.483          9.282   0.4
>>  Wait GPU local         2    4     100001       0.292          5.607   0.2
>>  NB X/F buffer ops.     2    4     392002       5.703        109.491   4.2
>>  Vsite spread           2    4     101002       2.762         53.030   2.0
>>  Write traj.            2    4          1       0.007          0.130   0.0
>>  Update                 2    4     100001       4.372         83.942   3.2
>>  Constraints            2    4     100001      23.858        458.072  17.4
>>  Comm. energies         2    4      20001       0.146          2.803   0.1
>>  Rest                                           2.739         52.595   2.0
>> ------------------------------------------------------------
>> -----------------
>>  Total                                        137.015       2630.701 100.0
>> ------------------------------------------------------------
>> -----------------
>>  Breakdown of PME mesh computation
>> ------------------------------------------------------------
>> -----------------
>>  PME redist. X/F        2    4     200002       6.021        115.598   4.4
>>  PME spread/gather      2    4     200002      36.204        695.123  26.4
>>  PME 3D-FFT             2    4     200002      13.127        252.036   9.6
>>  PME 3D-FFT Comm.       2    4     200002       2.007         38.538   1.5
>>  PME solve Elec         2    4     100001       0.541         10.392   0.4
>> ------------------------------------------------------------
>> -----------------
>>
>>                Core t (s)   Wall t (s)        (%)
>>        Time:      444.060      137.015      324.1
>>                  (ns/day)    (hour/ns)
>> Performance:      315.296        0.076
>> Finished mdrun on rank 0 Wed May 24 15:48:59 2017
>>
>> On Wed, May 24, 2017 at 3:25 PM, Smith, Micholas D. <smithmd at ornl.gov>
>> wrote:
>>
>> > Try just using your equivalent of:
>> >
>> > mpirun -n 2 -npernode 2 gmx_mpi mdrun (your run stuff here) -ntomp 4
>> > -gpu_id 00
>> >
>> > That may speed it up.
>> >
>> > ===================
>> > Micholas Dean Smith, PhD.
>> > Post-doctoral Research Associate
>> > University of Tennessee/Oak Ridge National Laboratory
>> > Center for Molecular Biophysics
>> >
>> > ________________________________________
>> > From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
>> > gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Daniel
>> > Kozuch <dkozuch at princeton.edu>
>> > Sent: Wednesday, May 24, 2017 3:08 PM
>> > To: gromacs.org_gmx-users at maillist.sys.kth.se
>> > Subject: [gmx-users] Poor GPU Performance with GROMACS 5.1.4
>> >
>> > Hello,
>> >
>> > I'm using GROMACS 5.1.4 on 8 CPUs and 1 GPU for a system of ~8000 atoms
>> in
>> > a dodecahedron box, and I'm having trouble getting good performance out
>> of
>> > the GPU. Specifically it appears that there is significant performance
>> loss
>> > to wait times ("Wait + Comm. F" and "Wait GPU nonlocal"). I have pasted
>> the
>> > relevant parts of the log file below. I suspect that I have set up my
>> > ranks/threads badly, but I am unsure where the issue is. I have tried
>> > changing the local variable OMP_NUM_THREADS from 1 to 2 per the note
>> > generated by GROMACS, but this severely slows down the simulation to the
>> > point where it takes 10 minutes to get a few picoseconds.
>> >
>> > I have tried browsing through the mailing lists, but I haven't found a
>> > solution to this particular problem.
>> >
>> > Any help is appreciated,
>> > Dan
>> >
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.