[gmx-users] Poor GPU Performance with GROMACS 5.1.4
Szilárd Páll
pall.szilard at gmail.com
Thu May 25 00:58:27 CEST 2017
+ let me emphasize again what Mark said: do not use
domain-decomposition with such a small system! All the overhead you
see comes from the communication you force mdrun to do by running
multiple ranks.
BTW the 1.1 us/day number you quote does a ~6000 atoms simulation with
4 or 5 fs time-step (so >500 ns/day with your system should be easily
doable).
Cheers,
--
Szilárd
On Thu, May 25, 2017 at 12:08 AM, Mark Abraham <mark.j.abraham at gmail.com> wrote:
> Hi,
>
> I'm wondering why you want 8 ranks on the 14 or 28 cores. The log reports
> that something F else is controlling thread affinity, which is the easiest
> thing to screw up if you are doing node sharing. The job manager has to
> give you cores that are solely yours, and you/it need to set the affinities
> of your threads to them. Or use mdrun -pin on and let mdrun do it properly
> (but you are still dead if there's another job on your cores).
>
> Mark
>
> On Wed, 24 May 2017 22:18 Daniel Kozuch <dkozuch at princeton.edu> wrote:
>
>> Thanks so much for the quick reply. That seems to have fixed the wait time
>> issues. Unfortunately, I'm still only getting ~300 ns/day for the benchmark
>> system (villin vsites, http://www.gromacs.org/GPU_acceleration), while the
>> website claims over 1000 ns/day.
>>
>> I'm running on a NVIDIA Tesla P100-PCIE-16GB with 8 Xeon(R) CPU E5-2680 v4
>> @ 2.40GHz. I can see that the cpus are now under performing (324% used).
>> Any suggestions?
>>
>>
>> _________________________________________________________
>>
>>
>> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>>
>> On 2 MPI ranks, each using 4 OpenMP threads
>>
>> Computing: Num Num Call Wall time Giga-Cycles
>> Ranks Threads Count (s) total sum %
>> ------------------------------------------------------------
>> -----------------
>> Domain decomp. 2 4 4001 4.402 84.517 3.2
>> DD comm. load 2 4 3983 0.021 0.402 0.0
>> DD comm. bounds 2 4 3982 0.014 0.267 0.0
>> Vsite constr. 2 4 100001 3.330 63.929 2.4
>> Neighbor search 2 4 4001 7.495 143.911 5.5
>> Launch GPU ops. 2 4 200002 4.820 92.537 3.5
>> Comm. coord. 2 4 96000 2.212 42.468 1.6
>> Force 2 4 100001 12.465 239.335 9.1
>> Wait + Comm. F 2 4 100001 2.572 49.381 1.9
>> PME mesh 2 4 100001 59.323 1139.002 43.3
>> Wait GPU nonlocal 2 4 100001 0.483 9.282 0.4
>> Wait GPU local 2 4 100001 0.292 5.607 0.2
>> NB X/F buffer ops. 2 4 392002 5.703 109.491 4.2
>> Vsite spread 2 4 101002 2.762 53.030 2.0
>> Write traj. 2 4 1 0.007 0.130 0.0
>> Update 2 4 100001 4.372 83.942 3.2
>> Constraints 2 4 100001 23.858 458.072 17.4
>> Comm. energies 2 4 20001 0.146 2.803 0.1
>> Rest 2.739 52.595 2.0
>> ------------------------------------------------------------
>> -----------------
>> Total 137.015 2630.701 100.0
>> ------------------------------------------------------------
>> -----------------
>> Breakdown of PME mesh computation
>> ------------------------------------------------------------
>> -----------------
>> PME redist. X/F 2 4 200002 6.021 115.598 4.4
>> PME spread/gather 2 4 200002 36.204 695.123 26.4
>> PME 3D-FFT 2 4 200002 13.127 252.036 9.6
>> PME 3D-FFT Comm. 2 4 200002 2.007 38.538 1.5
>> PME solve Elec 2 4 100001 0.541 10.392 0.4
>> ------------------------------------------------------------
>> -----------------
>>
>> Core t (s) Wall t (s) (%)
>> Time: 444.060 137.015 324.1
>> (ns/day) (hour/ns)
>> Performance: 315.296 0.076
>> Finished mdrun on rank 0 Wed May 24 15:48:59 2017
>>
>> On Wed, May 24, 2017 at 3:25 PM, Smith, Micholas D. <smithmd at ornl.gov>
>> wrote:
>>
>> > Try just using your equivalent of:
>> >
>> > mpirun -n 2 -npernode 2 gmx_mpi mdrun (your run stuff here) -ntomp 4
>> > -gpu_id 00
>> >
>> > That may speed it up.
>> >
>> > ===================
>> > Micholas Dean Smith, PhD.
>> > Post-doctoral Research Associate
>> > University of Tennessee/Oak Ridge National Laboratory
>> > Center for Molecular Biophysics
>> >
>> > ________________________________________
>> > From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
>> > gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Daniel
>> > Kozuch <dkozuch at princeton.edu>
>> > Sent: Wednesday, May 24, 2017 3:08 PM
>> > To: gromacs.org_gmx-users at maillist.sys.kth.se
>> > Subject: [gmx-users] Poor GPU Performance with GROMACS 5.1.4
>> >
>> > Hello,
>> >
>> > I'm using GROMACS 5.1.4 on 8 CPUs and 1 GPU for a system of ~8000 atoms
>> in
>> > a dodecahedron box, and I'm having trouble getting good performance out
>> of
>> > the GPU. Specifically it appears that there is significant performance
>> loss
>> > to wait times ("Wait + Comm. F" and "Wait GPU nonlocal"). I have pasted
>> the
>> > relevant parts of the log file below. I suspect that I have set up my
>> > ranks/threads badly, but I am unsure where the issue is. I have tried
>> > changing the local variable OMP_NUM_THREADS from 1 to 2 per the note
>> > generated by GROMACS, but this severely slows down the simulation to the
>> > point where it takes 10 minutes to get a few picoseconds.
>> >
>> > I have tried browsing through the mailing lists, but I haven't found a
>> > solution to this particular problem.
>> >
>> > Any help is appreciated,
>> > Dan
>> >
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list