[gmx-users] >60% slowdown with GPU / verlet and sd integrator
James Starlight
jmsstarlight at gmail.com
Wed Jan 16 15:26:18 CET 2013
Hi all!
I've also done some calculations with the SD integraator used as the
thermostat ( without t_coupl ) with the system of 65k atoms I obtained
10ns\day performance on gtc 670 and 4th core i5.
I haventrun any simulations with MD integrator yet so It should test it.
James
2013/1/15 Szilárd Páll <szilard.pall at cbr.su.se>:
> Hi Floris,
>
> Great feedback, this needs to be looked into. Could you please file a bug
> report, preferably with a tpr (and/or all inputs) as well as log files.
>
> Thanks,
>
> --
> Szilárd
>
>
> On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens <floris_buelens at yahoo.com>wrote:
>
>> Hi,
>>
>>
>> I'm seeing MD simulation running a lot slower with the sd integrator than
>> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no
>> documented indication that this should be the case.
>> Timings and logs pasted in below - wall time seems to be accumulating up
>> in Update and Rest, adding up to >60% of total. The effect is still there
>> without GPU, ca. 40% slowdown when switching from group to Verlet with the
>> SD integrator
>> System: Xeon E5-1620, 1x GTX 680, gromacs
>> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0
>>
>> I didn't file a bug report yet as I don't have much variety of testing
>> conditions available right now, I hope someone else has a moment to try to
>> reproduce?
>>
>> Timings:
>>
>> cpu (ns/day)
>> sd / verlet: 6
>> sd / group: 10
>> md / verlet: 9.2
>> md / group: 11.4
>>
>> gpu (ns/day)
>> sd / verlet: 11
>> md / verlet: 29.8
>>
>>
>>
>> **************MD integrator, GPU / verlet
>>
>> M E G A - F L O P S A C C O U N T I N G
>>
>> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
>> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
>> W3=SPC/TIP3p W4=TIP4p (single or pairs)
>> V&F=Potential and force V=Potential only F=Force only
>>
>> Computing: M-Number M-Flops % Flops
>>
>> -----------------------------------------------------------------------------
>> Pair Search distance check 1244.988096 11204.893 0.1
>> NxN QSTab Elec. + VdW [F] 194846.615488 7988711.235 91.9
>> NxN QSTab Elec. + VdW [V&F] 2009.923008 118585.457 1.4
>> 1,4 nonbonded interactions 31.616322 2845.469 0.0
>> Calc Weights 703.010574 25308.381 0.3
>> Spread Q Bspline 14997.558912 29995.118 0.3
>> Gather F Bspline 14997.558912 89985.353 1.0
>> 3D-FFT 47658.567884 381268.543 4.4
>> Solve PME 20.580896 1317.177 0.0
>> Shift-X 9.418458 56.511 0.0
>> Angles 21.879375 3675.735 0.0
>> Propers 48.599718 11129.335 0.1
>> Virial 23.498403 422.971 0.0
>> Stop-CM 2.436616 24.366 0.0
>> Calc-Ekin 93.809716 2532.862 0.0
>> Lincs 12.147284 728.837 0.0
>> Lincs-Mat 131.328750 525.315 0.0
>> Constraint-V 246.633614 1973.069 0.0
>> Constraint-Vir 23.486379 563.673 0.0
>> Settle 74.129451 23943.813 0.3
>>
>> -----------------------------------------------------------------------------
>> Total 8694798.114 100.0
>>
>> -----------------------------------------------------------------------------
>>
>>
>> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>>
>> Computing: Nodes Th. Count Wall t (s) G-Cycles %
>>
>> -----------------------------------------------------------------------------
>> Neighbor search 1 8 201 0.944 27.206 3.3
>> Launch GPU ops. 1 8 5001 0.371 10.690 1.3
>> Force 1 8 5001 2.185 62.987 7.7
>> PME mesh 1 8 5001 15.033 433.441 52.9
>> Wait GPU local 1 8 5001 1.551 44.719 5.5
>> NB X/F buffer ops. 1 8 9801 0.538 15.499 1.9
>> Write traj. 1 8 2 0.725 20.912 2.6
>> Update 1 8 5001 2.318 66.826 8.2
>> Constraints 1 8 5001 2.898 83.551 10.2
>> Rest 1 1.832 52.828 6.5
>>
>> -----------------------------------------------------------------------------
>> Total 1 28.394 818.659 100.0
>>
>> -----------------------------------------------------------------------------
>>
>> -----------------------------------------------------------------------------
>> PME spread/gather 1 8 10002 8.745 252.144 30.8
>> PME 3D-FFT 1 8 10002 5.392 155.458 19.0
>> PME solve 1 8 5001 0.869 25.069 3.1
>>
>> -----------------------------------------------------------------------------
>>
>> GPU timings
>>
>> -----------------------------------------------------------------------------
>> Computing: Count Wall t (s) ms/step %
>>
>> -----------------------------------------------------------------------------
>> Pair list H2D 201 0.080 0.397 0.4
>> X / q H2D 5001 0.698 0.140 3.7
>> Nonbonded F kernel 4400 14.856 3.376 79.1
>> Nonbonded F+ene k. 400 1.667 4.167 8.9
>> Nonbonded F+prune k. 100 0.441 4.407 2.3
>> Nonbonded F+ene+prune k. 101 0.535 5.300 2.9
>> F D2H 5001 0.501 0.100 2.7
>>
>> -----------------------------------------------------------------------------
>> Total 18.778 3.755 100.0
>>
>> -----------------------------------------------------------------------------
>>
>> Force evaluation time GPU/CPU: 3.755 ms/3.443 ms = 1.091
>> For optimal performance this ratio should be close to 1!
>>
>>
>> Core t (s) Wall t (s) (%)
>> Time: 221.730 28.394 780.9
>> (ns/day) (hour/ns)
>> Performance: 30.435 0.789
>>
>>
>>
>>
>> *****************SD integrator, GPU / verlet
>>
>> M E G A - F L O P S A C C O U N T I N G
>>
>> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
>> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
>> W3=SPC/TIP3p W4=TIP4p (single or pairs)
>> V&F=Potential and force V=Potential only F=Force only
>>
>> Computing: M-Number M-Flops % Flops
>>
>> -----------------------------------------------------------------------------
>> Pair Search distance check 1254.604928 11291.444 0.1
>> NxN QSTab Elec. + VdW [F] 197273.059584 8088195.443 91.6
>> NxN QSTab Elec. + VdW [V&F] 2010.150784 118598.896 1.3
>> 1,4 nonbonded interactions 31.616322 2845.469 0.0
>> Calc Weights 703.010574 25308.381 0.3
>> Spread Q Bspline 14997.558912 29995.118 0.3
>> Gather F Bspline 14997.558912 89985.353 1.0
>> 3D-FFT 47473.892284 379791.138 4.3
>> Solve PME 20.488896 1311.289 0.0
>> Shift-X 9.418458 56.511 0.0
>> Angles 21.879375 3675.735 0.0
>> Propers 48.599718 11129.335 0.1
>> Virial 23.498403 422.971 0.0
>> Update 234.336858 7264.443 0.1
>> Stop-CM 2.436616 24.366 0.0
>> Calc-Ekin 93.809716 2532.862 0.0
>> Lincs 24.289712 1457.383 0.0
>> Lincs-Mat 262.605000 1050.420 0.0
>> Constraint-V 246.633614 1973.069 0.0
>> Constraint-Vir 23.486379 563.673 0.0
>> Settle 148.229268 47878.054 0.5
>>
>> -----------------------------------------------------------------------------
>> Total 8825351.354 100.0
>>
>> -----------------------------------------------------------------------------
>>
>>
>> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>>
>> Computing: Nodes Th. Count Wall t (s) G-Cycles %
>>
>> -----------------------------------------------------------------------------
>> Neighbor search 1 8 201 0.945 27.212 1.2
>> Launch GPU ops. 1 8 5001 0.384 11.069 0.5
>> Force 1 8 5001 2.180 62.791 2.7
>> PME mesh 1 8 5001 15.029 432.967 18.5
>> Wait GPU local 1 8 5001 3.327 95.844 4.1
>> NB X/F buffer ops. 1 8 9801 0.542 15.628 0.7
>> Write traj. 1 8 2 0.749 21.582 0.9
>> Update 1 8 5001 28.044 807.908 34.5
>> Constraints 1 8 10002 5.562 160.243 6.8
>> Rest 1 24.488 705.458 30.1
>>
>> -----------------------------------------------------------------------------
>> Total 1 81.250 2340.701 100.0
>>
>> -----------------------------------------------------------------------------
>>
>> -----------------------------------------------------------------------------
>> PME spread/gather 1 8 10002 8.769 252.615 10.8
>> PME 3D-FFT 1 8 10002 5.367 154.630 6.6
>> PME solve 1 8 5001 0.865 24.910 1.1
>>
>> -----------------------------------------------------------------------------
>>
>> GPU timings
>>
>> -----------------------------------------------------------------------------
>> Computing: Count Wall t (s) ms/step %
>>
>> -----------------------------------------------------------------------------
>> Pair list H2D 201 0.080 0.398 0.4
>> X / q H2D 5001 0.699 0.140 3.4
>> Nonbonded F kernel 4400 16.271 3.698 79.6
>> Nonbonded F+ene k. 400 1.827 4.568 8.9
>> Nonbonded F+prune k. 100 0.482 4.816 2.4
>> Nonbonded F+ene+prune k. 101 0.584 5.787 2.9
>> F D2H 5001 0.505 0.101 2.5
>>
>> -----------------------------------------------------------------------------
>> Total 20.448 4.089 100.0
>>
>> -----------------------------------------------------------------------------
>>
>> Force evaluation time GPU/CPU: 4.089 ms/3.441 ms = 1.188
>> For optimal performance this ratio should be close to 1!
>>
>> Core t (s) Wall t (s) (%)
>> Time: 643.440 81.250 791.9
>> (ns/day) (hour/ns)
>> Performance: 10.636 2.256
>> --
>> gmx-users mailing list gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> * Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
More information about the gromacs.org_gmx-users
mailing list