[gmx-users] possible configuration for gromacs gpu node
Szilárd Páll
pall.szilard at gmail.com
Tue May 6 16:26:34 CEST 2014
Hi,
Based on the performance data you provided, I'm afraid a GTX 770 won't
be fast enough combined with a E5-2643V2 - at least for your system.
Notice in the log output that the "Wait GPU local" accounts for 46% of
the runtime. This is because while the bonded + PME force compute time
takes 2.77 ms/step, but the nonbonded compute takes ~5.8 ms/step, more
than twice of this. As the CPU and GPU force compute overlap is very
poor, you get almost half of the CPU time spent in waiting for the
GPU.
Hence, to get a balanced hardware combination (assuming the same input
system and settings), you would need a GPU that's about 2x faster than
the K5000. The GTX 770 is perhaps 50% faster*; my guess is that the
even a 780 could be on the slow side but, a 780 Ti should be fast
enough.
Cheers,
--
Szilárd
* The GROMACS non-bonded kernels are compute-bound, so one can compare
roughly performance of two cards of identical compute capability (!)
by looking at the ratio of #multiprocessor*frequency (assuming an
input large enough to reach the peak of the respective GPU), i.e. for
K5000 vs GTX 770 roughly (1085*8)/(706.0*8)
On Tue, May 6, 2014 at 9:15 AM, Harry Mark Greenblatt
<harry.greenblatt at weizmann.ac.il> wrote:
> BS"D
>
> Dear All,
>
> I was asked to provide some examples of what we are doing to assess whether my proposal for a GPU compute node is reasonable
> (2 x 3.5GHz E5-2643V2 hexacore, with 2 x Geforce GTX 770; run two jobs, each with six cores and 1 GPU). I did some tests on a workstation some time ago with Gromacs 4.6.2, and so am including that now. Please let me know if this is enough information.
>
> It seems from these test that the CPU (E5-1650, 3.2GHz, and a Quadro K4000) outstripped the GPU. This GPU has half the CUDA cores of what we are proposing. System is a protein bound to DS B-DNA (DNA is restrained). It suggests using a shorter cut-off, but I was using 1.0 here, which is shorter than what I was using in the older cutoff scheme.
>
> Here is the .mdp file
>
>
> define = -DPOSRES
> integrator = md
> dt = 0.002 ; ps ! 2 fs
> nsteps = 500000 ; total 1,000 ps (1ns)
> nstcomm = 10
> nstxout = 500 ; collect data every 1 ps
> nstxtcout = 500
> xtc_grps = Protein DNA Ion
> nstenergy = 100
> nstvout = 0
> nstfout = 0
> nstlist = 10
> ns_type = grid
> rlist = 1.0
> coulombtype = PME
> ;rcoulomb = 1.0
> rcoulomb = 1.0
> vdwtype = cut-off
> cutoff-scheme = Verlet
> rvdw = 1.0
> pme_order = 4
> ewald_rtol = 1e-5
> optimize_fft = yes
> DispCorr = no
> ; OPTIONS FOR BONDS
> constraints = all-bonds
> continuation = yes ; continuation from NPT PR
> constraint_algorithm = lincs ; holonomic constraints
> lincs_iter = 1 ; accuracy of LINCS
> lincs_order = 4 ; also related to accuracy
>
> ; Berendsen temperature coupling is on
> Tcoupl = v-rescale
> tau_t = 0.1 0.1
> tc-grps = protein non-protein
> ref_t = 300 300
> ; Pressure coupling is on
> ;Pcoupl = parrinello-rahmana
> Pcoupl = no
> Pcoupltype = isotropic
> tau_p = 1.0
> compressibility = 4.5e-5
> ref_p = 1.0
> ; Generate velocites is on at 300 K.
> gen_vel = no
> gen_temp = 300.0
> gen_seed = -1
> ;
>
>
> And at the end of the run:
>
>
> Computing: M-Number M-Flops % Flops
> -----------------------------------------------------------------------------
> Pair Search distance check 55582.974192 500246.768 0.1
> NxN QSTab Elec. + VdW [F] 15309048.189184 627670975.757 88.9
> NxN QSTab Elec. + VdW [V&F] 154666.831424 9125343.054 1.3
> 1,4 nonbonded interactions 3121.006242 280890.562 0.0
> Calc Weights 48916.597833 1760997.522 0.2
> Spread Q Bspline 1043554.087104 2087108.174 0.3
> Gather F Bspline 1043554.087104 6261324.523 0.9
> 3D-FFT 6907906.423072 55263251.385 7.8
> Solve PME 2591.791424 165874.651 0.0
> Shift-X 407.670111 2446.021 0.0
> Angles 2274.504549 382116.764 0.1
> Propers 3495.506991 800471.101 0.1
> Impropers 245.500491 51064.102 0.0
> Pos. Restr. 325.000650 16250.033 0.0
> Virial 163.312656 2939.628 0.0
> Stop-CM 163.120222 1631.202 0.0
> Calc-Ekin 3261.165222 88051.461 0.0
> Lincs 1262.502525 75750.151 0.0
> Lincs-Mat 27294.054588 109176.218 0.0
> Constraint-V 17579.035158 140632.281 0.0
> Constraint-Vir 163.197633 3916.743 0.0
> Settle 5018.010036 1620817.242 0.2
> -----------------------------------------------------------------------------
> Total 706411275.342 100.0
> -----------------------------------------------------------------------------
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Th. Count Wall t (s) G-Cycles %
> -----------------------------------------------------------------------------
> Neighbor search 1 6 12501 36.427 699.418 1.1
> Launch GPU ops. 1 6 500001 35.610 683.734 1.1
> Force 1 6 500001 123.471 2370.727 3.8
> PME mesh 1 6 500001 1261.777 24227.040 38.9
> Wait GPU local 1 6 500001 1488.623 28582.658 45.9
> NB X/F buffer ops. 1 6 987501 34.047 653.734 1.0
> Write traj. 1 6 1004 4.602 88.359 0.1
> Update 1 6 500001 41.532 797.453 1.3
> Constraints 1 6 500001 197.492 3791.991 6.1
> Rest 1 20.561 394.787 0.6
> -----------------------------------------------------------------------------
> Total 1 3244.142 62289.902 100.0
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> PME spread/gather 1 6 1000002 510.512 9802.198 15.7
> PME 3D-FFT 1 6 1000002 683.758 13128.652 21.1
> PME solve 1 6 500001 65.117 1250.298 2.0
> -----------------------------------------------------------------------------
>
> GPU timings
> -----------------------------------------------------------------------------
> Computing: Count Wall t (s) ms/step %
> -----------------------------------------------------------------------------
> Pair list H2D 12501 3.637 0.291 0.1
> X / q H2D 500001 49.444 0.099 1.7
> Nonbonded F kernel 485000 2685.409 5.537 93.0
> Nonbonded F+ene k. 2500 18.131 7.252 0.6
> Nonbonded F+prune k. 10000 73.623 7.362 2.5
> Nonbonded F+ene+prune k. 2501 22.572 9.025 0.8
> F D2H 500001 35.269 0.071 1.2
> -----------------------------------------------------------------------------
> Total 2888.085 5.776 100.0
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 5.776 ms/2.770 ms = 2.085
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
> performance loss, consider using a shorter cut-off and a finer PME grid.
>
> Core t (s) Wall t (s) (%)
> Time: 19439.980 3244.142 599.2
> 54:04
> (ns/day) (hour/ns)
> Performance: 26.633 0.901
> Finished mdrun on node 0 Wed Jul 10 17:12:21 2013
>
>
> Thanks very much,
>
>
> Harry
>
>
> -------------------------------------------------------------------------
>
> Harry M. Greenblatt
>
> Associate Staff Scientist
>
> Dept of Structural Biology Harry.Greenblatt at weizmann.ac.il<mailto:arry.Greenblatt at weizmann.ac.il>
>
> Weizmann Institute of Science Phone: 972-8-934-3625
>
> 234 Herzl St. Facsimile: 972-8-934-4159
>
> Rehovot, 76100
>
> Israel
>
>
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list