[gmx-users] possible configuration for gromacs gpu node

Szilárd Páll pall.szilard at gmail.com
Tue May 6 16:26:34 CEST 2014


Hi,

Based on the performance data you provided, I'm afraid a GTX 770 won't
be fast enough combined with a E5-2643V2 - at least for your system.

Notice in the log output that the "Wait GPU local" accounts for 46% of
the runtime. This is because while the bonded + PME force compute time
takes 2.77 ms/step, but the nonbonded compute takes ~5.8 ms/step, more
than twice of this. As the CPU and GPU force compute overlap is very
poor, you get almost half of the CPU time spent in waiting for the
GPU.

Hence, to get a balanced hardware combination (assuming the same input
system and settings), you would need a GPU that's about 2x faster than
the K5000. The GTX 770 is perhaps 50% faster*; my guess is that the
even a 780 could be on the slow side but, a 780 Ti should be fast
enough.

Cheers,
--
Szilárd

* The GROMACS non-bonded kernels are compute-bound, so one can compare
roughly performance of two cards of identical compute capability (!)
by looking at the ratio of #multiprocessor*frequency (assuming an
input large enough to reach the peak of the respective GPU), i.e. for
K5000 vs GTX 770 roughly (1085*8)/(706.0*8)


On Tue, May 6, 2014 at 9:15 AM, Harry Mark Greenblatt
<harry.greenblatt at weizmann.ac.il> wrote:
> BS"D
>
> Dear All,
>
>   I was asked to provide some examples of what we are doing to assess whether my proposal for a GPU compute node is reasonable
> (2 x  3.5GHz E5-2643V2 hexacore, with 2 x Geforce GTX 770; run two jobs, each with six cores and 1 GPU).  I did some tests on a workstation some time ago with Gromacs 4.6.2, and so am including that now.  Please let me know if this is enough information.
>
>  It seems from these test that the CPU (E5-1650, 3.2GHz, and a Quadro K4000) outstripped the GPU.  This GPU has half the CUDA cores of what we are proposing.  System is a protein bound to DS B-DNA (DNA is restrained).  It suggests using a shorter cut-off, but I was using 1.0 here, which is shorter than what I was using in the older cutoff scheme.
>
> Here is the .mdp file
>
>
> define              = -DPOSRES
> integrator          = md
> dt                  = 0.002 ; ps ! 2 fs
> nsteps              = 500000 ; total 1,000 ps (1ns)
> nstcomm             = 10
> nstxout             = 500     ; collect data every 1 ps
> nstxtcout           = 500
> xtc_grps            = Protein DNA Ion
> nstenergy           = 100
> nstvout             = 0
> nstfout             = 0
> nstlist             = 10
> ns_type             = grid
> rlist               = 1.0
> coulombtype         = PME
> ;rcoulomb            = 1.0
> rcoulomb            = 1.0
> vdwtype             = cut-off
> cutoff-scheme       = Verlet
> rvdw                = 1.0
> pme_order           = 4
> ewald_rtol          = 1e-5
> optimize_fft        = yes
> DispCorr            = no
> ; OPTIONS FOR BONDS
> constraints         = all-bonds
> continuation        = yes      ; continuation from NPT PR
> constraint_algorithm  = lincs  ; holonomic constraints
> lincs_iter            = 1      ; accuracy of LINCS
> lincs_order           = 4      ;  also related to accuracy
>
> ; Berendsen temperature coupling is on
> Tcoupl                = v-rescale
> tau_t                 = 0.1     0.1
> tc-grps               = protein     non-protein
> ref_t                 = 300         300
> ; Pressure coupling is on
> ;Pcoupl              = parrinello-rahmana
> Pcoupl              = no
> Pcoupltype          = isotropic
> tau_p               = 1.0
> compressibility     = 4.5e-5
> ref_p               = 1.0
> ; Generate velocites is on at 300 K.
> gen_vel             = no
> gen_temp            = 300.0
> gen_seed            = -1
> ;
>
>
> And at the end of the run:
>
>
> Computing:                               M-Number         M-Flops  % Flops
> -----------------------------------------------------------------------------
>  Pair Search distance check           55582.974192      500246.768     0.1
>  NxN QSTab Elec. + VdW [F]         15309048.189184   627670975.757    88.9
>  NxN QSTab Elec. + VdW [V&F]         154666.831424     9125343.054     1.3
>  1,4 nonbonded interactions            3121.006242      280890.562     0.0
>  Calc Weights                         48916.597833     1760997.522     0.2
>  Spread Q Bspline                   1043554.087104     2087108.174     0.3
>  Gather F Bspline                   1043554.087104     6261324.523     0.9
>  3D-FFT                             6907906.423072    55263251.385     7.8
>  Solve PME                             2591.791424      165874.651     0.0
>  Shift-X                                407.670111        2446.021     0.0
>  Angles                                2274.504549      382116.764     0.1
>  Propers                               3495.506991      800471.101     0.1
>  Impropers                              245.500491       51064.102     0.0
>  Pos. Restr.                            325.000650       16250.033     0.0
>  Virial                                 163.312656        2939.628     0.0
>  Stop-CM                                163.120222        1631.202     0.0
>  Calc-Ekin                             3261.165222       88051.461     0.0
>  Lincs                                 1262.502525       75750.151     0.0
>  Lincs-Mat                            27294.054588      109176.218     0.0
>  Constraint-V                         17579.035158      140632.281     0.0
>  Constraint-Vir                         163.197633        3916.743     0.0
>  Settle                                5018.010036     1620817.242     0.2
> -----------------------------------------------------------------------------
>  Total                                               706411275.342   100.0
> -----------------------------------------------------------------------------
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
>  Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles       %
> -----------------------------------------------------------------------------
>  Neighbor search        1    6      12501      36.427      699.418     1.1
>  Launch GPU ops.        1    6     500001      35.610      683.734     1.1
>  Force                  1    6     500001     123.471     2370.727     3.8
>  PME mesh               1    6     500001    1261.777    24227.040    38.9
>  Wait GPU local         1    6     500001    1488.623    28582.658    45.9
>  NB X/F buffer ops.     1    6     987501      34.047      653.734     1.0
>  Write traj.            1    6       1004       4.602       88.359     0.1
>  Update                 1    6     500001      41.532      797.453     1.3
>  Constraints            1    6     500001     197.492     3791.991     6.1
>  Rest                   1                      20.561      394.787     0.6
> -----------------------------------------------------------------------------
>  Total                  1                    3244.142    62289.902   100.0
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
>  PME spread/gather      1    6    1000002     510.512     9802.198    15.7
>  PME 3D-FFT             1    6    1000002     683.758    13128.652    21.1
>  PME solve              1    6     500001      65.117     1250.298     2.0
> -----------------------------------------------------------------------------
>
> GPU timings
> -----------------------------------------------------------------------------
>  Computing:                         Count  Wall t (s)      ms/step       %
> -----------------------------------------------------------------------------
>  Pair list H2D                      12501       3.637        0.291     0.1
>  X / q H2D                         500001      49.444        0.099     1.7
>  Nonbonded F kernel                485000    2685.409        5.537    93.0
>  Nonbonded F+ene k.                  2500      18.131        7.252     0.6
>  Nonbonded F+prune k.               10000      73.623        7.362     2.5
>  Nonbonded F+ene+prune k.            2501      22.572        9.025     0.8
>  F D2H                             500001      35.269        0.071     1.2
> -----------------------------------------------------------------------------
>  Total                                       2888.085        5.776   100.0
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 5.776 ms/2.770 ms = 2.085
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>       performance loss, consider using a shorter cut-off and a finer PME grid.
>
>                Core t (s)   Wall t (s)        (%)
>        Time:    19439.980     3244.142      599.2
>                          54:04
>                  (ns/day)    (hour/ns)
> Performance:       26.633        0.901
> Finished mdrun on node 0 Wed Jul 10 17:12:21 2013
>
>
> Thanks very much,
>
>
> Harry
>
>
> -------------------------------------------------------------------------
>
> Harry M. Greenblatt
>
> Associate Staff Scientist
>
> Dept of Structural Biology           Harry.Greenblatt at weizmann.ac.il<mailto:arry.Greenblatt at weizmann.ac.il>
>
> Weizmann Institute of Science        Phone:  972-8-934-3625
>
> 234 Herzl St.                        Facsimile:   972-8-934-4159
>
> Rehovot, 76100
>
> Israel
>
>
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list