[gmx-users] using dual CPU's

Kutzner, Carsten ckutzne at gwdg.de
Wed Dec 12 16:02:04 CET 2018


Hi Paul,

> On 12. Dec 2018, at 15:36, pbuscemi at q.com wrote:
> 
> Dear users  ( one more try ) 
> 
> I am trying to use 2 GPU cards to improve modeling speed.  The computer described in the log files is used  to iron out models and am using to learn how to use two GPU cards before purchasing two new RTX 2080 ti's.  The CPU is a 8 core 16 thread AMD and the GPU's are two GTX 1060; there are 50000 atoms in the model
> 
> Using ntpmi and ntomp  settings of 1: 16,  auto  ( 4:4) and  2: 8 ( and any other combination factoring to 16)  the rating for ns/day are approx.   12-16  and  for any other setting ~6-8  i.e adding a card cuts efficiency by half.  The average load imbalance is less than 3.4% for the multicard setup .
> 
> I am not at this point trying to maximize efficiency, but only to show some improvement going from one to two cards.   According to a 2015 paper form the Gromacs group  “ Best bang for your buck: GPU nodes for GROMACS biomolecular simulations “  I should expect maybe (at best )  50% improvement for 90k atoms ( with  2x  GTX 970 )
We did not benchmark GTX 970 in that publication.

But from Table 6 you can see that we also had quite a few cases with out 80k benchmark
where going from 1 to 2 GPUs, simulation speed did not increase much: E.g. for the
E5-2670v2 going from one to 2 GTX 980 GPUs led to an increase of 10 percent.

Did you use counter resetting for the benchnarks?

Carsten


> What bothers me in my initial attempts is that my simulations became slower by adding the second GPU - it is frustrating to say the least. It's like swimming backwards.
> 
> I know am missing - as a minimum -  the correct setup for mdrun and suggestions would be welcome
> 
> The output from the last section of the log files is included below.
> 
> =========================== ntpmi  1  ntomp:16 ==============================
> 
> 	<======  ###############  ==>
> 	<====  A V E R A G E S  ====>
> 	<==  ###############  ======>
> 
> 	Statistics over 29301 steps using 294 frames
> 
>   Energies (kJ/mol)
>          Angle       G96Angle    Proper Dih.  Improper Dih.          LJ-14
>    9.17533e+05    2.27874e+04    6.64128e+04    2.31214e+02    8.34971e+04
>     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>   -2.84567e+07   -1.43385e+05   -2.04658e+03    1.33320e+07    1.59914e+05
> Position Rest.      Potential    Kinetic En.   Total Energy    Temperature
>    7.79893e+01   -1.40196e+07    1.88467e+05   -1.38312e+07    3.00376e+02
> Pres. DC (bar) Pressure (bar)   Constr. rmsd
>   -2.88685e+00    3.75436e+01    0.00000e+00
> 
>   Total Virial (kJ/mol)
>    5.27555e+04   -4.87626e+02    1.86144e+02
>   -4.87648e+02    4.04479e+04   -1.91959e+02
>    1.86177e+02   -1.91957e+02    5.45671e+04
> 
>   Pressure (bar)
>    2.22202e+01    1.27887e+00   -4.71738e-01
>    1.27893e+00    6.48135e+01    5.12638e-01
>   -4.71830e-01    5.12632e-01    2.55971e+01
> 
>         T-PDMS         T-VMOS
>    2.99822e+02    3.32834e+02
> 
> 
> 	M E G A - F L O P S   A C C O U N T I N G
> 
> NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
> RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
> W3=SPC/TIP3p  W4=TIP4p (single or pairs)
> V&F=Potential and force  V=Potential only  F=Force only
> 
> Computing:                               M-Number         M-Flops  % Flops
> -----------------------------------------------------------------------------
> Pair Search distance check            2349.753264       21147.779     0.0
> NxN Ewald Elec. + LJ [F]           1771584.591744   116924583.055    96.6
> NxN Ewald Elec. + LJ [V&F]           17953.091840     1920980.827     1.6
> 1,4 nonbonded interactions            5278.575150      475071.763     0.4
> Shift-X                                 22.173480         133.041     0.0
> Angles                                4178.908620      702056.648     0.6
> Propers                                879.909030      201499.168     0.2
> Impropers                                5.274180        1097.029     0.0
> Pos. Restr.                             42.193440        2109.672     0.0
> Virial                                  22.186710         399.361     0.0
> Update                                2209.881420       68506.324     0.1
> Stop-CM                                 22.248900         222.489     0.0
> Calc-Ekin                               44.346960        1197.368     0.0
> Lincs                                 4414.639320      264878.359     0.2
> Lincs-Mat                           100297.229760      401188.919     0.3
> Constraint-V                          8829.127980       70633.024     0.1
> Constraint-Vir                          22.147020         531.528     0.0
> -----------------------------------------------------------------------------
> Total                                               121056236.355   100.0
> -----------------------------------------------------------------------------
>     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> On 1 MPI rank, each using 16 OpenMP threads
> 
> Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                     Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
> Neighbor search        1   16        294       2.191        129.485   1.0
> Launch GPU ops.        1   16      58602       4.257        251.544   2.0
> Force                  1   16      29301      23.769       1404.510  11.3
> Wait PME GPU gather    1   16      29301      33.740       1993.695  16.0
> Reduce GPU PME F       1   16      29301       7.244        428.079   3.4
> Wait GPU NB local      1   16      29301      60.054       3548.612  28.5
> NB X/F buffer ops.     1   16      58308       9.823        580.459   4.7
> Write traj.            1   16          7       0.119          7.048   0.1
> Update                 1   16      58602      11.089        655.275   5.3
> Constraints            1   16      58602      40.378       2385.992  19.2
> Rest                                          17.743       1048.462   8.4
> -----------------------------------------------------------------------------
> Total                                        210.408      12433.160 100.0
> -----------------------------------------------------------------------------
> 
>               Core t (s)   Wall t (s)        (%)
>       Time:     3366.529      210.408     1600.0
>                 (ns/day)    (hour/ns)
> Performance:       12.032        1.995
> Finished mdrun on rank 0 Mon Dec 10 17:17:04 2018
> 
> 
> =========================== ntpmi and ntomp   auto  ( 4:4 ) =======================================
> 
> 
> 	<======  ###############  ==>
> 	<====  A V E R A G E S  ====>
> 	<==  ###############  ======>
> 
> 	Statistics over 3301 steps using 34 frames
> 
>   Energies (kJ/mol)
>          Angle       G96Angle    Proper Dih.  Improper Dih.          LJ-14
>    9.20586e+05    1.95534e+04    6.56058e+04    2.21093e+02    8.56673e+04
>     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>   -2.84553e+07   -1.44595e+05   -2.04658e+03    1.34518e+07    4.26167e+04
> Position Rest.      Potential    Kinetic En.   Total Energy    Temperature
>    3.83653e+01   -1.40159e+07    1.90353e+05   -1.38255e+07    3.03381e+02
> Pres. DC (bar) Pressure (bar)   Constr. rmsd
>   -2.88685e+00    2.72913e+02    0.00000e+00
> 
>   Total Virial (kJ/mol)
>   -5.05948e+04   -3.29107e+03    4.84786e+02
>   -3.29135e+03   -3.42006e+04   -3.32392e+03
>    4.84606e+02   -3.32403e+03   -2.06849e+04
> 
>   Pressure (bar)
>    3.09713e+02    8.98192e+00   -1.19828e+00
>    8.98270e+00    2.73248e+02    8.99543e+00
>   -1.19778e+00    8.99573e+00    2.35776e+02
> 
>         T-PDMS         T-VMOS
>    2.98623e+02    5.82467e+02
> 
> 
>       P P   -   P M E   L O A D   B A L A N C I N G
> 
> NOTE: The PP/PME load balancing was limited by the maximum allowed grid scaling,
>       you might not have reached a good load balance.
> 
> PP/PME load balancing changed the cut-off and PME settings:
>           particle-particle                    PME
>            rcoulomb  rlist            grid      spacing   1/beta
>   initial  1.000 nm  1.000 nm     160 160 128   0.156 nm  0.320 nm
>   final    1.628 nm  1.628 nm      96  96  80   0.260 nm  0.521 nm
> cost-ratio           4.31             0.23
> (note that these numbers concern only part of the total PP and PME load)
> 
> 
> 	M E G A - F L O P S   A C C O U N T I N G
> 
> NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
> RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
> W3=SPC/TIP3p  W4=TIP4p (single or pairs)
> V&F=Potential and force  V=Potential only  F=Force only
> 
> Computing:                               M-Number         M-Flops  % Flops
> -----------------------------------------------------------------------------
> Pair Search distance check             285.793872        2572.145     0.0
> NxN Ewald Elec. + LJ [F]            367351.034688    24245168.289    92.1
> NxN Ewald Elec. + LJ [V&F]            3841.181056      411006.373     1.6
> 1,4 nonbonded interactions             594.675150       53520.763     0.2
> Calc Weights                           746.884260       26887.833     0.1
> Spread Q Bspline                     15933.530880       31867.062     0.1
> Gather F Bspline                     15933.530880       95601.185     0.4
> 3D-FFT                              154983.295306     1239866.362     4.7
> Solve PME                               40.079616        2565.095     0.0
> Reset In Box                             2.564280           7.693     0.0
> CG-CoM                                   2.639700           7.919     0.0
> Angles                                 470.788620       79092.488     0.3
> Propers                                 99.129030       22700.548     0.1
> Impropers                                0.594180         123.589     0.0
> Pos. Restr.                              4.753440         237.672     0.0
> Virial                                   2.570400          46.267     0.0
> Update                                 248.961420        7717.804     0.0
> Stop-CM                                  2.639700          26.397     0.0
> Calc-Ekin                                5.128560         138.471     0.0
> Lincs                                  557.713246       33462.795     0.1
> Lincs-Mat                            12624.363456       50497.454     0.2
> Constraint-V                          1115.257670        8922.061     0.0
> Constraint-Vir                           2.871389          68.913     0.0
> -----------------------------------------------------------------------------
> Total                                                26312105.181   100.0
> -----------------------------------------------------------------------------
> 
> 
>    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
> 
> av. #atoms communicated per step for force:  2 x 16748.9
> av. #atoms communicated per step for LINCS:  2 x 9361.6
> 
> 
> Dynamic load balancing report:
> DLB was off during the run due to low measured imbalance.
> Average load imbalance: 3.4%.
> The balanceable part of the MD step is 46%, load imbalance is computed from this.
> Part of the total run time spent waiting due to load imbalance: 1.6%.
> 
> 
>     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> 
> On 4 MPI ranks, each using 4 OpenMP threads
> 
> Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                     Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
> Domain decomp.         4    4         34       0.457         26.976   1.0
> DD comm. load          4    4          2       0.000          0.008   0.0
> Neighbor search        4    4         34       0.138          8.160   0.3
> Launch GPU ops.        4    4       6602       0.441         26.070   0.9
> Comm. coord.           4    4       3267       0.577         34.081   1.2
> Force                  4    4       3301       2.298        135.761   4.9
> Wait + Comm. F         4    4       3301       0.276         16.330   0.6
> PME mesh               4    4       3301      25.822       1525.817  54.8
> Wait GPU NB nonloc.    4    4       3301       0.132          7.819   0.3
> Wait GPU NB local      4    4       3301       0.012          0.724   0.0
> NB X/F buffer ops.     4    4      13136       0.471         27.822   1.0
> Write traj.            4    4          2       0.014          0.839   0.0
> Update                 4    4       6602       1.006         59.442   2.1
> Constraints            4    4       6602       6.926        409.290  14.7
> Comm. energies         4    4         34       0.009          0.524   0.0
> Rest                                           8.548        505.108  18.1
> -----------------------------------------------------------------------------
> Total                                         47.127       2784.772 100.0
> -----------------------------------------------------------------------------
> Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
> PME redist. X/F        4    4       6602       2.538        149.998   5.4
> PME spread             4    4       3301       6.055        357.770  12.8
> PME gather             4    4       3301       3.432        202.814   7.3
> PME 3D-FFT             4    4       6602      10.559        623.925  22.4
> PME 3D-FFT Comm.       4    4       6602       2.691        158.993   5.7
> PME solve Elec         4    4       3301       0.521         30.805   1.1
> -----------------------------------------------------------------------------
> 
>               Core t (s)   Wall t (s)        (%)
>       Time:      754.033       47.127     1600.0
>                 (ns/day)    (hour/ns)
> Performance:        6.052        3.966
> Finished mdrun on rank 0 Mon Dec 10 17:10:34 2018
> 
> 
> =========================================== ntmpi  2: ntomp 8 ==============================================
> 
> 	<======  ###############  ==>
> 	<====  A V E R A G E S  ====>
> 	<==  ###############  ======>
> 
> 	Statistics over 11201 steps using 113 frames
> 
>   Energies (kJ/mol)
>          Angle       G96Angle    Proper Dih.  Improper Dih.          LJ-14
>    9.16403e+05    2.12953e+04    6.61725e+04    2.26296e+02    8.35215e+04
>     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>   -2.84508e+07   -1.43740e+05   -2.04658e+03    1.34647e+07    2.76232e+04
> Position Rest.      Potential    Kinetic En.   Total Energy    Temperature
>    5.93627e+01   -1.40166e+07    1.88847e+05   -1.38277e+07    3.00981e+02
> Pres. DC (bar) Pressure (bar)   Constr. rmsd
>   -2.88685e+00    8.53077e+01    0.00000e+00
> 
>   Total Virial (kJ/mol)
>    3.15233e+04   -6.80636e+02    9.80007e+01
>   -6.81075e+02    2.45640e+04   -1.40642e+03
>    9.81033e+01   -1.40643e+03    4.02877e+04
> 
>   Pressure (bar)
>    8.11163e+01    1.87348e+00   -2.03329e-01
>    1.87469e+00    1.09211e+02    3.83468e+00
>   -2.03613e-01    3.83470e+00    6.55961e+01
> 
>         T-PDMS         T-VMOS
>    2.99551e+02    3.84895e+02
> 
> 
>       P P   -   P M E   L O A D   B A L A N C I N G
> 
> NOTE: The PP/PME load balancing was limited by the maximum allowed grid scaling,
>       you might not have reached a good load balance.
> 
> PP/PME load balancing changed the cut-off and PME settings:
>           particle-particle                    PME
>            rcoulomb  rlist            grid      spacing   1/beta
>   initial  1.000 nm  1.000 nm     160 160 128   0.156 nm  0.320 nm
>   final    1.628 nm  1.628 nm      96  96  80   0.260 nm  0.521 nm
> cost-ratio           4.31             0.23
> (note that these numbers concern only part of the total PP and PME load)
> 
> 
> 	M E G A - F L O P S   A C C O U N T I N G
> 
> NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
> RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
> W3=SPC/TIP3p  W4=TIP4p (single or pairs)
> V&F=Potential and force  V=Potential only  F=Force only
> 
> Computing:                               M-Number         M-Flops  % Flops
> -----------------------------------------------------------------------------
> Pair Search distance check            1057.319360        9515.874     0.0
> NxN Ewald Elec. + LJ [F]           1410325.411968    93081477.190    93.9
> NxN Ewald Elec. + LJ [V&F]           14378.367616     1538485.335     1.6
> 1,4 nonbonded interactions            2017.860150      181607.413     0.2
> Calc Weights                          2534.338260       91236.177     0.1
> Spread Q Bspline                     54065.882880      108131.766     0.1
> Gather F Bspline                     54065.882880      324395.297     0.3
> 3D-FFT                              383450.341906     3067602.735     3.1
> Solve PME                              113.199616        7244.775     0.0
> Reset In Box                             8.522460          25.567     0.0
> CG-CoM                                   8.597880          25.794     0.0
> Angles                                1597.486620      268377.752     0.3
> Propers                                336.366030       77027.821     0.1
> Impropers                                2.016180         419.365     0.0
> Pos. Restr.                             16.129440         806.472     0.0
> Virial                                   8.532630         153.587     0.0
> Update                                 844.779420       26188.162     0.0
> Stop-CM                                  8.597880          85.979     0.0
> Calc-Ekin                               17.044920         460.213     0.0
> Lincs                                 1753.732822      105223.969     0.1
> Lincs-Mat                            39788.083512      159152.334     0.2
> Constraint-V                          3507.309174       28058.473     0.0
> Constraint-Vir                           8.845375         212.289     0.0
> -----------------------------------------------------------------------------
> Total                                                99075914.342   100.0
> -----------------------------------------------------------------------------
> 
> 
>    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
> 
> av. #atoms communicated per step for force:  2 x 6810.8
> av. #atoms communicated per step for LINCS:  2 x 3029.3
> 
> 
> Dynamic load balancing report:
> DLB was off during the run due to low measured imbalance.
> Average load imbalance: 0.8%.
> The balanceable part of the MD step is 46%, load imbalance is computed from this.
> Part of the total run time spent waiting due to load imbalance: 0.4%.
> 
> 
>     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> 
> On 2 MPI ranks, each using 8 OpenMP threads
> 
> Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                     Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
> Domain decomp.         2    8        113       1.532         90.505   1.4
> DD comm. load          2    8          4       0.000          0.027   0.0
> Neighbor search        2    8        113       0.442         26.107   0.4
> Launch GPU ops.        2    8      22402       1.230         72.668   1.1
> Comm. coord.           2    8      11088       0.894         52.844   0.8
> Force                  2    8      11201       8.166        482.534   7.5
> Wait + Comm. F         2    8      11201       0.672         39.720   0.6
> PME mesh               2    8      11201      61.637       3642.183  56.6
> Wait GPU NB nonloc.    2    8      11201       0.342         20.205   0.3
> Wait GPU NB local      2    8      11201       0.031          1.847   0.0
> NB X/F buffer ops.     2    8      44578       1.793        105.947   1.6
> Write traj.            2    8          4       0.040          2.386   0.0
> Update                 2    8      22402       4.148        245.121   3.8
> Constraints            2    8      22402      19.207       1134.940  17.6
> Comm. energies         2    8        113       0.006          0.354   0.0
> Rest                                           8.801        520.065   8.1
> -----------------------------------------------------------------------------
> Total                                        108.942       6437.452 100.0
> -----------------------------------------------------------------------------
> Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
> PME redist. X/F        2    8      22402       4.992        294.991   4.6
> PME spread             2    8      11201      16.979       1003.299  15.6
> PME gather             2    8      11201      11.687        690.563  10.7
> PME 3D-FFT             2    8      22402      21.648       1279.195  19.9
> PME 3D-FFT Comm.       2    8      22402       4.985        294.567   4.6
> PME solve Elec         2    8      11201       1.241         73.332   1.1
> -----------------------------------------------------------------------------
> 
>               Core t (s)   Wall t (s)        (%)
>       Time:     1743.073      108.942     1600.0
>                 (ns/day)    (hour/ns)
> Performance:        8.883        2.702
> Finished mdrun on rank 0 Mon Dec 10 17:01:45 2018
> 
> 
> 
> -- 
> Gromacs Users mailing list
> 
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> 
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> 
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.



More information about the gromacs.org_gmx-users mailing list