[gmx-users] using dual CPU's
pbuscemi at q.com
pbuscemi at q.com
Wed Dec 12 15:36:26 CET 2018
Dear users ( one more try )
I am trying to use 2 GPU cards to improve modeling speed. The computer described in the log files is used to iron out models and am using to learn how to use two GPU cards before purchasing two new RTX 2080 ti's. The CPU is a 8 core 16 thread AMD and the GPU's are two GTX 1060; there are 50000 atoms in the model
Using ntpmi and ntomp settings of 1: 16, auto ( 4:4) and 2: 8 ( and any other combination factoring to 16) the rating for ns/day are approx. 12-16 and for any other setting ~6-8 i.e adding a card cuts efficiency by half. The average load imbalance is less than 3.4% for the multicard setup .
I am not at this point trying to maximize efficiency, but only to show some improvement going from one to two cards. According to a 2015 paper form the Gromacs group “ Best bang for your buck: GPU nodes for GROMACS biomolecular simulations “ I should expect maybe (at best ) 50% improvement for 90k atoms ( with 2x GTX 970 ) What bothers me in my initial attempts is that my simulations became slower by adding the second GPU - it is frustrating to say the least. It's like swimming backwards.
I know am missing - as a minimum - the correct setup for mdrun and suggestions would be welcome
The output from the last section of the log files is included below.
=========================== ntpmi 1 ntomp:16 ==============================
<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>
Statistics over 29301 steps using 294 frames
Energies (kJ/mol)
Angle G96Angle Proper Dih. Improper Dih. LJ-14
9.17533e+05 2.27874e+04 6.64128e+04 2.31214e+02 8.34971e+04
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
-2.84567e+07 -1.43385e+05 -2.04658e+03 1.33320e+07 1.59914e+05
Position Rest. Potential Kinetic En. Total Energy Temperature
7.79893e+01 -1.40196e+07 1.88467e+05 -1.38312e+07 3.00376e+02
Pres. DC (bar) Pressure (bar) Constr. rmsd
-2.88685e+00 3.75436e+01 0.00000e+00
Total Virial (kJ/mol)
5.27555e+04 -4.87626e+02 1.86144e+02
-4.87648e+02 4.04479e+04 -1.91959e+02
1.86177e+02 -1.91957e+02 5.45671e+04
Pressure (bar)
2.22202e+01 1.27887e+00 -4.71738e-01
1.27893e+00 6.48135e+01 5.12638e-01
-4.71830e-01 5.12632e-01 2.55971e+01
T-PDMS T-VMOS
2.99822e+02 3.32834e+02
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 2349.753264 21147.779 0.0
NxN Ewald Elec. + LJ [F] 1771584.591744 116924583.055 96.6
NxN Ewald Elec. + LJ [V&F] 17953.091840 1920980.827 1.6
1,4 nonbonded interactions 5278.575150 475071.763 0.4
Shift-X 22.173480 133.041 0.0
Angles 4178.908620 702056.648 0.6
Propers 879.909030 201499.168 0.2
Impropers 5.274180 1097.029 0.0
Pos. Restr. 42.193440 2109.672 0.0
Virial 22.186710 399.361 0.0
Update 2209.881420 68506.324 0.1
Stop-CM 22.248900 222.489 0.0
Calc-Ekin 44.346960 1197.368 0.0
Lincs 4414.639320 264878.359 0.2
Lincs-Mat 100297.229760 401188.919 0.3
Constraint-V 8829.127980 70633.024 0.1
Constraint-Vir 22.147020 531.528 0.0
-----------------------------------------------------------------------------
Total 121056236.355 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 16 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 16 294 2.191 129.485 1.0
Launch GPU ops. 1 16 58602 4.257 251.544 2.0
Force 1 16 29301 23.769 1404.510 11.3
Wait PME GPU gather 1 16 29301 33.740 1993.695 16.0
Reduce GPU PME F 1 16 29301 7.244 428.079 3.4
Wait GPU NB local 1 16 29301 60.054 3548.612 28.5
NB X/F buffer ops. 1 16 58308 9.823 580.459 4.7
Write traj. 1 16 7 0.119 7.048 0.1
Update 1 16 58602 11.089 655.275 5.3
Constraints 1 16 58602 40.378 2385.992 19.2
Rest 17.743 1048.462 8.4
-----------------------------------------------------------------------------
Total 210.408 12433.160 100.0
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 3366.529 210.408 1600.0
(ns/day) (hour/ns)
Performance: 12.032 1.995
Finished mdrun on rank 0 Mon Dec 10 17:17:04 2018
=========================== ntpmi and ntomp auto ( 4:4 ) =======================================
<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>
Statistics over 3301 steps using 34 frames
Energies (kJ/mol)
Angle G96Angle Proper Dih. Improper Dih. LJ-14
9.20586e+05 1.95534e+04 6.56058e+04 2.21093e+02 8.56673e+04
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
-2.84553e+07 -1.44595e+05 -2.04658e+03 1.34518e+07 4.26167e+04
Position Rest. Potential Kinetic En. Total Energy Temperature
3.83653e+01 -1.40159e+07 1.90353e+05 -1.38255e+07 3.03381e+02
Pres. DC (bar) Pressure (bar) Constr. rmsd
-2.88685e+00 2.72913e+02 0.00000e+00
Total Virial (kJ/mol)
-5.05948e+04 -3.29107e+03 4.84786e+02
-3.29135e+03 -3.42006e+04 -3.32392e+03
4.84606e+02 -3.32403e+03 -2.06849e+04
Pressure (bar)
3.09713e+02 8.98192e+00 -1.19828e+00
8.98270e+00 2.73248e+02 8.99543e+00
-1.19778e+00 8.99573e+00 2.35776e+02
T-PDMS T-VMOS
2.98623e+02 5.82467e+02
P P - P M E L O A D B A L A N C I N G
NOTE: The PP/PME load balancing was limited by the maximum allowed grid scaling,
you might not have reached a good load balance.
PP/PME load balancing changed the cut-off and PME settings:
particle-particle PME
rcoulomb rlist grid spacing 1/beta
initial 1.000 nm 1.000 nm 160 160 128 0.156 nm 0.320 nm
final 1.628 nm 1.628 nm 96 96 80 0.260 nm 0.521 nm
cost-ratio 4.31 0.23
(note that these numbers concern only part of the total PP and PME load)
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 285.793872 2572.145 0.0
NxN Ewald Elec. + LJ [F] 367351.034688 24245168.289 92.1
NxN Ewald Elec. + LJ [V&F] 3841.181056 411006.373 1.6
1,4 nonbonded interactions 594.675150 53520.763 0.2
Calc Weights 746.884260 26887.833 0.1
Spread Q Bspline 15933.530880 31867.062 0.1
Gather F Bspline 15933.530880 95601.185 0.4
3D-FFT 154983.295306 1239866.362 4.7
Solve PME 40.079616 2565.095 0.0
Reset In Box 2.564280 7.693 0.0
CG-CoM 2.639700 7.919 0.0
Angles 470.788620 79092.488 0.3
Propers 99.129030 22700.548 0.1
Impropers 0.594180 123.589 0.0
Pos. Restr. 4.753440 237.672 0.0
Virial 2.570400 46.267 0.0
Update 248.961420 7717.804 0.0
Stop-CM 2.639700 26.397 0.0
Calc-Ekin 5.128560 138.471 0.0
Lincs 557.713246 33462.795 0.1
Lincs-Mat 12624.363456 50497.454 0.2
Constraint-V 1115.257670 8922.061 0.0
Constraint-Vir 2.871389 68.913 0.0
-----------------------------------------------------------------------------
Total 26312105.181 100.0
-----------------------------------------------------------------------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 16748.9
av. #atoms communicated per step for LINCS: 2 x 9361.6
Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 3.4%.
The balanceable part of the MD step is 46%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 1.6%.
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 4 MPI ranks, each using 4 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 4 4 34 0.457 26.976 1.0
DD comm. load 4 4 2 0.000 0.008 0.0
Neighbor search 4 4 34 0.138 8.160 0.3
Launch GPU ops. 4 4 6602 0.441 26.070 0.9
Comm. coord. 4 4 3267 0.577 34.081 1.2
Force 4 4 3301 2.298 135.761 4.9
Wait + Comm. F 4 4 3301 0.276 16.330 0.6
PME mesh 4 4 3301 25.822 1525.817 54.8
Wait GPU NB nonloc. 4 4 3301 0.132 7.819 0.3
Wait GPU NB local 4 4 3301 0.012 0.724 0.0
NB X/F buffer ops. 4 4 13136 0.471 27.822 1.0
Write traj. 4 4 2 0.014 0.839 0.0
Update 4 4 6602 1.006 59.442 2.1
Constraints 4 4 6602 6.926 409.290 14.7
Comm. energies 4 4 34 0.009 0.524 0.0
Rest 8.548 505.108 18.1
-----------------------------------------------------------------------------
Total 47.127 2784.772 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME redist. X/F 4 4 6602 2.538 149.998 5.4
PME spread 4 4 3301 6.055 357.770 12.8
PME gather 4 4 3301 3.432 202.814 7.3
PME 3D-FFT 4 4 6602 10.559 623.925 22.4
PME 3D-FFT Comm. 4 4 6602 2.691 158.993 5.7
PME solve Elec 4 4 3301 0.521 30.805 1.1
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 754.033 47.127 1600.0
(ns/day) (hour/ns)
Performance: 6.052 3.966
Finished mdrun on rank 0 Mon Dec 10 17:10:34 2018
=========================================== ntmpi 2: ntomp 8 ==============================================
<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>
Statistics over 11201 steps using 113 frames
Energies (kJ/mol)
Angle G96Angle Proper Dih. Improper Dih. LJ-14
9.16403e+05 2.12953e+04 6.61725e+04 2.26296e+02 8.35215e+04
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
-2.84508e+07 -1.43740e+05 -2.04658e+03 1.34647e+07 2.76232e+04
Position Rest. Potential Kinetic En. Total Energy Temperature
5.93627e+01 -1.40166e+07 1.88847e+05 -1.38277e+07 3.00981e+02
Pres. DC (bar) Pressure (bar) Constr. rmsd
-2.88685e+00 8.53077e+01 0.00000e+00
Total Virial (kJ/mol)
3.15233e+04 -6.80636e+02 9.80007e+01
-6.81075e+02 2.45640e+04 -1.40642e+03
9.81033e+01 -1.40643e+03 4.02877e+04
Pressure (bar)
8.11163e+01 1.87348e+00 -2.03329e-01
1.87469e+00 1.09211e+02 3.83468e+00
-2.03613e-01 3.83470e+00 6.55961e+01
T-PDMS T-VMOS
2.99551e+02 3.84895e+02
P P - P M E L O A D B A L A N C I N G
NOTE: The PP/PME load balancing was limited by the maximum allowed grid scaling,
you might not have reached a good load balance.
PP/PME load balancing changed the cut-off and PME settings:
particle-particle PME
rcoulomb rlist grid spacing 1/beta
initial 1.000 nm 1.000 nm 160 160 128 0.156 nm 0.320 nm
final 1.628 nm 1.628 nm 96 96 80 0.260 nm 0.521 nm
cost-ratio 4.31 0.23
(note that these numbers concern only part of the total PP and PME load)
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 1057.319360 9515.874 0.0
NxN Ewald Elec. + LJ [F] 1410325.411968 93081477.190 93.9
NxN Ewald Elec. + LJ [V&F] 14378.367616 1538485.335 1.6
1,4 nonbonded interactions 2017.860150 181607.413 0.2
Calc Weights 2534.338260 91236.177 0.1
Spread Q Bspline 54065.882880 108131.766 0.1
Gather F Bspline 54065.882880 324395.297 0.3
3D-FFT 383450.341906 3067602.735 3.1
Solve PME 113.199616 7244.775 0.0
Reset In Box 8.522460 25.567 0.0
CG-CoM 8.597880 25.794 0.0
Angles 1597.486620 268377.752 0.3
Propers 336.366030 77027.821 0.1
Impropers 2.016180 419.365 0.0
Pos. Restr. 16.129440 806.472 0.0
Virial 8.532630 153.587 0.0
Update 844.779420 26188.162 0.0
Stop-CM 8.597880 85.979 0.0
Calc-Ekin 17.044920 460.213 0.0
Lincs 1753.732822 105223.969 0.1
Lincs-Mat 39788.083512 159152.334 0.2
Constraint-V 3507.309174 28058.473 0.0
Constraint-Vir 8.845375 212.289 0.0
-----------------------------------------------------------------------------
Total 99075914.342 100.0
-----------------------------------------------------------------------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 6810.8
av. #atoms communicated per step for LINCS: 2 x 3029.3
Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 0.8%.
The balanceable part of the MD step is 46%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.4%.
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 2 MPI ranks, each using 8 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 2 8 113 1.532 90.505 1.4
DD comm. load 2 8 4 0.000 0.027 0.0
Neighbor search 2 8 113 0.442 26.107 0.4
Launch GPU ops. 2 8 22402 1.230 72.668 1.1
Comm. coord. 2 8 11088 0.894 52.844 0.8
Force 2 8 11201 8.166 482.534 7.5
Wait + Comm. F 2 8 11201 0.672 39.720 0.6
PME mesh 2 8 11201 61.637 3642.183 56.6
Wait GPU NB nonloc. 2 8 11201 0.342 20.205 0.3
Wait GPU NB local 2 8 11201 0.031 1.847 0.0
NB X/F buffer ops. 2 8 44578 1.793 105.947 1.6
Write traj. 2 8 4 0.040 2.386 0.0
Update 2 8 22402 4.148 245.121 3.8
Constraints 2 8 22402 19.207 1134.940 17.6
Comm. energies 2 8 113 0.006 0.354 0.0
Rest 8.801 520.065 8.1
-----------------------------------------------------------------------------
Total 108.942 6437.452 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME redist. X/F 2 8 22402 4.992 294.991 4.6
PME spread 2 8 11201 16.979 1003.299 15.6
PME gather 2 8 11201 11.687 690.563 10.7
PME 3D-FFT 2 8 22402 21.648 1279.195 19.9
PME 3D-FFT Comm. 2 8 22402 4.985 294.567 4.6
PME solve Elec 2 8 11201 1.241 73.332 1.1
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 1743.073 108.942 1600.0
(ns/day) (hour/ns)
Performance: 8.883 2.702
Finished mdrun on rank 0 Mon Dec 10 17:01:45 2018
More information about the gromacs.org_gmx-users
mailing list