[gmx-users] mpirun -npernode option gives gromcs slowdown unless used with mpirun option -np or gromacs option -ntomp 1
Christopher Neale
chris.neale at alum.utoronto.ca
Tue Jul 26 22:35:04 CEST 2016
Dear Users:
this is simply an informational post in case somebody runs into similar troubles in the future. I don't understand why the usage must be this way, but empirically it works.
I find that when I use (A) "mpirun -np 4 gmx_mpi -ntomp 6" I get 32 ns/day. However, if I instead use (B) "mpirun -npernode 4 gmx_mpi -ntomp 6" I get only 10 ns/day. Finally, if I use (C) both the -npernode and -np options to mpirun, "mpirun -np 4 -npernode 4 gmx_mpi -ntomp 6", then I get again 32 ns.day. A diff of the .log files from option B (no-np.log) and option C (yes-np.log) doesn't contain any clues as to how things were set up differently with gromacs in the two cases (see end of this post).
Why bother with the -npernode option at all? Because I was having trouble getting -multi to work with gpus otherwise. I found that if I simply used "mpirun -np 8 gmx_mpi -ntomp 6 -multi 2" then both jobs got put on the first node and the second allocated node was empty. Therefore, the only way that I can find to use CPU/GPU runs and the -multi keyword with efficiency is to use "mpirun -npernode" (seems to be required to get good distribution of processes across nodes) and then for some reason this leads to performance degradation in gromacs.
I found two ways to get good performance from gromacs with mpirun -npernode. The first is to simply include the -np option as well. Could be like this:
mpirun -np 8 -npernode 4 gmx_mpi mdrun -notunepme -deffnm MD_ -dlb yes -npme 0 -cpt 60 -maxh 0.05 -cpi MD_.cpt -ntomp 6 -gpu_id 0123 -multi 2
I guess that might be obvious to some people, but the man page for openmpi mpirun reads to me as if -npernode is an alternative to -np rather than an augmentation.
The second alternative is this:
mpirun -bind-to core:overload-allowed -npernode 24 gmx_mpi mdrun -notunepme -deffnm MD_ -dlb yes -npme 0 -cpt 60 -maxh 0.05 -cpi MD_.cpt -ntomp 1 -gpu_id 000000111111222222333333 -multi 2
where the "-bind-to core:overload-allowed" option is probably only required with hyperthreading.
$ diff no-np.log yes-np.log
1,2c1,2
< Log file opened on Tue Jul 26 09:16:25 2016
< Host: node001 pid: 21102 rank ID: 0 number of ranks: 4
---
> Log file opened on Tue Jul 26 09:41:11 2016
> Host: node001 pid: 22346 rank ID: 0 number of ranks: 4
66c66
< Number of logical cores detected (24) does not match the number reported by OpenMP (2).
---
> Number of logical cores detected (24) does not match the number reported by OpenMP (12).
162c162
< ld-seed = 3114924391
---
> ld-seed = 4178809860
429,430c429,430
< RMS relative constraint deviation after constraining: 3.46e-06
< Initial temperature: 309.815 K
---
> RMS relative constraint deviation after constraining: 3.42e-06
> Initial temperature: 310.938 K
432c432
< Started mdrun on rank 0 Tue Jul 26 09:16:29 2016
---
> Started mdrun on rank 0 Tue Jul 26 09:41:15 2016
442c442
< -4.90630e+05 1.68715e+05 -3.21916e+05 3.21682e+02 -3.95211e+01
---
> -4.90630e+05 1.69398e+05 -3.21232e+05 3.22985e+02 -5.09389e+01
444c444
< 4.26491e-06
---
> 4.34646e-06
446c446
< DD step 39 vol min/aver 1.000 load imb.: force 4.4%
---
> DD step 39 vol min/aver 1.000 load imb.: force 7.4%
449c449
< Step 10320: Run time exceeded 0.050 hours, will terminate the run
---
> Step 32640: Run time exceeded 0.050 hours, will terminate the run
451c451
< 10360 20.72000 0.00000
---
> 32680 65.36000 0.00000
453c453
< Writing checkpoint, step 10360 at Tue Jul 26 09:19:28 2016
---
> Writing checkpoint, step 32680 at Tue Jul 26 09:44:13 2016
458c458
< 1.22604e+04 5.70873e+04 3.77247e+04 8.94662e+02 -1.42204e+03
---
> 1.22192e+04 5.73049e+04 3.79451e+04 9.68468e+02 -1.52979e+03
460c460
< 8.43362e+03 -1.28393e+04 8.15360e+03 -6.05532e+05 2.67908e+03
---
> 8.57418e+03 -1.24316e+04 1.00830e+04 -6.06735e+05 2.66427e+03
462c462
< -4.92560e+05 1.62211e+05 -3.30349e+05 3.09282e+02 -2.47940e+02
---
> -4.90937e+05 1.62776e+05 -3.28161e+05 3.10359e+02 2.17014e+02
470c470
< Statistics over 10361 steps using 104 frames
---
> Statistics over 32681 steps using 327 frames
474c474
< 1.21421e+04 5.71798e+04 3.80964e+04 9.85006e+02 -1.46855e+03
---
> 1.20981e+04 5.70274e+04 3.79428e+04 9.64206e+02 -1.50478e+03
476c476
< 8.62873e+03 -1.26960e+04 9.41233e+03 -6.06182e+05 2.71017e+03
---
> 8.61301e+03 -1.24384e+04 9.36526e+03 -6.05690e+05 2.70108e+03
478c478
< -4.91192e+05 1.62881e+05 -3.28311e+05 3.10559e+02 7.81796e+00
---
> -4.90921e+05 1.62879e+05 -3.28042e+05 3.10555e+02 6.78073e+00
483c483
< 7.40515e+00 7.40515e+00 1.03635e+01
---
> 7.40100e+00 7.40100e+00 1.03708e+01
486,488c486,488
< 5.33098e+04 4.30370e+02 1.93729e+02
< 4.29164e+02 5.35544e+04 -1.86077e+02
< 1.93524e+02 -1.83077e+02 5.56151e+04
---
> 5.33490e+04 1.51469e+02 6.52398e+01
> 1.52200e+02 5.39549e+04 -1.33843e+02
> 6.68338e+01 -1.34647e+02 5.52271e+04
491,493c491,493
< 2.34064e+01 -2.47322e+01 -1.67206e+01
< -2.46617e+01 1.68480e+01 9.55039e+00
< -1.67086e+01 9.37497e+00 -1.68005e+01
---
> 2.16749e+01 -1.01125e+01 -4.92329e+00
> -1.01552e+01 -1.03203e+01 1.21038e+01
> -5.01651e+00 1.21508e+01 8.98758e+00
505,530c505,530
< NB VdW [V&F] 456.951183 456.951 0.0
< Pair Search distance check 2229.399952 20064.600 0.0
< NxN Ewald Elec. + LJ [F] 710016.305728 55381271.847 95.0
< NxN Ewald Elec. + LJ [V&F] 7268.991360 937699.885 1.6
< 1,4 nonbonded interactions 639.781389 57580.325 0.1
< Calc Weights 1812.449730 65248.190 0.1
< Spread Q Bspline 38665.594240 77331.188 0.1
< Gather F Bspline 38665.594240 231993.565 0.4
< 3D-FFT 151434.386688 1211475.094 2.1
< Solve PME 42.438656 2716.074 0.0
< Reset In Box 15.160600 45.482 0.0
< CG-CoM 15.218910 45.657 0.0
< Bonds 97.983977 5781.055 0.0
< Propers 736.822515 168732.356 0.3
< Impropers 10.682191 2221.896 0.0
< Virial 60.654130 1091.774 0.0
< Update 604.149910 18728.647 0.0
< Stop-CM 6.122550 61.225 0.0
< P-Coupling 60.409160 362.455 0.0
< Calc-Ekin 120.934940 3265.243 0.0
< Lincs 302.400510 18144.031 0.0
< Lincs-Mat 2689.009088 10756.036 0.0
< Constraint-V 1363.641087 10909.129 0.0
< Constraint-Vir 53.106186 1274.548 0.0
< Settle 252.968620 81708.864 0.1
< (null) 3.108300 0.000 0.0
---
> NB VdW [V&F] 1441.330143 1441.330 0.0
> Pair Search distance check 6990.356112 62913.205 0.0
> NxN Ewald Elec. + LJ [F] 2236130.475520 174418177.091 95.0
> NxN Ewald Elec. + LJ [V&F] 22669.995648 2924429.439 1.6
> 1,4 nonbonded interactions 2018.019069 181621.716 0.1
> Calc Weights 5716.887330 205807.944 0.1
> Spread Q Bspline 121960.263040 243920.526 0.1
> Gather F Bspline 121960.263040 731761.578 0.4
> 3D-FFT 477659.221248 3821273.770 2.1
> Solve PME 133.861376 8567.128 0.0
> Reset In Box 47.697580 143.093 0.0
> CG-CoM 47.755890 143.268 0.0
> Bonds 309.064217 18234.789 0.0
> Propers 2324.109315 532221.033 0.3
> Impropers 33.694111 7008.375 0.0
> Virial 191.203810 3441.669 0.0
> Update 1905.629110 59074.502 0.0
> Stop-CM 19.125680 191.257 0.0
> P-Coupling 190.557080 1143.342 0.0
> Calc-Ekin 381.230780 10293.231 0.0
> Lincs 952.147766 57128.866 0.0
> Lincs-Mat 8450.181728 33800.727 0.0
> Constraint-V 4296.810133 34374.481 0.0
> Constraint-Vir 167.277361 4014.657 0.0
> Settle 797.526798 257601.156 0.1
> (null) 9.804300 0.000 0.0
532c532
< Total 58308966.118 100.0
---
> Total 183618728.172 100.0
538,539c538,539
< av. #atoms communicated per step for force: 2 x 39597.3
< av. #atoms communicated per step for LINCS: 2 x 2855.3
---
> av. #atoms communicated per step for force: 2 x 39598.1
> av. #atoms communicated per step for LINCS: 2 x 2808.9
541,542c541,542
< Average load imbalance: 2.8 %
< Part of the total run time spent waiting due to load imbalance: 0.8 %
---
> Average load imbalance: 0.6 %
> Part of the total run time spent waiting due to load imbalance: 0.2 %
553,569c553,569
< Domain decomp. 4 6 260 3.195 183.586 1.8
< DD comm. load 4 6 259 0.010 0.593 0.0
< DD comm. bounds 4 6 260 0.059 3.413 0.0
< Neighbor search 4 6 260 2.627 150.975 1.5
< Launch GPU ops. 4 6 20722 1.150 66.067 0.6
< Comm. coord. 4 6 10101 2.839 163.129 1.6
< Force 4 6 10361 54.113 3109.598 30.1
< Wait + Comm. F 4 6 10361 1.577 90.599 0.9
< PME mesh 4 6 10361 73.011 4195.535 40.7
< Wait GPU nonlocal 4 6 10361 0.179 10.297 0.1
< Wait GPU local 4 6 10361 0.044 2.520 0.0
< NB X/F buffer ops. 4 6 40924 2.848 163.647 1.6
< Write traj. 4 6 2 0.017 0.999 0.0
< Update 4 6 20722 11.184 642.700 6.2
< Constraints 4 6 20722 25.794 1482.226 14.4
< Comm. energies 4 6 1037 0.170 9.758 0.1
< Rest 0.753 43.281 0.4
---
> Domain decomp. 4 6 818 4.527 260.126 2.5
> DD comm. load 4 6 817 0.004 0.225 0.0
> DD comm. bounds 4 6 818 0.055 3.182 0.0
> Neighbor search 4 6 818 2.497 143.466 1.4
> Launch GPU ops. 4 6 65362 3.204 184.088 1.8
> Comm. coord. 4 6 31863 5.387 309.550 3.0
> Force 4 6 32681 48.870 2808.314 27.4
> Wait + Comm. F 4 6 32681 5.033 289.221 2.8
> PME mesh 4 6 32681 71.720 4121.326 40.2
> Wait GPU nonlocal 4 6 32681 0.393 22.593 0.2
> Wait GPU local 4 6 32681 0.112 6.457 0.1
> NB X/F buffer ops. 4 6 129088 2.409 138.419 1.3
> Write traj. 4 6 3 0.012 0.705 0.0
> Update 4 6 65362 9.216 529.577 5.2
> Constraints 4 6 65362 23.384 1343.761 13.1
> Comm. energies 4 6 3269 0.234 13.446 0.1
> Rest 1.433 82.373 0.8
571c571
< Total 179.571 10318.922 100.0
---
> Total 178.490 10256.829 100.0
575,579c575,579
< PME redist. X/F 4 6 20722 12.505 718.602 7.0
< PME spread/gather 4 6 20722 35.759 2054.864 19.9
< PME 3D-FFT 4 6 20722 18.237 1047.978 10.2
< PME 3D-FFT Comm. 4 6 20722 4.554 261.665 2.5
< PME solve Elec 4 6 10361 0.894 51.350 0.5
---
> PME redist. X/F 4 6 65362 8.981 516.074 5.0
> PME spread/gather 4 6 65362 30.705 1764.430 17.2
> PME 3D-FFT 4 6 65362 15.356 882.437 8.6
> PME 3D-FFT Comm. 4 6 65362 13.560 779.189 7.6
> PME solve Elec 4 6 32681 2.992 171.962 1.7
583c583
< Time: 1786.804 179.571 995.0
---
> Time: 4282.038 178.490 2399.0
585,586c585,586
< Performance: 9.970 2.407
< Finished mdrun on rank 0 Tue Jul 26 09:19:29 2016
---
> Performance: 31.639 0.759
> Finished mdrun on rank 0 Tue Jul 26 09:44:14 2016
More information about the gromacs.org_gmx-users
mailing list