[gmx-users] mpirun -npernode option gives gromcs slowdown unless used with mpirun option -np or gromacs option -ntomp 1
Mark Abraham
mark.j.abraham at gmail.com
Tue Jul 26 23:08:07 CEST 2016
Hi,
On Tue, Jul 26, 2016 at 10:35 PM Christopher Neale <
chris.neale at alum.utoronto.ca> wrote:
> Dear Users:
>
> this is simply an informational post in case somebody runs into similar
> troubles in the future. I don't understand why the usage must be this way,
> but empirically it works.
>
> I find that when I use (A) "mpirun -np 4 gmx_mpi -ntomp 6" I get 32
> ns/day. However, if I instead use (B) "mpirun -npernode 4 gmx_mpi -ntomp 6"
> I get only 10 ns/day. Finally, if I use (C) both the -npernode and -np
> options to mpirun, "mpirun -np 4 -npernode 4 gmx_mpi -ntomp 6", then I get
> again 32 ns.day. A diff of the .log files from option B (no-np.log) and
> option C (yes-np.log) doesn't contain any clues as to how things were set
> up differently with gromacs in the two cases (see end of this post).
>
> Why bother with the -npernode option at all? Because I was having trouble
> getting -multi to work with gpus otherwise. I found that if I simply used
> "mpirun -np 8 gmx_mpi -ntomp 6 -multi 2" then both jobs got put on the
> first node and the second allocated node was empty. Therefore, the only way
> that I can find to use CPU/GPU runs and the -multi keyword with efficiency
> is to use "mpirun -npernode" (seems to be required to get good distribution
> of processes across nodes) and then for some reason this leads to
> performance degradation in gromacs.
>
Yeah, this is irritating. We haven't found a good way to help the user
manage things. There's a fundamental conflict over which piece of
infrastructure should have the last word over which CPU threads should have
affinity for which sets of cores. We (think we) know what's best for mdrun,
but if something external has set affinity masks, then by default mdrun
needs to respect that. But it's very easy to mis-configure MPI libraries,
or not use mpirun in the best way, and that leads to externally-set
affinity masks that mdrun notionally should respect, and we haven't yet
implemented a good way to detect it's probably a problem and react. The
good news is that mdrun -pin on will instruct mdrun to ignore external
affinities and set the kind of affinity patterns we've designed for, and
those will work pretty well on mainstream hardware. The mpirun -bind-to
options are other ways to be specific about your requirements. Also,
whether any of this matters probably also is affected by the presence of
system processes on the machine, and how much the kernel has been set up to
reflect the realities of HPC workloads.
I found two ways to get good performance from gromacs with mpirun
> -npernode. The first is to simply include the -np option as well. Could be
> like this:
>
> mpirun -np 8 -npernode 4 gmx_mpi mdrun -notunepme -deffnm MD_ -dlb yes
> -npme 0 -cpt 60 -maxh 0.05 -cpi MD_.cpt -ntomp 6 -gpu_id 0123 -multi 2
>
> I guess that might be obvious to some people, but the man page for openmpi
> mpirun reads to me as if -npernode is an alternative to -np rather than an
> augmentation.
>
> The second alternative is this:
>
> mpirun -bind-to core:overload-allowed -npernode 24 gmx_mpi mdrun
> -notunepme -deffnm MD_ -dlb yes -npme 0 -cpt 60 -maxh 0.05 -cpi MD_.cpt
> -ntomp 1 -gpu_id 000000111111222222333333 -multi 2
>
> where the "-bind-to core:overload-allowed" option is probably only
> required with hyperthreading.
>
> $ diff no-np.log yes-np.log
>
> 1,2c1,2
> < Log file opened on Tue Jul 26 09:16:25 2016
> < Host: node001 pid: 21102 rank ID: 0 number of ranks: 4
> ---
> > Log file opened on Tue Jul 26 09:41:11 2016
> > Host: node001 pid: 22346 rank ID: 0 number of ranks: 4
> 66c66
> < Number of logical cores detected (24) does not match the number reported
> by OpenMP (2).
> ---
> > Number of logical cores detected (24) does not match the number reported
> by OpenMP (12).
>
This is actually the only clue available, though the wording of the message
is sufficiently unclear that we've removed it for 2016, until we've come up
with a way to detect foolish-looking affinity patterns and say something
that a user can act upon. The 2 and 12 reflect the range of hardware
threads (24, with hyperthreading) over which software threads are permitted
to migrate, and of course migration off a physical core will wreck
performance of compute-bound code because you invalidate caches, so one
doesn't want to permit that.
Mark
162c162
> < ld-seed = 3114924391
> ---
> > ld-seed = 4178809860
> 429,430c429,430
> < RMS relative constraint deviation after constraining: 3.46e-06
> < Initial temperature: 309.815 K
> ---
> > RMS relative constraint deviation after constraining: 3.42e-06
> > Initial temperature: 310.938 K
> 432c432
> < Started mdrun on rank 0 Tue Jul 26 09:16:29 2016
> ---
> > Started mdrun on rank 0 Tue Jul 26 09:41:15 2016
> 442c442
> < -4.90630e+05 1.68715e+05 -3.21916e+05 3.21682e+02
> -3.95211e+01
> ---
> > -4.90630e+05 1.69398e+05 -3.21232e+05 3.22985e+02
> -5.09389e+01
> 444c444
> < 4.26491e-06
> ---
> > 4.34646e-06
> 446c446
> < DD step 39 vol min/aver 1.000 load imb.: force 4.4%
> ---
> > DD step 39 vol min/aver 1.000 load imb.: force 7.4%
> 449c449
> < Step 10320: Run time exceeded 0.050 hours, will terminate the run
> ---
> > Step 32640: Run time exceeded 0.050 hours, will terminate the run
> 451c451
> < 10360 20.72000 0.00000
> ---
> > 32680 65.36000 0.00000
> 453c453
> < Writing checkpoint, step 10360 at Tue Jul 26 09:19:28 2016
> ---
> > Writing checkpoint, step 32680 at Tue Jul 26 09:44:13 2016
> 458c458
> < 1.22604e+04 5.70873e+04 3.77247e+04 8.94662e+02
> -1.42204e+03
> ---
> > 1.22192e+04 5.73049e+04 3.79451e+04 9.68468e+02
> -1.52979e+03
> 460c460
> < 8.43362e+03 -1.28393e+04 8.15360e+03 -6.05532e+05
> 2.67908e+03
> ---
> > 8.57418e+03 -1.24316e+04 1.00830e+04 -6.06735e+05
> 2.66427e+03
> 462c462
> < -4.92560e+05 1.62211e+05 -3.30349e+05 3.09282e+02
> -2.47940e+02
> ---
> > -4.90937e+05 1.62776e+05 -3.28161e+05 3.10359e+02
> 2.17014e+02
> 470c470
> < Statistics over 10361 steps using 104 frames
> ---
> > Statistics over 32681 steps using 327 frames
> 474c474
> < 1.21421e+04 5.71798e+04 3.80964e+04 9.85006e+02
> -1.46855e+03
> ---
> > 1.20981e+04 5.70274e+04 3.79428e+04 9.64206e+02
> -1.50478e+03
> 476c476
> < 8.62873e+03 -1.26960e+04 9.41233e+03 -6.06182e+05
> 2.71017e+03
> ---
> > 8.61301e+03 -1.24384e+04 9.36526e+03 -6.05690e+05
> 2.70108e+03
> 478c478
> < -4.91192e+05 1.62881e+05 -3.28311e+05 3.10559e+02
> 7.81796e+00
> ---
> > -4.90921e+05 1.62879e+05 -3.28042e+05 3.10555e+02
> 6.78073e+00
> 483c483
> < 7.40515e+00 7.40515e+00 1.03635e+01
> ---
> > 7.40100e+00 7.40100e+00 1.03708e+01
> 486,488c486,488
> < 5.33098e+04 4.30370e+02 1.93729e+02
> < 4.29164e+02 5.35544e+04 -1.86077e+02
> < 1.93524e+02 -1.83077e+02 5.56151e+04
> ---
> > 5.33490e+04 1.51469e+02 6.52398e+01
> > 1.52200e+02 5.39549e+04 -1.33843e+02
> > 6.68338e+01 -1.34647e+02 5.52271e+04
> 491,493c491,493
> < 2.34064e+01 -2.47322e+01 -1.67206e+01
> < -2.46617e+01 1.68480e+01 9.55039e+00
> < -1.67086e+01 9.37497e+00 -1.68005e+01
> ---
> > 2.16749e+01 -1.01125e+01 -4.92329e+00
> > -1.01552e+01 -1.03203e+01 1.21038e+01
> > -5.01651e+00 1.21508e+01 8.98758e+00
> 505,530c505,530
> < NB VdW [V&F] 456.951183 456.951
> 0.0
> < Pair Search distance check 2229.399952 20064.600
> 0.0
> < NxN Ewald Elec. + LJ [F] 710016.305728 55381271.847
> 95.0
> < NxN Ewald Elec. + LJ [V&F] 7268.991360 937699.885
> 1.6
> < 1,4 nonbonded interactions 639.781389 57580.325
> 0.1
> < Calc Weights 1812.449730 65248.190
> 0.1
> < Spread Q Bspline 38665.594240 77331.188
> 0.1
> < Gather F Bspline 38665.594240 231993.565
> 0.4
> < 3D-FFT 151434.386688 1211475.094
> 2.1
> < Solve PME 42.438656 2716.074
> 0.0
> < Reset In Box 15.160600 45.482
> 0.0
> < CG-CoM 15.218910 45.657
> 0.0
> < Bonds 97.983977 5781.055
> 0.0
> < Propers 736.822515 168732.356
> 0.3
> < Impropers 10.682191 2221.896
> 0.0
> < Virial 60.654130 1091.774
> 0.0
> < Update 604.149910 18728.647
> 0.0
> < Stop-CM 6.122550 61.225
> 0.0
> < P-Coupling 60.409160 362.455
> 0.0
> < Calc-Ekin 120.934940 3265.243
> 0.0
> < Lincs 302.400510 18144.031
> 0.0
> < Lincs-Mat 2689.009088 10756.036
> 0.0
> < Constraint-V 1363.641087 10909.129
> 0.0
> < Constraint-Vir 53.106186 1274.548
> 0.0
> < Settle 252.968620 81708.864
> 0.1
> < (null) 3.108300 0.000
> 0.0
> ---
> > NB VdW [V&F] 1441.330143 1441.330
> 0.0
> > Pair Search distance check 6990.356112 62913.205
> 0.0
> > NxN Ewald Elec. + LJ [F] 2236130.475520 174418177.091
> 95.0
> > NxN Ewald Elec. + LJ [V&F] 22669.995648 2924429.439
> 1.6
> > 1,4 nonbonded interactions 2018.019069 181621.716
> 0.1
> > Calc Weights 5716.887330 205807.944
> 0.1
> > Spread Q Bspline 121960.263040 243920.526
> 0.1
> > Gather F Bspline 121960.263040 731761.578
> 0.4
> > 3D-FFT 477659.221248 3821273.770
> 2.1
> > Solve PME 133.861376 8567.128
> 0.0
> > Reset In Box 47.697580 143.093
> 0.0
> > CG-CoM 47.755890 143.268
> 0.0
> > Bonds 309.064217 18234.789
> 0.0
> > Propers 2324.109315 532221.033
> 0.3
> > Impropers 33.694111 7008.375
> 0.0
> > Virial 191.203810 3441.669
> 0.0
> > Update 1905.629110 59074.502
> 0.0
> > Stop-CM 19.125680 191.257
> 0.0
> > P-Coupling 190.557080 1143.342
> 0.0
> > Calc-Ekin 381.230780 10293.231
> 0.0
> > Lincs 952.147766 57128.866
> 0.0
> > Lincs-Mat 8450.181728 33800.727
> 0.0
> > Constraint-V 4296.810133 34374.481
> 0.0
> > Constraint-Vir 167.277361 4014.657
> 0.0
> > Settle 797.526798 257601.156
> 0.1
> > (null) 9.804300 0.000
> 0.0
> 532c532
> < Total 58308966.118
> 100.0
> ---
> > Total 183618728.172
> 100.0
> 538,539c538,539
> < av. #atoms communicated per step for force: 2 x 39597.3
> < av. #atoms communicated per step for LINCS: 2 x 2855.3
> ---
> > av. #atoms communicated per step for force: 2 x 39598.1
> > av. #atoms communicated per step for LINCS: 2 x 2808.9
> 541,542c541,542
> < Average load imbalance: 2.8 %
> < Part of the total run time spent waiting due to load imbalance: 0.8 %
> ---
> > Average load imbalance: 0.6 %
> > Part of the total run time spent waiting due to load imbalance: 0.2 %
> 553,569c553,569
> < Domain decomp. 4 6 260 3.195 183.586
> 1.8
> < DD comm. load 4 6 259 0.010 0.593
> 0.0
> < DD comm. bounds 4 6 260 0.059 3.413
> 0.0
> < Neighbor search 4 6 260 2.627 150.975
> 1.5
> < Launch GPU ops. 4 6 20722 1.150 66.067
> 0.6
> < Comm. coord. 4 6 10101 2.839 163.129
> 1.6
> < Force 4 6 10361 54.113 3109.598
> 30.1
> < Wait + Comm. F 4 6 10361 1.577 90.599
> 0.9
> < PME mesh 4 6 10361 73.011 4195.535
> 40.7
> < Wait GPU nonlocal 4 6 10361 0.179 10.297
> 0.1
> < Wait GPU local 4 6 10361 0.044 2.520
> 0.0
> < NB X/F buffer ops. 4 6 40924 2.848 163.647
> 1.6
> < Write traj. 4 6 2 0.017 0.999
> 0.0
> < Update 4 6 20722 11.184 642.700
> 6.2
> < Constraints 4 6 20722 25.794 1482.226
> 14.4
> < Comm. energies 4 6 1037 0.170 9.758
> 0.1
> < Rest 0.753 43.281
> 0.4
> ---
> > Domain decomp. 4 6 818 4.527 260.126
> 2.5
> > DD comm. load 4 6 817 0.004 0.225
> 0.0
> > DD comm. bounds 4 6 818 0.055 3.182
> 0.0
> > Neighbor search 4 6 818 2.497 143.466
> 1.4
> > Launch GPU ops. 4 6 65362 3.204 184.088
> 1.8
> > Comm. coord. 4 6 31863 5.387 309.550
> 3.0
> > Force 4 6 32681 48.870 2808.314
> 27.4
> > Wait + Comm. F 4 6 32681 5.033 289.221
> 2.8
> > PME mesh 4 6 32681 71.720 4121.326
> 40.2
> > Wait GPU nonlocal 4 6 32681 0.393 22.593
> 0.2
> > Wait GPU local 4 6 32681 0.112 6.457
> 0.1
> > NB X/F buffer ops. 4 6 129088 2.409 138.419
> 1.3
> > Write traj. 4 6 3 0.012 0.705
> 0.0
> > Update 4 6 65362 9.216 529.577
> 5.2
> > Constraints 4 6 65362 23.384 1343.761
> 13.1
> > Comm. energies 4 6 3269 0.234 13.446
> 0.1
> > Rest 1.433 82.373
> 0.8
> 571c571
> < Total 179.571 10318.922
> 100.0
> ---
> > Total 178.490 10256.829
> 100.0
> 575,579c575,579
> < PME redist. X/F 4 6 20722 12.505 718.602
> 7.0
> < PME spread/gather 4 6 20722 35.759 2054.864
> 19.9
> < PME 3D-FFT 4 6 20722 18.237 1047.978
> 10.2
> < PME 3D-FFT Comm. 4 6 20722 4.554 261.665
> 2.5
> < PME solve Elec 4 6 10361 0.894 51.350
> 0.5
> ---
> > PME redist. X/F 4 6 65362 8.981 516.074
> 5.0
> > PME spread/gather 4 6 65362 30.705 1764.430
> 17.2
> > PME 3D-FFT 4 6 65362 15.356 882.437
> 8.6
> > PME 3D-FFT Comm. 4 6 65362 13.560 779.189
> 7.6
> > PME solve Elec 4 6 32681 2.992 171.962
> 1.7
> 583c583
> < Time: 1786.804 179.571 995.0
> ---
> > Time: 4282.038 178.490 2399.0
> 585,586c585,586
> < Performance: 9.970 2.407
> < Finished mdrun on rank 0 Tue Jul 26 09:19:29 2016
> ---
> > Performance: 31.639 0.759
> > Finished mdrun on rank 0 Tue Jul 26 09:44:14 2016
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list