[gmx-users] mpirun -npernode option gives gromcs slowdown unless used with mpirun option -np or gromacs option -ntomp 1

Tue Jul 26 23:08:07 CEST 2016

Hi,

On Tue, Jul 26, 2016 at 10:35 PM Christopher Neale <
chris.neale at alum.utoronto.ca> wrote:

> Dear Users:
>
> this is simply an informational post in case somebody runs into similar
> troubles in the future. I don't understand why the usage must be this way,
> but empirically it works.
>
> I find that when I use (A) "mpirun -np 4 gmx_mpi -ntomp 6" I get 32
> ns/day. However, if I instead use (B) "mpirun -npernode 4 gmx_mpi -ntomp 6"
> I get only 10 ns/day. Finally, if I use (C) both the -npernode and -np
> options to mpirun, "mpirun -np 4 -npernode 4 gmx_mpi -ntomp 6", then I get
> again 32 ns.day. A diff of the .log files from option B (no-np.log) and
> option C (yes-np.log) doesn't contain any clues as to how things were set
> up differently with gromacs in the two cases (see end of this post).
>
> Why bother with the -npernode option at all? Because I was having trouble
> getting -multi to work with gpus otherwise. I found that if I simply used
> "mpirun -np 8 gmx_mpi -ntomp 6 -multi 2" then both jobs got put on the
> first node and the second allocated node was empty. Therefore, the only way
> that I can find to use CPU/GPU runs and the -multi keyword with efficiency
> is to use "mpirun -npernode" (seems to be required to get good distribution
> of processes across nodes) and then for some reason this leads to
> performance degradation in gromacs.
>

Yeah, this is irritating. We haven't found a good way to help the user
manage things. There's a fundamental conflict over which piece of
infrastructure should have the last word over which CPU threads should have
affinity for which sets of cores. We (think we) know what's best for mdrun,
but if something external has set affinity masks, then by default mdrun
needs to respect that. But it's very easy to mis-configure MPI libraries,
or not use mpirun in the best way, and that leads to externally-set
affinity masks that mdrun notionally should respect, and we haven't yet
implemented a good way to detect it's probably a problem and react. The
good news is that mdrun -pin on will instruct mdrun to ignore external
affinities and set the kind of affinity patterns we've designed for, and
those will work pretty well on mainstream hardware. The mpirun -bind-to
options are other ways to be specific about your requirements. Also,
whether any of this matters probably also is affected by the presence of
system processes on the machine, and how much the kernel has been set up to
reflect the realities of HPC workloads.

I found two ways to get good performance from gromacs with mpirun
> -npernode. The first is to simply include the -np option as well. Could be
> like this:
>
> mpirun -np 8 -npernode 4 gmx_mpi mdrun -notunepme -deffnm MD_ -dlb yes
> -npme 0 -cpt 60 -maxh 0.05 -cpi MD_.cpt -ntomp 6 -gpu_id 0123 -multi 2
>
> I guess that might be obvious to some people, but the man page for openmpi
> mpirun reads to me as if -npernode is an alternative to -np rather than an
> augmentation.
>
> The second alternative is this:
>
> mpirun -bind-to core:overload-allowed -npernode 24 gmx_mpi mdrun
> -notunepme -deffnm MD_ -dlb yes -npme 0 -cpt 60 -maxh 0.05 -cpi MD_.cpt
> -ntomp 1 -gpu_id 000000111111222222333333 -multi 2
>
> where the "-bind-to core:overload-allowed" option is probably only
> required with hyperthreading.
>
> $ diff no-np.log yes-np.log
>
> 1,2c1,2
> < Log file opened on Tue Jul 26 09:16:25 2016
> < Host: node001  pid: 21102  rank ID: 0  number of ranks:  4
> ---
> > Log file opened on Tue Jul 26 09:41:11 2016
> > Host: node001  pid: 22346  rank ID: 0  number of ranks:  4
> 66c66
> < Number of logical cores detected (24) does not match the number reported
> by OpenMP (2).
> ---
> > Number of logical cores detected (24) does not match the number reported
> by OpenMP (12).
>

This is actually the only clue available, though the wording of the message
is sufficiently unclear that we've removed it for 2016, until we've come up
with a way to detect foolish-looking affinity patterns and say something
that a user can act upon. The 2 and 12 reflect the range of hardware
threads (24, with hyperthreading) over which software threads are permitted
to migrate, and of course migration off a physical core will wreck
performance of compute-bound code because you invalidate caches, so one
doesn't want to permit that.

Mark

162c162
> <    ld-seed                        = 3114924391
> ---
> >    ld-seed                        = 4178809860
> 429,430c429,430
> < RMS relative constraint deviation after constraining: 3.46e-06
> < Initial temperature: 309.815 K
> ---
> > RMS relative constraint deviation after constraining: 3.42e-06
> > Initial temperature: 310.938 K
> 432c432
> < Started mdrun on rank 0 Tue Jul 26 09:16:29 2016
> ---
> > Started mdrun on rank 0 Tue Jul 26 09:41:15 2016
> 442c442
> <    -4.90630e+05    1.68715e+05   -3.21916e+05    3.21682e+02
>  -3.95211e+01
> ---
> >    -4.90630e+05    1.69398e+05   -3.21232e+05    3.22985e+02
>  -5.09389e+01
> 444c444
> <     4.26491e-06
> ---
> >     4.34646e-06
> 446c446
> < DD  step 39  vol min/aver 1.000  load imb.: force  4.4%
> ---
> > DD  step 39  vol min/aver 1.000  load imb.: force  7.4%
> 449c449
> < Step 10320: Run time exceeded 0.050 hours, will terminate the run
> ---
> > Step 32640: Run time exceeded 0.050 hours, will terminate the run
> 451c451
> <           10360       20.72000        0.00000
> ---
> >           32680       65.36000        0.00000
> 453c453
> < Writing checkpoint, step 10360 at Tue Jul 26 09:19:28 2016
> ---
> > Writing checkpoint, step 32680 at Tue Jul 26 09:44:13 2016
> 458c458
> <     1.22604e+04    5.70873e+04    3.77247e+04    8.94662e+02
>  -1.42204e+03
> ---
> >     1.22192e+04    5.73049e+04    3.79451e+04    9.68468e+02
>  -1.52979e+03
> 460c460
> <     8.43362e+03   -1.28393e+04    8.15360e+03   -6.05532e+05
> 2.67908e+03
> ---
> >     8.57418e+03   -1.24316e+04    1.00830e+04   -6.06735e+05
> 2.66427e+03
> 462c462
> <    -4.92560e+05    1.62211e+05   -3.30349e+05    3.09282e+02
>  -2.47940e+02
> ---
> >    -4.90937e+05    1.62776e+05   -3.28161e+05    3.10359e+02
> 2.17014e+02
> 470c470
> <       Statistics over 10361 steps using 104 frames
> ---
> >       Statistics over 32681 steps using 327 frames
> 474c474
> <     1.21421e+04    5.71798e+04    3.80964e+04    9.85006e+02
>  -1.46855e+03
> ---
> >     1.20981e+04    5.70274e+04    3.79428e+04    9.64206e+02
>  -1.50478e+03
> 476c476
> <     8.62873e+03   -1.26960e+04    9.41233e+03   -6.06182e+05
> 2.71017e+03
> ---
> >     8.61301e+03   -1.24384e+04    9.36526e+03   -6.05690e+05
> 2.70108e+03
> 478c478
> <    -4.91192e+05    1.62881e+05   -3.28311e+05    3.10559e+02
> 7.81796e+00
> ---
> >    -4.90921e+05    1.62879e+05   -3.28042e+05    3.10555e+02
> 6.78073e+00
> 483c483
> <     7.40515e+00    7.40515e+00    1.03635e+01
> ---
> >     7.40100e+00    7.40100e+00    1.03708e+01
> 486,488c486,488
> <     5.33098e+04    4.30370e+02    1.93729e+02
> <     4.29164e+02    5.35544e+04   -1.86077e+02
> <     1.93524e+02   -1.83077e+02    5.56151e+04
> ---
> >     5.33490e+04    1.51469e+02    6.52398e+01
> >     1.52200e+02    5.39549e+04   -1.33843e+02
> >     6.68338e+01   -1.34647e+02    5.52271e+04
> 491,493c491,493
> <     2.34064e+01   -2.47322e+01   -1.67206e+01
> <    -2.46617e+01    1.68480e+01    9.55039e+00
> <    -1.67086e+01    9.37497e+00   -1.68005e+01
> ---
> >     2.16749e+01   -1.01125e+01   -4.92329e+00
> >    -1.01552e+01   -1.03203e+01    1.21038e+01
> >    -5.01651e+00    1.21508e+01    8.98758e+00
> 505,530c505,530
> <  NB VdW [V&F]                           456.951183         456.951
>  0.0
> <  Pair Search distance check            2229.399952       20064.600
>  0.0
> <  NxN Ewald Elec. + LJ [F]            710016.305728    55381271.847
> 95.0
> <  NxN Ewald Elec. + LJ [V&F]            7268.991360      937699.885
>  1.6
> <  1,4 nonbonded interactions             639.781389       57580.325
>  0.1
> <  Calc Weights                          1812.449730       65248.190
>  0.1
> <  Spread Q Bspline                     38665.594240       77331.188
>  0.1
> <  Gather F Bspline                     38665.594240      231993.565
>  0.4
> <  3D-FFT                              151434.386688     1211475.094
>  2.1
> <  Solve PME                               42.438656        2716.074
>  0.0
> <  Reset In Box                            15.160600          45.482
>  0.0
> <  CG-CoM                                  15.218910          45.657
>  0.0
> <  Bonds                                   97.983977        5781.055
>  0.0
> <  Propers                                736.822515      168732.356
>  0.3
> <  Impropers                               10.682191        2221.896
>  0.0
> <  Virial                                  60.654130        1091.774
>  0.0
> <  Update                                 604.149910       18728.647
>  0.0
> <  Stop-CM                                  6.122550          61.225
>  0.0
> <  P-Coupling                              60.409160         362.455
>  0.0
> <  Calc-Ekin                              120.934940        3265.243
>  0.0
> <  Lincs                                  302.400510       18144.031
>  0.0
> <  Lincs-Mat                             2689.009088       10756.036
>  0.0
> <  Constraint-V                          1363.641087       10909.129
>  0.0
> <  Constraint-Vir                          53.106186        1274.548
>  0.0
> <  Settle                                 252.968620       81708.864
>  0.1
> <  (null)                                   3.108300           0.000
>  0.0
> ---
> >  NB VdW [V&F]                          1441.330143        1441.330
>  0.0
> >  Pair Search distance check            6990.356112       62913.205
>  0.0
> >  NxN Ewald Elec. + LJ [F]           2236130.475520   174418177.091
> 95.0
> >  NxN Ewald Elec. + LJ [V&F]           22669.995648     2924429.439
>  1.6
> >  1,4 nonbonded interactions            2018.019069      181621.716
>  0.1
> >  Calc Weights                          5716.887330      205807.944
>  0.1
> >  Spread Q Bspline                    121960.263040      243920.526
>  0.1
> >  Gather F Bspline                    121960.263040      731761.578
>  0.4
> >  3D-FFT                              477659.221248     3821273.770
>  2.1
> >  Solve PME                              133.861376        8567.128
>  0.0
> >  Reset In Box                            47.697580         143.093
>  0.0
> >  CG-CoM                                  47.755890         143.268
>  0.0
> >  Bonds                                  309.064217       18234.789
>  0.0
> >  Propers                               2324.109315      532221.033
>  0.3
> >  Impropers                               33.694111        7008.375
>  0.0
> >  Virial                                 191.203810        3441.669
>  0.0
> >  Update                                1905.629110       59074.502
>  0.0
> >  Stop-CM                                 19.125680         191.257
>  0.0
> >  P-Coupling                             190.557080        1143.342
>  0.0
> >  Calc-Ekin                              381.230780       10293.231
>  0.0
> >  Lincs                                  952.147766       57128.866
>  0.0
> >  Lincs-Mat                             8450.181728       33800.727
>  0.0
> >  Constraint-V                          4296.810133       34374.481
>  0.0
> >  Constraint-Vir                         167.277361        4014.657
>  0.0
> >  Settle                                 797.526798      257601.156
>  0.1
> >  (null)                                   9.804300           0.000
>  0.0
> 532c532
> <  Total                                                58308966.118
>  100.0
> ---
> >  Total                                               183618728.172
>  100.0
> 538,539c538,539
> <  av. #atoms communicated per step for force:  2 x 39597.3
> <  av. #atoms communicated per step for LINCS:  2 x 2855.3
> ---
> >  av. #atoms communicated per step for force:  2 x 39598.1
> >  av. #atoms communicated per step for LINCS:  2 x 2808.9
> 541,542c541,542
> <  Average load imbalance: 2.8 %
> <  Part of the total run time spent waiting due to load imbalance: 0.8 %
> ---
> >  Average load imbalance: 0.6 %
> >  Part of the total run time spent waiting due to load imbalance: 0.2 %
> 553,569c553,569
> <  Domain decomp.         4    6        260       3.195        183.586
>  1.8
> <  DD comm. load          4    6        259       0.010          0.593
>  0.0
> <  DD comm. bounds        4    6        260       0.059          3.413
>  0.0
> <  Neighbor search        4    6        260       2.627        150.975
>  1.5
> <  Launch GPU ops.        4    6      20722       1.150         66.067
>  0.6
> <  Comm. coord.           4    6      10101       2.839        163.129
>  1.6
> <  Force                  4    6      10361      54.113       3109.598
> 30.1
> <  Wait + Comm. F         4    6      10361       1.577         90.599
>  0.9
> <  PME mesh               4    6      10361      73.011       4195.535
> 40.7
> <  Wait GPU nonlocal      4    6      10361       0.179         10.297
>  0.1
> <  Wait GPU local         4    6      10361       0.044          2.520
>  0.0
> <  NB X/F buffer ops.     4    6      40924       2.848        163.647
>  1.6
> <  Write traj.            4    6          2       0.017          0.999
>  0.0
> <  Update                 4    6      20722      11.184        642.700
>  6.2
> <  Constraints            4    6      20722      25.794       1482.226
> 14.4
> <  Comm. energies         4    6       1037       0.170          9.758
>  0.1
> <  Rest                                           0.753         43.281
>  0.4
> ---
> >  Domain decomp.         4    6        818       4.527        260.126
>  2.5
> >  DD comm. load          4    6        817       0.004          0.225
>  0.0
> >  DD comm. bounds        4    6        818       0.055          3.182
>  0.0
> >  Neighbor search        4    6        818       2.497        143.466
>  1.4
> >  Launch GPU ops.        4    6      65362       3.204        184.088
>  1.8
> >  Comm. coord.           4    6      31863       5.387        309.550
>  3.0
> >  Force                  4    6      32681      48.870       2808.314
> 27.4
> >  Wait + Comm. F         4    6      32681       5.033        289.221
>  2.8
> >  PME mesh               4    6      32681      71.720       4121.326
> 40.2
> >  Wait GPU nonlocal      4    6      32681       0.393         22.593
>  0.2
> >  Wait GPU local         4    6      32681       0.112          6.457
>  0.1
> >  NB X/F buffer ops.     4    6     129088       2.409        138.419
>  1.3
> >  Write traj.            4    6          3       0.012          0.705
>  0.0
> >  Update                 4    6      65362       9.216        529.577
>  5.2
> >  Constraints            4    6      65362      23.384       1343.761
> 13.1
> >  Comm. energies         4    6       3269       0.234         13.446
>  0.1
> >  Rest                                           1.433         82.373
>  0.8
> 571c571
> <  Total                                        179.571      10318.922
> 100.0
> ---
> >  Total                                        178.490      10256.829
> 100.0
> 575,579c575,579
> <  PME redist. X/F        4    6      20722      12.505        718.602
>  7.0
> <  PME spread/gather      4    6      20722      35.759       2054.864
> 19.9
> <  PME 3D-FFT             4    6      20722      18.237       1047.978
> 10.2
> <  PME 3D-FFT Comm.       4    6      20722       4.554        261.665
>  2.5
> <  PME solve Elec         4    6      10361       0.894         51.350
>  0.5
> ---
> >  PME redist. X/F        4    6      65362       8.981        516.074
>  5.0
> >  PME spread/gather      4    6      65362      30.705       1764.430
> 17.2
> >  PME 3D-FFT             4    6      65362      15.356        882.437
>  8.6
> >  PME 3D-FFT Comm.       4    6      65362      13.560        779.189
>  7.6
> >  PME solve Elec         4    6      32681       2.992        171.962
>  1.7
> 583c583
> <        Time:     1786.804      179.571      995.0
> ---
> >        Time:     4282.038      178.490     2399.0
> 585,586c585,586
> < Performance:        9.970        2.407
> < Finished mdrun on rank 0 Tue Jul 26 09:19:29 2016
> ---
> > Performance:       31.639        0.759
> > Finished mdrun on rank 0 Tue Jul 26 09:44:14 2016
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>