[gmx-users] mpirun -npernode option gives gromcs slowdown unless used with mpirun option -np or gromacs option -ntomp 1

Christopher Neale chris.neale at alum.utoronto.ca
Tue Jul 26 22:35:04 CEST 2016


Dear Users:

this is simply an informational post in case somebody runs into similar troubles in the future. I don't understand why the usage must be this way, but empirically it works.

I find that when I use (A) "mpirun -np 4 gmx_mpi -ntomp 6" I get 32 ns/day. However, if I instead use (B) "mpirun -npernode 4 gmx_mpi -ntomp 6" I get only 10 ns/day. Finally, if I use (C) both the -npernode and -np options to mpirun, "mpirun -np 4 -npernode 4 gmx_mpi -ntomp 6", then I get again 32 ns.day. A diff of the .log files from option B (no-np.log) and option C (yes-np.log) doesn't contain any clues as to how things were set up differently with gromacs in the two cases (see end of this post).

Why bother with the -npernode option at all? Because I was having trouble getting -multi to work with gpus otherwise. I found that if I simply used "mpirun -np 8 gmx_mpi -ntomp 6 -multi 2" then both jobs got put on the first node and the second allocated node was empty. Therefore, the only way that I can find to use CPU/GPU runs and the -multi keyword with efficiency is to use "mpirun -npernode" (seems to be required to get good distribution of processes across nodes) and then for some reason this leads to performance degradation in gromacs.

I found two ways to get good performance from gromacs with mpirun -npernode. The first is to simply include the -np option as well. Could be like this:

mpirun -np 8 -npernode 4 gmx_mpi mdrun -notunepme -deffnm MD_ -dlb yes -npme 0 -cpt 60 -maxh 0.05 -cpi MD_.cpt -ntomp 6 -gpu_id 0123 -multi 2

I guess that might be obvious to some people, but the man page for openmpi mpirun reads to me as if -npernode is an alternative to -np rather than an augmentation.

The second alternative is this:

mpirun -bind-to core:overload-allowed -npernode 24 gmx_mpi mdrun -notunepme -deffnm MD_ -dlb yes -npme 0 -cpt 60 -maxh 0.05 -cpi MD_.cpt -ntomp 1 -gpu_id 000000111111222222333333 -multi 2

where the "-bind-to core:overload-allowed" option is probably only required with hyperthreading.

$ diff no-np.log yes-np.log 

1,2c1,2
< Log file opened on Tue Jul 26 09:16:25 2016
< Host: node001  pid: 21102  rank ID: 0  number of ranks:  4
---
> Log file opened on Tue Jul 26 09:41:11 2016
> Host: node001  pid: 22346  rank ID: 0  number of ranks:  4
66c66
< Number of logical cores detected (24) does not match the number reported by OpenMP (2).
---
> Number of logical cores detected (24) does not match the number reported by OpenMP (12).
162c162
<    ld-seed                        = 3114924391
---
>    ld-seed                        = 4178809860
429,430c429,430
< RMS relative constraint deviation after constraining: 3.46e-06
< Initial temperature: 309.815 K
---
> RMS relative constraint deviation after constraining: 3.42e-06
> Initial temperature: 310.938 K
432c432
< Started mdrun on rank 0 Tue Jul 26 09:16:29 2016
---
> Started mdrun on rank 0 Tue Jul 26 09:41:15 2016
442c442
<    -4.90630e+05    1.68715e+05   -3.21916e+05    3.21682e+02   -3.95211e+01
---
>    -4.90630e+05    1.69398e+05   -3.21232e+05    3.22985e+02   -5.09389e+01
444c444
<     4.26491e-06
---
>     4.34646e-06
446c446
< DD  step 39  vol min/aver 1.000  load imb.: force  4.4%
---
> DD  step 39  vol min/aver 1.000  load imb.: force  7.4%
449c449
< Step 10320: Run time exceeded 0.050 hours, will terminate the run
---
> Step 32640: Run time exceeded 0.050 hours, will terminate the run
451c451
<           10360       20.72000        0.00000
---
>           32680       65.36000        0.00000
453c453
< Writing checkpoint, step 10360 at Tue Jul 26 09:19:28 2016
---
> Writing checkpoint, step 32680 at Tue Jul 26 09:44:13 2016
458c458
<     1.22604e+04    5.70873e+04    3.77247e+04    8.94662e+02   -1.42204e+03
---
>     1.22192e+04    5.73049e+04    3.79451e+04    9.68468e+02   -1.52979e+03
460c460
<     8.43362e+03   -1.28393e+04    8.15360e+03   -6.05532e+05    2.67908e+03
---
>     8.57418e+03   -1.24316e+04    1.00830e+04   -6.06735e+05    2.66427e+03
462c462
<    -4.92560e+05    1.62211e+05   -3.30349e+05    3.09282e+02   -2.47940e+02
---
>    -4.90937e+05    1.62776e+05   -3.28161e+05    3.10359e+02    2.17014e+02
470c470
< 	Statistics over 10361 steps using 104 frames
---
> 	Statistics over 32681 steps using 327 frames
474c474
<     1.21421e+04    5.71798e+04    3.80964e+04    9.85006e+02   -1.46855e+03
---
>     1.20981e+04    5.70274e+04    3.79428e+04    9.64206e+02   -1.50478e+03
476c476
<     8.62873e+03   -1.26960e+04    9.41233e+03   -6.06182e+05    2.71017e+03
---
>     8.61301e+03   -1.24384e+04    9.36526e+03   -6.05690e+05    2.70108e+03
478c478
<    -4.91192e+05    1.62881e+05   -3.28311e+05    3.10559e+02    7.81796e+00
---
>    -4.90921e+05    1.62879e+05   -3.28042e+05    3.10555e+02    6.78073e+00
483c483
<     7.40515e+00    7.40515e+00    1.03635e+01
---
>     7.40100e+00    7.40100e+00    1.03708e+01
486,488c486,488
<     5.33098e+04    4.30370e+02    1.93729e+02
<     4.29164e+02    5.35544e+04   -1.86077e+02
<     1.93524e+02   -1.83077e+02    5.56151e+04
---
>     5.33490e+04    1.51469e+02    6.52398e+01
>     1.52200e+02    5.39549e+04   -1.33843e+02
>     6.68338e+01   -1.34647e+02    5.52271e+04
491,493c491,493
<     2.34064e+01   -2.47322e+01   -1.67206e+01
<    -2.46617e+01    1.68480e+01    9.55039e+00
<    -1.67086e+01    9.37497e+00   -1.68005e+01
---
>     2.16749e+01   -1.01125e+01   -4.92329e+00
>    -1.01552e+01   -1.03203e+01    1.21038e+01
>    -5.01651e+00    1.21508e+01    8.98758e+00
505,530c505,530
<  NB VdW [V&F]                           456.951183         456.951     0.0
<  Pair Search distance check            2229.399952       20064.600     0.0
<  NxN Ewald Elec. + LJ [F]            710016.305728    55381271.847    95.0
<  NxN Ewald Elec. + LJ [V&F]            7268.991360      937699.885     1.6
<  1,4 nonbonded interactions             639.781389       57580.325     0.1
<  Calc Weights                          1812.449730       65248.190     0.1
<  Spread Q Bspline                     38665.594240       77331.188     0.1
<  Gather F Bspline                     38665.594240      231993.565     0.4
<  3D-FFT                              151434.386688     1211475.094     2.1
<  Solve PME                               42.438656        2716.074     0.0
<  Reset In Box                            15.160600          45.482     0.0
<  CG-CoM                                  15.218910          45.657     0.0
<  Bonds                                   97.983977        5781.055     0.0
<  Propers                                736.822515      168732.356     0.3
<  Impropers                               10.682191        2221.896     0.0
<  Virial                                  60.654130        1091.774     0.0
<  Update                                 604.149910       18728.647     0.0
<  Stop-CM                                  6.122550          61.225     0.0
<  P-Coupling                              60.409160         362.455     0.0
<  Calc-Ekin                              120.934940        3265.243     0.0
<  Lincs                                  302.400510       18144.031     0.0
<  Lincs-Mat                             2689.009088       10756.036     0.0
<  Constraint-V                          1363.641087       10909.129     0.0
<  Constraint-Vir                          53.106186        1274.548     0.0
<  Settle                                 252.968620       81708.864     0.1
<  (null)                                   3.108300           0.000     0.0
---
>  NB VdW [V&F]                          1441.330143        1441.330     0.0
>  Pair Search distance check            6990.356112       62913.205     0.0
>  NxN Ewald Elec. + LJ [F]           2236130.475520   174418177.091    95.0
>  NxN Ewald Elec. + LJ [V&F]           22669.995648     2924429.439     1.6
>  1,4 nonbonded interactions            2018.019069      181621.716     0.1
>  Calc Weights                          5716.887330      205807.944     0.1
>  Spread Q Bspline                    121960.263040      243920.526     0.1
>  Gather F Bspline                    121960.263040      731761.578     0.4
>  3D-FFT                              477659.221248     3821273.770     2.1
>  Solve PME                              133.861376        8567.128     0.0
>  Reset In Box                            47.697580         143.093     0.0
>  CG-CoM                                  47.755890         143.268     0.0
>  Bonds                                  309.064217       18234.789     0.0
>  Propers                               2324.109315      532221.033     0.3
>  Impropers                               33.694111        7008.375     0.0
>  Virial                                 191.203810        3441.669     0.0
>  Update                                1905.629110       59074.502     0.0
>  Stop-CM                                 19.125680         191.257     0.0
>  P-Coupling                             190.557080        1143.342     0.0
>  Calc-Ekin                              381.230780       10293.231     0.0
>  Lincs                                  952.147766       57128.866     0.0
>  Lincs-Mat                             8450.181728       33800.727     0.0
>  Constraint-V                          4296.810133       34374.481     0.0
>  Constraint-Vir                         167.277361        4014.657     0.0
>  Settle                                 797.526798      257601.156     0.1
>  (null)                                   9.804300           0.000     0.0
532c532
<  Total                                                58308966.118   100.0
---
>  Total                                               183618728.172   100.0
538,539c538,539
<  av. #atoms communicated per step for force:  2 x 39597.3
<  av. #atoms communicated per step for LINCS:  2 x 2855.3
---
>  av. #atoms communicated per step for force:  2 x 39598.1
>  av. #atoms communicated per step for LINCS:  2 x 2808.9
541,542c541,542
<  Average load imbalance: 2.8 %
<  Part of the total run time spent waiting due to load imbalance: 0.8 %
---
>  Average load imbalance: 0.6 %
>  Part of the total run time spent waiting due to load imbalance: 0.2 %
553,569c553,569
<  Domain decomp.         4    6        260       3.195        183.586   1.8
<  DD comm. load          4    6        259       0.010          0.593   0.0
<  DD comm. bounds        4    6        260       0.059          3.413   0.0
<  Neighbor search        4    6        260       2.627        150.975   1.5
<  Launch GPU ops.        4    6      20722       1.150         66.067   0.6
<  Comm. coord.           4    6      10101       2.839        163.129   1.6
<  Force                  4    6      10361      54.113       3109.598  30.1
<  Wait + Comm. F         4    6      10361       1.577         90.599   0.9
<  PME mesh               4    6      10361      73.011       4195.535  40.7
<  Wait GPU nonlocal      4    6      10361       0.179         10.297   0.1
<  Wait GPU local         4    6      10361       0.044          2.520   0.0
<  NB X/F buffer ops.     4    6      40924       2.848        163.647   1.6
<  Write traj.            4    6          2       0.017          0.999   0.0
<  Update                 4    6      20722      11.184        642.700   6.2
<  Constraints            4    6      20722      25.794       1482.226  14.4
<  Comm. energies         4    6       1037       0.170          9.758   0.1
<  Rest                                           0.753         43.281   0.4
---
>  Domain decomp.         4    6        818       4.527        260.126   2.5
>  DD comm. load          4    6        817       0.004          0.225   0.0
>  DD comm. bounds        4    6        818       0.055          3.182   0.0
>  Neighbor search        4    6        818       2.497        143.466   1.4
>  Launch GPU ops.        4    6      65362       3.204        184.088   1.8
>  Comm. coord.           4    6      31863       5.387        309.550   3.0
>  Force                  4    6      32681      48.870       2808.314  27.4
>  Wait + Comm. F         4    6      32681       5.033        289.221   2.8
>  PME mesh               4    6      32681      71.720       4121.326  40.2
>  Wait GPU nonlocal      4    6      32681       0.393         22.593   0.2
>  Wait GPU local         4    6      32681       0.112          6.457   0.1
>  NB X/F buffer ops.     4    6     129088       2.409        138.419   1.3
>  Write traj.            4    6          3       0.012          0.705   0.0
>  Update                 4    6      65362       9.216        529.577   5.2
>  Constraints            4    6      65362      23.384       1343.761  13.1
>  Comm. energies         4    6       3269       0.234         13.446   0.1
>  Rest                                           1.433         82.373   0.8
571c571
<  Total                                        179.571      10318.922 100.0
---
>  Total                                        178.490      10256.829 100.0
575,579c575,579
<  PME redist. X/F        4    6      20722      12.505        718.602   7.0
<  PME spread/gather      4    6      20722      35.759       2054.864  19.9
<  PME 3D-FFT             4    6      20722      18.237       1047.978  10.2
<  PME 3D-FFT Comm.       4    6      20722       4.554        261.665   2.5
<  PME solve Elec         4    6      10361       0.894         51.350   0.5
---
>  PME redist. X/F        4    6      65362       8.981        516.074   5.0
>  PME spread/gather      4    6      65362      30.705       1764.430  17.2
>  PME 3D-FFT             4    6      65362      15.356        882.437   8.6
>  PME 3D-FFT Comm.       4    6      65362      13.560        779.189   7.6
>  PME solve Elec         4    6      32681       2.992        171.962   1.7
583c583
<        Time:     1786.804      179.571      995.0
---
>        Time:     4282.038      178.490     2399.0
585,586c585,586
< Performance:        9.970        2.407
< Finished mdrun on rank 0 Tue Jul 26 09:19:29 2016
---
> Performance:       31.639        0.759
> Finished mdrun on rank 0 Tue Jul 26 09:44:14 2016


More information about the gromacs.org_gmx-users mailing list