[gmx-users] Too few cells to run on multiple cores

Szilárd Páll pall.szilard at gmail.com
Tue Aug 7 14:27:26 CEST 2018


The domain decomposition has certain algorithmic limits that you can relax,
but as you notice that comes at the cost of deteriorating load balance --
and at a certain point it might come at the cost of simulations aborting
mid-run (if you make -rdd too large). More load imbalance does not
necessarily mean less performance, so if your only way of using more cores
is to "squeeze out" more domains of your system, as long as you get more
performance, that may be fine.

However, instead of trying to squeeze out more domains, you can actually
use multiple CPU cores per domain; see the -ntomp option and examples here:
For instance, in your case you could use 4 threads per MPI rank and with a
3x2x2 decoposition you'd get 12 domains x 4 threads = 48 threads total.

A few more tips:
- your original cell size limit was due to bonded interactions, so tweaking
LINCS would not help with that
- you can also try to use separate PME ranks and by doing that some cores
will be reserved for PME work and the domain-decomposition may be stretched
a bit less, e.g.: -ntmpi 24 -npme 6 -ntomp 2 will give a 3x3x2
decomposition. the success of this split will of course depend on the PME
load in the system (which is estimated to be very high -- are you using
some non-default settings?)



On Tue, Aug 7, 2018 at 1:52 PM Adrian Devitt-Lee <adriandlee at gmail.com>

> Hi,
> I'm having an issue using mdrun in parallel on 48 cores. I'm trying to
> figure out which options I can include in the .mdp file to increase the
> number of cells in my system. The full error message is:
> Initializing Domain Decomposition on 48 ranks
> Dynamic load balancing: auto
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
>     two-body bonded interactions: 1.368 nm, LJC Pairs NB, atoms 2123 2152
>   multi-body bonded interactions: 0.428 nm, Proper Dih., atoms 1105 1113
> *Minimum cell size due to bonded interactions: 1.505 nm*
> Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 0.219 nm
> Estimated maximum distance required for P-LINCS: 0.219 nm
> Guess for relative PME load: 0.65
> Using 0 separate PME ranks, as guessed by mdrun
> Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
> *Optimizing the DD grid for 48 cells with a minimum initial size of 1.881
> nmThe maximum allowed number of cells is: X 3 Y 3 Z 3*
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 5.1
> Source code file:
> /work/y07/y07/gmx/5.1-phase2/source/src/gromacs/domdec/domdec.cpp, line:
> 6969
> Fatal error:
> There is no domain decomposition for 48 ranks that is compatible with the
> given box and a minimum cell size of 1.88133 nm
> I understand the problem -- the system wants to assign one core to each
> grid cell, but there are only 3x3x3 = 27 cells. I don't know what I can do
> to fix this problem. The system is a solvated protein bound to a ligand in
> a ~15 nm box, and it has > 40,000 atoms. I have tried changing the
> lincs-order and fourier-spacing to no avail.
> I was able to get the system to run by adding the following flags to mdrun:
>   -rdd 1.2 -dds 0.9
> But when I did this, the force imbalance went to > 140%, and 10-30% of the
> cpu time was lost due to load imbalance.
> Can someone suggest how I could edit my .mdp file to increase the number of
> allowed cells?
> --
> Gromacs Users mailing list
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.

More information about the gromacs.org_gmx-users mailing list