[gmx-users] On what scale will simulation with PME-dedicated nodes perform better?

Mark Abraham mark.j.abraham at gmail.com
Fri Sep 19 13:17:32 CEST 2014

On Fri, Sep 19, 2014 at 5:35 AM, Theodore Si <sjyzhxw at gmail.com> wrote:

> Hi all,
> I run gromacs 4.6 on 5 nodes(each has 16 CPU cores and 2 Nvidia K20m) and
> 4 nodes in the following ways:
> 5 nodes:
> 1. Each node has 8 MPI processes, and use one node as PME-dedicated node
> 2. Each node has 8 MPI processes, and use two nodes as PME-dedicated nodes
> 3. Each node has 4 MPI processes, and use one node as PME-dedicated node
> In these settings, the log files complain that PME nodes have more work to
> do than PP nodes, and the average imbalance is 20% - 40%.
> 4nodes:
> Each node has 8 MPI processes, and there is no PME-dedicated node
> In the log file, the PME mesh wall time is about the half compared the
> settings above. My guess is that the scaling of my run is small so
> PME-dedicated nodes won't do any good.

Can't comment on that without knowing about the size of your system and
whether you have Infiniband.

So, on what condition should I set PME nodes manually?

Assuming you have two sockets in the CPU in each node, and thus each socket
with 8 cores, then I suspect you will do best with a PP rank and a PME rank
*on each socket*, with the PP rank mapped to a single GPU, and OpenMP used
to split the cores in a socket to PP:PME ranks e.g. 5:3 or 4:4, based on a
very quick understanding of some results Carsten Kutzer shared recently (do
correct me if I'm wrong, Carsten). This kind of setup
* uses all GPUs,
* avoids overheads from sharing GPUs between ranks,
* takes advantage of any socket<->GPU locality effects, and
* might mean you get to spend some quality time with your MPI documentation
to work out how to get it done

Ideally, on 5 such nodes something like

mpirun -np 20 mdrun -npme 10 -ntomp 5 -ntomp_pme 3 -gpu_id 01

would work, which takes advantage of the (default) mdrun -ddorder
interleave and hopefully the setup of your MPI installation, so that the
ranks get laid out PP-PME-PP-PME within each node and that maps to filling
sockets in the desired way.

In our ideal world, needing to choose a static split for the cores within
sockets and ranks within nodes would go away, but that's (a lot of) work in


> --
> Gromacs Users mailing list
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.

More information about the gromacs.org_gmx-users mailing list