[gmx-users] Best performace with 0 core for PME calcuation

Sat Jan 10 02:45:29 CET 2009

Nicolas wrote:
> Hello,
> 
> I'm trying to do a benchmark with Gromacs 4 on our cluster, but I don't 
> completely understand the results I obtain. The system I used is a 128 
> DOPC bilayer hydrated by ~18800 SPC for a total of ~70200 atoms. The 
> size of the system is 9.6x9.6x10.1 nm^3. I'm using the following 
> parameters:
> 
>        * nstlist = 10
>        * rlist = 1
>        * Coulombtype = PME
>        * rcoulomb = 1
>        * fourier spacing = 0.12
>        * vdwtype = Cutoff
>        * rvdw = 1
> 
> The cluster itself has got 2 procs/node connected by Ethernet 100 MB/s.

Ethernet and Gigabit ethernet are not fast enough to get reasonable 
scaling. There've been quite a few posts on this topic in the last six 
months.

Hmm I see you've corrected your post to refer to Infiniband with four 
cores/node. That should be reasonable, I understand (but search the 
archive).

You should also check that your benchmark calculation is long enough 
that you are measuring a simulation time that isn't being dominated by 
setup costs. Some years ago a non-MD sysadmin complained of poor scaling 
when he was testing over 10 or so MD steps!

> I'm using mpiexec to run Gromacs.  When I use -npme 2 -ddorder 
> interleave, I get:
> ncore    Perf (ns/day)    PME (%)
> 
>    1    0,00    0
>    2    0,00    0
>    3    0,00    0
>    4    1,35    28
>    5    1,84    31
>    6    2,08    27
>    8    2,09    21
>    10    2,25    17
>    12    2,02    15
>    14    2,20    13
>    16    2,04    11
>    18    2,18    10
>    20    2,29    9
> 
> So, above 6-8 cores, the PP nodes are spending too much time waiting for 
> the PME nodes and the perf forms a plateau. 

That's not surprising - the heuristic is that about a third to a quarter 
of the cores want to be PME-only nodes. Of course, that depends on the 
relative size of the real- and reciprocal-space parts of the calculation.

> When I use -npme 0, I get:
> 
>     ncore    Perf (ns/day)    PME (%)
>    1    0,43    33
>    2    0,92    34
>    3    1,34    35
>    4    1,69    36
>    5    2,17    33
>    6    2,56    32
>    8    3,24    33
>    10    3,84    34
>    12    4,34    35
>    14    5,05    32
>    16    5,47    34
>    18    5,54    37
>    20    6,13    36
> 
> I obtain much better performances when there is no PME nodes, while I 
> was expecting the opposite. Does someone have an explanation for that? 
> Does that means domain decomposition is useless below a certain real 
> space cutoff?  I'm quite confused.

The relevant observations are for 4,5,6 and 8, for which shared-duty is 
out-performing -npme 2. I think your observations support the conclusion 
that your network hardware is more limiting for PME-only nodes than 
shared-duty nodes. They don't support the conclusion that DD is useless, 
since you haven't compared with PD.

You can play with the PME parameters to shift more load into the 
real-space part - IIRC Carsten suggested a heuristic a few months back.

Mark