[gmx-users] Best performace with 0 core for PME calcuation
Mark Abraham
Mark.Abraham at anu.edu.au
Sat Jan 10 02:45:29 CET 2009
Nicolas wrote:
> Hello,
>
> I'm trying to do a benchmark with Gromacs 4 on our cluster, but I don't
> completely understand the results I obtain. The system I used is a 128
> DOPC bilayer hydrated by ~18800 SPC for a total of ~70200 atoms. The
> size of the system is 9.6x9.6x10.1 nm^3. I'm using the following
> parameters:
>
> * nstlist = 10
> * rlist = 1
> * Coulombtype = PME
> * rcoulomb = 1
> * fourier spacing = 0.12
> * vdwtype = Cutoff
> * rvdw = 1
>
> The cluster itself has got 2 procs/node connected by Ethernet 100 MB/s.
Ethernet and Gigabit ethernet are not fast enough to get reasonable
scaling. There've been quite a few posts on this topic in the last six
months.
Hmm I see you've corrected your post to refer to Infiniband with four
cores/node. That should be reasonable, I understand (but search the
archive).
You should also check that your benchmark calculation is long enough
that you are measuring a simulation time that isn't being dominated by
setup costs. Some years ago a non-MD sysadmin complained of poor scaling
when he was testing over 10 or so MD steps!
> I'm using mpiexec to run Gromacs. When I use -npme 2 -ddorder
> interleave, I get:
> ncore Perf (ns/day) PME (%)
>
> 1 0,00 0
> 2 0,00 0
> 3 0,00 0
> 4 1,35 28
> 5 1,84 31
> 6 2,08 27
> 8 2,09 21
> 10 2,25 17
> 12 2,02 15
> 14 2,20 13
> 16 2,04 11
> 18 2,18 10
> 20 2,29 9
>
> So, above 6-8 cores, the PP nodes are spending too much time waiting for
> the PME nodes and the perf forms a plateau.
That's not surprising - the heuristic is that about a third to a quarter
of the cores want to be PME-only nodes. Of course, that depends on the
relative size of the real- and reciprocal-space parts of the calculation.
> When I use -npme 0, I get:
>
> ncore Perf (ns/day) PME (%)
> 1 0,43 33
> 2 0,92 34
> 3 1,34 35
> 4 1,69 36
> 5 2,17 33
> 6 2,56 32
> 8 3,24 33
> 10 3,84 34
> 12 4,34 35
> 14 5,05 32
> 16 5,47 34
> 18 5,54 37
> 20 6,13 36
>
> I obtain much better performances when there is no PME nodes, while I
> was expecting the opposite. Does someone have an explanation for that?
> Does that means domain decomposition is useless below a certain real
> space cutoff? I'm quite confused.
The relevant observations are for 4,5,6 and 8, for which shared-duty is
out-performing -npme 2. I think your observations support the conclusion
that your network hardware is more limiting for PME-only nodes than
shared-duty nodes. They don't support the conclusion that DD is useless,
since you haven't compared with PD.
You can play with the PME parameters to shift more load into the
real-space part - IIRC Carsten suggested a heuristic a few months back.
Mark
More information about the gromacs.org_gmx-users
mailing list