[gmx-users] PME grid parameters in large system run in parallel in Gromacs 4 CVS

Wed Feb 27 12:09:11 CET 2008

----------------------------------------
> From: larsson at xray.bmc.uu.se
> Date: Wed, 27 Feb 2008 11:48:30 +0100
> To: gmx-users at gromacs.org
> Subject: [gmx-users] PME grid parameters in large system run in parallel in	Gromacs 4 CVS 
> 
> Ok, this will be a long one.
> 
> Currently I am simulating large systems (20-35 nm sides; rhombic  
> dodecahedron or rectangular boxes) in the latest CVS version of  
> Gromacs on 70-500 nodes using PME to calculate long range  
> electrostatic interactions. Now I am looking for some advice  
> regarding how to set up my system for good performance.
> 
> There is a correlation between the short range cutoff (rlist/rcoulomb/ 
> rvdw), PME grid spacing (fourierspacing, fourier_nx/fourier_ny/ 
> fourier_nz) and the PME interpolation order (pme_order). Grompp tells  
> me that for best performance there should be about 25-33% load on the  
> dedicated PME nodes compared to the regular particle-particle nodes.  
> I guess this value is due to the communication overhead of the PME  
> nodes.

No, this is because of the balance of the computational cost.

> 
> Two of my setups are:
> 
> SYSTEM 1
> Dimentions: 24.12032  24.12032  17.05213   0.00000   0.00000    
> 0.00000   0.00000  12.06016  12.06016
> Fourier grid spacing: 0.137 x 0.137 x 0.137
> Cutoffs: 1.1
> Interpolation order: 4
> Estimated PME load: 0.28
> 
> SYSTEM 2
> Dimentions: 23.53603  24.05188  35.23882
> Fourier grid spacing: 0.134 0.137 0.133
> Cutoffs: 1.1
> Interpolation order: 4
> Estimated PME load: 0.51
> 
> So I am interested to hear from you how far I can push the fourier  
> grid spacing before I loose accuracy in terms of force and energy.  
> How much must I compensate with increased interpolation and increased  
> cutoff? Cutoff will put more load on the PP nodes which is what I  
> want in this case, but would an increased interpolation order also be  
> required/advantageous? I use the fft optimization option.
> 
> In the original paper on PME (Darden et al. 1993) there is a  
> comparison of PME order 1-4 and grid size of 0.1-0.05 nm which shows  
> that a grid spacing of 0.075 nm and an interpolation order of 3 gives  
> an rms force error of 2x10^-4, which they say is reasonable. Does  
> anyone of you know of a more recent equivalent test on larger systems?
> 

The answer to this is really simple.
I tested it, and the accuracy does not change when you scale
the cut-off and the grid spacing by the same factor.

> 
> A somewhat unrelated question is about the heuristics of mdrun when  
> it decides how many PME nodes versus PP nodes it should use and when  
> it does the PP domain decomposition. It would be quite useful with a  
> tool that could suggest a desirable number of nodes within a given  
> interval. If I can understand in which order the decisions are made  
> and which constrains are imposed I will gladly write such a tool  
> myself. (More than once has my simulations failed after a long time  
> in the queue due to an incompatible number of nodes. A feature of  
> mdrun would be to choose not to use x number of nodes if it would  
> improve performance rather than just die.)

When you are running over large numbers of nodes and using a large amount
of computational resources, it is better (and worth the effort) to set the number
of PME only nodes by hand.
It is desirable to use a number of particle only nodes that is a multiple
of the number of PME only nodes. So you choose the number of PME nodes,
which should be compatible with the grid spacing, and then you add
a number of PP nodes which is 3x or 4x the PME nodes, depending on
what grompp has told you, or better, what you determined from a (very short)
benchmark run.

Note that version 4 is still not out, and the documentation is not complete.
I am open for suggestions on how to improve the automated setup of mdrun.
But I don't know if it would be better to automatically use an inefficient setup
(such as just not using some cpus) instead of producing a fatal error.
For small numbers of cpus things should nearly always work, while for large
numbers of cpus the user should do a bit of work to get the optimal performance
out of the large computational resources.

Berk.

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/