[gmx-users] Re: The number of PME nodes

Mon Jun 23 14:11:11 CEST 2008

Hi Xuji,

xuji wrote:
> 
> Hi Carsten:
>  
>     First, thanks for your reply.
>     Second, someone in the mailing-list said:"Please cut of your email 
> from the digest. It confuses people." And it seems that there's no other 
> people that is interest in this topic. So I think it's maybe better we 
> discuss this problem in private email.
I think Yang Ye simply meant that you should choose a meaningful subject -
not "Re: [gmx-developers]". I also forward this to the users mailing list,
which is the best place for this discussion.

>     Now it's time to discuss the problem. It's because I have 3 
> nodes(each node has 8 CPUs) that I choose 6 as the number of the PME nodes.
>    I also test other number of PME nodes. The mdrun program's default 
> PME nodes number is 12. That is to say when I run the 'mdrun' with the 
> command line 'mpiexec -machinefile ./mf -n 24 mdrun -v -dlb -s md1.tpr 
> -o md1.trr -g md1.log  -e md1.edr -x md1.xtc>& md1.job &' the PME nodes 
> number is 12. The CPU occupancy is almost the same between PME nodes and 
> DD(domain decomposition) nodes.
>     When I run with 'mpiexec -machinefile ./mf -n 24 mdrun -v -dd 7 3 1 
> -npme 3 -dlb -s md1.tpr -o md1.trr -g md1.log  -e md1.edr -x md1.xtc>& 
> md1.job &'. Each node has one CPU to do PME. The CPU occupancy of PME 
> nodes is larger than that of the DD nodes.
>     When I run with 'mpiexec -machinefile ./mf -n 24 mdrun -v -dd 6 3 1 
> -npme 6 -dlb -s md1.tpr -o md1.trr -g md1.log  -e md1.edr -x md1.xtc>& 
> md1.job &'. Each node has two CPUs to do PME. The CPU occupancy of PME 
> nodes is nearly the same as that of the DD nodes.
>     And the CPU occupancy of these three cases is about 10.0%. It's very 
> low.
>     I also test 16 CPUs. I use two nodes. I also choose 6 PME nodes. The 
> PME nodes runtime is more than DD nodes. I think it means the wokload of 
> PME nodes is more than DD nodes. And the occupancy of each CPU is about 
> 50.0%.
>     When you run your program how do you determine the number of PME 
> nodes? And how do you choose the DD mode? Two dimension or three 
> dimension or just simple one dimension? I think the DD method also 
> affects the efficiency.
Ok, here is what I would do in your case:

1. start simulations on 1, 2, 4, 8, 16, and 24 CPUs without PME nodes first
   and write down how many nanoseconds you get per day. Choose different
   DD grids to see which one is the fastest. It is usually a good idea to
   have the largest number in the x-direction, because this makes it easier
   to redistribute the data for the PME calculation which needs slabs in
   x-direction. But this is not necessarily the case. Just play around with
   the values to get a feeling for it.
2. Calculate the scaling and/or speedup for 1-24 CPUs. The point, where the
   scaling breaks down helps to identify where the problem might be.
3. If the scaling is really bad, the problem has to be found first. PME
   nodes will not miraculously turn a bad scaling into a perfect one.
4. If the scaling is as expected, try if you can further enhance it by using
   PME/PP node separation. Do not look at the CPU occupance since a high
   value does not necessarily mean a good scaling! Better use the gromacs
   ns/day values at the end of the md.log file. For balancing the PME against
   the PP nodes, look at the load balancing info you get every 100 steps
   (if -v is turned on). Try to approach a ratio pme/F near 1 which means
   the pme and the (short-range) Force calculation approximately take the
   same amount of time. Also play around with the -ddorder parameter: It
   could be a good idea to have a whole node (8 CPUs) do PME, while the other
   one or two nodes do PP. This way the communication-intensive FFT can make use
   of fast shared memory communication.
5. Be aware that dynamic load balancing may take some time until it has
   converged to optimal cell sizes. The "imb F" just before the "pme/F"
   value tells you how big the imbalance in the force calculation between
   the individual processes still is. Should converge to about 1%.
6. Choose enough time steps for doing the benchmarks such that you can
   neglect the initial phase, where load balancing adaptions take place.
   (maybe 1000 time steps, might be longer).

Carsten

>  
>  
>  
> Best wishes!
>  
> Ji Xu
> xuji at home.ipe.ac.cn <mailto:xuji at home.ipe.ac.cn>
> 
> 2008-06-23
> 　　　　　　　　　　　　　　
> 
> 	
> 
> 

-- 
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics Department
Am Fassberg 11
37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
http://www.mpibpc.mpg.de/research/dep/grubmueller/
http://www.gwdg.de/~ckutzne