[gmx-users] Re: The number of PME nodes
Carsten Kutzner
ckutzne at gwdg.de
Mon Jun 23 14:11:11 CEST 2008
Hi Xuji,
xuji wrote:
>
> Hi Carsten:
>
> First, thanks for your reply.
> Second, someone in the mailing-list said:"Please cut of your email
> from the digest. It confuses people." And it seems that there's no other
> people that is interest in this topic. So I think it's maybe better we
> discuss this problem in private email.
I think Yang Ye simply meant that you should choose a meaningful subject -
not "Re: [gmx-developers]". I also forward this to the users mailing list,
which is the best place for this discussion.
> Now it's time to discuss the problem. It's because I have 3
> nodes(each node has 8 CPUs) that I choose 6 as the number of the PME nodes.
> I also test other number of PME nodes. The mdrun program's default
> PME nodes number is 12. That is to say when I run the 'mdrun' with the
> command line 'mpiexec -machinefile ./mf -n 24 mdrun -v -dlb -s md1.tpr
> -o md1.trr -g md1.log -e md1.edr -x md1.xtc>& md1.job &' the PME nodes
> number is 12. The CPU occupancy is almost the same between PME nodes and
> DD(domain decomposition) nodes.
> When I run with 'mpiexec -machinefile ./mf -n 24 mdrun -v -dd 7 3 1
> -npme 3 -dlb -s md1.tpr -o md1.trr -g md1.log -e md1.edr -x md1.xtc>&
> md1.job &'. Each node has one CPU to do PME. The CPU occupancy of PME
> nodes is larger than that of the DD nodes.
> When I run with 'mpiexec -machinefile ./mf -n 24 mdrun -v -dd 6 3 1
> -npme 6 -dlb -s md1.tpr -o md1.trr -g md1.log -e md1.edr -x md1.xtc>&
> md1.job &'. Each node has two CPUs to do PME. The CPU occupancy of PME
> nodes is nearly the same as that of the DD nodes.
> And the CPU occupancy of these three cases is about 10.0%. It's very
> low.
> I also test 16 CPUs. I use two nodes. I also choose 6 PME nodes. The
> PME nodes runtime is more than DD nodes. I think it means the wokload of
> PME nodes is more than DD nodes. And the occupancy of each CPU is about
> 50.0%.
> When you run your program how do you determine the number of PME
> nodes? And how do you choose the DD mode? Two dimension or three
> dimension or just simple one dimension? I think the DD method also
> affects the efficiency.
Ok, here is what I would do in your case:
1. start simulations on 1, 2, 4, 8, 16, and 24 CPUs without PME nodes first
and write down how many nanoseconds you get per day. Choose different
DD grids to see which one is the fastest. It is usually a good idea to
have the largest number in the x-direction, because this makes it easier
to redistribute the data for the PME calculation which needs slabs in
x-direction. But this is not necessarily the case. Just play around with
the values to get a feeling for it.
2. Calculate the scaling and/or speedup for 1-24 CPUs. The point, where the
scaling breaks down helps to identify where the problem might be.
3. If the scaling is really bad, the problem has to be found first. PME
nodes will not miraculously turn a bad scaling into a perfect one.
4. If the scaling is as expected, try if you can further enhance it by using
PME/PP node separation. Do not look at the CPU occupance since a high
value does not necessarily mean a good scaling! Better use the gromacs
ns/day values at the end of the md.log file. For balancing the PME against
the PP nodes, look at the load balancing info you get every 100 steps
(if -v is turned on). Try to approach a ratio pme/F near 1 which means
the pme and the (short-range) Force calculation approximately take the
same amount of time. Also play around with the -ddorder parameter: It
could be a good idea to have a whole node (8 CPUs) do PME, while the other
one or two nodes do PP. This way the communication-intensive FFT can make use
of fast shared memory communication.
5. Be aware that dynamic load balancing may take some time until it has
converged to optimal cell sizes. The "imb F" just before the "pme/F"
value tells you how big the imbalance in the force calculation between
the individual processes still is. Should converge to about 1%.
6. Choose enough time steps for doing the benchmarks such that you can
neglect the initial phase, where load balancing adaptions take place.
(maybe 1000 time steps, might be longer).
Carsten
>
>
>
> Best wishes!
>
> Ji Xu
> xuji at home.ipe.ac.cn <mailto:xuji at home.ipe.ac.cn>
>
> 2008-06-23
>
>
>
>
>
--
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics Department
Am Fassberg 11
37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
http://www.mpibpc.mpg.de/research/dep/grubmueller/
http://www.gwdg.de/~ckutzne
More information about the gromacs.org_gmx-users
mailing list