[gmx-developers] Why is domain decomposition the default parallelization algorithm in version 4 and later?

Fri May 11 22:03:05 CEST 2012

Hi,

I hope that this is an appropriate topic for this list.  If it is not,
please let me know and I will be happy to move it.  

I think that in versions prior to 4, particle decomposition was the only
parallelization method available.  According to the 2005 J. Comput. Chem.
Paper (http://onlinelibrary.wiley.com/doi/10.1002/jcc.20291/abstract) by Dr.
van der Spoel et al.:

"An early design decision was the choice to work with particle decomposition
rather than domain decomposition to distribute work over the processors.
...  Domain decomposition is a better choice only when linear system size
considerably exceeds the range of interaction, which is seldom the case in
molecular dynamics.  With particle decomposition ... every processor keeps
in its local memory the complete coordinate set of the system rather than
restricting storage to the coordinates it needs. This is simpler and saves
communication overhead, while the memory claim is usually not a limiting
factor at all, even for millions of particles.  ...  Communication is
essentially restricted to sending coordinates and forces once per time step
around the processor ring.  These choices have proven to be robust over time
and easily applicable to modern processor clusters."  [page 1702]

But in version 4, domain decomposition was implemented and is now the
default parallelization algorithm in mdrun.  Why is this the case?  From
reading the 2008 paper by Hess et al. in JCTC
(http://pubs.acs.org/doi/abs/10.1021/ct700301q), it seems that domain
decomposition can be and is better performing than particle decomposition if
implemented cleverly, despite domain decomposition's higher communication
overhead:

"GROMACS was in fact set up to run in parallel on 10Mbit ethernet from the
start in 1992 but used a particle/force decomposition that did not scale
well.  The single-instruction-multiple-data kernels we introduced in 2000
made the relative scaling even worse (although absolute performance improved
significantly), since the fraction of remaining time spent on communication
increased.  A related problem was load imbalance; with particle
decomposition one can frequently avoid imbalance by distributing different
types of molecules uniformly over the processors.  Domain decomposition, on
the other hand, requires automatic load balancing to avoid deterioration of
performance."  [page 436]

"Previous GROMACS versions used a ring communication topology, where half of
the coordinates/forces were sent over half the ring. To be frank, the only
thing to be said in favor of that is that it was simple."  [page 441]

Unfortunately, I am not very well-versed in parallelization algorithms and
high-performance computing in general.  Can you please tell me in 1-2
sentences why domain decomposition is now the default parallelization
method?  

(In recent simulations I have run, I have seen some seemingly significant
differences in the electric potential (calculated using g_potential) when I
use particle decomposition versus when I use domain decomposition.  Do you
know if this has been observed?  I do not see any discussion of the
differences in results between the two algorithms.  I am convinced that I
must be making a mistake (it seems unlikely that I, of all people would find
a bug), but I have not yet found my mistake.)

Thanks for your time!

Andrew DeYoung
Carnegie Mellon University