[gmx-developers] parallel optimizations of inl1130
lindahl at cbr.su.se
Thu Nov 15 23:37:32 CET 2007
On Nov 15, 2007, at 9:49 PM, James Schwarzmeier wrote:
> This is a question from a non-chemist. In the version of Gromacs I
> am looking at, routine inl1130 appears to be calculating long range
> forces on atoms of all water molecules due to atoms of all other
> water molecules. This is implemented as a doubly nested loop. The
> outer loop is over three atoms of a fixed water molecule, and I
> believe the inner loop goes over atoms of all other water molecules.
Almost - the inner loop goes over entries in a neighborlist, which is
a small fraction of all other water molecules.
> I have two questions about parallelization opportunities. First,
> even though there are gathers and scatter going on, isn’t the inner
> loop fully vectorizable? That is, the force on a given atom should
> be a completely parallel computation, given positions of all other
> atoms. This implies that all indirect addresses of the ‘faction’
> array have no repeated indices ‘j3’ as you vary loop index ‘k’. Is
> this correct?
Correct. To some extent we already do this with SIMD-style
instructions for x86. A long time ago we even had special vector
versions for Cray, but I think they disappeared somewhere with version
3.0 or so, since they were never tested :-)
> The second question is about the outer loop over do-n, which I
> believe counts unique water molecules in the system. Isn’t this also
> a fully parallel operation that could be made into an OpenMP loop
> (with a long list of private variables!)? That is, the force
> calculation of all atoms should be a completely parallel operation,
> given positions of all particles. Again this implies indices ‘ii3’
> of faction, ‘i3’ of fshift, etc, have no repeats as you vary outer
> loop index ‘n’. Is this correct?
No, that's not correct in general. A complication with MD is that we
need periodic boundary conditions. To avoid having these conditionals
in the innermost loop we record the relative box position ("shift") we
calculate during neighborsearching every 10-15 steps, so each atom
will give rise to a number (usually 2-3) "i particles" with different
shift indices. Any "j particle" only occurs in one of these, though.
We tried OpenMP a few years ago, but the performance couldn't compete
with the current MPI implementation in CVS where we essentially scale
perfectly as long as we stay within a dual quad-core node.
However, we have already prepared the kernels to use threads
explicitly, and even the x86 assembly kernels use compare-exchange to
sync workunits of 1-10 outer loop interations. Once gromacs 4.0 is
released we plan to start the work on a mixed MPI/threads version.
We've worked a lot on these optimizations over the last few years, but
it's always nice with a fresh perspective, so we'd be happy for any
suggestions you might have :-)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the gromacs.org_gmx-developers