[gmx-developers] Cache Locality in gather_f_bsplines()
Ansgar.Esztermann at mpi-bpc.mpg.de
Thu Apr 29 10:23:33 CEST 2010
here is some work I have done on locality optimization in the PME routines. I hope it proves useful. (See attachment for the patch.)
In a typical simulation, a sizable fraction of PME runtime is spent outside the FFT. Very roughly, PME force calculation works in three steps: first, charges are transferred from the atoms to a grid (spread_q_bsplines()); secondly, the forces on the grid points are calculated; finally, these forces are transferred back to the atoms (gather_f_bsplines()).
Although the spread_q_bsplines() and gather_f_bsplines() routines are computationally not very demanding, their impact on runtime is not negligible. Analyzing a simulation run with a tool like Shark yields a very high number of cache misses in these routines.
*Theory of operation
When calculating the forces on a certain atom, gather_f_bsplines() takes into account a small number of grid points surrounding the atom. That is, the routine is very "local" in (simulated) physical space. However, multidimensional arrays in C are stored in row-major format, thereby losing this locality. In fact, two adjacent grid points may be stored kilobytes apart in memory. With today's fast CPUs and (comparatively) slow RAM, this imposes severe penalties on runtime.
If cache geometry is known, code may be optimized to group together operations on data belonging to the same cache line (and therefore loaded simultaneously). However, when cache geometry is not known beforehand (e.g. when the same code is supposed to run on various architectures), a different approach is called for.
Space filling (fractal) curves tend to preserve locality much better than row-major format does. That is, two points close together in (say) 3d space are close together along such a curve with a higher probability. The Hilbert curve (built by recursively replacing straight lines with U shapes) is a good example for this. Apart from locality preservation, a second criterion is important for such a curve: to be useful for representing a 3d grid, it must be fast to construct. For a given atom, the x, y, and z coordinates of the surrounding grid points are easily obtained. Thus, calculating their memory address in row-major format is easy. Calculating the corresponding point along a space-filling curve necessitates more computation, thus diminishing the runtime advantage.
For the present implementation, I have chosen the Morton curve: it can be created by recursively replacing straight lines with the letter Z. Because of the diagonal "jumps", its locality properties are somewhat less optimal, but there is a very fast way of calculating addresses: simply interleave bits of the x, y, and z coordinates. Thus, in an eight by eight by eight grid with zero-based indices, (3,4,5)=(b011,b100,b101) becomes b011100101=229 (ordering the bits as xyzxyzxyz).
Implementing the core functionality using this approach is simple enough: array indices are converted in morton_index() using shift operations. Shift masks are given in octal notation since its three bits per character work well with three-dimensional arrays, leading to an easily recognizable symmetry.
At the beginning of gather_f_bsplines(), the routine to_morton() traces the grid in natural row-major order, then writes it into a 1d array in Morton curve order. At this point, there is a marked increase in cache misses; however, this happens outside the atom loop, minimizing its impact on overall runtime. The inner loop then proceeds almost unchanged, simply reading forces from the new 1d array (real *morton), at the index calculated on-the-fly by morton_index().
*Costs and Drawbacks
By design, some overhead is incurred by this approach. Notably, the FFT grid is duplicated, using space (once) and time (once per PME loop).
Moreover, index calculation (each access to a grid point) is fast but not instantaneous.
Besides these fundamental costs, there is one quirk of the implementation. The amount of memory allocated for the Morton curve is 2^(3n)*sizeof(real), where n is the smallest integer such that 2^n >= max(x,y,z). In other words, the grid is blown up to a cube with power-of-two edge length. This speeds up index calculation, but may pose space (and perhaps time) problems with very flat or elongated Fourier grids, and with grid dimensions slightly larger than a power of two.
I have tested the implementation on a simple system with 80k atoms. The run produced identical trajectories compared to an unaltered gmx version. It was, 3 to 4 percent faster overall, and PME spread/gather time dropped about 20% as taken from md.log.
The test runs have been performed on x86_64.
Firstly, more testing for performance and correctness with different input files and with parallel runs is needed. Secondly, the user should be able to switch the new behaviour on and off, e.g. via a command line option. (Currently, this is controlled via the preprocessor macro PME_GRID_MORTON.) Thirdly, it should be investigated whether a similar approach is feasible for spread_q_bsplines().
Finally, it is a pleasure to thank Carsten Kutzner for advice and helpful discussions.
The bit shift algorithms have been adapted from http://graphics.stanford.edu/~seander/bithacks.html.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 5243 bytes
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
More information about the gromacs.org_gmx-developers