[gmx-developers] Cache Locality in gather_f_bsplines()

Tue May 4 11:19:09 CEST 2010

Esztermann, Ansgar wrote:
> Hello Berk,
>
>   
>> Carsten told me about your work when I was in Goettingen half a year ago.
>> But I forgot why this access pattern can improve cache misses.
>> Currently Gromacs reads and writes from the grid in 3D loops with indices
>> major, middle, minor. To me this seems as if you would already get
>> optimal caching.
>>     
>
> As long as access follows these loops, I would indeed expect optimal cache behaviour. However, the pattern during spread/gather is different: there is a loop over all atoms, and for each atom, neighbouring grid points are accessed. Taking into account only nearest neighbours, and assuming x is the fastest varying index, the following ensues: points <x,y,z> and <x+1,y,z> are next to each other in memory; <x,y+1,z> is already max_x*sizeof(real) bytes away; for <x,y,z+1>, the distance is max_x*max_y*sizeof(real), which might well be as large as 25kB.
> This is true for any value of z, i.e. every atom has its neighbouring grid points spread out across memory in this way.
>
> When storing data along the Morton curve, locality is on average much better. The curve traces a hierarchy of cubes with edges of 2^n grid points. For example, a cube with 7*d edges (where d is the grid constant, for simplicity assuming d_x=d_y=d_z) contains 8*8*8=512 grid points in a contiguous region of 512*sizeof(real) bytes. Therefore, any atom contained in 7^3 out of 8^3 unit cubes (a volume fraction of 67%) has *all* of its nearest grid points within said 512*sizeof(real) bytes. Because of the self-similarity of the curve (and fractals in general), this argument applies at any scale.
>
>   
Ah, you reorder the whole grid in memory, I didn't realize this.

Did you test the performance with domain decomposition?
When running in serial the atoms are stored in nearly random order.
When running with domain decomposition, atoms are ordered on x major, y
middle and z minor.

Berk
>> We reimplemented most of the PME routines in the 2dpme git branch.
>> Here we have a 20% performance improvement in spreading and gathering,
>> because we now use a local grid which is not periodically wrapped.
>> This avoids an extra indexation, but, probably more importantly, avoids
>> cache missing because some grid points might be far away in memory
>> over the periodic boundary.
>>     
>
> Thanks for pointing this out. I will take a look at the branch and see whether the idea might be useful there.
>
>
> A.
>
>
>
>