[gmx-developers] Unnecessary shl in assembly inner loops?
Mark.Abraham at anu.edu.au
Fri Mar 3 08:35:01 CET 2006
While examining the IA32_SSE assembly language inner loops, I noticed a
possible speed improvement to the way lookups of LJ vdW parameters is
done. They are taken from a 2D array (indexed by the two atom types) of
pointers to two contiguous float locations containing c6 and c12
parameters. In nb_kernel110_ia32_sse.s, the row of the array is
constructed early in the outer loop with
mov ebx, [ecx +esi*4] ;# ebx =ii
mov edx, [ebp + nb110_type]
mov edx, [edx + ebx*4]
imul edx, [esp + nb110_ntype]
shl edx, 1
mov [esp + nb110_ntia], edx
with the offset of the row stored in nb110_ntia for use in the inner loop.
Later on in the quad-unrolled inner loop, we have
mov edx, [esp + nb110_innerjjnr] ;# pointer to jjnr[k]
mov eax, [edx]
mov ebx, [edx + 4]
mov ecx, [edx + 8]
mov edx, [edx + 12] ;# eax-edx=jnr1-4
mov esi, [ebp + nb110_type]
mov eax, [esi + eax*4]
mov ebx, [esi + ebx*4]
mov ecx, [esi + ecx*4]
mov edx, [esi + edx*4]
mov esi, [ebp + nb110_vdwparam]
shl eax, 1
shl ebx, 1
shl ecx, 1
shl edx, 1
mov edi, [esp + nb110_ntia]
add eax, edi
add ebx, edi
add ecx, edi
add edx, edi
movlps xmm6, [esi + eax*4]
movlps xmm7, [esi + ecx*4]
movhps xmm6, [esi + ebx*4]
movhps xmm7, [esi + edx*4]
Here the columns for the lookup of the vdW array for four j atoms are
constructed in eax-edx. These all get shifted left by 1 and added to the
row offset we stored earlier, for the lookup of c6 and c12
simultaneously with the mov[lh]ps instructions - that's elegantly done!
However I note that we use nb110_ntia only in this type of operation,
and that we did a shift left by 1 before storing it, and we've just done
a shift left by 1 on all the operands we add to it after retrieving it.
This immediately suggests moving the four shl operations to after the
addition in the inner loop, and eliminating the shl in the outer loop.
This saves one shl per outer loop, which isn't much to write home about.
However the first thing we do to eax-edx is use them in an effective
address construction as an index scaled by 4. We don't use these values
afterwards, so we can eliminate all of the shl operations and replace
them by the same indices scaled by 8. Scaling by any of 1, 2, 4 and 8 is
permitted on the IA32, and my guess would be that 4 and 8 are equally
fast - and that 8 will be faster than 4 and a pre-shift left. If my
understanding of the algorithm and my guess about the timing is right,
that will save time(shl)*(4*nrj+1)*nri per call to this function. Of
course time(shl) is infinitesimal, but we are doing a lot of them if
we're calling nb_kernel??0 a lot.
A quick look at nb_kernel310, nb_kernel111 and nb_kernel112 suggests
that the first has the same two-part optimization available in both
inner and outer loops, the second has the optimization available outside
the outer loop and inside the inner loop only which saves
time(shl)*(4*nrj*nri+1), and the third only saves outside the outer loop
for a trivial gain of time(shl)*1. This is all expected from the nature
of the water optimization.
Because of the large relative amount of time typically spent in
water-water loops, the gain would only a tiny increment for someone
using an optimized water solvent on a small solute. However for a system
with very large solute, or non-optimized solvent, or membranes, or no
solvent, the gain might be appreciable.
Hope this helps!
More information about the gromacs.org_gmx-developers