[gmx-developers] Unnecessary shl in assembly inner loops?

Fri Mar 3 08:35:01 CET 2006

Hi,

While examining the IA32_SSE assembly language inner loops, I noticed a 
possible speed improvement to the way lookups of LJ vdW parameters is 
done. They are taken from a 2D array (indexed by the two atom types) of 
pointers to two contiguous float locations containing c6 and c12 
parameters. In nb_kernel110_ia32_sse.s, the row of the array is 
constructed early in the outer loop with

	mov   ebx, [ecx +esi*4]	    ;# ebx =ii
<snippage>
     	mov   edx, [ebp + nb110_type]
     	mov   edx, [edx + ebx*4]
     	imul  edx, [esp + nb110_ntype]
     	shl   edx, 1
     	mov   [esp + nb110_ntia], edx

with the offset of the row stored in nb110_ntia for use in the inner loop.

Later on in the quad-unrolled inner loop, we have

	mov   edx, [esp + nb110_innerjjnr]     ;# pointer to jjnr[k]
	mov   eax, [edx]	
	mov   ebx, [edx + 4]
	mov   ecx, [edx + 8]
	mov   edx, [edx + 12]         ;# eax-edx=jnr1-4
<snippage>
	mov esi, [ebp + nb110_type]
	mov eax, [esi + eax*4]
	mov ebx, [esi + ebx*4]
	mov ecx, [esi + ecx*4]
	mov edx, [esi + edx*4]
	mov esi, [ebp + nb110_vdwparam]
	shl eax, 1	
	shl ebx, 1	
	shl ecx, 1	
	shl edx, 1	
	mov edi, [esp + nb110_ntia]
	add eax, edi
	add ebx, edi
	add ecx, edi
	add edx, edi

	movlps xmm6, [esi + eax*4]
	movlps xmm7, [esi + ecx*4]
	movhps xmm6, [esi + ebx*4]
	movhps xmm7, [esi + edx*4]

Here the columns for the lookup of the vdW array for four j atoms are 
constructed in eax-edx. These all get shifted left by 1 and added to the 
row offset we stored earlier, for the lookup of c6 and c12 
simultaneously with the mov[lh]ps instructions - that's elegantly done!

However I note that we use nb110_ntia only in this type of operation, 
and that we did a shift left by 1 before storing it, and we've just done 
a shift left by 1 on all the operands we add to it after retrieving it. 
This immediately suggests moving the four shl operations to after the 
addition in the inner loop, and eliminating the shl in the outer loop. 
This saves one shl per outer loop, which isn't much to write home about.

However the first thing we do to eax-edx is use them in an effective 
address construction as an index scaled by 4. We don't use these values 
afterwards, so we can eliminate all of the shl operations and replace 
them by the same indices scaled by 8. Scaling by any of 1, 2, 4 and 8 is 
permitted on the IA32, and my guess would be that 4 and 8 are equally 
fast - and that 8 will be faster than 4 and a pre-shift left. If my 
understanding of the algorithm and my guess about the timing is right, 
that will save time(shl)*(4*nrj+1)*nri per call to this function.  Of 
course time(shl) is infinitesimal, but we are doing a lot of them if 
we're calling nb_kernel??0 a lot.

A quick look at nb_kernel310, nb_kernel111 and nb_kernel112 suggests 
that the first has the same two-part optimization available in both 
inner and outer loops, the second has the optimization available outside 
the outer loop and inside the inner loop only which saves 
time(shl)*(4*nrj*nri+1), and the third only saves outside the outer loop 
for a trivial gain of time(shl)*1. This is all expected from the nature 
of the water optimization.

Because of the large relative amount of time typically spent in 
water-water loops, the gain would only a tiny increment for someone 
using an optimized water solvent on a small solute. However for a system 
with very large solute, or non-optimized solvent, or membranes, or no 
solvent, the gain might be appreciable.

Hope this helps!

Mark