[gmx-developers] auto vectorization calc_gb_rad_obc

Thu Apr 17 09:54:14 CEST 2014

A user on our system noticed that the 4.6 version of gromacs was notably
slower compared to the 4.5 version.
To investigate this further we ran the dhfr-impl-2nm.bench example from the
gromacs-gpubench-dhfr.tar.gz
Indeed we found that running gromacs on a single CPU core (not using the
GPU) we could achieve 1.57 ns/day for the 4.5 version and only 0.5 ns/day
for the 4.6 version.
On closer inspection we noticed that in the 4.5 version of the code a
vectorized version of the main time consuming routine was called
calc_gb_rad_hct_obc_sse2_single, where in the 4.6 version this routine is
commented out #if 0 && defined (GMX_X86_SSE2) and a non vectorized version
of the routine is used.
Unfortunately version 14.02 of the intel compiler does not auto-vectorize
the main inner loop of calc_gb_rad_obc

to allow the compiler to vectorize the loop we had to make a few changes

we added (line 896 genborn.c)
#pragma ivdep
        for (k = nj0; k < nj1; k++)
        {
	    if ( nl->jjnr[k] <0) exit;
the intel compiler cannot handle the more complex loop construct so we had
to manually extract the second part.
Also we had to replace the calls to 
gmx_invsqrt(dr2);
with 
1.0f/sqrtf(dr2)
(we added f to all numerical constants to prevent type conversion as well) 

after these changed the intel compiler can auto-vectorize (avx) the loop and
the performance becomes slightly better compared to the sse vectorized code,
1.59 ns/day

now the main time consuming part of the loop is the reduction operation at
the end of the loop
 born->gpol_hct_work[aj] += 0.5*tmp; (line 1011)

best
Thomas

--
View this message in context: http://gromacs.5086.x6.nabble.com/auto-vectorization-calc-gb-rad-obc-tp5015895.html
Sent from the GROMACS Developers Forum mailing list archive at Nabble.com.