[gmx-developers] SSE+FMA4 kernel

Wed May 30 16:31:28 CEST 2012

Dear list,

I would like to share my trial to port GROMACS 4.5 SSE/SSE2 nonbonded kernel
to AMD's new family 15h chip, "Bulldozer" architecture, performed to measure
the benefit of new instructions. In AMD family 15h new FMA4 instructions
are added; FMA4 is the fused multiplication and addition (subtraction) operations.
FMA4 reduces number of instructions and latency, giving a marginal performance boost.
Also there are XOP instructions, which is useful implementing table interpolation
used in GROMACS.

 As far as I could try, the speedup is only 5% for SP kernels and around 10% for
DP kernels, on gmxbench ( http://www.gromacs.org/About_Gromacs/Benchmarks ).
The source codes and generated kernels are uploaded on
( https://bitbucket.org/shun.sakuraba/gromacs-fma4-patch ),
being available for download. I hope this interests some developers.

gmxbench 3.0, ns/day
           vanilla fma4+xop
SP    dppc  2.13    2.21
SP lzm.pme 16.15   16.73
SP lzm.cut 23.93   24.79
SP polych2 38.88   38.13
SP  villin 88.32   92.71
DP    dppc  1.35    1.52
DP lzm.pme 10.28   11.14
DP lzm.cut 15.34   17.09
DP polych2 26.73   29.87
DP  villin 48.82   54.92

Benchmarking results are taken on the following environment:
AMD FX-4100 (2 modules, 4 cores, 3.6GHz / up to 3.9GHz with turbo)
Linux (ArchLinux) 3.1.12-1-ARCH
gcc (GCC) 4.7.0  20120505 (prerelease)
FFTW 3.3.2
GROMACS 4.5.5, used threading parallelization with 4 threads, compiled with
CFLAGS="-O3 -fomit-frame-pointer -finline-functions -Wall -Wno-unused -march=native -funroll-all-loops -std=gnu99 -fexcess-precision=fast".
All benchmarks are taken 3 times and the median is taken.

Notes:
* In GROMACS, if "-march=native" is replaced with "-msse2" (which is default in GROMACS 4.5.5),
  the program runs 5~10% slower than "-march=native".
* I also compiled FFTW with and without -march=native. With default (i.e. without -march=native)
  SP/DP runs are ~3% slower in PME.
* This version of FFTW does not use FMA4 SIMD instructions. FMA4 SIMD instruction in FFTW
  should increase performance, but I have not tried.
* On NVE simulation, the energy conservation is slightly better than vanilla, possibly because
  in IEEE-compliant FMA, intermediate multiplication results are calculated with infinite precision.

-- 
Shun SAKURABA, Ph.D.
Postdoc @ Molecular Modeling & Simulation Group, Japan Atomic Energy Agency