[gmx-developers] SSE+FMA4 kernel
Shun Sakuraba
shun.sakuraba at gmail.com
Wed May 30 16:31:28 CEST 2012
Dear list,
I would like to share my trial to port GROMACS 4.5 SSE/SSE2 nonbonded kernel
to AMD's new family 15h chip, "Bulldozer" architecture, performed to measure
the benefit of new instructions. In AMD family 15h new FMA4 instructions
are added; FMA4 is the fused multiplication and addition (subtraction) operations.
FMA4 reduces number of instructions and latency, giving a marginal performance boost.
Also there are XOP instructions, which is useful implementing table interpolation
used in GROMACS.
As far as I could try, the speedup is only 5% for SP kernels and around 10% for
DP kernels, on gmxbench ( http://www.gromacs.org/About_Gromacs/Benchmarks ).
The source codes and generated kernels are uploaded on
( https://bitbucket.org/shun.sakuraba/gromacs-fma4-patch ),
being available for download. I hope this interests some developers.
gmxbench 3.0, ns/day
vanilla fma4+xop
SP dppc 2.13 2.21
SP lzm.pme 16.15 16.73
SP lzm.cut 23.93 24.79
SP polych2 38.88 38.13
SP villin 88.32 92.71
DP dppc 1.35 1.52
DP lzm.pme 10.28 11.14
DP lzm.cut 15.34 17.09
DP polych2 26.73 29.87
DP villin 48.82 54.92
Benchmarking results are taken on the following environment:
AMD FX-4100 (2 modules, 4 cores, 3.6GHz / up to 3.9GHz with turbo)
Linux (ArchLinux) 3.1.12-1-ARCH
gcc (GCC) 4.7.0 20120505 (prerelease)
FFTW 3.3.2
GROMACS 4.5.5, used threading parallelization with 4 threads, compiled with
CFLAGS="-O3 -fomit-frame-pointer -finline-functions -Wall -Wno-unused -march=native -funroll-all-loops -std=gnu99 -fexcess-precision=fast".
All benchmarks are taken 3 times and the median is taken.
Notes:
* In GROMACS, if "-march=native" is replaced with "-msse2" (which is default in GROMACS 4.5.5),
the program runs 5~10% slower than "-march=native".
* I also compiled FFTW with and without -march=native. With default (i.e. without -march=native)
SP/DP runs are ~3% slower in PME.
* This version of FFTW does not use FMA4 SIMD instructions. FMA4 SIMD instruction in FFTW
should increase performance, but I have not tried.
* On NVE simulation, the energy conservation is slightly better than vanilla, possibly because
in IEEE-compliant FMA, intermediate multiplication results are calculated with infinite precision.
--
Shun SAKURABA, Ph.D.
Postdoc @ Molecular Modeling & Simulation Group, Japan Atomic Energy Agency
More information about the gromacs.org_gmx-developers
mailing list