[gmx-developers] SSE+FMA4 kernel

Mon Jun 11 18:55:25 CEST 2012

Thanks for the suggestions.

 > > With AVX, you can replace two-operand SSE instructions by three-operand AVX instructions, and avoid copies. For example:
Since Bulldozer uses register renaming, all latency with moving between registers can be hidden to the subsequent operations.
(See 10.16 Move/Compute Optimization in AMD Optimization manual for details.)
Thus, the only benefit to rewrite to AVX128 is to reduce instruction decoding time as well as front-end pipeline stalls.
The front-end stall cycles observed however are only circa 5% throughout benchmarks, and therefore rewriting
all mov operations is not very helpful in my opinion. Also note that GROMACS uses registers
very smartly to avoid load/stores.
I also tried to remove some of mov operations, but the performance gain is not significant.

 > > Another direction is using 256-bit ymm registers instead of 128-bit xmm registers, but that would require a thorough rewrite of the kernels. Especially on Sandy Bridge, where each core has its own 256-bit floating point unit, you may very well increase performance by 50% or more.
Yes, in Bulldozer FP two cores share two 128-bit pipelines, and in most FP benchmarks
it is reported that using AVX256 instructions does not increase performance significantly.
Rewriting for AVX256 in asm is also too tough, especially when loading 8 table numbers from memory
and shuffling, therefore I am not going to try it.

As the ticket #923 of GROMACS suggests that these asm kernels are removed in 4.6,
at the moment I am not going to tune this kernel further.
( http://redmine.gromacs.org/issues/923 )
Now I am eagerly waiting for 4.6 AVX/FMA merge :)

BTW, in ticket #923 Rossen Apostolov reported that mixing SSE/AVX incur expensive switches
between instruction sets in Bulldozer, which I did not observed in my experiments.
In the rewritten kernel there are many switches b/w SSE and AVX128 (of FMA4), but no performance problems are observed.
Also, according to AMD engineers, Flex FP in Bulldozer have no switching penalty between SSE and AVX128.
( e.g. Page 9 of http://metavo.metacentrum.cz/export/sites/meta/cs/seminars/seminar4/AMD_HPC_Brno.pdf )
Does anyone know on which condition this happens? Is it SSE/AVX256 switching? Does vzeroupper instruction help?

On 2012年06月12日 01:27, Szilárd Páll wrote:
> Hi,
>
> Keep an eye on the release-4-6 git branch, the AVX 128 (+FMA4) and AVX 256 version of non-bonded kernels will very soon get merged upstream.
>
> Cheers,
> --
> Szilárd
>
>
> On Wed, May 30, 2012 at 4:31 PM, Shun Sakuraba <shun.sakuraba at gmail.com <mailto:shun.sakuraba at gmail.com>> wrote:
>
>     Dear list,
>
>     I would like to share my trial to port GROMACS 4.5 SSE/SSE2 nonbonded kernel
>     to AMD's new family 15h chip, "Bulldozer" architecture, performed to measure
>     the benefit of new instructions. In AMD family 15h new FMA4 instructions
>     are added; FMA4 is the fused multiplication and addition (subtraction) operations.
>     FMA4 reduces number of instructions and latency, giving a marginal performance boost.
>     Also there are XOP instructions, which is useful implementing table interpolation
>     used in GROMACS.
>
>       As far as I could try, the speedup is only 5% for SP kernels and around 10% for
>     DP kernels, on gmxbench ( http://www.gromacs.org/About_Gromacs/Benchmarks ).
>     The source codes and generated kernels are uploaded on
>     ( https://bitbucket.org/shun.sakuraba/gromacs-fma4-patch ),
>     being available for download. I hope this interests some developers.
>
>     gmxbench 3.0, ns/day
>                vanilla fma4+xop
>     SP    dppc  2.13    2.21
>     SP lzm.pme 16.15   16.73
>     SP lzm.cut 23.93   24.79
>     SP polych2 38.88   38.13
>     SP  villin 88.32   92.71
>     DP    dppc  1.35    1.52
>     DP lzm.pme 10.28   11.14
>     DP lzm.cut 15.34   17.09
>     DP polych2 26.73   29.87
>     DP  villin 48.82   54.92
>
>     Benchmarking results are taken on the following environment:
>     AMD FX-4100 (2 modules, 4 cores, 3.6GHz / up to 3.9GHz with turbo)
>     Linux (ArchLinux) 3.1.12-1-ARCH
>     gcc (GCC) 4.7.0  20120505 (prerelease)
>     FFTW 3.3.2
>     GROMACS 4.5.5, used threading parallelization with 4 threads, compiled with
>     CFLAGS="-O3 -fomit-frame-pointer -finline-functions -Wall -Wno-unused -march=native -funroll-all-loops -std=gnu99 -fexcess-precision=fast".
>     All benchmarks are taken 3 times and the median is taken.
>
>     Notes:
>     * In GROMACS, if "-march=native" is replaced with "-msse2" (which is default in GROMACS 4.5.5),
>       the program runs 5~10% slower than "-march=native".
>     * I also compiled FFTW with and without -march=native. With default (i.e. without -march=native)
>       SP/DP runs are ~3% slower in PME.
>     * This version of FFTW does not use FMA4 SIMD instructions. FMA4 SIMD instruction in FFTW
>       should increase performance, but I have not tried.
>     * On NVE simulation, the energy conservation is slightly better than vanilla, possibly because
>       in IEEE-compliant FMA, intermediate multiplication results are calculated with infinite precision.
>
>     --
>     Shun SAKURABA, Ph.D.
>     Postdoc @ Molecular Modeling & Simulation Group, Japan Atomic Energy Agency
>     --
>     gmx-developers mailing list
>     gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>
>     http://lists.gromacs.org/mailman/listinfo/gmx-developers
>     Please don't post (un)subscribe requests to the list. Use the
>     www interface or send it to gmx-developers-request at gromacs.org <mailto:gmx-developers-request at gromacs.org>.
>
>
>
>

-- 
Shun SAKURABA, Ph.D.
Postdoc @ Molecular Modeling & Simulation Group, Japan Atomic Energy Agency