[gmx-developers] SSE+FMA4 kernel

Szilárd Páll szilard.pall at cbr.su.se
Mon Jun 11 18:27:44 CEST 2012


Hi,

Keep an eye on the release-4-6 git branch, the AVX 128 (+FMA4) and AVX 256
version of non-bonded kernels will very soon get merged upstream.

Cheers,
--
Szilárd


On Wed, May 30, 2012 at 4:31 PM, Shun Sakuraba <shun.sakuraba at gmail.com>wrote:

> Dear list,
>
> I would like to share my trial to port GROMACS 4.5 SSE/SSE2 nonbonded
> kernel
> to AMD's new family 15h chip, "Bulldozer" architecture, performed to
> measure
> the benefit of new instructions. In AMD family 15h new FMA4 instructions
> are added; FMA4 is the fused multiplication and addition (subtraction)
> operations.
> FMA4 reduces number of instructions and latency, giving a marginal
> performance boost.
> Also there are XOP instructions, which is useful implementing table
> interpolation
> used in GROMACS.
>
>  As far as I could try, the speedup is only 5% for SP kernels and around
> 10% for
> DP kernels, on gmxbench ( http://www.gromacs.org/About_Gromacs/Benchmarks).
> The source codes and generated kernels are uploaded on
> ( https://bitbucket.org/shun.sakuraba/gromacs-fma4-patch ),
> being available for download. I hope this interests some developers.
>
> gmxbench 3.0, ns/day
>           vanilla fma4+xop
> SP    dppc  2.13    2.21
> SP lzm.pme 16.15   16.73
> SP lzm.cut 23.93   24.79
> SP polych2 38.88   38.13
> SP  villin 88.32   92.71
> DP    dppc  1.35    1.52
> DP lzm.pme 10.28   11.14
> DP lzm.cut 15.34   17.09
> DP polych2 26.73   29.87
> DP  villin 48.82   54.92
>
> Benchmarking results are taken on the following environment:
> AMD FX-4100 (2 modules, 4 cores, 3.6GHz / up to 3.9GHz with turbo)
> Linux (ArchLinux) 3.1.12-1-ARCH
> gcc (GCC) 4.7.0  20120505 (prerelease)
> FFTW 3.3.2
> GROMACS 4.5.5, used threading parallelization with 4 threads, compiled with
> CFLAGS="-O3 -fomit-frame-pointer -finline-functions -Wall -Wno-unused
> -march=native -funroll-all-loops -std=gnu99 -fexcess-precision=fast".
> All benchmarks are taken 3 times and the median is taken.
>
> Notes:
> * In GROMACS, if "-march=native" is replaced with "-msse2" (which is
> default in GROMACS 4.5.5),
>  the program runs 5~10% slower than "-march=native".
> * I also compiled FFTW with and without -march=native. With default (i.e.
> without -march=native)
>  SP/DP runs are ~3% slower in PME.
> * This version of FFTW does not use FMA4 SIMD instructions. FMA4 SIMD
> instruction in FFTW
>  should increase performance, but I have not tried.
> * On NVE simulation, the energy conservation is slightly better than
> vanilla, possibly because
>  in IEEE-compliant FMA, intermediate multiplication results are calculated
> with infinite precision.
>
> --
> Shun SAKURABA, Ph.D.
> Postdoc @ Molecular Modeling & Simulation Group, Japan Atomic Energy Agency
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20120611/5d125b21/attachment.html>


More information about the gromacs.org_gmx-developers mailing list