[gmx-developers] SSE+FMA4 kernel

Maik Nijhuis maik.nijhuis at clustervision.com
Wed Jun 6 11:43:14 CEST 2012

2012/5/30 Shun Sakuraba <shun.sakuraba at gmail.com>

> Dear list,
> I would like to share my trial to port GROMACS 4.5 SSE/SSE2 nonbonded
> kernel
> to AMD's new family 15h chip, "Bulldozer" architecture, performed to
> measure
> the benefit of new instructions. In AMD family 15h new FMA4 instructions
> are added; FMA4 is the fused multiplication and addition (subtraction)
> operations.
> FMA4 reduces number of instructions and latency, giving a marginal
> performance boost.
> Also there are XOP instructions, which is useful implementing table
> interpolation
> used in GROMACS.
> Dear Shun,

Great work, however, besides FMA4, the Bulldozer also support three-operand
AVX instructions. In fact, FMA4 is an extension to AVX. Intel Sandy Bridge
only supports AVX, but no FMA4.

With AVX, you can replace two-operand SSE instructions by three-operand AVX
instructions, and avoid copies. For example:

movaps %xmm6,%xmm1
mulps %xmm6,%xmm1 ## rinv4
mulps %xmm6,%xmm1 ## rinv6
movaps %xmm1,%xmm2
mulps %xmm2,%xmm2 ## xmm2=rinv12


vmulps %xmm6,%xmm6,%xmm1  ##rinv4
vmulps %xmm6, %xmm1, %xmm1 ##rinv6
vmulps %xmm1, %xmm1, %xmm2 ## xmm2=rinv12

With those modifications, you may get additional performance. Often, you
can free up registers that hold temporary values, and store values there
that would otherwise end up on the stack.

Another direction is using 256-bit ymm registers instead of 128-bit xmm
registers, but that would require a thorough rewrite of the
kernels. Especially on Sandy Bridge, where each core has its own 256-bit
floating point unit, you may very well increase performance by 50% or more.

At Bulldozer, two cores share a 256-bit floating point unit (FPU). With
128-bit instructions, each core uses 'half' of the FPU, and the 128-bit
instructions run at full speed. With 256-bit instructions, two cores
compete for the FPU, so the performance gain would be limited, compared to
Sandy Bridge. Bulldozers can still benefit from FMA4, though, which is not
available on Sandy Bridge.

 [image: clustervision_logo.png] Dr. Maik Nijhuis
HPC Benchmark Specialist

Direct: +31 20 407 7556
Skype: maiknijhuis
maik.nijhuis at clustervision.com

ClusterVision BV
Nieuw-Zeelandweg 15B
1045 AL Amsterdam
The Netherlands
Tel: +31 20 407 7550
Fax: +31 84 759 8389
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20120606/6cb215ef/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clustervision_logo.gif
Type: image/gif
Size: 4672 bytes
Desc: not available
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20120606/6cb215ef/attachment.gif>

More information about the gromacs.org_gmx-developers mailing list