[gmx-users] How to let Gromacs run 60% faster..

Sun Oct 18 15:40:52 CEST 2009

Vaclav Horacek wrote:
> Hello,
> 
> I just bumped into the site www.yasara.org, who claims that their just released new MD algorithms are 60% faster then Gromacs.
> Actually they dont say 'Gromacs', but 'closest competitor', which I assume is Gromacs looking at the benchmark numbers.

One should always be skeptical about people who mention that they are 
better, but don't display their comparisons. It's very difficult to 
fairly compare different MD packages, because of fundamental algorithm 
differences and optimization levels. See discussion in the GROMACS 4 
paper, for example. Even if you can design a fair test, you still need 
to be sure you've done the best by all codes with the compiler at hand. 
Further, the metric they quote (time for a single integration step) is 
not very useful. Anyone doing serious MD is going to run calculations 
for at least days, if not months - comparisons need to be over *those* 
timeframes. They claim to be doing PME with a 0.786nm real-space 
cut-off, which ought to require much smaller than 0.1nm Fourier grid 
spacing for the reciprocal-space part, for decent accuracy. Speed is 
only one part of the issue. There might be other reasons they aren't 
referring to peer-reviewed literature to support these claims :-)

>>From the numbers, I also saw that they seem to do particulary well on newer CPUs like Core 2 Duo and Xeon L5420, using code for SSSE3 and SSE 4.1.

They don't show performance numbers without such extensions being used, 
so it looks like marketing hype. I don't see SSE3 or higher being very 
useful at all.

> I am not expert for this kind of low level stuff, but typing SSE4 into Wikipedia shows lots of commands that look useful for MD. For example the 'dpps' instruction does an entire dot product at once.

IIRC, there's only one dot-product-like operation per interaction in a 
PME non-bonded inner loop, which is the operation for r^2= (x1-x2)^2 + 
(y1-y2)^2 + (z1-z2)^2, which is probably already spread out SIMD-style 
over several interactions with SSE or SSE2. At best you might gain 2 
flops per interaction which is a percent or two. Whether that might come 
at a cost to the existing SSE/SSE2 SIMD is a harder question.

A single-cycle "floating-point distance to nearest integer":

y <- x - floor(x)

would be noteworthy :-)

> I looked at the gmxlib/nonbonded directory and saw that SSE2 seems to be the most supported by Gromacs. So maybe adding support for SSE3 and SSE4 can still help a lot! Are there any plans for that?

Mark