[gmx-developers] 4.6 Binaries and Acceleration levels

Fri Jul 6 10:47:37 CEST 2012

On 07/06/2012 09:55 AM, Roland Schulz wrote:
> On Fri, Jul 6, 2012 at 3:16 AM, Berk Hess <hess at kth.se> wrote:
>> On 07/06/2012 05:51 AM, Nicholas Breen wrote:
>>> On Thu, Jul 05, 2012 at 11:05:53PM +0200, Szilárd Páll wrote:
>>>> On Thu, Jul 5, 2012 at 10:59 PM, Christoph Junghans<junghans at votca.org>  wrote:
>>>>> Roland has a good point, but Debian and Fedora already compile Gromacs
>>>>> for different mpi version.
>>>>>
>>>>> 4 acc. x 3 (serial,openmpi&  mpich2) = 12 packages!
>>>>>
>>>>> @ Jussi: Would that still be possible?
>>>> I guess possible is one thing and probable is a totally different one...
>>>>
>>>> Perhaps it would be good to ask the Ubuntu/Debian maintainer as well...
>>> I would not want to create that much package complexity (the gromacs source
>>> package already builds five binary packages!), especially if it would all be on
>>> only one of the many architectures supported, and I cannot think of any other
>>> package in the Debian archive that operates that way -- everything that I know
>>> of with multiple CPU optimizations uses run-time detection, except for the
>>> unavoidable cases with packaging the Linux kernel itself.  If it is
>>> functionality that is contained purely within the shared libraries, then
>>> glibc's hwcap support might be a workaround if the build system can permit
>>> compiling one copy of the libraries for every supported variant.  Otherwise, if
>>> it's all within mdrun itself, maybe just stick to run-time detection and
>>> downgrade or eliminate the warnings it issues for "suboptimal" CPUs?
>>>
>>>
>> Erik's non-bonded kernels are present as source in all flavors.
>> My non-bonded kernel sources can easily be put in in all flavors.
>> Then we would still need compilation and call selection support.
>> But there is also x86 SIMD code in the PME and bonded code,
>> as well as in some other parts of the code.
> How big is the maximum speedup we get in anythings but non-bonded
> (PME, bonded, ...) of AVX/SSE4 over SSE2?
> If it isn't very dramatic compared to the seepdup of just the
> non-bonded, I think their would be a great value in having
> multi-acceleration non-bonded kernels. Even if that means that all
> other functions are limited to the lowest common denominator targeted
> by the binary. Because it would at least prevent people to use the
> SSE2 non-bonded kernels when using binaries. But if the speedup of
> bonded/PME is also very significant, than it is probably better to
> advice people not to use binaries for mdrun but only for analysis
> tools and always compile mdrun.
The difference between SSE2 and SSE4.1 will be minor, not really
worth put much effort into. But AVX is much faster than SSE on
both Intel and AMD. The compiler converting SSE intrinsics as well as
code to AVX helps a lot and on Intel 256-bit SIMD also helps.

For a run with my nbnxn kernels on Sandy Bridge for AVX-256 vs SSE2 I get:

Wall time improvement using 4 OpenMP threads:
Overall: -15%

NS: -32%
bondeds: -4%
non-bonded: -20%
PME: -1%
update: -9%
constraints: -6%

But with 8 OpenMP threads using HT AVX gets slower, whereas SSE2 gets 
faster,
so then then AVX-256 on 4 threads is only 3% faster than SSE2 on 8 threads.
This is negligible and shows that we might have to worry more about 
automating
and optimizing the run setup than optimizing the compile configuration!
We should not run HT+OpenMP+AVX-256 code.

We Erik's kernels the picture might be different, as my kernels (and 
PME) loose
with HT at 8 threads mainly because of reduction overhead.
But using MPI threads or mixed parallelization is slower.

Cheers,

Berk