[gmx-developers] 4.6 Binaries and Acceleration levels

Fri Jul 6 11:06:14 CEST 2012

On 07/06/2012 10:47 AM, Berk Hess wrote:
> On 07/06/2012 09:55 AM, Roland Schulz wrote:
>> On Fri, Jul 6, 2012 at 3:16 AM, Berk Hess <hess at kth.se> wrote:
>>> On 07/06/2012 05:51 AM, Nicholas Breen wrote:
>>>> On Thu, Jul 05, 2012 at 11:05:53PM +0200, Szilárd Páll wrote:
>>>>> On Thu, Jul 5, 2012 at 10:59 PM, Christoph 
>>>>> Junghans<junghans at votca.org>  wrote:
>>>>>> Roland has a good point, but Debian and Fedora already compile 
>>>>>> Gromacs
>>>>>> for different mpi version.
>>>>>>
>>>>>> 4 acc. x 3 (serial,openmpi&  mpich2) = 12 packages!
>>>>>>
>>>>>> @ Jussi: Would that still be possible?
>>>>> I guess possible is one thing and probable is a totally different 
>>>>> one...
>>>>>
>>>>> Perhaps it would be good to ask the Ubuntu/Debian maintainer as 
>>>>> well...
>>>> I would not want to create that much package complexity (the 
>>>> gromacs source
>>>> package already builds five binary packages!), especially if it 
>>>> would all be on
>>>> only one of the many architectures supported, and I cannot think of 
>>>> any other
>>>> package in the Debian archive that operates that way -- everything 
>>>> that I know
>>>> of with multiple CPU optimizations uses run-time detection, except 
>>>> for the
>>>> unavoidable cases with packaging the Linux kernel itself. If it is
>>>> functionality that is contained purely within the shared libraries, 
>>>> then
>>>> glibc's hwcap support might be a workaround if the build system can 
>>>> permit
>>>> compiling one copy of the libraries for every supported variant.  
>>>> Otherwise, if
>>>> it's all within mdrun itself, maybe just stick to run-time 
>>>> detection and
>>>> downgrade or eliminate the warnings it issues for "suboptimal" CPUs?
>>>>
>>>>
>>> Erik's non-bonded kernels are present as source in all flavors.
>>> My non-bonded kernel sources can easily be put in in all flavors.
>>> Then we would still need compilation and call selection support.
>>> But there is also x86 SIMD code in the PME and bonded code,
>>> as well as in some other parts of the code.
>> How big is the maximum speedup we get in anythings but non-bonded
>> (PME, bonded, ...) of AVX/SSE4 over SSE2?
>> If it isn't very dramatic compared to the seepdup of just the
>> non-bonded, I think their would be a great value in having
>> multi-acceleration non-bonded kernels. Even if that means that all
>> other functions are limited to the lowest common denominator targeted
>> by the binary. Because it would at least prevent people to use the
>> SSE2 non-bonded kernels when using binaries. But if the speedup of
>> bonded/PME is also very significant, than it is probably better to
>> advice people not to use binaries for mdrun but only for analysis
>> tools and always compile mdrun.
> The difference between SSE2 and SSE4.1 will be minor, not really
> worth put much effort into. But AVX is much faster than SSE on
> both Intel and AMD. The compiler converting SSE intrinsics as well as
> code to AVX helps a lot and on Intel 256-bit SIMD also helps.
>
> For a run with my nbnxn kernels on Sandy Bridge for AVX-256 vs SSE2 I 
> get:
>
> Wall time improvement using 4 OpenMP threads:
> Overall: -15%
>
> NS: -32%
> bondeds: -4%
> non-bonded: -20%
> PME: -1%
> update: -9%
> constraints: -6%
>
> But with 8 OpenMP threads using HT AVX gets slower, whereas SSE2 gets 
> faster,
> so then then AVX-256 on 4 threads is only 3% faster than SSE2 on 8 
> threads.
> This is negligible and shows that we might have to worry more about 
> automating
> and optimizing the run setup than optimizing the compile configuration!
> We should not run HT+OpenMP+AVX-256 code.
>
I forgot to pinn neighboring threads to the same core (this is not yet 
automated,
will be soon). So now I have for rnase with PME, tric box, 16813 atoms:
#threads=4 SSE2: 25.4 ns/day
#threads=4 AVX256: 26.5 ns/day
#threads=8 SSE2: 25.2 ns/day
#threads=8 AVX256: 27.2 ns/day

So the difference is 7 to 8%.
It would be nice to have this, but it's not a lot either.
I could live with loosing this in a binary distribution.

I think it will be more on AMD Bulldozer.
It will also be more when you need to use MPI, which is when
running on more than 8 cores or with the group cut-off scheme.

Cheers,

Berk