[gmx-developers] OpenACC Development

Millad Ghane mghane at cs.uh.edu
Fri Jul 1 22:21:19 CEST 2016

Hi Erik,

> Hi,
>> On 01 Jul 2016, at 20:21, Millad Ghane <mghane at cs.uh.edu> wrote:
>> I am not saying there is no performance loss. There is, but the
>> performance loss shouldn't be more than like 30% "for the GPU codes”.
> I think you are completely underestimating the importance of data layout.
> Of course OpenACC would do decently if we keep the 90% of the GPU-related
> code we wrote to provide e.g. GPU-optimize data layouts.
> However, in that case there is no use whatsoever for OpenACC since the few
> lines of “CUDA” code in the kernels are straightforward - and you are
> losing 30% for no use whatsoever.
> You would still need a different data layout for Xeon Phi.
> Adding OpenACC pragmas isn’t very difficult, so again: if you believe
> that you can write a completely general implementation that is only 30%
> slower in the accelerated kernels (~15% in total) and that works on all
> architectures, please do it and show us :-)

I agree with you. Adding OpenACC pragmas to a serial code is not hard and
the performance bottleneck would be data layout on the device (how to
transfer the data from host to gpu). Current kernels are expressed in a
very productive way and it is very hard to beat them.

>>> That’s like saying all C++ implementations of molecular dynamics
>>> should
>>> have the same performance because it’s the same language.
>>> If that was true, you should not see any performance difference when
>>> you
>>> disable SIMD. After all, all floating-point math on x86 is implemented
>>> with SSE/AVX instructions today.
>> It's different. By enabling and using SIMD commands you are actually
>> accessing and exploiting some hardware features of CPU. So, since you
>> are
>> accessing high performance features, the code executes faster.
> No. Please check the assembly output of your compiler. Your compiler WILL
> be generating AVX2 instructions with proper flags and “-O3", but it is
> not capable of reorganizing the data layout.
>> But
>> changing languages with the same SIMD configuration, the output should
>> be
>> roughly the same.
> “same SIMD configuration” means keeping the 99% of the data layouts
> optimized for each architecture. Sorry, but that’s not making your
> implementation portable.
>> However, I argue that in this case, C is a little bit
>> more faster than C++.
> No. Intrinsics will generate identical assembly in C and C++ - and that is
> based on knowledge, not opinions. Once we exploit template parameters, C++
> completely kills C for complex kernels.

Have you ever tried other high performance methods too? Like task-based
systems? For instance, Habanero C from Rice Uni.

Do you think it would be a promising approach?


> Cheers,
> Erik
> --
> Gromacs Developers mailing list
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> posting!
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or
> send a mail to gmx-developers-request at gromacs.org.

More information about the gromacs.org_gmx-developers mailing list