[gmx-developers] OpenACC Development

Fri Jul 1 20:45:11 CEST 2016

Hi,

> On 01 Jul 2016, at 20:21, Millad Ghane <mghane at cs.uh.edu> wrote:
> 
> I am not saying there is no performance loss. There is, but the
> performance loss shouldn't be more than like 30% "for the GPU codes”.

I think you are completely underestimating the importance of data layout.

Of course OpenACC would do decently if we keep the 90% of the GPU-related code we wrote to provide e.g. GPU-optimize data layouts. 

However, in that case there is no use whatsoever for OpenACC since the few lines of “CUDA” code in the kernels are straightforward - and you are losing 30% for no use whatsoever. 
You would still need a different data layout for Xeon Phi.

Adding OpenACC pragmas isn’t very difficult, so again: if you believe that you can write a completely general implementation that is only 30% slower in the accelerated kernels (~15% in total) and that works on all architectures, please do it and show us :-)

>> Thatâ€™s like saying all C++ implementations of molecular dynamics should
>> have the same performance because itâ€™s the same language.
>> If that was true, you should not see any performance difference when you
>> disable SIMD. After all, all floating-point math on x86 is implemented
>> with SSE/AVX instructions today.
>> 
> It's different. By enabling and using SIMD commands you are actually
> accessing and exploiting some hardware features of CPU. So, since you are
> accessing high performance features, the code executes faster.

No. Please check the assembly output of your compiler. Your compiler WILL be generating AVX2 instructions with proper flags and “-O3", but it is not capable of reorganizing the data layout.

> But
> changing languages with the same SIMD configuration, the output should be
> roughly the same.

“same SIMD configuration” means keeping the 99% of the data layouts optimized for each architecture. Sorry, but that’s not making your implementation portable.

> However, I argue that in this case, C is a little bit
> more faster than C++.

No. Intrinsics will generate identical assembly in C and C++ - and that is based on knowledge, not opinions. Once we exploit template parameters, C++ completely kills C for complex kernels.

Cheers,

Erik