[gmx-developers] OpenACC Development

Fri Jul 1 22:21:19 CEST 2016

Hi Erik,

> Hi,
>
>> On 01 Jul 2016, at 20:21, Millad Ghane <mghane at cs.uh.edu> wrote:
>>
>> I am not saying there is no performance loss. There is, but the
>> performance loss shouldn't be more than like 30% "for the GPU codesâ€.
>
> I think you are completely underestimating the importance of data layout.
>
> Of course OpenACC would do decently if we keep the 90% of the GPU-related
> code we wrote to provide e.g. GPU-optimize data layouts.
>
> However, in that case there is no use whatsoever for OpenACC since the few
> lines of â€œCUDAâ€ code in the kernels are straightforward - and you are
> losing 30% for no use whatsoever.
> You would still need a different data layout for Xeon Phi.
>
> Adding OpenACC pragmas isnâ€™t very difficult, so again: if you believe
> that you can write a completely general implementation that is only 30%
> slower in the accelerated kernels (~15% in total) and that works on all
> architectures, please do it and show us :-)
>
>

I agree with you. Adding OpenACC pragmas to a serial code is not hard and
the performance bottleneck would be data layout on the device (how to
transfer the data from host to gpu). Current kernels are expressed in a
very productive way and it is very hard to beat them.

>>> ThatÃ¢Â€Â™s like saying all C++ implementations of molecular dynamics
>>> should
>>> have the same performance because itÃ¢Â€Â™s the same language.
>>> If that was true, you should not see any performance difference when
>>> you
>>> disable SIMD. After all, all floating-point math on x86 is implemented
>>> with SSE/AVX instructions today.
>>>
>> It's different. By enabling and using SIMD commands you are actually
>> accessing and exploiting some hardware features of CPU. So, since you
>> are
>> accessing high performance features, the code executes faster.
>
> No. Please check the assembly output of your compiler. Your compiler WILL
> be generating AVX2 instructions with proper flags and â€œ-O3", but it is
> not capable of reorganizing the data layout.
>
>> But
>> changing languages with the same SIMD configuration, the output should
>> be
>> roughly the same.
>
> â€œsame SIMD configurationâ€ means keeping the 99% of the data layouts
> optimized for each architecture. Sorry, but thatâ€™s not making your
> implementation portable.
>
>> However, I argue that in this case, C is a little bit
>> more faster than C++.
>
> No. Intrinsics will generate identical assembly in C and C++ - and that is
> based on knowledge, not opinions. Once we exploit template parameters, C++
> completely kills C for complex kernels.
>

Have you ever tried other high performance methods too? Like task-based
systems? For instance, Habanero C from Rice Uni.

Do you think it would be a promising approach?

Regards,
Millad

>
> Cheers,
>
> Erik
>
>
> --
> Gromacs Developers mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or
> send a mail to gmx-developers-request at gromacs.org.