[gmx-developers] OpenACC Development

Thu Jul 28 17:36:58 CEST 2016

On Fri, Jul 1, 2016 at 10:21 PM, Millad Ghane <mghane at cs.uh.edu> wrote:
> Hi Erik,
>
>> Hi,
>>
>>> On 01 Jul 2016, at 20:21, Millad Ghane <mghane at cs.uh.edu> wrote:
>>>
>>> I am not saying there is no performance loss. There is, but the
>>> performance loss shouldn't be more than like 30% "for the GPU codesâ€ .
>>
>> I think you are completely underestimating the importance of data layout.
>>
>> Of course OpenACC would do decently if we keep the 90% of the GPU-related
>> code we wrote to provide e.g. GPU-optimize data layouts.
>>
>> However, in that case there is no use whatsoever for OpenACC since the few
>> lines of â€œCUDAâ€  code in the kernels are straightforward - and you are
>> losing 30% for no use whatsoever.
>> You would still need a different data layout for Xeon Phi.
>>
>> Adding OpenACC pragmas isnâ€™t very difficult, so again: if you believe
>> that you can write a completely general implementation that is only 30%
>> slower in the accelerated kernels (~15% in total) and that works on all
>> architectures, please do it and show us :-)
>>
>>
>
> I agree with you. Adding OpenACC pragmas to a serial code is not hard and
> the performance bottleneck would be data layout on the device (how to
> transfer the data from host to gpu). Current kernels are expressed in a
> very productive way and it is very hard to beat them.
>
>
>>>> ThatÃ¢Â€Â™s like saying all C++ implementations of molecular dynamics
>>>> should
>>>> have the same performance because itÃ¢Â€Â™s the same language.
>>>> If that was true, you should not see any performance difference when
>>>> you
>>>> disable SIMD. After all, all floating-point math on x86 is implemented
>>>> with SSE/AVX instructions today.
>>>>
>>> It's different. By enabling and using SIMD commands you are actually
>>> accessing and exploiting some hardware features of CPU. So, since you
>>> are
>>> accessing high performance features, the code executes faster.
>>
>> No. Please check the assembly output of your compiler. Your compiler WILL
>> be generating AVX2 instructions with proper flags and â€œ-O3", but it is
>> not capable of reorganizing the data layout.
>>
>>> But
>>> changing languages with the same SIMD configuration, the output should
>>> be
>>> roughly the same.
>>
>> â€œsame SIMD configurationâ€  means keeping the 99% of the data layouts
>> optimized for each architecture. Sorry, but thatâ€™s not making your
>> implementation portable.
>>
>>> However, I argue that in this case, C is a little bit
>>> more faster than C++.
>>
>> No. Intrinsics will generate identical assembly in C and C++ - and that is
>> based on knowledge, not opinions. Once we exploit template parameters, C++
>> completely kills C for complex kernels.
>>
>
> Have you ever tried other high performance methods too? Like task-based
> systems? For instance, Habanero C from Rice Uni.
>
> Do you think it would be a promising approach?

That's a totally different topic. The original discussion/debate was
about generating the right SIMD instructions, ensure ILP, and maximize
IPC which compilers can't do -- at least nowhere near as well as some
of us ;)

This is very fine-grained parallelism while tasking and work-stealing
(that Habanero-C[++] and other similar libraries exploit) are a
higher-level parallelization techniques.

PS: Somewhat off-topic, but if I can, I personally vote for open
standards, the ones that are truly open, not through an "Open" prefix
in in their name and and backed by commercial entities, but by the
power of an open standards committee. That's not to say that exploring
the use of pragma-based heterogeneous programming models with a tools
that's convenient for you is uninteresting! On the contrary, it is
useful and the programming model in this exploration is less important
(though it does make an implicit statement of support). However, the
fact that there is a single commercial compiler with decent OpenACC
support (and one that's not great at all in other espects as Erik
mentioned) is not a great start.

On the long run the actually open programming models (e.g. OpenMP) are
the ones we want to rely on. Of course, we have to be willing to
compromise if there is no other way to achieve good performance,
that's why we use CUDA ;)

BTW, Intel won't back OpenACC anytime soon (fun video to watch:
https://goo.gl/k4Vi9x).

Cheers,
--
Szilárd

>
> Regards,
> Millad
>
>
>>
>> Cheers,
>>
>> Erik
>>
>>
>> --
>> Gromacs Developers mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or
>> send a mail to gmx-developers-request at gromacs.org.
>
>
> --
> Gromacs Developers mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or send a mail to gmx-developers-request at gromacs.org.