[gmx-developers] OpenACC Development

Fri Jul 1 20:45:47 CEST 2016

Thanks Szilárd for information.

> Hi,
>
> On Fri, Jul 1, 2016 at 6:17 PM, Millad Ghane <mghane at cs.uh.edu> wrote:
>> Hi Berk,
>>
>> Thanks for your reply.
>>
>> I know that OpenACC is a programming language. What I meant by OpenACC
>> architecture was "OpenACC software architecture". I know that
>> eventually,
>> the OpenACC code is executed by an underlying hardware architecture
>> (which
>> could be NVidia or AMD/ATI or even Xeon PHI from Intel).
>>
>> What I hope to achieve: introducing a set of kernels that are optimized
>> for OpenACC.
>
> There is no such thing as "optimized for OpenACC" (well you could con
> that term to express data-flow optimizations, but you'd need to
> explain).

To clarify, what I meant with "optimized for OpenACC" was "to benefit from
OpenACC programming model". Or, I can say it like "optimized to be
executed by OpenACC".

>
> You implement kernels _using_ OpenACC + code transformations to cater
> for the different arch you target. TBH OpenACC doesn't define much
> more than the pragmas for data flow and constraints on this, but the
> code transformation to make the kernels run with OK performance are up
> to the developers to do on a case-by-case basis while hoping that the
> compiler does a reasonable job in optimizing.
>

>> What I have done: trying to parallelize the 4x4 Plain-C code for CPU
>> using
>> OpenACC compiler directives.
>
> Plain C SIMD (reference SIMD) kernels are the ones to target
> (GMX_SIMD=Reference). These run significantly faster than the no SIMD
> ones (GMX_SIMD=None). 4x2 reference is typically the fastest with the
> amount/quality of auto-vectorization compilers can accomplish.
>
>> The problem is that the code and especially the data structures (in
>> Plain-C CPU kernels) get complex in kernel level to some extent.
>> Therefore, that was what I was looking for: introducing kernels for
>> OpenACC programming model.
>
> That still feels like it may be wrong target, but of course it depends
> what _you_ want to accomplish/investigate.
> There are two questions that come to my mind:
> - How fast can you get some GROMACS kernels when expressed with OpenACC
> - How can you implement the data/algorithmic flow with OpenACC to
> target multiple architectures.
>
> I think focusing on the kernels is a not very interesting or rewarding
> task -- unless the question you are posing is exactly as expressed
> above. In that case, I'd hope one's explicit goal is to i) investigate
> the source of the differences between what the OpenACC compiler can
> achieve vs. the manually tuned kernels and ii) provide solid data  on
> how to get the kernels faster (i.e. feedback to the OpenACC team
> and/or us).
>
> The latter seems more interesting (and potentially more rewarding).
> That's because you can potentially replace a huge amount of host-side
> boilerplate code (far more than kernel code) with "just" a few
> pragmas. It also feels like it is likely a more realistic target to
> implement decent data flow and CPU-GPU (or CPU-CPU) concurrency using
> #pragma programming and achieve good performance. I'm not sure, but
> it's quite likely that the existing fast kernels can be reused.
>

What a great inside! Thanks Szilárd.

Actually, my goal was to address first question. The data transfer between
device and the host is not that much in my control (unless I use OpenACC
API to reduce some overheads of pragmas). That's why I focused on
parallelization capabilities of GPUs to increase IPC: Bring all of the
data into GPU and parallelize the code over there (to improve IPC).

I had two crazy ideas regarding data flow and "computation control". It
was something that I wondered why they are missed!

Regards,
Millad

>
> Cheers,
> --
> SzilÃ¡rd
>
>>
>> Best Regards,
>> Millad
>>
>>
>>
>>
>>> Hi Millad,
>>>
>>> Welcome to the list.
>>>
>>> GROMACS aims for close to optimal performance on all relevant hardware.
>>> To achieve this, we write highly optimized non-bonded (and other)
>>> kernels for all relevant architectures. Currently we have plain-C CPU
>>> kernels (extremely slow), SIMD, CUDA and OpenCL non-bonded kernels. The
>>> optimization is not so much in the arithmetics, but rather in the data
>>> layout. OpenACC is not an architecture, but a programming standard. So
>>> you would need to choose a target architecture for your OpenACC
>>> "acceleration" and then choose one of our kernel types. But the only
>>> gain of this would be semi-automated offloading. The question is then
>>> if
>>> the offloading will be efficient enough.
>>>
>>> Cheers,
>>>
>>> Berk
>>>
>>> On 2016-07-01 02:05, Millad Ghane wrote:
>>>> Hello everyone,
>>>>
>>>> I am a PhD student in computer science at University of Houston.
>>>> Currently, in this Summer as an intern, I am working with the physics
>>>> department in our school to work on GROMACS in order to port it to
>>>> OpenACC. My adviser for this project is Prof. Cheung.
>>>>
>>>> My understanding is that GROMCS currently supports NVIDIA GPUs (and
>>>> also
>>>> SIMDs on CPUs), however my job is to investigate the ability of
>>>> transferring the code to OpenACC, which is more heterogeneous and
>>>> ubiquitous compared to CUDA.
>>>>
>>>> My question regarding the development of GROMACS is that whether you
>>>> are
>>>> supporting or planning to support OpenACC currently or in future. And,
>>>> if
>>>> you are not supporting OpenACC, my question is that how can I
>>>> introduce
>>>> new "kernel functions" for supporting OpenACC. How much work should be
>>>> done?
>>>>
>>>> Based on my understanding, you had different kernel function for
>>>> different
>>>> architectures (CPU, CPU with SIMD, GPU). I wanted to know how much
>>>> effort
>>>> is required to introduce new architecture like OpenACC?
>>>>
>>>> Before getting in touch with you, I dig into some of the code and
>>>> tried
>>>> to
>>>> parallelize the CPU version of kernel code using OpenACC constructs,
>>>> but
>>>> the code gets messy and the data is not transferred correctly on the
>>>> device. So, my hope is to introduce new kernels for OpenACC like the
>>>> way
>>>> you introduced different ones for CPUs and GPUs. This way, we have
>>>> more
>>>> controls over data transfers and kernel codes.
>>>>
>>>>
>>>> I hope I was clear enough and make you interested.
>>>>
>>>>
>>>> Best Regards,
>>>> Millad Ghane
>>>> Computer Science Department
>>>> University of Houston
>>>>
>>>>
>>>
>>> --
>>> Gromacs Developers mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
>>> or
>>> send a mail to gmx-developers-request at gromacs.org.
>>>
>>
>>
>> --
>> Gromacs Developers mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
>> or send a mail to gmx-developers-request at gromacs.org.
> --
> Gromacs Developers mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or
> send a mail to gmx-developers-request at gromacs.org.