[gmx-developers] GROMACS OpenCL on Gallium

Szilárd Páll pall.szilard at gmail.com
Tue Dec 8 17:20:58 CET 2015


Hi,

Thanks Mark for the fixes, I'll review the change this afternoon.

Yesterday filed a redmine related to the NVIDIA OpenCL segfaults (
http://redmine.gromacs.org/issues/1871) because in my testing I reproduced
the issue with recent CUDA compiler/driver too. Not sure if this is a bug
in the release 5.1 code or in the NVIDIA runtime, but given that on two
ranks I did reproducibly get segfault with the three tests that jenkins
complains about, the documentation changes that suggest CUDA >=v6.5 may not
be enough.

Cheers,
--
Szilárd

On Tue, Dec 8, 2015 at 4:27 AM, Mark Abraham <mark.j.abraham at gmail.com>
wrote:

> Hi,
>
> I've uploaded a patch that addresses a couple of the issues - the
> regressiontests are fine like I said - I think the segfaults are indeed
> coming from a broken version of CUDA (have updated the opencl test config
> to try 6.5). Agree we should probably bump the minimum version of CUDA for
> OpenCL and avoid trouble.
>
> The empty-domain test (that I added to cover a hard-to-reproduce bug in
> our GPU stream handling) requires two ranks. I used to hard-code this in
> the CUDA days, which was OK then but not now with OpenCL needed in Jenkins,
> so my patch tries to rely better on the new automated resource assignment,
> but Jenkins can be the judge of that. I think we were also mis-managing the
> OpenCL version of the code that waited for non-local events before starting
> local events - that test case at least did its job (eventually).
>
> Also added some error code strings that we might make more general use of
> in future.
>
> http://jenkins.gromacs.org/job/Gromacs_Gerrit_5_1-test-opencl-slave/15/
> https://gerrit.gromacs.org/#/c/5430/
>
> Mark
>
> On Tue, Dec 8, 2015 at 4:36 AM Szilárd Páll <pall.szilard at gmail.com>
> wrote:
>
>> Hi,
>>
>> All three segfaults produce a backtrace similar to this:
>>
>> [...]
>> #2  0x00007fcf53618632 in ?? () from
>> /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
>> #3  0x00007fcf58b6fdca in sync_ocl_event (stream=0x7fcf4c820160,
>> ocl_event=0x7fcf4c031380)
>>     at
>> /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl.cpp:331
>> #4  0x00007fcf58b70f7d in nbnxn_gpu_launch_cpyback (nb=0x7fcf4c030f40,
>> nbatom=0x7fcf4c022ca0,
>>     flags=1015, aloc=0)
>>     at
>> /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl.cpp:952
>> #5  0x00007fcf58b65fcc in do_force_cutsVERLET ()
>>     at
>> /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/gromacs/mdlib/sim_util.cpp:1061
>> #6  0x00007fcf58b68e02 in do_force ()
>>     at
>> /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/gromacs/mdlib/sim_util.cpp:2009
>> #7  0x000000000041ac0e in do_md ()
>>     at
>> /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/programs/mdrun/md.cpp:1078
>> #8  0x000000000042835b in mdrunner ()
>>     at
>> /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/programs/mdrun/runner.cpp:1282
>> #9  0x000000000042528e in mdrunner_start_fn (arg=0xb8ddd0)
>>     at
>> /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/programs/mdrun/runner.cpp:186
>> [...]
>>
>> This could be due to an old CUDA being used. I'll check that, but in any
>> case, especially for NVIDIA OpenCL that we know it's been buggy (and as far
>> as I know still is), we probably really should not use anything older than
>> 7.0 or 7.5.
>>
>> The other failures on the AMD test machine seem to be caused by the tests
>> being called in an incompatible way, although I have the feeling that
>> something is off with that too (because tMPI+OpenCL multi-GPU should work,
>> I though).
>>
>> --
>> Szilárd
>>
>> On Mon, Dec 7, 2015 at 2:46 PM, Vedran Miletić <rivanvx at gmail.com> wrote:
>>
>>> Szilard, Mark,
>>>
>>> thanks for looking into this.
>>>
>>> 2015-12-07 14:29 GMT+01:00 Szilárd Páll <pall.szilard at gmail.com>:
>>> > http://jenkins.gromacs.org/job/Gromacs_Gerrit_5_1-test-opencl-slave/14
>>>
>>> Didn't know we had that one. Very nice.
>>>
>>> Regards,
>>> Vedran
>>>
>>> --
>>> Vedran Miletić
>>> http://vedranmileti.ch/
>>> --
>>> Gromacs Developers mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
>>> or send a mail to gmx-developers-request at gromacs.org.
>>>
>>
>> --
>> Gromacs Developers mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
>> or send a mail to gmx-developers-request at gromacs.org.
>
>
> --
> Gromacs Developers mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
> or send a mail to gmx-developers-request at gromacs.org.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20151208/42729c1b/attachment.html>


More information about the gromacs.org_gmx-developers mailing list