[gmx-developers] mdrun 4.6.7 with GPU sharing between thread-MPI ranks yields crashes

Wed Dec 17 19:39:19 CET 2014

PS: forgot to mention, this is the redmine issue of the crashes we observed:
http://redmine.gromacs.org/issues/1623
--
Szilárd


On Wed, Dec 17, 2014 at 4:20 PM, Szilárd Páll <pall.szilard at gmail.com> wrote:
> Update: I've been running some tests again trying to reproduce the
> issue, I'll give some updates on the existing bug page about the
> current findings.
> --
> Szilárd
>
>
> On Fri, Dec 12, 2014 at 2:34 AM, Szilárd Páll <pall.szilard at gmail.com> wrote:
>> On Thu, Dec 11, 2014 at 6:40 PM, Berk Hess <hess at kth.se> wrote:
>>> On 12/11/2014 06:36 PM, Szilárd Páll wrote:
>>>>
>>>> On Thu, Dec 11, 2014 at 5:38 PM, Berk Hess <hess at kth.se> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We are also having some, as of yet, unexplainable issues that only seem
>>>>> to
>>>>> show up with GPU sharing. We have done a lot of checking, so a bug in
>>>>> Gromacs seems unlikely. Nvidia says this is officially not supported
>>>>> because
>>>>> of some issues, so this could be one of those.
>>>>
>>>> What is not supported? To the best of my knowledge everything we do is
>>>> officially supported, moreover, even the (ex) Hyper-Q now CUDA
>>>> MPS-related Tesla-only features work because GPU contexts are by
>>>> definition shared between pthreads (=tMPI ranks).
>>>
>>> Am I confusing it with real MPI then, when there is an issue?
>>
>> I don't know of any issue, as far as I know we do nothing that's
>> against the specs or recommendations of NVIDIA.
>>
>> The only "issue" is not on our side, but in NVIDIA's incomplete
>> support for CUDA MPS in CUDA <v6.5, but that's only a performance
>> concern (especially in GPU rank sharing setups) and should not affect
>> correctness.
>>
>>> We also observed issue with thread-MPI and GPU sharing. I thought this could
>>> be attributable to a CUDA issue, but if not, we might have a bug in Gromacs
>>> (although very hard to find).
>>
>> Given how hard it has been to reproduce I am hesitant to call it a
>> CUDA bug, but it could very well be one. We should try to reproduce it
>> again, one thing we could concentrate on is to attempt to reproduce it
>> with Tesla GPUs.
>>
>> Cheers,
>> --
>> Szilárd
>>
>>> Cheers,
>>>
>>> Berk
>>>
>>>> --
>>>> Szilárd
>>>>
>>>>> PS Why are you not using 5.0? I don't recall anything related to sharing
>>>>> has
>>>>> changed, but in such cases I would try the newest version.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Berk
>>>>>
>>>>>
>>>>> On 12/11/2014 05:32 PM, Carsten Kutzner wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> we are seeing a weird problem here with 4.6.7 on GPU nodes.
>>>>>> A 146k atom system that already ran happily on a lot of different
>>>>>> nodes (with and without GPU) now often crashes on GPU nodes
>>>>>> with the error message:
>>>>>>
>>>>>> x particles communicated to PME node y are more than 2/3 times the
>>>>>> cut-off
>>>>>> … dimension x
>>>>>>
>>>>>> DD is 8 x 1 x 1 in all cases, mdrun is started with the somewhat unusual
>>>>>> (but best performing) options
>>>>>>
>>>>>> -ntmpi 8 -ntomp 5 -gpu_id 00001111 -dlb no
>>>>>>
>>>>>> on nodes with 2x GTX 780Ti and 40 logical cores. Out of 20 of these runs
>>>>>> approx 14 die in the first 100k time steps with a variation of the above
>>>>>> error message.
>>>>>>
>>>>>> Our solution for now is to run it with
>>>>>>
>>>>>> -ntmpi 2 -ntomp 20 -gpu_id 01 -dlb no
>>>>>>
>>>>>> (no crashes up to now) however, at a large performance penalty.
>>>>>>
>>>>>> Comments on how to debug this further are welcome.
>>>>>>
>>>>>> Thanks!
>>>>>>     Carsten
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dr. Carsten Kutzner
>>>>>> Max Planck Institute for Biophysical Chemistry
>>>>>> Theoretical and Computational Biophysics
>>>>>> Am Fassberg 11, 37077 Goettingen, Germany
>>>>>> Tel. +49-551-2012313, Fax: +49-551-2012302
>>>>>> http://www.mpibpc.mpg.de/grubmueller/kutzner
>>>>>> http://www.mpibpc.mpg.de/grubmueller/sppexa
>>>>>>
>>>>> --
>>>>> Gromacs Developers mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
>>>>> or
>>>>> send a mail to gmx-developers-request at gromacs.org.
>>>
>>>
>>> --
>>> Gromacs Developers mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or
>>> send a mail to gmx-developers-request at gromacs.org.