[gmx-developers] mdrun 4.6.7 with GPU sharing between thread-MPI ranks yields crashes

Szilárd Páll pall.szilard at gmail.com
Wed Dec 17 16:20:05 CET 2014


Update: I've been running some tests again trying to reproduce the
issue, I'll give some updates on the existing bug page about the
current findings.
--
Szilárd


On Fri, Dec 12, 2014 at 2:34 AM, Szilárd Páll <pall.szilard at gmail.com> wrote:
> On Thu, Dec 11, 2014 at 6:40 PM, Berk Hess <hess at kth.se> wrote:
>> On 12/11/2014 06:36 PM, Szilárd Páll wrote:
>>>
>>> On Thu, Dec 11, 2014 at 5:38 PM, Berk Hess <hess at kth.se> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We are also having some, as of yet, unexplainable issues that only seem
>>>> to
>>>> show up with GPU sharing. We have done a lot of checking, so a bug in
>>>> Gromacs seems unlikely. Nvidia says this is officially not supported
>>>> because
>>>> of some issues, so this could be one of those.
>>>
>>> What is not supported? To the best of my knowledge everything we do is
>>> officially supported, moreover, even the (ex) Hyper-Q now CUDA
>>> MPS-related Tesla-only features work because GPU contexts are by
>>> definition shared between pthreads (=tMPI ranks).
>>
>> Am I confusing it with real MPI then, when there is an issue?
>
> I don't know of any issue, as far as I know we do nothing that's
> against the specs or recommendations of NVIDIA.
>
> The only "issue" is not on our side, but in NVIDIA's incomplete
> support for CUDA MPS in CUDA <v6.5, but that's only a performance
> concern (especially in GPU rank sharing setups) and should not affect
> correctness.
>
>> We also observed issue with thread-MPI and GPU sharing. I thought this could
>> be attributable to a CUDA issue, but if not, we might have a bug in Gromacs
>> (although very hard to find).
>
> Given how hard it has been to reproduce I am hesitant to call it a
> CUDA bug, but it could very well be one. We should try to reproduce it
> again, one thing we could concentrate on is to attempt to reproduce it
> with Tesla GPUs.
>
> Cheers,
> --
> Szilárd
>
>> Cheers,
>>
>> Berk
>>
>>> --
>>> Szilárd
>>>
>>>> PS Why are you not using 5.0? I don't recall anything related to sharing
>>>> has
>>>> changed, but in such cases I would try the newest version.
>>>>
>>>> Cheers,
>>>>
>>>> Berk
>>>>
>>>>
>>>> On 12/11/2014 05:32 PM, Carsten Kutzner wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> we are seeing a weird problem here with 4.6.7 on GPU nodes.
>>>>> A 146k atom system that already ran happily on a lot of different
>>>>> nodes (with and without GPU) now often crashes on GPU nodes
>>>>> with the error message:
>>>>>
>>>>> x particles communicated to PME node y are more than 2/3 times the
>>>>> cut-off
>>>>> … dimension x
>>>>>
>>>>> DD is 8 x 1 x 1 in all cases, mdrun is started with the somewhat unusual
>>>>> (but best performing) options
>>>>>
>>>>> -ntmpi 8 -ntomp 5 -gpu_id 00001111 -dlb no
>>>>>
>>>>> on nodes with 2x GTX 780Ti and 40 logical cores. Out of 20 of these runs
>>>>> approx 14 die in the first 100k time steps with a variation of the above
>>>>> error message.
>>>>>
>>>>> Our solution for now is to run it with
>>>>>
>>>>> -ntmpi 2 -ntomp 20 -gpu_id 01 -dlb no
>>>>>
>>>>> (no crashes up to now) however, at a large performance penalty.
>>>>>
>>>>> Comments on how to debug this further are welcome.
>>>>>
>>>>> Thanks!
>>>>>     Carsten
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dr. Carsten Kutzner
>>>>> Max Planck Institute for Biophysical Chemistry
>>>>> Theoretical and Computational Biophysics
>>>>> Am Fassberg 11, 37077 Goettingen, Germany
>>>>> Tel. +49-551-2012313, Fax: +49-551-2012302
>>>>> http://www.mpibpc.mpg.de/grubmueller/kutzner
>>>>> http://www.mpibpc.mpg.de/grubmueller/sppexa
>>>>>
>>>> --
>>>> Gromacs Developers mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
>>>> or
>>>> send a mail to gmx-developers-request at gromacs.org.
>>
>>
>> --
>> Gromacs Developers mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or
>> send a mail to gmx-developers-request at gromacs.org.


More information about the gromacs.org_gmx-developers mailing list