[gmx-developers] mdrun 4.6.7 with GPU sharing between thread-MPI ranks yields crashes

Szilárd Páll pall.szilard at gmail.com
Fri Dec 12 02:34:08 CET 2014


On Thu, Dec 11, 2014 at 6:40 PM, Berk Hess <hess at kth.se> wrote:
> On 12/11/2014 06:36 PM, Szilárd Páll wrote:
>>
>> On Thu, Dec 11, 2014 at 5:38 PM, Berk Hess <hess at kth.se> wrote:
>>>
>>> Hi,
>>>
>>> We are also having some, as of yet, unexplainable issues that only seem
>>> to
>>> show up with GPU sharing. We have done a lot of checking, so a bug in
>>> Gromacs seems unlikely. Nvidia says this is officially not supported
>>> because
>>> of some issues, so this could be one of those.
>>
>> What is not supported? To the best of my knowledge everything we do is
>> officially supported, moreover, even the (ex) Hyper-Q now CUDA
>> MPS-related Tesla-only features work because GPU contexts are by
>> definition shared between pthreads (=tMPI ranks).
>
> Am I confusing it with real MPI then, when there is an issue?

I don't know of any issue, as far as I know we do nothing that's
against the specs or recommendations of NVIDIA.

The only "issue" is not on our side, but in NVIDIA's incomplete
support for CUDA MPS in CUDA <v6.5, but that's only a performance
concern (especially in GPU rank sharing setups) and should not affect
correctness.

> We also observed issue with thread-MPI and GPU sharing. I thought this could
> be attributable to a CUDA issue, but if not, we might have a bug in Gromacs
> (although very hard to find).

Given how hard it has been to reproduce I am hesitant to call it a
CUDA bug, but it could very well be one. We should try to reproduce it
again, one thing we could concentrate on is to attempt to reproduce it
with Tesla GPUs.

Cheers,
--
Szilárd

> Cheers,
>
> Berk
>
>> --
>> Szilárd
>>
>>> PS Why are you not using 5.0? I don't recall anything related to sharing
>>> has
>>> changed, but in such cases I would try the newest version.
>>>
>>> Cheers,
>>>
>>> Berk
>>>
>>>
>>> On 12/11/2014 05:32 PM, Carsten Kutzner wrote:
>>>>
>>>> Hi,
>>>>
>>>> we are seeing a weird problem here with 4.6.7 on GPU nodes.
>>>> A 146k atom system that already ran happily on a lot of different
>>>> nodes (with and without GPU) now often crashes on GPU nodes
>>>> with the error message:
>>>>
>>>> x particles communicated to PME node y are more than 2/3 times the
>>>> cut-off
>>>> … dimension x
>>>>
>>>> DD is 8 x 1 x 1 in all cases, mdrun is started with the somewhat unusual
>>>> (but best performing) options
>>>>
>>>> -ntmpi 8 -ntomp 5 -gpu_id 00001111 -dlb no
>>>>
>>>> on nodes with 2x GTX 780Ti and 40 logical cores. Out of 20 of these runs
>>>> approx 14 die in the first 100k time steps with a variation of the above
>>>> error message.
>>>>
>>>> Our solution for now is to run it with
>>>>
>>>> -ntmpi 2 -ntomp 20 -gpu_id 01 -dlb no
>>>>
>>>> (no crashes up to now) however, at a large performance penalty.
>>>>
>>>> Comments on how to debug this further are welcome.
>>>>
>>>> Thanks!
>>>>     Carsten
>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Carsten Kutzner
>>>> Max Planck Institute for Biophysical Chemistry
>>>> Theoretical and Computational Biophysics
>>>> Am Fassberg 11, 37077 Goettingen, Germany
>>>> Tel. +49-551-2012313, Fax: +49-551-2012302
>>>> http://www.mpibpc.mpg.de/grubmueller/kutzner
>>>> http://www.mpibpc.mpg.de/grubmueller/sppexa
>>>>
>>> --
>>> Gromacs Developers mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
>>> or
>>> send a mail to gmx-developers-request at gromacs.org.
>
>
> --
> Gromacs Developers mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or
> send a mail to gmx-developers-request at gromacs.org.


More information about the gromacs.org_gmx-developers mailing list