[gmx-developers] mdrun 4.6.7 with GPU sharing between thread-MPI ranks yields crashes

Berk Hess hess at kth.se
Thu Dec 11 18:39:30 CET 2014


On 12/11/2014 06:36 PM, Szilárd Páll wrote:
> On Thu, Dec 11, 2014 at 5:38 PM, Berk Hess <hess at kth.se> wrote:
>> Hi,
>>
>> We are also having some, as of yet, unexplainable issues that only seem to
>> show up with GPU sharing. We have done a lot of checking, so a bug in
>> Gromacs seems unlikely. Nvidia says this is officially not supported because
>> of some issues, so this could be one of those.
> What is not supported? To the best of my knowledge everything we do is
> officially supported, moreover, even the (ex) Hyper-Q now CUDA
> MPS-related Tesla-only features work because GPU contexts are by
> definition shared between pthreads (=tMPI ranks).
Am I confusing it with real MPI then, when there is an issue?

We also observed issue with thread-MPI and GPU sharing. I thought this 
could be attributable to a CUDA issue, but if not, we might have a bug 
in Gromacs (although very hard to find).

Cheers,

Berk
> --
> Szilárd
>
>> PS Why are you not using 5.0? I don't recall anything related to sharing has
>> changed, but in such cases I would try the newest version.
>>
>> Cheers,
>>
>> Berk
>>
>>
>> On 12/11/2014 05:32 PM, Carsten Kutzner wrote:
>>> Hi,
>>>
>>> we are seeing a weird problem here with 4.6.7 on GPU nodes.
>>> A 146k atom system that already ran happily on a lot of different
>>> nodes (with and without GPU) now often crashes on GPU nodes
>>> with the error message:
>>>
>>> x particles communicated to PME node y are more than 2/3 times the cut-off
>>> … dimension x
>>>
>>> DD is 8 x 1 x 1 in all cases, mdrun is started with the somewhat unusual
>>> (but best performing) options
>>>
>>> -ntmpi 8 -ntomp 5 -gpu_id 00001111 -dlb no
>>>
>>> on nodes with 2x GTX 780Ti and 40 logical cores. Out of 20 of these runs
>>> approx 14 die in the first 100k time steps with a variation of the above
>>> error message.
>>>
>>> Our solution for now is to run it with
>>>
>>> -ntmpi 2 -ntomp 20 -gpu_id 01 -dlb no
>>>
>>> (no crashes up to now) however, at a large performance penalty.
>>>
>>> Comments on how to debug this further are welcome.
>>>
>>> Thanks!
>>>     Carsten
>>>
>>>
>>>
>>> --
>>> Dr. Carsten Kutzner
>>> Max Planck Institute for Biophysical Chemistry
>>> Theoretical and Computational Biophysics
>>> Am Fassberg 11, 37077 Goettingen, Germany
>>> Tel. +49-551-2012313, Fax: +49-551-2012302
>>> http://www.mpibpc.mpg.de/grubmueller/kutzner
>>> http://www.mpibpc.mpg.de/grubmueller/sppexa
>>>
>> --
>> Gromacs Developers mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or
>> send a mail to gmx-developers-request at gromacs.org.



More information about the gromacs.org_gmx-developers mailing list