[gmx-developers] mdrun 4.6.7 with GPU sharing between thread-MPI ranks yields crashes

Berk Hess hess at kth.se
Thu Dec 11 17:37:27 CET 2014


Hi,

We are also having some, as of yet, unexplainable issues that only seem 
to show up with GPU sharing. We have done a lot of checking, so a bug in 
Gromacs seems unlikely. Nvidia says this is officially not supported 
because of some issues, so this could be one of those.

PS Why are you not using 5.0? I don't recall anything related to sharing 
has changed, but in such cases I would try the newest version.

Cheers,

Berk

On 12/11/2014 05:32 PM, Carsten Kutzner wrote:
> Hi,
>
> we are seeing a weird problem here with 4.6.7 on GPU nodes.
> A 146k atom system that already ran happily on a lot of different
> nodes (with and without GPU) now often crashes on GPU nodes
> with the error message:
>
> x particles communicated to PME node y are more than 2/3 times the cut-off … dimension x
>
> DD is 8 x 1 x 1 in all cases, mdrun is started with the somewhat unusual
> (but best performing) options
>
> -ntmpi 8 -ntomp 5 -gpu_id 00001111 -dlb no
>
> on nodes with 2x GTX 780Ti and 40 logical cores. Out of 20 of these runs
> approx 14 die in the first 100k time steps with a variation of the above
> error message.
>
> Our solution for now is to run it with
>
> -ntmpi 2 -ntomp 20 -gpu_id 01 -dlb no
>
> (no crashes up to now) however, at a large performance penalty.
>
> Comments on how to debug this further are welcome.
>
> Thanks!
>    Carsten
>
>
>
> --
> Dr. Carsten Kutzner
> Max Planck Institute for Biophysical Chemistry
> Theoretical and Computational Biophysics
> Am Fassberg 11, 37077 Goettingen, Germany
> Tel. +49-551-2012313, Fax: +49-551-2012302
> http://www.mpibpc.mpg.de/grubmueller/kutzner
> http://www.mpibpc.mpg.de/grubmueller/sppexa
>



More information about the gromacs.org_gmx-developers mailing list