[gmx-developers] mdrun 4.6.7 with GPU sharing between thread-MPI ranks yields crashes

Carsten Kutzner ckutzne at gwdg.de
Thu Dec 11 17:32:44 CET 2014


we are seeing a weird problem here with 4.6.7 on GPU nodes.
A 146k atom system that already ran happily on a lot of different
nodes (with and without GPU) now often crashes on GPU nodes
with the error message:

x particles communicated to PME node y are more than 2/3 times the cut-off … dimension x

DD is 8 x 1 x 1 in all cases, mdrun is started with the somewhat unusual
(but best performing) options

-ntmpi 8 -ntomp 5 -gpu_id 00001111 -dlb no

on nodes with 2x GTX 780Ti and 40 logical cores. Out of 20 of these runs
approx 14 die in the first 100k time steps with a variation of the above
error message.

Our solution for now is to run it with

-ntmpi 2 -ntomp 20 -gpu_id 01 -dlb no

(no crashes up to now) however, at a large performance penalty.

Comments on how to debug this further are welcome.


Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics
Am Fassberg 11, 37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302

More information about the gromacs.org_gmx-developers mailing list