[gmx-developers] mdrun 4.6.7 with GPU sharing between thread-MPI ranks yields crashes

Carsten Kutzner ckutzne at gwdg.de
Thu Dec 11 17:32:44 CET 2014


Hi,

we are seeing a weird problem here with 4.6.7 on GPU nodes.
A 146k atom system that already ran happily on a lot of different
nodes (with and without GPU) now often crashes on GPU nodes
with the error message:

x particles communicated to PME node y are more than 2/3 times the cut-off … dimension x

DD is 8 x 1 x 1 in all cases, mdrun is started with the somewhat unusual
(but best performing) options

-ntmpi 8 -ntomp 5 -gpu_id 00001111 -dlb no

on nodes with 2x GTX 780Ti and 40 logical cores. Out of 20 of these runs
approx 14 die in the first 100k time steps with a variation of the above
error message.

Our solution for now is to run it with

-ntmpi 2 -ntomp 20 -gpu_id 01 -dlb no

(no crashes up to now) however, at a large performance penalty.

Comments on how to debug this further are welcome.

Thanks!
  Carsten



--
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics
Am Fassberg 11, 37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
http://www.mpibpc.mpg.de/grubmueller/kutzner
http://www.mpibpc.mpg.de/grubmueller/sppexa



More information about the gromacs.org_gmx-developers mailing list