[gmx-developers] mdrun 4.6.7 with GPU sharing between thread-MPI ranks yields crashes
Carsten Kutzner
ckutzne at gwdg.de
Thu Dec 11 17:32:44 CET 2014
Hi,
we are seeing a weird problem here with 4.6.7 on GPU nodes.
A 146k atom system that already ran happily on a lot of different
nodes (with and without GPU) now often crashes on GPU nodes
with the error message:
x particles communicated to PME node y are more than 2/3 times the cut-off … dimension x
DD is 8 x 1 x 1 in all cases, mdrun is started with the somewhat unusual
(but best performing) options
-ntmpi 8 -ntomp 5 -gpu_id 00001111 -dlb no
on nodes with 2x GTX 780Ti and 40 logical cores. Out of 20 of these runs
approx 14 die in the first 100k time steps with a variation of the above
error message.
Our solution for now is to run it with
-ntmpi 2 -ntomp 20 -gpu_id 01 -dlb no
(no crashes up to now) however, at a large performance penalty.
Comments on how to debug this further are welcome.
Thanks!
Carsten
--
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics
Am Fassberg 11, 37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
http://www.mpibpc.mpg.de/grubmueller/kutzner
http://www.mpibpc.mpg.de/grubmueller/sppexa
More information about the gromacs.org_gmx-developers
mailing list