[gmx-users] Gromacs GPU got hang

Tue Oct 13 00:19:46 CEST 2015

Hi Szilárd,

I tried to use strace to one of the MPI ranks. Below are the outputs. There
are some timed out in OpenMP thread, but I have no idea what is the root
cause. Is it kind of bug in Gromacs, or maybe in MPI / OpenMP ? Could you
see what's the root cause ?

FYI, we use Intel compiler v15.0.2 and OpenMPI v1.8.8. In this simulation,
I use 4 MPI ranks and each rank has 4 OpenMP threads.

PIDs of the MPI processes:
$ ps -axf
3014 ?        Sl    16:50 /opt/gridengine/bin/linux-x64/sge_execd
 3027 ?        S      2:37  \_ /opt/gridengine/util/gpu/gpu_sensor
12938 ?        S      0:00  \_ sge_shepherd-407 -bg
12967 ?        Ss     0:00      \_ -bash
/opt/gridengine/apollo4/spool/apollo4-g01/job_scripts/407
12987 ?        Sl     0:00          \_ mpirun -np 4 gmx_mpi mdrun -deffnm
md -pin on -pinoffset 0 -gpu_id 0000
12989 ?        Sl   4680:09              \_ gmx_mpi mdrun -deffnm md -pin
on -pinoffset 0 -gpu_id 0000
12990 ?        Sl   4681:36              \_ gmx_mpi mdrun -deffnm md -pin
on -pinoffset 0 -gpu_id 0000
12991 ?        Sl   4681:17              \_ gmx_mpi mdrun -deffnm md -pin
on -pinoffset 0 -gpu_id 0000
12992 ?        Sl   4681:25              \_ gmx_mpi mdrun -deffnm md -pin
on -pinoffset 0 -gpu_id 0000

strace one of MPI processes
[root at apollo4-g01 strace407]# strace -s 128 -x -p 12989 -f 2>&1 | head -50
Process 12989 attached with 9 threads
[pid 13039] restart_syscall(<... resuming interrupted call ...> <unfinished
...>
[pid 13031] futex(0x2b428203be84, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
...>
[pid 13025] futex(0x2b428205d484, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished
...>
[pid 13024] futex(0x2b428205e984, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished
...>
[pid 13023] restart_syscall(<... resuming interrupted call ...> <unfinished
...>
[pid 13020] restart_syscall(<... resuming interrupted call ...> <unfinished
...>
[pid 13005] select(23, [12 16 22], [], NULL, NULL <unfinished ...>
[pid 12995] restart_syscall(<... resuming interrupted call ...> <unfinished
...>
[pid 12989] futex(0x2b428205f784, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
...>
[pid 13039] <... restart_syscall resumed> ) = 0
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330079, 937607479}) = 0
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330079, 937666819}) = 0
[pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
events=POLLIN}], 6, 100) = 0 (Timeout)
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 38112431}) = 0
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 38178662}) = 0
[pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
events=POLLIN}], 6, 100 <unfinished ...>
*[pid 13023] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection
timed out)*
[pid 13023] futex(0x2b426ed9aa00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 13023] futex(0x2b426ed9aa44,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1103881, {1444684996,
236306000}, ffffffff <unfinished ...>
[pid 13039] <... poll resumed> )        = 0 (Timeout)
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 138386651}) = 0
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 138430809}) = 0
[pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
events=POLLIN}], 6, 100) = 0 (Timeout)
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 238630125}) = 0
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 238673628}) = 0
[pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
events=POLLIN}], 6, 100 <unfinished ...>
[pid 13023] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed
out)
[pid 13023] futex(0x2b426ed9aa00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 13023] futex(0x2b426ed9aa44,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1103883, {1444684996,
436466000}, ffffffff <unfinished ...>
[pid 13039] <... poll resumed> )        = 0 (Timeout)
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 338874153}) = 0
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 338918688}) = 0
*[pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
events=POLLIN}], 6, 100) = 0 (Timeout)*
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 439131341}) = 0
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 439174001}) = 0
[pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
events=POLLIN}], 6, 100 <unfinished ...>
*[pid 13023] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed
out)*
[pid 13023] futex(0x2b426ed9aa00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 13023] futex(0x2b426ed9aa44,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1103885, {1444684996,
636581000}, ffffffff <unfinished ...>
[pid 13039] <... poll resumed> )        = 0 (Timeout)
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 539370079}) = 0
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 539411729}) = 0
[pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
events=POLLIN}], 6, 100) = 0 (Timeout)
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 639606472}) = 0
[pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 639648695}) = 0
[pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
events=POLLIN}], 6, 100 <unfinished ...>
[pid 13023] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed
out)
[pid 13023] futex(0x2b426ed9aa00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 13023] futex(0x2b426ed9aa44,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1103887, {1444684996,
836695000}, ffffffff <unfinished ...>

[root at apollo4-g01 ~]# ls -l /proc/12989/fd
total 0
lr-x------ 1 he hpcusers 64 Oct 13 04:47 0 -> pipe:[1707538]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 1 -> /dev/pts/0
lrwx------ 1 he hpcusers 64 Oct 13 04:47 10 -> socket:[1689136]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 11 -> socket:[1711145]
lr-x------ 1 he hpcusers 64 Oct 13 04:47 12 -> pipe:[1689141]
l-wx------ 1 he hpcusers 64 Oct 13 04:47 13 -> pipe:[1689141]
lr-x------ 1 he hpcusers 64 Oct 13 04:47 14 -> pipe:[1689142]
l-wx------ 1 he hpcusers 64 Oct 13 04:47 15 -> pipe:[1689142]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 16 -> /dev/infiniband/rdma_cm
lrwx------ 1 he hpcusers 64 Oct 13 04:47 17 -> /dev/infiniband/uverbs0
lr-x------ 1 he hpcusers 64 Oct 13 04:47 18 -> anon_inode:[infinibandevent]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 19 -> /dev/infiniband/uverbs0
l-wx------ 1 he hpcusers 64 Oct 13 04:47 2 -> pipe:[1707539]
lr-x------ 1 he hpcusers 64 Oct 13 04:47 20 -> anon_inode:[infinibandevent]
l-wx------ 1 he hpcusers 64 Oct 13 04:47 21 -> pipe:[1707540]
lr-x------ 1 he hpcusers 64 Oct 13 04:47 22 -> anon_inode:[infinibandevent]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 23 -> socket:[1689145]
lr-x------ 1 he hpcusers 64 Oct 13 04:47 24 -> pipe:[1689146]
l-wx------ 1 he hpcusers 64 Oct 13 04:47 25 -> pipe:[1689146]
lr-x------ 1 he hpcusers 64 Oct 13 04:47 26 -> pipe:[1689147]
l-wx------ 1 he hpcusers 64 Oct 13 04:47 27 -> pipe:[1689147]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 28 ->
/home/he/C1C2C3/C3Test/Run1_g51/md.log
lrwx------ 1 he hpcusers 64 Oct 13 04:47 29 -> /dev/nvidiactl
lrwx------ 1 he hpcusers 64 Oct 13 04:47 3 -> socket:[1689129]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 30 -> /dev/nvidia0
lrwx------ 1 he hpcusers 64 Oct 13 04:47 31 -> /dev/nvidia0
lrwx------ 1 he hpcusers 64 Oct 13 04:47 32 -> /dev/nvidia0
lrwx------ 1 he hpcusers 64 Oct 13 04:47 33 -> /dev/nvidia-uvm
lrwx------ 1 he hpcusers 64 Oct 13 04:47 34 -> /dev/nvidiactl
lrwx------ 1 he hpcusers 64 Oct 13 04:47 35 -> /dev/nvidia0
lrwx------ 1 he hpcusers 64 Oct 13 04:47 36 -> /dev/nvidia0
lrwx------ 1 he hpcusers 64 Oct 13 04:47 37 -> /dev/nvidia0
lr-x------ 1 he hpcusers 64 Oct 13 04:47 38 -> pipe:[1677448]
l-wx------ 1 he hpcusers 64 Oct 13 04:47 39 -> pipe:[1677448]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 4 -> socket:[1689130]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 40 -> /dev/nvidia0
lrwx------ 1 he hpcusers 64 Oct 13 04:47 41 -> /dev/nvidia0
lrwx------ 1 he hpcusers 64 Oct 13 04:47 42 -> /dev/nvidia0
lr-x------ 1 he hpcusers 64 Oct 13 04:47 43 -> pipe:[1677449]
l-wx------ 1 he hpcusers 64 Oct 13 04:47 44 -> pipe:[1677449]
lr-x------ 1 he hpcusers 64 Oct 13 04:47 45 -> pipe:[1677450]
l-wx------ 1 he hpcusers 64 Oct 13 04:47 46 -> pipe:[1677450]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 47 ->
/home/he/C1C2C3/C3Test/Run1_g51/md.xtc
lrwx------ 1 he hpcusers 64 Oct 13 04:47 48 ->
/home/he/C1C2C3/C3Test/Run1_g51/md.edr
lrwx------ 1 he hpcusers 64 Oct 13 04:47 5 -> anon_inode:[eventfd]
*lrwx------ 1 he hpcusers 64 Oct 13 04:47 6 -> /dev/shm/open_mpi.0000
(deleted)*
lrwx------ 1 he hpcusers 64 Oct 13 04:47 7 -> socket:[1689133]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 8 -> socket:[1689134]
lrwx------ 1 he hpcusers 64 Oct 13 04:47 9 -> anon_inode:[eventfd]

Regards,
Teguh

On Thu, Oct 1, 2015 at 10:03 PM, Szilárd Páll <pall.szilard at gmail.com>
wrote:

> The only way to know more is to either attach a debugger to the hanging
> process or possibly with an ltrace/strace to see in which library or
> syscalls is the process hanging.
>
> I suggest you try attaching a debugger and getting a stack trace (see
> https://sourceware.org/gdb/onlinedocs/gdb/Attach.html). Note that you'll
> need debugging symbols (e.g. switch to RelWithDebSymbols build in CMake) to
> get a result that can be interpreted.
>
> Cheers,
>
> --
> Szilárd
>
> On Thu, Oct 1, 2015 at 2:20 AM, M Teguh Satria <mteguhsat at gmail.com>
> wrote:
>
> > Hi Stéphane,
> >
> > Thanks for your reply.
> >
> > Actually everything is fine if we run shorter gromacs gpu job. Only when
> we
> > run longer gromacs gpu job (requires 20+ hours running) we got this
> > problem.
> >
> > I recorded nvidia-smi every 10 minutes. From these records, I doubt if
> > temperature was the cause.
> >
> > Before drop:
> > Tue Sep 29 11:59:59 2015
> > +------------------------------------------------------+
> >
> > | NVIDIA-SMI 346.46     Driver Version: 346.46         |
> >
> >
> >
> |-------------------------------+----------------------+----------------------+
> > | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
> > ECC |
> > | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
> Compute
> > M. |
> > |   0  Tesla K40m          Off  | 0000:82:00.0     Off |
> >  0 |
> > | N/A   41C    P0   110W / 235W |    139MiB / 11519MiB |     72%
> >  Default |
> >
> >
> +-------------------------------+----------------------+----------------------+
> >
> >
> +-----------------------------------------------------------------------------+
> > | Processes:                                                       GPU
> > Memory |
> > |  GPU       PID  Type  Process name                               Usage
> >    |
> > |    0     17500    C   mdrun_mpi
> > 82MiB |
> >
> >
> +-----------------------------------------------------------------------------+
> >
> >
> > After drop to 0%:
> > Tue Sep 29 12:09:59 2015
> > +------------------------------------------------------+
> >
> > | NVIDIA-SMI 346.46     Driver Version: 346.46         |
> >
> >
> >
> |-------------------------------+----------------------+----------------------+
> > | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
> > ECC |
> > | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
> Compute
> > M. |
> > |   0  Tesla K40m          Off  | 0000:82:00.0     Off |
> >  0 |
> > | N/A   34C    P0    62W / 235W |    139MiB / 11519MiB |      0%
> >  Default |
> >
> >
> +-------------------------------+----------------------+----------------------+
> >
> >
> +-----------------------------------------------------------------------------+
> > | Processes:                                                       GPU
> > Memory |
> > |  GPU       PID  Type  Process name                               Usage
> >    |
> > |    0     17500    C   mdrun_mpi
> > 82MiB |
> >
> >
> +-----------------------------------------------------------------------------+
> >
> >
> > On Wed, Sep 30, 2015 at 5:43 PM, Téletchéa Stéphane <
> > stephane.teletchea at univ-nantes.fr> wrote:
> >
> > > Le 29/09/2015 23:40, M Teguh Satria a écrit :
> > >
> > >> Any of you experiencing similar problem ? Is there any way to
> > >> troubleshoot/debug to see the cause ? Because I didn't get any warning
> > or
> > >> error message.
> > >>
> > >
> > > Hello,
> > >
> > > This can be a driver issue (or hardware, think of temperature, dust,
> > ...),
> > > and happens to me from time to time.
> > >
> > > The only solution I found was to reset the GPU (see nvidia-smi
> options),
> > > if this is not sufficient you will have to reboot (and use the cold
> boot:
> > > turn off the computer for more than 30s, and then boot again).
> > >
> > > If this happens too often, you may have a defective card, see your
> vendor
> > > in that
> > > case...
> > >
> > > Best,
> > >
> > > Stéphane Téletchéa
> > >
> > > --
> > > Assistant Professor, UFIP, UMR 6286 CNRS, Team Protein Design In Silico
> > > UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322
> > > Nantes cedex 03, France
> > > Tél : +33 251 125 636 / Fax : +33 251 125 632
> > > http://www.ufip.univ-nantes.fr/ - http://www.steletch.org
> > >
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-request at gromacs.org.
> > >
> >
> >
> >
> > --
> >
> >
> -----------------------------------------------------------------------------------
> > Regards,
> > *Teguh* <http://www.linkedin.com/in/mteguhsatria>
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>

-- 
-----------------------------------------------------------------------------------
Regards,
*Teguh* <http://www.linkedin.com/in/mteguhsatria>