[gmx-users] Gromacs GPU got hang

Szilárd Páll pall.szilard at gmail.com
Tue Oct 13 15:54:02 CEST 2015


Hi Teguh,

Unfortunately, I can't see anything out of the ordinary in these outputs
and, admittedly, the library trace is what I was hoping to tell the most.

I can't exclude the possibility if this being a bug - either in GROMACS or
in one of the runtimes used. To test this and have a chance of tracking
down the issue I suggest:

- Attaching a debugger and getting a stack trace (i.e. gdb -p PID, then
type "bt"), this would be most helpful if GROMACS is built with
debug symbols (CMAKE_BUILD_TYPE=RelWithDebInfo); this should tell where
exactly is the run stuck.

- Try different builds, e.g. gcc >=4.8, different MPI or non-MPI build (on
a single node you can run the same setup with 4 thread-MPI ranks). It may
be worth setting up a test or two based on the outcome of the above.

I hope that helps.

Cheers,
--
Szilárd

On Tue, Oct 13, 2015 at 12:19 AM, M Teguh Satria <mteguhsat at gmail.com>
wrote:

> Hi Szilárd,
>
> I tried to use strace to one of the MPI ranks. Below are the outputs. There
> are some timed out in OpenMP thread, but I have no idea what is the root
> cause. Is it kind of bug in Gromacs, or maybe in MPI / OpenMP ? Could you
> see what's the root cause ?
>
> FYI, we use Intel compiler v15.0.2 and OpenMPI v1.8.8. In this simulation,
> I use 4 MPI ranks and each rank has 4 OpenMP threads.
>
> PIDs of the MPI processes:
> $ ps -axf
> 3014 ?        Sl    16:50 /opt/gridengine/bin/linux-x64/sge_execd
>  3027 ?        S      2:37  \_ /opt/gridengine/util/gpu/gpu_sensor
> 12938 ?        S      0:00  \_ sge_shepherd-407 -bg
> 12967 ?        Ss     0:00      \_ -bash
> /opt/gridengine/apollo4/spool/apollo4-g01/job_scripts/407
> 12987 ?        Sl     0:00          \_ mpirun -np 4 gmx_mpi mdrun -deffnm
> md -pin on -pinoffset 0 -gpu_id 0000
> 12989 ?        Sl   4680:09              \_ gmx_mpi mdrun -deffnm md -pin
> on -pinoffset 0 -gpu_id 0000
> 12990 ?        Sl   4681:36              \_ gmx_mpi mdrun -deffnm md -pin
> on -pinoffset 0 -gpu_id 0000
> 12991 ?        Sl   4681:17              \_ gmx_mpi mdrun -deffnm md -pin
> on -pinoffset 0 -gpu_id 0000
> 12992 ?        Sl   4681:25              \_ gmx_mpi mdrun -deffnm md -pin
> on -pinoffset 0 -gpu_id 0000
>
>
> strace one of MPI processes
> [root at apollo4-g01 strace407]# strace -s 128 -x -p 12989 -f 2>&1 | head -50
> Process 12989 attached with 9 threads
> [pid 13039] restart_syscall(<... resuming interrupted call ...> <unfinished
> ...>
> [pid 13031] futex(0x2b428203be84, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
> ...>
> [pid 13025] futex(0x2b428205d484, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished
> ...>
> [pid 13024] futex(0x2b428205e984, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished
> ...>
> [pid 13023] restart_syscall(<... resuming interrupted call ...> <unfinished
> ...>
> [pid 13020] restart_syscall(<... resuming interrupted call ...> <unfinished
> ...>
> [pid 13005] select(23, [12 16 22], [], NULL, NULL <unfinished ...>
> [pid 12995] restart_syscall(<... resuming interrupted call ...> <unfinished
> ...>
> [pid 12989] futex(0x2b428205f784, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
> ...>
> [pid 13039] <... restart_syscall resumed> ) = 0
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330079, 937607479}) = 0
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330079, 937666819}) = 0
> [pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
> events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
> events=POLLIN}], 6, 100) = 0 (Timeout)
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 38112431}) = 0
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 38178662}) = 0
> [pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
> events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
> events=POLLIN}], 6, 100 <unfinished ...>
> *[pid 13023] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection
> timed out)*
> [pid 13023] futex(0x2b426ed9aa00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 13023] futex(0x2b426ed9aa44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1103881, {1444684996,
> 236306000}, ffffffff <unfinished ...>
> [pid 13039] <... poll resumed> )        = 0 (Timeout)
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 138386651}) = 0
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 138430809}) = 0
> [pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
> events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
> events=POLLIN}], 6, 100) = 0 (Timeout)
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 238630125}) = 0
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 238673628}) = 0
> [pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
> events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
> events=POLLIN}], 6, 100 <unfinished ...>
> [pid 13023] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed
> out)
> [pid 13023] futex(0x2b426ed9aa00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 13023] futex(0x2b426ed9aa44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1103883, {1444684996,
> 436466000}, ffffffff <unfinished ...>
> [pid 13039] <... poll resumed> )        = 0 (Timeout)
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 338874153}) = 0
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 338918688}) = 0
> *[pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
> events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
> events=POLLIN}], 6, 100) = 0 (Timeout)*
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 439131341}) = 0
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 439174001}) = 0
> [pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
> events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
> events=POLLIN}], 6, 100 <unfinished ...>
> *[pid 13023] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed
> out)*
> [pid 13023] futex(0x2b426ed9aa00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 13023] futex(0x2b426ed9aa44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1103885, {1444684996,
> 636581000}, ffffffff <unfinished ...>
> [pid 13039] <... poll resumed> )        = 0 (Timeout)
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 539370079}) = 0
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 539411729}) = 0
> [pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
> events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
> events=POLLIN}], 6, 100) = 0 (Timeout)
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 639606472}) = 0
> [pid 13039] clock_gettime(CLOCK_MONOTONIC_RAW, {3330080, 639648695}) = 0
> [pid 13039] poll([{fd=38, events=POLLIN}, {fd=40, events=POLLIN}, {fd=41,
> events=POLLIN}, {fd=42, events=POLLIN}, {fd=43, events=POLLIN}, {fd=45,
> events=POLLIN}], 6, 100 <unfinished ...>
> [pid 13023] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed
> out)
> [pid 13023] futex(0x2b426ed9aa00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 13023] futex(0x2b426ed9aa44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1103887, {1444684996,
> 836695000}, ffffffff <unfinished ...>
>
>
> [root at apollo4-g01 ~]# ls -l /proc/12989/fd
> total 0
> lr-x------ 1 he hpcusers 64 Oct 13 04:47 0 -> pipe:[1707538]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 1 -> /dev/pts/0
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 10 -> socket:[1689136]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 11 -> socket:[1711145]
> lr-x------ 1 he hpcusers 64 Oct 13 04:47 12 -> pipe:[1689141]
> l-wx------ 1 he hpcusers 64 Oct 13 04:47 13 -> pipe:[1689141]
> lr-x------ 1 he hpcusers 64 Oct 13 04:47 14 -> pipe:[1689142]
> l-wx------ 1 he hpcusers 64 Oct 13 04:47 15 -> pipe:[1689142]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 16 -> /dev/infiniband/rdma_cm
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 17 -> /dev/infiniband/uverbs0
> lr-x------ 1 he hpcusers 64 Oct 13 04:47 18 -> anon_inode:[infinibandevent]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 19 -> /dev/infiniband/uverbs0
> l-wx------ 1 he hpcusers 64 Oct 13 04:47 2 -> pipe:[1707539]
> lr-x------ 1 he hpcusers 64 Oct 13 04:47 20 -> anon_inode:[infinibandevent]
> l-wx------ 1 he hpcusers 64 Oct 13 04:47 21 -> pipe:[1707540]
> lr-x------ 1 he hpcusers 64 Oct 13 04:47 22 -> anon_inode:[infinibandevent]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 23 -> socket:[1689145]
> lr-x------ 1 he hpcusers 64 Oct 13 04:47 24 -> pipe:[1689146]
> l-wx------ 1 he hpcusers 64 Oct 13 04:47 25 -> pipe:[1689146]
> lr-x------ 1 he hpcusers 64 Oct 13 04:47 26 -> pipe:[1689147]
> l-wx------ 1 he hpcusers 64 Oct 13 04:47 27 -> pipe:[1689147]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 28 ->
> /home/he/C1C2C3/C3Test/Run1_g51/md.log
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 29 -> /dev/nvidiactl
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 3 -> socket:[1689129]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 30 -> /dev/nvidia0
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 31 -> /dev/nvidia0
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 32 -> /dev/nvidia0
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 33 -> /dev/nvidia-uvm
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 34 -> /dev/nvidiactl
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 35 -> /dev/nvidia0
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 36 -> /dev/nvidia0
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 37 -> /dev/nvidia0
> lr-x------ 1 he hpcusers 64 Oct 13 04:47 38 -> pipe:[1677448]
> l-wx------ 1 he hpcusers 64 Oct 13 04:47 39 -> pipe:[1677448]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 4 -> socket:[1689130]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 40 -> /dev/nvidia0
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 41 -> /dev/nvidia0
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 42 -> /dev/nvidia0
> lr-x------ 1 he hpcusers 64 Oct 13 04:47 43 -> pipe:[1677449]
> l-wx------ 1 he hpcusers 64 Oct 13 04:47 44 -> pipe:[1677449]
> lr-x------ 1 he hpcusers 64 Oct 13 04:47 45 -> pipe:[1677450]
> l-wx------ 1 he hpcusers 64 Oct 13 04:47 46 -> pipe:[1677450]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 47 ->
> /home/he/C1C2C3/C3Test/Run1_g51/md.xtc
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 48 ->
> /home/he/C1C2C3/C3Test/Run1_g51/md.edr
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 5 -> anon_inode:[eventfd]
> *lrwx------ 1 he hpcusers 64 Oct 13 04:47 6 -> /dev/shm/open_mpi.0000
> (deleted)*
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 7 -> socket:[1689133]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 8 -> socket:[1689134]
> lrwx------ 1 he hpcusers 64 Oct 13 04:47 9 -> anon_inode:[eventfd]
>
>
> Regards,
> Teguh
>
>
> On Thu, Oct 1, 2015 at 10:03 PM, Szilárd Páll <pall.szilard at gmail.com>
> wrote:
>
> > The only way to know more is to either attach a debugger to the hanging
> > process or possibly with an ltrace/strace to see in which library or
> > syscalls is the process hanging.
> >
> > I suggest you try attaching a debugger and getting a stack trace (see
> > https://sourceware.org/gdb/onlinedocs/gdb/Attach.html). Note that you'll
> > need debugging symbols (e.g. switch to RelWithDebSymbols build in CMake)
> to
> > get a result that can be interpreted.
> >
> > Cheers,
> >
> > --
> > Szilárd
> >
> > On Thu, Oct 1, 2015 at 2:20 AM, M Teguh Satria <mteguhsat at gmail.com>
> > wrote:
> >
> > > Hi Stéphane,
> > >
> > > Thanks for your reply.
> > >
> > > Actually everything is fine if we run shorter gromacs gpu job. Only
> when
> > we
> > > run longer gromacs gpu job (requires 20+ hours running) we got this
> > > problem.
> > >
> > > I recorded nvidia-smi every 10 minutes. From these records, I doubt if
> > > temperature was the cause.
> > >
> > > Before drop:
> > > Tue Sep 29 11:59:59 2015
> > > +------------------------------------------------------+
> > >
> > > | NVIDIA-SMI 346.46     Driver Version: 346.46         |
> > >
> > >
> > >
> >
> |-------------------------------+----------------------+----------------------+
> > > | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
> Uncorr.
> > > ECC |
> > > | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
> > Compute
> > > M. |
> > > |   0  Tesla K40m          Off  | 0000:82:00.0     Off |
> > >  0 |
> > > | N/A   41C    P0   110W / 235W |    139MiB / 11519MiB |     72%
> > >  Default |
> > >
> > >
> >
> +-------------------------------+----------------------+----------------------+
> > >
> > >
> >
> +-----------------------------------------------------------------------------+
> > > | Processes:                                                       GPU
> > > Memory |
> > > |  GPU       PID  Type  Process name
>  Usage
> > >    |
> > > |    0     17500    C   mdrun_mpi
> > > 82MiB |
> > >
> > >
> >
> +-----------------------------------------------------------------------------+
> > >
> > >
> > > After drop to 0%:
> > > Tue Sep 29 12:09:59 2015
> > > +------------------------------------------------------+
> > >
> > > | NVIDIA-SMI 346.46     Driver Version: 346.46         |
> > >
> > >
> > >
> >
> |-------------------------------+----------------------+----------------------+
> > > | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
> Uncorr.
> > > ECC |
> > > | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
> > Compute
> > > M. |
> > > |   0  Tesla K40m          Off  | 0000:82:00.0     Off |
> > >  0 |
> > > | N/A   34C    P0    62W / 235W |    139MiB / 11519MiB |      0%
> > >  Default |
> > >
> > >
> >
> +-------------------------------+----------------------+----------------------+
> > >
> > >
> >
> +-----------------------------------------------------------------------------+
> > > | Processes:                                                       GPU
> > > Memory |
> > > |  GPU       PID  Type  Process name
>  Usage
> > >    |
> > > |    0     17500    C   mdrun_mpi
> > > 82MiB |
> > >
> > >
> >
> +-----------------------------------------------------------------------------+
> > >
> > >
> > > On Wed, Sep 30, 2015 at 5:43 PM, Téletchéa Stéphane <
> > > stephane.teletchea at univ-nantes.fr> wrote:
> > >
> > > > Le 29/09/2015 23:40, M Teguh Satria a écrit :
> > > >
> > > >> Any of you experiencing similar problem ? Is there any way to
> > > >> troubleshoot/debug to see the cause ? Because I didn't get any
> warning
> > > or
> > > >> error message.
> > > >>
> > > >
> > > > Hello,
> > > >
> > > > This can be a driver issue (or hardware, think of temperature, dust,
> > > ...),
> > > > and happens to me from time to time.
> > > >
> > > > The only solution I found was to reset the GPU (see nvidia-smi
> > options),
> > > > if this is not sufficient you will have to reboot (and use the cold
> > boot:
> > > > turn off the computer for more than 30s, and then boot again).
> > > >
> > > > If this happens too often, you may have a defective card, see your
> > vendor
> > > > in that
> > > > case...
> > > >
> > > > Best,
> > > >
> > > > Stéphane Téletchéa
> > > >
> > > > --
> > > > Assistant Professor, UFIP, UMR 6286 CNRS, Team Protein Design In
> Silico
> > > > UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322
> > > > Nantes cedex 03, France
> > > > Tél : +33 251 125 636 / Fax : +33 251 125 632
> > > > http://www.ufip.univ-nantes.fr/ - http://www.steletch.org
> > > >
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at
> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> > > > send a mail to gmx-users-request at gromacs.org.
> > > >
> > >
> > >
> > >
> > > --
> > >
> > >
> >
> -----------------------------------------------------------------------------------
> > > Regards,
> > > *Teguh* <http://www.linkedin.com/in/mteguhsatria>
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-request at gromacs.org.
> > >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
>
>
>
> --
>
> -----------------------------------------------------------------------------------
> Regards,
> *Teguh* <http://www.linkedin.com/in/mteguhsatria>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list