[gmx-users] Gromacs GPU got hang

Szilárd Páll pall.szilard at gmail.com
Thu Oct 1 16:03:16 CEST 2015


The only way to know more is to either attach a debugger to the hanging
process or possibly with an ltrace/strace to see in which library or
syscalls is the process hanging.

I suggest you try attaching a debugger and getting a stack trace (see
https://sourceware.org/gdb/onlinedocs/gdb/Attach.html). Note that you'll
need debugging symbols (e.g. switch to RelWithDebSymbols build in CMake) to
get a result that can be interpreted.

Cheers,

--
Szilárd

On Thu, Oct 1, 2015 at 2:20 AM, M Teguh Satria <mteguhsat at gmail.com> wrote:

> Hi Stéphane,
>
> Thanks for your reply.
>
> Actually everything is fine if we run shorter gromacs gpu job. Only when we
> run longer gromacs gpu job (requires 20+ hours running) we got this
> problem.
>
> I recorded nvidia-smi every 10 minutes. From these records, I doubt if
> temperature was the cause.
>
> Before drop:
> Tue Sep 29 11:59:59 2015
> +------------------------------------------------------+
>
> | NVIDIA-SMI 346.46     Driver Version: 346.46         |
>
>
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
> ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
> M. |
> |   0  Tesla K40m          Off  | 0000:82:00.0     Off |
>  0 |
> | N/A   41C    P0   110W / 235W |    139MiB / 11519MiB |     72%
>  Default |
>
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU
> Memory |
> |  GPU       PID  Type  Process name                               Usage
>    |
> |    0     17500    C   mdrun_mpi
> 82MiB |
>
> +-----------------------------------------------------------------------------+
>
>
> After drop to 0%:
> Tue Sep 29 12:09:59 2015
> +------------------------------------------------------+
>
> | NVIDIA-SMI 346.46     Driver Version: 346.46         |
>
>
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
> ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
> M. |
> |   0  Tesla K40m          Off  | 0000:82:00.0     Off |
>  0 |
> | N/A   34C    P0    62W / 235W |    139MiB / 11519MiB |      0%
>  Default |
>
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU
> Memory |
> |  GPU       PID  Type  Process name                               Usage
>    |
> |    0     17500    C   mdrun_mpi
> 82MiB |
>
> +-----------------------------------------------------------------------------+
>
>
> On Wed, Sep 30, 2015 at 5:43 PM, Téletchéa Stéphane <
> stephane.teletchea at univ-nantes.fr> wrote:
>
> > Le 29/09/2015 23:40, M Teguh Satria a écrit :
> >
> >> Any of you experiencing similar problem ? Is there any way to
> >> troubleshoot/debug to see the cause ? Because I didn't get any warning
> or
> >> error message.
> >>
> >
> > Hello,
> >
> > This can be a driver issue (or hardware, think of temperature, dust,
> ...),
> > and happens to me from time to time.
> >
> > The only solution I found was to reset the GPU (see nvidia-smi options),
> > if this is not sufficient you will have to reboot (and use the cold boot:
> > turn off the computer for more than 30s, and then boot again).
> >
> > If this happens too often, you may have a defective card, see your vendor
> > in that
> > case...
> >
> > Best,
> >
> > Stéphane Téletchéa
> >
> > --
> > Assistant Professor, UFIP, UMR 6286 CNRS, Team Protein Design In Silico
> > UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322
> > Nantes cedex 03, France
> > Tél : +33 251 125 636 / Fax : +33 251 125 632
> > http://www.ufip.univ-nantes.fr/ - http://www.steletch.org
> >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
>
>
>
> --
>
> -----------------------------------------------------------------------------------
> Regards,
> *Teguh* <http://www.linkedin.com/in/mteguhsatria>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list