[gmx-users] Gromacs GPU got hang
Szilárd Páll
pall.szilard at gmail.com
Thu Oct 1 16:03:16 CEST 2015
The only way to know more is to either attach a debugger to the hanging
process or possibly with an ltrace/strace to see in which library or
syscalls is the process hanging.
I suggest you try attaching a debugger and getting a stack trace (see
https://sourceware.org/gdb/onlinedocs/gdb/Attach.html). Note that you'll
need debugging symbols (e.g. switch to RelWithDebSymbols build in CMake) to
get a result that can be interpreted.
Cheers,
--
Szilárd
On Thu, Oct 1, 2015 at 2:20 AM, M Teguh Satria <mteguhsat at gmail.com> wrote:
> Hi Stéphane,
>
> Thanks for your reply.
>
> Actually everything is fine if we run shorter gromacs gpu job. Only when we
> run longer gromacs gpu job (requires 20+ hours running) we got this
> problem.
>
> I recorded nvidia-smi every 10 minutes. From these records, I doubt if
> temperature was the cause.
>
> Before drop:
> Tue Sep 29 11:59:59 2015
> +------------------------------------------------------+
>
> | NVIDIA-SMI 346.46 Driver Version: 346.46 |
>
>
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
> ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
> M. |
> | 0 Tesla K40m Off | 0000:82:00.0 Off |
> 0 |
> | N/A 41C P0 110W / 235W | 139MiB / 11519MiB | 72%
> Default |
>
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
> | Processes: GPU
> Memory |
> | GPU PID Type Process name Usage
> |
> | 0 17500 C mdrun_mpi
> 82MiB |
>
> +-----------------------------------------------------------------------------+
>
>
> After drop to 0%:
> Tue Sep 29 12:09:59 2015
> +------------------------------------------------------+
>
> | NVIDIA-SMI 346.46 Driver Version: 346.46 |
>
>
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
> ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
> M. |
> | 0 Tesla K40m Off | 0000:82:00.0 Off |
> 0 |
> | N/A 34C P0 62W / 235W | 139MiB / 11519MiB | 0%
> Default |
>
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
> | Processes: GPU
> Memory |
> | GPU PID Type Process name Usage
> |
> | 0 17500 C mdrun_mpi
> 82MiB |
>
> +-----------------------------------------------------------------------------+
>
>
> On Wed, Sep 30, 2015 at 5:43 PM, Téletchéa Stéphane <
> stephane.teletchea at univ-nantes.fr> wrote:
>
> > Le 29/09/2015 23:40, M Teguh Satria a écrit :
> >
> >> Any of you experiencing similar problem ? Is there any way to
> >> troubleshoot/debug to see the cause ? Because I didn't get any warning
> or
> >> error message.
> >>
> >
> > Hello,
> >
> > This can be a driver issue (or hardware, think of temperature, dust,
> ...),
> > and happens to me from time to time.
> >
> > The only solution I found was to reset the GPU (see nvidia-smi options),
> > if this is not sufficient you will have to reboot (and use the cold boot:
> > turn off the computer for more than 30s, and then boot again).
> >
> > If this happens too often, you may have a defective card, see your vendor
> > in that
> > case...
> >
> > Best,
> >
> > Stéphane Téletchéa
> >
> > --
> > Assistant Professor, UFIP, UMR 6286 CNRS, Team Protein Design In Silico
> > UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322
> > Nantes cedex 03, France
> > Tél : +33 251 125 636 / Fax : +33 251 125 632
> > http://www.ufip.univ-nantes.fr/ - http://www.steletch.org
> >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
>
>
>
> --
>
> -----------------------------------------------------------------------------------
> Regards,
> *Teguh* <http://www.linkedin.com/in/mteguhsatria>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list