[gmx-users] Gromacs GPU got hang

M Teguh Satria mteguhsat at gmail.com
Thu Oct 1 02:20:57 CEST 2015


Hi Stéphane,

Thanks for your reply.

Actually everything is fine if we run shorter gromacs gpu job. Only when we
run longer gromacs gpu job (requires 20+ hours running) we got this problem.

I recorded nvidia-smi every 10 minutes. From these records, I doubt if
temperature was the cause.

Before drop:
Tue Sep 29 11:59:59 2015
+------------------------------------------------------+

| NVIDIA-SMI 346.46     Driver Version: 346.46         |

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
M. |
|   0  Tesla K40m          Off  | 0000:82:00.0     Off |
 0 |
| N/A   41C    P0   110W / 235W |    139MiB / 11519MiB |     72%
 Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU
Memory |
|  GPU       PID  Type  Process name                               Usage
   |
|    0     17500    C   mdrun_mpi
82MiB |
+-----------------------------------------------------------------------------+


After drop to 0%:
Tue Sep 29 12:09:59 2015
+------------------------------------------------------+

| NVIDIA-SMI 346.46     Driver Version: 346.46         |

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
M. |
|   0  Tesla K40m          Off  | 0000:82:00.0     Off |
 0 |
| N/A   34C    P0    62W / 235W |    139MiB / 11519MiB |      0%
 Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU
Memory |
|  GPU       PID  Type  Process name                               Usage
   |
|    0     17500    C   mdrun_mpi
82MiB |
+-----------------------------------------------------------------------------+


On Wed, Sep 30, 2015 at 5:43 PM, Téletchéa Stéphane <
stephane.teletchea at univ-nantes.fr> wrote:

> Le 29/09/2015 23:40, M Teguh Satria a écrit :
>
>> Any of you experiencing similar problem ? Is there any way to
>> troubleshoot/debug to see the cause ? Because I didn't get any warning or
>> error message.
>>
>
> Hello,
>
> This can be a driver issue (or hardware, think of temperature, dust, ...),
> and happens to me from time to time.
>
> The only solution I found was to reset the GPU (see nvidia-smi options),
> if this is not sufficient you will have to reboot (and use the cold boot:
> turn off the computer for more than 30s, and then boot again).
>
> If this happens too often, you may have a defective card, see your vendor
> in that
> case...
>
> Best,
>
> Stéphane Téletchéa
>
> --
> Assistant Professor, UFIP, UMR 6286 CNRS, Team Protein Design In Silico
> UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322
> Nantes cedex 03, France
> Tél : +33 251 125 636 / Fax : +33 251 125 632
> http://www.ufip.univ-nantes.fr/ - http://www.steletch.org
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>



-- 
-----------------------------------------------------------------------------------
Regards,
*Teguh* <http://www.linkedin.com/in/mteguhsatria>


More information about the gromacs.org_gmx-users mailing list