[gmx-users] [gmx-developers] Fatal error: cudaStreamSynchronize failed in cu_blockwait_nb

Wed Jan 29 22:29:43 CET 2014

Hi Anders,

This mail belongs to the users' list.

This type of error is typically a sign of the CUDA kernel failing due
to a nasty bug in the code or hardware error. The dmesg message is
suspicious and it may be a hint of hardware error (see
https://www.kubuntuforums.net/showthread.php?64133-kwin-crashes-repeatedly)

I would not make any assumptions though, but rather try a few things first:
- Does the card pass a memtest (sourceforge.net/projects/cudagpumemtest/)?
- Does the installation pass the regressiontests?
- Is the error reproducible with other inputs?

Also note that with the default invocation of mdrun you are attempting
to use all cores/hardware threads in your machine (I assume a
2x12-core IVB-E node with HT on). This requires a huge number of
OpenMP threads that will lead to pretty bad performance in the CPU
code. Typically one to one CPU-GPU ratio is decent, especially with
fast Intel Xeons. Use only one socket for your tests or if you plan to
use a single GPU per node, at least use only 1 thread/core -> 24
threads in total.

Cheers,
--
Szilárd

On Wed, Jan 29, 2014 at 7:07 PM, AOWI (Anders Ossowicki)
<AOWI at novozymes.com> wrote:
> Hello,
>
> We are testing out Gromacs 4.6.5 with an Nvidia K20 card. We keep running into the error message below, no matter which setup we're trying. In the included case, it was the RNAse example from http://www.gromacs.org/GPU_acceleration. Furthermore, we get the following line in dmesg as well:
>
>    NVRM: GPU at 0000:42:00: GPU-d0b07804-027a-5a02-43bc-fd7dc9064637
>    NVRM: Xid (0000:42:00): 31, Ch 00000003, engmask 00000101, intr 10000000
>
> Are we just completely out of luck with this card, or have we done something wrong?
>
> We've built Gromacs from source against the cuda 5.5 libraries straight from Nvidia. The system is Ubuntu 12.04. Gromacs works fine when it's not using the GPU.
> The card identifies itself as NVIDIA Corporation GK110GL [Tesla K20m] (rev a1)
>
> This is what we've done to trigger the error:
> $ grompp -f pme_verlet.mdp -c conf.gro -p topol.top
> $ mdrun
>
> Here is the output from mdrun. The error message tells me absolutely nothing, so any advice on how to proceed with debugging this would be much appreciated.
>
> Reading file topol.tpr, VERSION 4.6.5 (single precision)
> Changing nstlist from 10 to 40, rlist from 0.9 to 0.996
>
> Using 1 MPI thread
> Using 48 OpenMP threads
>
> 1 GPU detected:
>   #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
>
> 1 GPU auto-selected for this run.
> Mapping of GPU to the 1 PP rank in this node: #0
>
>
> Back Off! I just backed up ener.edr to ./#ener.edr.1#
> starting mdrun 'RNASE ZF-1A in water'
> 10000 steps,     20.0 ps.
>
> -------------------------------------------------------
> Program mdrun, VERSION 4.6.5
> Source code file: /home/nztest/src/gromacs-4.6.5/src/mdlib/nbnxn_cuda/nbnxn_cuda.cu, line: 591
>
> Fatal error:
> cudaStreamSynchronize failed in cu_blockwait_nb: unspecified launch failure
>
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
>
> Thanks in advance!
>
> --
> Best Regards
> Anders Ossowicki
>
> --
> Gromacs Developers mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or send a mail to gmx-developers-request at gromacs.org.