[gmx-users] [gmx-developers] Fatal error: cudaStreamSynchronize failed in cu_blockwait_nb

Thu Jan 30 15:51:08 CET 2014

On Thu, Jan 30, 2014 at 2:10 PM, AOWI (Anders Ossowicki)
<AOWI at novozymes.com> wrote:
>> Does the error happen at step? Assuming the it does occur within the first 10 steps, here are a few things to try:
>
> It happens immediately. As in:
>
> $ time mdrun
> <snip>
> real    0m3.312s
> user    0m6.768s
> sys     0m1.968s
> $

Well, with a 24k system a single iteration can be done in 2-3 ms, so
those 3.3 seconds are mostly initialization and some number of steps -
could be one, ten, or even hundred.

>> - Run "cuda-memcheck mdrun -nsteps 10";
>
> A wild backtrace appeared!
>
> starting mdrun 'RNASE ZF-1A in water'
> 10 steps,      0.0 ps.
> ========= Program hit error 4 on CUDA API call to cudaStreamSynchronize
> =========     Saved host backtrace up to driver entry point at error
> =========     Host Frame:/usr/lib/nvidia-current/libcuda.so [0x26d660]
> =========     Host Frame:/usr/local/cuda-5.5/lib64/libcudart.so.5.5 (cudaStreamSynchronize + 0x15e) [0x36f5e]
> =========     Host Frame:/usr/bin/../lib/libmd.so.8 (nbnxn_cuda_wait_gpu + 0x9d9) [0x3bc779]
> =========     Host Frame:/usr/bin/../lib/libmd.so.8 (do_force_cutsVERLET + 0x1ff8) [0x275f98]
> =========     Host Frame:/usr/bin/../lib/libmd.so.8 (do_force + 0x3bf) [0x27a88f]
> =========     Host Frame:mdrun (do_md + 0x7fc7) [0x34267]
> =========     Host Frame:mdrun (mdrunner + 0x18a1) [0x11491]
> =========     Host Frame:mdrun (cmain + 0x1a30) [0x38cb0]
> =========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
> =========     Host Frame:mdrun [0x76a1]
> =========
> <snip the usual error>

That doesn't tell much, could you add a -g to the CXX flags?

>> - Try running with GMX_EMULATE_GPU env. var. set? This will run the GPU acceleration code-path, but will use CPU kernels (equivalent to the CUDA but slow implementation).
>
> This seems to run correctly.

Does correctly mean that you've checked the results or that it
completed without a crash?

>
>> - Run with GMX_EMULATE_GPU using valgrind: "GMX_EMULATE_GPU=1 valgrind mdrun -nsteps 10"
> Valgrind dies immediately with
>
> nztest at ubuntu:~/rnase_bench/rnase_cubic$ GMX_EMULATE_GPU=YesPlease valgrind mdrun -nsteps 10
> ==13510== Memcheck, a memory error detector
> ==13510== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
> ==13510== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
> ==13510== Command: mdrun -nsteps 10
> ==13510==
>                          :-)  G  R  O  M  A  C  S  (-:
>
> vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x2A 0xC2 0xC5 0xFA 0x59 0xD
> ==13510== valgrind: Unrecognised instruction at address 0x5b5ac9d.
> ==13510==    at 0x5B5AC9D: rando (in /usr/lib/libgmx.so.8)
> ==13510==    by 0x5BAB0A4: pukeit (in /usr/lib/libgmx.so.8)
> ==13510==    by 0x5BAB420: bromacs (in /usr/lib/libgmx.so.8)
> ==13510==    by 0x5BAB933: CopyRight (in /usr/lib/libgmx.so.8)
> ==13510==    by 0x438E26: cmain (in /usr/bin/mdrun)
> ==13510==    by 0x65D976C: (below main) (libc-start.c:226)

Yeah, your valgrind does not support encoded instructions (=AVX). Use
SSE4.1 on the CPU and AFAIK you may need to set
GMX_DISTRIBUTABLE_BINARY=ON. However, I do not expect this to shed
more light on the issue.

> --
> Anders Ossowicki
>