[gmx-users] [gmx-developers] Fatal error: cudaStreamSynchronize failed in cu_blockwait_nb

AOWI (Anders Ossowicki) AOWI at novozymes.com
Thu Jan 30 14:10:23 CET 2014

> Does the error happen at step? Assuming the it does occur within the first 10 steps, here are a few things to try:

It happens immediately. As in:

$ time mdrun
real    0m3.312s
user    0m6.768s
sys     0m1.968s

> - Run "cuda-memcheck mdrun -nsteps 10";

A wild backtrace appeared!

starting mdrun 'RNASE ZF-1A in water'
10 steps,      0.0 ps.
========= Program hit error 4 on CUDA API call to cudaStreamSynchronize
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/nvidia-current/libcuda.so [0x26d660]
=========     Host Frame:/usr/local/cuda-5.5/lib64/libcudart.so.5.5 (cudaStreamSynchronize + 0x15e) [0x36f5e]
=========     Host Frame:/usr/bin/../lib/libmd.so.8 (nbnxn_cuda_wait_gpu + 0x9d9) [0x3bc779]
=========     Host Frame:/usr/bin/../lib/libmd.so.8 (do_force_cutsVERLET + 0x1ff8) [0x275f98]
=========     Host Frame:/usr/bin/../lib/libmd.so.8 (do_force + 0x3bf) [0x27a88f]
=========     Host Frame:mdrun (do_md + 0x7fc7) [0x34267]
=========     Host Frame:mdrun (mdrunner + 0x18a1) [0x11491]
=========     Host Frame:mdrun (cmain + 0x1a30) [0x38cb0]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:mdrun [0x76a1]
<snip the usual error>

> - Try running with GMX_EMULATE_GPU env. var. set? This will run the GPU acceleration code-path, but will use CPU kernels (equivalent to the CUDA but slow implementation).

This seems to run correctly. 

> - Run with GMX_EMULATE_GPU using valgrind: "GMX_EMULATE_GPU=1 valgrind mdrun -nsteps 10"
Valgrind dies immediately with 

nztest at ubuntu:~/rnase_bench/rnase_cubic$ GMX_EMULATE_GPU=YesPlease valgrind mdrun -nsteps 10
==13510== Memcheck, a memory error detector
==13510== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==13510== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==13510== Command: mdrun -nsteps 10
                         :-)  G  R  O  M  A  C  S  (-:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x2A 0xC2 0xC5 0xFA 0x59 0xD
==13510== valgrind: Unrecognised instruction at address 0x5b5ac9d.
==13510==    at 0x5B5AC9D: rando (in /usr/lib/libgmx.so.8)
==13510==    by 0x5BAB0A4: pukeit (in /usr/lib/libgmx.so.8)
==13510==    by 0x5BAB420: bromacs (in /usr/lib/libgmx.so.8)
==13510==    by 0x5BAB933: CopyRight (in /usr/lib/libgmx.so.8)
==13510==    by 0x438E26: cmain (in /usr/bin/mdrun)
==13510==    by 0x65D976C: (below main) (libc-start.c:226)
Anders Ossowicki

More information about the gromacs.org_gmx-users mailing list