[gmx-users] [gmx-developers] Fatal error: cudaStreamSynchronize failed in cu_blockwait_nb
AOWI (Anders Ossowicki)
AOWI at novozymes.com
Thu Jan 30 14:10:23 CET 2014
> Does the error happen at step? Assuming the it does occur within the first 10 steps, here are a few things to try:
It happens immediately. As in:
$ time mdrun
<snip>
real 0m3.312s
user 0m6.768s
sys 0m1.968s
$
> - Run "cuda-memcheck mdrun -nsteps 10";
A wild backtrace appeared!
starting mdrun 'RNASE ZF-1A in water'
10 steps, 0.0 ps.
========= Program hit error 4 on CUDA API call to cudaStreamSynchronize
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/nvidia-current/libcuda.so [0x26d660]
========= Host Frame:/usr/local/cuda-5.5/lib64/libcudart.so.5.5 (cudaStreamSynchronize + 0x15e) [0x36f5e]
========= Host Frame:/usr/bin/../lib/libmd.so.8 (nbnxn_cuda_wait_gpu + 0x9d9) [0x3bc779]
========= Host Frame:/usr/bin/../lib/libmd.so.8 (do_force_cutsVERLET + 0x1ff8) [0x275f98]
========= Host Frame:/usr/bin/../lib/libmd.so.8 (do_force + 0x3bf) [0x27a88f]
========= Host Frame:mdrun (do_md + 0x7fc7) [0x34267]
========= Host Frame:mdrun (mdrunner + 0x18a1) [0x11491]
========= Host Frame:mdrun (cmain + 0x1a30) [0x38cb0]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
========= Host Frame:mdrun [0x76a1]
=========
<snip the usual error>
> - Try running with GMX_EMULATE_GPU env. var. set? This will run the GPU acceleration code-path, but will use CPU kernels (equivalent to the CUDA but slow implementation).
This seems to run correctly.
> - Run with GMX_EMULATE_GPU using valgrind: "GMX_EMULATE_GPU=1 valgrind mdrun -nsteps 10"
Valgrind dies immediately with
nztest at ubuntu:~/rnase_bench/rnase_cubic$ GMX_EMULATE_GPU=YesPlease valgrind mdrun -nsteps 10
==13510== Memcheck, a memory error detector
==13510== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==13510== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==13510== Command: mdrun -nsteps 10
==13510==
:-) G R O M A C S (-:
vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x2A 0xC2 0xC5 0xFA 0x59 0xD
==13510== valgrind: Unrecognised instruction at address 0x5b5ac9d.
==13510== at 0x5B5AC9D: rando (in /usr/lib/libgmx.so.8)
==13510== by 0x5BAB0A4: pukeit (in /usr/lib/libgmx.so.8)
==13510== by 0x5BAB420: bromacs (in /usr/lib/libgmx.so.8)
==13510== by 0x5BAB933: CopyRight (in /usr/lib/libgmx.so.8)
==13510== by 0x438E26: cmain (in /usr/bin/mdrun)
==13510== by 0x65D976C: (below main) (libc-start.c:226)
--
Anders Ossowicki
More information about the gromacs.org_gmx-users
mailing list