[gmx-users] Unexpected cudaStreamQuery failure: unspecified launch failure

Michael Brunsteiner mbx0009 at yahoo.com
Tue Nov 20 16:29:07 CET 2018


Hi,
gromacs started dying on me lately with rather obscure error messages as in the 
caption of this mail. Errors seem to be related to the nvidia driver (see below for more output,and further below for the mdp file) ... i perform a large number of short (2ns) sims and this happens perhapsone out of ten times, its non-reproducible and un-predictable.
has anybody experienced anything like it?... and is this a a) gromacs issue, b) an nvidia driver issue, or c) a hardware issue??
thanks for any help!Michael
my system is a vanilla debian stretch with cuda-toolkit version 9.1.85-4and nvidia-driver 390.87-2, with unmodified gromacs 2018.3


what i see:

gmx mdrun dies and writes in log:
Source file: src/gromacs/gpu_utils/cudautils.cuh (line 298)
Fatal error:
Unexpected cudaStreamQuery failure: unspecified launch failure

at the same time i see in syslog:
Nov 20 09:34:56 rcpepc01797 kernel: [350452.906685] NVRM: Xid (PCI:0000:20:00): 69, Class Error: ChId 001a, Class 0000c1c0, Offset 000001b0, Data 00000041, ErrorCode 00000053

or:
gmx mdrun dies and writes in log:
Source file: src/gromacs/gpu_utils/cudautils.cuh (line 298)
Fatal error:
Unexpected cudaStreamQuery failure: unspecified launch failure

at the same time i see in syslog:
06:10
Nov 20 06:10:54 rcpepc01797 kernel: [338210.240084] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception:  EXTRA_INLINE_DATA
Nov 20 06:10:54 rcpepc01797 kernel: [338210.240088] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception: ESR 0x404600=0x80000001
Nov 20 06:10:54 rcpepc01797 kernel: [338210.240115] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception: ChID 001a, Class 0000c1c0, Offset 000001b4, Data 00000000

or:
gmx mdrun dies and writes in log:
Source file: src/gromacs/ewald/pme.cu (line 76)
Fatal error:
Failed to synchronize the PME GPU stream!: unspecified launch failure

at the same time i see in syslog:
Nov 17 03:16:15 rcpepc01797 kernel: [68519.703064] NVRM: GPU at PCI:0000:20:00: GPU-16aba4a6-68c1-44ab-47dd-7c7d06d2ddc5
Nov 17 03:16:15 rcpepc01797 kernel: [68519.703066] NVRM: GPU Board Serial Number:
Nov 17 03:16:15 rcpepc01797 kernel: [68519.703068] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception - INSTR_RAM_ACCESS_OUT_OF_BOUNDS
Nov 17 03:16:15 rcpepc01797 kernel: [68519.703072] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception: ESR 0x404490=0x80000020
Nov 17 03:16:15 rcpepc01797 kernel: [68519.703096] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception: ChID 001b, Class 0000c197, Offset 00000000, Data 00000000


mdp file:
integrator               = md
dt                       = 0.002
nsteps                   = 1000000
comm-grps                = System
;
nstxout                  = 50
nstvout                  = 0
nstfout                  = 0
nstlog                   = 50
nstenergy                = 50
;
nstlist                  = 50
ns_type                  = grid
pbc                      = xyz
rlist                    = 1.2
cutoff-scheme            = Verlet
;
coulombtype              = PME
rcoulomb                 = 1.2
vdw_type                 = cut-off 
rvdw                     = 1.2
;
constraints              = h-bonds
;
tcoupl            = v-rescale
tau-t             = 0.1
ref-t             = 300.0
tc-grps           = System
;
acc-grps         = api pol
accelerate       = 0 0.5 0 0 -0.5 0






More information about the gromacs.org_gmx-users mailing list