[gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue
Chenou Zhang
czhan178 at asu.edu
Tue Dec 3 23:11:04 CET 2019
Hi,
I've run 30 tests with the -notunepme option. I got the following error
from one of them(which is still the same *cudaStreamSynchronize failed*
error):
```
DD step 1422999 vol min/aver 0.639 load imb.: force 1.1% pme
mesh/force 1.079
Step Time
1423000 2846.00000
Energies (kJ/mol)
Bond U-B Proper Dih. Improper Dih. CMAP Dih.
3.79755e+04 1.78943e+05 1.22798e+05 2.83835e+03 -9.19303e+02
LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Coul. recip.
2.56547e+04 5.11714e+05 9.77218e+03 -2.07148e+06 8.64504e+03
Potential Kinetic En. Total Energy Conserved En. Temperature
7.64126e+13 4.79398e+05 7.64126e+13 7.64126e+13 3.58009e+02
Pressure (bar) Constr. rmsd
-6.03201e+01 4.56399e-06
-------------------------------------------------------
Program: gmx mdrun, version 2019.4
Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229)
MPI rank: 2 (out of 8)
Fatal error:
cudaStreamSynchronize failed: an illegal memory access was encountered
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
```
Here is the command and the driver info:
```
Command line:
gmx mdrun -v -s md_seed_fixed.tpr -deffnm md_seed_fixed -ntmpi 8 -pin on
-nb gpu -ntomp 3 -pme gpu -pmefft gpu -notunepme -npme 1 -gputasks 00112233
-maxh 2 -cpt 60 -cpi -noappend
GROMACS version: 2019.4
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-1.11.2
Tracing support: disabled
C compiler: /packages/7x/gcc/gcc-7.3.0/bin/gcc GNU 7.3.0
C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler: /packages/7x/gcc/gcc-7.3.0/bin/g++ GNU 7.3.0
C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
CUDA compiler: /packages/7x/cuda/9.2.88.1/bin/nvcc nvcc: NVIDIA (R)
Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on
Wed_Apr_11_23:16:29_CDT_2018;Cuda compilation tools, release 9.2, V9.2.88
CUDA compiler
flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;;
;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 9.20
CUDA runtime: 9.20
Running on 1 node with total 24 cores, 24 logical cores, 4 compatible GPUs
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz
Family: 6 Model: 79 Stepping: 1
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel
lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Full, with devices
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [
8] [ 9] [ 10] [ 11]
Socket 1: [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19] [
20] [ 21] [ 22] [ 23]
Numa nodes:
Node 0 (34229563392 bytes mem): 0 1 2 3 4 5 6 7 8
9 10 11
Node 1 (34359738368 bytes mem): 12 13 14 15 16 17 18 19 20
21 22 23
Latency:
0 1
0 1.00 2.10
1 2.10 1.00
Caches:
L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
L3: 31457280 bytes, linesize 64 bytes, assoc. 20, shared 12 ways
PCI devices:
0000:01:00.0 Id: 15b3:1007 Class: 0x0200 Numa: 0
0000:02:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 0
0000:03:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 0
0000:00:11.4 Id: 8086:8d62 Class: 0x0106 Numa: 0
0000:06:00.0 Id: 1a03:2000 Class: 0x0300 Numa: 0
0000:00:1f.2 Id: 8086:8d02 Class: 0x0106 Numa: 0
0000:81:00.0 Id: 8086:1521 Class: 0x0200 Numa: 1
0000:81:00.1 Id: 8086:1521 Class: 0x0200 Numa: 1
0000:82:00.0 Id: 15b3:1007 Class: 0x0280 Numa: 1
0000:83:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 1
0000:84:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 1
GPU info:
Number of GPUs detected: 4
#0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat:
compatible
#1: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat:
compatible
#2: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat:
compatible
#3: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat:
compatible
```
Note that the simulation ran for about 2.8ns and we got a weird high
potential energy at the end of it.
On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham <mark.j.abraham at gmail.com>
wrote:
> Hi,
>
> What driver version is reported in the respective log files? Does the error
> persist if mdrun -notunepme is used?
>
> Mark
>
> On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, <czhan178 at asu.edu> wrote:
>
> > Hi Gromacs developers,
> >
> > I'm currently running gromacs 2019.4 on our university's HPC cluster. To
> > fully utilize the GPU nodes, I followed notes on
> >
> >
> http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html
> > .
> >
> >
> > And here is the command I used for my runs.
> > ```
> > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu
> -ntomp
> > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60
> -cpi
> > -noappend
> > ```
> >
> > And for some of those runs, they might fail with the following error:
> > ```
> > -------------------------------------------------------
> >
> > Program: gmx mdrun, version 2019.4
> >
> > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229)
> >
> > MPI rank: 3 (out of 8)
> >
> >
> >
> > Fatal error:
> >
> > cudaStreamSynchronize failed: an illegal memory access was encountered
> >
> >
> >
> > For more information and tips for troubleshooting, please check the
> GROMACS
> >
> > website at http://www.gromacs.org/Documentation/Errors
> > ```
> >
> > we also had a different error from slurm system:
> > ```
> > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9
> > M-cycles
> > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4
> > M-cycles
> > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation
> > fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on
> -nb
> > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS
> > -cpt 60 -cpi -noappend
> > ```
> >
> > We first thought this could due to compiler issue and tried different
> > settings as following:
> > ===test1===
> > <source>
> > module load cuda/9.2.88.1
> > module load gcc/7.3.0
> > . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC
> > </source>
> > ===test2===
> > <source>
> > module load cuda/9.2.88.1
> > module load gcc/6x
> > . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC
> > </source>
> > ===test3===
> > <source>
> > module load cuda/9.2.148
> > module load gcc/7.3.0
> > . /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC
> > </source>
> > ===test4===
> > <source>
> > module load cuda/9.2.148
> > module load gcc/6x
> > . /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC
> > </source>
> > ===test5===
> > <source>
> > module load cuda/9.1.85
> > module load gcc/6x
> > . /home/rsexton2/Library/gromacs/2019.4/test5/bin/GMXRC
> > </source>
> > ===test6===
> > <source>
> > module load cuda/9.0.176
> > module load gcc/6x
> > . /home/rsexton2/Library/gromacs/2019.4/test6/bin/GMXRC
> > </source>
> > ===test7===
> > <source>
> > module load cuda/9.2.88.1
> > module load gccgpu/7.4.0
> > . /home/rsexton2/Library/gromacs/2019.4/test7/bin/GMXRC
> > </source>
> >
> > However we still ended up with the same errors showed above. Does anyone
> > know where does the cudaStreamSynchronize come in? Or am I wrongly using
> > those gmx gpu commands?
> >
> > Any input will be appreciated!
> >
> > Thanks!
> > Chenou
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list