[gmx-users] domain decomposition error >60 ns into simulation on a specific machine
Mark Abraham
mark.j.abraham at gmail.com
Thu Feb 14 21:35:34 CET 2019
Hi,
What does the trajectory look like before it crashes?
We did recently fix a bug relevant to simulations using CHARMM switching
functions on GPUs, if that could be an explanation. We will probably put
out a new 2018 version with that fix next week (or so).
Mark
On Thu., 14 Feb. 2019, 20:26 Mala L Radhakrishnan, <mradhakr at wellesley.edu>
wrote:
> Hi all,
>
> My student is trying to do a fairly straightforward MD simulation -- a
> protein complex in water with ions with *no* pull coordinate. It's on an
> NVidia GPU-based machine and we're running gromacs 2018.3.
>
> About 65 ns into the simulation, it dies with:
>
> "an atom moved too far between two domain decomposition steps. This usually
> means that your system is not well equilibrated"
>
> If we restart at, say, 2 ns before it died, it then runs fine, PAST where
> it died before, for another ~63 ns or so, and then dies with the same
> error. We have had far larger and arguably more complex gromacs jobs run
> fine on this same machine.
>
> Even stranger, when we run the same, problematic job on a different NVidia
> GPU-based machine with slightly older CPUs that's running Gromacs 2016.4,
> it runs fine (it's currently at 200 ns).
>
> Below are the Gromacs hardware and compilation specs of the machine on
> which it died in case that helps anyone:- there is a note at the end of
> this logfile output that might be useful -- thanks in advance for any
> ideas.
> -----------------------------------------
>
> GROMACS version: 2018.3
> Precision: single
> Memory model: 64 bit
> MPI library: thread_mpi
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
> GPU support: CUDA
> SIMD instructions: AVX2_256
> FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
> RDTSCP usage: enabled
> TNG support: enabled
> Hwloc support: disabled
> Tracing support: disabled
> Built on: 2018-10-31 22:05:13
> Build OS/arch: Linux 3.10.0-693.21.1.el7.x86_64 x86_64
> Build CPU vendor: Intel
> Build CPU brand: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
> Build CPU family: 6 Model: 85 Stepping: 4
> Build CPU features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl
> clfsh cmov cx8 cx16 f16c fma hle htt intel lahf m
> mx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm
> sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler: /usr/bin/cc GNU 4.8.5
> C compiler flags: -march=core-avx2 -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast
> C++ compiler: /usr/bin/c++ GNU 4.8.5
> C++ compiler flags: -march=core-avx2 -std=c++11 -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
> driver;Copyright (c) 2005-2018 NVIDIA Corporat
> ion;Built on Sat_Aug_25_21:08:01_CDT_2018;Cuda compilation tools, release
> 10.0, V10.0.130
> CUDA compiler
>
> flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=
>
> sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode
>
> ;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;;
>
> ;-march=core-avx2;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
> CUDA driver: 10.0
> CUDA runtime: 10.0
> Running on 1 node with total 20 cores, 40 logical cores, 4 compatible GPUs
> Hardware detected:
> CPU info:
> Vendor: Intel
> Brand: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
> Family: 6 Model: 85 Stepping: 4
> Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl clfsh
> cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr
> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2
> sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> Number of AVX-512 FMA units: Cannot run AVX-512 detection - assuming 2
> Hardware topology: Basic
> Sockets, cores, and logical processors:
> Socket 0: [ 0 20] [ 1 21] [ 2 22] [ 3 23] [ 4 24] [
> 5 25] [ 6 26] [ 7 27] [ 8 28] [ 9
> 29]
> Socket 1: [ 10 30] [ 11 31] [ 12 32] [ 13 33] [ 14 34] [
> 15 35] [ 16 36] [ 17 37] [ 18 38] [ 19
> 39]
> GPU info:
> Number of GPUs detected: 4
> #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat:
> compatible
> #1: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat:
> compatible
> #2: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat:
> compatible
> #3: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat:
> compatible
>
> Highest SIMD level requested by all nodes in run: AVX_512
> SIMD instructions selected at compile time: AVX2_256
> This program was compiled for different hardware than you are running on,
> which could influence performance. This build might have been configured on
> a
> login node with only a single AVX-512 FMA unit (in which case AVX2 is
> faster),
> while the node you are running on has dual AVX-512 FMA units.
>
>
>
> --
> Mala L. Radhakrishnan
> Whitehead Associate Professor of Critical Thought
> Associate Professor of Chemistry
> Wellesley College
> 106 Central Street
> Wellesley, MA 02481
> (781)283-2981
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list