[gmx-users] domain decomposition error >60 ns into simulation on a specific machine

Mark Abraham mark.j.abraham at gmail.com
Thu Feb 14 21:35:34 CET 2019


Hi,

What does the trajectory look like before it crashes?

We did recently fix a bug relevant to simulations using CHARMM switching
functions on GPUs, if that could be an explanation. We will probably put
out a new 2018 version with that fix next week (or so).

Mark

On Thu., 14 Feb. 2019, 20:26 Mala L Radhakrishnan, <mradhakr at wellesley.edu>
wrote:

> Hi all,
>
> My student is trying to do a fairly straightforward MD simulation -- a
> protein complex in water with ions with *no* pull coordinate.  It's on an
> NVidia GPU-based machine and we're running gromacs 2018.3.
>
> About 65 ns into the simulation, it dies with:
>
> "an atom moved too far between two domain decomposition steps. This usually
> means that your system is not well equilibrated"
>
> If we restart at, say, 2 ns before it died, it then runs fine, PAST where
> it died before, for another ~63 ns or so, and then dies with the same
> error.  We have had far larger and arguably more complex gromacs jobs run
> fine on this same machine.
>
> Even stranger, when we run the same, problematic job on a different NVidia
> GPU-based machine with slightly older CPUs that's running Gromacs 2016.4,
> it runs fine (it's currently at 200 ns).
>
> Below are the Gromacs hardware and compilation specs of the machine on
> which it died in case that helps anyone:-  there is a note at the end of
> this logfile output  that might be useful -- thanks in advance for any
> ideas.
> -----------------------------------------
>
> GROMACS version:    2018.3
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
> GPU support:        CUDA
> SIMD instructions:  AVX2_256
> FFT library:        fftw-3.3.8-sse2-avx-avx2-avx2_128
> RDTSCP usage:       enabled
> TNG support:        enabled
> Hwloc support:      disabled
> Tracing support:    disabled
> Built on:           2018-10-31 22:05:13
> Build OS/arch:      Linux 3.10.0-693.21.1.el7.x86_64 x86_64
> Build CPU vendor:   Intel
> Build CPU brand:    Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
> Build CPU family:   6   Model: 85   Stepping: 4
> Build CPU features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl
> clfsh cmov cx8 cx16 f16c fma hle htt intel lahf m
> mx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm
> sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler:         /usr/bin/cc GNU 4.8.5
> C compiler flags:    -march=core-avx2     -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast
> C++ compiler:       /usr/bin/c++ GNU 4.8.5
> C++ compiler flags:  -march=core-avx2    -std=c++11   -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
> driver;Copyright (c) 2005-2018 NVIDIA Corporat
> ion;Built on Sat_Aug_25_21:08:01_CDT_2018;Cuda compilation tools, release
> 10.0, V10.0.130
> CUDA compiler
>
> flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=
>
> sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode
>
> ;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;;
>
>  ;-march=core-avx2;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
> CUDA driver:        10.0
> CUDA runtime:       10.0
> Running on 1 node with total 20 cores, 40 logical cores, 4 compatible GPUs
> Hardware detected:
>   CPU info:
>     Vendor: Intel
>     Brand:  Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
>     Family: 6   Model: 85   Stepping: 4
>     Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl clfsh
> cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr
>  nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2
> sse3 sse4.1 sse4.2 ssse3 tdt x2apic
>     Number of AVX-512 FMA units: Cannot run AVX-512 detection - assuming 2
>   Hardware topology: Basic
>     Sockets, cores, and logical processors:
>       Socket  0: [   0  20] [   1  21] [   2  22] [   3  23] [   4  24] [
> 5  25] [   6  26] [   7  27] [   8  28] [   9
>  29]
>       Socket  1: [  10  30] [  11  31] [  12  32] [  13  33] [  14  34] [
> 15  35] [  16  36] [  17  37] [  18  38] [  19
>  39]
>   GPU info:
>     Number of GPUs detected: 4
>     #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat:
> compatible
>     #1: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat:
> compatible
>     #2: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat:
> compatible
>     #3: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat:
> compatible
>
> Highest SIMD level requested by all nodes in run: AVX_512
> SIMD instructions selected at compile time:       AVX2_256
> This program was compiled for different hardware than you are running on,
> which could influence performance. This build might have been configured on
> a
> login node with only a single AVX-512 FMA unit (in which case AVX2 is
> faster),
> while the node you are running on has dual AVX-512 FMA units.
>
>
>
> --
> Mala L. Radhakrishnan
> Whitehead Associate Professor of Critical Thought
> Associate Professor of Chemistry
> Wellesley College
> 106 Central Street
> Wellesley, MA 02481
> (781)283-2981
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list