[gmx-users] Gromacs 2019.2 on Power9 + Volta GPUs (building and running)

Thu May 9 22:00:28 CEST 2019

Okay, we're positively unable to run a Gromacs (2019.1) test on Power9. The
test procedure is simple, using slurm:
1. Request an interactive session: > srun -N 1 -n 20 --pty
--partition=debug --time=1:00:00 --gres=gpu:1 bash
2. Load CUDA library: module load cuda
3. Run test batch. This starts with a CPU-only static EM, which, despite
the mdrun variables, runs on a single thread. Any help will be highly
appreciated.

 md.log below:

GROMACS:      gmx mdrun, version 2019.1
Executable:   /home/reida/ppc64le/stow/gromacs/bin/gmx
Data prefix:  /home/reida/ppc64le/stow/gromacs
Working dir:  /home/smolyan/gmx_test1
Process ID:   115831
Command line:
  gmx mdrun -pin on -pinstride 2 -ntomp 4 -ntmpi 4 -pme cpu -nb cpu -s
em.tpr -o traj.trr -g md.log -c after_em.pdb

GROMACS version:    2019.1
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  IBM_VSX
FFT library:        fftw-3.3.8
RDTSCP usage:       disabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.8
Tracing support:    disabled
C compiler:         /opt/rh/devtoolset-7/root/usr/bin/cc GNU 7.3.1
C compiler flags:   -mcpu=power9 -mtune=power9  -mvsx     -O2 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
C++ compiler:       /opt/rh/devtoolset-7/root/usr/bin/c++ GNU 7.3.1
C++ compiler flags: -mcpu=power9 -mtune=power9  -mvsx    -std=c++11   -O2
-DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler:      /usr/local/cuda-10.0/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on
Sat_Aug_25_21:10:00_CDT_2018;Cuda compilation tools, release 10.0, V10.0.130
CUDA compiler
flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;;
-mcpu=power9;-mtune=power9;-mvsx;-std=c++11;-O2;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver:        10.10
CUDA runtime:       10.0

Running on 1 node with total 160 cores, 160 logical cores, 1 compatible GPU
Hardware detected:
  CPU info:
    Vendor: IBM
    Brand:  POWER9, altivec supported
    Family: 0   Model: 0   Stepping: 0
    Features: vmx vsx
  Hardware topology: Only logical processor count
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA Tesla V100-SXM2-16GB, compute cap.: 7.0, ECC: yes, stat:
compatible

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++

*SKIPPED*

Input Parameters:
   integrator                     = steep
   tinit                          = 0
   dt                             = 0.001
   nsteps                         = 50000
   init-step                      = 0
   simulation-part                = 1
   comm-mode                      = Linear
   nstcomm                        = 100
   bd-fric                        = 0
   ld-seed                        = 1941752878
   emtol                          = 100
   emstep                         = 0.01
   niter                          = 20
   fcstep                         = 0
   nstcgsteep                     = 1000
   nbfgscorr                      = 10
   rtpi                           = 0.05
   nstxout                        = 0
   nstvout                        = 0
   nstfout                        = 0
   nstlog                         = 1000
   nstcalcenergy                  = 100
   nstenergy                      = 1000
   nstxout-compressed             = 0
   compressed-x-precision         = 1000
   cutoff-scheme                  = Verlet
   nstlist                        = 1
   ns-type                        = Grid
   pbc                            = xyz
   periodic-molecules             = true
   verlet-buffer-tolerance        = 0.005
   rlist                          = 1.2
   coulombtype                    = PME
   coulomb-modifier               = Potential-shift
   rcoulomb-switch                = 0
   rcoulomb                       = 1.2
   epsilon-r                      = 1
   epsilon-rf                     = inf
   vdw-type                       = Cut-off
   vdw-modifier                   = Potential-shift
   rvdw-switch                    = 0
   rvdw                           = 1.2
   DispCorr                       = No
   table-extension                = 1
   fourierspacing                 = 0.12
   fourier-nx                     = 52
   fourier-ny                     = 52
   fourier-nz                     = 52
   pme-order                      = 4
   ewald-rtol                     = 1e-05
   ewald-rtol-lj                  = 0.001
   lj-pme-comb-rule               = Geometric
   ewald-geometry                 = 0
   epsilon-surface                = 0
   tcoupl                         = No
   nsttcouple                     = -1
   nh-chain-length                = 0
   print-nose-hoover-chain-variables = false
   pcoupl                         = No
   pcoupltype                     = Isotropic
   nstpcouple                     = -1
   tau-p                          = 1
   compressibility (3x3):
      compressibility[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      compressibility[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      compressibility[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   ref-p (3x3):
      ref-p[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   refcoord-scaling               = No
   posres-com (3):
      posres-com[0]= 0.00000e+00
      posres-com[1]= 0.00000e+00
      posres-com[2]= 0.00000e+00
   posres-comB (3):
      posres-comB[0]= 0.00000e+00
      posres-comB[1]= 0.00000e+00
      posres-comB[2]= 0.00000e+00
   QMMM                           = false
   QMconstraints                  = 0
   QMMMscheme                     = 0
   MMChargeScaleFactor            = 1
qm-opts:
   ngQM                           = 0
   constraint-algorithm           = Lincs
   continuation                   = false
   Shake-SOR                      = false
   shake-tol                      = 0.0001
   lincs-order                    = 4
   lincs-iter                     = 1
   lincs-warnangle                = 30
   nwall                          = 0
   wall-type                      = 9-3
   wall-r-linpot                  = -1
   wall-atomtype[0]               = -1
   wall-atomtype[1]               = -1
   wall-density[0]                = 0
   wall-density[1]                = 0
   wall-ewald-zfac                = 3
   pull                           = false
   awh                            = false
   rotation                       = false
   interactiveMD                  = false
   disre                          = No
   disre-weighting                = Conservative
   disre-mixed                    = false
   dr-fc                          = 1000
   dr-tau                         = 0
   nstdisreout                    = 100
   orire-fc                       = 0
   orire-tau                      = 0
   nstorireout                    = 100
   free-energy                    = no
   cos-acceleration               = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   simulated-tempering            = false
   swapcoords                     = no
   userint1                       = 0
   userint2                       = 0
   userint3                       = 0
   userint4                       = 0
   userreal1                      = 0
   userreal2                      = 0
   userreal3                      = 0
   userreal4                      = 0
   applied-forces:
     electric-field:
       x:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       y:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       z:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
grpopts:
   nrdf:       47805
   ref-t:           0
   tau-t:           0
annealing:          No
annealing-npoints:           0
   acc:            0           0           0
   nfreeze:           N           N           N
   energygrp-flags[  0]: 0

Initializing Domain Decomposition on 4 ranks
NOTE: disabling dynamic load balancing as it is only supported with
dynamics, not with integrator 'steep'.
Dynamic load balancing: auto
Using update groups, nr 10529, average size 2.5 atoms, max. radius 0.078 nm
Minimum cell size due to atom displacement: 0.000 nm
NOTE: Periodic molecules are present in this system. Because of this, the
domain decomposition algorithm cannot easily determine the minimum cell
size that it requires for treating bonded interactions. Instead, domain
decomposition will assume that half the non-bonded cut-off will be a
suitable lower bound.
Minimum cell size due to bonded interactions: 0.678 nm
Using 0 separate PME ranks, as there are too few total
 ranks for efficient splitting
Optimizing the DD grid for 4 cells with a minimum initial size of 0.678 nm
The maximum allowed number of cells is: X 8 Y 8 Z 8
Domain decomposition grid 1 x 4 x 1, separate PME ranks 0
PME domain decomposition: 1 x 4 x 1
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: Y 1
The initial domain decomposition cell size is: Y 1.50 nm

The maximum allowed distance for atom groups involved in interactions is:
                 non-bonded interactions           1.356 nm
            two-body bonded interactions  (-rdd)   1.356 nm
          multi-body bonded interactions  (-rdd)   1.356 nm
              virtual site constructions  (-rcon)  1.503 nm

Using 4 MPI threads
Using 4 OpenMP threads per tMPI thread

Overriding thread affinity set outside gmx mdrun

Pinning threads with a user-specified logical core stride of 2

NOTE: Thread affinity was not set.
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -8.333e-06
Initialized non-bonded Ewald correction tables, spacing: 1.02e-03 size: 1176

Generated table with 1100 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1100 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1100 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using SIMD 4x4 nonbonded short-range kernels

Using a 4x4 pair-list setup:
  updated every 1 steps, buffer 0.000 nm, rlist 1.200 nm

Using geometric Lennard-Jones combination rule

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------

Linking all bonded interactions to atoms
There are 5407 inter charge-group virtual sites,
will an extra communication step for selected coordinates and forces

Note that activating steepest-descent energy minimization via the
integrator .mdp option and the command gmx mdrun may be available in a
different form in a future version of GROMACS, e.g. gmx minimize and an
.mdp option.
Initiating Steepest Descents

Atom distribution over 4 domains: av 6687 stddev 134 min 6515 max 6792
Started Steepest Descents on rank 0 Thu May  9 15:49:36 2019