[gmx-users] Gromacs 2019.2 on Power9 + Volta GPUs (building and running)

Szilárd Páll pall.szilard at gmail.com
Thu May 9 22:51:26 CEST 2019

On Thu, May 9, 2019 at 10:01 PM Alex <nedomacho at gmail.com> wrote:

> Okay, we're positively unable to run a Gromacs (2019.1) test on Power9. The
> test procedure is simple, using slurm:
> 1. Request an interactive session: > srun -N 1 -n 20 --pty
> --partition=debug --time=1:00:00 --gres=gpu:1 bash
> 2. Load CUDA library: module load cuda
> 3. Run test batch. This starts with a CPU-only static EM, which, despite
> the mdrun variables, runs on a single thread. Any help will be highly
> appreciated.
>  md.log below:
> GROMACS:      gmx mdrun, version 2019.1
> Executable:   /home/reida/ppc64le/stow/gromacs/bin/gmx
> Data prefix:  /home/reida/ppc64le/stow/gromacs
> Working dir:  /home/smolyan/gmx_test1
> Process ID:   115831
> Command line:
>   gmx mdrun -pin on -pinstride 2 -ntomp 4 -ntmpi 4 -pme cpu -nb cpu -s
> em.tpr -o traj.trr -g md.log -c after_em.pdb
> GROMACS version:    2019.1
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
> GPU support:        CUDA
> SIMD instructions:  IBM_VSX
> FFT library:        fftw-3.3.8
> RDTSCP usage:       disabled
> TNG support:        enabled
> Hwloc support:      hwloc-1.11.8
> Tracing support:    disabled
> C compiler:         /opt/rh/devtoolset-7/root/usr/bin/cc GNU 7.3.1
> C compiler flags:   -mcpu=power9 -mtune=power9  -mvsx     -O2 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> C++ compiler:       /opt/rh/devtoolset-7/root/usr/bin/c++ GNU 7.3.1
> C++ compiler flags: -mcpu=power9 -mtune=power9  -mvsx    -std=c++11   -O2
> -DNDEBUG -funroll-all-loops -fexcess-precision=fast
> CUDA compiler:      /usr/local/cuda-10.0/bin/nvcc nvcc: NVIDIA (R) Cuda
> compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on
> Sat_Aug_25_21:10:00_CDT_2018;Cuda compilation tools, release 10.0,
> V10.0.130
> CUDA compiler
> flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;;
> -mcpu=power9;-mtune=power9;-mvsx;-std=c++11;-O2;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
> CUDA driver:        10.10
> CUDA runtime:       10.0
> Running on 1 node with total 160 cores, 160 logical cores, 1 compatible GPU
> Hardware detected:
>   CPU info:
>     Vendor: IBM
>     Brand:  POWER9, altivec supported
>     Family: 0   Model: 0   Stepping: 0
>     Features: vmx vsx
>   Hardware topology: Only logical processor count
>   GPU info:
>     Number of GPUs detected: 1
>     #0: NVIDIA Tesla V100-SXM2-16GB, compute cap.: 7.0, ECC: yes, stat:
> compatible
> Input Parameters:
>    integrator                     = steep
>    tinit                          = 0
>    dt                             = 0.001
>    nsteps                         = 50000
>    init-step                      = 0
>    simulation-part                = 1
>    comm-mode                      = Linear
>    nstcomm                        = 100
>    bd-fric                        = 0
>    ld-seed                        = 1941752878
>    emtol                          = 100
>    emstep                         = 0.01
>    niter                          = 20
>    fcstep                         = 0
>    nstcgsteep                     = 1000
>    nbfgscorr                      = 10
>    rtpi                           = 0.05
>    nstxout                        = 0
>    nstvout                        = 0
>    nstfout                        = 0
>    nstlog                         = 1000
>    nstcalcenergy                  = 100
>    nstenergy                      = 1000
>    nstxout-compressed             = 0
>    compressed-x-precision         = 1000
>    cutoff-scheme                  = Verlet
>    nstlist                        = 1
>    ns-type                        = Grid
>    pbc                            = xyz
>    periodic-molecules             = true
>    verlet-buffer-tolerance        = 0.005
>    rlist                          = 1.2
>    coulombtype                    = PME
>    coulomb-modifier               = Potential-shift
>    rcoulomb-switch                = 0
>    rcoulomb                       = 1.2
>    epsilon-r                      = 1
>    epsilon-rf                     = inf
>    vdw-type                       = Cut-off
>    vdw-modifier                   = Potential-shift
>    rvdw-switch                    = 0
>    rvdw                           = 1.2
>    DispCorr                       = No
>    table-extension                = 1
>    fourierspacing                 = 0.12
>    fourier-nx                     = 52
>    fourier-ny                     = 52
>    fourier-nz                     = 52
>    pme-order                      = 4
>    ewald-rtol                     = 1e-05
>    ewald-rtol-lj                  = 0.001
>    lj-pme-comb-rule               = Geometric
>    ewald-geometry                 = 0
>    epsilon-surface                = 0
>    tcoupl                         = No
>    nsttcouple                     = -1
>    nh-chain-length                = 0
>    print-nose-hoover-chain-variables = false
>    pcoupl                         = No
>    pcoupltype                     = Isotropic
>    nstpcouple                     = -1
>    tau-p                          = 1
>    compressibility (3x3):
>       compressibility[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>       compressibility[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>       compressibility[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>    ref-p (3x3):
>       ref-p[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>       ref-p[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>       ref-p[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>    refcoord-scaling               = No
>    posres-com (3):
>       posres-com[0]= 0.00000e+00
>       posres-com[1]= 0.00000e+00
>       posres-com[2]= 0.00000e+00
>    posres-comB (3):
>       posres-comB[0]= 0.00000e+00
>       posres-comB[1]= 0.00000e+00
>       posres-comB[2]= 0.00000e+00
>    QMMM                           = false
>    QMconstraints                  = 0
>    QMMMscheme                     = 0
>    MMChargeScaleFactor            = 1
> qm-opts:
>    ngQM                           = 0
>    constraint-algorithm           = Lincs
>    continuation                   = false
>    Shake-SOR                      = false
>    shake-tol                      = 0.0001
>    lincs-order                    = 4
>    lincs-iter                     = 1
>    lincs-warnangle                = 30
>    nwall                          = 0
>    wall-type                      = 9-3
>    wall-r-linpot                  = -1
>    wall-atomtype[0]               = -1
>    wall-atomtype[1]               = -1
>    wall-density[0]                = 0
>    wall-density[1]                = 0
>    wall-ewald-zfac                = 3
>    pull                           = false
>    awh                            = false
>    rotation                       = false
>    interactiveMD                  = false
>    disre                          = No
>    disre-weighting                = Conservative
>    disre-mixed                    = false
>    dr-fc                          = 1000
>    dr-tau                         = 0
>    nstdisreout                    = 100
>    orire-fc                       = 0
>    orire-tau                      = 0
>    nstorireout                    = 100
>    free-energy                    = no
>    cos-acceleration               = 0
>    deform (3x3):
>       deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>       deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>       deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>    simulated-tempering            = false
>    swapcoords                     = no
>    userint1                       = 0
>    userint2                       = 0
>    userint3                       = 0
>    userint4                       = 0
>    userreal1                      = 0
>    userreal2                      = 0
>    userreal3                      = 0
>    userreal4                      = 0
>    applied-forces:
>      electric-field:
>        x:
>          E0                       = 0
>          omega                    = 0
>          t0                       = 0
>          sigma                    = 0
>        y:
>          E0                       = 0
>          omega                    = 0
>          t0                       = 0
>          sigma                    = 0
>        z:
>          E0                       = 0
>          omega                    = 0
>          t0                       = 0
>          sigma                    = 0
> grpopts:
>    nrdf:       47805
>    ref-t:           0
>    tau-t:           0
> annealing:          No
> annealing-npoints:           0
>    acc:            0           0           0
>    nfreeze:           N           N           N
>    energygrp-flags[  0]: 0
> Initializing Domain Decomposition on 4 ranks
> NOTE: disabling dynamic load balancing as it is only supported with
> dynamics, not with integrator 'steep'.
> Dynamic load balancing: auto
> Using update groups, nr 10529, average size 2.5 atoms, max. radius 0.078 nm
> Minimum cell size due to atom displacement: 0.000 nm
> NOTE: Periodic molecules are present in this system. Because of this, the
> domain decomposition algorithm cannot easily determine the minimum cell
> size that it requires for treating bonded interactions. Instead, domain
> decomposition will assume that half the non-bonded cut-off will be a
> suitable lower bound.
> Minimum cell size due to bonded interactions: 0.678 nm
> Using 0 separate PME ranks, as there are too few total
>  ranks for efficient splitting
> Optimizing the DD grid for 4 cells with a minimum initial size of 0.678 nm
> The maximum allowed number of cells is: X 8 Y 8 Z 8
> Domain decomposition grid 1 x 4 x 1, separate PME ranks 0
> PME domain decomposition: 1 x 4 x 1
> Domain decomposition rank 0, coordinates 0 0 0
> The initial number of communication pulses is: Y 1
> The initial domain decomposition cell size is: Y 1.50 nm
> The maximum allowed distance for atom groups involved in interactions is:
>                  non-bonded interactions           1.356 nm
>             two-body bonded interactions  (-rdd)   1.356 nm
>           multi-body bonded interactions  (-rdd)   1.356 nm
>               virtual site constructions  (-rcon)  1.503 nm
> Using 4 MPI threads
> Using 4 OpenMP threads per tMPI thread
> Overriding thread affinity set outside gmx mdrun
> Pinning threads with a user-specified logical core stride of 2
> NOTE: Thread affinity was not set.

The threads are not pinned -- see above --, but why I can't say. I suggest:
i) talk to your admins ii) try to tell the job scheduler to not set
affinities and let mdrun set it.

> System total charge: 0.000
> Will do PME sum in reciprocal space for electrostatic interactions.
> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
> Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
> Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
> Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -8.333e-06
> Initialized non-bonded Ewald correction tables, spacing: 1.02e-03 size:
> 1176
> Generated table with 1100 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 1100 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
> Using SIMD 4x4 nonbonded short-range kernels
> Using a 4x4 pair-list setup:
>   updated every 1 steps, buffer 0.000 nm, rlist 1.200 nm
> Using geometric Lennard-Jones combination rule
> S. Miyamoto and P. A. Kollman
> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
> Water Models
> J. Comp. Chem. 13 (1992) pp. 952-962
> -------- -------- --- Thank You --- -------- --------
> Linking all bonded interactions to atoms
> There are 5407 inter charge-group virtual sites,
> will an extra communication step for selected coordinates and forces
> Note that activating steepest-descent energy minimization via the
> integrator .mdp option and the command gmx mdrun may be available in a
> different form in a future version of GROMACS, e.g. gmx minimize and an
> .mdp option.
> Initiating Steepest Descents
> Atom distribution over 4 domains: av 6687 stddev 134 min 6515 max 6792
> Started Steepest Descents on rank 0 Thu May  9 15:49:36 2019
