[gmx-users] GROMACS performance issues on POWER9/V100 node

Thu Apr 23 18:40:41 CEST 2020

We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running RHEL 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel nodes. Everything below is about of the POWER9/V100 node.

We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1 CPU-core and 1 GPU (ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives 102 ns/day. The difference in performance is roughly the same for the larger ADH benchmark and when different numbers of CPU-cores are used. GROMACS is always underperforming on our POWER9/V100 nodes. We have pinning turned on (see Slurm script at bottom).

Below is our build procedure on the POWER9/V100 node:

version_gmx=2019.6
wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
tar zxvf gromacs-${version_gmx}.tar.gz
cd gromacs-${version_gmx}
mkdir build && cd build

module purge
module load rh/devtoolset/7
module load cudatoolkit/10.2

OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"

cmake3 .. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
-DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
-DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
-DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
-DGMX_BUILD_OWN_FFTW=ON \
-DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
-DGMX_OPENMP_MAX_THREADS=128 \
-DCMAKE_INSTALL_PREFIX=$HOME/.local \
-DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON

make -j 10
make check
make install

45 of the 46 tests pass with the exception being HardwareUnitTests. There are several posts about this and apparently it is not a concern. The full build log is here:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log

Here is more info about our POWER9/V100 node:

$ lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    4
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          6
Model:                 2.3 (pvr 004e 1203)
Model name:            POWER9, altivec supported
CPU max MHz:           3800.0000
CPU min MHz:           2300.0000

You see that we have 4 hardware threads per physical core. If we use 4 hardware threads on the RNASE benchmark instead of 1 the performance goes to 119 ns/day which is still about 20% less than the Broadwell/P100 value. When using multiple CPU-cores on the POWER9/V100 there is significant variation in the execution time of the code.

There are four GPUs per POWER9/V100 node:

$ nvidia-smi -q
Driver Version                      : 440.33.01
CUDA Version                        : 10.2
GPU 00000004:04:00.0
    Product Name                    : Tesla V100-SXM2-32GB

The GPUs have been shown to perform as expected on other applications.

The following lines are found in md.log for the POWER9/V100 run:

Overriding thread affinity set outside gmx mdrun
Pinning threads with an auto-selected logical core stride of 128
NOTE: Thread affinity was not set.

The full md.log is available here:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log

Below are the MegaFlops Accounting for the POWER9/V100 versus Broadwell/P100:

================ IBM POWER9 WITH NVIDIA V100 ================
Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check             297.763872        2679.875     0.0
 NxN Ewald Elec. + LJ [F]            244214.215808    16118138.243    98.0
 NxN Ewald Elec. + LJ [V&F]            2483.565760      265741.536     1.6
 1,4 nonbonded interactions              53.415341        4807.381     0.0
 Shift-X                                  3.029040          18.174     0.0
 Angles                                  37.043704        6223.342     0.0
 Propers                                 55.825582       12784.058     0.1
 Impropers                                4.220422         877.848     0.0
 Virial                                   2.432585          43.787     0.0
 Stop-CM                                  2.452080          24.521     0.0
 Calc-Ekin                               48.128080        1299.458     0.0
 Lincs                                   20.536159        1232.170     0.0
 Lincs-Mat                              444.613344        1778.453     0.0
 Constraint-V                           261.192228        2089.538     0.0
 Constraint-Vir                           2.430161          58.324     0.0
 Settle                                  73.382008       23702.389     0.1
-----------------------------------------------------------------------------
 Total                                                16441499.096   100.0
-----------------------------------------------------------------------------

================ INTEL BROADWELL WITH NVIDIA P100 ================
 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check             271.334272        2442.008     0.0
 NxN Ewald Elec. + LJ [F]            191599.850112    12645590.107    98.0
 NxN Ewald Elec. + LJ [V&F]            1946.866432      208314.708     1.6
 1,4 nonbonded interactions              53.415341        4807.381     0.0
 Shift-X                                  3.029040          18.174     0.0
 Bonds                                   10.541054         621.922     0.0
 Angles                                  37.043704        6223.342     0.0
 Propers                                 55.825582       12784.058     0.1
 Impropers                                4.220422         877.848     0.0
 Virial                                   2.432585          43.787     0.0
 Stop-CM                                  2.452080          24.521     0.0
 Calc-Ekin                               48.128080        1299.458     0.0
 Lincs                                    9.992997         599.580     0.0
 Lincs-Mat                               50.775228         203.101     0.0
 Constraint-V                           240.108012        1920.864     0.0
 Constraint-Vir                           2.323707          55.769     0.0
 Settle                                  73.382008       23702.389     0.2
-----------------------------------------------------------------------------
 Total                                                12909529.017   100.0
-----------------------------------------------------------------------------

Some of the rows are identical between the two tables above. The largest difference
is observed for the "NxN Ewald Elec. + LJ [F]" row.

Here is our Slurm script:

#!/bin/bash
#SBATCH --job-name=gmx           # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=4G                 # memory per node (4G per cpu-core is default)
#SBATCH --time=00:10:00          # total run time limit (HH:MM:SS)
#SBATCH --gres=gpu:1             # number of gpus per node

module purge
module load cudatoolkit/10.2

BCH=../rnase_cubic
gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o bench.tpr
gmx mdrun -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s bench.tpr

How do we get optimal performance out of GROMACS on our POWER9/V100 nodes?

Jon