[gmx-users] GROMACS performance issues on POWER9/V100 node
Jonathan D. Halverson
halverson at Princeton.EDU
Thu Apr 23 18:40:41 CEST 2020
We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running RHEL 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel nodes. Everything below is about of the POWER9/V100 node.
We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1 CPU-core and 1 GPU (ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives 102 ns/day. The difference in performance is roughly the same for the larger ADH benchmark and when different numbers of CPU-cores are used. GROMACS is always underperforming on our POWER9/V100 nodes. We have pinning turned on (see Slurm script at bottom).
Below is our build procedure on the POWER9/V100 node:
version_gmx=2019.6
wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
tar zxvf gromacs-${version_gmx}.tar.gz
cd gromacs-${version_gmx}
mkdir build && cd build
module purge
module load rh/devtoolset/7
module load cudatoolkit/10.2
OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"
cmake3 .. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
-DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
-DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
-DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
-DGMX_BUILD_OWN_FFTW=ON \
-DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
-DGMX_OPENMP_MAX_THREADS=128 \
-DCMAKE_INSTALL_PREFIX=$HOME/.local \
-DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON
make -j 10
make check
make install
45 of the 46 tests pass with the exception being HardwareUnitTests. There are several posts about this and apparently it is not a concern. The full build log is here:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log
Here is more info about our POWER9/V100 node:
$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 4
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 6
Model: 2.3 (pvr 004e 1203)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
You see that we have 4 hardware threads per physical core. If we use 4 hardware threads on the RNASE benchmark instead of 1 the performance goes to 119 ns/day which is still about 20% less than the Broadwell/P100 value. When using multiple CPU-cores on the POWER9/V100 there is significant variation in the execution time of the code.
There are four GPUs per POWER9/V100 node:
$ nvidia-smi -q
Driver Version : 440.33.01
CUDA Version : 10.2
GPU 00000004:04:00.0
Product Name : Tesla V100-SXM2-32GB
The GPUs have been shown to perform as expected on other applications.
The following lines are found in md.log for the POWER9/V100 run:
Overriding thread affinity set outside gmx mdrun
Pinning threads with an auto-selected logical core stride of 128
NOTE: Thread affinity was not set.
The full md.log is available here:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log
Below are the MegaFlops Accounting for the POWER9/V100 versus Broadwell/P100:
================ IBM POWER9 WITH NVIDIA V100 ================
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 297.763872 2679.875 0.0
NxN Ewald Elec. + LJ [F] 244214.215808 16118138.243 98.0
NxN Ewald Elec. + LJ [V&F] 2483.565760 265741.536 1.6
1,4 nonbonded interactions 53.415341 4807.381 0.0
Shift-X 3.029040 18.174 0.0
Angles 37.043704 6223.342 0.0
Propers 55.825582 12784.058 0.1
Impropers 4.220422 877.848 0.0
Virial 2.432585 43.787 0.0
Stop-CM 2.452080 24.521 0.0
Calc-Ekin 48.128080 1299.458 0.0
Lincs 20.536159 1232.170 0.0
Lincs-Mat 444.613344 1778.453 0.0
Constraint-V 261.192228 2089.538 0.0
Constraint-Vir 2.430161 58.324 0.0
Settle 73.382008 23702.389 0.1
-----------------------------------------------------------------------------
Total 16441499.096 100.0
-----------------------------------------------------------------------------
================ INTEL BROADWELL WITH NVIDIA P100 ================
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 271.334272 2442.008 0.0
NxN Ewald Elec. + LJ [F] 191599.850112 12645590.107 98.0
NxN Ewald Elec. + LJ [V&F] 1946.866432 208314.708 1.6
1,4 nonbonded interactions 53.415341 4807.381 0.0
Shift-X 3.029040 18.174 0.0
Bonds 10.541054 621.922 0.0
Angles 37.043704 6223.342 0.0
Propers 55.825582 12784.058 0.1
Impropers 4.220422 877.848 0.0
Virial 2.432585 43.787 0.0
Stop-CM 2.452080 24.521 0.0
Calc-Ekin 48.128080 1299.458 0.0
Lincs 9.992997 599.580 0.0
Lincs-Mat 50.775228 203.101 0.0
Constraint-V 240.108012 1920.864 0.0
Constraint-Vir 2.323707 55.769 0.0
Settle 73.382008 23702.389 0.2
-----------------------------------------------------------------------------
Total 12909529.017 100.0
-----------------------------------------------------------------------------
Some of the rows are identical between the two tables above. The largest difference
is observed for the "NxN Ewald Elec. + LJ [F]" row.
Here is our Slurm script:
#!/bin/bash
#SBATCH --job-name=gmx # create a short name for your job
#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=4G # memory per node (4G per cpu-core is default)
#SBATCH --time=00:10:00 # total run time limit (HH:MM:SS)
#SBATCH --gres=gpu:1 # number of gpus per node
module purge
module load cudatoolkit/10.2
BCH=../rnase_cubic
gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o bench.tpr
gmx mdrun -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s bench.tpr
How do we get optimal performance out of GROMACS on our POWER9/V100 nodes?
Jon
More information about the gromacs.org_gmx-users
mailing list