[gmx-users] GROMACS performance issues on POWER9/V100 node

Jonathan D. Halverson halverson at Princeton.EDU
Fri Apr 24 19:27:54 CEST 2020


I cannot force the pinning via GROMACS so I will look at what can be done with hwloc.

On the POWER9 the hardware appears to be detected correctly (only Intel gives note):
Running on 1 node with total 128 cores, 128 logical cores, 1 compatible GPU

But during the build it fails the HarwareUnitTests:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log#L3338


Here are more benchmarks based on Kevin and Szilárd's suggestions:

ADH (134177 atoms, ftp://ftp.gromacs.org/pub/benchmarks/ADH_bench_systems.tar.gz)
2019.6, PME and cubic box
nsteps = 40000

Intel Broadwell-NVIDIA P100
ntomp (rate, wall time)
1 (21 ns/day, 323 s)
4 (56 ns/day, 123 s)
8 (69 ns/day, 100 s)

IBM POWER9-NVIDIA V100
ntomp (rate, wall time)
 1 (14 ns/day, 500 s)
 1 (14 ns/day, 502 s)
 1 (14 ns/day, 510 s)
 4 (19 ns/day, 357 s)
 4 (17 ns/day, 397 s)
 4 (20 ns/day, 346 s)
 8 (30 ns/day, 232 s)
 8 (24 ns/day, 288 s)
 8 (31 ns/day, 222 s)
16 (59 ns/day, 117 s)
16 (65 ns/day, 107 s)
16 (63 ns/day, 110 s) [md.log on GitHub is https://bit.ly/3aCm1gw]
32 (89 ns/day,  76 s)
32 (93 ns/day,  75 s)
32 (89 ns/day,  78 s)
64 (57 ns/day, 122 s)
64 (43 ns/day, 159 s)
64 (46 ns/day, 152 s)

Yes, there is variability between identical runs for POWER9/V100.

For the Intel case, ntomp equals the number of physical cores. For the IBM case, ntomp is equal to the number of hardware threads (4 hardware threads per physical core). On a physical core basis, these number are looking better but clearly there are still problems.

I tried different values for -pinoffset but didn't see performance gains that could't be explained by the variation from run to run.

I've written to contacts at ORNL and IBM.

Jon

________________________________
From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Szilárd Páll <pall.szilard at gmail.com>
Sent: Friday, April 24, 2020 10:23 AM
To: Discussion list for GROMACS users <gmx-users at gromacs.org>
Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Using a single thread per GPU as the linked log files show is not
sufficient for GROMACS (and any modern machine should have more than that
anyway), but I imply from your mail that this only meant to debug
performance instability?

Your performance variations with Power9 may be related that you are either
not setting affinities or the affinity settings is not correct. However,
you also have some job scheduler in the way (that I suspect is either not
configured well or is not passed the required options to correctly assign
resources to jobs) and obfuscates machine layout and makes things look
weird to mdrun [1].

I suggest to simplify the problem and try to debug it step-by-step. Start
with allocating full nodes and test that you can pin (either with mdurun
-pin on or hwloc) and avoid [1], get an understanding of what should you
expect from the node sharing that seem to not work correctly. Building
GROMACS with hwloc may help as you get better reporting in the log.

[1]
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100#L58

--
Szilárd


On Fri, Apr 24, 2020 at 3:43 AM Jonathan D. Halverson <
halverson at princeton.edu> wrote:

> Hi Kevin,
>
> md.log for the Intel run is here:
>
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100
>
> Thanks for the info on constraints with 2020. I'll try some runs with
> different values of -pinoffset for 2019.6.
>
> I know a group at NIST is having the same or similar problems with
> POWER9/V100.
>
> Jon
> ________________________________
> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
> gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Kevin
> Boyd <kevin.boyd at uconn.edu>
> Sent: Thursday, April 23, 2020 9:08 PM
> To: gmx-users at gromacs.org <gmx-users at gromacs.org>
> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node
>
> Hi,
>
> Can you post the full log for the Intel system? I typically find the real
> cycle and time accounting section a better place to start debugging
> performance issues.
>
> A couple quick notes, but need a side-by-side comparison for more useful
> analysis, and these points may apply to both systems so may not be your
> root cause:
> * At first glance, your Power system spends 1/3 of its time in constraint
> calculation, which is unusual. This can be reduced 2 ways - first, by
> adding more CPU cores. It doesn't make a ton of sense to benchmark on one
> core if your applications will use more. Second, if you upgrade to Gromacs
> 2020 you can probably put the constraint calculation on the GPU with
> -update GPU.
> * The Power system log has this line:
>
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
> indicating
> that threads perhaps were not actually pinned. Try adding -pinoffset 0 (or
> some other core) to specify where you want the process pinned.
>
> Kevin
>
> On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
> halverson at princeton.edu> wrote:
>
> > *Message sent from a system outside of UConn.*
> >
> >
> > We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an
> > IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running
> RHEL
> > 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
> > nodes. Everything below is about of the POWER9/V100 node.
> >
> > We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
> > CPU-core and 1 GPU (
> > ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
> > found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives
> 102
> > ns/day. The difference in performance is roughly the same for the larger
> > ADH benchmark and when different numbers of CPU-cores are used. GROMACS
> is
> > always underperforming on our POWER9/V100 nodes. We have pinning turned
> on
> > (see Slurm script at bottom).
> >
> > Below is our build procedure on the POWER9/V100 node:
> >
> > version_gmx=2019.6
> > wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
> > tar zxvf gromacs-${version_gmx}.tar.gz
> > cd gromacs-${version_gmx}
> > mkdir build && cd build
> >
> > module purge
> > module load rh/devtoolset/7
> > module load cudatoolkit/10.2
> >
> > OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"
> >
> > cmake3 .. -DCMAKE_BUILD_TYPE=Release \
> > -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
> > -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
> > -DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
> > -DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
> > -DGMX_BUILD_OWN_FFTW=ON \
> > -DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
> > -DGMX_OPENMP_MAX_THREADS=128 \
> > -DCMAKE_INSTALL_PREFIX=$HOME/.local \
> > -DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON
> >
> > make -j 10
> > make check
> > make install
> >
> > 45 of the 46 tests pass with the exception being HardwareUnitTests. There
> > are several posts about this and apparently it is not a concern. The full
> > build log is here:
> >
> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log
> >
> >
> >
> > Here is more info about our POWER9/V100 node:
> >
> > $ lscpu
> > Architecture:          ppc64le
> > Byte Order:            Little Endian
> > CPU(s):                128
> > On-line CPU(s) list:   0-127
> > Thread(s) per core:    4
> > Core(s) per socket:    16
> > Socket(s):             2
> > NUMA node(s):          6
> > Model:                 2.3 (pvr 004e 1203)
> > Model name:            POWER9, altivec supported
> > CPU max MHz:           3800.0000
> > CPU min MHz:           2300.0000
> >
> > You see that we have 4 hardware threads per physical core. If we use 4
> > hardware threads on the RNASE benchmark instead of 1 the performance goes
> > to 119 ns/day which is still about 20% less than the Broadwell/P100
> value.
> > When using multiple CPU-cores on the POWER9/V100 there is significant
> > variation in the execution time of the code.
> >
> > There are four GPUs per POWER9/V100 node:
> >
> > $ nvidia-smi -q
> > Driver Version                      : 440.33.01
> > CUDA Version                        : 10.2
> > GPU 00000004:04:00.0
> >     Product Name                    : Tesla V100-SXM2-32GB
> >
> > The GPUs have been shown to perform as expected on other applications.
> >
> >
> >
> >
> > The following lines are found in md.log for the POWER9/V100 run:
> >
> > Overriding thread affinity set outside gmx mdrun
> > Pinning threads with an auto-selected logical core stride of 128
> > NOTE: Thread affinity was not set.
> >
> > The full md.log is available here:
> > https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log
> >
> >
> >
> >
> > Below are the MegaFlops Accounting for the POWER9/V100 versus
> > Broadwell/P100:
> >
> > ================ IBM POWER9 WITH NVIDIA V100 ================
> > Computing:                               M-Number         M-Flops  %
> Flops
> >
> >
> -----------------------------------------------------------------------------
> >  Pair Search distance check             297.763872        2679.875
>  0.0
> >  NxN Ewald Elec. + LJ [F]            244214.215808    16118138.243
> 98.0
> >  NxN Ewald Elec. + LJ [V&F]            2483.565760      265741.536
>  1.6
> >  1,4 nonbonded interactions              53.415341        4807.381
>  0.0
> >  Shift-X                                  3.029040          18.174
>  0.0
> >  Angles                                  37.043704        6223.342
>  0.0
> >  Propers                                 55.825582       12784.058
>  0.1
> >  Impropers                                4.220422         877.848
>  0.0
> >  Virial                                   2.432585          43.787
>  0.0
> >  Stop-CM                                  2.452080          24.521
>  0.0
> >  Calc-Ekin                               48.128080        1299.458
>  0.0
> >  Lincs                                   20.536159        1232.170
>  0.0
> >  Lincs-Mat                              444.613344        1778.453
>  0.0
> >  Constraint-V                           261.192228        2089.538
>  0.0
> >  Constraint-Vir                           2.430161          58.324
>  0.0
> >  Settle                                  73.382008       23702.389
>  0.1
> >
> >
> -----------------------------------------------------------------------------
> >  Total                                                16441499.096
>  100.0
> >
> >
> -----------------------------------------------------------------------------
> >
> >
> > ================ INTEL BROADWELL WITH NVIDIA P100 ================
> >  Computing:                               M-Number         M-Flops  %
> Flops
> >
> >
> -----------------------------------------------------------------------------
> >  Pair Search distance check             271.334272        2442.008
>  0.0
> >  NxN Ewald Elec. + LJ [F]            191599.850112    12645590.107
> 98.0
> >  NxN Ewald Elec. + LJ [V&F]            1946.866432      208314.708
>  1.6
> >  1,4 nonbonded interactions              53.415341        4807.381
>  0.0
> >  Shift-X                                  3.029040          18.174
>  0.0
> >  Bonds                                   10.541054         621.922
>  0.0
> >  Angles                                  37.043704        6223.342
>  0.0
> >  Propers                                 55.825582       12784.058
>  0.1
> >  Impropers                                4.220422         877.848
>  0.0
> >  Virial                                   2.432585          43.787
>  0.0
> >  Stop-CM                                  2.452080          24.521
>  0.0
> >  Calc-Ekin                               48.128080        1299.458
>  0.0
> >  Lincs                                    9.992997         599.580
>  0.0
> >  Lincs-Mat                               50.775228         203.101
>  0.0
> >  Constraint-V                           240.108012        1920.864
>  0.0
> >  Constraint-Vir                           2.323707          55.769
>  0.0
> >  Settle                                  73.382008       23702.389
>  0.2
> >
> >
> -----------------------------------------------------------------------------
> >  Total                                                12909529.017
>  100.0
> >
> >
> -----------------------------------------------------------------------------
> >
> > Some of the rows are identical between the two tables above. The largest
> > difference
> > is observed for the "NxN Ewald Elec. + LJ [F]" row.
> >
> >
> >
> > Here is our Slurm script:
> >
> > #!/bin/bash
> > #SBATCH --job-name=gmx           # create a short name for your job
> > #SBATCH --nodes=1                # node count
> > #SBATCH --ntasks=1               # total number of tasks across all nodes
> > #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if
> > multi-threaded tasks)
> > #SBATCH --mem=4G                 # memory per node (4G per cpu-core is
> > default)
> > #SBATCH --time=00:10:00          # total run time limit (HH:MM:SS)
> > #SBATCH --gres=gpu:1             # number of gpus per node
> >
> > module purge
> > module load cudatoolkit/10.2
> >
> > BCH=../rnase_cubic
> > gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o
> > bench.tpr
> > gmx mdrun -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s
> > bench.tpr
> >
> >
> >
> > How do we get optimal performance out of GROMACS on our POWER9/V100
> nodes?
> >
> > Jon
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list