[gmx-users] GROMACS performance issues on POWER9/V100 node

Alex nedomacho at gmail.com
Sat Apr 25 02:19:38 CEST 2020


Hi Szilárd,

My comment was as follows:

1. We have been unable to pin threads (mdrun overrides -pin on) with or 
without stride set to 4.

2. We basically accepted that power9/V100 performance (in ns/day) on 
identical systems is much worse than that we get from an Intel-based 
machine. Both jobs are set with -nt 32 and using four GPUs.

3. We have not tried to reach out to IBM or take any other steps. As I 
said, we accepted crappy performance.

We would of course very much appreciate any further clarification from 
you -- without pointing to the specific issue (e.g., with OS), I am 
unable to productively bug our sysadmins (the cluster is 
institution-wide and there are only two people who have to deal with all 
the users). I myself do not have admin privileges on this machine. The 
only reason I commented was that Jon revitalized my old thread. ;)

Alex

On 4/24/2020 2:31 PM, Szilárd Páll wrote:
> On Fri, Apr 24, 2020 at 5:55 AM Alex <nedomacho at gmail.com> wrote:
>
>> Hi Kevin,
>>
>> We've been having issues with Power9/V100 very similar to what Jon
>> described and basically settled on what I believe is sub-par
>> performance. We tested it on systems with ~30-50K particles and threads
>> simply cannot be pinned.
>
> What does that mean, how did you verify that?
> The Linux kernel can in general set affinities on ppc64el, whether that's
> requested by mdrun or some other tool, so if you have observed that the
> affinity mask is not respected (or it does not change), that more likely OS
> / setup issue, I'd think.
>
> What is different compared to x86 is that the hardware thread layout is
> different on Power9 (with default Linux kernel configs) and hardware
> threads are exposed as consecutive "CPUs" by the OS rather than strided by
> #cores.
>
> I could try to sum up some details on how to sett affinities (with mdrun or
> external tools), if that is of interest. However, it really should be
> something that's possible to do even using the job scheduler (+ along
> reasonable system configuration).
>
>
>> As far as Gromacs is concerned, our brand-new
>> Power9 nodes operate as if they were based on Intel CPUs (two threads
>> per core)
>
> Unless the hardware thread layout has been changed, that's perhaps not the
> case, see above.
>
>
>> and zero advantage of IBM parallelization is being taken.
>>
> You mean the SMT4?
>
>
>> Other users of the same nodes reported similar issues with other
>> software, which to me suggests that our sysadmins don't really know how
>> to set these nodes up.
>>
>> At this point, if someone could figure out a clear set of build
>> instructions in combination with slurm/mdrun inputs, it would be very
>> much appreciated.
>>
> Have you checked  public documentation on ORNL's sites? GROMACS has been
> used successfully on Summit. What about IBM support?
>
> --
> Szilárd
>
>
>> Alex
>>
>> On 4/23/2020 9:37 PM, Kevin Boyd wrote:
>>> I'm not entirely sure how thread-pinning plays with slurm allocations on
>>> partial nodes. I always reserve the entire node when I use thread
>> pinning,
>>> and run a bunch of simulations by pinning to different cores manually,
>>> rather than relying on slurm to divvy up resources for multiple jobs.
>>>
>>> Looking at both logs now, a few more points
>>>
>>> * Your benchmarks are short enough that little things like cores spinning
>>> up frequencies can matter. I suggest running longer (increase nsteps in
>> the
>>> mdp or at the command line), and throwing away your initial benchmark
>> data
>>> (see -resetstep and -resethway) to avoid artifacts
>>> * Your benchmark system is quite small for such a powerful GPU. I might
>>> expect better performance running multiple simulations per-GPU if the
>>> workflows being run can rely on replicates, and a larger system would
>>> probably scale better to the V100.
>>> * The P100/intel system appears to have pinned cores properly, it's
>>> unclear whether it had a real impact on these benchmarks
>>> * It looks like the CPU-based computations were the primary contributors
>> to
>>> the observed difference in performance. That should decrease or go away
>>> with increased core counts and shifting the update phase to the GPU. It
>> may
>>> be (I have no prior experience to indicate either way) that the intel
>> cores
>>> are simply better on a 1-1 basis than the Power cores. If you have 4-8
>>> cores per simulation (try -ntomp 4 and increasing the allocation of your
>>> slurm job), the individual core performance shouldn't matter too
>>> much, you're just certainly bottlenecked on one CPU core per GPU, which
>> can
>>> emphasize performance differences..
>>>
>>> Kevin
>>>
>>> On Thu, Apr 23, 2020 at 6:43 PM Jonathan D. Halverson <
>>> halverson at princeton.edu> wrote:
>>>
>>>> *Message sent from a system outside of UConn.*
>>>>
>>>>
>>>> Hi Kevin,
>>>>
>>>> md.log for the Intel run is here:
>>>>
>>>>
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100
>>>> Thanks for the info on constraints with 2020. I'll try some runs with
>>>> different values of -pinoffset for 2019.6.
>>>>
>>>> I know a group at NIST is having the same or similar problems with
>>>> POWER9/V100.
>>>>
>>>> Jon
>>>> ________________________________
>>>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
>>>> gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Kevin
>>>> Boyd <kevin.boyd at uconn.edu>
>>>> Sent: Thursday, April 23, 2020 9:08 PM
>>>> To: gmx-users at gromacs.org <gmx-users at gromacs.org>
>>>> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node
>>>>
>>>> Hi,
>>>>
>>>> Can you post the full log for the Intel system? I typically find the
>> real
>>>> cycle and time accounting section a better place to start debugging
>>>> performance issues.
>>>>
>>>> A couple quick notes, but need a side-by-side comparison for more useful
>>>> analysis, and these points may apply to both systems so may not be your
>>>> root cause:
>>>> * At first glance, your Power system spends 1/3 of its time in
>> constraint
>>>> calculation, which is unusual. This can be reduced 2 ways - first, by
>>>> adding more CPU cores. It doesn't make a ton of sense to benchmark on
>> one
>>>> core if your applications will use more. Second, if you upgrade to
>> Gromacs
>>>> 2020 you can probably put the constraint calculation on the GPU with
>>>> -update GPU.
>>>> * The Power system log has this line:
>>>>
>>>>
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
>>>> indicating
>>>> that threads perhaps were not actually pinned. Try adding -pinoffset 0
>> (or
>>>> some other core) to specify where you want the process pinned.
>>>>
>>>> Kevin
>>>>
>>>> On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
>>>> halverson at princeton.edu> wrote:
>>>>
>>>>> *Message sent from a system outside of UConn.*
>>>>>
>>>>>
>>>>> We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on
>> an
>>>>> IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running
>>>> RHEL
>>>>> 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
>>>>> nodes. Everything below is about of the POWER9/V100 node.
>>>>>
>>>>> We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
>>>>> CPU-core and 1 GPU (
>>>>> ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
>>>>> found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives
>>>> 102
>>>>> ns/day. The difference in performance is roughly the same for the
>> larger
>>>>> ADH benchmark and when different numbers of CPU-cores are used. GROMACS
>>>> is
>>>>> always underperforming on our POWER9/V100 nodes. We have pinning turned
>>>> on
>>>>> (see Slurm script at bottom).
>>>>>
>>>>> Below is our build procedure on the POWER9/V100 node:
>>>>>
>>>>> version_gmx=2019.6
>>>>> wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
>>>>> tar zxvf gromacs-${version_gmx}.tar.gz
>>>>> cd gromacs-${version_gmx}
>>>>> mkdir build && cd build
>>>>>
>>>>> module purge
>>>>> module load rh/devtoolset/7
>>>>> module load cudatoolkit/10.2
>>>>>
>>>>> OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"
>>>>>
>>>>> cmake3 .. -DCMAKE_BUILD_TYPE=Release \
>>>>> -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
>>>>> -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
>>>>> -DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
>>>>> -DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
>>>>> -DGMX_BUILD_OWN_FFTW=ON \
>>>>> -DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
>>>>> -DGMX_OPENMP_MAX_THREADS=128 \
>>>>> -DCMAKE_INSTALL_PREFIX=$HOME/.local \
>>>>> -DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON
>>>>>
>>>>> make -j 10
>>>>> make check
>>>>> make install
>>>>>
>>>>> 45 of the 46 tests pass with the exception being HardwareUnitTests.
>> There
>>>>> are several posts about this and apparently it is not a concern. The
>> full
>>>>> build log is here:
>>>>>
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log
>>>>>
>>>>> Here is more info about our POWER9/V100 node:
>>>>>
>>>>> $ lscpu
>>>>> Architecture:          ppc64le
>>>>> Byte Order:            Little Endian
>>>>> CPU(s):                128
>>>>> On-line CPU(s) list:   0-127
>>>>> Thread(s) per core:    4
>>>>> Core(s) per socket:    16
>>>>> Socket(s):             2
>>>>> NUMA node(s):          6
>>>>> Model:                 2.3 (pvr 004e 1203)
>>>>> Model name:            POWER9, altivec supported
>>>>> CPU max MHz:           3800.0000
>>>>> CPU min MHz:           2300.0000
>>>>>
>>>>> You see that we have 4 hardware threads per physical core. If we use 4
>>>>> hardware threads on the RNASE benchmark instead of 1 the performance
>> goes
>>>>> to 119 ns/day which is still about 20% less than the Broadwell/P100
>>>> value.
>>>>> When using multiple CPU-cores on the POWER9/V100 there is significant
>>>>> variation in the execution time of the code.
>>>>>
>>>>> There are four GPUs per POWER9/V100 node:
>>>>>
>>>>> $ nvidia-smi -q
>>>>> Driver Version                      : 440.33.01
>>>>> CUDA Version                        : 10.2
>>>>> GPU 00000004:04:00.0
>>>>>       Product Name                    : Tesla V100-SXM2-32GB
>>>>>
>>>>> The GPUs have been shown to perform as expected on other applications.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The following lines are found in md.log for the POWER9/V100 run:
>>>>>
>>>>> Overriding thread affinity set outside gmx mdrun
>>>>> Pinning threads with an auto-selected logical core stride of 128
>>>>> NOTE: Thread affinity was not set.
>>>>>
>>>>> The full md.log is available here:
>>>>>
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log
>>>>>
>>>>>
>>>>>
>>>>> Below are the MegaFlops Accounting for the POWER9/V100 versus
>>>>> Broadwell/P100:
>>>>>
>>>>> ================ IBM POWER9 WITH NVIDIA V100 ================
>>>>> Computing:                               M-Number         M-Flops  %
>>>> Flops
>> -----------------------------------------------------------------------------
>>>>>    Pair Search distance check             297.763872        2679.875
>>>>    0.0
>>>>>    NxN Ewald Elec. + LJ [F]            244214.215808    16118138.243
>>>> 98.0
>>>>>    NxN Ewald Elec. + LJ [V&F]            2483.565760      265741.536
>>>>    1.6
>>>>>    1,4 nonbonded interactions              53.415341        4807.381
>>>>    0.0
>>>>>    Shift-X                                  3.029040          18.174
>>>>    0.0
>>>>>    Angles                                  37.043704        6223.342
>>>>    0.0
>>>>>    Propers                                 55.825582       12784.058
>>>>    0.1
>>>>>    Impropers                                4.220422         877.848
>>>>    0.0
>>>>>    Virial                                   2.432585          43.787
>>>>    0.0
>>>>>    Stop-CM                                  2.452080          24.521
>>>>    0.0
>>>>>    Calc-Ekin                               48.128080        1299.458
>>>>    0.0
>>>>>    Lincs                                   20.536159        1232.170
>>>>    0.0
>>>>>    Lincs-Mat                              444.613344        1778.453
>>>>    0.0
>>>>>    Constraint-V                           261.192228        2089.538
>>>>    0.0
>>>>>    Constraint-Vir                           2.430161          58.324
>>>>    0.0
>>>>>    Settle                                  73.382008       23702.389
>>>>    0.1
>> -----------------------------------------------------------------------------
>>>>>    Total                                                16441499.096
>>>>    100.0
>> -----------------------------------------------------------------------------
>>>>> ================ INTEL BROADWELL WITH NVIDIA P100 ================
>>>>>    Computing:                               M-Number         M-Flops  %
>>>> Flops
>> -----------------------------------------------------------------------------
>>>>>    Pair Search distance check             271.334272        2442.008
>>>>    0.0
>>>>>    NxN Ewald Elec. + LJ [F]            191599.850112    12645590.107
>>>> 98.0
>>>>>    NxN Ewald Elec. + LJ [V&F]            1946.866432      208314.708
>>>>    1.6
>>>>>    1,4 nonbonded interactions              53.415341        4807.381
>>>>    0.0
>>>>>    Shift-X                                  3.029040          18.174
>>>>    0.0
>>>>>    Bonds                                   10.541054         621.922
>>>>    0.0
>>>>>    Angles                                  37.043704        6223.342
>>>>    0.0
>>>>>    Propers                                 55.825582       12784.058
>>>>    0.1
>>>>>    Impropers                                4.220422         877.848
>>>>    0.0
>>>>>    Virial                                   2.432585          43.787
>>>>    0.0
>>>>>    Stop-CM                                  2.452080          24.521
>>>>    0.0
>>>>>    Calc-Ekin                               48.128080        1299.458
>>>>    0.0
>>>>>    Lincs                                    9.992997         599.580
>>>>    0.0
>>>>>    Lincs-Mat                               50.775228         203.101
>>>>    0.0
>>>>>    Constraint-V                           240.108012        1920.864
>>>>    0.0
>>>>>    Constraint-Vir                           2.323707          55.769
>>>>    0.0
>>>>>    Settle                                  73.382008       23702.389
>>>>    0.2
>> -----------------------------------------------------------------------------
>>>>>    Total                                                12909529.017
>>>>    100.0
>> -----------------------------------------------------------------------------
>>>>> Some of the rows are identical between the two tables above. The
>> largest
>>>>> difference
>>>>> is observed for the "NxN Ewald Elec. + LJ [F]" row.
>>>>>
>>>>>
>>>>>
>>>>> Here is our Slurm script:
>>>>>
>>>>> #!/bin/bash
>>>>> #SBATCH --job-name=gmx           # create a short name for your job
>>>>> #SBATCH --nodes=1                # node count
>>>>> #SBATCH --ntasks=1               # total number of tasks across all
>> nodes
>>>>> #SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if
>>>>> multi-threaded tasks)
>>>>> #SBATCH --mem=4G                 # memory per node (4G per cpu-core is
>>>>> default)
>>>>> #SBATCH --time=00:10:00          # total run time limit (HH:MM:SS)
>>>>> #SBATCH --gres=gpu:1             # number of gpus per node
>>>>>
>>>>> module purge
>>>>> module load cudatoolkit/10.2
>>>>>
>>>>> BCH=../rnase_cubic
>>>>> gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o
>>>>> bench.tpr
>>>>> gmx mdrun -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s
>>>>> bench.tpr
>>>>>
>>>>>
>>>>>
>>>>> How do we get optimal performance out of GROMACS on our POWER9/V100
>>>> nodes?
>>>>> Jon
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a mail to gmx-users-request at gromacs.org.
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a mail to gmx-users-request at gromacs.org.
>>>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>


More information about the gromacs.org_gmx-users mailing list