[gmx-users] GROMACS performance issues on POWER9/V100 node
Alex
nedomacho at gmail.com
Fri Apr 24 05:54:47 CEST 2020
Hi Kevin,
We've been having issues with Power9/V100 very similar to what Jon
described and basically settled on what I believe is sub-par
performance. We tested it on systems with ~30-50K particles and threads
simply cannot be pinned. As far as Gromacs is concerned, our brand-new
Power9 nodes operate as if they were based on Intel CPUs (two threads
per core) and zero advantage of IBM parallelization is being taken.
Other users of the same nodes reported similar issues with other
software, which to me suggests that our sysadmins don't really know how
to set these nodes up.
At this point, if someone could figure out a clear set of build
instructions in combination with slurm/mdrun inputs, it would be very
much appreciated.
Alex
On 4/23/2020 9:37 PM, Kevin Boyd wrote:
> I'm not entirely sure how thread-pinning plays with slurm allocations on
> partial nodes. I always reserve the entire node when I use thread pinning,
> and run a bunch of simulations by pinning to different cores manually,
> rather than relying on slurm to divvy up resources for multiple jobs.
>
> Looking at both logs now, a few more points
>
> * Your benchmarks are short enough that little things like cores spinning
> up frequencies can matter. I suggest running longer (increase nsteps in the
> mdp or at the command line), and throwing away your initial benchmark data
> (see -resetstep and -resethway) to avoid artifacts
> * Your benchmark system is quite small for such a powerful GPU. I might
> expect better performance running multiple simulations per-GPU if the
> workflows being run can rely on replicates, and a larger system would
> probably scale better to the V100.
> * The P100/intel system appears to have pinned cores properly, it's
> unclear whether it had a real impact on these benchmarks
> * It looks like the CPU-based computations were the primary contributors to
> the observed difference in performance. That should decrease or go away
> with increased core counts and shifting the update phase to the GPU. It may
> be (I have no prior experience to indicate either way) that the intel cores
> are simply better on a 1-1 basis than the Power cores. If you have 4-8
> cores per simulation (try -ntomp 4 and increasing the allocation of your
> slurm job), the individual core performance shouldn't matter too
> much, you're just certainly bottlenecked on one CPU core per GPU, which can
> emphasize performance differences..
>
> Kevin
>
> On Thu, Apr 23, 2020 at 6:43 PM Jonathan D. Halverson <
> halverson at princeton.edu> wrote:
>
>> *Message sent from a system outside of UConn.*
>>
>>
>> Hi Kevin,
>>
>> md.log for the Intel run is here:
>>
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100
>>
>> Thanks for the info on constraints with 2020. I'll try some runs with
>> different values of -pinoffset for 2019.6.
>>
>> I know a group at NIST is having the same or similar problems with
>> POWER9/V100.
>>
>> Jon
>> ________________________________
>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
>> gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Kevin
>> Boyd <kevin.boyd at uconn.edu>
>> Sent: Thursday, April 23, 2020 9:08 PM
>> To: gmx-users at gromacs.org <gmx-users at gromacs.org>
>> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node
>>
>> Hi,
>>
>> Can you post the full log for the Intel system? I typically find the real
>> cycle and time accounting section a better place to start debugging
>> performance issues.
>>
>> A couple quick notes, but need a side-by-side comparison for more useful
>> analysis, and these points may apply to both systems so may not be your
>> root cause:
>> * At first glance, your Power system spends 1/3 of its time in constraint
>> calculation, which is unusual. This can be reduced 2 ways - first, by
>> adding more CPU cores. It doesn't make a ton of sense to benchmark on one
>> core if your applications will use more. Second, if you upgrade to Gromacs
>> 2020 you can probably put the constraint calculation on the GPU with
>> -update GPU.
>> * The Power system log has this line:
>>
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
>> indicating
>> that threads perhaps were not actually pinned. Try adding -pinoffset 0 (or
>> some other core) to specify where you want the process pinned.
>>
>> Kevin
>>
>> On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
>> halverson at princeton.edu> wrote:
>>
>>> *Message sent from a system outside of UConn.*
>>>
>>>
>>> We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an
>>> IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running
>> RHEL
>>> 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
>>> nodes. Everything below is about of the POWER9/V100 node.
>>>
>>> We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
>>> CPU-core and 1 GPU (
>>> ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
>>> found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives
>> 102
>>> ns/day. The difference in performance is roughly the same for the larger
>>> ADH benchmark and when different numbers of CPU-cores are used. GROMACS
>> is
>>> always underperforming on our POWER9/V100 nodes. We have pinning turned
>> on
>>> (see Slurm script at bottom).
>>>
>>> Below is our build procedure on the POWER9/V100 node:
>>>
>>> version_gmx=2019.6
>>> wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
>>> tar zxvf gromacs-${version_gmx}.tar.gz
>>> cd gromacs-${version_gmx}
>>> mkdir build && cd build
>>>
>>> module purge
>>> module load rh/devtoolset/7
>>> module load cudatoolkit/10.2
>>>
>>> OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"
>>>
>>> cmake3 .. -DCMAKE_BUILD_TYPE=Release \
>>> -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
>>> -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
>>> -DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
>>> -DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
>>> -DGMX_BUILD_OWN_FFTW=ON \
>>> -DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
>>> -DGMX_OPENMP_MAX_THREADS=128 \
>>> -DCMAKE_INSTALL_PREFIX=$HOME/.local \
>>> -DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON
>>>
>>> make -j 10
>>> make check
>>> make install
>>>
>>> 45 of the 46 tests pass with the exception being HardwareUnitTests. There
>>> are several posts about this and apparently it is not a concern. The full
>>> build log is here:
>>>
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log
>>>
>>>
>>> Here is more info about our POWER9/V100 node:
>>>
>>> $ lscpu
>>> Architecture: ppc64le
>>> Byte Order: Little Endian
>>> CPU(s): 128
>>> On-line CPU(s) list: 0-127
>>> Thread(s) per core: 4
>>> Core(s) per socket: 16
>>> Socket(s): 2
>>> NUMA node(s): 6
>>> Model: 2.3 (pvr 004e 1203)
>>> Model name: POWER9, altivec supported
>>> CPU max MHz: 3800.0000
>>> CPU min MHz: 2300.0000
>>>
>>> You see that we have 4 hardware threads per physical core. If we use 4
>>> hardware threads on the RNASE benchmark instead of 1 the performance goes
>>> to 119 ns/day which is still about 20% less than the Broadwell/P100
>> value.
>>> When using multiple CPU-cores on the POWER9/V100 there is significant
>>> variation in the execution time of the code.
>>>
>>> There are four GPUs per POWER9/V100 node:
>>>
>>> $ nvidia-smi -q
>>> Driver Version : 440.33.01
>>> CUDA Version : 10.2
>>> GPU 00000004:04:00.0
>>> Product Name : Tesla V100-SXM2-32GB
>>>
>>> The GPUs have been shown to perform as expected on other applications.
>>>
>>>
>>>
>>>
>>> The following lines are found in md.log for the POWER9/V100 run:
>>>
>>> Overriding thread affinity set outside gmx mdrun
>>> Pinning threads with an auto-selected logical core stride of 128
>>> NOTE: Thread affinity was not set.
>>>
>>> The full md.log is available here:
>>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log
>>>
>>>
>>>
>>>
>>> Below are the MegaFlops Accounting for the POWER9/V100 versus
>>> Broadwell/P100:
>>>
>>> ================ IBM POWER9 WITH NVIDIA V100 ================
>>> Computing: M-Number M-Flops %
>> Flops
>>>
>> -----------------------------------------------------------------------------
>>> Pair Search distance check 297.763872 2679.875
>> 0.0
>>> NxN Ewald Elec. + LJ [F] 244214.215808 16118138.243
>> 98.0
>>> NxN Ewald Elec. + LJ [V&F] 2483.565760 265741.536
>> 1.6
>>> 1,4 nonbonded interactions 53.415341 4807.381
>> 0.0
>>> Shift-X 3.029040 18.174
>> 0.0
>>> Angles 37.043704 6223.342
>> 0.0
>>> Propers 55.825582 12784.058
>> 0.1
>>> Impropers 4.220422 877.848
>> 0.0
>>> Virial 2.432585 43.787
>> 0.0
>>> Stop-CM 2.452080 24.521
>> 0.0
>>> Calc-Ekin 48.128080 1299.458
>> 0.0
>>> Lincs 20.536159 1232.170
>> 0.0
>>> Lincs-Mat 444.613344 1778.453
>> 0.0
>>> Constraint-V 261.192228 2089.538
>> 0.0
>>> Constraint-Vir 2.430161 58.324
>> 0.0
>>> Settle 73.382008 23702.389
>> 0.1
>>>
>> -----------------------------------------------------------------------------
>>> Total 16441499.096
>> 100.0
>>>
>> -----------------------------------------------------------------------------
>>>
>>> ================ INTEL BROADWELL WITH NVIDIA P100 ================
>>> Computing: M-Number M-Flops %
>> Flops
>>>
>> -----------------------------------------------------------------------------
>>> Pair Search distance check 271.334272 2442.008
>> 0.0
>>> NxN Ewald Elec. + LJ [F] 191599.850112 12645590.107
>> 98.0
>>> NxN Ewald Elec. + LJ [V&F] 1946.866432 208314.708
>> 1.6
>>> 1,4 nonbonded interactions 53.415341 4807.381
>> 0.0
>>> Shift-X 3.029040 18.174
>> 0.0
>>> Bonds 10.541054 621.922
>> 0.0
>>> Angles 37.043704 6223.342
>> 0.0
>>> Propers 55.825582 12784.058
>> 0.1
>>> Impropers 4.220422 877.848
>> 0.0
>>> Virial 2.432585 43.787
>> 0.0
>>> Stop-CM 2.452080 24.521
>> 0.0
>>> Calc-Ekin 48.128080 1299.458
>> 0.0
>>> Lincs 9.992997 599.580
>> 0.0
>>> Lincs-Mat 50.775228 203.101
>> 0.0
>>> Constraint-V 240.108012 1920.864
>> 0.0
>>> Constraint-Vir 2.323707 55.769
>> 0.0
>>> Settle 73.382008 23702.389
>> 0.2
>>>
>> -----------------------------------------------------------------------------
>>> Total 12909529.017
>> 100.0
>>>
>> -----------------------------------------------------------------------------
>>> Some of the rows are identical between the two tables above. The largest
>>> difference
>>> is observed for the "NxN Ewald Elec. + LJ [F]" row.
>>>
>>>
>>>
>>> Here is our Slurm script:
>>>
>>> #!/bin/bash
>>> #SBATCH --job-name=gmx # create a short name for your job
>>> #SBATCH --nodes=1 # node count
>>> #SBATCH --ntasks=1 # total number of tasks across all nodes
>>> #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if
>>> multi-threaded tasks)
>>> #SBATCH --mem=4G # memory per node (4G per cpu-core is
>>> default)
>>> #SBATCH --time=00:10:00 # total run time limit (HH:MM:SS)
>>> #SBATCH --gres=gpu:1 # number of gpus per node
>>>
>>> module purge
>>> module load cudatoolkit/10.2
>>>
>>> BCH=../rnase_cubic
>>> gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o
>>> bench.tpr
>>> gmx mdrun -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s
>>> bench.tpr
>>>
>>>
>>>
>>> How do we get optimal performance out of GROMACS on our POWER9/V100
>> nodes?
>>> Jon
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a mail to gmx-users-request at gromacs.org.
>>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
More information about the gromacs.org_gmx-users
mailing list