[gmx-users] gromacs.org_gmx-users Digest, Vol 192, Issue 89
Jun Zhou
jun.zhou at monash.edu
Fri Apr 24 23:00:25 CEST 2020
Hi,
I use gromacs-2019.4.
Sent from my iPhone
> On 25 Apr 2020, at 6:54 am, gromacs.org_gmx-users-request at maillist.sys.kth.se wrote:
>
> Send gromacs.org_gmx-users mailing list submissions to
> gromacs.org_gmx-users at maillist.sys.kth.se
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or, via email, send a message with subject or body 'help' to
> gromacs.org_gmx-users-request at maillist.sys.kth.se
>
> You can reach the person managing the list at
> gromacs.org_gmx-users-owner at maillist.sys.kth.se
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of gromacs.org_gmx-users digest..."
>
>
> Today's Topics:
>
> 1. Re: GROMACS performance issues on POWER9/V100 node (Szil?rd P?ll)
> 2. Re: GROMACS performance issues on POWER9/V100 node (Szil?rd P?ll)
> 3. Re: GROMACS performance issues on POWER9/V100 node (Szil?rd P?ll)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 24 Apr 2020 22:31:11 +0200
> From: Szil?rd P?ll <pall.szilard at gmail.com>
> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100
> node
> Message-ID:
> <CANnYEw410kwAD9ivgCayUC_nU4i6eJ+KtK-o0ztc8W+voL=x8g at mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
>> On Fri, Apr 24, 2020 at 5:55 AM Alex <nedomacho at gmail.com> wrote:
>>
>> Hi Kevin,
>>
>> We've been having issues with Power9/V100 very similar to what Jon
>> described and basically settled on what I believe is sub-par
>> performance. We tested it on systems with ~30-50K particles and threads
>> simply cannot be pinned.
>
>
> What does that mean, how did you verify that?
> The Linux kernel can in general set affinities on ppc64el, whether that's
> requested by mdrun or some other tool, so if you have observed that the
> affinity mask is not respected (or it does not change), that more likely OS
> / setup issue, I'd think.
>
> What is different compared to x86 is that the hardware thread layout is
> different on Power9 (with default Linux kernel configs) and hardware
> threads are exposed as consecutive "CPUs" by the OS rather than strided by
> #cores.
>
> I could try to sum up some details on how to sett affinities (with mdrun or
> external tools), if that is of interest. However, it really should be
> something that's possible to do even using the job scheduler (+ along
> reasonable system configuration).
>
>
>> As far as Gromacs is concerned, our brand-new
>> Power9 nodes operate as if they were based on Intel CPUs (two threads
>> per core)
>
>
> Unless the hardware thread layout has been changed, that's perhaps not the
> case, see above.
>
>
>> and zero advantage of IBM parallelization is being taken.
>>
>
> You mean the SMT4?
>
>
>> Other users of the same nodes reported similar issues with other
>> software, which to me suggests that our sysadmins don't really know how
>> to set these nodes up.
>>
>> At this point, if someone could figure out a clear set of build
>> instructions in combination with slurm/mdrun inputs, it would be very
>> much appreciated.
>>
>
> Have you checked public documentation on ORNL's sites? GROMACS has been
> used successfully on Summit. What about IBM support?
>
> --
> Szil?rd
>
>
>>
>> Alex
>>
>>> On 4/23/2020 9:37 PM, Kevin Boyd wrote:
>>> I'm not entirely sure how thread-pinning plays with slurm allocations on
>>> partial nodes. I always reserve the entire node when I use thread
>> pinning,
>>> and run a bunch of simulations by pinning to different cores manually,
>>> rather than relying on slurm to divvy up resources for multiple jobs.
>>>
>>> Looking at both logs now, a few more points
>>>
>>> * Your benchmarks are short enough that little things like cores spinning
>>> up frequencies can matter. I suggest running longer (increase nsteps in
>> the
>>> mdp or at the command line), and throwing away your initial benchmark
>> data
>>> (see -resetstep and -resethway) to avoid artifacts
>>> * Your benchmark system is quite small for such a powerful GPU. I might
>>> expect better performance running multiple simulations per-GPU if the
>>> workflows being run can rely on replicates, and a larger system would
>>> probably scale better to the V100.
>>> * The P100/intel system appears to have pinned cores properly, it's
>>> unclear whether it had a real impact on these benchmarks
>>> * It looks like the CPU-based computations were the primary contributors
>> to
>>> the observed difference in performance. That should decrease or go away
>>> with increased core counts and shifting the update phase to the GPU. It
>> may
>>> be (I have no prior experience to indicate either way) that the intel
>> cores
>>> are simply better on a 1-1 basis than the Power cores. If you have 4-8
>>> cores per simulation (try -ntomp 4 and increasing the allocation of your
>>> slurm job), the individual core performance shouldn't matter too
>>> much, you're just certainly bottlenecked on one CPU core per GPU, which
>> can
>>> emphasize performance differences..
>>>
>>> Kevin
>>>
>>> On Thu, Apr 23, 2020 at 6:43 PM Jonathan D. Halverson <
>>> halverson at princeton.edu> wrote:
>>>
>>>> *Message sent from a system outside of UConn.*
>>>>
>>>>
>>>> Hi Kevin,
>>>>
>>>> md.log for the Intel run is here:
>>>>
>>>>
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100
>>>>
>>>> Thanks for the info on constraints with 2020. I'll try some runs with
>>>> different values of -pinoffset for 2019.6.
>>>>
>>>> I know a group at NIST is having the same or similar problems with
>>>> POWER9/V100.
>>>>
>>>> Jon
>>>> ________________________________
>>>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
>>>> gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Kevin
>>>> Boyd <kevin.boyd at uconn.edu>
>>>> Sent: Thursday, April 23, 2020 9:08 PM
>>>> To: gmx-users at gromacs.org <gmx-users at gromacs.org>
>>>> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node
>>>>
>>>> Hi,
>>>>
>>>> Can you post the full log for the Intel system? I typically find the
>> real
>>>> cycle and time accounting section a better place to start debugging
>>>> performance issues.
>>>>
>>>> A couple quick notes, but need a side-by-side comparison for more useful
>>>> analysis, and these points may apply to both systems so may not be your
>>>> root cause:
>>>> * At first glance, your Power system spends 1/3 of its time in
>> constraint
>>>> calculation, which is unusual. This can be reduced 2 ways - first, by
>>>> adding more CPU cores. It doesn't make a ton of sense to benchmark on
>> one
>>>> core if your applications will use more. Second, if you upgrade to
>> Gromacs
>>>> 2020 you can probably put the constraint calculation on the GPU with
>>>> -update GPU.
>>>> * The Power system log has this line:
>>>>
>>>>
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
>>>> indicating
>>>> that threads perhaps were not actually pinned. Try adding -pinoffset 0
>> (or
>>>> some other core) to specify where you want the process pinned.
>>>>
>>>> Kevin
>>>>
>>>> On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
>>>> halverson at princeton.edu> wrote:
>>>>
>>>>> *Message sent from a system outside of UConn.*
>>>>>
>>>>>
>>>>> We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on
>> an
>>>>> IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running
>>>> RHEL
>>>>> 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
>>>>> nodes. Everything below is about of the POWER9/V100 node.
>>>>>
>>>>> We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
>>>>> CPU-core and 1 GPU (
>>>>> ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
>>>>> found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives
>>>> 102
>>>>> ns/day. The difference in performance is roughly the same for the
>> larger
>>>>> ADH benchmark and when different numbers of CPU-cores are used. GROMACS
>>>> is
>>>>> always underperforming on our POWER9/V100 nodes. We have pinning turned
>>>> on
>>>>> (see Slurm script at bottom).
>>>>>
>>>>> Below is our build procedure on the POWER9/V100 node:
>>>>>
>>>>> version_gmx=2019.6
>>>>> wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
>>>>> tar zxvf gromacs-${version_gmx}.tar.gz
>>>>> cd gromacs-${version_gmx}
>>>>> mkdir build && cd build
>>>>>
>>>>> module purge
>>>>> module load rh/devtoolset/7
>>>>> module load cudatoolkit/10.2
>>>>>
>>>>> OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"
>>>>>
>>>>> cmake3 .. -DCMAKE_BUILD_TYPE=Release \
>>>>> -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
>>>>> -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
>>>>> -DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
>>>>> -DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
>>>>> -DGMX_BUILD_OWN_FFTW=ON \
>>>>> -DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
>>>>> -DGMX_OPENMP_MAX_THREADS=128 \
>>>>> -DCMAKE_INSTALL_PREFIX=$HOME/.local \
>>>>> -DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON
>>>>>
>>>>> make -j 10
>>>>> make check
>>>>> make install
>>>>>
>>>>> 45 of the 46 tests pass with the exception being HardwareUnitTests.
>> There
>>>>> are several posts about this and apparently it is not a concern. The
>> full
>>>>> build log is here:
>>>>>
>>>>
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log
>>>>>
>>>>>
>>>>> Here is more info about our POWER9/V100 node:
>>>>>
>>>>> $ lscpu
>>>>> Architecture: ppc64le
>>>>> Byte Order: Little Endian
>>>>> CPU(s): 128
>>>>> On-line CPU(s) list: 0-127
>>>>> Thread(s) per core: 4
>>>>> Core(s) per socket: 16
>>>>> Socket(s): 2
>>>>> NUMA node(s): 6
>>>>> Model: 2.3 (pvr 004e 1203)
>>>>> Model name: POWER9, altivec supported
>>>>> CPU max MHz: 3800.0000
>>>>> CPU min MHz: 2300.0000
>>>>>
>>>>> You see that we have 4 hardware threads per physical core. If we use 4
>>>>> hardware threads on the RNASE benchmark instead of 1 the performance
>> goes
>>>>> to 119 ns/day which is still about 20% less than the Broadwell/P100
>>>> value.
>>>>> When using multiple CPU-cores on the POWER9/V100 there is significant
>>>>> variation in the execution time of the code.
>>>>>
>>>>> There are four GPUs per POWER9/V100 node:
>>>>>
>>>>> $ nvidia-smi -q
>>>>> Driver Version : 440.33.01
>>>>> CUDA Version : 10.2
>>>>> GPU 00000004:04:00.0
>>>>> Product Name : Tesla V100-SXM2-32GB
>>>>>
>>>>> The GPUs have been shown to perform as expected on other applications.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The following lines are found in md.log for the POWER9/V100 run:
>>>>>
>>>>> Overriding thread affinity set outside gmx mdrun
>>>>> Pinning threads with an auto-selected logical core stride of 128
>>>>> NOTE: Thread affinity was not set.
>>>>>
>>>>> The full md.log is available here:
>>>>>
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Below are the MegaFlops Accounting for the POWER9/V100 versus
>>>>> Broadwell/P100:
>>>>>
>>>>> ================ IBM POWER9 WITH NVIDIA V100 ================
>>>>> Computing: M-Number M-Flops %
>>>> Flops
>>>>>
>>>>
>> -----------------------------------------------------------------------------
>>>>> Pair Search distance check 297.763872 2679.875
>>>> 0.0
>>>>> NxN Ewald Elec. + LJ [F] 244214.215808 16118138.243
>>>> 98.0
>>>>> NxN Ewald Elec. + LJ [V&F] 2483.565760 265741.536
>>>> 1.6
>>>>> 1,4 nonbonded interactions 53.415341 4807.381
>>>> 0.0
>>>>> Shift-X 3.029040 18.174
>>>> 0.0
>>>>> Angles 37.043704 6223.342
>>>> 0.0
>>>>> Propers 55.825582 12784.058
>>>> 0.1
>>>>> Impropers 4.220422 877.848
>>>> 0.0
>>>>> Virial 2.432585 43.787
>>>> 0.0
>>>>> Stop-CM 2.452080 24.521
>>>> 0.0
>>>>> Calc-Ekin 48.128080 1299.458
>>>> 0.0
>>>>> Lincs 20.536159 1232.170
>>>> 0.0
>>>>> Lincs-Mat 444.613344 1778.453
>>>> 0.0
>>>>> Constraint-V 261.192228 2089.538
>>>> 0.0
>>>>> Constraint-Vir 2.430161 58.324
>>>> 0.0
>>>>> Settle 73.382008 23702.389
>>>> 0.1
>>>>>
>>>>
>> -----------------------------------------------------------------------------
>>>>> Total 16441499.096
>>>> 100.0
>>>>>
>>>>
>> -----------------------------------------------------------------------------
>>>>>
>>>>> ================ INTEL BROADWELL WITH NVIDIA P100 ================
>>>>> Computing: M-Number M-Flops %
>>>> Flops
>>>>>
>>>>
>> -----------------------------------------------------------------------------
>>>>> Pair Search distance check 271.334272 2442.008
>>>> 0.0
>>>>> NxN Ewald Elec. + LJ [F] 191599.850112 12645590.107
>>>> 98.0
>>>>> NxN Ewald Elec. + LJ [V&F] 1946.866432 208314.708
>>>> 1.6
>>>>> 1,4 nonbonded interactions 53.415341 4807.381
>>>> 0.0
>>>>> Shift-X 3.029040 18.174
>>>> 0.0
>>>>> Bonds 10.541054 621.922
>>>> 0.0
>>>>> Angles 37.043704 6223.342
>>>> 0.0
>>>>> Propers 55.825582 12784.058
>>>> 0.1
>>>>> Impropers 4.220422 877.848
>>>> 0.0
>>>>> Virial 2.432585 43.787
>>>> 0.0
>>>>> Stop-CM 2.452080 24.521
>>>> 0.0
>>>>> Calc-Ekin 48.128080 1299.458
>>>> 0.0
>>>>> Lincs 9.992997 599.580
>>>> 0.0
>>>>> Lincs-Mat 50.775228 203.101
>>>> 0.0
>>>>> Constraint-V 240.108012 1920.864
>>>> 0.0
>>>>> Constraint-Vir 2.323707 55.769
>>>> 0.0
>>>>> Settle 73.382008 23702.389
>>>> 0.2
>>>>>
>>>>
>> -----------------------------------------------------------------------------
>>>>> Total 12909529.017
>>>> 100.0
>>>>>
>>>>
>> -----------------------------------------------------------------------------
>>>>> Some of the rows are identical between the two tables above. The
>> largest
>>>>> difference
>>>>> is observed for the "NxN Ewald Elec. + LJ [F]" row.
>>>>>
>>>>>
>>>>>
>>>>> Here is our Slurm script:
>>>>>
>>>>> #!/bin/bash
>>>>> #SBATCH --job-name=gmx # create a short name for your job
>>>>> #SBATCH --nodes=1 # node count
>>>>> #SBATCH --ntasks=1 # total number of tasks across all
>> nodes
>>>>> #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if
>>>>> multi-threaded tasks)
>>>>> #SBATCH --mem=4G # memory per node (4G per cpu-core is
>>>>> default)
>>>>> #SBATCH --time=00:10:00 # total run time limit (HH:MM:SS)
>>>>> #SBATCH --gres=gpu:1 # number of gpus per node
>>>>>
>>>>> module purge
>>>>> module load cudatoolkit/10.2
>>>>>
>>>>> BCH=../rnase_cubic
>>>>> gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o
>>>>> bench.tpr
>>>>> gmx mdrun -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s
>>>>> bench.tpr
>>>>>
>>>>>
>>>>>
>>>>> How do we get optimal performance out of GROMACS on our POWER9/V100
>>>> nodes?
>>>>> Jon
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a mail to gmx-users-request at gromacs.org.
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a mail to gmx-users-request at gromacs.org.
>>>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 24 Apr 2020 22:52:48 +0200
> From: Szil?rd P?ll <pall.szilard at gmail.com>
> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
> Cc: "gromacs.org_gmx-users at maillist.sys.kth.se"
> <gromacs.org_gmx-users at maillist.sys.kth.se>
> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100
> node
> Message-ID:
> <CANnYEw6j7b5FSJLRkDi7z2paHKo_rBvFt173kfUZK6+c7gUQwA at mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
>> The following lines are found in md.log for the POWER9/V100 run:
>>
>> Overriding thread affinity set outside gmx mdrun
>> Pinning threads with an auto-selected logical core stride of 128
>> NOTE: Thread affinity was not set.
>>
>> The full md.log is available here:
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log
>
>
> I glanced over that at first, will see if I can reproduce it, though I only
> have access to a Raptor Talos, not an IBM machine with Ubuntu.
>
> What OS are you using?
>
>
> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 24 Apr 2020 22:52:48 +0200
> From: Szil?rd P?ll <pall.szilard at gmail.com>
> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
> Cc: "gromacs.org_gmx-users at maillist.sys.kth.se"
> <gromacs.org_gmx-users at maillist.sys.kth.se>
> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100
> node
> Message-ID:
> <CANnYEw6j7b5FSJLRkDi7z2paHKo_rBvFt173kfUZK6+c7gUQwA at mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
>> The following lines are found in md.log for the POWER9/V100 run:
>>
>> Overriding thread affinity set outside gmx mdrun
>> Pinning threads with an auto-selected logical core stride of 128
>> NOTE: Thread affinity was not set.
>>
>> The full md.log is available here:
>> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log
>
>
> I glanced over that at first, will see if I can reproduce it, though I only
> have access to a Raptor Talos, not an IBM machine with Ubuntu.
>
> What OS are you using?
>
>
> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
>
>
> ------------------------------
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
>
> End of gromacs.org_gmx-users Digest, Vol 192, Issue 89
> ******************************************************
More information about the gromacs.org_gmx-users
mailing list