[gmx-developers] 2018-beta1: Performance of membrane simulations
Szilárd Páll
pall.szilard at gmail.com
Fri Feb 9 18:36:08 CET 2018
Hi,
Quite belated, but let me clarify the Flops vs execution time a bit:
Flops are not the best indicative of wall-time required to execute a
kernel (and that's increasingly the case with modern hardware). Many
algorithms, including those in MD are not Flop but instead memory
bound (most notably PME, integration, etc.). For that reason most code
is actually unable to reach anywhere near the peak theoretical
instruction (let alone floating point) rate of a CPU and it is rather
limited by the rate of loading data, the operands of arithmetic
instructions (and in that case flops can be essentially seen as
"free").
In our case, although the short-range nonbonded computation can
represent >90% of the Flops per step, these kernels will not take up
anywhere near that fraction of the runtime, not even in CPU-only runs.
That's because these algorithms are more arithmetically intensive
(especially with the the MxN Verlet algorithm) of all ingredients of a
typical MD run, and therefore run at a higher flop-rate than others.
Consequently, these will take relatively For instance, PME will be
strongly memory bound and even bonded kernels are not in the
arithmetically bound regime.
Therefore, when looking for bottlenecks it's best to look at the
wall-time (or cycles) reported in the performance table (note that you
can use -DGMX_CYCLE_SUBCOUNTERS=ON to get further breakdown on
different aggregate counters). Additionally, when using GPU offload,
the rate-limiting computation can be either CPU, GPU or both.
That brings us to your performance tables which show that both with 4
and 10 cores the runs are CPU-bound as the "Wait GPU" entries (the
time the CPU spent waiting for GPU results) are negligible for both
cases.
Cheers,
--
Szilárd
On Fri, Dec 1, 2017 at 11:44 PM, Magnus Lundborg
<magnus.lundborg at scilifelab.se> wrote:
> Hi,
>
> I guess it would be of general interest, so it could be worth considering,
> but what goes into the release is not in my hands. However, I think most
> people (probably including you) would not get quite as dramatic gain as my
> system contains very little water, which means that bonded interactions will
> be a large portion.
>
> In the long run I think it would be good to both make UB SIMD optimised and
> to move all bondeds to GPUs. But that's for later.
>
> Cheers,
>
> Magnus
>
> Den 1 dec. 2017 23:21 skrev "Jochen Hub" <jhub at gwdg.de>:
>>
>> Hi Magnus,
>>
>> many thanks, and impressive, that clarifies my question.
>>
>> Since UB has such as drastic effect on performance, maybe you can convince
>> the other developers to make an exception and get a UB-SIMD patch into 2018?
>>
>> I understand, normally only bug fixes after beta release, but is 33%
>> performance loss (50% gain) not close to a bug?
>>
>> Cheers,
>> Jochen
>>
>> Am 01.12.17 um 22:54 schrieb Magnus Lundborg:
>>>
>>> Hi,
>>>
>>> I'm running simulations with the CHARMM forcefield, which also uses UB
>>> and experienced similar things. Apparently the flops count in the first
>>> table is not the actual time for the calculations, if I understood the
>>> explanations correctly. So it's the Force row in the second table that's
>>> bonded forces (with long range and PME on GPU). So I tried making a SIMD
>>> version of UB (only standard angles are SIMD optimised) and got almost a 50%
>>> performance gain. Making also bonds using SIMD only have an additional 1 or
>>> 2%. My patch is just a draft as it's not clear what future SIMD functions
>>> should look like, but ill share it with you so that you can try it. However,
>>> it won't be in the next release, I guess.
>>>
>>> Cheers,
>>>
>>> Magnus
>>>
>>>
>>> Den 1 dec. 2017 22:34 skrev "Jochen Hub" <jhub at gwdg.de
>>> <mailto:jhub at gwdg.de>>:
>>>
>>> Dear developers,
>>>
>>> I started a thread in the user list yesterday (and Szilard already
>>> gave a quick answer) but I felt this point is relevant for the
>>> developers list.
>>>
>>> We did some benchmarks with the 2018-beta1 with PME on the GPU -
>>> overall fantastic (!!) - we just don't understand the performance of
>>> lipid membrane simulations (Slipids or Charmm36, with UB
>>> potentials). They contain roughly 50% lipid, 50% water atoms. Please
>>> see here:
>>>
>>> http://cmb.bio.uni-goettingen.de/bench.pdf
>>> <http://cmb.bio.uni-goettingen.de/bench.pdf>
>>>
>>> As you see in the linked PDF, the Slipid simulations are limited by
>>> the CPU up to 10 (!) quite strong Xeon cores, when using a GTX 1080.
>>> Szilard pointed out that is is probably due to bonded UB
>>> interactions - however, they make only 0.2% of the Flops, see the
>>> log output pasted below, for ntomp=4 or 10 (for 128 Slipids system
>>> with 1nm cutoff). The Flops-Summary is nearly the same for ntomp=4
>>> or 10, so only the ntomp=4 is shown below.
>>>
>>> In contrast, protein simulations (whether membrane protein or purely
>>> in water) behave as one hopes, showing that we can buy a cheap CPU
>>> when doing PME on the GPU.
>>>
>>> So my question is: Is this expected? Is this really due to
>>> Urey-Bradley? Or maybe due to Constraints? In case that UB is
>>> limiting, are there any plans to port this also onto the GPU in the
>>> future?
>>>
>>> This has also impact on hardware: Depending on whether you run
>>> protein or membrane simulation, you need to buy different hardware.
>>>
>>> Many thanks for any input, and many thanks again for the fabulous
>>> work on 2018!
>>>
>>> Jochen
>>>
>>>
>>> Computing: M-Number M-Flops
>>> % Flops
>>>
>>> -----------------------------------------------------------------------------
>>> Pair Search distance check 151.929968 1367.370
>>> 0.0
>>> NxN Ewald Elec. + LJ [F] 157598.160192 10401478.573
>>> 97.2
>>> NxN Ewald Elec. + LJ [V&F] 1623.781504 173744.621
>>> 1.6
>>> 1,4 nonbonded interactions 200.360064 18032.406
>>> 0.2
>>> Shift-X 1.553664 9.322
>>> 0.0
>>> Propers 246.449280 56436.885
>>> 0.5
>>> Impropers 1.280256 266.293
>>> 0.0
>>> Virial 7.657759 137.840
>>> 0.0
>>> Stop-CM 1.553664 15.537
>>> 0.0
>>> P-Coupling 7.646464 45.879
>>> 0.0
>>> Calc-Ekin 15.262464 412.087
>>> 0.0
>>> Lincs 74.894976 4493.699
>>> 0.0
>>> Lincs-Mat 1736.027136 6944.109
>>> 0.1
>>> Constraint-V 226.605312 1812.842
>>> 0.0
>>> Constraint-Vir 7.614336 182.744
>>> 0.0
>>> Settle 25.605120 8270.454
>>> 0.1
>>> Urey-Bradley 144.668928 26474.414
>>> 0.2
>>>
>>> -----------------------------------------------------------------------------
>>> Total 10700125.072
>>> 100.0
>>>
>>> -----------------------------------------------------------------------------
>>>
>>>
>>> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>>>
>>> On 1 MPI rank, each using 4 OpenMP threads
>>>
>>> Computing: Num Num Call Wall time
>>> Giga-Cycles
>>> Ranks Threads Count (s) total
>>> sum %
>>>
>>> -----------------------------------------------------------------------------
>>> Neighbor search 1 4 51 0.260
>>> 2.284 2.4
>>> Launch GPU ops. 1 4 10002 0.591
>>> 5.191 5.4
>>> Force 1 4 5001 7.314
>>> 64.211 67.1
>>> Wait PME GPU gather 1 4 5001 0.071
>>> 0.626 0.7
>>> Reduce GPU PME F 1 4 5001 0.078
>>> 0.684 0.7
>>> Wait GPU NB local 1 4 5001 0.017
>>> 0.151 0.2
>>> NB X/F buffer ops. 1 4 9951 0.321
>>> 2.822 2.9
>>> Write traj. 1 4 2 0.117
>>> 1.026 1.1
>>> Update 1 4 5001 0.199
>>> 1.749 1.8
>>> Constraints 1 4 5001 1.853
>>> 16.270 17.0
>>> Rest 0.085
>>> 0.743 0.8
>>>
>>> -----------------------------------------------------------------------------
>>> Total 10.907
>>> 95.757 100.0
>>>
>>> -----------------------------------------------------------------------------
>>>
>>> ********************************
>>> ****** 10 Open MP threads ******
>>> ********************************
>>>
>>> On 1 MPI rank, each using 10 OpenMP threads
>>>
>>> Computing: Num Num Call Wall time
>>> Giga-Cycles
>>> Ranks Threads Count (s) total
>>> sum %
>>>
>>> -----------------------------------------------------------------------------
>>> Neighbor search 1 10 51 0.120
>>> 2.625 2.3
>>> Launch GPU ops. 1 10 10002 0.580
>>> 12.731 11.3
>>> Force 1 10 5001 2.999
>>> 65.828 58.4
>>> Wait PME GPU gather 1 10 5001 0.066
>>> 1.459 1.3
>>> Reduce GPU PME F 1 10 5001 0.045
>>> 0.980 0.9
>>> Wait GPU NB local 1 10 5001 0.014
>>> 0.308 0.3
>>> NB X/F buffer ops. 1 10 9951 0.157
>>> 3.453 3.1
>>> Write traj. 1 10 2 0.147
>>> 3.224 2.9
>>> Update 1 10 5001 0.140
>>> 3.067 2.7
>>> Constraints 1 10 5001 0.814
>>> 17.867 15.9
>>> Rest 0.053
>>> 1.161 1.0
>>>
>>> -----------------------------------------------------------------------------
>>> Total 5.135
>>> 112.703 100.0
>>>
>>> -----------------------------------------------------------------------------
>>>
>>>
>>>
>>> -- ---------------------------------------------------
>>> Dr. Jochen Hub
>>> Computational Molecular Biophysics Group
>>> Institute for Microbiology and Genetics
>>> Georg-August-University of Göttingen
>>> Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
>>>
>>> <https://maps.google.com/?q=Justus-von-Liebig-Weg+11,+37077+G%C3%B6ttingen,+Germany&entry=gmail&source=g>.
>>> Phone: +49-551-39-14189 <tel:%2B49-551-39-14189>
>>> http://cmb.bio.uni-goettingen.de/ <http://cmb.bio.uni-goettingen.de/>
>>> ---------------------------------------------------
>>> -- Gromacs Developers mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List
>>> <http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List>
>>> before posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>> <http://www.gromacs.org/Support/Mailing_Lists>
>>>
>>> * For (un)subscribe requests visit
>>>
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
>>>
>>> <https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers>
>>> or send a mail to gmx-developers-request at gromacs.org
>>> <mailto:gmx-developers-request at gromacs.org>.
>>>
>>>
>>>
>>>
>>
>> --
>> ---------------------------------------------------
>> Dr. Jochen Hub
>> Computational Molecular Biophysics Group
>> Institute for Microbiology and Genetics
>> Georg-August-University of Göttingen
>> Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany.
>> Phone: +49-551-39-14189
>> http://cmb.bio.uni-goettingen.de/
>> ---------------------------------------------------
>> --
>> Gromacs Developers mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or
>> send a mail to gmx-developers-request at gromacs.org.
>
>
> --
> Gromacs Developers mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or
> send a mail to gmx-developers-request at gromacs.org.
More information about the gromacs.org_gmx-developers
mailing list