[gmx-users] Performance gains with AVX_512 ?

Tue Dec 12 23:11:18 CET 2017

Hi Szilárd,

> On 12. Dec 2017, at 17:58, Szilárd Páll <pall.szilard at gmail.com> wrote:
> 
> Hi Carsten,
> 
> The performance behavior you observe is expected, I have observed it
> myself. Nothing seems unusual in the performance numbers you report.
> 
> The AVX512 clock throttle is additional (10-20% IIRC) to the AVX2 throttle,
> and the only code that really gains significantly from AVX512 is the
> nonbonded kernels. When those are offloaded, the gain from higher clocks
> with AVX2 will translate to better CPU performance (and especially if the
> run is CPU-bound, that will make a significant difference).
> 
> BTW, on the low- and mid-range CPUs ("Bronze"/"Silver" and "cut-down" i9s)
> AVX512 is even less likely to ever be worth it.
So using AVX2 on GPU nodes seems generally to be the fastest option. 
Thanks a lot for the info! 

Best,
  Carsten

> 
> Cheers,
> 
> --
> Szilárd
> 
> On Tue, Dec 12, 2017 at 3:07 PM, Kutzner, Carsten <ckutzne at gwdg.de> wrote:
> 
>> Hi,
>> 
>> what are the expected performance benefits of AVX_512 SIMD instructions
>> on Intel Skylake processors, compared to AVX2_256? In many cases, I see
>> a significantly (15 %) higher GROMACS 2016 / 2018b2 performance when using
>> AVX2_256 instead of AVX_512. I would have guessed that AVX_512 is at least
>> not slower than inferior instruction sets.
>> 
>> Some quick benchmarks results:
>> Node with 2x12 core (48 threads) Xeon Gold 6146 plus 2x GTX 1080Ti
>> 80k atoms membrane benchmark system, 2 fs time step, pme on cpu
>> 
>> GROMACS v.    SIMD        ns/d
>> 2016          AVX_512     102.3
>> 2016          AVX2_256    119.3
>> 2018b2        AVX_512     107.9
>> 2018b2        AVX2_256    123.2
>> 
>> I realize that AVX_512 turbo frequencies are significantly lower
>> compared to AVX2_256 if all cores are in use, and for a serial run,
>> AVX_512 is indeed by about 6% faster than AVX2_256.
>> 
> 
> By "serial" you mean single threaded runs? Single-core turbo on this 165W
> CPU will be pretty high (>=4.2 GHz) and it will not likely to reflect the
> relative difference at the base-clock.
> 
> Gromacs 2018b2, -nb cpu
>> thread-MPI  ns/day   ns/day     improvement
>> threads     AVX_512  AVX2_256   over AVX2
>> 1           2.880    2.702     1.065
>> 2           5.451    5.209     1.046
>> 4           9.617    9.332     1.031
>> 8          17.469   17.276     1.011
>> 12          21.852   24.245      .901
>> 16          28.579   31.691      .902
>> 24          39.731   41.576      .956
>> 48          41.831   39.336     1.063
>> 
> 
> Does this mean that for all but row 5,7, and 8 last two rows you left
> socket(s) partially empty?
> 
> 
> Cheers,
> --
> Szilárd
> 
> 
>> Can anyone comment on whether that is the expected behavior and why?
>> 
>> Thanks!
>>  Carsten
>> 
>> 
>> 
>> --
>> Gromacs Users mailing list
>> 
>> * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>> 
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> 
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>> 
> -- 
> Gromacs Users mailing list
> 
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> 
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> 
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.