[gmx-developers] Any news on SIMD intrinsic accelerated kernel for Blue Gene/Q?

Thu May 9 04:29:07 CEST 2013

Hi,

On Wed, May 8, 2013 at 8:45 PM, Szilárd Páll <szilard.pall at cbr.su.se> wrote:
> On Wed, May 8, 2013 at 11:48 PM, Jeff Hammond <jhammond at alcf.anl.gov> wrote:
>> Have you compiled Gromacs with auto-vectorization turned on (e.g.
>> CFLAGS = -g -O3 -qarch=qp -qtune=qp -qsimd=auto -qhot=level=1
>> -qprefetch -qunroll=yes -qreport) and run the code through HPM (see
>> https://wiki.alcf.anl.gov/bgq-earlyaccess/images/4/4a/Hpct-bgq.pdf for
>> instructions - it requires only modest source changes) to determine
>> that the compiler is not capable of vectorizing the code?  I am far
>> from an XL fanboy but it is imprudent to devote human effort when a
>> machine can do the job already.
>
> I doubt that auto-vectorization will give much performance benefit -
> at least based on my experience on x86. However, as we've invested no
> effort into making the code auto-vectorization friendly, it may be
> possible to get the plain C kernels more vectorizer-friendly.

I agree completely.  XL is way less likely to auto-vectorize than the
Intel compilers, if for no other reason than the alignment
requirements of QPX (in contrast, AVX does not care about alignment).

>> The HPM results will be of value no matter what the compiler is doing,
>> as this level of profiling information is absolutely critical to
>> making intelligent choices about how to optimize any code for BGQ.
>> You might discover, for example, that QPX is not the best place to
>> start tuning anyways.  It could be that there are memory access issues
>> that are costing Gromacs far more than a factor of 4 (that is the most
>> one can expect from a 4-way FPU, of course - I assume that the FMA is
>> already used because it is generic to PowerPC).
>
> I'm not sure what you mean by "generic to PowerPC", but unless the
> compiler happens to be able to auto-vectorize a lot of the kernel
> code, FMA will not be used (intensively).

FMA (fused multiply-add) is not a vectorized FPU instruction but
merely y+=x for floating-point numbers, which is quite common in
scientific codes.  I think force evaluation in MD is almost trivially
going to use FMA.  This means that one is doing 2 flop/cycle instead
of 1.  Intel chips only now can do FMA (maybe it's in Ivy Bridge?) and
AMD introduced it recently as well; it's been in PPC since POWER3 (~10
years ago).

> If we are already at it, I'd like to ask your opinion on the ease of
> use and potential advantage of a few BG features that looked
> particularly interesting to me at first while briefly browsing through
> the docs. Note that I have very limited knowledge of BG, so forgive me
> if I'm asking trivial questions.

At least you're asking.  The standard practice I see is to not ask,
assume it behaves like an x86 cluster and then waste inordinate
amounts of time rediscovering what BG support staff like myself have
known for years.  I'm always happy to help new BG users.

> - Efficient atomics: how efficient is efficient?

The performance I've seen
(https://wiki.alcf.anl.gov/parts/index.php/Blue_Gene/Q#L2_Atomics) is
pretty amazing.  I have not compared to atomics on x86 but the BGQ L2
atomics are at least an order of magnitude faster than LLSC
(load-link-store-condition aka load-with-reservation,
store-with-conditional aka lwrx+stwx "larks and sticks").  However,
John Mellor-Crummey wrote a lock (an MCS lock, in fact, which makes
sense given that he's the MC in MCS) that is faster than L2 locks on
BGQ under full contention; this is not too surprising though, given
that a computer science colleague of mine refers to him as "parallel
Jesus").

> What are the consequences of having to "predesignate" memory for atomic use?

It means that it is useful for libraries that need a predetermined set
of atomic locations are fine but one cannot generally use L2 atomics
on any memory location, e.g. to implement atomics in the app unless
they are acting on pre-allocated locations set aside specifically for
this.  If you want to use L2 atomics inside of e.g. a dynamic
load-balancing utility, that's fine, but if you want to implement this
ad hoc inside of a loop, it won't be reliable to assume L2 slots are
available.

> How limiting is the number of memory translation entries?

I've not seen any problems related to this but so far the
near-exclusive use of L2 atomics is inside of system software like MPI
and the OS.  If a bunch of user code was trying to use them, one might
hit the TLB limitations noted in the documentation (not necessarily
easy to find but I recall that it is in the latest version of the Apps
Redbook).

> What makes it sounds particularly interesting is that stores are
> queued and do not stall the CPU. So, more concretely, is it feasible
> to implement a reduction with OpenMP threads doing potentially
> conflicting updates (low conflict rate) and get this faster than code
> that produces N outputs on N threads and reducing these?

The enqueued stores to which you refer (I've not studied this too
much) are probably the reason that strided stores are way less of an
issue on BGQ than x86 (which I have studied extensively).  The loop
order of matrix transpose that works best is the opposite on BG as it
is on x86 and the difference between the two orderings is much more
(5x on BGP compared to 1.5x on x86; I can't remember the diff on BGQ
off-hand).

Integer atomics in this case could be super fast if they used L2
atomics, but that would require the aforementioned preallocation.
OpenMP will have to use LLSC.

> - Is OpenMP synchronization much faster than on x86 (due to the above)?

The OpenMP runtime that we use in production doesn't have all the
hardcore L2 atomics tuning (it's in beta and I don't know if it will
ever be generally available) but even without this, I find that OpenMP
scaling is much better than on x86.  This is primarily due to the lack
of NUMA.  BGQ is a single socket SMP for all the cores whereas HPC
systems with Intel chips are almost always dual-socket, meaning that
crossing the Ncores/2 barrier with OpenMP is nearly impossible.

> - Collective FP network operations: is it feasible to use it for
> anything but huge problem sizes? I am thinking about potential use in
> the reduction required after halo exchange with domain-decomposition
> (which is otherwise quite lightweight).

Like BGL and BGP, most users find that BGQ does collectives so fast
they need diapers.  The short-message allreduce latency is so fast
it's silly.  It's less than 50 microseconds for 1-40 doubles on the
full machine, or something along those lines.  I have full data if
people care for the details.

Similarly, alltoall(v) is crazy fast relative to Cray or Infiniband.
It's probably not a good idea to compare; people might start trying to
get refunds on their Cray machines :-)

> - What is the efficiency of SMT, in particular compared to
> HyperThreading (for compute/cache intensive code like MD)?

In an abstract sense, threads exist to hide latency.  On many systems,
they hide IO latency or memory latency.  On BGQ, they are used to hide
not only memory latency but instruction latency.  You need to use >2
threads per core to saturate the instruction throughput so it's not an
option to use them.  The advantage here is that codes that go to L2 or
DRAM a lot will see a nice speedup due to latency hiding.  The L1
thrashing can be an issue but it is unlikely in an MD code, which is
either in the streaming limit or needs enough cache for tables, etc.
that these end up living in L2.

LAMMPS runs just fine with 64 MPI processes per node at scale
(https://wiki.alcf.anl.gov/parts/index.php/Mira_LAMMPS_Documentation),
particularly for weak-scaling, but the long-range solvers benefit from
threads and hierarchical parallelism.  Apps that aren't properly
threaded can always use this mode but it isn't ideal.  Codes that have
fine-grain OpenMP will run fine 16x3 or 16x4 (MPIxOMP) while codes
with chunkier OpenMP, more OpenMP in general, or benefit from greater
latency-hiding (the fusion particle codes are a good example of this)
can run all the way out to 1x64.

I hope the information I've provided here is useful.  It is hardly
Gromacs-specific since I am far less familiar with Gromacs than other
codes.  I signed up to the list in hopes of learning more and
interacting with people interested in optimizing it for BGQ, so I'm
very happy to see this coming to pass.

Best,

Jeff

>> On Wed, May 8, 2013 at 3:46 PM, Bin Liu <fdusuperstring at gmail.com> wrote:
>>> Dear developers,
>>>
>>>
>>> In the GROMACS acceleration and parallelization webpage, I found
>>>
>>> We will add Blue Gene P and/or Q and AVX2 in the near future.
>>>
>>> I am quite excited about this news, since I am a researcher in Canada, and
>>> the University of Toronto has purchased a Blue Gene Q cluster which will be
>>> operated under Scinet Consortium.
>>>
>>> https://support.scinet.utoronto.ca/wiki/index.php/BGQ#System_Status:_BETA
>>>
>>> Without the SIMD intrinsic accelerated kernel for Blue Gene/Q, perhaps they
>>> won't install GROMACS on it since a lot of computational resources are going
>>> to be wasted. If the GROMACS developer can implement  the SIMD intrinsic
>>> accelerated kernel, I will be more than grateful for that.
>>>
>>>
>>> Bin
>>>
>>> --
>>> gmx-developers mailing list
>>> gmx-developers at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>> Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to gmx-developers-request at gromacs.org.
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>> ALCF docs: http://www.alcf.anl.gov/user-guides
>> --
>> gmx-developers mailing list
>> gmx-developers at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-developers-request at gromacs.org.
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org.

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
ALCF docs: http://www.alcf.anl.gov/user-guides