[gmx-developers] 4.6 Binaries and Acceleration levels

Fri Jul 6 11:18:03 CEST 2012

On Fri, Jul 6, 2012 at 4:47 AM, Berk Hess <hess at kth.se> wrote:
> On 07/06/2012 09:55 AM, Roland Schulz wrote:
>> On Fri, Jul 6, 2012 at 3:16 AM, Berk Hess <hess at kth.se> wrote:
>>> On 07/06/2012 05:51 AM, Nicholas Breen wrote:
>>>> On Thu, Jul 05, 2012 at 11:05:53PM +0200, Szilárd Páll wrote:
>>>>> On Thu, Jul 5, 2012 at 10:59 PM, Christoph Junghans<junghans at votca.org>  wrote:
>>>>>> Roland has a good point, but Debian and Fedora already compile Gromacs
>>>>>> for different mpi version.
>>>>>>
>>>>>> 4 acc. x 3 (serial,openmpi&  mpich2) = 12 packages!
>>>>>>
>>>>>> @ Jussi: Would that still be possible?
>>>>> I guess possible is one thing and probable is a totally different one...
>>>>>
>>>>> Perhaps it would be good to ask the Ubuntu/Debian maintainer as well...
>>>> I would not want to create that much package complexity (the gromacs source
>>>> package already builds five binary packages!), especially if it would all be on
>>>> only one of the many architectures supported, and I cannot think of any other
>>>> package in the Debian archive that operates that way -- everything that I know
>>>> of with multiple CPU optimizations uses run-time detection, except for the
>>>> unavoidable cases with packaging the Linux kernel itself.  If it is
>>>> functionality that is contained purely within the shared libraries, then
>>>> glibc's hwcap support might be a workaround if the build system can permit
>>>> compiling one copy of the libraries for every supported variant.  Otherwise, if
>>>> it's all within mdrun itself, maybe just stick to run-time detection and
>>>> downgrade or eliminate the warnings it issues for "suboptimal" CPUs?
>>>>
>>>>
>>> Erik's non-bonded kernels are present as source in all flavors.
>>> My non-bonded kernel sources can easily be put in in all flavors.
>>> Then we would still need compilation and call selection support.
>>> But there is also x86 SIMD code in the PME and bonded code,
>>> as well as in some other parts of the code.
>> How big is the maximum speedup we get in anythings but non-bonded
>> (PME, bonded, ...) of AVX/SSE4 over SSE2?
>> If it isn't very dramatic compared to the seepdup of just the
>> non-bonded, I think their would be a great value in having
>> multi-acceleration non-bonded kernels. Even if that means that all
>> other functions are limited to the lowest common denominator targeted
>> by the binary. Because it would at least prevent people to use the
>> SSE2 non-bonded kernels when using binaries. But if the speedup of
>> bonded/PME is also very significant, than it is probably better to
>> advice people not to use binaries for mdrun but only for analysis
>> tools and always compile mdrun.
> The difference between SSE2 and SSE4.1 will be minor, not really
> worth put much effort into. But AVX is much faster than SSE on
> both Intel and AMD. The compiler converting SSE intrinsics as well as
> code to AVX helps a lot and on Intel 256-bit SIMD also helps.
>
> For a run with my nbnxn kernels on Sandy Bridge for AVX-256 vs SSE2 I get:
>
> Wall time improvement using 4 OpenMP threads:
> Overall: -15%
>
> NS: -32%
> bondeds: -4%
> non-bonded: -20%
> PME: -1%
> update: -9%
> constraints: -6%

Well more important I think are the absolute savings of the total
runtime not the savings relative to the runtime of the function/unit
itself. From your numbers I would expect that everything besides
non-bonded is less than 4% because e.g. NS should usually be less than
10% of total time. In other words most of the saving comes from the
non-bonded speed-up (and it is less than the 20% in total because PME
doesn't gain at all).

> I forgot to pinn neighboring threads to the same core
> So the difference is 7 to 8%.
> It would be nice to have this, but it's not a lot either.
> I could live with loosing this in a binary distribution.

Yes it isn't as huge that we have make the warning even more obvious
that it is right now. But I would think that if we can easily compile
more than one non-bonded kernel, and thus get most of that speed-up
(only non-bonded), that this would still be something worthwhile.

Roland

>
> But with 8 OpenMP threads using HT AVX gets slower, whereas SSE2 gets
> faster,
> so then then AVX-256 on 4 threads is only 3% faster than SSE2 on 8 threads.
> This is negligible and shows that we might have to worry more about
> automating
> and optimizing the run setup than optimizing the compile configuration!
> We should not run HT+OpenMP+AVX-256 code.
>
> We Erik's kernels the picture might be different, as my kernels (and
> PME) loose
> with HT at 8 threads mainly because of reduction overhead.
> But using MPI threads or mixed parallelization is slower.
>
> Cheers,
>
> Berk
>
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org.
>
>
>
>

-- 
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309