[gmx-developers] Making libgromacs a dynamic loadable object

Mon Sep 30 11:23:04 CEST 2013

Hi,

On Sep 30, 2013, at 11:14 AM, Mark Abraham <mark.j.abraham at gmail.com> wrote:
> 
> > I think this is unlikely to make it for 5.0, but long-term I would like to support multiple hardware accelerations in a single binary again, by making the actual binaries very small and loading one of several libraries as a dynamic module at runtime. This is not technically difficult to do, but there is one step that will be a little pain for us: Each symbol we want to use from the library must be resolved manually with a call to dlsym().
> 
> I can see three possible division levels: mdrun vs tools, md-loop vs rest, hardware-tuned inner loops vs rest. The third is by far the easiest to do.
> 
We already discussed this in Redmine (see the thread Teemu linked to), and unfortunately the problem is not limited to inner loops - CPU-specific optimization flags has significant impact on large parts of the code, and will improve performance by ~20% above the inner kernels - I don't think we're willing to sacrifice 20% performance.
> Cray still requires static linking, and BlueGene/Q encourages it, so I think it is important that the implementation does not require dynamic linking in the cases where portability of the binary is immaterial.
> 
I don't think we both can have our cake and eat it. For special-purpose highly parallel architectures that require static linking I think it is reasonable that the Gromacs binary will be specific to that particular architecture. 
> > This means we should start thinking of two things to make life simpler in the future:
> >
> > 1) Decide on what level we want the interface between library and executables, and keep this interface _really_ small (in the sense that we want to resolve as few symbols as possible).
> > 2) Since we will have to compile the high-level binaries with generic compiler flags, any code that is performance-sensitive should go in the architecture-specific optimized library.
> 
> I think the third option I give above is the most achievable. I do not know whether the dynamic function calls incur overhead per call, or whether that can be mitigated by the helper object Teemu suggested, but he sounds right (as usual). I hope the libraries would share the same address space. Since we anyway plan for tasks to wrap function calls, the implementations converge.
> 
See above. It would lose ~20% performance, which I think is unacceptable. The main md loop and all functions under it need to be compiled with CPU-specific optimization, so that's the lowest level we can split on. Otherwise we can just as well disable AVX optimization and ship SSE4.1 binaries to be portable :-)

Cheers,

Erik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20130930/18a6aefa/attachment.html>