[gmx-developers] Making libgromacs a dynamic loadable object

Erik Lindahl erik.lindahl at scilifelab.se
Mon Sep 30 11:23:04 CEST 2013


On Sep 30, 2013, at 11:14 AM, Mark Abraham <mark.j.abraham at gmail.com> wrote:
> > I think this is unlikely to make it for 5.0, but long-term I would like to support multiple hardware accelerations in a single binary again, by making the actual binaries very small and loading one of several libraries as a dynamic module at runtime. This is not technically difficult to do, but there is one step that will be a little pain for us: Each symbol we want to use from the library must be resolved manually with a call to dlsym().
> I can see three possible division levels: mdrun vs tools, md-loop vs rest, hardware-tuned inner loops vs rest. The third is by far the easiest to do.
We already discussed this in Redmine (see the thread Teemu linked to), and unfortunately the problem is not limited to inner loops - CPU-specific optimization flags has significant impact on large parts of the code, and will improve performance by ~20% above the inner kernels - I don't think we're willing to sacrifice 20% performance.
> Cray still requires static linking, and BlueGene/Q encourages it, so I think it is important that the implementation does not require dynamic linking in the cases where portability of the binary is immaterial.
I don't think we both can have our cake and eat it. For special-purpose highly parallel architectures that require static linking I think it is reasonable that the Gromacs binary will be specific to that particular architecture. 
> > This means we should start thinking of two things to make life simpler in the future:
> >
> > 1) Decide on what level we want the interface between library and executables, and keep this interface _really_ small (in the sense that we want to resolve as few symbols as possible).
> > 2) Since we will have to compile the high-level binaries with generic compiler flags, any code that is performance-sensitive should go in the architecture-specific optimized library.
> I think the third option I give above is the most achievable. I do not know whether the dynamic function calls incur overhead per call, or whether that can be mitigated by the helper object Teemu suggested, but he sounds right (as usual). I hope the libraries would share the same address space. Since we anyway plan for tasks to wrap function calls, the implementations converge.
See above. It would lose ~20% performance, which I think is unacceptable. The main md loop and all functions under it need to be compiled with CPU-specific optimization, so that's the lowest level we can split on. Otherwise we can just as well disable AVX optimization and ship SSE4.1 binaries to be portable :-)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20130930/18a6aefa/attachment.html>

More information about the gromacs.org_gmx-developers mailing list