Mark Abraham
Tue Mar 24 09:30:31 CET 2020


Jan, the biggest bang-for-buck optimizations relevant to Folding at Home are to

a) offer to build them an OpenCL-enabled GROMACS "core" (ie the version of
GROMACS that they run, when they run GROMACS). Currently they seem to run
all GPU jobs with OpenCL and OpenMM, which is nice but leaves a lot of
throughput on the table. The GROMACS OpenCL port is mature and stable, runs
on AMD/NVIDIA/Intel current GPUs, and should present no more driver/user
problems than their OpenMM one. Their concept of a GPU slot is a single GPU
accompanied by a single CPU thread/, whereas the GROMACS OpenCL port would
prefer multiple dedicated cores. That's still better than leaving GPUs
empty if there's not enough OpenMM jobs in the queue, though the actual
performance will be woeful compared to GROMACS when you give it a healthy
chunk of CPU cores also. Could even be better than OpenMM's GPU core,
depending how modern that one is, too ;-) The GROMACS CUDA port is better
still (and in 2020 can do a decent job even with only a single CPU core),
but they have made a philosophical choice to use OpenCL only.
b) update the GROMACS CPU core in F at H because the one used in F at H is
several years behind and losing the benefit of the hard optimization work
that we've done in the meantime.
c) demonstrate that they can maintainably and usefully offer more than two
x86 builds of that GROMACS CPU core (GROMACS has lots of SIMD specialized
flavours, but F at H only offers SSE4.1 and basic AVX from those flavours,
which leaves a lot of performance on the table on recent x86 CPUs. We
already have all the logic needed to work out which pre-built GROMACS to
download and run, because we use it in containerized GROMACS builds also.)

Unfortunately they've never open-sourced any of that, so finding out where
to start is the first challenge. But that way you'll have a lot more impact
sooner than you will from profiling GROMACS runs after 30 years of
optimization. ;-)


On Mon, 23 Mar 2020 at 14:59, jan wrote:

> Hi,
> I'm a general back-end dev.  Given the situation, and folding at home
> using gromacs, I thought I'd poke through the code. I noticed
> something unexpected, and was advised to email it here. in edsam.cpp,
> this:
> void do_linacc(rvec* xcoll, t_edpar* edi)
> {
>     /* loop over linacc vectors */
>     for (int i = 0; i < edi->vecs.linacc.neig; i++)
>     {
>         /* calculate the projection */
>         real proj = projectx(*edi, xcoll, edi->vecs.linacc.vec[i]);
>         /* calculate the correction */
>         real preFactor = 0.0;
>         if (edi->vecs.linacc.stpsz[i] > 0.0)
>         {
>             if ((proj - edi->vecs.linacc.refproj[i]) < 0.0)
>             {
>                 preFactor = edi->vecs.linacc.refproj[i] - proj;
>             }
>         }
>         if (edi->vecs.linacc.stpsz[i] < 0.0)
>         {
>             if ((proj - edi->vecs.linacc.refproj[i]) > 0.0)
>             {
>                 preFactor = edi->vecs.linacc.refproj[i] - proj;
>             }
>         }
>        [...]
> In both cases it reaches the same code
>   preFactor = edi->vecs.linacc.refproj[i] - proj
> That surprised me a bit, is it deliberate? If so it may be the code
> can be simplified anyway.
> That aside, if you're looking for performance I might be able to help.
> I don't know the high level stuff *at this point* and my C++ is so
> rusty it creaks, but I can brush that up, do profiling and whatnot.
> I'm pretty experience, just not in this area.  Speeding things up is
> something I've got a track record of (though I usually have a good
> feel for the problem domain first, which I don't here)
> Would it be of some value for me to try getting more speed? If so,
> first thing I'd need is to get this running under cygwin, which I'm
> struggling with.
> regards
> jan
