[gmx-developers] automated performance testing

Thu Oct 23 23:43:36 CEST 2014

On Fri, Oct 17, 2014 at 1:27 AM, Szilárd Páll <pall.szilard at gmail.com>
wrote:

> Hi,
>
> There have been attempts at putting together a benchmark suite, the
> first one quite a few years ago resulting in the gmxbench.sh script
> and few input systems; more additions have been made a couple of
> months ago, everything is here: git.gromacs.org/benchmarks.git
>
> On Tue, Sep 30, 2014 at 6:22 PM, Mark Abraham <mark.j.abraham at gmail.com>
> wrote:
> > Hi,
> >
> > Cherry-picking Michael's email into its own thread:
> >
> >> Is there a plan (long term) to do (essentially) automated performance
> >> tests so that we
> >> can perform consistent(ish) checks for new changes in code, then post
> the
> >> results in an
> >> easy(ish) way to interpret for others?
> >
> > There's no organized plan.  I've lately been trying to organize a
> dedicated
> > machine here so we can start to do some of this - had we had it and the
> > right kinds of tests then various bugs would not have gone unnoticed.
>
> While a machine is useful, I think the what and how to benchmarks are
> to clarify first; given the difficulties in concretizing the benchmark
> setup in the past, these aspects require attention earlier rather than
> later, I think.

Indeed, there are multiple objectives that a Gromacs benchmark suite might
address...
1) a sysadmin/user would like to run one, to verify that they have not done
a horrible job of installation
2) some sysadmins/users/devs would want to try to observe maximum
performance on given hardware
3) devs would like a time-efficient way to be able to show that their
proposed/actual code change is neutral-or-better

Some of the questions that previous attempts brought
> up:
> - facilitate comparing results (to other codes or older version of
> GROMACS) while avoiding the pitfall of using of "smallest" common
> denominator features/algorithms (like JACC or the STFC benchmarks);
>

There's a Daresbury benchmark suite that attempted a large-scale
cross-suite comparison, but I don't think we have much to benefit from
facilitating that kind of comparison, for a range of reasons. There's merit
in maintaining a benchmark that permits comparisons with other codes if the
settings are ones that are actually reasonable to use - but e.g. a
benchmark that stipulated the neighbour list update frequency and buffer
size would be implementable but of doubtful real interest. Choosing a
simulation package based on least-common-denominator performance is really
not what we should encourage anybody to do.

Comparing with older versions of Gromacs is worthwhile only over a short
time frame, and only to demonstrate that the expected performance behaviour
is realized. There's no ongoing value in being able to run a current
benchmark suite on 4.5, because it has none of the optimizations we've
implemented since. So I think a benchmark set should evolve along with the
code, and that backward compatibility should be a fair way down the list of
priorities.

- test algorithms or functionality: one may be interested in the
> algorithmic performance while others want to know how fast can one
> compute X.

Indeed, this are my cases 3) & 2). It seems to me unlikely that their
solution should have the same form. Use for case 3) might want to test
performance on cases that are constructed to be a bit obtuse. For example,
showing an improved bonded implementation suggests that you use a setup
where you can observe any relevant difference to desirable precision after
an efficient amount of computation. That's not necessarily in the regime
where the code would be used. So some Martini polymer system with
small-cutoff RF on a GPU might be good for a test of bonded-forces
performance characteristics, while perhaps being doubtful as a useful model
of anything, and thus not in a benchmark suite for 1) or 2).

> > In
> > principle, people could run that test suite on their own hardware, of
> > course.
>
> I think the best would be if many people ran on many different hardware.
>

There's value there, but only if there's a lightweight way to curate any
shared results. "Hi, I compiled GROMACS at -O2 with the default gcc 4.4.7
on our new cluster" is of zero value... Results for cases 1) and 2) are
interesting.

However, I think reproducibility is quite tricky. It can only be
> ensured if we get a couple of identical machines, set them up with
> identical software and avoid most upgrades (e.g. kernel, libc[++],
> etc.) that can affect performance, keeping a machine as backup to
> avoid having to rerun all reference runs when the hardware breaks.
> Even this will only ensure observing performance on one particular
> piece of hardware with the set of algorithms, parallelization,
> optimizations actually used.

Why is reproducibility across machines useful? The absolute performance
number on any configuration is of nearly no interest. I think that the
change over a short period of time on given hardware is what is of interest
for 2) or 3), and that can be measured most reliably by running the old
software version again. That doesn't scale for keeping an historical record
of trends in performance, but I haven't been able to think of a need for
that other than for released versions.

> One option I've been toying with lately is dumping the mdrun performance
> > data matrix to XML (or something) so that some existing plotting
> machinery
> > can show the trend over time (and also observe a per-commit delta large
> > enough to vote -1 on Jenkins).
>
> Isn't per commit is an overkill an ill-scaling setup? I think it's
> better to do less frequent (e.g. weekly) and per-reqeust performance
> regression tests. Proper testing anyway requires running dozens of
> combinations of input+launch configurations on several platforms. It's
> not a fun task, I know because I've been doing quite extensive
> performance testing semi-manually.
>

Right, per-commit can't be done on a wide range of configurations. But we
can test on something per-commit, which is better than we do right now.
When we expect a change is the time to trigger some wider-scale testing
(e.g. as I have done for some of the C->C++ changes that are now merged),
but having any form of "canary in the mine" for unanticipated changes has
decent value. It could probably be combined with e.g. tests of ensemble
quality.

Mark

--
> Szilárd
>
> > I also mean to have a poke around with
> > http://www.phoromatic.com/ to see if maybe it already has
> infrastructure we
> > could use.
> >
> > Mark
> >
> > --
> > Gromacs Developers mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
> or
> > send a mail to gmx-developers-request at gromacs.org.
> --
> Gromacs Developers mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
> or send a mail to gmx-developers-request at gromacs.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20141023/f878f2d8/attachment.html>