[gmx-developers] New Test Set

Shirts, Michael (mrs5pt) mrs5pt at eservices.virginia.edu
Sun Feb 5 21:53:50 CET 2012

Hi, all-

My opinion:

I think there should probably be two classes of sets -- fast fully automated
sets, and more sophisticated content validation sets.

For the fast fully automated test, I would suggest:
-testing a large range of input .mdps, tops and gros for whether they run
through grompp and mdrun. Not performing whether the output is correct or
not, because that is very hard to automate -- just testing whether it runs
for, say 20-100 steps or so.
-testing input files for backwards compatibility.
-testing whether tools run on typical data sets.

All of these tasks can be easily automated, with unambiguous results (does
it run to completion yes/no).

Such a test set can (And should) be run by people editing the code by
themselves, and should also be tested using something like jenkins to verify
that it passes the tests on multiple platforms, either on commit, or more
likely as part of a nightly test process.

Longer term, we should look more at validating code at a physical level.
Clearly testing energy conservation is a good idea for integrators; it's
fairly sensitive.  I think we need to discuss a bit more about how to
evaluate energy conservation.  This actually can take a fair amount of time,
and I'm thinking this is something that perhaps should wait for 5.0.  For
thermostats and barostats, I'm working on a very sensitive test of ensemble
validity. I'll email a copy to the list when it's ready to go (1-2 weeks?),
and this is something that can also be incorporated in an integrated testing
regime, but again, these sorts of tests will take multiple hours, not
seconds.   That sort of testing level can't be part of the day to day build.

> - Why do the current tests fail? Is it only because of different floating
> point rounding or are there other problems? What's the best procedure to
> find out why a test fails?

I think there are a lot of reasons that the current tests that do diffs to
previous results can fail -- floating point rounding is definitely an issue,
but there can be small changes in algorithms that can affect things, or
differences in the output format.  Perhaps the number of random number calls
is changed, and thus different random numbers are used for different

> - Should the current tests all be part of the new test set?

I'm not so sure about this -- I think we should think a little differently
about how to implement them.

> - How should the new test be implemented? Should the comparison with the
> reference value be done in C (within mdrun), ctest script, python or perl?

I think that script would be better.  I think we should isolate the test
itself from mdrun.

> - Should the new test execute mdrun for each test? Or should we somehow
> (e.g. python wrapper or within mdrun) load the binary only once and run
> many test per execution?

I think executing mdrun for each test is fine, unless it slows things down
drastically.  You could imagine a smaller (10-20) and larger (1000's) of
inputs that can be run.

> - What are the requirements for the new test set? E.g. how easy should it
> be to see whats wrong when a test fails?

For the first set of tests, I can imagine that it would be nice to be able
to look at the outputs of the tests, and diff different outputs
corresponding to different code versions to help track down changes were.
But I'm suspicious about making the evaluation of these tests decided on
automatically at first.  I agree such differences should EVENTUALLY be
automated, but I'd prefer several months of investigation and discussion
before deciding exactly what "correct" is.

> Should the test support being run
> under valgrind? Other?

Valgrind is incredibly slow and can fail for weird reasons -- I'm not sure
it would add much to do it under valgrind.

> - Do we have any other bugs which have to be solved before the test can be
> implemented? E.g. is the problem with shared libraries solved? Are there
> any open redmine issues related to the new test set?

I have no clue.

> - Should we have a policy that everyone who adds a feature also has to
> provide tests covering those features?


> - Should we have a conference call to discuss the test set? If yes when?
No idea.

> - Should we agree that we won't release 4.6 without the test set to give it
> a high priority?

I'm OK with having a first pass of the fully atomated test set that I
describe above (does it run on all input it's supposed to)  I think we can
have a beta release even if the test set isn't finished, as that can be fine
tuned while beta bugs are being find.

I DON'T think we should have any test set that starts to look at more
complicated features right now -- it will take months to get that working,
and we need to get 4.6 out of the door on the order of weeks, so we can move
on to the next improvements.  4.6 doesn't have to be perfectly flawless, as
long as it's closer to perfect than 4.5.

Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
michael.shirts at virginia.edu

More information about the gromacs.org_gmx-developers mailing list