[gmx-developers] New Test Set
david.bowman at gmail.com
Sat Feb 11 18:49:42 CET 2012
I have a few comments and suggestions that I hope will help. I have been
designing software and building products for 30 years in industry in the
area of compilers, optimizers, databases, imaging and other software where
testing is a critical issue. In many cases this involved thousands or even
millions of copies. I have been working on performance improvements to
GROMACS and the difficulties of knowing if I broke something has been a big
1) Unit and system test suites are developed as* products go through
release cycles and as bugs are being fixed*. Trying to build tests after a
release or near the end of a release cycle does not work because of
limitations on resources, the desire to get the release out or because of
work on future releases.
2) The important thing for 4.6 is to catch the big things and priority 1
bugs through the GROMACS community before it goes out and put in place a
simple manageable process going forward for testing. I would not release it
with a priority 1 bug. (e.g. if Reaction Field is broken I would not
3) If conventions for a test framework can quickly be established and the
developers of 4.6 functionality and recent critical bug fixes have tests
that they have used for their work that may reasonably be adapted to the
new test framework they should be added.
4) The foundation of test suites is the unit/module level. If the base is
not sound then the system/simulation will not be sound. At the module level
it is almost always possible to determine what the 'right' answer is with
relatively small datasets in short amounts of time for many parameter sets.
A set of pass/fail test programs at module level is a good *long
term*goal. It seems unrealistic to me to try to go back and build
tests for GROMACS with limited resources, money and time. It is* more
realistic to add tests as bugs are fixed and new features are added.*
5) Keep the test framework very simple. There are number of 'test
environments' and products available. I have used a lot of them in
commercial product development I would not recommend the use of any of
them for GROMACS.
The US National Institute of Standards and Technology has test suites for
compilers, databases and many other software technologies consisting of
numerous programs that test functionality at a module level where each test
program performs multiple tests sequentially, logs the results as text and
returns a program exit indicating pass/fail. Test programs are typically
grouped by functional area and run as a set of scripts. The tests that fail
are clearly identified in the log and reading the test program code is
simple because the tests are executed sequentially. Sometimes there are
input/output files that are validated against a reference file. This simple
model is used extensively for ANSI Standards compliance. Using this model
unit/module level testing would be easy and application/simulation level
tests could be run as scripts with diffs with configurable 'tolerances'
with the same style of logging and pass/fail return method.
What would be needed: Test program and test script templates, conventions
for test logging, individual program/script exit (pass/fail) , conventions
for dataset naming and configurable tolerances for higher level
application/simulation tests. The process also would need to be managed.
6) Decisions about *how *to test functionality at the module level in
industry is distributed among the developers that work to build initial or
new features and to fix bugs. For *future releases* of GROMACS developers
should be required to add regression test programs that will verify on a
pass/fail basis all new functionality and priority 1 bug fixes.
Developers usually have the basic logic of such test programs in their own
test programs, simulations, files and criteria that they use to validate
their own development. Their test work is usually thrown away. It should
not be too time consuming to use this code as the basis of regression tests
*in the future if developers understand* that it is a requirement for code
submission. Tests must be designed and the developers implementing new
features and fixing bugs should take primary responsibility for how to test
changes. *Unfortunately this takes time, **costs money, slows the
development process and publication process*.
7) It is important to try to get a second opinion about how something
important is to be fixed or a new feature is to be added and have someone
'sign off' on another person's work and test program/script prior to
incorporation into a release. It would be good to integrate this 'signoff'
into the process.
On Sat, Feb 11, 2012 at 7:35 AM, David van der Spoel
<spoel at xray.bmc.uu.se>wrote:
> On 2012-02-10 20:17, Roland Schulz wrote:
>> On Sun, Feb 5, 2012 at 3:53 PM, Shirts, Michael (mrs5pt)
>> <mrs5pt at eservices.virginia.edu <mailto:mrs5pt at eservices.**virginia.edu<mrs5pt at eservices.virginia.edu>>>
>> Hi, all-
>> My opinion:
>> I think there should probably be two classes of sets -- fast fully
>> sets, and more sophisticated content validation sets.
>> For the fast fully automated test, I would suggest:
>> -testing a large range of input .mdps, tops and gros for whether
>> they run
>> through grompp and mdrun. Not performing whether the output is
>> correct or
>> not, because that is very hard to automate -- just testing whether
>> it runs
>> for, say 20-100 steps or so.
>> Yes having that set of inputs is needed. Should we start a wiki page to
>> start listing all the inputs we want to include? Or what would be the
>> best way to collaborative creating this set of inputs?
> A while ago I talked about this with Nicu who works with Siewert Jan
> Marrink. He is a software engineer by training and suggested the following.
> For each import parameter you take extreme values (e.g. timestep 5 fs and
> 0.001 fs) and a random value in between. Then there would be N^3 different
> parameter combinations for N parameters which probably is way too many
> combination, even if N would be only 20. Therefore you now pick a subset
> of, say 200 or 1000 out of these N^3 possible tests, and this becomes the
> test set. With such a set-up it is quite easy to see that we'd test at
> least the extreme value which are possible where things can go wrong. A few
> of these tests would actually be prohibited by grompp too, but in all
> likelihood not nearly enough.
> At the time when Nicu & I discussed this we even considered publishing
> this, since I am not aware of another scientific code that has such
> rigorous testing tools.
>> Longer term, we should look more at validating code at a physical
>> Clearly testing energy conservation is a good idea for integrators;
>> fairly sensitive. I think we need to discuss a bit more about how to
>> evaluate energy conservation. This actually can take a fair amount
>> of time,
>> and I'm thinking this is something that perhaps should wait for 5.0.
>> thermostats and barostats, I'm working on a very sensitive test of
>> validity. I'll email a copy to the list when it's ready to go (1-2
>> and this is something that can also be incorporated in an integrated
>> regime, but again, these sorts of tests will take multiple hours, not
>> seconds. That sort of testing level can't be part of the day to
>> day build.
>> Well even if the tests take ~2000CPU-hours, I think we (maybe even Erik
>> by himself) have the resources to run this weekly.
>> > - What are the requirements for the new test set? E.g. how easy
>> should it
>> > be to see whats wrong when a test fails?
>> For the first set of tests, I can imagine that it would be nice to
>> be able
>> to look at the outputs of the tests, and diff different outputs
>> corresponding to different code versions to help track down changes
>> But I'm suspicious about making the evaluation of these tests decided
>> automatically at first. I agree such differences should EVENTUALLY be
>> automated, but I'd prefer several months of investigation and
>> before deciding exactly what "correct" is.
>> I think a wrong reference value is better than no reference value. Even
>> a wrong reference value would allow us to detect if e.g. different
>> compilers give significant different results (maybe some give the
>> correct value). Also it would help to avoid adding additional bugs. Of
>> course we shouldn't release the test set to the outside before we are
>> relative sure that it actually correct.
>> > Should the test support being run
>> > under valgrind? Other?
>> Valgrind is incredibly slow and can fail for weird reasons -- I'm
>> not sure
>> it would add much to do it under valgrind.
>> I have the current C++ tests (those written by Teemu) running under
>> valgrind in Jenkins. It wasn't very hard to write a few suppression
>> rules to make valgrind not report any false positives. Now Jenkins
>> can automatically fail the build if the code has any memory errors.
>> Obviously one woudn't run any of the long running tests with valgrind.
>> But for the unit tests I think it might be very useful to catch bugs.
>> I DON'T think we should have any test set that starts to look at more
>> complicated features right now -- it will take months to get that
>> and we need to get 4.6 out of the door on the order of weeks, so we
>> can move
>> on to the next improvements. 4.6 doesn't have to be perfectly
>> flawless, as
>> long as it's closer to perfect than 4.5.
>> My reason for delaying the 4.6 release would not be to improve the 4.6
>> release. I agree with you we probably can't guarantee that the reference
>> value are correct in time anyhow, so we probably wouldn't even want to
>> ship the tests with 4.6. My worry is that as soon as 4.6 is out the
>> focus is on adding new cool features instead of working on these boring
>> tasks we should do, because they help us in the long run. E.g. if we
>> would have agreed that we don't have a 4.6 release, the C++ conversion
>> would most likely be much further along. And I don't see how we can
>> create an incentive mechanism to work on these issues without somehow
>> coupling it to releases.
>> Michael Shirts
>> Assistant Professor
>> Department of Chemical Engineering
>> University of Virginia
>> michael.shirts at virginia.edu <mailto:michael.shirts@**virginia.edu<michael.shirts at virginia.edu>
>> (434)-243-1821 <tel:%28434%29-243-1821>
>> gmx-developers mailing list
>> gmx-developers at gromacs.org <mailto:gmx-developers@**gromacs.org<gmx-developers at gromacs.org>>
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-developers-request@**gromacs.org<gmx-developers-request at gromacs.org>
>> <mailto:gmx-developers-**request at gromacs.org<gmx-developers-request at gromacs.org>
>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <http://cmb.ornl.gov>
>> 865-241-1537, ORNL PO BOX 2008 MS6309
> David van der Spoel, Ph.D., Professor of Biology
> Dept. of Cell & Molec. Biol., Uppsala University.
> Box 596, 75124 Uppsala, Sweden. Phone: +46184714205.
> spoel at xray.bmc.uu.se http://folding.bmc.uu.se
> gmx-developers mailing list
> gmx-developers at gromacs.org
> Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-developers-request@**gromacs.org<gmx-developers-request at gromacs.org>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the gromacs.org_gmx-developers