[gmx-developers] New Test Set

Sun Feb 5 22:54:47 CET 2012

Personally, I think that the most important tests are those that  
validate the code. See, for example, the tip4p free-energy code bug:

http://www.mail-archive.com/gmx-users@gromacs.org/msg18846.html

I think that it is this type of error, which can silently lead to the  
wrong values, that we really need a test set in order to catch. The  
gmx-users list is not such a bad way to catch errors that lead to  
grompp failure.

It is not immediately clear, however, who would be responsible for  
developing such a test (those who added tip4p, those who added  
optimization loops, or those who added the free energy code). I recall  
seeing a post indicating that developers would be required to test  
their code prior to incorporation, but with so many usage options in  
mdrun I think that it will be essential to figure out how to define  
that requirement more precisely.

Finally, this is not to suggest that such a test set needs to be  
incorporated into version 4.6. I do appreciate the need to get  
intermediate versions out the door.

Chris.

Quoting "Shirts, Michael (mrs5pt)" <mrs5pt at eservices.virginia.edu>:

> Hi, all-
>
> My opinion:
>
> I think there should probably be two classes of sets -- fast fully automated
> sets, and more sophisticated content validation sets.
>
> For the fast fully automated test, I would suggest:
> -testing a large range of input .mdps, tops and gros for whether they run
> through grompp and mdrun. Not performing whether the output is correct or
> not, because that is very hard to automate -- just testing whether it runs
> for, say 20-100 steps or so.
> -testing input files for backwards compatibility.
> -testing whether tools run on typical data sets.
>
> All of these tasks can be easily automated, with unambiguous results (does
> it run to completion yes/no).
>
> Such a test set can (And should) be run by people editing the code by
> themselves, and should also be tested using something like jenkins to verify
> that it passes the tests on multiple platforms, either on commit, or more
> likely as part of a nightly test process.
>
> Longer term, we should look more at validating code at a physical level.
> Clearly testing energy conservation is a good idea for integrators; it's
> fairly sensitive.  I think we need to discuss a bit more about how to
> evaluate energy conservation.  This actually can take a fair amount of time,
> and I'm thinking this is something that perhaps should wait for 5.0.  For
> thermostats and barostats, I'm working on a very sensitive test of ensemble
> validity. I'll email a copy to the list when it's ready to go (1-2 weeks?),
> and this is something that can also be incorporated in an integrated testing
> regime, but again, these sorts of tests will take multiple hours, not
> seconds.   That sort of testing level can't be part of the day to day build.
>
>> - Why do the current tests fail? Is it only because of different floating
>> point rounding or are there other problems? What's the best procedure to
>> find out why a test fails?
>
> I think there are a lot of reasons that the current tests that do diffs to
> previous results can fail -- floating point rounding is definitely an issue,
> but there can be small changes in algorithms that can affect things, or
> differences in the output format.  Perhaps the number of random number calls
> is changed, and thus different random numbers are used for different
> functions.
>
>> - Should the current tests all be part of the new test set?
>
> I'm not so sure about this -- I think we should think a little differently
> about how to implement them.
>
>> - How should the new test be implemented? Should the comparison with the
>> reference value be done in C (within mdrun), ctest script, python or perl?
>
> I think that script would be better.  I think we should isolate the test
> itself from mdrun.
>
>> - Should the new test execute mdrun for each test? Or should we somehow
>> (e.g. python wrapper or within mdrun) load the binary only once and run
>> many test per execution?
>
> I think executing mdrun for each test is fine, unless it slows things down
> drastically.  You could imagine a smaller (10-20) and larger (1000's) of
> inputs that can be run.
>
>> - What are the requirements for the new test set? E.g. how easy should it
>> be to see whats wrong when a test fails?
>
> For the first set of tests, I can imagine that it would be nice to be able
> to look at the outputs of the tests, and diff different outputs
> corresponding to different code versions to help track down changes were.
> But I'm suspicious about making the evaluation of these tests decided on
> automatically at first.  I agree such differences should EVENTUALLY be
> automated, but I'd prefer several months of investigation and discussion
> before deciding exactly what "correct" is.
>
>> Should the test support being run
>> under valgrind? Other?
>
> Valgrind is incredibly slow and can fail for weird reasons -- I'm not sure
> it would add much to do it under valgrind.
>
>> - Do we have any other bugs which have to be solved before the test can be
>> implemented? E.g. is the problem with shared libraries solved? Are there
>> any open redmine issues related to the new test set?
>
> I have no clue.
>
>> - Should we have a policy that everyone who adds a feature also has to
>> provide tests covering those features?
>
> Yes.
>
>> - Should we have a conference call to discuss the test set? If yes when?
> No idea.
>
>> - Should we agree that we won't release 4.6 without the test set to give it
>> a high priority?
>
> I'm OK with having a first pass of the fully atomated test set that I
> describe above (does it run on all input it's supposed to)  I think we can
> have a beta release even if the test set isn't finished, as that can be fine
> tuned while beta bugs are being find.
>
> I DON'T think we should have any test set that starts to look at more
> complicated features right now -- it will take months to get that working,
> and we need to get 4.6 out of the door on the order of weeks, so we can move
> on to the next improvements.  4.6 doesn't have to be perfectly flawless, as
> long as it's closer to perfect than 4.5.
>
> Best,
> ~~~~~~~~~~~~
> Michael Shirts
> Assistant Professor
> Department of Chemical Engineering
> University of Virginia
> michael.shirts at virginia.edu
> (434)-243-1821
>
>
>
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org.
>