[gmx-developers] libxml2

Mon Nov 11 13:32:49 CET 2013

PS:

If it wasn’t implicitly clear, I can try to help realize this, although I can’t promise to do it right away, and I can’t do it myself :-)

To try to be constructive, I’ve been considering the scenario where we want to describe the execution of a complete job, including some components that will require extra chemical data. Longer-term, I think it would be great if we could assemble a single XML document that really describes the entire system (even coordinates), force field, topology, simulation, the MDP settings, metadata settings for parallelization, and not least the chemical data.

Then we could have a structure with a top-level “gromacs” XML namespace that just contains metadata (user, generating program, etc) and a bunch of lower namespaces that contain the actual data.
These could for instance be “forcefield”, “topology”, mdp parameters, and likely a separate block to be able to describe higher-level simulation metadata (e.g. parallelization or that we should run N simulations in REMD).

We don’t need to think of the contents of most of these until we implement them. If we want to start with the special case of structure factors I guess the questions we should think of are:

1) Where do we see this type of data fitting in a bigger Gromacs namespace? What other similar data might we have in the future?

2) Are there any other structure factors that could occur in a simulation (say, X-ray)? Can we describe those in the same datastructure, or should they be separate? If separate, we should reflect that in the naming, etc.

3) Can we design a simple datastructure for _this_ type of data, so other programs that need it can ask Gromacs (which will also validate input xml files) rather than write their own XML parsing code?

If that sounds potentially interesting I can try to contribute by starting to sketch on the highest-level namespace?

Cheers,

Erik

On 11 Nov 2013, at 03:47, Erik Lindahl <erik.lindahl at scilifelab.se> wrote:

> Hi,
> 
> On 10 Nov 2013, at 23:33, David van der Spoel <spoel at xray.bmc.uu.se> wrote:
>> 
>> I guess this will prevent us from using xml in practice. We have 
>> discussed xml for ten years or so, but the transition to xml schema is a 
>> real show stopper. I don't have the time to learn that as well. Does 
>> that imply I should stop developing? In addition, for many small files 
>> you don't need a dtd or schema (and in fact there isn't one for these 
>> xml files), it's just that the libxml2 library demands you put it into 
>> the file. If we're talking rtp files then that's another matter where 
>> more structure is needed.
> 
> I think the ability to validate the contents of a file is the core concept we want from XML. An XML file that doesn’t have any DTD or Schema is just a textfile that looks fancier - you can add illegal data anywhere, and they you only rely on the internal logic of the program reading it to catch your error (or not) - that won’t really be much safer than our current text files.
> 
> Writing a schema for a simple file takes less than an hour to learn, and there are even free DTD-to-schema converters. Obviously, it will still be a lot of work to write an advanced schema e.g. for topologies, but I don’t think that’s on the table right now.  However, just as class design is a pain for all of us (well, maybe not Teemu :-), the reason for doing it is that it will save time for all developers and lead to fewer bugs in the long run.
>> 
>> Some other points, like having clear names and units I do agree with and 
>> can change it my present application.
>> 
>> Common modules for writing and reading implies that all possible data 
>> should be merged into one or a few monster formats. This in itself will 
>> create extra problems.
> 
> Well, it doesn’t necessarily have to be _one_ single format, but I think it is a far better solution to standardize on how we do it rather than ~20 tools each inventing their own structure for how to store and read data? That is what we have right now with the text files...
> 
>> As for changing names of files, this shouldn't be necessary as one 
>> should be able to see from the content what kind of file this is. No 
>> strong feelings here but it would be very confusing to add many new 
>> files names.
> 
> If we have a good namespace structure we can probably get around without it. However, at some point we have to consider how to separate the topology XML file from the mdp XML file in each directory.
> 
>> @Mark: an extra layer wouldn't help would it - there is no competing 
>> package as far as I know. There is, however, libxml++, a C++ wrapper 
>> around libxml2, which is slightly more logical to use in C++ code, but 
>> it would imply an extra library. On the other hand that might function 
>> as a thin wrapper around the library.
> 
> I know of at least Expat and MSXML, and quickly also foundmini-XML, Xerces, AsmXml and RapidXml, where the last two are claiming an order of magnitude faster parsing speeds than libxml2. 
> I see no particular reason for using any of those libraries today, but this sounds like exactly the same situation where we originally saw no reason for any other FFT libraries than FFTW :-)
> 
> Cheers,
> 
> Erik
> 
> -- 
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the 
> www interface or send it to gmx-developers-request at gromacs.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20131111/52206e11/attachment.html>