[gmx-developers] libxml2

Mon Nov 11 14:12:03 CET 2013

On 2013-11-11 13:32, Erik Lindahl wrote:
> PS:
>
> If it wasn’t implicitly clear, I can try to help realize this, although
> I can’t promise to do it right away, and I can’t do it myself :-)
>
>
> To try to be constructive, I’ve been considering the scenario where we
> want to describe the execution of a complete job, including some
> components that will require extra chemical data. Longer-term, I think
> it would be great if we could assemble a single XML document that really
> describes the entire system (even coordinates), force field, topology,
> simulation, the MDP settings, metadata settings for parallelization, and
> not least the chemical data.
>
> Then we could have a structure with a top-level “gromacs” XML namespace
> that just contains metadata (user, generating program, etc) and a bunch
> of lower namespaces that contain the actual data.
> These could for instance be “forcefield”, “topology”, mdp parameters,
> and likely a separate block to be able to describe higher-level
> simulation metadata (e.g. parallelization or that we should run N
> simulations in REMD).
>
> We don’t need to think of the contents of most of these until we
> implement them. If we want to start with the special case of structure
> factors I guess the questions we should think of are:
>
>
> 1) Where do we see this type of data fitting in a bigger Gromacs
> namespace? What other similar data might we have in the future?
>
> 2) Are there any other structure factors that could occur in a
> simulation (say, X-ray)? Can we describe those in the same
> datastructure, or should they be separate? If separate, we should
> reflect that in the naming, etc.
>
> 3) Can we design a simple datastructure for _this_ type of data, so
> other programs that need it can ask Gromacs (which will also validate
> input xml files) rather than write their own XML parsing code?
>
>
> If that sounds potentially interesting I can try to contribute by
> starting to sketch on the highest-level namespace?
>

OK, sorry for whining, but my problem is, like most of us, lack of time. 
I realize that everything will be a lot better with namespaces and 
schemas, although I do think that even simple xml is a lot better than 
text.

Another drawback with a monster XML file is that it would move us away 
from modularity again: having sfactor reading and writing completely 
modular encapsulated in the waxsdebye module also means that if I break 
that code everything else will not be affected. For instance, I am about 
to implement gtest stuff for reading and writing sfactors (and for 
computing sfactors). Of course you can argue that having a single XML 
I/O module also is modular, but then we will have to change formats for 
every addition that we make. Hence we have a conflict between two design 
goals that we have to resolve.

If you or anyone else wants to sketch a namespace that's fine with me, 
but first we have to resolve the above question.

>
> Cheers,
>
> Erik
>
>
> On 11 Nov 2013, at 03:47, Erik Lindahl <erik.lindahl at scilifelab.se
> <mailto:erik.lindahl at scilifelab.se>> wrote:
>
>> Hi,
>>
>> On 10 Nov 2013, at 23:33, David van der Spoel <spoel at xray.bmc.uu.se
>> <mailto:spoel at xray.bmc.uu.se>> wrote:
>>>
>>> I guess this will prevent us from using xml in practice. We have
>>> discussed xml for ten years or so, but the transition to xml schema is a
>>> real show stopper. I don't have the time to learn that as well. Does
>>> that imply I should stop developing? In addition, for many small files
>>> you don't need a dtd or schema (and in fact there isn't one for these
>>> xml files), it's just that the libxml2 library demands you put it into
>>> the file. If we're talking rtp files then that's another matter where
>>> more structure is needed.
>>
>> I think the ability to validate the contents of a file is the core
>> concept we want from XML. An XML file that doesn’t have any DTD or
>> Schema is just a textfile that looks fancier - you can add illegal
>> data anywhere, and they you only rely on the internal logic of the
>> program reading it to catch your error (or not) - that won’t really be
>> much safer than our current text files.
>>
>> Writing a schema for a simple file takes less than an hour to learn,
>> and there are even free DTD-to-schema converters. Obviously, it will
>> still be a lot of work to write an advanced schema e.g. for
>> topologies, but I don’t think that’s on the table right now.  However,
>> just as class design is a pain for all of us (well, maybe not Teemu
>> :-), the reason for doing it is that it will save time for all
>> developers and lead to fewer bugs in the long run.
>>>
>>> Some other points, like having clear names and units I do agree with and
>>> can change it my present application.
>>>
>>> Common modules for writing and reading implies that all possible data
>>> should be merged into one or a few monster formats. This in itself will
>>> create extra problems.
>>
>> Well, it doesn’t necessarily have to be _one_ single format, but I
>> think it is a far better solution to standardize on how we do it
>> rather than ~20 tools each inventing their own structure for how to
>> store and read data? That is what we have right now with the text files...
>>
>>> As for changing names of files, this shouldn't be necessary as one
>>> should be able to see from the content what kind of file this is. No
>>> strong feelings here but it would be very confusing to add many new
>>> files names.
>>
>> If we have a good namespace structure we can probably get around
>> without it. However, at some point we have to consider how to separate
>> the topology XML file from the mdp XML file in each directory.
>>
>>> @Mark: an extra layer wouldn't help would it - there is no competing
>>> package as far as I know. There is, however, libxml++, a C++ wrapper
>>> around libxml2, which is slightly more logical to use in C++ code, but
>>> it would imply an extra library. On the other hand that might function
>>> as a thin wrapper around the library.
>>
>> I know of at least Expat and MSXML, and quickly also foundmini-XML,
>> Xerces, AsmXml and RapidXml, where the last two are claiming an order
>> of magnitude faster parsing speeds than libxml2.
>> I see no particular reason for using any of those libraries today, but
>> this sounds like exactly the same situation where we originally saw no
>> reason for any other FFT libraries than FFTW :-)
>>
>> Cheers,
>>
>> Erik
>>
>> --
>> gmx-developers mailing list
>> gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>
>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-developers-request at gromacs.org.
>
>
>

-- 
David van der Spoel, Ph.D., Professor of Biology
Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone:	+46184714205.
spoel at xray.bmc.uu.se    http://folding.bmc.uu.se