[gmx-users] GROMACS and XML

Wed Apr 17 14:49:23 CEST 2002

At 19:40 17/04/2002 +0200, David van der Spoel wrote:
>On Wed, 17 Apr 2002, Peter Murray-Rust wrote:
>
> >One general rule: for every XML element ("tag") you create software has to
> >exist! So always bear in mind how the XML is to be processed. For CML we
> >have three approaches - XSLT (probably the easiest and most generally
> >applicable), CML-DOM and CML-SAX. I expect that in computational chemistry
> >the XSLT approach will initially the most useful.
> >
> >I certainly don't want to reinvent anything that has already been done - is
> >the GROMACS DTD reachable from the website or is it in a CVS repository?
>
>THere is some stuff in the relaese (share/top/gromacs.dtd) but it's still
>under devlopement.

Some comments on DTDs. DTDs serve the following purposes:
         - to validate the final document. This validation covers 
vocabulary (element names, attribute names, allowed attribute values) and 
syntactic structure (what children (elements, text) can each element 
contain). Validation does not cover values (except in enumerated 
attributes). It does not distinguish data types (all 'data' is character 
data - #PCDATA /CDATA)
         - to supply default values for attributes. This can be useful, but 
means that all documents must be packaged with the DTD (or linked to it). I 
don't believe this is of high value for most applications.
         - to assemble the final document (using ENTITYs). This can be very 
valuable for large, semistructured material. It's excellent for books (e.g. 
each manual chapter could be an entity) but it has limitations and the W3C 
is creating XInclude as a more natural way.

It is very difficult to write a DTD for a complex document that anticipates 
all the possible variations. It has been done (TEI and DOCBOOK are 
examples) but they have difficult structures. DTDs do not support 
namespaces, which is a major drawback. In general I doubt that any document 
containing CML instances is likely to be validatable, or to benefit from 
validation.

Having said that, DTDs are an excellent tool for systematising program 
architecture.

That is why I have moved to Schemas. They support namespaces, dataTypes and 
re-usable structures. Moreover governments are starting to mandate them. 
But they are overly complex and there is a large body of XML experts who 
prefer less complex tools like RELAX and Schematron.

>You will find newer versions in the CVS. I think I have
>the description of macromolecules solved at a level that we can describe
>generally the construction of a macromolecule from building blocks (e.g.
>amino acids) and the links between them (including peptide links, cys-cys
>bridges etc.). Furthermore we have added  elements to the DTD for
>describing modifications to amino acids, e.g. N- and C-terminus, but also
>other modifications, e.g. methylation of a Lysine chain can be described
>this way. The big point I want to put through is separation of molecule
>description from force field description. Basically the GROMACS input
>should consist of three parts:
>1. Molecule description
>2. Force field description
>3. Simulation parameters

I agree with this.

>My current way of thinking is to create 3 DTDs which then are combined
>into one master DTD. One could also conceive having quantum chemical
>information instead of 2 (i.e. basis sets).
Yes. multiple DTDs/Schemas are the way to do it. It also, I think, requires 
multiple namespaces. DTDs made up of smaller DTDs are not fun to manage and 
that is partly why schemas were invented. Schemas are meant to be 
pluggable, so that (1) could be GROMACS for a Macromolecule or CML for a 
small molecule.

>A practical  drawback of separating 1 and 2 is that a certain amount of
>information has to be duplicated. E.g. in 1 we would specify that there is
>a bond between N and Ca in an Ala residue, this information would also
>have to be incorporated in 2 in some way, to be able to add the
>appropriate bond length and force constants.
I think there is a subtle difference. In (1) these are instances of atoms , 
in (2) they are generic descriptions ("atom types"). It is perfectly 
possible to input (1) and (2) independently to a program and ask it to 
assign (2) to each (1). The linkage could then be done with an extended 
vocabulary. My current thoughts are:

foo.cml
<molecule xmlns="http://www.xml-cml.org/schema/core/cml2"> <!-- CML Schema 
namespace-->
   <atomArray>
     <atom id="a1" elementType="N" xyz3="1.2 2.3 3.4">
       <scalar title="Gromacs atom type" 
dictRef="http://www.gromacs.org/dictionary.xml#atomType">amide</scalar>
     </atom>
...
   </atomArray>
</molecule>

ff.xml
<!-- warning - details are rubbish! -->
<forceField xmlns="http://www.gromacs.org/forcefield">
     <atomTypes>
       <atomType id="amide">
         <parameter dictRef="oopBend" title ="out of plane bending for 
amide" dataType="xsd:decimal" units="dynes/angstrom">1.234</parameter>
       </atomType>
     </atomTypes>
</forceField>

This is schematic only, but shows how the two parts are completely 
decoupled - in 3 different files.  The user creates the molecule input. 
Somewhere s/he decides that the atoms have appropriate types. CML 
deliberately omitted atomTypes as there is so much variation, but allows 
extension through the dictRef attribute. This says that atom a1 (BTW almost 
all XML elements should have IDs) has a property (atomType) which is 
described in the Gromacs online XML dictionary. The use therefore knows 
what s/he is doing and why. The Gromacs forcefield specifies what atomTypes 
it recognises and (presumably) the program will have a tool to discover all 
atoms that are instances of this atom type. This discovery is very easy 
with XSLT. To find all atoms in an XML file which are labelled as amides, 
we could write:
         select="atom[@title='Gromacs atom type'][.='amide']"

>Sorry for the long expose, but it would really be beneficial to cooperate
>with you, since you have much more experience in XML. One reason we didn't
>base our efforts on CML is exactly the lack of macromolecule support,

This is deliberate. The macromolecular community has many ontological 
efforts (I have been involved with CIF for many years, for example). In 
fact mmCIF is the basis of the OMG/CORBA formulation, and I suspect it to 
be a candidate for XML serialization. EBI/MSD now outputs XML/CML, though 
without coordinates. Macromolecules are complex and I am happy to re-use 
the community's efforts and try to make sure CML interoperates.

mmCIF is rather complex (it's based on relational descriptions) and you may 
well find that your own approach is tractable and valuable.

You have mentioned input, but not output. (probably wisely!) Output is both 
extensive and complex. There are several issues (some of which extend to MO 
as well):
         - output can be very variable depending on input options. Many 
programs are based on modules so it seems useful to output a <module> 
element for each main program section. In itself this can make a major 
contribution to parsability.
         - many outputs are iterative. <cycle> information can be valuable, 
so that plots of key quantities can be easily extracted
         - many outputs consist of multiple view of the same system. For a 
single molecule these might be snapshots on a trajectory, conformations, 
vibrational modes, points on a reaction pathway, etc. There is usually some 
molecular information which is invariant (atom count, atom labels, nuclear 
properties) which does not need to be repeated. The coordinates (and 
sometimes other quantities - charges, bond orders, etc.) change. This is an 
inheritance design. Herman and I like the term <configuration> but others 
disagree. (Babel uses <pose> which is at least novel). The design has to 
accommodate this
         - there are often problems of reference frame. Do all components 
of a system have the same reference frame or does one have hierarchical 
structure

Many of these can be solved, especially if dictionaries are created. For 
example the concept (not the value!) of "moment of inertia", "dipole 
moment", etc. should be independent of Gromacs, MOPAC, etc. I am currently 
working with Steve Stein's IUPAC project to systematize XML dictionaries, 
so a collection of re-usable concepts would be extremely valuable.

>but
>also I found the use of schema rather (maybe even overly) complicated.

I sympathize about the schema and suspect that schema validation is not 
critical for Gromacs itself. However if you are getting CML or other XML 
files you will often need to validate them in some way. XSLT is often 
better (and more powerful) than Schema.

         P.

--
Peter Murray-Rust, pm286 AT cam.ac.uk
Unilever Centre for Molecular Informatics, Chemistry Department
Lensfield Road, Cambridge, CB2 1EW, UK
+44-1223-336-432