[gmx-users] GROMACS and XML
Peter Murray-Rust
pm286 at cam.ac.uk
Wed Apr 17 14:49:23 CEST 2002
At 19:40 17/04/2002 +0200, David van der Spoel wrote:
>On Wed, 17 Apr 2002, Peter Murray-Rust wrote:
>
> >One general rule: for every XML element ("tag") you create software has to
> >exist! So always bear in mind how the XML is to be processed. For CML we
> >have three approaches - XSLT (probably the easiest and most generally
> >applicable), CML-DOM and CML-SAX. I expect that in computational chemistry
> >the XSLT approach will initially the most useful.
> >
> >I certainly don't want to reinvent anything that has already been done - is
> >the GROMACS DTD reachable from the website or is it in a CVS repository?
>
>THere is some stuff in the relaese (share/top/gromacs.dtd) but it's still
>under devlopement.
Some comments on DTDs. DTDs serve the following purposes:
- to validate the final document. This validation covers
vocabulary (element names, attribute names, allowed attribute values) and
syntactic structure (what children (elements, text) can each element
contain). Validation does not cover values (except in enumerated
attributes). It does not distinguish data types (all 'data' is character
data - #PCDATA /CDATA)
- to supply default values for attributes. This can be useful, but
means that all documents must be packaged with the DTD (or linked to it). I
don't believe this is of high value for most applications.
- to assemble the final document (using ENTITYs). This can be very
valuable for large, semistructured material. It's excellent for books (e.g.
each manual chapter could be an entity) but it has limitations and the W3C
is creating XInclude as a more natural way.
It is very difficult to write a DTD for a complex document that anticipates
all the possible variations. It has been done (TEI and DOCBOOK are
examples) but they have difficult structures. DTDs do not support
namespaces, which is a major drawback. In general I doubt that any document
containing CML instances is likely to be validatable, or to benefit from
validation.
Having said that, DTDs are an excellent tool for systematising program
architecture.
That is why I have moved to Schemas. They support namespaces, dataTypes and
re-usable structures. Moreover governments are starting to mandate them.
But they are overly complex and there is a large body of XML experts who
prefer less complex tools like RELAX and Schematron.
>You will find newer versions in the CVS. I think I have
>the description of macromolecules solved at a level that we can describe
>generally the construction of a macromolecule from building blocks (e.g.
>amino acids) and the links between them (including peptide links, cys-cys
>bridges etc.). Furthermore we have added elements to the DTD for
>describing modifications to amino acids, e.g. N- and C-terminus, but also
>other modifications, e.g. methylation of a Lysine chain can be described
>this way. The big point I want to put through is separation of molecule
>description from force field description. Basically the GROMACS input
>should consist of three parts:
>1. Molecule description
>2. Force field description
>3. Simulation parameters
I agree with this.
>My current way of thinking is to create 3 DTDs which then are combined
>into one master DTD. One could also conceive having quantum chemical
>information instead of 2 (i.e. basis sets).
Yes. multiple DTDs/Schemas are the way to do it. It also, I think, requires
multiple namespaces. DTDs made up of smaller DTDs are not fun to manage and
that is partly why schemas were invented. Schemas are meant to be
pluggable, so that (1) could be GROMACS for a Macromolecule or CML for a
small molecule.
>A practical drawback of separating 1 and 2 is that a certain amount of
>information has to be duplicated. E.g. in 1 we would specify that there is
>a bond between N and Ca in an Ala residue, this information would also
>have to be incorporated in 2 in some way, to be able to add the
>appropriate bond length and force constants.
I think there is a subtle difference. In (1) these are instances of atoms ,
in (2) they are generic descriptions ("atom types"). It is perfectly
possible to input (1) and (2) independently to a program and ask it to
assign (2) to each (1). The linkage could then be done with an extended
vocabulary. My current thoughts are:
foo.cml
<molecule xmlns="http://www.xml-cml.org/schema/core/cml2"> <!-- CML Schema
namespace-->
<atomArray>
<atom id="a1" elementType="N" xyz3="1.2 2.3 3.4">
<scalar title="Gromacs atom type"
dictRef="http://www.gromacs.org/dictionary.xml#atomType">amide</scalar>
</atom>
...
</atomArray>
</molecule>
ff.xml
<!-- warning - details are rubbish! -->
<forceField xmlns="http://www.gromacs.org/forcefield">
<atomTypes>
<atomType id="amide">
<parameter dictRef="oopBend" title ="out of plane bending for
amide" dataType="xsd:decimal" units="dynes/angstrom">1.234</parameter>
</atomType>
</atomTypes>
</forceField>
This is schematic only, but shows how the two parts are completely
decoupled - in 3 different files. The user creates the molecule input.
Somewhere s/he decides that the atoms have appropriate types. CML
deliberately omitted atomTypes as there is so much variation, but allows
extension through the dictRef attribute. This says that atom a1 (BTW almost
all XML elements should have IDs) has a property (atomType) which is
described in the Gromacs online XML dictionary. The use therefore knows
what s/he is doing and why. The Gromacs forcefield specifies what atomTypes
it recognises and (presumably) the program will have a tool to discover all
atoms that are instances of this atom type. This discovery is very easy
with XSLT. To find all atoms in an XML file which are labelled as amides,
we could write:
select="atom[@title='Gromacs atom type'][.='amide']"
>Sorry for the long expose, but it would really be beneficial to cooperate
>with you, since you have much more experience in XML. One reason we didn't
>base our efforts on CML is exactly the lack of macromolecule support,
This is deliberate. The macromolecular community has many ontological
efforts (I have been involved with CIF for many years, for example). In
fact mmCIF is the basis of the OMG/CORBA formulation, and I suspect it to
be a candidate for XML serialization. EBI/MSD now outputs XML/CML, though
without coordinates. Macromolecules are complex and I am happy to re-use
the community's efforts and try to make sure CML interoperates.
mmCIF is rather complex (it's based on relational descriptions) and you may
well find that your own approach is tractable and valuable.
You have mentioned input, but not output. (probably wisely!) Output is both
extensive and complex. There are several issues (some of which extend to MO
as well):
- output can be very variable depending on input options. Many
programs are based on modules so it seems useful to output a <module>
element for each main program section. In itself this can make a major
contribution to parsability.
- many outputs are iterative. <cycle> information can be valuable,
so that plots of key quantities can be easily extracted
- many outputs consist of multiple view of the same system. For a
single molecule these might be snapshots on a trajectory, conformations,
vibrational modes, points on a reaction pathway, etc. There is usually some
molecular information which is invariant (atom count, atom labels, nuclear
properties) which does not need to be repeated. The coordinates (and
sometimes other quantities - charges, bond orders, etc.) change. This is an
inheritance design. Herman and I like the term <configuration> but others
disagree. (Babel uses <pose> which is at least novel). The design has to
accommodate this
- there are often problems of reference frame. Do all components
of a system have the same reference frame or does one have hierarchical
structure
Many of these can be solved, especially if dictionaries are created. For
example the concept (not the value!) of "moment of inertia", "dipole
moment", etc. should be independent of Gromacs, MOPAC, etc. I am currently
working with Steve Stein's IUPAC project to systematize XML dictionaries,
so a collection of re-usable concepts would be extremely valuable.
>but
>also I found the use of schema rather (maybe even overly) complicated.
I sympathize about the schema and suspect that schema validation is not
critical for Gromacs itself. However if you are getting CML or other XML
files you will often need to validate them in some way. XSLT is often
better (and more powerful) than Schema.
P.
--
Peter Murray-Rust, pm286 AT cam.ac.uk
Unilever Centre for Molecular Informatics, Chemistry Department
Lensfield Road, Cambridge, CB2 1EW, UK
+44-1223-336-432
More information about the gromacs.org_gmx-users
mailing list