[gmx-developers] Shall we ditch gro and g96 files?

John Chodera jchodera at gmail.com
Tue Apr 1 19:04:34 CEST 2008


To address a number of excellent comments and questions:

On 01/04/2008, Erik Lindahl <lindahl at cbr.su.se> wrote:

> I think a lot of people (including me...) like to be able to do "simple"
> coordinate manipulation through scripts that just grep/awk for atom names,
> but I like Mathias suggestion of having a separate tool to translate
> back/forth instead, and keep the "core" format HDF5.

I agree that this is a much better idea than restricting the format to

Another drawback of ASCII that hasn't been mentioned is that it is
impossible to represent the floating-point numbers in exactly the same
way as they are used in gromacs.  The exact representation becomes
important when one is trying to implement algorithms on top of gromacs
that do Monte Carlo sampling by adjusting momenta, such as transition
path sampling.

> The only thing that worries me (just a little bit :-) is that it would make
> us entirely dependent on a big external library. I know that HDF5 is _very_
> portable, but at least in theory we could end up in a situation where
> Gromacs doesn't work on some obscure platform e.g. because there's a
> compiler bug affecting HDF5.

I can't speak to whether this is a reasonable concern or not (it
sounds like you may have encountered these problems before with other
libraries).  But there is a substantial amount of funding and
engineering behind both netCDF and HDF5, so I would imagine actual
bugs would be resolved rather quickly.

If the existing formats are still supported, then it will be possible
to work around these issues until they are resolved, at least in the
majority of cases where the extra information that could be contained
in netCDF and HDF5 is irrelevant.

> Mathias/John, do you or anybody else have any experience from using HDF5 for
> development? Have there been different library versions that you need to
> install, or do packages usually include their own copy of the library?

I have personally used netCDF (the C API, the Fortran 90/95 API, and a
Python netCDF module).  For these scientific applications, I installed
the library separately.  AMBER9 comes with a distribution of netCDF
that it will configure and build if you do not already have netCDF
libraries installed.  MODELLER [http://www.salilab.org/modeller/] uses
HDF5 to store its library files, but is distributed precompiled with
HDF5 libraries for various platforms.

Many Linux distributions come with (or can easily install) netCDF or
HDF5 libraries using standard package management tools, but it might
be convenient to also include a distribution of the library with
gromacs (licenses permitting) that is configured and compiled should
no existing library be detected.

Roland's suggestion about asking around if there might be a 'standard'
sort of format for MD data interchange is also good, but it may be
best to just lead by example, except in the case where formats already
exist (such as AMBER's netCDF trajectory specification -- one could
simply add to that for trajectories).

Regarding Berk's comments that he and David want a "flexible and
extendible" format for manipulating single conformations before and
after runs: I would remind you that there are convenient Python
modules that make manipulation of this data very, very easy, even for
quick scripts:

PyTables (for HDF5):


(or for netCDF4):

This sort of robust but convenient way of interacting with datafile
formats should always, ALWAYS be preferred over the "quick-and-dirty"
manipulation of ASCII files, even for "one-off" scripts.  We've all
heard the story of Geoffrey Chang's quintuple retraction by now -- I
am sure you all want to avoid similar situations.  :)


- John

Dr. John D. Chodera <jchodera at gmail.com>      | Mobile    : 415.867.7384
Postdoctoral researcher, Pande lab            | Lab phone : 650.723.1097
Department of Chemistry, Stanford University  | Lab fax   : 650.724.4021

More information about the gromacs.org_gmx-developers mailing list