[gmx-developers] Re: TNG format in Gromacs

Tue Apr 17 16:47:36 CEST 2012

Hi,

On Tue, Apr 17, 2012 at 9:48 AM, Erik Lindahl <erik at kth.se> wrote:

>
> On Apr 17, 2012, at 3:18 PM, Roland Schulz wrote:
>
> > E.g. for parallelization the issue is very similar as it is for
> portability. Supporting domain decomposition makes it more difficult for
> everyone and everyone has to make sure that they don't brake it. And it is
> only included because it essential to Gromacs and used by almost everyone.
>
> Right - and that's of course something we don't want to push down just on
> the few people working with parallelization :-) We don't have automated
> tests for it yet, but when we have more functional tests the idea is that
> we should automatically reject patches that break parallel runs!
>
Yes. But we only do it for parallelization because the majority (in this
case probably everyone) agrees that this is important. We wouldn't accept a
feature which would be as time consuming for every developer as
parallelization is, but only useful for a small minority. :-)

> I simply don't buy the argument that just because these 1132 lines are not
> perfect (they obviously aren't) portability doesn't matter at all and we
> might as well include 10 megabytes of additional source code where we have
> no control of the portability.
>
I didn't say portability isn't important at all. All I'm saying is that
portability shouldn't be treated as a Boolean. In practice portability is,
as any other metric, a scale. And the decision to support 99.9% of
platforms instead 99.5% should be a matter of cost benefit analysis as is
adding a new feature.

> But I think that "fancy" IO is also an optional feature. I agree that it
> is a very important feature and it has many disadvantages if the same
> format is not used everywhere. But it is also non-essential. And at that
> point it should become a matter of cost-benefit and not a matter of
> principal. I.e. how many people benefit from features made possible by HDF5
> (e.g. because limited developer time wouldn't allow them without HDF5)
> versus how much of a pain is it to the few people how have to live with XTC
> (and conversion). And one very important factor in that cost-benefit
> analysis is the ratio of users.
>
> But now you are moving the goal-posts!  The aim of the present TNG-based
> project was NOT "fancy" IO, but a new default simple portable Gromacs
> trajectory format that (1) includes headers for atom names and stuff, (2)
> is a small free library that can easily be contributed to other codes so
> they can read/write our files, and (3) enable better compression.
>
What I meant with "fancy" IO was that it is optional. These 3 things aren't
required to run a simulation on an exotic platform (e.g. Kei) and to be
able to analysis the results (after potentially converting).

> It would of course be nice if this format also allowed efficient parallel
> IO and advanced slicing, but that has never been the primary goal of the
> file format project, in particular not if it starts to come in conflict
> with the aims above.
>
As a said before, parallel IO isn't the issue. (Simple) parallel writing is
easier without HDF5. Parallel reading (for analysis) is possible as long as
the format is seekable (can be easily added even to XTC by creating a 2nd
file with the index).

>
> Having said that, we just discussed things here in the lab, and one
> alternative could be to have a simple built-in HDF5 implementation that can
> write correct headers for 1-3 dimensional arrays so our normal files are
> HDF5-compliant when written on a single node. This should be possible to do
> in ~100k of source code. If there is no external HDF5 library present, this
> will be the only alternative supported, and you will not be able to use
> e.g. parallel IO - but the file format will work.
>

Option 1) Up to 100k lines we have to write and support. And the code can
only use the subset of HDF5 supported.
Option 2) Users on very exotic platforms have to keep using XTC and in
post-production convert their files (only if they want to benefit of HDF5
advantages in analysis)

I really don't see how Option 1 could win in any reasonable
cost benefit analysis. :-)

BTW: All of HDF5 is 135k lines (according to sloccount, exluding C++, HL or
Fortran binding). And HDF5 has all OS depending functions (IO, threads, ..)
abstracted. Thus only a small part (18 files, total 9300 lines - this
includes the respective headers and the abstraction layer itself) have any
#ifdef for windows. Thus only those files would need to be touched to add
support for a non POSIX, WINDOWS, or VMS OS. It is even possible to write
an own low level file layer (
http://www.hdfgroup.org/HDF5/doc/TechNotes/VFL.html) which could be based
on futil.c to have our own OS abstraction.

The caveat is what happens to the physical file format when HDF5 writes
> parallel IO? Will this result in a file with different properties that is
> difficult for us to read with a naive implementation?

No problem. HDF5 parallel IO doesn't produce different formats. It writes
in standard chunks (which would need to be supported anyhow for block
compression and fast seek).

Roland

>
>
>
>
>
>
>
>
>

-- 
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20120417/cbc0015d/attachment.html>