[gmx-developers] Gromacs parallel I/O?
roland at utk.edu
Wed Jul 7 08:24:18 CEST 2010
On Wed, Jul 7, 2010 at 1:14 AM, David van der Spoel <spoel at xray.bmc.uu.se>wrote:
> On 7/7/10 1:57 AM, Roland Schulz wrote:
>> On Tue, Jul 6, 2010 at 7:18 PM, Shirts, Michael (mrs5pt)
>> <mrs5pt at eservices.virginia.edu <mailto:mrs5pt at eservices.virginia.edu>>
>> > BTW: Regarding parallel read of XTC for analysis tools. I suggest
>> we add an
>> > XTC meta-file to solve the problem of parallel read for XTC. To
>> be able to
>> > read frames in parallel we need to know the starting positions of
>> the frame.
>> > Using the bisect search for XTC in parallel will probably give poor
>> > performance on most parallel IO systems (small random access IO
>> pattern - is
>> > what parallel IO systems don't like at all). Using TRR instead
>> for parallel
>> > analysis is also not such a good idea because even with parallel
>> IO several
>> > analysis will be IO bound and thus we could benefit from the XTC
>> > Thus an XTC file with a meta-file containing the starting
>> positions should
>> > give the best performance. A separate meta-file instead of adding
>> > positions to the header has the advantage that we don't change
>> the current
>> > format and thus don't break compatibility with 3rd party softare.
>> Having a
>> > separate meta-file has the disadvantage of the required
>> bookkeeping to make
>> > sure that the XTC file and the metafile are up-to date to each
>> other, but I
>> > think this shouldn't be to difficult to solve. And if a meta-file
>> is missing
>> > or not up-to date it is possible to generate it on the fly.
>> I'm wondering if this is the sort of problem that eventually moving to
>> something like netCDF might help solve. Clearly, it would be a
>> move, and would require interconversion utilities for backward
>> I looked into this. The compression of XTC is very good. And good
>> compression is important if you want to have a good IO rate (of the
>> uncompressed data). NetCDF3 doesn't support compressions (there are
>> unsupported extensions). HDF5/NetCDF4 support compression but
>> only parallel read of compressed data not parallel write of compressed
>> data. Also the zlib compression would have a significantly lower
>> compression ration than the XTC compression does.
>> Thus none would do by itself all we would like to do. Of course one
>> could do the XTC compression within a NetCDF/HDF5 container, but I don't
>> see how this would help anyone. Without the full required support for
>> compression the only other advantage I could see in moving to
>> NetCDF/HDF5 is that is easier for others to program readers/writers (is
>> already very easy since the library xdrfile has been released). And if
>> we have our custom compression within NetCDF/HDF5 than reading those
>> files wouldn't be any easier than reading/writing current XTC files.
>> Without compression we could as well use TRR. Writing a parallel
>> reader/writer for that is dead simple (since the position of each frame
>> is known from the number of atoms).
> A person here at UU (Daniel Spångberg) has developed a new trajectory
> library (TNG - trajectory next generation). We are about to submit a paper
> about it. Key advantages over xtc:
> - slightly better compression (slightly slower in the best form, but
> algorithm is tunable)
> - support for velocities
> - support for additional information (e.g. atom names) in one or more
> - random search supported without binary search
> - parallel compression
parallel in the sense of multi-threaded?
> - open source
> This will provide a very good basis for parallel trajectory I/O.
This sounds great!
The main problem for parallel I/O is management of atom numbers in a domain
> decomposition setup. If atoms drift to another processor over time this will
> imply that bookkeeping has to deal with this, in particular when assembling
> the trajectories later for analysis.
Yes. Without DD and TNG (or TRR) it is extremely easy. The question is how
to do DD correct. This is what I tried to address. But in case it wasn't
quite clear what I meant, I write some more detail.
There are two possible approaches I can think of for parallel I/O with DD (I
describe here only how to write. Reading should be similar):
- If you want to minimize the communication and want to be able to write a
single frame in parallel you can't write the atoms in their original order.
If you would, the required sorting and small writes would kill
the performance. As I understand you, writing in this non-sorted way would
require the bookkeeping you mention. One needs to write both the atom data
and their atom number to the trajectory (to be able to sort later) and then
you need to do the sorting as post-processing step or the analysis tools
have to be able to read a format where atoms are not sorted.
- Or you do the global sort of atoms as currently (evolving global
communication) but in a way that you hide latencies and communication time.
I think this is possible by writing less frequent and thus always >~50
frames at one time. This allows you to do the sorting as currently, and thus
no post-pressing or bookkeeping in the analysis tools is required. A single
frame is still written serially but you are able to write several frames in
I think the 2nd approach is much better. It should be easier to implement
because it doesn't require any changes to analysis tools and the current
writing functions only need few changes. Also the less frequent writes are
anyhow needed for good performance because of I/O latency and because I/O
systems the low bandwidth with small chunk size. And since the I/O bandwidth
will always be much lower than the communication bandwidth (if optimized)
the global communication required for the sort won't be a bottleneck.
The only thing the 2nd approach doesn't address is the memory requirement.
If we want to be able to simulate systems larger than what fits into one
node memory (may be interesting for Bluegene - for Cray it is > 200M atoms),
then we need to do parallel IO within one frame. But the 2nd approach is
easily extensible for that. If you sort the atoms anyhow (and do this partly
in parallel) it is easy to send only the atoms within 1/N of a frame to one
node and the next 1/N to the next node. If you already are writing only
every 15min than the latency problems of small communication (for sorting)
and small chunks (for writing) are not an issue (and they wouldn't be really
that small anyhow if you system is bigger than your memory ;-)).
>> Michael Shirts
>> Assistant Professor
>> Department of Chemical Engineering
>> University of Virginia
>> michael.shirts at virginia.edu <mailto:michael.shirts at virginia.edu>
>> gmx-developers mailing list
>> gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-developers-request at gromacs.org
>> <mailto:gmx-developers-request at gromacs.org>.
>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <http://cmb.ornl.gov
>> 865-241-1537, ORNL PO BOX 2008 MS6309
> David van der Spoel, PhD, Professor of Biology
> Dept. of Cell and Molecular Biology, Uppsala University.
> Husargatan 3, Box 596, 75124 Uppsala, Sweden
> phone: 46 18 471 4205 fax: 46 18 511 755
> spoel at xray.bmc.uu.se spoel at gromacs.org http://folding.bmc.uu.se
> gmx-developers mailing list
> gmx-developers at gromacs.org
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-developers-request at gromacs.org.
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the gromacs.org_gmx-developers