[gmx-developers] Python interface for Gromacs

Thu Sep 9 11:28:43 CEST 2010

----- Original Message -----
From: Berk Hess <hess at cbr.su.se>
Date: Thursday, September 9, 2010 17:57
Subject: Re: [gmx-developers] Python interface for Gromacs
To: Discussion list for GROMACS development <gmx-developers at gromacs.org>

> On 09/09/2010 09:35 AM, Mark Abraham wrote:
> > > People in the Lindahl group are working on parallellizing
> > > analysis tools because they are quickly becoming the bottleneck.
> > > We run simulations of large systems on hundreds of processors,
> > > and due to checkpointing this can be done largely unattended.
> >
> > If so, then an important issue to address is using MPI2 
> parallel IO
> > properly. At the moment, for DD, mdrun collects vectors for 
> I/O on the
> > master and writes them out in serial. Proper use of parallel 
> I/O might
> > be worth the investment in restructuring the output. 
> Maintaining the
> > DD processor-local file view suited for I/O of the local atoms is
> > probably not any more complex than the existing contortions 
> that are
> > gone through to gather global vectors. Likewise, a parallel analysis
> > tool will often wish to be doing its I/O in parallel.
> >
> The main issue here is that the atom order chances every nstlist 
> steps.We could write files with all atom indices in there as 
> well, but that
> would double the file size.

No, you wouldn't want to write the indices out. :-)

The current mdrun implementation does an MPI Gather of the blocks of data local to each node to the master process and then unscrambles the atom ordering, before writing in serial. 

Ideally we could do the unscrambling in the parallel I/O operation, but  that kind of random-access is not supported in MPI I/O.

AFAIK it shouldn't cost any more communication time to do the existing process as an Allgather, followed by parallel I/O of the unscrambled vector. The catch is that we have to allocate for a global vector on each node. A two-tier approach might be better - Gather information to some DD nodes that will later do the parallel I/O, Allgather between them, unscramble, and then write. This has a lower memory footprint.

Note that on a machine like BlueGene with a fixed amount of real memory (and no virtual memory), the "extra" memory needed on non-master nodes for the above approach doesn't matter. The global array memory needs of the master node are already dominating GROMACS scalability.

The most scalable solution I can come up with is for each atom to have a designated "I/O home" node (that node is also a normal DD node). When an atom changes its DD home node, the I/O home node's index is passed around with it, and the I/O home node notified of the change. That probably minimizes book-keeping communication. At global I/O time, each I/O home node knows where all its atoms are, and each DD home node knows to which I/O node to send its home atoms. So, a straight Alltoall with well-designed MPI receiving datatypes do the unscrambling. The result is ready for chunk-style parallel I/O as above. No need for global vectors at all.

> Also I have my doubts about the efficiency of MPI i/o.

Sure. It would depend a lot on the extent to which the MPI I/O implementation matched the hardware attributes. I found it useful on BlueGene for a I/O-heavy trajectory-comparison task.

> Ideally we would want the i/o to happen in the background, I 
> don't know
> if the MPI file i/o can do this.

MPI2 "split collective data access" seems to fit the bill - see http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node197.htm#Node197. Whether there are decent implementations would be the catch.

Mark

> With Roland Schulz I have been discussing the possibility of some
> (dedicated) processes collecting
> and writing the data using some kind of tree structure.
> 
> Berk
> > We would probably wish to write our own data representation 
> conversion> functions to hook into MPI_File_set_view so that we 
> can read/write our
> > standard XDR formats in parallel. (Unless, of course, the existing
> > "external32" representation can be made to do the job.)
> >
> > Mark 
> 
> -- 
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the 
> www interface or send it to gmx-developers-request at gromacs.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20100909/2898cc3a/attachment.html>