[gmx-developers] Collective IO

Fri Oct 1 10:28:28 CEST 2010

On Fri, Oct 1, 2010 at 3:35 AM, Mark Abraham <mark.abraham at anu.edu.au>wrote:

>
>
> ----- Original Message -----
> From: Roland Schulz <roland at utk.edu>
> Date: Friday, October 1, 2010 16:58
> Subject: Re: [gmx-developers] Collective IO
> To: Discussion list for GROMACS development <gmx-developers at gromacs.org>
>
> >
> >
> > On Thu, Sep 30, 2010 at 9:19 PM, Mark Abraham <mark.abraham at anu.edu.au>wrote:
>
>> >
>> >
>> > ----- Original Message -----
>> > From: Roland Schulz <roland at utk.edu>
>> > Date: Friday, October 1, 2010 9:04
>> > Subject: Re: [gmx-developers] Collective IO
>> > To: Discussion list for GROMACS development <gmx-developers at gromacs.org
>> >
>> >
>> > >
>> > >
>> > > On Thu, Sep 30, 2010 at 6:21 PM, Szilárd Páll <szilard.pall at cbr.su.se
>> > wrote:
>>
>>> > > Hi Roland,
>>> > >
>>> > > Nice work, I'll definitely take a look at it!
>>> > >
>>> > > Any idea on how does this improve scaling in general and at what
>>> > > problem size starts to really matter? Does it introduce and overhead
>>> > > in smaller simulations or it is only conditionally turned on?
>>>
>> > >
>> > > At the moment it is always turned on for XTC when compiled with MPI.
>> In serial or with threads nothing changes. At the moment we buffer at
>> maximum 100 frames. If one uses less than 100 PP nodes than we buffer as
>> many frames as the number of PP nodes. We also make sure that we don't
>> buffer more than 250MB per node.
>> > >
>> > > The 100 frames and 250MB are both constants which should probably
>> still be tuned.
>> >
>> > Indeed - and the user should be able to tune them, too. They won't want
>> to exceed their available physical memory, since buffering frames to virtual
>> memory (if any) loses any gains from collective I/O.
>>
> > Honestly we hadn't thought much about the 250MB limit. We first wanted
> to get feedback on the approach and the code before doing more benchmarks
> and tuning these parameters. It is very likely that their are no cases which
> benefit from using more than 2MB per MPI process.
> >
> > In case we limit the memory usage to 2MB should we still make it
> configurable? I think adding to many mdrun option gets confusing. Should we
> make the number of buffered frames a hidden mdrun option or
> an environment variable (the default would be that the number is
> auto-tuned)?
>
> Hmmm. 2MB feels like quite a low lower bound.
>
The buffering is done on every node before the data is collected to the IO
nodes. Thus if you e.g. have 1000 nodes each buffering 2MB and you have 20
IO nodes each IO node gets 100MB.
The IO requirement of an IO node is the same as it is currently for the
master. A whole frame has to fit into memory (an IO node never has more than
one frame). This can of course be a problem but it is independent of the
current approach.
As soon as someone writes code which allows to write a single frame in
parallel this can be easily combined with our buffering approach and would
overcome this memory requirement.

Thus our approach doesn't improve the current memory requirement. But it
also doesn't increases the memory per process ( <2MB) significantly.

Currently more than one IO node e (=1 MPI task on one core) can be placed on
one physical node. We will improve this and limit it to one IO node per
physical node (thus only maximum one core participates in the collective
IO). This makes sure that also the total memory per shared memory node
doesn't increase significantly.

> Collective I/O requires of the order of several MB per process per
> operation to be worthwhile.
>
yes. The IO nodes have that as long as one frame is not too small. For very
small frames/few atoms Collective IO shouldn't be important. But performance
might still improve because the XTC compression is done in parallel and the
write frequency is reduced.

OTOH you don't want to buffer excessively, because that loses more when
> hardware crashes occur. You do have the checkpoint interval as another upper
> bound, so that's probably fine. 250MB concerned me, because the BlueGene
> cpus have up to about 1GB per cpu...
>

> I think a hidden option is probably best.
>
OK. We do that (unless their'll be other opinions).

Roland
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20101001/660ccf0d/attachment.html>