[gmx-developers] Collective IO
David van der Spoel
spoel at xray.bmc.uu.se
Mon Oct 4 15:08:54 CEST 2010
On 2010-10-01 10.28, Roland Schulz wrote:
>
>
> On Fri, Oct 1, 2010 at 3:35 AM, Mark Abraham <mark.abraham at anu.edu.au
> <mailto:mark.abraham at anu.edu.au>> wrote:
>
>
>
> ----- Original Message -----
> From: Roland Schulz <roland at utk.edu <mailto:roland at utk.edu>>
> Date: Friday, October 1, 2010 16:58
> Subject: Re: [gmx-developers] Collective IO
> To: Discussion list for GROMACS development
> <gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>>
>
> >
> >
> > On Thu, Sep 30, 2010 at 9:19 PM, Mark Abraham
> <mark.abraham at anu.edu.au> wrote:
>
> >
> >
> > ----- Original Message -----
> > From: Roland Schulz <roland at utk.edu>
> > Date: Friday, October 1, 2010 9:04
> > Subject: Re: [gmx-developers] Collective IO
> > To: Discussion list for GROMACS development
> <gmx-developers at gromacs.org>
> >
> > >
> > >
> > > On Thu, Sep 30, 2010 at 6:21 PM, Szilárd Páll
> <szilard.pall at cbr.su.se> wrote:
>
> > > Hi Roland,
> > >
> > > Nice work, I'll definitely take a look at it!
> > >
> > > Any idea on how does this improve scaling in general
> and at what
> > > problem size starts to really matter? Does it introduce
> and overhead
> > > in smaller simulations or it is only conditionally
> turned on?
>
> > >
> > > At the moment it is always turned on for XTC when compiled
> with MPI. In serial or with threads nothing changes. At the
> moment we buffer at maximum 100 frames. If one uses less than
> 100 PP nodes than we buffer as many frames as the number of PP
> nodes. We also make sure that we don't buffer more than 250MB
> per node.
> > >
> > > The 100 frames and 250MB are both constants which should
> probably still be tuned.
> >
> > Indeed - and the user should be able to tune them, too. They
> won't want to exceed their available physical memory, since
> buffering frames to virtual memory (if any) loses any gains from
> collective I/O.
>
> > Honestly we hadn't thought much about the 250MB limit. We first
> wanted to get feedback on the approach and the code before doing
> more benchmarks and tuning these parameters. It is very likely that
> their are no cases which benefit from using more than 2MB per MPI
> process.
> >
> > In case we limit the memory usage to 2MB should we still make it
> configurable? I think adding to many mdrun option gets confusing.
> Should we make the number of buffered frames a hidden mdrun option
> or an environment variable (the default would be that the number is
> auto-tuned)?
>
> Hmmm. 2MB feels like quite a low lower bound.
>
> The buffering is done on every node before the data is collected to the
> IO nodes. Thus if you e.g. have 1000 nodes each buffering 2MB and you
> have 20 IO nodes each IO node gets 100MB.
> The IO requirement of an IO node is the same as it is currently for the
> master. A whole frame has to fit into memory (an IO node never has more
> than one frame). This can of course be a problem but it is independent
> of the current approach.
> As soon as someone writes code which allows to write a single frame in
> parallel this can be easily combined with our buffering approach and
> would overcome this memory requirement.
>
> Thus our approach doesn't improve the current memory requirement. But it
> also doesn't increases the memory per process ( <2MB) significantly.
>
> Currently more than one IO node e (=1 MPI task on one core) can be
> placed on one physical node. We will improve this and limit it to one IO
> node per physical node (thus only maximum one core participates in the
> collective IO). This makes sure that also the total memory per shared
> memory node doesn't increase significantly.
>
> Collective I/O requires of the order of several MB per process per
> operation to be worthwhile.
>
> yes. The IO nodes have that as long as one frame is not too small. For
> very small frames/few atoms Collective IO shouldn't be important. But
> performance might still improve because the XTC compression is done in
> parallel and the write frequency is reduced.
>
> OTOH you don't want to buffer excessively, because that loses more
> when hardware crashes occur. You do have the checkpoint interval as
> another upper bound, so that's probably fine. 250MB concerned me,
> because the BlueGene cpus have up to about 1GB per cpu...
>
>
> I think a hidden option is probably best.
>
> OK. We do that (unless their'll be other opinions).
>
> Roland
>
Hi Ryan & Roland, great stuff!
I'm giving a gromacs talk in an hour and will mention this immediately!
Have you also tested PME? It would be interesting if the performance
with PME also increases by 13 ns/day for the same setup...
--
David van der Spoel, Ph.D., Professor of Biology
Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone: +46184714205.
spoel at xray.bmc.uu.se http://folding.bmc.uu.se
More information about the gromacs.org_gmx-developers
mailing list