[gmx-developers] Collective IO

Mon Oct 4 15:08:54 CEST 2010

On 2010-10-01 10.28, Roland Schulz wrote:
>
>
> On Fri, Oct 1, 2010 at 3:35 AM, Mark Abraham <mark.abraham at anu.edu.au
> <mailto:mark.abraham at anu.edu.au>> wrote:
>
>
>
>     ----- Original Message -----
>     From: Roland Schulz <roland at utk.edu <mailto:roland at utk.edu>>
>     Date: Friday, October 1, 2010 16:58
>     Subject: Re: [gmx-developers] Collective IO
>     To: Discussion list for GROMACS development
>     <gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>>
>
>      >
>      >
>      > On Thu, Sep 30, 2010 at 9:19 PM, Mark Abraham
>     <mark.abraham at anu.edu.au> wrote:
>
>          >
>          >
>          > ----- Original Message -----
>          > From: Roland Schulz <roland at utk.edu>
>          > Date: Friday, October 1, 2010 9:04
>          > Subject: Re: [gmx-developers] Collective IO
>          > To: Discussion list for GROMACS development
>         <gmx-developers at gromacs.org>
>          >
>          > >
>          > >
>          > > On Thu, Sep 30, 2010 at 6:21 PM, Szilárd Páll
>         <szilard.pall at cbr.su.se> wrote:
>
>              > > Hi Roland,
>              > >
>              > > Nice work, I'll definitely take a look at it!
>              > >
>              > > Any idea on how does this improve scaling in general
>             and at what
>              > > problem size starts to really matter? Does it introduce
>             and overhead
>              > > in smaller simulations or it is only conditionally
>             turned on?
>
>          > >
>          > > At the moment it is always turned on for XTC when compiled
>         with MPI. In serial or with threads nothing changes. At the
>         moment we buffer at maximum 100 frames. If one uses less than
>         100 PP nodes than we buffer as many frames as the number of PP
>         nodes. We also make sure that we don't buffer more than 250MB
>         per node.
>          > >
>          > > The 100 frames and 250MB are both constants which should
>         probably still be tuned.
>          >
>          > Indeed - and the user should be able to tune them, too. They
>         won't want to exceed their available physical memory, since
>         buffering frames to virtual memory (if any) loses any gains from
>         collective I/O.
>
>      > Honestly we hadn't thought much about the 250MB limit. We first
>     wanted to get feedback on the approach and the code before doing
>     more benchmarks and tuning these parameters. It is very likely that
>     their are no cases which benefit from using more than 2MB per MPI
>     process.
>      >
>      > In case we limit the memory usage to 2MB should we still make it
>     configurable? I think adding to many mdrun option gets confusing.
>     Should we make the number of buffered frames a hidden mdrun option
>     or an environment variable (the default would be that the number is
>     auto-tuned)?
>
>     Hmmm. 2MB feels like quite a low lower bound.
>
> The buffering is done on every node before the data is collected to the
> IO nodes. Thus if you e.g. have 1000 nodes each buffering 2MB and you
> have 20 IO nodes each IO node gets 100MB.
> The IO requirement of an IO node is the same as it is currently for the
> master. A whole frame has to fit into memory (an IO node never has more
> than one frame). This can of course be a problem but it is independent
> of the current approach.
> As soon as someone writes code which allows to write a single frame in
> parallel this can be easily combined with our buffering approach and
> would overcome this memory requirement.
>
> Thus our approach doesn't improve the current memory requirement. But it
> also doesn't increases the memory per process ( <2MB) significantly.
>
> Currently more than one IO node e (=1 MPI task on one core) can be
> placed on one physical node. We will improve this and limit it to one IO
> node per physical node (thus only maximum one core participates in the
> collective IO). This makes sure that also the total memory per shared
> memory node doesn't increase significantly.
>
>     Collective I/O requires of the order of several MB per process per
>     operation to be worthwhile.
>
> yes. The IO nodes have that as long as one frame is not too small. For
> very small frames/few atoms Collective IO shouldn't be important. But
> performance might still improve because the XTC compression is done in
> parallel and the write frequency is reduced.
>
>     OTOH you don't want to buffer excessively, because that loses more
>     when hardware crashes occur. You do have the checkpoint interval as
>     another upper bound, so that's probably fine. 250MB concerned me,
>     because the BlueGene cpus have up to about 1GB per cpu...
>
>
>     I think a hidden option is probably best.
>
> OK. We do that (unless their'll be other opinions).
>
> Roland
>
Hi Ryan & Roland, great stuff!

I'm giving a gromacs talk in an hour and will mention this immediately! 
Have you also tested PME? It would be interesting if the performance 
with PME also increases by 13 ns/day for the same setup...

-- 
David van der Spoel, Ph.D., Professor of Biology
Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone:	+46184714205.
spoel at xray.bmc.uu.se    http://folding.bmc.uu.se