[gmx-developers] Collective IO

Mon Oct 4 19:05:15 CEST 2010

On Mon, Oct 4, 2010 at 9:08 AM, David van der Spoel <spoel at xray.bmc.uu.se>wrote:

> On 2010-10-01 10.28, Roland Schulz wrote:
>
>>
>>
>> On Fri, Oct 1, 2010 at 3:35 AM, Mark Abraham <mark.abraham at anu.edu.au
>> <mailto:mark.abraham at anu.edu.au>> wrote:
>>
>>
>>
>>    ----- Original Message -----
>>    From: Roland Schulz <roland at utk.edu <mailto:roland at utk.edu>>
>>    Date: Friday, October 1, 2010 16:58
>>    Subject: Re: [gmx-developers] Collective IO
>>    To: Discussion list for GROMACS development
>>    <gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>>
>>
>>     >
>>     >
>>     > On Thu, Sep 30, 2010 at 9:19 PM, Mark Abraham
>>    <mark.abraham at anu.edu.au> wrote:
>>
>>         >
>>         >
>>         > ----- Original Message -----
>>         > From: Roland Schulz <roland at utk.edu>
>>         > Date: Friday, October 1, 2010 9:04
>>         > Subject: Re: [gmx-developers] Collective IO
>>         > To: Discussion list for GROMACS development
>>        <gmx-developers at gromacs.org>
>>         >
>>         > >
>>         > >
>>         > > On Thu, Sep 30, 2010 at 6:21 PM, Szilárd Páll
>>        <szilard.pall at cbr.su.se> wrote:
>>
>>             > > Hi Roland,
>>             > >
>>             > > Nice work, I'll definitely take a look at it!
>>             > >
>>             > > Any idea on how does this improve scaling in general
>>            and at what
>>             > > problem size starts to really matter? Does it introduce
>>            and overhead
>>             > > in smaller simulations or it is only conditionally
>>            turned on?
>>
>>         > >
>>         > > At the moment it is always turned on for XTC when compiled
>>        with MPI. In serial or with threads nothing changes. At the
>>        moment we buffer at maximum 100 frames. If one uses less than
>>        100 PP nodes than we buffer as many frames as the number of PP
>>        nodes. We also make sure that we don't buffer more than 250MB
>>        per node.
>>         > >
>>         > > The 100 frames and 250MB are both constants which should
>>        probably still be tuned.
>>         >
>>         > Indeed - and the user should be able to tune them, too. They
>>        won't want to exceed their available physical memory, since
>>        buffering frames to virtual memory (if any) loses any gains from
>>        collective I/O.
>>
>>     > Honestly we hadn't thought much about the 250MB limit. We first
>>    wanted to get feedback on the approach and the code before doing
>>    more benchmarks and tuning these parameters. It is very likely that
>>    their are no cases which benefit from using more than 2MB per MPI
>>    process.
>>     >
>>     > In case we limit the memory usage to 2MB should we still make it
>>    configurable? I think adding to many mdrun option gets confusing.
>>    Should we make the number of buffered frames a hidden mdrun option
>>    or an environment variable (the default would be that the number is
>>    auto-tuned)?
>>
>>    Hmmm. 2MB feels like quite a low lower bound.
>>
>> The buffering is done on every node before the data is collected to the
>> IO nodes. Thus if you e.g. have 1000 nodes each buffering 2MB and you
>> have 20 IO nodes each IO node gets 100MB.
>> The IO requirement of an IO node is the same as it is currently for the
>> master. A whole frame has to fit into memory (an IO node never has more
>> than one frame). This can of course be a problem but it is independent
>> of the current approach.
>> As soon as someone writes code which allows to write a single frame in
>> parallel this can be easily combined with our buffering approach and
>> would overcome this memory requirement.
>>
>> Thus our approach doesn't improve the current memory requirement. But it
>> also doesn't increases the memory per process ( <2MB) significantly.
>>
>> Currently more than one IO node e (=1 MPI task on one core) can be
>> placed on one physical node. We will improve this and limit it to one IO
>> node per physical node (thus only maximum one core participates in the
>> collective IO). This makes sure that also the total memory per shared
>> memory node doesn't increase significantly.
>>
>>    Collective I/O requires of the order of several MB per process per
>>    operation to be worthwhile.
>>
>> yes. The IO nodes have that as long as one frame is not too small. For
>> very small frames/few atoms Collective IO shouldn't be important. But
>> performance might still improve because the XTC compression is done in
>> parallel and the write frequency is reduced.
>>
>>    OTOH you don't want to buffer excessively, because that loses more
>>    when hardware crashes occur. You do have the checkpoint interval as
>>    another upper bound, so that's probably fine. 250MB concerned me,
>>    because the BlueGene cpus have up to about 1GB per cpu...
>>
>>
>>    I think a hidden option is probably best.
>>
>> OK. We do that (unless their'll be other opinions).
>>
>> Roland
>>
>>  Hi Ryan & Roland, great stuff!
>
> I'm giving a gromacs talk in an hour and will mention this immediately!
> Have you also tested PME? It would be interesting if the performance with
> PME also increases by 13 ns/day for the same setup...
>
PME won't scale to 8000 cores for 3M atoms. But for a problem size for which
PME scales the speed-up should be the same.
Also we (Berk & me) will probably have an improved version of PME (with
threads) ready by the end of the week. This should help a lot for the PME
scaling at this scale. And thus make the IO change more important even with
PME for more problem sizes.

Roland

>
>
> --
> David van der Spoel, Ph.D., Professor of Biology
> Dept. of Cell & Molec. Biol., Uppsala University.
> Box 596, 75124 Uppsala, Sweden. Phone:  +46184714205.
> spoel at xray.bmc.uu.se    http://folding.bmc.uu.se
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-developers-request at gromacs.org.
>

-- 
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20101004/911701ec/attachment.html>