[gmx-developers] Collective IO
roland at utk.edu
Mon Oct 4 19:05:15 CEST 2010
On Mon, Oct 4, 2010 at 9:08 AM, David van der Spoel <spoel at xray.bmc.uu.se>wrote:
> On 2010-10-01 10.28, Roland Schulz wrote:
>> On Fri, Oct 1, 2010 at 3:35 AM, Mark Abraham <mark.abraham at anu.edu.au
>> <mailto:mark.abraham at anu.edu.au>> wrote:
>> ----- Original Message -----
>> From: Roland Schulz <roland at utk.edu <mailto:roland at utk.edu>>
>> Date: Friday, October 1, 2010 16:58
>> Subject: Re: [gmx-developers] Collective IO
>> To: Discussion list for GROMACS development
>> <gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>>
>> > On Thu, Sep 30, 2010 at 9:19 PM, Mark Abraham
>> <mark.abraham at anu.edu.au> wrote:
>> > ----- Original Message -----
>> > From: Roland Schulz <roland at utk.edu>
>> > Date: Friday, October 1, 2010 9:04
>> > Subject: Re: [gmx-developers] Collective IO
>> > To: Discussion list for GROMACS development
>> <gmx-developers at gromacs.org>
>> > >
>> > >
>> > > On Thu, Sep 30, 2010 at 6:21 PM, Szilárd Páll
>> <szilard.pall at cbr.su.se> wrote:
>> > > Hi Roland,
>> > >
>> > > Nice work, I'll definitely take a look at it!
>> > >
>> > > Any idea on how does this improve scaling in general
>> and at what
>> > > problem size starts to really matter? Does it introduce
>> and overhead
>> > > in smaller simulations or it is only conditionally
>> turned on?
>> > >
>> > > At the moment it is always turned on for XTC when compiled
>> with MPI. In serial or with threads nothing changes. At the
>> moment we buffer at maximum 100 frames. If one uses less than
>> 100 PP nodes than we buffer as many frames as the number of PP
>> nodes. We also make sure that we don't buffer more than 250MB
>> per node.
>> > >
>> > > The 100 frames and 250MB are both constants which should
>> probably still be tuned.
>> > Indeed - and the user should be able to tune them, too. They
>> won't want to exceed their available physical memory, since
>> buffering frames to virtual memory (if any) loses any gains from
>> collective I/O.
>> > Honestly we hadn't thought much about the 250MB limit. We first
>> wanted to get feedback on the approach and the code before doing
>> more benchmarks and tuning these parameters. It is very likely that
>> their are no cases which benefit from using more than 2MB per MPI
>> > In case we limit the memory usage to 2MB should we still make it
>> configurable? I think adding to many mdrun option gets confusing.
>> Should we make the number of buffered frames a hidden mdrun option
>> or an environment variable (the default would be that the number is
>> Hmmm. 2MB feels like quite a low lower bound.
>> The buffering is done on every node before the data is collected to the
>> IO nodes. Thus if you e.g. have 1000 nodes each buffering 2MB and you
>> have 20 IO nodes each IO node gets 100MB.
>> The IO requirement of an IO node is the same as it is currently for the
>> master. A whole frame has to fit into memory (an IO node never has more
>> than one frame). This can of course be a problem but it is independent
>> of the current approach.
>> As soon as someone writes code which allows to write a single frame in
>> parallel this can be easily combined with our buffering approach and
>> would overcome this memory requirement.
>> Thus our approach doesn't improve the current memory requirement. But it
>> also doesn't increases the memory per process ( <2MB) significantly.
>> Currently more than one IO node e (=1 MPI task on one core) can be
>> placed on one physical node. We will improve this and limit it to one IO
>> node per physical node (thus only maximum one core participates in the
>> collective IO). This makes sure that also the total memory per shared
>> memory node doesn't increase significantly.
>> Collective I/O requires of the order of several MB per process per
>> operation to be worthwhile.
>> yes. The IO nodes have that as long as one frame is not too small. For
>> very small frames/few atoms Collective IO shouldn't be important. But
>> performance might still improve because the XTC compression is done in
>> parallel and the write frequency is reduced.
>> OTOH you don't want to buffer excessively, because that loses more
>> when hardware crashes occur. You do have the checkpoint interval as
>> another upper bound, so that's probably fine. 250MB concerned me,
>> because the BlueGene cpus have up to about 1GB per cpu...
>> I think a hidden option is probably best.
>> OK. We do that (unless their'll be other opinions).
>> Hi Ryan & Roland, great stuff!
> I'm giving a gromacs talk in an hour and will mention this immediately!
> Have you also tested PME? It would be interesting if the performance with
> PME also increases by 13 ns/day for the same setup...
PME won't scale to 8000 cores for 3M atoms. But for a problem size for which
PME scales the speed-up should be the same.
Also we (Berk & me) will probably have an improved version of PME (with
threads) ready by the end of the week. This should help a lot for the PME
scaling at this scale. And thus make the IO change more important even with
PME for more problem sizes.
> David van der Spoel, Ph.D., Professor of Biology
> Dept. of Cell & Molec. Biol., Uppsala University.
> Box 596, 75124 Uppsala, Sweden. Phone: +46184714205.
> spoel at xray.bmc.uu.se http://folding.bmc.uu.se
> gmx-developers mailing list
> gmx-developers at gromacs.org
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-developers-request at gromacs.org.
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the gromacs.org_gmx-developers