[gmx-developers] Re: flushing files

hess at sbc.su.se hess at sbc.su.se
Wed Oct 13 10:18:01 CEST 2010


> On Wed, Oct 13, 2010 at 3:35 AM, Erik Lindahl <lindahl at cbr.su.se> wrote:
>
>> Hi,
>> On Oct 13, 2010, at 9:23 AM, Roland Schulz wrote:
>>
>>
>> This is not what we are doing at the moment. At the moment (flush after
>> frame, sync after checkpoint) it is possible that the trajectory is
>> broken.
>> But the check-pointing append feature guarantees that it automatically
>> fixes
>> it. I like the approach of fast writing + automatic fix in the worst
>> case
>> better than having to guarantee that it is always correct from the
>> beginning. Also it would be extremely difficult to guarantee it for all
>> cases (e.g. for the case of a crash during writing of a frame).
>>
>>
>> Yes, but that's a huge difference: Presently you might get broken frames
>> if
>> your simulation crashes. If you are on a file system that never flushes
>> to
>> disk with fflush() you won't get frames on the frontend, but at least
>> they
>> aren't broken.
>>
>
> I think, it is also currently possible (but unlikely) that the trajectory
> appears broken. While a frame is written it is possible (I'm pretty sure
> I encountered that before). But I see the point that we at least want to
> make it as unlikely (most of the time it is not currently writing) as
> possible without affecting the performance.
>
> This might actually not be a problem with MPI-IO because we buffer the
> whole
> frame in memory and then have one MPI_File_write call for the whole frame
> (or more precise a MPI_File_write_ordered for a couple of frame). Thus
> because we always write a whole frame in one go it should not be an issue.
> We'll test to make sure.
> If it is still an issue we can buffer more frames to not cause a
> performance
> problem with MPI_File_sync after each write.
>
> Independent of my original question and the CollectiveIO work, we might
> want
> to make sure that we guarantee to fsync every 15min, even when we don't
> checkpoint or only checkpoint infrequent. This might be a fix we want to
> add
> to the release branch.
>
> Roland

Checkpointing every 15 minutes is default and doesn't cost much.
I don't see a good reason for users to change this to a higher value.
So I don't think we currently need to change things.
Furthermore, I rather avoid messing with such critical things is a release.

Berk




More information about the gromacs.org_gmx-developers mailing list