[gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1

Fri Mar 29 15:56:00 CET 2013

On Fri, Mar 29, 2013 at 12:25 PM, Berk Hess <gmx3 at hotmail.com> wrote:

> Hi,
>
> I don't know enough about the details of Lustre to understand what's going
> on exactly.
> But I think mdrun can't do more then check the return value of fsync and
> believe that the file is completely flushed to disk. Possibly Lustre does
> some syncing, but doesn't actually flush the file physically to disk, which
> could lead to corruption when power goes down unexpectedly.
> But I hope this would happen so infrequently that you can take your losses
> (of up to the queue time, which is, hopefully, around 24 hours).
>

Full loss of checkpoint data might occur if the fsync() has incorrectly
returned 0 at least twice and writing to disk has not actually completed.
But that does not sound like what happened to Chris in the few minutes
before power loss.

When writing a new checkpoint file, GROMACS writes the new checkpoint to a
temporary name, does a copy of the old checkpoint file to _prev.cpt, and
only then renames the temporary checkpoint file to the normal name. This
way there is always a valid checkpoint file on disk, even if there's a
failure during any stage. But if the filesystem is coded/configured to
cheat such that any of those operations are not as atomic as they should be
(and that's a tempting thing to do, to "speed up" the filesystem), and more
than one file operation is actually still pending when a power failure
occurs, there will be a problem. Chris's .cpt timestamps are perhaps
consistent with this scenario. If such a scenario could exist, the best
thing mdrun could do is to re-load the checkpoint files after closing or
renaming, and re-compute the md5sum (like gmxcheck does). However, the
filesystem would probably just give mdrun the copy that is sitting around
in a memory buffer waiting to be flushed, so that is a false security (and
costs execution time for all the correctly-functioning filesystems). If the
return from fsync() was a lie, then there's nothing upon which GROMACS can
rely, I'm afraid.

Mark

> I assume your problem is that you don't even have the checkpoint file of
> the previous simulation part left. Another option would then be using mdrun
> -noappend
>
> Cheers,
>
> Berk
>
> ----------------------------------------
> > From: chris.neale at mail.utoronto.ca
> > To: gmx-users at gromacs.org
> > Date: Fri, 29 Mar 2013 01:15:06 +0000
> > Subject: [gmx-users] chiller failure leads to truncated .cpt and
> _prev.cpt files using gromacs 4.6.1
> >
> > Thank you, Berk, Justin, and Matthew, for your assistance.
> >
> > I checked with my sysadmin, who said:
> >
> > The /global/scratch FS is Lustre. It is fully POSIX and the fsync etc
> > are fully and well implemented. However when the 'power off' command is
> > issued there is no way OS can finish I/O in a controlled way.
> >
> > Note that the power off command was given when they
> > realized that they had lost all cooling in the data room, and they had
> just a few
> > minutes to react, forcing them to shutdown all compute nodes.
> >
> > Justin's suggestion to use -cpnum is good, although I think it will be
> easier to simply have a script that runs
> > gmxcheck once every 12 hours and backs up the .cpt file if it is ok.
> >
> > I don't know enough about computer OS's to say if there is any possible
> way for gromacs to avoid this
> > in the future, but if it was possible, then it would be useful.
> >
> > Thank you again,
> > Chris.
> >
> > -- original message --
> >
> > Gromacs calls fsync for every checkpoint file written:
> >
> > fsync() transfers ("flushes") all modified in-core data of (i.e., modi-
> > fied buffer cache pages for) the file referred to by the file descrip-
> > tor fd to the disk device (or other permanent storage device) so that
> > all changed information can be retrieved even after the system crashed
> > or was rebooted. This includes writing through or flushing a disk
> > cache if present. The call blocks until the device reports that the
> > transfer has completed. It also flushes metadata information associ-
> > ated with the file (see stat(2)).
> >
> > If fsync fails, mdrun exits with a fatal error.
> > We have experience with unreliable AFS file systems, where fsync mdrun
> could wait for hours and fail,
> > for which we added an environment variable.
> > So either fsync is not supported on your system (highly unlikely)
> > or your file system returns 0, indicating the file was synched, but it
> actually didn't fully sync.
> >
> > Note that we first write a new checkpoint file with number, fynsc that,
> then move the current
> > to _prev (thereby loosing the old prev) and then the numbered one to the
> current.
> > So you should never end up with only corrupted files, unless fsync
> doesn't do what it's supposed to do.
> >
> > Cheers,
> >
> > Berk
> >
> > --
> > gmx-users mailing list gmx-users at gromacs.org
> > http://lists.gromacs.org/mailman/listinfo/gmx-users
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> > * Please don't post (un)subscribe requests to the list. Use the
> > www interface or send it to gmx-users-request at gromacs.org.
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>                                 --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>