[gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1

Fri Mar 29 12:25:12 CET 2013

Hi,

I don't know enough about the details of Lustre to understand what's going on exactly.
But I think mdrun can't do more then check the return value of fsync and believe that the file is completely flushed to disk. Possibly Lustre does some syncing, but doesn't actually flush the file physically to disk, which could lead to corruption when power goes down unexpectedly.
But I hope this would happen so infrequently that you can take your losses (of up to the queue time, which is, hopefully, around 24 hours).

I assume your problem is that you don't even have the checkpoint file of the previous simulation part left. Another option would then be using mdrun -noappend

Cheers,

Berk

----------------------------------------
> From: chris.neale at mail.utoronto.ca
> To: gmx-users at gromacs.org
> Date: Fri, 29 Mar 2013 01:15:06 +0000
> Subject: [gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1
>
> Thank you, Berk, Justin, and Matthew, for your assistance.
>
> I checked with my sysadmin, who said:
>
> The /global/scratch FS is Lustre. It is fully POSIX and the fsync etc
> are fully and well implemented. However when the 'power off' command is
> issued there is no way OS can finish I/O in a controlled way.
>
> Note that the power off command was given when they
> realized that they had lost all cooling in the data room, and they had just a few
> minutes to react, forcing them to shutdown all compute nodes.
>
> Justin's suggestion to use -cpnum is good, although I think it will be easier to simply have a script that runs
> gmxcheck once every 12 hours and backs up the .cpt file if it is ok.
>
> I don't know enough about computer OS's to say if there is any possible way for gromacs to avoid this
> in the future, but if it was possible, then it would be useful.
>
> Thank you again,
> Chris.
>
> -- original message --
>
> Gromacs calls fsync for every checkpoint file written:
>
> fsync() transfers ("flushes") all modified in-core data of (i.e., modi-
> fied buffer cache pages for) the file referred to by the file descrip-
> tor fd to the disk device (or other permanent storage device) so that
> all changed information can be retrieved even after the system crashed
> or was rebooted. This includes writing through or flushing a disk
> cache if present. The call blocks until the device reports that the
> transfer has completed. It also flushes metadata information associ-
> ated with the file (see stat(2)).
>
> If fsync fails, mdrun exits with a fatal error.
> We have experience with unreliable AFS file systems, where fsync mdrun could wait for hours and fail,
> for which we added an environment variable.
> So either fsync is not supported on your system (highly unlikely)
> or your file system returns 0, indicating the file was synched, but it actually didn't fully sync.
>
> Note that we first write a new checkpoint file with number, fynsc that, then move the current
> to _prev (thereby loosing the old prev) and then the numbered one to the current.
> So you should never end up with only corrupted files, unless fsync doesn't do what it's supposed to do.
>
> Cheers,
>
> Berk
>
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists