[gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1
Christopher Neale
chris.neale at mail.utoronto.ca
Fri Mar 29 02:15:06 CET 2013
Thank you, Berk, Justin, and Matthew, for your assistance.
I checked with my sysadmin, who said:
The /global/scratch FS is Lustre. It is fully POSIX and the fsync etc
are fully and well implemented. However when the 'power off' command is
issued there is no way OS can finish I/O in a controlled way.
Note that the power off command was given when they
realized that they had lost all cooling in the data room, and they had just a few
minutes to react, forcing them to shutdown all compute nodes.
Justin's suggestion to use -cpnum is good, although I think it will be easier to simply have a script that runs
gmxcheck once every 12 hours and backs up the .cpt file if it is ok.
I don't know enough about computer OS's to say if there is any possible way for gromacs to avoid this
in the future, but if it was possible, then it would be useful.
Thank you again,
Chris.
-- original message --
Gromacs calls fsync for every checkpoint file written:
fsync() transfers ("flushes") all modified in-core data of (i.e., modi-
fied buffer cache pages for) the file referred to by the file descrip-
tor fd to the disk device (or other permanent storage device) so that
all changed information can be retrieved even after the system crashed
or was rebooted. This includes writing through or flushing a disk
cache if present. The call blocks until the device reports that the
transfer has completed. It also flushes metadata information associ-
ated with the file (see stat(2)).
If fsync fails, mdrun exits with a fatal error.
We have experience with unreliable AFS file systems, where fsync mdrun could wait for hours and fail,
for which we added an environment variable.
So either fsync is not supported on your system (highly unlikely)
or your file system returns 0, indicating the file was synched, but it actually didn't fully sync.
Note that we first write a new checkpoint file with number, fynsc that, then move the current
to _prev (thereby loosing the old prev) and then the numbered one to the current.
So you should never end up with only corrupted files, unless fsync doesn't do what it's supposed to do.
Cheers,
Berk
More information about the gromacs.org_gmx-users
mailing list