[gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1

Christopher Neale chris.neale at mail.utoronto.ca
Fri Mar 29 02:15:06 CET 2013


Thank you, Berk, Justin, and Matthew, for your assistance.

I checked with my sysadmin, who said:

The /global/scratch FS is Lustre. It is fully POSIX and the fsync etc 
are fully and well implemented. However when the 'power off' command is 
issued there is no way OS can finish I/O in a controlled way.

Note that the power off command was given when they
realized that they had lost all cooling in the data room, and they had just a few 
minutes to react, forcing them to shutdown all compute nodes. 

Justin's suggestion to use  -cpnum is good, although I think it will be easier to simply have a script that runs 
gmxcheck once every 12 hours and backs up the .cpt file if it is ok.

I don't know enough about computer OS's to say if there is any possible way for gromacs to avoid this
in the future, but if it was possible, then it would be useful.

Thank you again,
Chris.

-- original message --

Gromacs calls fsync for every checkpoint file written:

       fsync() transfers ("flushes") all modified in-core data of (i.e., modi-
       fied  buffer cache pages for) the file referred to by the file descrip-
       tor fd to the disk device (or other permanent storage device)  so  that
       all  changed information can be retrieved even after the system crashed
       or was rebooted.  This includes writing  through  or  flushing  a  disk
       cache  if  present.   The call blocks until the device reports that the
       transfer has completed.  It also flushes metadata  information  associ-
       ated with the file (see stat(2)).

If fsync fails, mdrun exits with a fatal error.
We have experience with unreliable AFS file systems, where fsync mdrun could wait for hours and fail,
for which we added an environment variable.
So either fsync is not supported on your system (highly unlikely)
or your file system returns 0, indicating the file was synched, but it actually didn't fully sync.

Note that we first write a new checkpoint file with number, fynsc that, then move the current
to _prev (thereby loosing the old prev) and then the numbered one to the current.
So you should never end up with only corrupted files, unless fsync doesn't do what it's supposed to do.

Cheers,

Berk




More information about the gromacs.org_gmx-users mailing list