[gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1
Christopher Neale
chris.neale at mail.utoronto.ca
Sat Mar 30 02:55:40 CET 2013
Thank you Berk, my problem, was indeed that I didn't have any valid .cpt files. The only way that I could proceed was to extract a frame from the .xtc file and run it through grompp again to get a new .tpr. That's fine and things are running again. I just wanted to pass all of this information along.
It sounds, from Mark's most recent email on this subject, that there may be nothing that can be done to avoid this on Gromacs' end. I would have thought that gmxcheck would handle it, but if Mark is right that the OS might serve up the version of the .cpt that is in memory, rather than what is on disk, then I suppose that gmxcheck on the .cpt might pass just prior to the shutdown even though the file was not on disk and the gmxcheck will fail later, after a reboot.
I've scripted the creation of automatic backups of good .cpt files so I should be ok for the future. I thought that I'd get replies from lots of other people who had experienced this (since I've seen it on multiple clusters, all managed by entirely different staff), but since nobody responded to say that they have experienced this then I guess it is probably not worth any more of your time.
To all those who helped me on this subject, I appreciate all of your assistance .
Chris.
-- original message --
I don't know enough about the details of Lustre to understand what's going on exactly.
But I think mdrun can't do more then check the return value of fsync and believe that the file is completely flushed to disk. Possibly Lustre does some syncing, but doesn't actually flush the file physically to disk, which could lead to corruption when power goes down unexpectedly.
But I hope this would happen so infrequently that you can take your losses (of up to the queue time, which is, hopefully, around 24 hours).
I assume your problem is that you don't even have the checkpoint file of the previous simulation part left. Another option would then be using mdrun -noappend
Cheers,
Berk
More information about the gromacs.org_gmx-users
mailing list