[gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1
Christopher Neale
chris.neale at mail.utoronto.ca
Wed Mar 27 03:04:44 CET 2013
Dear Users:
A cluster that I use went down today with a chiller failure. I lost all 16 jobs (running gromacs 4.6.1). For 13 of these jobs, not only is the .cpt file truncated, but also the _prev.cpt file is truncated, meaning that I am going to have to go back through the files, extract a frame, make a new .tpr file (using a new, custom .mdp file to get the timestamp right), restart the runs, and then later join the trajectory data fragments.
I have experienced this a number of times over the years with different versions of gromacs (see, for example, http://redmine.gromacs.org/issues/790 ) and wonder if anybody else has experienced this?
Also, does anybody have some advice on how to handle this? For now, my idea is to run a script in the background to periodically check the .cpt file and make a copy if it is not corrupted/truncated so that I can always restart.
If it is useful information, both the .cpt and the _prev.cpt files have the same size and timestamp, but are smaller than non-corrupted .cpt files. E.g.:
$ ls -ltr --full-time *cpt
-rw-r----- 1 cneale cneale 1963536 2013-03-22 17:18:04.000000000 -0700 md2d_prev.cpt
-rw-r----- 1 cneale cneale 1963536 2013-03-22 17:18:04.000000000 -0700 md2d.cpt
-rw-r----- 1 cneale cneale 1048576 2013-03-26 12:46:02.000000000 -0700 md3.cpt
-rw-r----- 1 cneale cneale 1048576 2013-03-26 12:46:03.000000000 -0700 md3_prev.cpt
Where, above, md2d.cpt was from the last stage of my equilibration and md3.cpt was from my production.
Here is another example from a different run with corruption:
$ ls -ltr --full-time *cpt
-rw-r----- 1 cneale cneale 2209508 2013-03-21 08:24:33.000000000 -0700 md2d_prev.cpt
-rw-r----- 1 cneale cneale 2209508 2013-03-21 08:24:33.000000000 -0700 md2d.cpt
-rw-r----- 1 cneale cneale 2097152 2013-03-26 12:46:01.000000000 -0700 md3_prev.cpt
-rw-r----- 1 cneale cneale 2097152 2013-03-26 12:46:01.000000000 -0700 md3.cpt
I detect corruption/truncation in the .cpt file like this:
$ gmxcheck -f md3.cpt
Fatal error:
Checkpoint file corrupted/truncated, or maybe you are out of disk space?
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
Also, I confirmed the problem by trying to run mdrun:
$ mdrun -nt 1 -deffnm md3 -cpi md3.cpt -nsteps 5000000000
Fatal error:
Checkpoint file corrupted/truncated, or maybe you are out of disk space?
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
(and I get the same thing using md3_prev.cpt)
I am not out of disk space, but probably some type of condition like that existed when the chiller failed and the system went down:
$ df -h .
Filesystem Size Used Avail Use% Mounted on
342T 57T 281T 17% /global/scratch
Nor am I out of quota (although I have no command to show that here).
There is no corruption of .edr, .trr, or .xtc files
The .log files end like this:
Writing checkpoint, step 194008250 at Tue Mar 26 10:49:31 2013
Writing checkpoint, step 194932330 at Tue Mar 26 11:49:31 2013
Step Time Lambda
195757661 391515.32200 0.00000
Writing checkpoint, step 195757661 at Tue Mar 26 12:46:02 2013
I am motivated to help solve this problem, but have no idea how to stop gromacs from copying corrupted/truncated checkpoint files to _prev.cpt . I presume that one could write a magic number to the end of the .cpt file and test that it exists prior to moving .cpt to _prev.cpt , but perhaps I misunderstand the problem. If needs be, perhaps mdrun could call gmxcheck, since that tool seems to detect the corruption/truncation. If it's done every 30 minutes, it shouldn't affect the performance.
Thank you for any advice,
Chris.
More information about the gromacs.org_gmx-users
mailing list