[gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1

Wed Mar 27 03:04:44 CET 2013

Dear Users:

A cluster that I use went down today with a chiller failure. I lost all 16 jobs (running gromacs 4.6.1). For 13 of these jobs, not only is the .cpt file truncated, but also the _prev.cpt file is truncated, meaning that I am going to have to go back through the files, extract a frame, make a new .tpr file (using a new, custom .mdp file to get the timestamp right), restart the runs, and then later join the trajectory data fragments. 

I have experienced this a number of times over the years with different versions of gromacs (see, for example, http://redmine.gromacs.org/issues/790 ) and wonder if anybody else has experienced this?

Also, does anybody have some advice on how to handle this? For now, my idea is to run a script in the background to periodically check the .cpt file and make a copy if it is not corrupted/truncated so that I can always restart.

If it is useful information, both the .cpt and the _prev.cpt files have the same size and timestamp, but are smaller than non-corrupted .cpt files. E.g.:

$ ls -ltr --full-time *cpt
-rw-r----- 1 cneale cneale 1963536 2013-03-22 17:18:04.000000000 -0700 md2d_prev.cpt
-rw-r----- 1 cneale cneale 1963536 2013-03-22 17:18:04.000000000 -0700 md2d.cpt
-rw-r----- 1 cneale cneale 1048576 2013-03-26 12:46:02.000000000 -0700 md3.cpt
-rw-r----- 1 cneale cneale 1048576 2013-03-26 12:46:03.000000000 -0700 md3_prev.cpt

Where, above, md2d.cpt was from the last stage of my equilibration and md3.cpt was from my production.

Here is another example from a different run with corruption:

$ ls -ltr --full-time *cpt
-rw-r----- 1 cneale cneale 2209508 2013-03-21 08:24:33.000000000 -0700 md2d_prev.cpt
-rw-r----- 1 cneale cneale 2209508 2013-03-21 08:24:33.000000000 -0700 md2d.cpt
-rw-r----- 1 cneale cneale 2097152 2013-03-26 12:46:01.000000000 -0700 md3_prev.cpt
-rw-r----- 1 cneale cneale 2097152 2013-03-26 12:46:01.000000000 -0700 md3.cpt

I detect corruption/truncation in the .cpt file like this:
$ gmxcheck  -f md3.cpt
Fatal error:
Checkpoint file corrupted/truncated, or maybe you are out of disk space?
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors

Also, I confirmed the problem by trying to run mdrun:

$ mdrun -nt 1 -deffnm md3  -cpi md3.cpt -nsteps 5000000000
Fatal error:
Checkpoint file corrupted/truncated, or maybe you are out of disk space?
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors

(and I get the same thing using md3_prev.cpt)

I am not out of disk space, but probably some type of condition like that existed when the chiller failed and the system went down:
$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
                      342T   57T  281T  17% /global/scratch

Nor am I out of quota (although I have no command to show that here).

There is no corruption of .edr, .trr, or .xtc files

The .log files end like this:

Writing checkpoint, step 194008250 at Tue Mar 26 10:49:31 2013
Writing checkpoint, step 194932330 at Tue Mar 26 11:49:31 2013
           Step           Time         Lambda
      195757661   391515.32200        0.00000
Writing checkpoint, step 195757661 at Tue Mar 26 12:46:02 2013

I am motivated to help solve this problem, but have no idea how to stop gromacs from copying corrupted/truncated checkpoint files to _prev.cpt . I presume that one could write a magic number to the end of the .cpt file and test that it exists prior to moving .cpt to _prev.cpt , but perhaps I misunderstand the problem. If needs be, perhaps mdrun could call gmxcheck, since that tool seems to detect the corruption/truncation. If it's done every 30 minutes, it shouldn't affect the performance.

Thank you for any advice,
Chris.