[gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1

Wed Mar 27 03:47:22 CET 2013

Dear Chris,

While it's always possible that GROMACS can be improved (or debugged), this
smells more like a system-level problem. The corrupt checkpoint files are
precisely 1MiB or 2MiB, which suggests strongly either 1) GROMACS was in
the middle of a buffer flush when it was killed (but the filesystem did
everything right; it was just sent incomplete data), or 2) the filesystem
itself wrote a truncated file (but GROMACS wrote it successfully, the data
was buffered, and GROMACS went on its merry way).

#1 could happen, for example, if GROMACS was killed with SIGKILL while
copying .cpt to _prev.cpt -- if GROMACS even copies, rather than renames --
its checkpoint files. #2 could happen in any number of ways, depending on
precisely how your disks, filesystems, and network filesystems are all
configured (for example, if a RAID array goes down hard with per-drive
writeback caches enabled, or your NFS system is soft-mounted and either
client or server goes down). With the sizes of the truncated checkpoint
files being very convenient numbers, my money is on #2.

Have you contacted your sysadmins to report this? They may be able to take
some steps to try to prevent this, and (if this is indeed a system problem)
doing so would provide all their users an increased measure of safety for
their data.

Cheers,
MZ

On Tue, Mar 26, 2013 at 10:04 PM, Christopher Neale <
chris.neale at mail.utoronto.ca> wrote:

> Dear Users:
>
> A cluster that I use went down today with a chiller failure. I lost all 16
> jobs (running gromacs 4.6.1). For 13 of these jobs, not only is the .cpt
> file truncated, but also the _prev.cpt file is truncated, meaning that I am
> going to have to go back through the files, extract a frame, make a new
> .tpr file (using a new, custom .mdp file to get the timestamp right),
> restart the runs, and then later join the trajectory data fragments.
>
> I have experienced this a number of times over the years with different
> versions of gromacs (see, for example,
> http://redmine.gromacs.org/issues/790 ) and wonder if anybody else has
> experienced this?
>
> Also, does anybody have some advice on how to handle this? For now, my
> idea is to run a script in the background to periodically check the .cpt
> file and make a copy if it is not corrupted/truncated so that I can always
> restart.
>
> If it is useful information, both the .cpt and the _prev.cpt files have
> the same size and timestamp, but are smaller than non-corrupted .cpt files.
> E.g.:
>
> $ ls -ltr --full-time *cpt
> -rw-r----- 1 cneale cneale 1963536 2013-03-22 17:18:04.000000000 -0700
> md2d_prev.cpt
> -rw-r----- 1 cneale cneale 1963536 2013-03-22 17:18:04.000000000 -0700
> md2d.cpt
> -rw-r----- 1 cneale cneale 1048576 2013-03-26 12:46:02.000000000 -0700
> md3.cpt
> -rw-r----- 1 cneale cneale 1048576 2013-03-26 12:46:03.000000000 -0700
> md3_prev.cpt
>
> Where, above, md2d.cpt was from the last stage of my equilibration and
> md3.cpt was from my production.
>
> Here is another example from a different run with corruption:
>
> $ ls -ltr --full-time *cpt
> -rw-r----- 1 cneale cneale 2209508 2013-03-21 08:24:33.000000000 -0700
> md2d_prev.cpt
> -rw-r----- 1 cneale cneale 2209508 2013-03-21 08:24:33.000000000 -0700
> md2d.cpt
> -rw-r----- 1 cneale cneale 2097152 2013-03-26 12:46:01.000000000 -0700
> md3_prev.cpt
> -rw-r----- 1 cneale cneale 2097152 2013-03-26 12:46:01.000000000 -0700
> md3.cpt
>
> I detect corruption/truncation in the .cpt file like this:
> $ gmxcheck  -f md3.cpt
> Fatal error:
> Checkpoint file corrupted/truncated, or maybe you are out of disk space?
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
>
> Also, I confirmed the problem by trying to run mdrun:
>
> $ mdrun -nt 1 -deffnm md3  -cpi md3.cpt -nsteps 5000000000
> Fatal error:
> Checkpoint file corrupted/truncated, or maybe you are out of disk space?
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
>
> (and I get the same thing using md3_prev.cpt)
>
> I am not out of disk space, but probably some type of condition like that
> existed when the chiller failed and the system went down:
> $ df -h .
> Filesystem            Size  Used Avail Use% Mounted on
>                       342T   57T  281T  17% /global/scratch
>
> Nor am I out of quota (although I have no command to show that here).
>
> There is no corruption of .edr, .trr, or .xtc files
>
> The .log files end like this:
>
> Writing checkpoint, step 194008250 at Tue Mar 26 10:49:31 2013
> Writing checkpoint, step 194932330 at Tue Mar 26 11:49:31 2013
>            Step           Time         Lambda
>       195757661   391515.32200        0.00000
> Writing checkpoint, step 195757661 at Tue Mar 26 12:46:02 2013
>
> I am motivated to help solve this problem, but have no idea how to stop
> gromacs from copying corrupted/truncated checkpoint files to _prev.cpt . I
> presume that one could write a magic number to the end of the .cpt file and
> test that it exists prior to moving .cpt to _prev.cpt , but perhaps I
> misunderstand the problem. If needs be, perhaps mdrun could call gmxcheck,
> since that tool seems to detect the corruption/truncation. If it's done
> every 30 minutes, it shouldn't affect the performance.
>
> Thank you for any advice,
> Chris.
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>