[gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1

Justin Lemkul jalemkul at vt.edu
Wed Mar 27 12:51:31 CET 2013



On 3/26/13 11:13 PM, Christopher Neale wrote:
> Dear Matthew:
>
> Thank you for noticing the file size. This is a very good lead.
> I had not noticed that this was special. Indeed, here is the complete listing for truncated/corrupt .cpt files:
>
> -rw-r----- 1 cneale cneale 1048576 Mar 26 18:53 md3.cpt
> -rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt
> -rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt
> -rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt
> -rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt
> -rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt
> -rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt
> -rw-r----- 1 cneale cneale 1048576 Mar 26 18:51 md3.cpt
> -rw-r----- 1 cneale cneale 1048576 Mar 26 18:51 md3.cpt
> -rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt
> -rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt
> -rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt
> -rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt
>
> I will contact my sysadmins and let them know about your suggestions.
>
> Nevertheless, I respectfully reject the idea that there is really nothing that can be done about this inside
> gromacs. About 6 years ago, I worked on a cluster with massive sporadic NSF delays. The only solution to
> automate runs on that machine was to, for example, use sed to create a .mdp from a template .mdp file, which had ;;;EOF as the last line and then to poll the created mdp file for ;;;EOF until it existed prior to running
> grompp (at the time I was using mdrun -sort and desorting with an in-house script prior to domain
> decomposition, so I had to stop/start gromacs every coupld of hours). This is not to say that such things are
> ideal, but I think  gromacs would be all the better if it was able to avoid with problems like this regardless of
> the cluster setup.
>
> Please note that, over the years, I have seen this on 4 different clusters (albeit with different versions of
> gromacs), but that is to say that it's not just one setup that is to blame.
>
> Matthew, please don't take my comments the wrong way. I deeply appreciate your help. I just want to put it
> out there that I believe that gromacs would be better if it didn't overwrite good .cpt files with truncated/corrupt
> .cpt files ever, even if the cluster catches on fire or the earth's magnetic field reverses, etc.
> Also, I suspect that sysadmins don't have a lot of time to test their clusters for graceful exit upon chiller failure
> conditions, so a super-careful regime of .cpt update will always be useful.
>
> Thank you again for your help, I'll take it to my sysadmins, who are very good and may be able to remedy
> this on their cluster, but who knows what cluster I will be using in 5 years.
>

Perhaps this is a case where the -cpnum option would be useful?  That may cause 
a lot of checkpoint files to accumulate, depending on the length of the run, but 
perhaps a scripted cleanup routine to preserve some subset of backups would be 
useful.

-Justin

-- 
========================================

Justin A. Lemkul, Ph.D.
Research Scientist
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================



More information about the gromacs.org_gmx-users mailing list