[gmx-users] Checkpointing GROMACS jobs

Mon Jan 28 20:09:29 CET 2008

Steven Kirk wrote:
> Hello,
> 
> I have been using GROMACS for some very long (in wall clock terms) 
> simulations, and am curious as to how other users on this list solve the 
> problem of checkpointing long MD runs. It's a problem because of the 
> tendency of computational nodes in large HPC facilities (the more 
> processors, the more prevalent the problem, it seems) to keel over near 
> the end of a very time consuming run. Intermittent disk and scheduler 
> faults can also trigger such conditions.
> 
> Checkpointing at the operating system level is very system-specific, and 
> occasionally compilers can produce executable 'dump' files that continue 
> from where your program left off, but I'm thinking that someone must 
> have automated this process directly using conventionally-compiled 
> GROMACS executables.
> 
> Of course, it is possible to do an exact continuation from a crashed run 
> using .edr and trajectory (.trr) files by generating a new .tpr from the 
> last trajectory frame that had both position and velocity data. This 
> seems to be, by necessity, an entirely interactive process (unless 
> someone out there has a cool auto-restart script ..).
> 
> I am thinking more in terms of 'proactive' checkpointing for long jobs, 
>  by the following process:
> 
> A script parses the desired .mdp file describing the user's MD run of T 
> timesteps, then asks the user how many sections (N) to split the run 
> into. The script will then auto-generate a shell script containing all 
> the necessary GROMACS commands to:
> 
> * Generate a new .mdp file almost identical to the original, but with 
> the number of timesteps set to T/N.
> 
> * Run N successive mdrun commands, where the output .trr and .edr files 
> from each short run using the modified .mdp file are used, to generate 
> an 'exact restart' .tpr file for the next 'mdrun' command, with the 
> appropriate continuation flag set.
> 
> * Log (to a file) how many of the N partial runs have been completed, in 
> such a way that if the shell script containing the commands is 
> restarted, it will jump to the correct point in the sequence, restarting 
> from the most recently completed partial run.
> 
> Has anyone else already solved this problem, or have a method 
> implementing some of the desirable properties above that I can then 
> extend to do exactly the things described above?
> 
> 
Most queue system allow you to chain jobs, that is, let the next one 
start after the previous one finished. In PBS this is done alike

qsub -Wdepend=afterok:prev_jobid

combining this with a script to start the jobs you are all set. I 
presume you are aware of tpbconv -extend, or tpbconv -until ?

-- 
David.
________________________________________________________________________
David van der Spoel, PhD, Assoc. Prof., Molecular Biophysics group,
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596,  	75124 Uppsala, Sweden
phone:	46 18 471 4205		fax: 46 18 511 755
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://folding.bmc.uu.se
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++