[gmx-users] Checkpointing GROMACS jobs

Mon Jan 28 19:27:15 CET 2008

Hello,

I have been using GROMACS for some very long (in wall clock terms) 
simulations, and am curious as to how other users on this list solve the 
problem of checkpointing long MD runs. It's a problem because of the 
tendency of computational nodes in large HPC facilities (the more 
processors, the more prevalent the problem, it seems) to keel over near 
the end of a very time consuming run. Intermittent disk and scheduler 
faults can also trigger such conditions.

Checkpointing at the operating system level is very system-specific, and 
occasionally compilers can produce executable 'dump' files that continue 
from where your program left off, but I'm thinking that someone must 
have automated this process directly using conventionally-compiled 
GROMACS executables.

Of course, it is possible to do an exact continuation from a crashed run 
using .edr and trajectory (.trr) files by generating a new .tpr from the 
last trajectory frame that had both position and velocity data. This 
seems to be, by necessity, an entirely interactive process (unless 
someone out there has a cool auto-restart script ..).

I am thinking more in terms of 'proactive' checkpointing for long jobs, 
  by the following process:

A script parses the desired .mdp file describing the user's MD run of T 
timesteps, then asks the user how many sections (N) to split the run 
into. The script will then auto-generate a shell script containing all 
the necessary GROMACS commands to:

* Generate a new .mdp file almost identical to the original, but with 
the number of timesteps set to T/N.

* Run N successive mdrun commands, where the output .trr and .edr files 
from each short run using the modified .mdp file are used, to generate 
an 'exact restart' .tpr file for the next 'mdrun' command, with the 
appropriate continuation flag set.

* Log (to a file) how many of the N partial runs have been completed, in 
such a way that if the shell script containing the commands is 
restarted, it will jump to the correct point in the sequence, restarting 
from the most recently completed partial run.

Has anyone else already solved this problem, or have a method 
implementing some of the desirable properties above that I can then 
extend to do exactly the things described above?

-- 
Dr. Steven R. Kirk           <steven.kirk at hv.se, S.R.Kirk at physics.org>
Dept. of Technology, Mathematics & Computer Science  (P)+46 520 223215
University West                                      (F)+46 520 223299
P.O. Box 957 Trollhattan 461 29 SWEDEN       http://beacon.webhop.org