[gmx-users] Checkpointing GROMACS jobs

Chris Neale chris.neale at utoronto.ca
Mon Jan 28 22:13:39 CET 2008


Steven Kirk wrote:
>/ Hello,
/>/ 
/>/ I have been using GROMACS for some very long (in wall clock terms) 
/>/ simulations, and am curious as to how other users on this list solve the 
/>/ problem of checkpointing long MD runs. It's a problem because of the 
/>/ tendency of computational nodes in large HPC facilities (the more 
/>/ processors, the more prevalent the problem, it seems) to keel over near 
/>/ the end of a very time consuming run. Intermittent disk and scheduler 
/>/ faults can also trigger such conditions.
/>/ 
/>/ Checkpointing at the operating system level is very system-specific, and 
/>/ occasionally compilers can produce executable 'dump' files that continue 
/>/ from where your program left off, but I'm thinking that someone must 
/>/ have automated this process directly using conventionally-compiled 
/>/ GROMACS executables.
/>/ 
/>/ Of course, it is possible to do an exact continuation from a crashed run 
/>/ using .edr and trajectory (.trr) files by generating a new .tpr from the 
/>/ last trajectory frame that had both position and velocity data. This 
/>/ seems to be, by necessity, an entirely interactive process (unless 
/>/ someone out there has a cool auto-restart script ..).
/>/ 
/>/ I am thinking more in terms of 'proactive' checkpointing for long jobs, 
/>/  by the following process:
/>/ 
/>/ A script parses the desired .mdp file describing the user's MD run of T 
/>/ timesteps, then asks the user how many sections (N) to split the run 
/>/ into. The script will then auto-generate a shell script containing all 
/>/ the necessary GROMACS commands to:
/>/ 
/>/ * Generate a new .mdp file almost identical to the original, but with 
/>/ the number of timesteps set to T/N.
/>/ 
/>/ * Run N successive mdrun commands, where the output .trr and .edr files 
/>/ from each short run using the modified .mdp file are used, to generate 
/>/ an 'exact restart' .tpr file for the next 'mdrun' command, with the 
/>/ appropriate continuation flag set.
/>/ 
/>/ * Log (to a file) how many of the N partial runs have been completed, in 
/>/ such a way that if the shell script containing the commands is 
/>/ restarted, it will jump to the correct point in the sequence, restarting 
/>/ from the most recently completed partial run.
/>/ 
/>/ Has anyone else already solved this problem, or have a method 
/>/ implementing some of the desirable properties above that I can then 
/>/ extend to do exactly the things described above?
/>/ 
/>/ /

I have just posted some of my scripts to the wiki

http://wiki.gromacs.org/index.php/Checkpointing_Jobs




More information about the gromacs.org_gmx-users mailing list