[gmx-users] Checkpointing GROMACS jobs
anoddlad at yahoo.com
Mon Jan 28 21:33:12 CET 2008
If the runs all finish successfully, then incorporating run continuations into your script is simple, but I believe the issue may be more the tendency of tpbconv to fail unpredictably - should the .edr file be even one frame shorter than the .trr file due to a crash, for instance, then tpbconv will not be successful and your script dies. Parsing out the relevant error messages to produce the information required (for the option -time in this example) is presumably possible and would solve the problem, but it's not a trivial thing to script.
Of course, the timescale of MD runs means that occasional manual intervention isn't too great a chore, but it can be annoying to almost complete a tpbconv on a very long run, only to find that it's missing the last couple of .edr frames due to a failure to flush the buffer...
----- Original Message ----
From: David van der Spoel <spoel at xray.bmc.uu.se>
To: Discussion list for GROMACS users <gmx-users at gromacs.org>
Sent: Monday, January 28, 2008 7:09:29 PM
Subject: Re: [gmx-users] Checkpointing GROMACS jobs
Steven Kirk wrote:
> I have been using GROMACS for some very long (in wall clock terms)
> simulations, and am curious as to how other users on this list solve the
> problem of checkpointing long MD runs. It's a problem because of the
> tendency of computational nodes in large HPC facilities (the more
> processors, the more prevalent the problem, it seems) to keel over near
> the end of a very time consuming run. Intermittent disk and scheduler
> faults can also trigger such conditions.
> Checkpointing at the operating system level is very system-specific, and
> occasionally compilers can produce executable 'dump' files that continue
> from where your program left off, but I'm thinking that someone must
> have automated this process directly using conventionally-compiled
> GROMACS executables.
> Of course, it is possible to do an exact continuation from a crashed run
> using .edr and trajectory (.trr) files by generating a new .tpr from the
> last trajectory frame that had both position and velocity data. This
> seems to be, by necessity, an entirely interactive process (unless
> someone out there has a cool auto-restart script ..).
> I am thinking more in terms of 'proactive' checkpointing for long jobs,
> by the following process:
> A script parses the desired .mdp file describing the user's MD run of T
> timesteps, then asks the user how many sections (N) to split the run
> into. The script will then auto-generate a shell script containing all
> the necessary GROMACS commands to:
> * Generate a new .mdp file almost identical to the original, but with
> the number of timesteps set to T/N.
> * Run N successive mdrun commands, where the output .trr and .edr files
> from each short run using the modified .mdp file are used, to generate
> an 'exact restart' .tpr file for the next 'mdrun' command, with the
> appropriate continuation flag set.
> * Log (to a file) how many of the N partial runs have been completed, in
> such a way that if the shell script containing the commands is
> restarted, it will jump to the correct point in the sequence, restarting
> from the most recently completed partial run.
> Has anyone else already solved this problem, or have a method
> implementing some of the desirable properties above that I can then
> extend to do exactly the things described above?
Most queue system allow you to chain jobs, that is, let the next one
start after the previous one finished. In PBS this is done alike
combining this with a script to start the jobs you are all set. I
presume you are aware of tpbconv -extend, or tpbconv -until ?
David van der Spoel, PhD, Assoc. Prof., Molecular Biophysics group,
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596, 75124 Uppsala, Sweden
phone: 46 18 471 4205 fax: 46 18 511 755
spoel at xray.bmc.uu.se spoel at gromacs.org http://folding.bmc.uu.se
gmx-users mailing list gmx-users at gromacs.org
Please search the archive at http://www.gromacs.org/search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-request at gromacs.org.
Can't post? Read http://www.gromacs.org/mailing_lists/users.php
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
More information about the gromacs.org_gmx-users