[gmx-users] errors on restart
Mark.Abraham at anu.edu.au
Thu May 18 16:28:37 CEST 2006
gianluca santarossa wrote:
> Mark Abraham wrote:
>> You can
>> a) prevent your simulations from crashing,
> I can't. I run simulations on a cluster through a queue, and sometimes
> the jobs are longer than the max time of the queue.
Yes you can. Do a pilot run and look at the last few lines of the
logfile - or better one of the crashed runs - you want reasonable length
so your setup time is amortized over all of the timesteps. That will
tell you how much simulation time you can do per unit wall clock time.
Now adjust the number of simulation steps accordingly.
>> b) restart from the last frame common to both files, here 8, or
> Ok, I want to do it automatically... Moreover, in this way it would be
> tricky to rebuild the trajectories and the energies.
Indeed, it isn't something you want to do all the time, so see the above
>> c) if they simulations are crashing in response to a signal, use a
>> less vigorous one and gromacs will catch it and exit gracefully,
>> writing output.
> This is how my job works now. If you have better ideas, I would be happy
> to try them...
> I submit to the queue a script running mdrun . The script just traps the
> signals SIGUSR2 (or, eventually, SIGINT) and copies the trajectories and
> the energies back to $SOMEWHERE.
> The script looks like this:
> cp ener.edr traj.trr $SOMEWHERE
> other stuff
> trap backup SIGUSR2 SIGINT
> mdrun > mdrun.log
> At the end, if I try to restart the simulation with tpbconv, I sometimes
> find that, as you said, ener.edr was interrupted while writing. How can
> I modify the script to let mdrun exit normally?
> As you can see, I catch SIGUSR2, not SIGKILL...
That's a reasonable start, but the nature of buffered output is such
that you can't guarantee that ener.edr and traj.trr are at the same
point. What you need to do is get gromacs to exit gracefully having
flushed its buffers. My PBS setup sends a SIGHUP that GROMACS 3.3.1
reads and does an appropriate end-of-last-step flush and a pirouette to
finish :-) I suggest passing the SIGHUP, delaying as long as you can
afford and only then copying the files back. This will work better on
average. It's probably overkill if you implement the first solution.
More information about the gromacs.org_gmx-users