[gmx-users] errors on restart

Mark Abraham Mark.Abraham at anu.edu.au
Thu May 18 16:28:37 CEST 2006


gianluca santarossa wrote:
> Mark Abraham wrote:
> 
>> You can
>> a) prevent your simulations from crashing,
> 
> I can't. I run simulations on a cluster through a queue, and sometimes 
> the jobs are longer than the max time of the queue.

Yes you can. Do a pilot run and look at the last few lines of the 
logfile - or better one of the crashed runs - you want reasonable length 
so your setup time is amortized over all of the timesteps. That will 
tell you how much simulation time you can do per unit wall clock time. 
Now adjust the number of simulation steps accordingly.

>> b) restart from the last frame common to both files, here 8, or
> 
> Ok, I want to do it automatically... Moreover, in this way it would be 
> tricky to rebuild the trajectories and the energies.

Indeed, it isn't something you want to do all the time, so see the above 
solution :-)

>> c) if they simulations are crashing in response to a signal, use a 
>> less vigorous one and gromacs will catch it and exit gracefully, 
>> writing output.
> 
> This is how my job works now. If you have better ideas, I would be happy 
> to try them...
> I submit to the queue a script running mdrun . The script just traps the 
> signals SIGUSR2 (or, eventually, SIGINT) and copies the trajectories and 
> the energies back to $SOMEWHERE.
> The script looks like this:
> 
> bakup()
> {
> cp ener.edr traj.trr $SOMEWHERE
> ...
> other stuff
> ...
> }
> trap backup SIGUSR2 SIGINT
> mdrun > mdrun.log
> 
> At the end, if I try to restart the simulation with tpbconv, I sometimes 
> find that, as you said, ener.edr was interrupted while writing. How can 
> I modify the script to let mdrun exit normally?
> As you can see, I catch SIGUSR2, not SIGKILL...

That's a reasonable start, but the nature of buffered output is such 
that you can't guarantee that ener.edr and traj.trr are at the same 
point. What you need to do is get gromacs to exit gracefully having 
flushed its buffers. My PBS setup sends a SIGHUP that GROMACS 3.3.1 
reads and does an appropriate end-of-last-step flush and a pirouette to 
finish :-) I suggest passing the SIGHUP, delaying as long as you can 
afford and only then copying the files back. This will work better on 
average. It's probably overkill if you implement the first solution.

Mark



More information about the gromacs.org_gmx-users mailing list