[gmx-users] errors on restart

Mark Abraham Mark.Abraham at anu.edu.au
Thu May 18 18:42:43 CEST 2006


gianluca santarossa wrote:

> I think you are right. I guess this is the best solution, after all.
> The drawback is that I need to do a pilot run for each system I need to 
> simulate. And the number of processors I choose, too. O, no! It smells 
> like a benchmark!!!! :P

Well in the real world, your simulations are likely to be sufficiently 
similar that you can keep a spreadsheet of calculation rates as a 
function of the number of particles and interpolate readily...

>> That's a reasonable start, but the nature of buffered output is such 
>> that you can't guarantee that ener.edr and traj.trr are at the same 
>> point. What you need to do is get gromacs to exit gracefully having 
>> flushed its buffers. My PBS setup sends a SIGHUP that GROMACS 3.3.1 
>> reads and does an appropriate end-of-last-step flush and a pirouette 
>> to finish :-) I suggest passing the SIGHUP, delaying as long as you 
>> can afford and only then copying the files back. This will work better 
>> on average. It's probably overkill if you implement the first solution.
> 
> I don't know how to do that... Can you help me? (At least, I can learn 
> something new about scripting...)
> If I'm right, trap is executed after its command finishes. So I cannot 
> send a SIGHUP signal from the trap.

I was theorizing that sending a SIGHUP would be possible... you need to 
catch the signal to send your output back, but you need to send one to 
the child process to get the buffers flushed. Actually man mdrun 
suggests that gromacs doesn't listen for SIGHUP at all, so ignore me.

> On the other side, I have no rights on the signals from the queue. From 
> the FAQ of the cluster:
> "To give the application a chance to exit gracefully, LSF first sends a 
> “friendly” signal (SIGUSR2) to
> all processes of a job when its time limit is about to expire. If the 
> job is still running after a short
> grace period, LSF sends increasingly “unfriendly” signals (SIGINT, 
> SIGTERM and SIGKILL). The last
> one effectively kills the job."

man mdrun suggests that unless you were able to copy output back after 
the SIGTERM, you'd struggle.

A plan B would be to intercept the SIGUSR2 with your script and send a 
SIGTERM to the simulation... if that's possible.

Mark



More information about the gromacs.org_gmx-users mailing list