[gmx-users] errors on restart

gianluca santarossa gianluca_santarossa at fastmail.fm
Thu May 18 17:07:09 CEST 2006


Mark Abraham wrote:
>> I can't. I run simulations on a cluster through a queue, and 
>> sometimes the jobs are longer than the max time of the queue.
>
> Yes you can. Do a pilot run and look at the last few lines of the 
> logfile - or better one of the crashed runs - you want reasonable 
> length so your setup time is amortized over all of the timesteps. That 
> will tell you how much simulation time you can do per unit wall clock 
> time. Now adjust the number of simulation steps accordingly.
I think you are right. I guess this is the best solution, after all.
The drawback is that I need to do a pilot run for each system I need to 
simulate. And the number of processors I choose, too. O, no! It smells 
like a benchmark!!!! :P

> That's a reasonable start, but the nature of buffered output is such 
> that you can't guarantee that ener.edr and traj.trr are at the same 
> point. What you need to do is get gromacs to exit gracefully having 
> flushed its buffers. My PBS setup sends a SIGHUP that GROMACS 3.3.1 
> reads and does an appropriate end-of-last-step flush and a pirouette 
> to finish :-) I suggest passing the SIGHUP, delaying as long as you 
> can afford and only then copying the files back. This will work better 
> on average. It's probably overkill if you implement the first solution.
I don't know how to do that... Can you help me? (At least, I can learn 
something new about scripting...)
If I'm right, trap is executed after its command finishes. So I cannot 
send a SIGHUP signal from the trap.
On the other side, I have no rights on the signals from the queue. From 
the FAQ of the cluster:
"To give the application a chance to exit gracefully, LSF first sends a 
“friendly” signal (SIGUSR2) to
all processes of a job when its time limit is about to expire. If the 
job is still running after a short
grace period, LSF sends increasingly “unfriendly” signals (SIGINT, 
SIGTERM and SIGKILL). The last
one effectively kills the job."
Any ideas?
Thanks,
Gianluca




More information about the gromacs.org_gmx-users mailing list