[gmx-users] errors on restart

Thu May 18 19:13:21 CEST 2006

On Thursday 18 May 2006 18:42, Mark Abraham wrote:
> gianluca santarossa wrote:
> > I think you are right. I guess this is the best solution, after all.
> > The drawback is that I need to do a pilot run for each system I need to
> > simulate. And the number of processors I choose, too. O, no! It smells
> > like a benchmark!!!! :P
>
> Well in the real world, your simulations are likely to be sufficiently
> similar that you can keep a spreadsheet of calculation rates as a
> function of the number of particles and interpolate readily...
>
> >> That's a reasonable start, but the nature of buffered output is such
> >> that you can't guarantee that ener.edr and traj.trr are at the same
> >> point. What you need to do is get gromacs to exit gracefully having
> >> flushed its buffers. My PBS setup sends a SIGHUP that GROMACS 3.3.1
> >> reads and does an appropriate end-of-last-step flush and a pirouette
> >> to finish :-) I suggest passing the SIGHUP, delaying as long as you
> >> can afford and only then copying the files back. This will work better
> >> on average. It's probably overkill if you implement the first solution.
> >
> > I don't know how to do that... Can you help me? (At least, I can learn
> > something new about scripting...)
> > If I'm right, trap is executed after its command finishes. So I cannot
> > send a SIGHUP signal from the trap.
>
> I was theorizing that sending a SIGHUP would be possible... you need to
> catch the signal to send your output back, but you need to send one to
> the child process to get the buffers flushed. Actually man mdrun
> suggests that gromacs doesn't listen for SIGHUP at all, so ignore me.
>
> > On the other side, I have no rights on the signals from the queue. From
> > the FAQ of the cluster:
> > "To give the application a chance to exit gracefully, LSF first sends a
> > “friendly” signal (SIGUSR2) to
> > all processes of a job when its time limit is about to expire. If the
> > job is still running after a short
> > grace period, LSF sends increasingly “unfriendly” signals (SIGINT,
> > SIGTERM and SIGKILL). The last
> > one effectively kills the job."

I hope this helps, this will catch sigterm with trap (Bash command) and will 
zip output directory to your home directory before the queuing system will 
stop. It works with torque/ maui here on your cluster:

#!/bin/sh
#
#PBS ....
#

TMPDIR=/scratch/${USER}/${PBS_JOBID}
export TMPDIR

# SIGTERM  get it and save it
trap "sleep 5 ; cd /scratch/${USER} ; tar cf - ${PBS_JOBID} | tar xf - 
-C /home/${GROUP}/${USER} ; exit" 15

# normal jobskript

cd $PBS_O_WORKDIR

# make scratch
mkdir -p $TMPDIR

# run
./temp.out

>
> man mdrun suggests that unless you were able to copy output back after
> the SIGTERM, you'd struggle.
>
> A plan B would be to intercept the SIGUSR2 with your script and send a
> SIGTERM to the simulation... if that's possible.
>
> Mark
> _______________________________________________
> gmx-users mailing list    gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Greetings,

Florian

-- 
-------------------------------------------------------------------------------
 Florian Haberl                        
 Computer-Chemie-Centrum   
 Universitaet Erlangen/ Nuernberg
 Naegelsbachstr 25
 D-91052 Erlangen
 Mailto: florian.haberl AT chemie.uni-erlangen.de
-------------------------------------------------------------------------------