[gmx-developers] Checkpointing

David van der Spoel spoel at xray.bmc.uu.se
Thu Jun 6 20:52:59 CEST 2002


On Thu, 6 Jun 2002, Justin MacCallum wrote:

>Yes I know.  That is unfortunate, but it should be possible to make a
>script that invokes tpbconv to make a new tpr file from the trr file, etc.

Actually we should just implement writing out a restart file every n
steps, which can be the same file (or alternate between two filenames).

Checkpointing of parallel runs at the OS level is complicated I would
think.

>> 
>> One problem is that I don't think the MPI standard includes any way of
>> sending signals to other nodes - we will have to intercept the signal on
>> the node where we get it and do the communication ourselves.
>
>Everything seems to work fine if you send the signal to one of the mdrun
>processes. I think the problem is that PBS sends signals to the shell of
>the job.  mpirun does not appear to send the signals on to mdrun. This
>creates at least two problems:
>
>1. Executing qdel kills mpirun, but leaves lamd and mpirun process happily
>running. I guess I will have to find a good script to deal with this.
I have found the same thing. Have you tried the qdel -W option?

>2. When PBS wants to checkpoint a job, USR1 does not get sent to the mdrun
>processes, so they never checkpoint themselves.
Maybe one has to start the mdrun program differently. Have you tried to do
it from a perl script?


Groeten, David.
________________________________________________________________________
Dr. David van der Spoel, 	Biomedical center, Dept. of Biochemistry
Husargatan 3, Box 576,  	75123 Uppsala, Sweden
phone:	46 18 471 4205		fax: 46 18 511 755
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://zorn.bmc.uu.se/~spoel
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




More information about the gromacs.org_gmx-developers mailing list