[gmx-developers] Checkpointing
David van der Spoel
spoel at xray.bmc.uu.se
Thu Jun 6 20:52:59 CEST 2002
On Thu, 6 Jun 2002, Justin MacCallum wrote:
>Yes I know. That is unfortunate, but it should be possible to make a
>script that invokes tpbconv to make a new tpr file from the trr file, etc.
Actually we should just implement writing out a restart file every n
steps, which can be the same file (or alternate between two filenames).
Checkpointing of parallel runs at the OS level is complicated I would
think.
>>
>> One problem is that I don't think the MPI standard includes any way of
>> sending signals to other nodes - we will have to intercept the signal on
>> the node where we get it and do the communication ourselves.
>
>Everything seems to work fine if you send the signal to one of the mdrun
>processes. I think the problem is that PBS sends signals to the shell of
>the job. mpirun does not appear to send the signals on to mdrun. This
>creates at least two problems:
>
>1. Executing qdel kills mpirun, but leaves lamd and mpirun process happily
>running. I guess I will have to find a good script to deal with this.
I have found the same thing. Have you tried the qdel -W option?
>2. When PBS wants to checkpoint a job, USR1 does not get sent to the mdrun
>processes, so they never checkpoint themselves.
Maybe one has to start the mdrun program differently. Have you tried to do
it from a perl script?
Groeten, David.
________________________________________________________________________
Dr. David van der Spoel, Biomedical center, Dept. of Biochemistry
Husargatan 3, Box 576, 75123 Uppsala, Sweden
phone: 46 18 471 4205 fax: 46 18 511 755
spoel at xray.bmc.uu.se spoel at gromacs.org http://zorn.bmc.uu.se/~spoel
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
More information about the gromacs.org_gmx-developers
mailing list