[gmx-developers] Checkpointing

Justin MacCallum jlmaccal at ucalgary.ca
Thu Jun 6 20:19:55 CEST 2002


Hi Erik,

> 
> This would indeed be nice to have, but we will probably have to
> implement it 'manually' on Linux - there is no OS support for
> checkpointing like on SGI or CRAY.

Yes I know.  That is unfortunate, but it should be possible to make a
script that invokes tpbconv to make a new tpr file from the trr file, etc.

> 
> One problem is that I don't think the MPI standard includes any way of
> sending signals to other nodes - we will have to intercept the signal on
> the node where we get it and do the communication ourselves.

Everything seems to work fine if you send the signal to one of the mdrun
processes. I think the problem is that PBS sends signals to the shell of
the job.  mpirun does not appear to send the signals on to mdrun. This
creates at least two problems:

1. Executing qdel kills mpirun, but leaves lamd and mpirun process happily
running. I guess I will have to find a good script to deal with this.

2. When PBS wants to checkpoint a job, USR1 does not get sent to the mdrun
processes, so they never checkpoint themselves.





More information about the gromacs.org_gmx-developers mailing list