[gmx-developers] Checkpointing

Thu Jun 6 21:50:48 CEST 2002

> 
> > 
> > Actually we should just implement writing out a restart file every n
> > steps, which can be the same file (or alternate between two filenames).
> 
> Since we anyway should get rid of all static data to support threads, it 
> would probably be a good idea to create one or a few structures that 
> contain all data and the current state of the system.
> 
> If we then write these structures in full precision to the file we 
> should be able to get a really transparent restart feature.
> 
> When writing files I'd suggest we use a temporary file, and once the 
> write is finished we just move it to the 'real' checkpoint file.

That sounds like a good idea.

> >>
> >>1. Executing qdel kills mpirun, but leaves lamd and mpirun process happily
> >>running. I guess I will have to find a good script to deal with this.
> > 
> 
> Another alternative might be to use a script like 'pbslam' (search the 
> net) and do an 'exec' command when you start it (to replace the shell 
> process). Not sure if it will help, but could be worth a try.

I actually am just playing with pbslam right now. I've had to modify it a
bit, but it actually doesn't seem to help. It does a good job of cleaning
up after a job, but it doesn't seem to pass signals along to child
processes very well. I might just use a simple script that looks at the
/tmp/lam-user at pbs###hostname files.  The processes in the current
lam-session are listed in the order:

	lamd
	mpirun
	mdrun_mpi
	mdrun_mpi

so if we just send signals to one of the mdrun processes, everytning
should work fine.

Justin