[gmx-developers] Checkpointing
Justin MacCallum
jlmaccal at ucalgary.ca
Thu Jun 6 21:50:48 CEST 2002
>
> >
> > Actually we should just implement writing out a restart file every n
> > steps, which can be the same file (or alternate between two filenames).
>
> Since we anyway should get rid of all static data to support threads, it
> would probably be a good idea to create one or a few structures that
> contain all data and the current state of the system.
>
> If we then write these structures in full precision to the file we
> should be able to get a really transparent restart feature.
>
> When writing files I'd suggest we use a temporary file, and once the
> write is finished we just move it to the 'real' checkpoint file.
That sounds like a good idea.
> >>
> >>1. Executing qdel kills mpirun, but leaves lamd and mpirun process happily
> >>running. I guess I will have to find a good script to deal with this.
> >
>
> Another alternative might be to use a script like 'pbslam' (search the
> net) and do an 'exec' command when you start it (to replace the shell
> process). Not sure if it will help, but could be worth a try.
I actually am just playing with pbslam right now. I've had to modify it a
bit, but it actually doesn't seem to help. It does a good job of cleaning
up after a job, but it doesn't seem to pass signals along to child
processes very well. I might just use a simple script that looks at the
/tmp/lam-user at pbs###hostname files. The processes in the current
lam-session are listed in the order:
lamd
mpirun
mdrun_mpi
mdrun_mpi
so if we just send signals to one of the mdrun processes, everytning
should work fine.
Justin
More information about the gromacs.org_gmx-developers
mailing list